« Docker Internals and Implementing Rebase
12 December 2015
The Docker image format isn’t standardised (or really documented), so my project
was going to be quite a challenge. The basic idea I had in my head was implementing
git rebase-like feature for Docker. Luckily, I’d already been looking at how
git strategies work, since
git is actually only a collection of a few simple
graph theory tricks.
git rebase Method
git rebase works, in a rough sense is like this:
git rebase A B case, find the common parent
C. This is equivalent to
git rebase --onto A C B. We would need to make an extension to the
git rebase model here, namely what happens when there’s no common commit. This
will be dealt with by stupidly applying layers on top of
A starting from
and then going up to
B (obviously this means you can’t use the “common parent”
For each layer starting from
C, going to
- If the layer’s diff is encountered in the history of
A, we skip that layer as it has already been applied.
- Else, if the layer modifies files which don’t exist (or delete files that do), just apply it blindly – unless it’s trying to remove a file that doesn’t exist which is obviously an error.
- Else, if the layer’s file modifications are such that the diff applies cleanly,
then just apply it to the file. This is different from the above case for Docker,
but not in the
- Otherwise, bail and ask the user to solve the conflict for us.
Fortunately with a Docker image, we don’t need to ask the user for any help. We
(at least in principle) should have all the information necessary to generate the
entire image history. There are some caveats for the special
#(nop) layers which
are generated by
Dockerfile builds, which must be considered separately to
Docker images are stored as a directed, acyclic graph (DAG), just
git. Each node is an image configuration file, with a
containing the set of changes made by the layer. The on-disk format is a
archive of the above with a directory of each layer, containing a JSON representation
of the image configuration and the layer’s
tar archive. You can follow along at
home by doing the following commands:
% docker save -o image.tar <image> % mkdir image/ % tar xvfC image.tar image/ 120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/ 120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/VERSION 120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/json 120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/layer.tar 42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/ 42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/VERSION 42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/json 42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/layer.tar 511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/ 511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/VERSION 511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/json 511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158/layer.tar 7318595639c20360ecb151e4d51e7d2c48dd26b517c0d55ceb6d1ece80787710/ 7318595639c20360ecb151e4d51e7d2c48dd26b517c0d55ceb6d1ece80787710/VERSION 7318595639c20360ecb151e4d51e7d2c48dd26b517c0d55ceb6d1ece80787710/json 7318595639c20360ecb151e4d51e7d2c48dd26b517c0d55ceb6d1ece80787710/layer.tar 7f79605b64c3f9579f4fd9a78537879c066576677e6466b1ac9cffefe1b1898e/ 7f79605b64c3f9579f4fd9a78537879c066576677e6466b1ac9cffefe1b1898e/VERSION 7f79605b64c3f9579f4fd9a78537879c066576677e6466b1ac9cffefe1b1898e/json 7f79605b64c3f9579f4fd9a78537879c066576677e6466b1ac9cffefe1b1898e/layer.tar a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/ a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/VERSION a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/json a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/layer.tar c3b53a67d369fb1a8a6455ac1d348d2cddd6d85b31bacdda602013e2087c8d98/ c3b53a67d369fb1a8a6455ac1d348d2cddd6d85b31bacdda602013e2087c8d98/VERSION c3b53a67d369fb1a8a6455ac1d348d2cddd6d85b31bacdda602013e2087c8d98/json c3b53a67d369fb1a8a6455ac1d348d2cddd6d85b31bacdda602013e2087c8d98/layer.tar fadc9aad1a8c5d6d7b89d1e2f1a9229d354551dd00a6bff6085225d8a43c7608/ fadc9aad1a8c5d6d7b89d1e2f1a9229d354551dd00a6bff6085225d8a43c7608/VERSION fadc9aad1a8c5d6d7b89d1e2f1a9229d354551dd00a6bff6085225d8a43c7608/json fadc9aad1a8c5d6d7b89d1e2f1a9229d354551dd00a6bff6085225d8a43c7608/layer.tar repositories %
repositories just contains a JSON object describing what tags were exported.
If you export using an image’s ID then it isn’t present, and it’s not important
for our work.
VERSION describes the version of the image format being used. For Docker 1.9,
this is always
1.0, despite the fact that the
V1Image structure is described
as being “legacy”. It appears as though the image code is currently undergoing
a migration to an updated version. The content-addressable hashes were vital in
getting Notary and the Docker content trust to operate securely. However,
docker save doesn’t output these “new and improved” images. I’m unsure if you
could actually subvert the security of Docker here (it checks that the layer IDs
json are the above mentioned layer node information. Together,
they contain the same form of information as a
git commit (with some additional
data about how they were created, which helps us when we have to deal with rebuilding
layers). One question that quickly arises is “but how does Docker use tar files
to deal with the removal of files?”. This is a very good question, and the answer
lies in the history of Docker’s storage drivers.
Despite it’s flaws, AUFS (or the state of union filesystems at the time) was quite important in the history of Docker’s design. The layered design of Docker images probably wouldn’t have been as easy to implement if it wasn’t for UFS and AUFS. Sure they caused kernel crashes, but they showed potential.
As it turned out, AUFS had already solved the layered filesystem representation
problem before Docker. They invented several metadata prefixes that represented
multiple kinds of files. The main one we’re interested in is
indicates a “whiteout file” – a file that was
unlink()d in the layer. There
are a few other AUFS metadata files and directories, which we don’t really care
In principle they are all properly replaced during the creating of the image
file. For reference, they are the
/.wh..wh.plnk/ directory which is used for
hardlinks to reference so you can reference files in other layers (hardlinks are
based on inodes, so you can’t really reference an inode number you don’t know
yet). The other is the
.wh..wh..opq file which is used to indicate that a
directory is opaque (
readdir() calls don’t descend to layers below it).
Applying on top of the
git rebase model, the following is the
design. The nomenclature is the same as the
A is the base
being applied on top of, starting from
- If an image is a
#(nop)image, we have no choice but to blindly apply it. If it modifies a file that already existed or deletes a file that didn’t exist we have to either bail and tell the user “sorry, you have to rebuild” or plough through with warnings.
- Else, if the image doesn’t affect any files which exist, and doesn’t delete any files we can apply it safely without rebuilding (but there might be some caveats with this.
- Otherwise, we have to rebuild the image. There is a possible improvement here
with layers that just modify metadata (but things like
chown -Rmight prove an issue). Rebuilding the image currently would require importing the current state of our new image into Docker and then doing the whole
docker commitshenanigans. This could be improved by faking the layers using
runc, but probably would be a major pain.
This model could definitely use some improvement. However, it offers a solid
docker build. It doesn’t have the same cache semantics as
Dockerfile, which is an improvement if you’re into hacking Docker images. In
addition, it can be a massive time saver for building applications or custom
git analogy only goes so far. While Docker layers and
commits might have very similar technical aspects, Docker layers describe an
entire system’s filesystem. Redundant copies of code are not considered a good
git doesn’t need to explicitly deal with multiple copies of the
same code. Unfortunately, it is quite common to have commands which generate files
from other files. It would be ideal if we could know what files were read by a
layer (but that’s not practical without adding some dodgy debugging facilities
to Docker). Thus, we must be able to support the user explicitly setting a flag
saying they want to explicitly rebuild all layers we know how to rebuild.
So, while all of that design is nice, we need to deal with the nitty gritty of
implementing this. There were a couple of issues faced during graph loading,
mainly involving the fact that Docker 1.9 doesn’t actually output the modern
image format, instead outputting version
1.0 legacy images. However, that’s
quickly solved by manually unmarshalling the JSON into the embedded
However, another issue you’ll run into as soon as you start working on applying images on different layers is what the new layer ID should be. In that past, back when I started contributing to Docker, the layer IDs were just random hexadecimal strings. However, they were improved to be deterministic and later to be entirely content-addressable. This results in a bunch of weirdness with layer IDs, and that’s where our issue comes in.
If you look closely, you’ll notice that the ID of an image is based on a
which represents a chain of
DiffIDs (that is, an image is a set of layers in
a particular order). However, generating a
DiffID is not straightforward. It’s
possible to get a
DiffID from a corresponding
ChainID, but that’s the only
obvious way I could find. Naturally, there must be a way that Docker bootstraps
this. They’ve hardcoded the
DiffID of an empty layer, and reading the tests
carefully appears to indicate that the
ChainID, it isn’t commented)
is generated from the
SHA256 hash of the
tar stream of the layer. The format
of the layer is not clearly specified, as the
tar stream of an empty layer is
(as you might guess) empty!
layerStore.applyTar appears to concur with this theory, but it also is not
clear what the format of the
tar stream is. Looking into the AUFS and Overlay
graph drivers, there’s not many comments explaining what to expect for the
tar stream. It does appear to just be the
layer.tar from above, but
a quick test shows this isn’t the case:
% shasum -a 256 image/120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/layer.tar 7f02483a97528d2c20fa0adb6b851743823616bc8e5bec765e04c8b061ae039b extract/120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/layer.tar
The two hashes don’t match. If you follow the graph driver all the way down the
rabbit hole you pass the
chrootarchive package (which executes a subprocess to
run a single Go function, so it can
chroot rather than dealing with paths in Go)
and end up in
archive.ApplyLayer. This function appears to back up the assertion
that it’s just the
layer.tar from above. But we know it isn’t. That’s as far
as I got investigating how to forge a
DiffID, although it’s probably possible
to patch the Docker daemon to dump the image tar.
Generating New Layers
As I mentioned above, there’s no real way of generating a new layer without
exporting the current state of the rebase to an image
tar archive, importing
it into Docker, running
docker run and
docker commit, then exporting the new
image and applying that layer to the rebase. This is very inefficient (and would
be improved if Docker offered this functionality), and could definitely use some
improvement. One possibility would be to try to use
runc to generate a new
layer. Unfortunately it would require quite a bit of trickery as we’d need to
diff the resultant filesystem (which could be a problem for large updates). Not
to mention the fact that
runc configuration files are quite different from
Docker layer JSON configuration files.
Show me the Code!
The code is available on my GitHub. It’s licensed under the Apache 2.0 license. Unfortunately, the Docker image format lacks so much documentation that I couldn’t finish implementing the rebase model. All of the graph loading code is done though (which is honestly the bulk of the work). I probably will work on this project in my spare time, or in the next Hackweek.
Unless otherwise stated, all of the opinions in the above post are solely my own and do not necessary represent the views of anyone else. This post is released under the Creative Commons BY-SA 4.0 license.
Want to keep up to date with my posts?
You can subscribe to the Atom Feed.