How to make a Docker container read-only – Ice and Fire

There are many ways to harden a Docker container, one is to make the container layer read-only.

This might be a marginal improvement to security, first your application should not run as root or has special privileges (e.g. CAP_DAC_OVERRIDE), so there is limited risk that an attacker exploiting a vulnerability of your application can modify sensitive applications. However, if you install your application within a Dockerfile as the application user (e.g. using bundle install) make the base layer read-only might protect it from unwanted modification.

I also like the idea of an immutable base layer and clearly identifying the writing data and if they should be persisted or not. I also relate that to security, because the better you know the behaviour of an application, the better you can adapt a confinement for it.

Setting the base layer read-only is somewhat challenging. Setting a container image to read-only is simple, there is a --read-only flag to the docker run command. But identifying which data is written by the containerised application can be a challenge One task is thus to identify all written data and defining of they should be persisted in a volume or not persisted. In the latter case, one could then use a tmpfs volume or a local volume (in a Swarm cluster).

We are going to use Docker layering approach to identify the written data. How to check the difference varies depending on the storage backend and they are too numerous for me to list each cases, I might complete the article in the future but today I will show how to use the BTRFS and Overlay2 backend.

What I am going to explain is based on the current implementation of the Docker storage backend as described in their respective guides. Each guide explains how the backend works, and by extracting that information I could find a way to compare the layers.

BTRFS Storage Backend

For BTRFS, I usually talk of Subvolumes, but some of those are actually Snapshots. A Snapshot is just a special subvolume but it is initialised with the data (no copy happens) of the snapshotted original subvolume. To simplify matters, I will use the term subvolumes interchangeably for both.

We will take the example of setting a Postgres DB as read-only. First we will run the database:

$ sudo docker run --name postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres:10.7

Now we will need to identify the 2 layers used by this container. There should be one, the image layer which should be actually the top layer of the image (here postgres:10.7) and the second should be the running container “working layer” which is the one we want to set read-only, we call that layer the “base layer”.

Both layers are identified in two configuration files under the following directory: <DockerRootDir>/image/btrfs/layerdb/mounts/<ContainerID>/ and they are named init-id and mount-id.

The <DockerRootDir> is easy and usually points to /var/lib/docker, you can use the docker system info --format '{{json .DockerRootDir}}' to find out.

The <ContainerID> is also fairly easy to find out, the command docker inspect --format="{{.ID}}" postgres should provide the answer. Let us see what it gives for our example:

$ sudo docker system info --format '{{json .DockerRootDir}}'
"/var/lib/docker"
$ sudo docker inspect --format="{{.ID}}" postgres
 01e2e618284eb0c9256541cb3e25be9b9f571faead74f29bb7d19126a41bfa47

Now, we need to check the names of the BTRFS subvolumes for the image layer (aka init-id) and for the base layer of the container (aka mount-id):

$ sudo grep -h '' /var/lib/docker/image/btrfs/layerdb/mounts/01e2e618284eb0c9256541cb3e25be9b9f571faead74f29bb7d19126a41bfa47/init-id
47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62-init
$ sudo grep -h '' /var/lib/docker/image/btrfs/layerdb/mounts/01e2e618284eb0c9256541cb3e25be9b9f571faead74f29bb7d19126a41bfa47/mount-id
47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62

We have found two information, the first is the name of the subvolume of the image layer (47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62-init) and the base layer (47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62). They differ by the suffix -init. We will use the ID to find information on the BTRFS subvolumes.

Making a difference between the running container and the image for BTRFS:

$ sudo btrfs subvol list /var/lib/docker | grep 47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62
ID 4176 gen 381073 top level 5 path lib/docker/btrfs/subvolumes/47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62-init
ID 4177 gen 381176 top level 5 path lib/docker/btrfs/subvolumes/47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62

Making a diff:

$ sudo btrfs subvolume find-new /var/lib/docker/btrfs/subvolumes/47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62 381073
inode 10844 file offset 0 len 12 disk start 0 offset 0 gen 381194 flags INLINE etc/mtab
inode 10848 file offset 0 len 63 disk start 0 offset 0 gen 381195 flags INLINE run/postgresql/.s.PGSQL.5432.lock
transid marker was 381176

The output might be too verbose so you can simplify it by doing (credit for the plumbing):

$ sudo btrfs subvolume find-new /var/lib/docker/btrfs/subvolumes/47907e3b70e72a608f62f5f34683b246058c748e9d8981b2515739ba4a6d0d62 381073 | sed '$d' | cut -f17- -d' ' | sort | uniq
etc/mtab
run/postgresql/.s.PGSQL.5432.lock

Conclusion

BTRFS offers some tools to find differences between the subvolumes, but it requires some understanding of BTRFS and Docker storage backend to find which files have changed. In the end, this is possible to perform and identify which files and directories have been created or modified. Of course, the above example is not enough, one should run some stress test or simply a validation campaign on their system to safely identify all possible file system changes.

In our very simple test, we could simply run a Postgres container using a Docker named volume by doing:

$ sudo docker run --name postgres -v pgdata:/var/lib/postgresql/data --read-only --tmpfs /run -e POSTGRES_PASSWORD=mysecretpassword -d postgres:10.7

In the above container creation and starting, we set the container to be read-only, we create a /run tmpfs mount to contain the runtime data (here some internal locks), and we create a named volume for storing the databases.

Overlay2 Storage Backend

For overlay2, it is much simpler. The file structure used by this Docker storage backend gives us all information we need. We will use the same example as for BTRFS using a postgres container. The information we are looking for is within the GraphDriver structure reported by Docker inspect:

$ sudo docker inspect postgres
(...)
    "GraphDriver": {
         "Data": {
(...)
            "UpperDir": "/var/lib/docker/overlay2/cde7ec0ff866c7bcda21109872cbe1751e20967d823bce8a8ec1a3a0be4357cd/diff",
(...)
        },
         "Name": "overlay2"     },

The data that changed compared to the base image is all in this directory:

$ sudo ls -AlF /var/lib/docker/overlay2/cde7ec0ff866c7bcda21109872cbe1751e20967d823bce8a8ec1a3a0be4357cd/diff/
total 0
 drwxr-xr-x. 3 root root 4096 Mar 27 1:12 run

We can easily see the full tree of files and folder that has changed by using the tree command (here I execute inline the Docker inspect command with a formatting to extract just the directory where the differences are visible):

$ sudo tree -apugsDF $(docker inspect --format '{{ .GraphDriver.Data.UpperDir }}' postgres)/
/var/lib/docker/overlay2/cde7ec0ff866c7bcda21109872cbe1751e20967d823bce8a8ec1a3a0be4357cd/diff/
└── [drwxr-xr-x root     root            4096 Mar 27  1:12]  run/
    └── [drwxrwsr-x 999      docker          4096 Apr  7 21:23]  postgresql/
        ├── [srwxrwxrwx 999      docker             0 Apr  7 21:23]  .s.PGSQL.5432=
        └── [-rw------- 999      docker            63 Apr  7 21:23]  .s.PGSQL.5432.loc

Conclusion

As we have seen, the overlay2 storage backend designed by Docker is very easy to explore. But I have the same conclusion than for BTRFS, one needs to perform further testing with the container to execute all likely code path and create/modify all possible files before the container can be set to read-only.