« Rootless Containers with runC

Aleksa Sarai

containers free software runc suse

27 June 2016


runC is the canonical Open Containers Initiative (OCI) runtime implementation. It came from the container runtime that Docker developed to replace LXC within Docker a few years ago, called libcontainer (not to be confused with Virtuozzo’s libct). Currently runC is used by Docker (through containerd) to run containers. If you’re using Docker 1.11 or later, you’re using runC explicitly. If you’re using Docker 1.8 or later, you’re using libcontainer code that was developed as part of runC. All of this means that runC is a very widely used container runtime, though many people are not using it to its full potential. It has support for things that the Docker engine does not (such as checkpoint and restore of containers). Many features that are developed in runC today won’t be available in Docker for a fairly long while.

All of that being said, let’s talk about what’s been cooking in the runC kitchen.

Problem to be Solved

First, a quick introduction into what “rootless containers” are. A rootless container is a container that was entirely set up without any privileges. This includes things like installing PAM modules or starting privileged daemons. Why is this important? Surely this doesn’t matter to people who want to use containers (we control our machines after all, which was the point of the Free Software movement). Unfortunately, that’s not always the case. A good example is research faculties inside universities, and so here follows a hypothetical researcher called James and a system administrator called Susan.

James decided to write his analysis scripts in Python 3, using some libraries that are not available for Python 2. Unfortunately, his faculty’s computing cluster only supports Python 2 and Susan doesn’t want to deal with supporting a different Python version. But James knows about containers, and writes a Dockerfile for his analysis scripts. But Susan also won’t install Docker, because Docker doesn’t have any builtin ACLs and thus will allow anyone to get root on the machine. Susan doesn’t want that to happen, because it would violate the university’s policy on privileged access to university resources.

So what should James do? Well, James could try to compile Python 3 and all of the necessary dependencies on the computing cluster. If you’ve ever had to do that, you’ll know that it doesn’t end well. And it doesn’t. James goes down the rabbit hole and emerges trying to modify an .so library to work with the older version of glibc that is running on the server. It doesn’t work, and he decides that he’ll just run his analysis scripts on his small laptop instead.

So, what James clearly needed was a way to run containers without any privileges. There must be no step where James would have to ask Susan to do anything on his behalf. And to be fair, Susan is understandably concerned about allowing researches to access the Docker daemon (it’s a part of the security model of Docker that anyone with access to docker.sock has full access to your machine). So there must be a solution. And there is – it’s called rootless containers.

Oh, and the researcher I called “James”? That was me about 6 months ago. I’ve also gotten emails from people at CERN that talk about about similar issues they’re having with the computing cluster used for the Large Hadron Collider, so this issue isn’t just isolated to one faculty at my university.

Unprivileged User Namespaces

Since Linux 3.8, it has been possible to create unprivileged user namespaces. User namespaces are different to the other namespaces, in that they change the privilege model that the process sees. Some operations require privileges in the root user namespace, others only require it in the process’s namespace. This means that having the ability to create unprivileged user namespaces results in users being able to gain certain privileges (with many caveats). The important code looks like this:

/* As any user. */
unshare(CLONE_NEWUSER);
/* We now have the full capability set in a new user namespace. */

There are some additional steps that need to be done in order for the unprivileged user namespace to be useful. You have to disable the setgroups(2) syscall, and then map your own user (and only your own user) into the namespace. Every other namespace pins what user namespace they were created in, and this information is used to determine what privileges an operation requires. So, after creating an unprivileged user namespace you just have to create all of the other namespaces you want to use and set up the container.

It should be noted that this feature would not be considered strictly safe until Linux 3.19. There were many different semantic and security discussions that happened after Linux 3.8, including the setgroups(2) syscall that I mentioned. Ultimately, as a user of runC you don’t really care about that. But as an administrator, you might have strong views on whether or not user namespaces are safe for general use. I don’t have a dog in that fight, I’m just creating some cool technology on top of the kernel technology.

Current State

So, what works right now? Currently all of the basic functionality of runC works with rootless containers (and where applicable, it works with with both privileged and unprivileged users). The current checklist is visible in the pull request. Please refrain from commenting on the pull request unless you have something constructive to add (please use the reactions provided by GitHub if you want to :+1: the issue). Unfortunately, there are still some outstanding issues that need to be resolved before I feel that rootless containers are ready to be merged, but we’ll get into those in a second.

Currently, you can do the following operations with rootless containers. There are some exceptions that I’ll get into later.

The following operations have been entirely disabled, since I’m either unsure about whether or not they’ll work properly or they cannot work in most configurations. We’ll get back to those later as well.

Consoles and Pain

There has been a long-standing bug within libcontainer with regards to console handling. In order to create a new pseudo TTY, you need to open the magical file /dev/ptmx and then do a few ioctl(2)s on it. However, all of this handling is done in the host within libcontainer. runc run “works” because it uses /dev/ptmx and is all managed internally. However, if you want to detach a container (or use the new create-start semantics with a terminal) then you’re going to run into some issues. You’re going to have to write a wrapper around runC in order to use it properly. And it starts with this little thing called devpts and Linus’ devotion to backwards compatibility.

A long time ago, it was decided that it should be possible for different mount namespaces to see different sets of ptys such that a privileged user in a particular mount namespace won’t be able to manipulate another namespace’s consoles. This is important for security reasons, as users that can read or write to your console can do quite a lot of nasty things. Thus, a new option was added to devpts called newinstance. If you give that option, then a new instance of the pts “namespace” is created where all ptys created in that namespace are only visible in that namespace. However, most GNU/Linux distributions don’t ship with that option enabled for the host’s devpts mount. If the option is not set, then the semantics will mirror the old semantics (one set of ptys visible to everyone). In addition, the file /dev/pts/ptmx is only usable by root on the host (it has file mode 0000).

Unmapped Users and Oddness

While everything works fine as “root” inside an unprivileged user namespace, the fact that you can only map your current user (and there’s a lot of additional restrictions added by the kernel) causes programs to act strangely. The most obvious example of this is package managers like apt:

% runc run ubuntu
# apt-get update
E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (22: Invalid argument)
E: seteuid 104 failed - seteuid (22: Invalid argument)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
Reading package lists... Done
W: chown to _apt:root of directory /var/lib/apt/lists/partial failed - SetupAPTPartialDirectory (22: Invalid argument)
E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (22: Invalid argument)
E: seteuid 104 failed - seteuid (22: Invalid argument)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
E: Method gave invalid 400 URI Failure message: Failed to setgroups - setgroups (1: Operation not permitted)
E: Method http has died unexpectedly!
E: Sub-process http returned an error code (112)
# apt-get install python
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package python
#

The issue here is that apt is trying to drop privileges, even though it doesn’t have any! Unfortunately, because of some regrettable decisions made by the implementors of user namespaces, a process inside a user namespace will be exposed to the oddness of unmapped uids and gids. You can both be in an unmapped group, but also have no permissions to a file with the same gid. In addition, setgroups(2) is disabled but you can see unmapped supplementary groups and you cannot leave them. Interestingly, zypper (which is openSUSE’s package manager) doesn’t suffer from this problem (though some packages have scripts that try to change the ownership of files).

Unfortunately, there isn’t a way to fix this from the kernel as it is now baked into the kernel ABI (and Linus doesn’t take kindly to people trying to fix the ABI). So, I had to work on a different solution. The full story will be in another blog post, but you can take a look at the code.

Networking

Currently creating a new network namespace will drop you into an environment with no network devices except lo. This means that you won’t have a network connection inside a rootless container that uses the network namespace. In order to create a bridge between two namespaces you need CAP_NET_ADMIN in both pinned user namespaces (including the host user namespace which is where your internet comes from in the host). You can fix this by just not using the network namespace. While this means that you can’t use iptables(8) it also means that you now have a network connection. There’s no security issue with not using a namespace, since we never have any privileges when setting up a rootless container.

There has apparently been some discussion upstream about solving this problem, but it’s quite a hard problem to solve. FreeBSD had a similar issue with the Jails implementation and it took a very long time to figure out how to securely expand network namespacing so that it could be secure and useful.

cgroups

Now for the big question: can we use cgroups in rootless containers? Unfortunately, in the general case it is not currently possible use cgroups in rootless containers. The reason for this is that the cgroup API uses a virtual filesystem, and the permissions model is based exclusively around that idea. There was the recent addition of a cgroup namespace, but it is not relevant to rootless containers.

However, it makes sense that (at least for cgroupv2) a process inside a particular cgroup should be allowed to manage its own resources (with the obvious caveat that it must obey the resource restrictions assigned to it). Unfortunately, currently there is no support for such a system within the kernel. I sent some patches to LKML and they were rejected on the grounds that the maintainer doesn’t want the cgroup API to deviate too far from VFS. I don’t agree with that, but I’m thinking about how we could implement this without needing to break the VFS API.

It should be noted that we could, in principle, use cgroups if we happen to have the right permissions. This is something that a colleague of mine, Christian Brauner, was helping me implement. You can see the current state of that here.

Once we have a way to use cgroups with the same constraints as the rest of the rootless containers implementation, we will be able to re-enable so many of the other operations that we had to disable (including pause and resume which are very useful).

Miscellaneous

Apart from the big issues I outlined above, there are a few remaining issues. These are mainly trivial things that shouldn’t take too long to fix, I just need to put in the time.

Currently, runc ps uses cgroups to get the list of processes in a container. This is nice because it does PID translation for us. Unfortunately, this also means that we can’t use it within rootless containers. Solving this requires a lot of messy engineering for very little gain. Essentially you have to create an AF_UNIX socket, then fork a process which will join the relevant namespaces in the container. That process can then enumerate the PIDs and send them to the runC process as ancillary data over the AF_UNIX socket. The PIDs will be translated and then we can carry on as before.

checkpoint and restore are currently disabled as a precautionary measure. I haven’t really used those features, and I’m unsure how well they would deal with having a rootless container that has some intricacies about how it is set up. The tool we use for the checkpoint and restore of containers is [CRIU][criu]. CRIU 2.0 brought support for [unprivileged checkpointing of processes][criu-2.0], and there are plans to support unprivileged restore in the future. So currently this isn’t a critical thing to support.

Unless otherwise stated, all of the opinions in the above post are solely my own and do not necessary represent the views of anyone else. This post is released under the Creative Commons BY-SA 4.0 license.

Want to keep up to date with my posts?

You can subscribe to the Atom Feed.