Lewis Gaul <lewis.gaul(a)gmail.com> writes:
Hi,
I'm trying to understand something about how capabilities in rootless podman work.
How does rootless podman have the capability to set up container mounts (such as cgroup
mounts) given a privileged container itself doesn't? Does
podman deliberately drop caps, or somehow get elevated privileges to do this?
This is the process tree podman sets up (where bash is the container entrypoint here):
systemd(1)---conmon(1327421)---bash(1327432)
I'm assuming it's conmon that sets up the container's mounts (via runc in
this case), which is a process running as my user (rootless). How is it that
conmon has the capabilities required (SYS_ADMIN?) to create the container's cgroup
and sysfs mounts but within the container itself this is not
possible?
Thanks for any insight!
a rootless container is able to perform "privileged" operations by using a
user namespace, and in that user namespace it gains the capabilities
required to perform mounts.
Be aware that in a user namespace, the root user is still limited to
what it can do, as the kernel differentiates between the root user on
the host (known as the initial user namespace) and any other user
namespace.
The user namespace is a special namespace, that alters how other
namespaces work since each namespace is "owned" by a user namespace.
So a user namespace alone is not enough to perform mounts, the user must
also create a new mount namespace. The combination user namespace+mount
namespace is what "podman unshare" creates.
For example:
$ podman unshare
$ id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)
context=unconfined_u:unconfined_r:container_runtime_t:s0-s0:c0.c1023
$ mkdir /tmp/test
$ mount -t tmpfs tmpfs /tmp/test
$ exit
You can try manually:
$ unshare -r bash ## creates a user namespace and maps your user to root
$ mkdir /tmp/test; mount -t tmpfs tmpfs /tmp/test
mkdir: cannot create directory ‘/tmp/test’: File exists
mount: /tmp/test: permission denied.
dmesg(1) may have more information after failed mount system call.
The failure happens because the user namespace does not own the mount
namespace as it is owned by the initial user namespace.
So in order to perform a mount, you must create a mount namespace:
$ unshare -m bash ## the new mount namespace is owned by the current
## user namespace
$ mount -t tmpfs tmpfs /tmp/test
In the rootless container case, the container mounts are performed by
the OCI runtime that runs in the user+mount namespace created by
Podman.
Regards,
Giuseppe