Hi Giuseppe,
Thanks, some useful points there. However, my question was more
specifically around how "special" mounts get created in containers, given
it's not possible for the container process itself to create them. A
concrete example below using rootless podman...
podman run --rm -it --name ubuntu --privileged ubuntu:20.04
root@b2069e97cd13:/# findmnt -R /sys/fs/cgroup/freezer
TARGET SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/freezer cgroup cgroup
rw,nosuid,nodev,noexec,relatime,seclabel,freezer
root@b2069e97cd13:/# umount /sys/fs/cgroup/freezer
root@b2069e97cd13:/# mount -t cgroup cgroup /sys/fs/cgroup/freezer -o
rw,nosuid,nodev,noexec,relatime,seclabel,freezer
mount: /sys/fs/cgroup/freezer: permission denied.
This shows that cgroup mounts are present in the container, and yet the
container does not have permission to create the mount.
However, I've realised these are perhaps just bind mounts from the host
mount namespace? I can simulate this as follows:
podman run --rm -it --name ubuntu --privileged -v
/sys/fs/cgroup:/tmp/host/cgroup:ro ubuntu:20.04
root@495f11acdd5b:/# findmnt -R /tmp/host/cgroup/freezer/
TARGET SOURCE FSTYPE OPTIONS
/tmp/host/cgroup/freezer cgroup cgroup
rw,nosuid,nodev,noexec,relatime,seclabel,freezer
root@495f11acdd5b:/# umount /sys/fs/cgroup/freezer
root@495f11acdd5b:/# mount --bind /tmp/host/cgroup/freezer
/sys/fs/cgroup/freezer
root@495f11acdd5b:/# findmnt -R /sys/fs/cgroup/freezer/
TARGET SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/freezer cgroup cgroup
rw,nosuid,nodev,noexec,relatime,seclabel,freezer
One further thing I'm unclear on is as follows. It seems when a new mount
namespace is created that the mount list is copied from the parent process,
but some of the container cgroup mounts are bind mounts at some point in
the hierarchy rather than being the same as the host mounts. Perhaps the
container runtime first unmounts /sys/fs/cgroup in the container mount
namespace before creating these bind mounts?
root@495f11acdd5b:/# findmnt /sys/fs/cgroup/devices
TARGET SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/devices cgroup[/user.slice] cgroup
rw,nosuid,nodev,noexec,relatime,seclabel,devices
root@495f11acdd5b:/# findmnt /tmp/host/cgroup/devices
TARGET SOURCE FSTYPE OPTIONS
/tmp/host/cgroup/devices cgroup cgroup
rw,nosuid,nodev,noexec,relatime,seclabel,devices
Thanks,
Lewis
On Thu, 14 Sept 2023 at 12:46, Giuseppe Scrivano <gscrivan(a)redhat.com>
wrote:
Lewis Gaul <lewis.gaul(a)gmail.com> writes:
> Hi,
>
> I'm trying to understand something about how capabilities in rootless
podman work.
>
> How does rootless podman have the capability to set up container mounts
(such as cgroup mounts) given a privileged container itself doesn't? Does
> podman deliberately drop caps, or somehow get elevated privileges to do
this?
>
> This is the process tree podman sets up (where bash is the container
entrypoint here):
> systemd(1)---conmon(1327421)---bash(1327432)
>
> I'm assuming it's conmon that sets up the container's mounts (via runc
in this case), which is a process running as my user (rootless). How is it
that
> conmon has the capabilities required (SYS_ADMIN?) to create the
container's cgroup and sysfs mounts but within the container itself this is
not
> possible?
>
> Thanks for any insight!
a rootless container is able to perform "privileged" operations by using a
user namespace, and in that user namespace it gains the capabilities
required to perform mounts.
Be aware that in a user namespace, the root user is still limited to
what it can do, as the kernel differentiates between the root user on
the host (known as the initial user namespace) and any other user
namespace.
The user namespace is a special namespace, that alters how other
namespaces work since each namespace is "owned" by a user namespace.
So a user namespace alone is not enough to perform mounts, the user must
also create a new mount namespace. The combination user namespace+mount
namespace is what "podman unshare" creates.
For example:
$ podman unshare
$ id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)
context=unconfined_u:unconfined_r:container_runtime_t:s0-s0:c0.c1023
$ mkdir /tmp/test
$ mount -t tmpfs tmpfs /tmp/test
$ exit
You can try manually:
$ unshare -r bash ## creates a user namespace and maps your user to root
$ mkdir /tmp/test; mount -t tmpfs tmpfs /tmp/test
mkdir: cannot create directory ‘/tmp/test’: File exists
mount: /tmp/test: permission denied.
dmesg(1) may have more information after failed mount system call.
The failure happens because the user namespace does not own the mount
namespace as it is owned by the initial user namespace.
So in order to perform a mount, you must create a mount namespace:
$ unshare -m bash ## the new mount namespace is owned by the current
## user namespace
$ mount -t tmpfs tmpfs /tmp/test
In the rootless container case, the container mounts are performed by
the OCI runtime that runs in the user+mount namespace created by
Podman.
Regards,
Giuseppe