Lewis Gaul <lewis.gaul(a)gmail.com> writes:
Hi Giuseppe,
Thanks, some useful points there. However, my question was more specifically around how
"special" mounts get created in containers, given it's not
possible for the container process itself to create them. A concrete example below using
rootless podman...
> podman run --rm -it --name ubuntu --privileged ubuntu:20.04
root@b2069e97cd13:/# findmnt -R /sys/fs/cgroup/freezer
TARGET SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,freezer
root@b2069e97cd13:/# umount /sys/fs/cgroup/freezer
root@b2069e97cd13:/# mount -t cgroup cgroup /sys/fs/cgroup/freezer -o
rw,nosuid,nodev,noexec,relatime,seclabel,freezer
mount: /sys/fs/cgroup/freezer: permission denied.
This shows that cgroup mounts are present in the container, and yet the container does
not have permission to create the mount.
However, I've realised these are perhaps just bind mounts from the host mount
namespace? I can simulate this as follows:
yes rootless containers do not use cgroup v1 controllers. They are bind
mounts from the host.
> podman run --rm -it --name ubuntu --privileged -v /sys/fs/cgroup:/tmp/host/cgroup:ro
ubuntu:20.04
root@495f11acdd5b:/# findmnt -R /tmp/host/cgroup/freezer/
TARGET SOURCE FSTYPE OPTIONS
/tmp/host/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,freezer
root@495f11acdd5b:/# umount /sys/fs/cgroup/freezer
root@495f11acdd5b:/# mount --bind /tmp/host/cgroup/freezer /sys/fs/cgroup/freezer
root@495f11acdd5b:/# findmnt -R /sys/fs/cgroup/freezer/
TARGET SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,freezer
One further thing I'm unclear on is as follows. It seems when a new mount namespace
is created that the mount list is copied from the parent
process, but some of the container cgroup mounts are bind mounts at some point in the
hierarchy rather than being the same as the host mounts.
Perhaps the container runtime first unmounts /sys/fs/cgroup in the
container mount namespace before creating these bind mounts?
the container runtime creates all the mounts for the container under the
container rootfs directory and then it uses pivot_root() to change the
root for the current mount namespace. You can think of pivot_root() as
chroot().
root@495f11acdd5b:/# findmnt /sys/fs/cgroup/devices
TARGET SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/devices cgroup[/user.slice] cgroup
rw,nosuid,nodev,noexec,relatime,seclabel,devices
root@495f11acdd5b:/# findmnt /tmp/host/cgroup/devices
TARGET SOURCE FSTYPE OPTIONS
/tmp/host/cgroup/devices cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,devices
Thanks,
Lewis
On Thu, 14 Sept 2023 at 12:46, Giuseppe Scrivano <gscrivan(a)redhat.com> wrote:
Lewis Gaul <lewis.gaul(a)gmail.com> writes:
> Hi,
>
> I'm trying to understand something about how capabilities in rootless podman
work.
>
> How does rootless podman have the capability to set up container mounts (such as
cgroup mounts) given a privileged container itself doesn't?
Does
> podman deliberately drop caps, or somehow get elevated privileges to do this?
>
> This is the process tree podman sets up (where bash is the container entrypoint
here):
> systemd(1)---conmon(1327421)---bash(1327432)
>
> I'm assuming it's conmon that sets up the container's mounts (via runc
in this case), which is a process running as my user (rootless). How is it
that
> conmon has the capabilities required (SYS_ADMIN?) to create the container's
cgroup and sysfs mounts but within the container itself this is not
> possible?
>
> Thanks for any insight!
a rootless container is able to perform "privileged" operations by using a
user namespace, and in that user namespace it gains the capabilities
required to perform mounts.
Be aware that in a user namespace, the root user is still limited to
what it can do, as the kernel differentiates between the root user on
the host (known as the initial user namespace) and any other user
namespace.
The user namespace is a special namespace, that alters how other
namespaces work since each namespace is "owned" by a user namespace.
So a user namespace alone is not enough to perform mounts, the user must
also create a new mount namespace. The combination user namespace+mount
namespace is what "podman unshare" creates.
For example:
$ podman unshare
$ id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)
context=unconfined_u:unconfined_r:container_runtime_t:s0-s0:c0.c1023
$ mkdir /tmp/test
$ mount -t tmpfs tmpfs /tmp/test
$ exit
You can try manually:
$ unshare -r bash ## creates a user namespace and maps your user to root
$ mkdir /tmp/test; mount -t tmpfs tmpfs /tmp/test
mkdir: cannot create directory ‘/tmp/test’: File exists
mount: /tmp/test: permission denied.
dmesg(1) may have more information after failed mount system call.
The failure happens because the user namespace does not own the mount
namespace as it is owned by the initial user namespace.
So in order to perform a mount, you must create a mount namespace:
$ unshare -m bash ## the new mount namespace is owned by the current
## user namespace
$ mount -t tmpfs tmpfs /tmp/test
In the rootless container case, the container mounts are performed by
the OCI runtime that runs in the user+mount namespace created by
Podman.
Regards,
Giuseppe