Hi Giuseppe,

Thanks, some useful points there. However, my question was more specifically around how "special" mounts get created in containers, given it's not possible for the container process itself to create them. A concrete example below using rootless podman...

> podman run --rm -it --name ubuntu --privileged ubuntu:20.04
root@b2069e97cd13:/# findmnt -R /sys/fs/cgroup/freezer
TARGET                 SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,freezer
root@b2069e97cd13:/# umount /sys/fs/cgroup/freezer
root@b2069e97cd13:/# mount -t cgroup cgroup /sys/fs/cgroup/freezer -o rw,nosuid,nodev,noexec,relatime,seclabel,freezer
mount: /sys/fs/cgroup/freezer: permission denied.

This shows that cgroup mounts are present in the container, and yet the container does not have permission to create the mount.

However, I've realised these are perhaps just bind mounts from the host mount namespace? I can simulate this as follows:

> podman run --rm -it --name ubuntu --privileged -v /sys/fs/cgroup:/tmp/host/cgroup:ro ubuntu:20.04
root@495f11acdd5b:/# findmnt -R /tmp/host/cgroup/freezer/
TARGET                   SOURCE FSTYPE OPTIONS
/tmp/host/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,freezer
root@495f11acdd5b:/# umount /sys/fs/cgroup/freezer
root@495f11acdd5b:/# mount --bind /tmp/host/cgroup/freezer /sys/fs/cgroup/freezer
root@495f11acdd5b:/# findmnt -R /sys/fs/cgroup/freezer/
TARGET                 SOURCE FSTYPE OPTIONS
/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,freezer

One further thing I'm unclear on is as follows. It seems when a new mount namespace is created that the mount list is copied from the parent process, but some of the container cgroup mounts are bind mounts at some point in the hierarchy rather than being the same as the host mounts. Perhaps the container runtime first unmounts /sys/fs/cgroup in the container mount namespace before creating these bind mounts?

root@495f11acdd5b:/# findmnt /sys/fs/cgroup/devices
TARGET                 SOURCE              FSTYPE OPTIONS
/sys/fs/cgroup/devices cgroup[/user.slice] cgroup rw,nosuid,nodev,noexec,relatime,seclabel,devices
root@495f11acdd5b:/# findmnt /tmp/host/cgroup/devices
TARGET                   SOURCE FSTYPE OPTIONS
/tmp/host/cgroup/devices cgroup cgroup rw,nosuid,nodev,noexec,relatime,seclabel,devices

Thanks,
Lewis

On Thu, 14 Sept 2023 at 12:46, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
Lewis Gaul <lewis.gaul@gmail.com> writes:

> Hi,
>
> I'm trying to understand something about how capabilities in rootless podman work.
>
> How does rootless podman have the capability to set up container mounts (such as cgroup mounts) given a privileged container itself doesn't? Does
> podman deliberately drop caps, or somehow get elevated privileges to do this?
>
> This is the process tree podman sets up (where bash is the container entrypoint here):
> systemd(1)---conmon(1327421)---bash(1327432)
>
> I'm assuming it's conmon that sets up the container's mounts (via runc in this case), which is a process running as my user (rootless). How is it that
> conmon has the capabilities required (SYS_ADMIN?) to create the container's cgroup and sysfs mounts but within the container itself this is not
> possible?
>
> Thanks for any insight!

a rootless container is able to perform "privileged" operations by using a
user namespace, and in that user namespace it gains the capabilities
required to perform mounts.

Be aware that in a user namespace, the root user is still limited to
what it can do, as the kernel differentiates between the root user on
the host (known as the initial user namespace) and any other user
namespace.

The user namespace is a special namespace, that alters how other
namespaces work since each namespace is "owned" by a user namespace.

So a user namespace alone is not enough to perform mounts, the user must
also create a new mount namespace.  The combination user namespace+mount
namespace is what "podman unshare" creates.

For example:

$ podman unshare
$ id
uid=0(root) gid=0(root) groups=0(root),65534(nobody) context=unconfined_u:unconfined_r:container_runtime_t:s0-s0:c0.c1023
$ mkdir /tmp/test
$ mount -t tmpfs tmpfs /tmp/test
$ exit

You can try manually:

$ unshare -r bash ## creates a user namespace and maps your user to root
$ mkdir /tmp/test; mount -t tmpfs tmpfs /tmp/test
mkdir: cannot create directory ‘/tmp/test’: File exists
mount: /tmp/test: permission denied.
       dmesg(1) may have more information after failed mount system call.

The failure happens because the user namespace does not own the mount
namespace as it is owned by the initial user namespace.

So in order to perform a mount, you must create a mount namespace:

$ unshare -m bash ## the new mount namespace is owned by the current
                  ## user namespace
$ mount -t tmpfs tmpfs /tmp/test

In the rootless container case, the container mounts are performed by
the OCI runtime that runs in the user+mount namespace created by
Podman.

Regards,
Giuseppe