Hi Podman team,

I came across an unexpected systemd warning when running inside a container - I emailed systemd-devel (this email summarises the thread, which you can find at https://lists.freedesktop.org/archives/systemd-devel/2023-January/048723.html) and Lennart suggested emailing here. Any thoughts would be great!

There are two different warnings seen in different scenarios, both cgroups related, and I believe related to each other given they both satisfy the points below.

The first warning is seen after 'podman restart $CTR', coming from https://github.com/systemd/systemd/blob/v245/src/shared/cgroup-setup.c#L279:
Failed to attach 1 to compat systemd cgroup /machine.slice/libpod-5e4ab2a36681c092f4ef937cf03b25a8d3d7b2fa530559bf4dac4079c84d0313.scope/init.scope: No such file or directory

The second warning is seen on every boot when using '--cgroupns=private', coming from https://github.com/systemd/systemd/blob/v245/src/core/cgroup.c#L2967:
Couldn't move remaining userspace processes, ignoring: Input/output error
Failed to create compat systemd cgroup /system.slice: No such file or directory
...

Both warnings are seen together when restarting a container using private cgroup namespace.

To summarise:
- The warnings are seen when running the container on a Centos 8 host, but not on an Ubuntu 20.04 host
- It is assumed this issue is specific to cgroups v1, based on the warning messages
- Disabling SELinux on the host with 'setenforce 0' makes no difference
- Seen with systemd v245 but not with v230
- Seen with '--privileged' and in non-privileged with '--cap-add sys_admin'
- Changing the cgroup driver/manager doesn't seem to have any effect
- The same is seen with docker except when running privileged the first warning becomes a fatal error after hitting "Failed to open pin file: No such file or directory" (coming from https://github.com/systemd/systemd/blob/v245/src/core/cgroup.c#L2972) and the container exits (however docker doesn't claim to support systemd)

Some extra details copied from the systemd email thread:
- On first boot PID 1 can be found in /sys/fs/cgroup/systemd/machine.slice/libpod-<ctr-id>.scope/init.scope/cgroup.procs, whereas when the container restarts the 'init.scope/' directory does not exist and PID 1 is instead found in the parent (container root) cgroup /sys/fs/cgroup/systemd/machine.slice/libpod-<ctr-id>.scope/cgroup.procs (also reflected by /proc/1/cgroup). This is strange because systemd must be the one to create this cgroup dir in the initial boot, so I'm not sure why it wouldn't on subsequent boot.
- I confirmed that the container has permissions to create the dir by executing a 'mkdir' in /sys/fs/cgroup/systemd/machine.slice/libpod-<ctr-id>.scope/ inside the container after the restart, so I have no idea why systemd is not creating the 'init.scope/' dir. I notice that inside the container's systemd cgroup mount 'system.slice/' does exist, but 'user.slice/' also does not (both exist on normal boot).

This should be reproducible using the following:
cat << EOF > Dockerfile
FROM ubuntu:20.04
RUN apt-get update -y && apt-get install systemd -y && ln -s /lib/systemd/systemd /sbin/init
ENTRYPOINT ["/sbin/init"]
EOF
podman build . --tag ubuntu-systemd
podman run -it --name ubuntu --privileged --cgroupns private ubuntu-systemd
podman restart ubuntu

Thanks,
Lewis