[Podman] Re: Why does conmon land in a different cgroup when using systemd and podman?

Monday, 12 August 2019

...
It's bugged in 1.4.x - fixed in 1.5.0 (just released Friday).

Right on! I just installed a fresh podman package on Ubuntu 16.04 and
got version 1.5.0.

I made an example using the `--conmon-pidfile` flag to podman create.
I never made a custom pid file before so I'm searched around, it looks
like people write pid files to /var/run/foo.pid.

```
# podman create \
--name bar \
--tty \
--conmon-pidfile /var/run/bar.pid \
alpine \
/bin/sh -c 'while true; do date; sleep 1; done'
33b7486fda2846ae780cef4913c3a46be6005e6bb75cf9be09ef7868543d8496
# podman generate systemd bar > /etc/systemd/system/bar.service
# systemctl daemon-reload
# systemctl start bar
# systemctl status bar
● bar.service - 33b7486fda2846ae780cef4913c3a46be6005e6bb75cf9be09ef7868543d8496
   Loaded: loaded (/etc/systemd/system/bar.service; disabled; vendor preset: ena
   Active: active (running) since Mon 2019-08-12 18:53:22 UTC; 31s ago
  Process: 15832 ExecStart=/usr/bin/podman start 33b7486fda2846ae780cef4913c3a46
 Main PID: 15951 (conmon)
    Tasks: 0
   Memory: 364.0K
      CPU: 71ms
   CGroup: /system.slice/bar.service
           ‣ 15951 /usr/bin/conmon --api-version 1 -s -c 33b7486fda2846ae780cef4

Aug 12 18:53:22 srv0 systemd[1]: Starting 33b7486fda2846ae780cef4913c3a46be6005e
Aug 12 18:53:22 srv0 podman[15832]: 2019-08-12 18:53:22.43628578 +0000 UTC m=+0.
Aug 12 18:53:22 srv0 podman[15832]: 2019-08-12 18:53:22.443460842 +0000 UTC m=+0
Aug 12 18:53:22 srv0 podman[15832]: 33b7486fda2846ae780cef4913c3a46be6005e6bb75c
Aug 12 18:53:22 srv0 systemd[1]: Started 33b7486fda2846ae780cef4913c3a46be6005e6
# podman ps | grep bar
33b7486fda28  docker.io/library/alpine:latest  /bin/sh -c while ...
40 seconds ago  Up 18 seconds ago         bar
# systemd-cgls /system.slice/bar.service
Control group /system.slice/bar.service:
# systemd-cgls /machine.slice
Control group /machine.slice:
├─libpod-33b7486fda2846ae780cef4913c3a46be6005e6bb75cf9be09ef7868543d8496.scope
│ ├─15966 /bin/sh -c while true; do date; sleep 1; done
│ └─16234 sleep 1
└─libpod-conmon-33b7486fda2846ae780cef4913c3a46be6005e6bb75cf9be09ef7868543d8496
  └─15951 /usr/bin/conmon --api-version 1 -s -c 33b7486fda2846ae780cef4913c3a46b
```

`systemctl status` reports the conmon process is in the
/system.slice/bar.service cgroup but I'm not seeing the
/system.slice/bar.service cgroup in the output of `systemd-cgls`. And
instead only seeing the two new cgroups for the scope units in
/machine.slice

```
root@srv0:~# systemd-cgls /system.slice/bar.service
Control group /system.slice/bar.service:
root@srv0:~# systemd-cgls /machine.slice
Control group /machine.slice:
├─libpod-33b7486fda2846ae780cef4913c3a46be6005e6bb75cf9be09ef7868543d8496.scope
│ ├─15966 /bin/sh -c while true; do date; sleep 1; done
│ └─16979 sleep 1
└─libpod-conmon-33b7486fda2846ae780cef4913c3a46be6005e6bb75cf9be09ef7868543d8496
  └─15951 /usr/bin/conmon --api-version 1 -s -c 33b7486fda2846ae780cef4913c3a46b
```

Is `systemctl status` showing a CGroup that doesn't exist unexpected?

On Mon, Aug 12, 2019 at 11:20 AM Matt Heon <mheon(a)redhat.com&gt; wrote:
>
> On 2019-08-12 10:56, Max Bigras wrote:
> >Thank you for the explanation and link to WIP PR #3581 [1]!
> >After reading your explanation and answers to my question on Stack
> >Exchange [2] I'm learning more about what's going on.
> >
> >>Also, a bit more context: the Conmon CGroup is not the container
> >>CGroup. Conmon creates its own CGroup (for various legacy reasons -
> >>we're evaluating whether these still hold true, and this could change)
> >>and then spawns the OCI runtime - and then the OCI runtime spawns its
> >>own CGroup. So you'll have a Conmon CGroup and another for the actual
> >>container (the two 'libpod' cgroups).
> >
> >I made an example illustrating the three cgroups you mentioned:
> >
> >1. /system.slice/example.service for my systemd service
> >2. /machine.slice/libpod-conmon-a75a16081e23b...scope for conmon
> >3. /machine.slice/libpod-a75a16081e23bb03589....scope for processes
> >inside the container (sleep in this case)
> >
> >```
> ># podman create --name example --tty alpine sleep 9999
> ># cat <<'EOF' > /etc/systemd/system/example.service
> >[Service]
> >ExecStart=/usr/bin/podman start --attach %N
> >ExecStop=/usr/bin/podman stop %N
> >EOF
> ># systemctl daemon-reload
> ># systemd-cgls /machine.slice
> >Control group /machine.slice
> ># systemctl start example
> ># systemd-cgls /machine.slice
> >Control group /machine.slice:
> >├─libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297
> >│ └─22385 /usr/bin/conmon -s -c a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7
> >└─libpod-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
> >  └─22399 sleep 9999
> ># systemd-cgls /system.slice/example.service
> >Control group /system.slice/example.service:
> >└─22308 /usr/bin/podman start --attach example
> >```
> >
> >From listening to a presentation from Lennart on YouTube systemd and
> >Control Groups [3], I learned that the scope unit is just like the
> >service unit because it manages processes; however, the scope unit is
> >different from the service unit because scope units can be dynamically
> >generated. In this case that explanation matches with what we're
> >seeing here: two scope units for conmon and sleep and dynamically
> >generated.
> >
> >>Because of this, we recommend tracking Conmon, not Podman, with unit
> >>files. In Podman 1.5.0 and later, 'podman generate systemd' will
> >>properly handle this, creating a unit file that tracks Conmon using
> >>PID file.
> >
> >Can you illustrate an example of using systemd to track conmon instead
> >of podman?
>
> The unit file generated by `generate systemd` below is a good example
> of what we want to use - in contrast to using a `Type=simple` unit
> file using `podman start --attach` (which was previously recommended).
> As previously mentioned, Podman becomes effectively independent of the
> container when it's running - so the Podman process can disappear but
> the container may still be running, which is not conducive to the
> Type=simple unit.
>
> >
> >My goal is to have one way to stop, start, and enable my containers.
> >
> >In the example above I know I can check the status of the conmon scope
> >unit, it looks like the name comes from the container id:
> >
> >```
> ># systemctl status
>
>libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
> >● libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297
> >   Loaded: loaded
> >Transient: yes
> >  Drop-In: /run/systemd/system/libpod-conmon-a75a16081e23bb03589b214580b3226d8b2
> >           └─50-DefaultDependencies.conf, 50-Delegate.conf, 50-Slice.conf
> >   Active: active (running) since Mon 2019-08-12 17:20:17 UTC; 2min 20s ago
> >    Tasks: 2
> >   Memory: 200.0K
> >      CPU: 17ms
> >   CGroup: /machine.slice/libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a
> >           └─22631 /usr/bin/conmon -s -c a75a16081e23bb03589b214580b3226d8b2ef77
> >
> >Aug 12 17:20:17 srv0 systemd[1]: Started libpod-conmon-a75a16081e23bb03589b21458
> ># podman ps
> >CONTAINER ID  IMAGE                            COMMAND     CREATED
> >    STATUS             PORTS  NAMES
> >a75a16081e23  docker.io/library/alpine:latest  sleep 9999  41 minutes
> >ago  Up 22 minutes ago         example
> ># systemctl show
>
>libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
> >| grep ExecStart
> >```
> >
> >I didn't see an ExecStart directive inside the scope unit.
> >
> >>podman generate systemd
> >
> >Ah, I see what you mean
> >
> >```
> ># podman generate systemd example > /etc/systemd/system/example.service
> ># systemctl daemon-reload
> ># systemctl cat example
> >[Unit]
> >Description=a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297 Pod
> >[Service]
> >Restart=on-failure
> >ExecStart=/usr/bin/podman start a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7
> >ExecStop=/usr/bin/podman stop -t 10 a75a16081e23bb03589b214580b3226d8b2ef77a382c
> >KillMode=none
> >Type=forking
> >PIDFile=/var/lib/containers/storage/overlay-containers/a75a16081e23bb03589b21458
> >[Install]
> >WantedBy=multi-user.target
> ># systemctl start example
> ># journalctl -u example | grep PID | tail -n 1
> >Aug 12 17:50:38 srv0 systemd[1]: example.service: PID file
>
>/var/lib/containers/storage/overlay-containers/a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297/userdata/a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.pid
> >not readable (yet?) after start: No such file or directory
> >```
> >
> >Is using `podman generate systemd example` planned for the future or
> >is it supposed to work now?
>
> It's bugged in 1.4.x - fixed in 1.5.0 (just released Friday).
>
> >
> >[1] https://github.com/containers/libpod/pull/3581
> >[2]
https://unix.stackexchange.com/questions/534843/why-is-conmon-in-a-differ...
> >[3] https://youtu.be/7CWmuhkgZWs?t=2204
> >
> >On Mon, Aug 12, 2019 at 7:01 AM Matt Heon <mheon(a)redhat.com&gt; wrote:
> >>
> >> On 2019-08-10 13:17, Max Bigras wrote:
> >> >Given an alpine:3.10.1 image
> >> >
> >> >```
> >> >podman pull alpine:3.10.1
> >> >```
> >> >
> >> >And a unit file foo.service
> >> >
> >> >```
> >> >[Service]
> >> >ExecStart=/usr/bin/podman run --name %N --rm --tty alpine:3.10.1 sleep
99999
> >> >ExecStop=/usr/bin/podman stop %N
> >> >```
> >> >
> >> >And starting `foo.service` with `systemctl`
> >> >
> >> >```
> >> ># systemctl daemon-reload
> >> ># systemctl start foo.service
> >> >```
> >> >
> >> >I don't see my `sleep` process in `foo.service` status:
> >> >
> >> >```
> >> ># systemctl status foo.service | head
> >> >● foo.service
> >> >   Loaded: loaded (/etc/systemd/system/foo.service; static; vendor
> >> >preset: enabled)
> >> >   Active: active (running) since Sat 2019-08-10 19:58:05 UTC; 40s ago
> >> > Main PID: 15524 (podman)
> >> >    Tasks: 9
> >> >   Memory: 7.3M
> >> >      CPU: 79ms
> >> >   CGroup: /system.slice/foo.service
> >> >           └─15524 /usr/bin/podman run --name foo --rm --tty
> >> >alpine:3.10.1 sleep 99999
> >> >```
> >> >
> >> >I see `conmon` land in a different cgroup, visible with the
> >> >`systemd-cgls` command:
> >> >
> >> >```
> >> ># systemd-cgls
> >> >Control group /:
> >> >-.slice
> >> >├─init.scope
> >> >│ └─1 /sbin/init
> >> >├─machine.slice
> >> >│
├─libpod-conmon-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a
> >> >│ │ └─15648 /usr/bin/conmon -s -c
> >> >c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0
> >> >│
└─libpod-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a5.scope
> >> >│   └─15662 sleep 99999
> >> >├─system.slice
> >> >│ ├─mdadm.service
> >> >│ │ └─880 /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid
> >> >--daemonise --s
> >> >│ ├─foo.service
> >> >│ │ └─15524 /usr/bin/podman run --name foo --rm --tty alpine:3.10.1
sleep 99999
> >> >```
> >> >
> >> >From listening to youtube presentations about podman I thought podman
> >> >using a traditional fork exec model would imply all my processes would
> >> >show up in the same `systemctl status` and be in the same control
> >> >group controlled by systemd.
> >> >
> >> >Looking at the output of `ps` also shows that the `sleep` process is
> >> >the parent of the `conmon` process and not the `podman` process:
> >> >
> >> >```
> >> ># ps -Heo pid,ppid,comm,cgroup
> >> >15524     1   podman
> >> >11:memory:/system.slice/foo.service,8:pids:/system.sl
> >> >15648     1   conmon
> >> >11:memory:/machine.slice/libpod-conmon-c598f5a0c84881
> >> >15662 15648     sleep
> >> >11:memory:/machine.slice/libpod-c598f5a0c84881c69dcd6
> >> >```
> >> >
> >> >Instead it looks like `conmon` in a `scope` unit named:
> >> >
> >> >```
> >>
>libpod-conmon-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a5.scope
> >> >```
> >> >
> >> >Why doesn't `conmon` and `sleep` land in the same `foo.service`
systemd unit?
> >> >_______________________________________________
> >> >Podman mailing list -- podman(a)lists.podman.io
> >> >To unsubscribe send an email to podman-leave(a)lists.podman.io
> >>
> >> All Linux containers, at present, will create and manage their own
> >> CGroups, independent of systemd (with a few potential exceptions).
> >> This is mostly done so they can independently manage resources, though
> >> it is also necessary to do some common container operations - the
> >> 'podman stats' and 'podman top' commands, for example, track
processes
> >> in the container by its CGroup.
> >>
> >> The exception I mentioned was rootless containers in a CGroup v1
> >> environment. There is no support for delegation to rootless containers
> >> in the V1 hierarchy, so the containers have no permission to create
> >> CGroups.
> >>
> >> For most people, this is perfectly sufficient, but we definitely
> >> recognize that there are use cases that require keeping containers in
> >> CGroups managed elsewhere - most notably, from systemd unit files.
> >> There is work ongoing at [1] to enable this, though I caution that
> >> there are a lot of moving parts here - we don't just need support in
> >> Podman, but also in 'runc' - Podman isn't actually creating the
new
> >> CGroup for the container, the OCI runtime (usually 'runc') is.
> >>
> >> Also, a bit more context: the Conmon CGroup is not the container
> >> CGroup. Conmon creates its own CGroup (for various legacy reasons -
> >> we're evaluating whether these still hold true, and this could change)
> >> and then spawns the OCI runtime - and then the OCI runtime spawns its
> >> own CGroup. So you'll have a Conmon CGroup and another for the actual
> >> container (the two 'libpod' cgroups).
> >>
> >> The parent situation is simpler to explain; Podman launches conmon,
> >> and then conmon double-forks to daemonize. At that point, the two are
> >> effectively separate; Podman itself can go away completely, and the
> >> Conmon process will continue managing the container. This is
> >> deliberate - Podman is just the frontend that launches the container,
> >> and we don't need it to keep running once the container is started.
> >>
> >> Because of this, we recommend tracking Conmon, not Podman, with unit
> >> files. In Podman 1.5.0 and later, 'podman generate systemd' will
> >> properly handle this, creating a unit file that tracks Conmon using
> >> PID file.
> >>
> >> I hope this helps explain things.
> >>
> >> Thanks,
> >> Matt Heon
> >>
> >> [1] https://github.com/containers/libpod/pull/3581

2025

2024

2023

2022

2021

2020

2019

[Podman] Re: Why does conmon land in a different cgroup when using systemd and podman?