On 2019-08-12 10:56, Max Bigras wrote:
Thank you for the explanation and link to WIP PR #3581 [1]!
After reading your explanation and answers to my question on Stack
Exchange [2] I'm learning more about what's going on.
>Also, a bit more context: the Conmon CGroup is not the container
>CGroup. Conmon creates its own CGroup (for various legacy reasons -
>we're evaluating whether these still hold true, and this could change)
>and then spawns the OCI runtime - and then the OCI runtime spawns its
>own CGroup. So you'll have a Conmon CGroup and another for the actual
>container (the two 'libpod' cgroups).
I made an example illustrating the three cgroups you mentioned:
1. /system.slice/example.service for my systemd service
2. /machine.slice/libpod-conmon-a75a16081e23b...scope for conmon
3. /machine.slice/libpod-a75a16081e23bb03589....scope for processes
inside the container (sleep in this case)
```
# podman create --name example --tty alpine sleep 9999
# cat <<'EOF' > /etc/systemd/system/example.service
[Service]
ExecStart=/usr/bin/podman start --attach %N
ExecStop=/usr/bin/podman stop %N
EOF
# systemctl daemon-reload
# systemd-cgls /machine.slice
Control group /machine.slice
# systemctl start example
# systemd-cgls /machine.slice
Control group /machine.slice:
├─libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297
│ └─22385 /usr/bin/conmon -s -c a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7
└─libpod-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
└─22399 sleep 9999
# systemd-cgls /system.slice/example.service
Control group /system.slice/example.service:
└─22308 /usr/bin/podman start --attach example
```
From listening to a presentation from Lennart on YouTube systemd and
Control Groups [3], I learned that the scope unit is just like the
service unit because it manages processes; however, the scope unit is
different from the service unit because scope units can be dynamically
generated. In this case that explanation matches with what we're
seeing here: two scope units for conmon and sleep and dynamically
generated.
>Because of this, we recommend tracking Conmon, not Podman, with unit
>files. In Podman 1.5.0 and later, 'podman generate systemd' will
>properly handle this, creating a unit file that tracks Conmon using
>PID file.
Can you illustrate an example of using systemd to track conmon instead
of podman?
The unit file generated by `generate systemd` below is a good example
of what we want to use - in contrast to using a `Type=simple` unit
file using `podman start --attach` (which was previously recommended).
As previously mentioned, Podman becomes effectively independent of the
container when it's running - so the Podman process can disappear but
the container may still be running, which is not conducive to the
Type=simple unit.
My goal is to have one way to stop, start, and enable my containers.
In the example above I know I can check the status of the conmon scope
unit, it looks like the name comes from the container id:
```
# systemctl status
libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
● libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297
Loaded: loaded
Transient: yes
Drop-In: /run/systemd/system/libpod-conmon-a75a16081e23bb03589b214580b3226d8b2
└─50-DefaultDependencies.conf, 50-Delegate.conf, 50-Slice.conf
Active: active (running) since Mon 2019-08-12 17:20:17 UTC; 2min 20s ago
Tasks: 2
Memory: 200.0K
CPU: 17ms
CGroup: /machine.slice/libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a
└─22631 /usr/bin/conmon -s -c a75a16081e23bb03589b214580b3226d8b2ef77
Aug 12 17:20:17 srv0 systemd[1]: Started libpod-conmon-a75a16081e23bb03589b21458
# podman ps
CONTAINER ID IMAGE COMMAND CREATED
STATUS PORTS NAMES
a75a16081e23 docker.io/library/alpine:latest sleep 9999 41 minutes
ago Up 22 minutes ago example
# systemctl show
libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
| grep ExecStart
```
I didn't see an ExecStart directive inside the scope unit.
>podman generate systemd
Ah, I see what you mean
```
# podman generate systemd example > /etc/systemd/system/example.service
# systemctl daemon-reload
# systemctl cat example
[Unit]
Description=a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297 Pod
[Service]
Restart=on-failure
ExecStart=/usr/bin/podman start a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7
ExecStop=/usr/bin/podman stop -t 10 a75a16081e23bb03589b214580b3226d8b2ef77a382c
KillMode=none
Type=forking
PIDFile=/var/lib/containers/storage/overlay-containers/a75a16081e23bb03589b21458
[Install]
WantedBy=multi-user.target
# systemctl start example
# journalctl -u example | grep PID | tail -n 1
Aug 12 17:50:38 srv0 systemd[1]: example.service: PID file
/var/lib/containers/storage/overlay-containers/a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297/userdata/a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.pid
not readable (yet?) after start: No such file or directory
```
Is using `podman generate systemd example` planned for the future or
is it supposed to work now?
It's bugged in 1.4.x - fixed in 1.5.0 (just released Friday).
[1]
https://github.com/containers/libpod/pull/3581
[2]
https://unix.stackexchange.com/questions/534843/why-is-conmon-in-a-differ...
[3]
https://youtu.be/7CWmuhkgZWs?t=2204
On Mon, Aug 12, 2019 at 7:01 AM Matt Heon <mheon(a)redhat.com> wrote:
>
> On 2019-08-10 13:17, Max Bigras wrote:
> >Given an alpine:3.10.1 image
> >
> >```
> >podman pull alpine:3.10.1
> >```
> >
> >And a unit file foo.service
> >
> >```
> >[Service]
> >ExecStart=/usr/bin/podman run --name %N --rm --tty alpine:3.10.1 sleep 99999
> >ExecStop=/usr/bin/podman stop %N
> >```
> >
> >And starting `foo.service` with `systemctl`
> >
> >```
> ># systemctl daemon-reload
> ># systemctl start foo.service
> >```
> >
> >I don't see my `sleep` process in `foo.service` status:
> >
> >```
> ># systemctl status foo.service | head
> >● foo.service
> > Loaded: loaded (/etc/systemd/system/foo.service; static; vendor
> >preset: enabled)
> > Active: active (running) since Sat 2019-08-10 19:58:05 UTC; 40s ago
> > Main PID: 15524 (podman)
> > Tasks: 9
> > Memory: 7.3M
> > CPU: 79ms
> > CGroup: /system.slice/foo.service
> > └─15524 /usr/bin/podman run --name foo --rm --tty
> >alpine:3.10.1 sleep 99999
> >```
> >
> >I see `conmon` land in a different cgroup, visible with the
> >`systemd-cgls` command:
> >
> >```
> ># systemd-cgls
> >Control group /:
> >-.slice
> >├─init.scope
> >│ └─1 /sbin/init
> >├─machine.slice
> >│
├─libpod-conmon-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a
> >│ │ └─15648 /usr/bin/conmon -s -c
> >c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0
> >│
└─libpod-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a5.scope
> >│ └─15662 sleep 99999
> >├─system.slice
> >│ ├─mdadm.service
> >│ │ └─880 /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid
> >--daemonise --s
> >│ ├─foo.service
> >│ │ └─15524 /usr/bin/podman run --name foo --rm --tty alpine:3.10.1 sleep 99999
> >```
> >
> >From listening to youtube presentations about podman I thought podman
> >using a traditional fork exec model would imply all my processes would
> >show up in the same `systemctl status` and be in the same control
> >group controlled by systemd.
> >
> >Looking at the output of `ps` also shows that the `sleep` process is
> >the parent of the `conmon` process and not the `podman` process:
> >
> >```
> ># ps -Heo pid,ppid,comm,cgroup
> >15524 1 podman
> >11:memory:/system.slice/foo.service,8:pids:/system.sl
> >15648 1 conmon
> >11:memory:/machine.slice/libpod-conmon-c598f5a0c84881
> >15662 15648 sleep
> >11:memory:/machine.slice/libpod-c598f5a0c84881c69dcd6
> >```
> >
> >Instead it looks like `conmon` in a `scope` unit named:
> >
> >```
>
>libpod-conmon-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a5.scope
> >```
> >
> >Why doesn't `conmon` and `sleep` land in the same `foo.service` systemd
unit?
> >_______________________________________________
> >Podman mailing list -- podman(a)lists.podman.io
> >To unsubscribe send an email to podman-leave(a)lists.podman.io
>
> All Linux containers, at present, will create and manage their own
> CGroups, independent of systemd (with a few potential exceptions).
> This is mostly done so they can independently manage resources, though
> it is also necessary to do some common container operations - the
> 'podman stats' and 'podman top' commands, for example, track
processes
> in the container by its CGroup.
>
> The exception I mentioned was rootless containers in a CGroup v1
> environment. There is no support for delegation to rootless containers
> in the V1 hierarchy, so the containers have no permission to create
> CGroups.
>
> For most people, this is perfectly sufficient, but we definitely
> recognize that there are use cases that require keeping containers in
> CGroups managed elsewhere - most notably, from systemd unit files.
> There is work ongoing at [1] to enable this, though I caution that
> there are a lot of moving parts here - we don't just need support in
> Podman, but also in 'runc' - Podman isn't actually creating the new
> CGroup for the container, the OCI runtime (usually 'runc') is.
>
> Also, a bit more context: the Conmon CGroup is not the container
> CGroup. Conmon creates its own CGroup (for various legacy reasons -
> we're evaluating whether these still hold true, and this could change)
> and then spawns the OCI runtime - and then the OCI runtime spawns its
> own CGroup. So you'll have a Conmon CGroup and another for the actual
> container (the two 'libpod' cgroups).
>
> The parent situation is simpler to explain; Podman launches conmon,
> and then conmon double-forks to daemonize. At that point, the two are
> effectively separate; Podman itself can go away completely, and the
> Conmon process will continue managing the container. This is
> deliberate - Podman is just the frontend that launches the container,
> and we don't need it to keep running once the container is started.
>
> Because of this, we recommend tracking Conmon, not Podman, with unit
> files. In Podman 1.5.0 and later, 'podman generate systemd' will
> properly handle this, creating a unit file that tracks Conmon using
> PID file.
>
> I hope this helps explain things.
>
> Thanks,
> Matt Heon
>
> [1]
https://github.com/containers/libpod/pull/3581