Thank you for the explanation and link to WIP PR #3581 [1]!
After reading your explanation and answers to my question on Stack
Exchange [2] I'm learning more about what's going on.
Also, a bit more context: the Conmon CGroup is not the container
CGroup. Conmon creates its own CGroup (for various legacy reasons -
we're evaluating whether these still hold true, and this could change)
and then spawns the OCI runtime - and then the OCI runtime spawns its
own CGroup. So you'll have a Conmon CGroup and another for the actual
container (the two 'libpod' cgroups).
I made an example illustrating the three cgroups you mentioned:
1. /system.slice/example.service for my systemd service
2. /machine.slice/libpod-conmon-a75a16081e23b...scope for conmon
3. /machine.slice/libpod-a75a16081e23bb03589....scope for processes
inside the container (sleep in this case)
```
# podman create --name example --tty alpine sleep 9999
# cat <<'EOF' > /etc/systemd/system/example.service
[Service]
ExecStart=/usr/bin/podman start --attach %N
ExecStop=/usr/bin/podman stop %N
EOF
# systemctl daemon-reload
# systemd-cgls /machine.slice
Control group /machine.slice
# systemctl start example
# systemd-cgls /machine.slice
Control group /machine.slice:
├─libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297
│ └─22385 /usr/bin/conmon -s -c a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7
└─libpod-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
└─22399 sleep 9999
# systemd-cgls /system.slice/example.service
Control group /system.slice/example.service:
└─22308 /usr/bin/podman start --attach example
```
From listening to a presentation from Lennart on YouTube systemd and
Control Groups [3], I learned that the scope unit is just like the
service unit because it manages processes; however, the scope unit is
different from the service unit because scope units can be dynamically
generated. In this case that explanation matches with what we're
seeing here: two scope units for conmon and sleep and dynamically
generated.
Because of this, we recommend tracking Conmon, not Podman, with unit
files. In Podman 1.5.0 and later, 'podman generate systemd' will
properly handle this, creating a unit file that tracks Conmon using
PID file.
Can you illustrate an example of using systemd to track conmon instead
of podman?
My goal is to have one way to stop, start, and enable my containers.
In the example above I know I can check the status of the conmon scope
unit, it looks like the name comes from the container id:
```
# systemctl status
libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
● libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297
Loaded: loaded
Transient: yes
Drop-In: /run/systemd/system/libpod-conmon-a75a16081e23bb03589b214580b3226d8b2
└─50-DefaultDependencies.conf, 50-Delegate.conf, 50-Slice.conf
Active: active (running) since Mon 2019-08-12 17:20:17 UTC; 2min 20s ago
Tasks: 2
Memory: 200.0K
CPU: 17ms
CGroup: /machine.slice/libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a
└─22631 /usr/bin/conmon -s -c a75a16081e23bb03589b214580b3226d8b2ef77
Aug 12 17:20:17 srv0 systemd[1]: Started libpod-conmon-a75a16081e23bb03589b21458
# podman ps
CONTAINER ID IMAGE COMMAND CREATED
STATUS PORTS NAMES
a75a16081e23 docker.io/library/alpine:latest sleep 9999 41 minutes
ago Up 22 minutes ago example
# systemctl show
libpod-conmon-a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.scope
| grep ExecStart
```
I didn't see an ExecStart directive inside the scope unit.
podman generate systemd
Ah, I see what you mean
```
# podman generate systemd example > /etc/systemd/system/example.service
# systemctl daemon-reload
# systemctl cat example
[Unit]
Description=a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297 Pod
[Service]
Restart=on-failure
ExecStart=/usr/bin/podman start a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7
ExecStop=/usr/bin/podman stop -t 10 a75a16081e23bb03589b214580b3226d8b2ef77a382c
KillMode=none
Type=forking
PIDFile=/var/lib/containers/storage/overlay-containers/a75a16081e23bb03589b21458
[Install]
WantedBy=multi-user.target
# systemctl start example
# journalctl -u example | grep PID | tail -n 1
Aug 12 17:50:38 srv0 systemd[1]: example.service: PID file
/var/lib/containers/storage/overlay-containers/a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297/userdata/a75a16081e23bb03589b214580b3226d8b2ef77a382c5cf7845466823742b297.pid
not readable (yet?) after start: No such file or directory
```
Is using `podman generate systemd example` planned for the future or
is it supposed to work now?
[1]
https://github.com/containers/libpod/pull/3581
[2]
https://unix.stackexchange.com/questions/534843/why-is-conmon-in-a-differ...
[3]
https://youtu.be/7CWmuhkgZWs?t=2204
On Mon, Aug 12, 2019 at 7:01 AM Matt Heon <mheon(a)redhat.com> wrote:
>
> On 2019-08-10 13:17, Max Bigras wrote:
> >Given an alpine:3.10.1 image
> >
> >```
> >podman pull alpine:3.10.1
> >```
> >
> >And a unit file foo.service
> >
> >```
> >[Service]
> >ExecStart=/usr/bin/podman run --name %N --rm --tty alpine:3.10.1 sleep 99999
> >ExecStop=/usr/bin/podman stop %N
> >```
> >
> >And starting `foo.service` with `systemctl`
> >
> >```
> ># systemctl daemon-reload
> ># systemctl start foo.service
> >```
> >
> >I don't see my `sleep` process in `foo.service` status:
> >
> >```
> ># systemctl status foo.service | head
> >● foo.service
> > Loaded: loaded (/etc/systemd/system/foo.service; static; vendor
> >preset: enabled)
> > Active: active (running) since Sat 2019-08-10 19:58:05 UTC; 40s ago
> > Main PID: 15524 (podman)
> > Tasks: 9
> > Memory: 7.3M
> > CPU: 79ms
> > CGroup: /system.slice/foo.service
> > └─15524 /usr/bin/podman run --name foo --rm --tty
> >alpine:3.10.1 sleep 99999
> >```
> >
> >I see `conmon` land in a different cgroup, visible with the
> >`systemd-cgls` command:
> >
> >```
> ># systemd-cgls
> >Control group /:
> >-.slice
> >├─init.scope
> >│ └─1 /sbin/init
> >├─machine.slice
> >│
├─libpod-conmon-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a
> >│ │ └─15648 /usr/bin/conmon -s -c
> >c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0
> >│
└─libpod-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a5.scope
> >│ └─15662 sleep 99999
> >├─system.slice
> >│ ├─mdadm.service
> >│ │ └─880 /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid
> >--daemonise --s
> >│ ├─foo.service
> >│ │ └─15524 /usr/bin/podman run --name foo --rm --tty alpine:3.10.1 sleep 99999
> >```
> >
> >From listening to youtube presentations about podman I thought podman
> >using a traditional fork exec model would imply all my processes would
> >show up in the same `systemctl status` and be in the same control
> >group controlled by systemd.
> >
> >Looking at the output of `ps` also shows that the `sleep` process is
> >the parent of the `conmon` process and not the `podman` process:
> >
> >```
> ># ps -Heo pid,ppid,comm,cgroup
> >15524 1 podman
> >11:memory:/system.slice/foo.service,8:pids:/system.sl
> >15648 1 conmon
> >11:memory:/machine.slice/libpod-conmon-c598f5a0c84881
> >15662 15648 sleep
> >11:memory:/machine.slice/libpod-c598f5a0c84881c69dcd6
> >```
> >
> >Instead it looks like `conmon` in a `scope` unit named:
> >
> >```
>
>libpod-conmon-c598f5a0c84881c69dcd69c5af981dd5071385138e45ce0c3b94dcc5308953a5.scope
> >```
> >
> >Why doesn't `conmon` and `sleep` land in the same `foo.service` systemd
unit?
> >_______________________________________________
> >Podman mailing list -- podman(a)lists.podman.io
> >To unsubscribe send an email to podman-leave(a)lists.podman.io
>
> All Linux containers, at present, will create and manage their own
> CGroups, independent of systemd (with a few potential exceptions).
> This is mostly done so they can independently manage resources, though
> it is also necessary to do some common container operations - the
> 'podman stats' and 'podman top' commands, for example, track
processes
> in the container by its CGroup.
>
> The exception I mentioned was rootless containers in a CGroup v1
> environment. There is no support for delegation to rootless containers
> in the V1 hierarchy, so the containers have no permission to create
> CGroups.
>
> For most people, this is perfectly sufficient, but we definitely
> recognize that there are use cases that require keeping containers in
> CGroups managed elsewhere - most notably, from systemd unit files.
> There is work ongoing at [1] to enable this, though I caution that
> there are a lot of moving parts here - we don't just need support in
> Podman, but also in 'runc' - Podman isn't actually creating the new
> CGroup for the container, the OCI runtime (usually 'runc') is.
>
> Also, a bit more context: the Conmon CGroup is not the container
> CGroup. Conmon creates its own CGroup (for various legacy reasons -
> we're evaluating whether these still hold true, and this could change)
> and then spawns the OCI runtime - and then the OCI runtime spawns its
> own CGroup. So you'll have a Conmon CGroup and another for the actual
> container (the two 'libpod' cgroups).
>
> The parent situation is simpler to explain; Podman launches conmon,
> and then conmon double-forks to daemonize. At that point, the two are
> effectively separate; Podman itself can go away completely, and the
> Conmon process will continue managing the container. This is
> deliberate - Podman is just the frontend that launches the container,
> and we don't need it to keep running once the container is started.
>
> Because of this, we recommend tracking Conmon, not Podman, with unit
> files. In Podman 1.5.0 and later, 'podman generate systemd' will
> properly handle this, creating a unit file that tracks Conmon using
> PID file.
>
> I hope this helps explain things.
>
> Thanks,
> Matt Heon
>
> [1]
https://github.com/containers/libpod/pull/3581