Hi Mark,

Thanks for reaching out.

I suggest using `podman generate systemd` to generate a systemd unit.  There's also a new way of running Podman inside of systemd called Quadlet that ships with the just released Podman v4.4.  A blog about that topic is in the pipeline.

Given the complexity of running Podman in systemd, `podman generate systemd` and Quadlet are the only supported ways.

In your case, I suggest removing `podman pull` from the service.  In contrast to `podman pull`, `podman run` won't redundantly pull the image if it's already in the local storage.  That will relax the network bottleneck.

Kind regards,
 Valentin

On Thu, Feb 2, 2023 at 10:00 PM Mark Raynsford via Podman <podman@lists.podman.io> wrote:
Hello!

I'm using podman on Fedora CoreOS. The standard setup for a
podman-based service tends to look like this (according to the
documentation):

---
[Unit]
Description=looseleaf
After=network-online.target
Wants=network-online.target

[Service]
Type=exec
TimeoutStartSec=60
User=_looseleaf
Group=_looseleaf
Restart=on-failure
RestartSec=10s

Environment="_JAVA_OPTIONS=-XX:+UseSerialGC -Xmx64m -Xms64m"

ExecStartPre=-/bin/podman kill looseleaf
ExecStartPre=-/bin/podman rm looseleaf
ExecStartPre=/bin/podman pull docker.io/io7m/looseleaf:0.0.4

ExecStart=/bin/podman run \
  --name looseleaf \
  --volume /var/storage/looseleaf/etc:/looseleaf/etc:Z,ro \
  --volume /var/storage/looseleaf/var:/looseleaf/var:Z,rw \
  --publish 20000:20000/tcp \
  --memory=128m \
  --memory-reservation=80m \
  docker.io/io7m/looseleaf:{{looseleaf_version}} \
  /looseleaf/bin/looseleaf server --file /looseleaf/etc/config.json

ExecStop=/bin/podman stop looseleaf

[Install]
WantedBy=multi-user.target
---

The important line is this one:

/bin/podman pull docker.io/io7m/looseleaf:0.0.4

Unfortunately, this line can fail. That in itself isn't a problem, the
service will be restarted and it'll run again. The real problem is that
it can fail in ways that will break all subsequent executions.

On new Fedora CoreOS deployments, there's often a lot of network
traffic happening on first boot as the rest of the system updates
itself, and it's not unusual for `podman pull` to fail and leave the
services permanently broken (unless someone goes in and fixes them).

This is what will typically happen:

Feb 02 20:31:05 control1.io7m.com podman[1934]: Trying to pull docker.io/io7m/looseleaf:0.0.4...
Feb 02 20:31:48 control1.io7m.com podman[1934]: time="2023-02-02T20:31:48Z" level=warning msg="Failed, retrying in 1s ... (1/3). Error: initializing source docker://io7m/looseleaf:0.0.4: pinging container registry registry-1.docker.io: Get \"https://regist>
Feb 02 20:31:50 control1.io7m.com podman[1934]: Getting image source signatures
Feb 02 20:31:50 control1.io7m.com podman[1934]: Copying blob sha256:9794579c486abc6811cea048073584c869db02a4d9b615eeaa1d29e9c75738b9
Feb 02 20:31:50 control1.io7m.com podman[1934]: Copying blob sha256:8921db27df2831fa6eaa85321205a2470c669b855f3ec95d5a3c2b46de0442c9
Feb 02 20:31:50 control1.io7m.com podman[1934]: Copying blob sha256:846e3b32ee5a149e3ccb99051cdb52e96e11488293cdf72ee88168c88dd335c7
Feb 02 20:31:50 control1.io7m.com podman[1934]: Copying blob sha256:7f516ed68e97f9655d26ae3312c2aeede3dfda2dd3d19d2f9c9c118027543e87
Feb 02 20:31:50 control1.io7m.com podman[1934]: Copying blob sha256:e88daf71a034bed777eda8657762faad07639a9e27c7afb719b9a117946d1b8a
Feb 02 20:32:03 control1.io7m.com systemd[1]: looseleaf.service: start-pre operation timed out. Terminating.

It'll usually happen again on the next service restart. Then, this will
tend to happen:

Feb 02 20:34:13 control1.io7m.com podman[2745]: time="2023-02-02T20:34:13Z" level=error msg="Image docker.io/io7m/looseleaf:0.0.4 exists in local storage but may be corrupted (remove the image to resolve the issue): size for layer \"13cfed814d5b083572142bc>
Feb 02 20:34:13 control1.io7m.com podman[2745]: Trying to pull docker.io/io7m/looseleaf:0.0.4...
Feb 02 20:34:14 control1.io7m.com podman[2745]: Getting image source signatures
Feb 02 20:34:14 control1.io7m.com podman[2745]: Copying blob sha256:9794579c486abc6811cea048073584c869db02a4d9b615eeaa1d29e9c75738b9
Feb 02 20:34:14 control1.io7m.com podman[2745]: Copying blob sha256:8921db27df2831fa6eaa85321205a2470c669b855f3ec95d5a3c2b46de0442c9
Feb 02 20:34:14 control1.io7m.com podman[2745]: Copying blob sha256:846e3b32ee5a149e3ccb99051cdb52e96e11488293cdf72ee88168c88dd335c7
Feb 02 20:34:14 control1.io7m.com podman[2745]: Copying blob sha256:7f516ed68e97f9655d26ae3312c2aeede3dfda2dd3d19d2f9c9c118027543e87
Feb 02 20:34:14 control1.io7m.com podman[2745]: Copying blob sha256:e88daf71a034bed777eda8657762faad07639a9e27c7afb719b9a117946d1b8a
Feb 02 20:34:18 control1.io7m.com podman[2745]: Copying config sha256:cce9701f3b6e34e3fc26332da58edcba85bbf4f625bdb5f508805d2fa5e62e3e
Feb 02 20:34:18 control1.io7m.com podman[2745]: Writing manifest to image destination
Feb 02 20:34:18 control1.io7m.com podman[2745]: Storing signatures
Feb 02 20:34:18 control1.io7m.com podman[2745]: Error: checking platform of image cce9701f3b6e34e3fc26332da58edcba85bbf4f625bdb5f508805d2fa5e62e3e: inspecting image: size for layer "13cfed814d5b083572142bc068ae7f890f323258135f0cffe87b04cb62c3742e" is unkno>
Feb 02 20:34:18 control1.io7m.com systemd[1]: looseleaf.service: Control process exited, code=exited, status=125/n/a

At this point, there's really nothing that can be done aside from
having a human log in and running something like "podman system reset".

These systems are supposed to be as immutable as possible, and
deployments are supposed to be automated. As it stands currently, I
can't actually a deploy a machine and not have it immediately break and
require a manual intervention.

Is there some better way to handle this?

--
Mark Raynsford | https://www.io7m.com

_______________________________________________
Podman mailing list -- podman@lists.podman.io
To unsubscribe send an email to podman-leave@lists.podman.io