Hi Giuseppe,

Thank you for the explanation, and for that seccomp tip which will definitely come in handy!

Best regards,
Vincent Quéméner.

Le mar. 4 mai 2021 à 14:48, Giuseppe Scrivano <gscrivan@redhat.com> a écrit :
Hi Vincent,

Vincent QUEMENER <vquemener@gmail.com> writes:

> Hi,
>
> I am looking for some guidance on how to securely containerize an application that depends on the `CAP_SYS_NICE` capability to work.
>
> Outside of the container world, one would probably just set the capability on the binary so that a non-privileged user could run it :
> ```
>     $ my_app
>     Error!
>     $ sudo setcap 'cap_sys_nice+ep' my_app
>     $ my_app
>     Success!
> ```
>
> When working with containers, the easiest solution would be to execute Podman as root with the `--cap-add` parameter :
> ```
>     $ sudo podman run --rm --cap-add "sys_nice" -v "$PWD/my_app:/my_app" fedora:34 /my_app
>     Success!
> ```
>
> A somewhat more secure option would consist in switching to a non-privileged user with the `--user` parameter :
> ```
>     $ sudo podman run --rm --cap-add "sys_nice" -v "$PWD/my_app:/my_app" --user nobody fedora:34 /my_app
>     Success!
> ```
>
> Now, in order to mitigate potential container-breakout vulnerabilities, I would like to go a bit further and set up a rootless container.
>
> I have recently learned about ambient capabilities and I have started experimenting with the `capsh` command. This seems to work :
> ```
>     $ sudo capsh --caps="cap_sys_nice+eip cap_setpcap,cap_setuid,cap_setgid+ep" --keep=1 --user="${USER}" --addamb=cap_sys_nice -- -c ./my_app
>     Success!
> ```
> But this does not (the ambient capability is not set in the container and `strace` indicates that the `setpriority` system call fails with a `Permission denied`) :
> ```
>     $ sudo capsh --caps="cap_sys_nice+eip cap_setpcap,cap_setuid,cap_setgid+ep" --keep=1 --user="${USER}" --addamb=cap_sys_nice -- -c "HOME=${HOME} podman
> run --rm --cap-add sys_nice -v $PWD/my_app:/my_app fedora:34 /my_app"
>     Error!
> ```
>
> Is this a podman limitation (Could it be improved?)? Is there a better approach?

when running as rootless, CAP_SYS_NICE is relative to the rootless user
namespace and there are some limitations in place in the kernel that
will still prevent setting a too high priority, unless it is running in
the initial user namespace (the host context without any user namespace
created).

If you would like to cheat and and let setpriority report success even
without performing any action, I'd suggest to look into seccomp and how
we treat the "socket" syscall.  You can force an errnoRet: 0, so it is
not performed but returns success to user space.

Regards,
Giuseppe