Hi Vincent,
Vincent QUEMENER <vquemener(a)gmail.com> writes:
Hi,
I am looking for some guidance on how to securely containerize an application that
depends on the `CAP_SYS_NICE` capability to work.
Outside of the container world, one would probably just set the capability on the binary
so that a non-privileged user could run it :
```
$ my_app
Error!
$ sudo setcap 'cap_sys_nice+ep' my_app
$ my_app
Success!
```
When working with containers, the easiest solution would be to execute Podman as root
with the `--cap-add` parameter :
```
$ sudo podman run --rm --cap-add "sys_nice" -v
"$PWD/my_app:/my_app" fedora:34 /my_app
Success!
```
A somewhat more secure option would consist in switching to a non-privileged user with
the `--user` parameter :
```
$ sudo podman run --rm --cap-add "sys_nice" -v
"$PWD/my_app:/my_app" --user nobody fedora:34 /my_app
Success!
```
Now, in order to mitigate potential container-breakout vulnerabilities, I would like to
go a bit further and set up a rootless container.
I have recently learned about ambient capabilities and I have started experimenting with
the `capsh` command. This seems to work :
```
$ sudo capsh --caps="cap_sys_nice+eip cap_setpcap,cap_setuid,cap_setgid+ep"
--keep=1 --user="${USER}" --addamb=cap_sys_nice -- -c ./my_app
Success!
```
But this does not (the ambient capability is not set in the container and `strace`
indicates that the `setpriority` system call fails with a `Permission denied`) :
```
$ sudo capsh --caps="cap_sys_nice+eip cap_setpcap,cap_setuid,cap_setgid+ep"
--keep=1 --user="${USER}" --addamb=cap_sys_nice -- -c "HOME=${HOME} podman
run --rm --cap-add sys_nice -v $PWD/my_app:/my_app fedora:34 /my_app"
Error!
```
Is this a podman limitation (Could it be improved?)? Is there a better approach?
when running as rootless, CAP_SYS_NICE is relative to the rootless user
namespace and there are some limitations in place in the kernel that
will still prevent setting a too high priority, unless it is running in
the initial user namespace (the host context without any user namespace
created).
If you would like to cheat and and let setpriority report success even
without performing any action, I'd suggest to look into seccomp and how
we treat the "socket" syscall. You can force an errnoRet: 0, so it is
not performed but returns success to user space.
Regards,
Giuseppe