[pkg-go] Bug#1078205: systemd: can't start polkitd in a podman container without CAP_SYS_ADMIN

Sat Aug 10 16:24:51 BST 2024

On 2024-08-09 04:52, Simon McVittie wrote:
> On Thu, 08 Aug 2024 at 11:12:07 -0400, Reinhard Tartler wrote:
>> In short, it seems to me if you are running a workload that requires
>> CAP_SYS_ADMIN,
>> then it is appropriate to pass that argument to podman. It is clearly 
>> much
>> better than
>> using --privileged (cf. [2]https://www.redhat.com/sysadmin/
>> podman-inside-container)
> 
> Do you think this would be reasonably accurate wording to put in
> autopkgtest-virt-podman(1)?
> 
>     Note that full functionality of systemd(1) as an init system 
> requires
>     the container to have CAP_SYS_ADMIN, which might allow code that
>     runs as root in the container to compromise processes outside the
>     container that are running under the same uid as
>     autopkgtest-virt-podman. If this is consistent with your security
>     model, it can be enabled by passing the --cap-add=CAP_SYS_ADMIN 
> option
>     to podman_run(1):
> 
>         autopkgtest ... -- podman --init $IMAGE -- 
> --cap-add=CAP_SYS_ADMIN
> 
> Or is "might allow" too strong, or too weak?

I personally find that wording a bit too strong. How about something 
like this:

>     Note that full functionality of systemd(1) as an init system 
> requires
>     the container to have CAP_SYS_ADMIN. This means that software
>     like systemd that
>     runs as root in the container can issue syscall (such as mount(2),
>     seccom(2), etc., see capabilities(7) for details) to setup its own
>     sandboxing features. However, this also introduces an additional
>     attack surface in the
>     kernel if malicious code tried to escape the container sandbox.
>     If this is consistent with your security
>     model, it can be enabled by passing the --cap-add=CAP_SYS_ADMIN 
> option
>     to podman_run(1):
> 
>         autopkgtest ... -- podman --init $IMAGE -- 
> --cap-add=CAP_SYS_ADMIN

> (It might also be appropriate to add a shorthand form for that, to 
> avoid
> needing to use the "pass arbitrary options to podman-run" mechanism,
> but that would need some more design to choose a suitable name for
> that option. --trust-root-in-testbed, perhaps, if my understanding of
> the impact of CAP_SYS_ADMIN is correct.)

I'd love to see such a shortcut, but it is not obvious to my how to name 
it.
Your suggestion seems too strong to me, because there are typically 
still other
security features in play, such as seccomp, selinux or apparmor.

Thank you so much for forwarding this question to upstream at
https://github.com/containers/podman/discussions/23558. Hopefully 
someone
like Dan can form an answer and opinion on this.

>> Since you assigned this bug to podman, may I ask what the ask is?
>> It's not clear how to improve the podman packaging in this context.
> 
> I initially reported this as a systemd bug, thinking that systemd's
> expected behaviour would be to turn off features that require
> CAP_SYS_ADMIN when it isn't available (as it already does for some but
> not all such features, as far as I can see), but Luca replied "this
> is an issue in podman". I don't know specifically what basis he had
> for that statement.

Yeah, I don't see this as an issue in podman. Again, if you need to run
a workload that requires this capability, podman offers a knob that 
allows
you to do so. This isn't defeating security per-se, it always depends
on the specific situation and use-case.

> Perhaps he was expecting that `podman run --systemd=true` (which is the
> default) would detect that we're running /sbin/init in the container,
> and automatically grant access to CAP_SYS_ADMIN? But I think that would
> be inappropriate as an automatic thing if giving access to 
> CAP_SYS_ADMIN
> requires trusting the container payload.

Yeah, that's not what --systemd=true does. Here is the relevant part of
the reference documentation:

https://manpages.debian.org/unstable/podman/podman-run.1.en.html#--systemd=true_%7C_false_%7C_always

Running the container in systemd mode causes the following changes:

* Podman mounts tmpfs file systems on the following directories
    /run
    /run/lock
    /tmp
    /sys/fs/cgroup/systemd (on a cgroup v1 system)
   /var/lib/journal
* Podman sets the default stop signal to SIGRTMIN+3.
* Podman sets container_uuid environment variable in the container to 
the first 32 characters of the container ID.
* Podman does not mount virtual consoles (/dev/tty\d+) when running with 
--privileged.
* On cgroup v2, /sys/fs/cgroup is mounted writeable.

This allows systemd to run in a confined container without any 
modifications.

Re-reading through https://github.com/systemd/systemd/issues/29860 
clarifies
that systemd has a number of additional security hardening features, 
such as
DynamicUsers, but also things like PrivateDevices=`, `ProtectHome=`,
`ProtectSystem=`, `MountFlags=`, `PrivateTmp=`, `ReadWriteDirectories=`,
`ReadOnlyDirectories=`, `InaccessibleDirectories=`, and `MountFlags=`.

It occurs to me that systemd is designed as a privileged process that 
aims
to provide sandboxing features primarily targeted at non-privileged 
processes,
or at least with reduced privileges. For this, it does need to execute a
number of syscalls, such as mount(2), setns(2), seccomp(2) and many 
others,
for which it needs CAP_SYS_ADMIN.

> If nothing is going to be done about this in systemd, and nothing can 
> be
> done about it in podman, then it'll probably have to end up as a
> documentation improvement in autopkgtest-virt-podman(1).

I tend to agree. I personally would be comfortable running containers
that have systemd inside with CAP_SYS_ADMIN because that is closer to
how systemd runs on a real system. Also, podman provides other 
additional
security features, such as seccomp and apparmor/selinux.

-rt