[pkg-go] Bug#1078205: systemd: can't start polkitd in a podman container without CAP_SYS_ADMIN
Reinhard Tartler
siretart at tauware.de
Thu Aug 8 14:35:55 BST 2024
Simon McVittie <smcv at debian.org> writes:
> It is possible to run systemd + polkitd inside a podman container by
> running it as "podman run ... --cap-add=CAP_SYS_ADMIN", but I am unsure
> whether this undermines or defeats podman's security model. A question
> for the podman maintainers: what is the security impact of adding that
> option? I'm guessing that the answer is one of these:
I'd say, significant.
> - maybe it's a sandbox escape that allows the container payload to execute
> arbitrary code as the user running podman, and this is by design, so
> reporting it as a security vulnerability would be wontfix'd?
>
> - or maybe it weakens podman's security hardening, but is not meant to
> result in a sandbox escape because other factors (e.g. use of userns
> and/or seccomp) are meant to protect the host system, so if there was
> a sandbox escape then it would be treated as a vulnerability?
>
> - or is there a better way to run a container that has a full systemd init
> system under podman?
Security is not absolute. You achieve it by layering security
measures. Running a container with CAP_SYS_ADMIN allows the process
significant privileges. Quoting from capabilities(7):
CAP_SYS_ADMIN
Note: this capability is overloaded; see Notes to kernel develop‐
ers below.
• Perform a range of system administration operations including:
quotactl(2), mount(2), umount(2), pivot_root(2), swapon(2),
swapoff(2), sethostname(2), and setdomainname(2);
• perform privileged syslog(2) operations (since Linux 2.6.37,
CAP_SYSLOG should be used to permit such operations);
• perform VM86_REQUEST_IRQ vm86(2) command;
• access the same checkpoint/restore functionality that is gov‐
erned by CAP_CHECKPOINT_RESTORE (but the latter, weaker capa‐
bility is preferred for accessing that functionality).
• perform the same BPF operations as are governed by CAP_BPF
(but the latter, weaker capability is preferred for accessing
that functionality).
• employ the same performance monitoring mechanisms as are gov‐
erned by CAP_PERFMON (but the latter, weaker capability is
preferred for accessing that functionality).
• perform IPC_SET and IPC_RMID operations on arbitrary System V
IPC objects;
• override RLIMIT_NPROC resource limit;
• perform operations on trusted and security extended attributes
(see xattr(7));
• use lookup_dcookie(2);
• use ioprio_set(2) to assign IOPRIO_CLASS_RT and (before Linux
2.6.25) IOPRIO_CLASS_IDLE I/O scheduling classes;
• forge PID when passing socket credentials via UNIX domain
sockets;
• exceed /proc/sys/fs/file-max, the system-wide limit on the
number of open files, in system calls that open files (e.g.,
accept(2), execve(2), open(2), pipe(2));
• employ CLONE_* flags that create new namespaces with clone(2)
and unshare(2) (but, since Linux 3.8, creating user namespaces
does not require any capability);
• access privileged perf event information;
• call setns(2) (requires CAP_SYS_ADMIN in the target name‐
space);
• call fanotify_init(2);
• perform privileged KEYCTL_CHOWN and KEYCTL_SETPERM keyctl(2)
operations;
• perform madvise(2) MADV_HWPOISON operation;
• employ the TIOCSTI ioctl(2) to insert characters into the in‐
put queue of a terminal other than the caller's controlling
terminal;
• employ the obsolete nfsservctl(2) system call;
• employ the obsolete bdflush(2) system call;
• perform various privileged block-device ioctl(2) operations;
• perform various privileged filesystem ioctl(2) operations;
• perform privileged ioctl(2) operations on the /dev/random de‐
vice (see random(4));
• install a seccomp(2) filter without first having to set the
no_new_privs thread attribute;
• modify allow/deny rules for device control groups;
• employ the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation to
dump tracee's seccomp filters;
• employ the ptrace(2) PTRACE_SETOPTIONS operation to suspend
the tracee's seccomp protections (i.e., the PTRACE_O_SUS‐
PEND_SECCOMP flag);
• perform administrative operations on many device drivers;
• modify autogroup nice values by writing to /proc/pid/autogroup
(see sched(7)).
I'd say that pretty much relates to: Any process inside the container pretty
much has the relevant privileges as the user running podman. For rootful
podman that's the equivalent to having root on the host. That's not
necessarily a full compromise, there are clearly situations where you would
want to have a process with full administrative privileges in a container.
It always depends on your use-case.
For rootless podman the situation is less clear, but from a security
assessment POV, I would consider any process running as root in the container
to have the same privileges as the UID starting the container.
>
> My goal here is to have a way that we can run a container with systemd
> init on an ordinary Debian system, certainly without introducing a
> root security hole on the host system if the uid 0 of the container is
> malicious or compromised, and ideally also without allowing access to the
> user account that is running podman (similar to the security properties
> that we'd expect from running the potentially malicious/compromised
> payload inside qemu/kvm).
>
> If the answer is something like "you can use --cap-add=CAP_SYS_ADMIN,
> it isn't a security hole" then we can reassign this to autopkgtest,
> as a request to make `autopkgtest-virt-podman --init` automatically add
> that option.
I personally think this is a call that the mmdebstrap needs to make together
with the autopkgtest-build-podman maintainers. Here's my thinking:
- the point of mmdebstrap is to allow creating debian chroots that does
not require root.
- the main concern is creating chroot involves running maintainer scripts
that expect to run as root for various purposes
- The main concern here are malicious or broken maintainer script
affecting the system in unintended ways. A practical concern for usage
on a buildd or a debci machine could be how often the machine needs to
be reset
- Having mmdebstrap run with CAP_SYS_ADMIN allows all those maintainer
scripts all of the privileges quoted above in the user-namespace that
podman created. In some cases these capabilities might be required,
for instance if a process needs to setup mounts, etc.
- Exploits to escape user-namespaces do come up from time to time, and
are treated as bugs. Processes with CAP_SYS_ADMIN are more likely
to exploit such bugs
- Evidently, we do have a maintainer script that invokes a process,
policykit in this example, that does expect CAP_SYS_ADMIN
- Not requiring CAP_SYS_ADMIN would clearly avoid a lot of headaches, but
it is not clear to me whether it is reasonble for maintainer scripts
to assert the presence of CAP_SYS_ADMIN. Ideally debian-policy would
provide some clarifying language, for instance by saying "yes,
maintainer scripts may assert and fail if absent", or "maintainer scripts
must tolerate the absence of CAP_SYS_ADMIN gracefully".
- Ideally policykit would not require CAP_SYS_ADMIN, but I don't know
enough about policykit to tell whether that would ake sense
- All things considered, I think running mmdebstrap with
CAP_SYS_ADMIN is a signficant security improvement over not running
it in a user-namespace at all, and tend to recommend to have
autopkgtest-build-podman invoke podman with CAP_SYS_ADMIN for maximum
compatibility
- Additionally, we can and probably sould keep thinking about what
capabilities are reasonable to assume in maintainer scripts and
init scripts. I'd recommend to scope this out to a separate discussion.
I hope the rambling above helps!
-rt
More information about the Pkg-go-maintainers
mailing list