[pkg-go] Bug#1078205: systemd: can't start polkitd in a podman container without CAP_SYS_ADMIN

Thu Aug 8 14:35:55 BST 2024

Simon McVittie <smcv at debian.org> writes:
> It is possible to run systemd + polkitd inside a podman container by
> running it as "podman run ... --cap-add=CAP_SYS_ADMIN", but I am unsure
> whether this undermines or defeats podman's security model. A question
> for the podman maintainers: what is the security impact of adding that
> option? I'm guessing that the answer is one of these:

I'd say, significant.

> - maybe it's a sandbox escape that allows the container payload to execute
>   arbitrary code as the user running podman, and this is by design, so
>   reporting it as a security vulnerability would be wontfix'd?
>
> - or maybe it weakens podman's security hardening, but is not meant to
>   result in a sandbox escape because other factors (e.g. use of userns
>   and/or seccomp) are meant to protect the host system, so if there was
>   a sandbox escape then it would be treated as a vulnerability?
>
> - or is there a better way to run a container that has a full systemd init
>   system under podman?

Security is not absolute. You achieve it by layering security
measures. Running a container with CAP_SYS_ADMIN allows the process
significant privileges. Quoting from capabilities(7):

CAP_SYS_ADMIN
       Note: this capability is overloaded; see Notes to kernel develop‐
       ers below.

       •  Perform a range of system administration operations including:
          quotactl(2),  mount(2),  umount(2),  pivot_root(2), swapon(2),
          swapoff(2), sethostname(2), and setdomainname(2);
       •  perform privileged syslog(2) operations (since  Linux  2.6.37,
          CAP_SYSLOG should be used to permit such operations);
       •  perform VM86_REQUEST_IRQ vm86(2) command;
       •  access  the same checkpoint/restore functionality that is gov‐
          erned by CAP_CHECKPOINT_RESTORE (but the latter, weaker  capa‐
          bility is preferred for accessing that functionality).
       •  perform  the  same  BPF  operations as are governed by CAP_BPF
          (but the latter, weaker capability is preferred for  accessing
          that functionality).
       •  employ  the same performance monitoring mechanisms as are gov‐
          erned by CAP_PERFMON (but the  latter,  weaker  capability  is
          preferred for accessing that functionality).
       •  perform  IPC_SET and IPC_RMID operations on arbitrary System V
          IPC objects;
       •  override RLIMIT_NPROC resource limit;
       •  perform operations on trusted and security extended attributes
          (see xattr(7));
       •  use lookup_dcookie(2);
       •  use ioprio_set(2) to assign IOPRIO_CLASS_RT and (before  Linux
          2.6.25) IOPRIO_CLASS_IDLE I/O scheduling classes;
       •  forge  PID  when  passing  socket  credentials via UNIX domain
          sockets;
       •  exceed /proc/sys/fs/file-max, the  system-wide  limit  on  the
          number  of  open files, in system calls that open files (e.g.,
          accept(2), execve(2), open(2), pipe(2));
       •  employ CLONE_* flags that create new namespaces with  clone(2)
          and unshare(2) (but, since Linux 3.8, creating user namespaces
          does not require any capability);
       •  access privileged perf event information;
       •  call  setns(2)  (requires  CAP_SYS_ADMIN  in  the target name‐
          space);
       •  call fanotify_init(2);
       •  perform privileged KEYCTL_CHOWN and  KEYCTL_SETPERM  keyctl(2)
          operations;
       •  perform madvise(2) MADV_HWPOISON operation;
       •  employ  the TIOCSTI ioctl(2) to insert characters into the in‐
          put queue of a terminal other than  the  caller's  controlling
          terminal;
       •  employ the obsolete nfsservctl(2) system call;
       •  employ the obsolete bdflush(2) system call;
       •  perform various privileged block-device ioctl(2) operations;
       •  perform various privileged filesystem ioctl(2) operations;
       •  perform  privileged ioctl(2) operations on the /dev/random de‐
          vice (see random(4));
       •  install a seccomp(2) filter without first having  to  set  the
          no_new_privs thread attribute;
       •  modify allow/deny rules for device control groups;
       •  employ  the  ptrace(2)  PTRACE_SECCOMP_GET_FILTER operation to
          dump tracee's seccomp filters;
       •  employ the ptrace(2) PTRACE_SETOPTIONS  operation  to  suspend
          the  tracee's  seccomp  protections  (i.e.,  the PTRACE_O_SUS‐
          PEND_SECCOMP flag);
       •  perform administrative operations on many device drivers;
       •  modify autogroup nice values by writing to /proc/pid/autogroup
          (see sched(7)).

I'd say that pretty much relates to: Any process inside the container pretty
much has the relevant privileges as the user running podman. For rootful
podman that's the equivalent to having root on the host. That's not
necessarily a full compromise, there are clearly situations where you would
want to have a process with full administrative privileges in a container.
It always depends on your use-case.

For rootless podman the situation is less clear, but from a security
assessment POV, I would consider any process running as root in the container
to have the same privileges as the UID starting the container.

>
> My goal here is to have a way that we can run a container with systemd
> init on an ordinary Debian system, certainly without introducing a
> root security hole on the host system if the uid 0 of the container is
> malicious or compromised, and ideally also without allowing access to the
> user account that is running podman (similar to the security properties
> that we'd expect from running the potentially malicious/compromised
> payload inside qemu/kvm).
>
> If the answer is something like "you can use --cap-add=CAP_SYS_ADMIN,
> it isn't a security hole" then we can reassign this to autopkgtest,
> as a request to make `autopkgtest-virt-podman --init` automatically add
> that option.

I personally think this is a call that the mmdebstrap needs to make together
with the autopkgtest-build-podman maintainers. Here's my thinking:

 - the point of mmdebstrap is to allow creating debian chroots that does
   not require root.
 - the main concern is creating chroot involves running  maintainer scripts
   that expect to run as root for various purposes
 - The main concern here are malicious or broken maintainer script
   affecting the system in unintended ways. A practical concern for usage
   on a buildd or a debci machine could be how often the machine needs to
   be reset
 - Having mmdebstrap run with CAP_SYS_ADMIN allows all those maintainer
   scripts all of the privileges quoted above in the user-namespace that
   podman created. In some cases these capabilities might be required,
   for instance if a process needs to setup mounts, etc.
 - Exploits to escape user-namespaces do come up from time to time, and
   are treated as bugs. Processes with CAP_SYS_ADMIN are more likely
   to exploit such bugs
 - Evidently, we do have a maintainer script that invokes a process,
   policykit in this example, that does expect CAP_SYS_ADMIN
 - Not requiring CAP_SYS_ADMIN would clearly avoid a lot of headaches, but
   it is not clear to me whether it is reasonble for maintainer scripts
   to assert the presence of CAP_SYS_ADMIN. Ideally debian-policy would
   provide some clarifying language, for instance by saying "yes,
   maintainer scripts may assert and fail if absent", or "maintainer scripts
   must tolerate the absence of CAP_SYS_ADMIN gracefully".
 - Ideally policykit would not require CAP_SYS_ADMIN, but I don't know
   enough about policykit to tell whether that would ake sense
 - All things considered, I think running mmdebstrap with
   CAP_SYS_ADMIN is a signficant security improvement over not running
   it in a user-namespace at all, and tend to recommend to have
   autopkgtest-build-podman invoke podman with CAP_SYS_ADMIN for maximum
   compatibility
 - Additionally, we can and probably sould keep thinking about what
   capabilities are reasonable to assume in maintainer scripts and
   init scripts. I'd recommend to scope this out to a separate discussion.

I hope the rambling above helps!

-rt