[pkg-go] Bug#1110249: podman corrupted its internal state or something?

Thu Sep 25 23:34:16 BST 2025

Hi.  Thanks for looking at this.

Reinhard Tartler writes ("Re: Bug#1110249: podman corrupted its internal state or something?"):
> I wish I had seen this earlier and been able to respond more swiftly.

No problem.

> It also seems that this has happened at least twice. Can you quantify
> how often that approximately happens?

So far, exactly twice.

> I haven't seen that myself so far. Unless podman is running as uid==0,
> then you are using rootless mode. That makes sense for DSA-managed host.

podman is running as a normal user.  (A DSA service account.)

> > The containers are used exclusively (at least in normal operation) via
> > autopkgtest-virt-podman.  They are created with a custom script, but
> > that runs weekly, but on Sundays, so doesn't seem implicated.  I
> > haven't been doing any admin work that might be related and I think
> > Sean hasn't either.
> 
> I'm also curious how tag2upload is interacting / using podman. Any
> chance you can point me to the relevant source code of tag2upload? I
> understand from the above that there is no direct interaction with
> podman, but only indirectly with autopkgtest-virt-podman. I'm familiar
> with this tool in the context of package testing. That doesn't appear to
> match with the purpose of the tag2upload infrastructure. What am I
> missing here?

The autopkgtest-virt-* protocol is suitable for many
virtualisation-like use cases.  We are using it for our
disposable/resettable source package build environment.

The relevant code is in the dgit repo, mostly here:
  https://salsa.debian.org/dgit-team/dgit/-/blob/main/infra/tag2upload-oracled?ref_type=heads
You may find that difficult to follow and confusing, because it
doesn't run on the host where podman runs.  Our service architeture is
here:
  https://salsa.debian.org/dgit-team/dgit/-/blob/main/TAG2UPLOAD-DESIGN.txt?ref_type=heads

The basic idea is that tag2upload-oracled runs
  ssh builder autopkgtest-virt-podman
It uses print-execute-command from the adt protocol and then does
   ssh builder the-command-that-autopkgtest-virt-podman-printed
with appropriate quoting.

I've noticed that the execute command is a complex-looking shell
fragment, which I havne't tried to understand.  The quoted version of
this command can be seen in the `ssh-builder` shell script from a
sample tag2upload job - I've attached one (from t2u job 1089, for our
ref).  It's pretty nasty to read since it contains text which has been
shell-quoted multiple times.

> > I believe our containers are running in a "rootless" mode.  We can
> > probably provide more information about the configuration if you can
> > tell us how to obtain it.
> 
> To diagnose, please provide the output of the command `podman system info`.

Attached.

> See: https://manpages.debian.org/trixie/podman/podman-system-reset.1.en.html
> The canonical way to reset is `podman system reset`.

We can try that if it happens again.

> See: https://manpages.debian.org/trixie/podman/podman-system-reset.1.en.html
> 
> Looking at the source code, it seems this error is coming from here:
> 
> https://sources.debian.org/src/podman/5.6.1+ds2-2/pkg/domain/infra/abi/system_linux.go?hl=102#L93-L102

I don't really understand this but if I were to speculate, I would
guess that "BecomeRootInUserNS" failed with ESRCH.

> The fact that the
> message includes the error "no such process" is really puzzling to me:
> That indicates an issue with identifying or starting the "pause" process

I bet "no such process" is simply an errno string for ESRCH.  I
believe the kernel may return ESRCH for a variety of reasons, possibly
relating to namespace stuff I don't really understand.

> > Now the service is working again.  But, I don't think it should have
> > broken down.
> 
> I agree, it really shouldn't. I see that there are apparently two
> containers being stopped. Any idea what those ideas are and what they
> are supposed to do? Are they supposed to be still running?

In the normal state in the current installation, I think there would
be *one* container running.  If by "container running" we mean
"invocation of autopkgtest-virt-podman with an 'open testbed'".

I think two means one has probably been leaked.  (It might be a leaked
image construction container; see below.)

The long shell script fragment, and my other experiences with
autopkgtest-virt-podman etc. when we were setting the service up, have
left me with a lack of confidence that autopkgtest-virt-podman will
properly handle out-of-course exits.  We've found the need to deploye
an ad-hoc "remove old running containers" rune.

All of this may be unrelated.

We do also drive podman directly to create new images.  But neither of
these failures occurred close to the time when that would be
happening, so probably that's not relevant.

> If you don't rely on long-lived containers running during your
> autopkgtest-virt-podman invocations, maybe you can invoke a `podman
> system reset -f` before the autopkgtest-virt-podman invocation. Note
> that this will clear out all local caches, containers and images. As
> such, I'd expect it to also terminate any running
> pods/containers. Dependening on how you use autopkgtest-virt-podman,
> this may or may not be an acceptable workaround for increasing system
> stability.

We need medium-term images; we currently rebuild the builder image
weekly, much like a buildd chroot.  Our current running container
management strategy involves keeping one running container ready for
use.  So it sounds like this workaround wouldn't be suitable.

> What puzzles me most is that `podman system migrate` appears to solve
> this issue. That indicates to me that there is something in
> ~/.local/share/containers/storage that's triggering it.
> 
> I wonder if you could backup that directory when the system is in a
> failed state, recover it using `podman system migrate` or `podman system
> reset`, and try to reproduce the issue after restoring the tarball. If
> you can, then that tarball would be very useful for further diagnostics.

Certainly, we can try that if it happens again.

Thanks for all the suggestions and info.

Ian.