[Pkg-libvirt-maintainers] Bug#998090: libvirt-daemon-system: Please defer starting libvirtd.socket until libvirtd.service dependencies can be met

Fri Nov 5 03:08:46 GMT 2021

Hi Guido,

On Wed, Nov 03, 2021 at 07:15:37PM +0100, Guido Günther wrote:
> Hi Ron,
> 
> Sorry for the broken boot. That's always annoying.

Thanks for looking at the details.  I filed these bugs because even
though I can step around the problem in the permutation that involves
something I maintain - if we don't fill in *all* of the contributing
potholes, then I can't prevent some other combination which I have no
control over making a reboot after some future upgrade crash on the
same sharp corner case.

So it really would be nice if we can make this as naturally robust
as possible.  That we have "three corner" accidents like this, where
the problem would not have occurred if any one of the contributors
did not have some window for trouble, and that nobody detected and
reported this through the stable release cycle, says to me that we
ought to close every window for this that we see when we see it ...

> On Sat, Oct 30, 2021 at 05:39:45PM +1030, Ron wrote:
> > The race occurs because the .socket unit creates the libvirt control
> > socket very early in the boot, before even the network-pre target is
> > reached, and so long before the libvirtd.service dependencies are
> > satisfied and the daemon itself can be started to handle requests.
> > 
> > The deadlock in my case occurs when a udev rule for a device already
> > attached at boot tries to assign that device to a VM.
> > 
> > Prior to Bullseye, what would occur is:
> > 
> >  The udev rule calls a small script on device hot/cold plug which
> >  checks a config file, and if the device is allocated to a VM, then
> >  calls virsh to attach it to that VM.
> 
> Is that a sync call to virsh from udev via RUN ? 

It's a call to virsh attach-device - which unless I'm missing something
has no option except to be a "sync" call?

But also unless I'm really missing something, there really is no reason
that particular operation should ever block or be "long running" when
called for a local domain.  Either it fails out immediately because
the local libvird is not running (prior to socket activation), it fails
out immediately because the requested domain is not running, or it
succeeds or fails "almost immediately" because the device could (not)
be attached to it.

I agree there are cases where virsh *could* be "long running" (I
wouldn't try to spin up a new VM from udev RUN :), and pathological
cases where *any* process, even the most trivial, could become
"long running" - but neither of those are involved in the failure
mode I'm currently looking at avoiding.

> >  This 'immediately' either succeeds, fails because the desired VM
> >  is not actually running (yet), or fails because libvirtd is not
> >  running and virsh did not find its socket present.
> > 
> >  If either of the failure cases occur, the calling script fails
> >  gracefully, and a QEMU hook will later handle attaching the device
> >  if/when libvirtd and the desired VM is actually started.
> > 
> > But in Bullseye there's a three-way race, and if the zombie socket is
> > created before the udev rule runs, then virsh connects to it, but hangs
> > indefinitely waiting for libvirtd.service to be able to start and
> > respond to the request.
> 
> So far sounds like expected behaviour for socket activation.

Yes, I think that's the crux here.  I understand the wishful thinking in
the design of socket activation, where you just fire the starting gun
early and let everything race for the finish line with no explicit
ordering, hoping that it will just shake out as an emergent property ...

But that only works if none of the dependency relationships are cyclic,
as soon as there's a cycle (which is what we have here), the emergent
property is you can't predict what it will break ...  and the only
tie-breaker is to time-out and kill something.

In the case I got to analyse here, the problem doesn't depend on the
bit-babbler package, or even anything being called by udev RUN.  Any
early-start service, with an ordering dependency that requires it first
started before any of libvirtd.service's After list, which calls virsh
for any reason would also fall into the same trap.

So the problem in libvirt's circle of influence isn't "a long running
service" spawned by udev, it's that it's now not safe for *anything*
to call virsh without an explicit dependency somehow declaring it must
not be required to complete before libvirtd.service is successfully
started ...

Otherwise libvirtd will never start, and virsh will never return.
Which might seem like a trivial problem to avoid, but as we see here,
it's quite easy to create it accidentally in the tangle of multiple
independent requirements created by multiple independent authors.

And I assume this would be true for even the most seemingly benign
uses of virsh (like version --daemon or sysinfo et al.) - there's
certainly many things being called by RUN rules which I'd expect
as more probable to become "long running" than virsh version ...

I'm not clinging tight to any particular solution to this, but it's
evidently a real trap, and the "obvious" answer to me to restore the
previous graceful and intuitive failure modes of virsh is to not have
the socket present before the daemon can actually service it - whether
that is via unit dependencies, or returning to libvirtd creating it
by itself ...  maybe there are better answers for that, but I haven't
seen any suggested yet to consider?

> > The deadlock in this specific case then happens when ifupdown-pre (but
> > it could be any of many other things) calls udevadm settle to give the
> > initial network devices a chance to be fully set up and available before
> > the networking.service brings them up.
> > 
> > Which in turn then hangs waiting for the (otherwise unrelated) udev rule
> > above to complete, which won't happen until libvirtd is started, which
> > won't happen until the udev rule returns (or udevadm settle times out)
> > and network.target (among others) can be reached.
> > 
> > Everything stops for two minutes until the systemd "bug solver" of
> > arbitrary timeouts starts killing things, and the machine finishes
> > booting without any network devices.
> > 
> > 
> > The latter can be avoided (in most cases at least) with a tweak to
> > the networking.service dependencies (the bug I've reported here
> > https://bugs.debian.org/998088 has more of the gory details of this
> > problem from the perspective of ifupdown's entanglement in it).
> > 
> > But we can avoid this specific incarnation of it completely if the
> > libvirtd.socket unit declared the same ordering dependencies as the
> > libvirtd.service does, so that anything calling virsh, at any time,
> > can reasonably expect an answer in finite time instead of blocking
> > indefinitely to wait for a service (that systemd already knows
> > does not even have the basic preconditions to make it eligible to
> > start yet but ignores that to create the socket anyway).
> > 
> > Unless systemd gets smarter about this, there may always be a race
> > with the possibility of circular deadlocks if creation of the
> > socket and responding to requests for it are not atomic with the
> > creation of the service using it - so it may actually be better to
> > just go back to letting the daemon create and manage the socket
> > itself (as its "activation" signal to users of that socket) - but
> > we can at least narrow the window for losing it significantly if
> > we defer creation of the socket until at least the point where
> > systemd thinks it can attempt to start the daemon (though with
> > no guarantee of success at that still ...)
> 
> it sounds like the problem comes about because:
> 
> - there's sync call for a (potentially) long runnig program invocation
>   in a udev rule which udev recommends against in it's man page
> 
>   This can only be used for very short-running foreground tasks. Running an event process for a long period of time may block all further events for this or a dependent device.
> 
>   (virsh can take a long time even when libvirtd is up)

The problem here is not the time it takes a local domain attach-device
operation to complete though - it's that if you attempt it between when
the socket it created and libvirt is started, other things you don't
(and can't) control, could create a situation where that might deadlock
and just never complete.  It never times out for "running too long" in
the case where it could complete ...

If you boil that part of this right down, it's equivalent to the classic
multi-threaded deadlock conditions created by taking multiple locks in
the wrong order.  It's a symptomless problem, until there's actually
contention for the resource.  But it's a real latent problem, with a
known fatal result when the race is lost.

> - ifupdown invokes udevadm settle which waits for the above and things
>   go downhill.

Which in this case is the "innocent victim" - because if its involvement
hadn't meant the network entirely failed to come up, this could have
quite easily gone unnoticed for a lot longer as just a "slow boot" with
a long, pointless, but otherwise harmless delay where it all stalled.

But again that's part of the issue here, if we don't smooth over all the
rough edges, ifupdown here is just a placeholder for any other process
that might reasonably become collateral damage in the future - and even
if nothing fails, I'd rather not be adding 2-3 useless minutes to a
normal boot time.

I'm just pointing at all the places where this analysis showed an easily
fixable problem, that doesn't even seem to be a dilemma where we need to
choose between two bad things, we can just avoid this problem without
creating some other one in its place, can't we?

Have I really missed something, or just failed at explaining some part
of it still?

> I'm not into systemd's details of socket activation but from the above
> I'm not yet convinced there's anything on the libvirt side to fix. The
> best argument I would by is: worked before, don't regress on upgrades
> (which could also be fixed by a mention in the NEWS file.

Yeah, we could put an orange cone next to the pothole.  But "we told you
so" isn't my personal favourite answer to people getting repeatedly bitten
by a problem that is fairly easily avoidable ...  I'm not the first to
get hit by this by a long shot, and none of the other cases I found were
using the bit-babbler package ... so I am genuinely concerned this can
even bite *me* again though someone else's accidental actions on some
future upgrade if we don't act to become robust against it.  Me knowing
it can happen is not enough to prevent it happening on my systems again
if we don't stop it at all the root causes.

Or is there really some advantage to having the zombie socket exist
before we even know whether libvirtd *can* be started?

As I say, I understand the hope in the socket activation design, but in
the case of libvirtd, I don't see it really participating meaningfully
in managing the start up ordering?

There might be some advantage to using the systemd access controls et al.
but we can still have that even if the socket unit is not activated
until libvird.service is clear to start, can't we?

Is there some other consideration I'm missing here?

> Can we please keep the discussion on a purely technical level?

I appreciate that very much, thank you!

  Cheers,
  Ron