[Pkg-libvirt-maintainers] Bug#998090: libvirt-daemon-system: Please defer starting libvirtd.socket until libvirtd.service dependencies can be met

Guido Günther agx at sigxcpu.org
Wed Nov 3 18:15:37 GMT 2021


Hi Ron,

Sorry for the broken boot. That's always annoying.

On Sat, Oct 30, 2021 at 05:39:45PM +1030, Ron wrote:
> Package: libvirt-daemon-system
> Version: 7.0.0-3
> Severity: important
> 
> Hi,
> 
> Systemd has a class of boot-time races which can result in deadlock,
> which I learned more than I ever wanted to know about when Buster to
> Bullseye upgrades started leaving me with machines that were off the
> network when they were rebooted ...  The reason for that is a bit of a
> tangle of otherwise unrelated packages, and there are many ways this
> *could* happen, but the root of it in my particular case was the libvirt
> package switching to use socket activation instead of letting the daemon
> create its own socket when it is ready to respond to requests on it.
> 
> The race occurs because the .socket unit creates the libvirt control
> socket very early in the boot, before even the network-pre target is
> reached, and so long before the libvirtd.service dependencies are
> satisfied and the daemon itself can be started to handle requests.
> 
> The deadlock in my case occurs when a udev rule for a device already
> attached at boot tries to assign that device to a VM.
> 
> Prior to Bullseye, what would occur is:
> 
>  The udev rule calls a small script on device hot/cold plug which
>  checks a config file, and if the device is allocated to a VM, then
>  calls virsh to attach it to that VM.

Is that a sync call to virsh from udev via RUN ? 

>  This 'immediately' either succeeds, fails because the desired VM
>  is not actually running (yet), or fails because libvirtd is not
>  running and virsh did not find its socket present.
> 
>  If either of the failure cases occur, the calling script fails
>  gracefully, and a QEMU hook will later handle attaching the device
>  if/when libvirtd and the desired VM is actually started.
> 
> But in Bullseye there's a three-way race, and if the zombie socket is
> created before the udev rule runs, then virsh connects to it, but hangs
> indefinitely waiting for libvirtd.service to be able to start and
> respond to the request.

So far sounds like expected behaviour for socket activation.

> The deadlock in this specific case then happens when ifupdown-pre (but
> it could be any of many other things) calls udevadm settle to give the
> initial network devices a chance to be fully set up and available before
> the networking.service brings them up.
> 
> Which in turn then hangs waiting for the (otherwise unrelated) udev rule
> above to complete, which won't happen until libvirtd is started, which
> won't happen until the udev rule returns (or udevadm settle times out)
> and network.target (among others) can be reached.
> 
> Everything stops for two minutes until the systemd "bug solver" of
> arbitrary timeouts starts killing things, and the machine finishes
> booting without any network devices.
> 
> 
> The latter can be avoided (in most cases at least) with a tweak to
> the networking.service dependencies (the bug I've reported here
> https://bugs.debian.org/998088 has more of the gory details of this
> problem from the perspective of ifupdown's entanglement in it).
> 
> But we can avoid this specific incarnation of it completely if the
> libvirtd.socket unit declared the same ordering dependencies as the
> libvirtd.service does, so that anything calling virsh, at any time,
> can reasonably expect an answer in finite time instead of blocking
> indefinitely to wait for a service (that systemd already knows
> does not even have the basic preconditions to make it eligible to
> start yet but ignores that to create the socket anyway).
> 
> Unless systemd gets smarter about this, there may always be a race
> with the possibility of circular deadlocks if creation of the
> socket and responding to requests for it are not atomic with the
> creation of the service using it - so it may actually be better to
> just go back to letting the daemon create and manage the socket
> itself (as its "activation" signal to users of that socket) - but
> we can at least narrow the window for losing it significantly if
> we defer creation of the socket until at least the point where
> systemd thinks it can attempt to start the daemon (though with
> no guarantee of success at that still ...)

it sounds like the problem comes about because:

- there's sync call for a (potentially) long runnig program invocation
  in a udev rule which udev recommends against in it's man page

  This can only be used for very short-running foreground tasks. Running an event process for a long period of time may block all further events for this or a dependent device.

  (virsh can take a long time even when libvirtd is up)

- ifupdown invokes udevadm settle which waits for the above and things
  go downhill.

I'm not into systemd's details of socket activation but from the above
I'm not yet convinced there's anything on the libvirt side to fix. The
best argument I would by is: worked before, don't regress on upgrades
(which could also be fixed by a mention in the NEWS file.

> 
> I hope I haven't missed anything that makes this make sense in the
> context of libvirt ...  trying to look at and describe this from four
> entirely independent points of view, each that doesn't directly care
> about any of the others, is a bit of a hall of mirrors with small
> parts of the problem stuck to each of them!
> 
>   Cheers,
>   Ron

[..following up on another mail in this thread..]

> On Tue, 02 Nov 2021 15:34:01 +0100 Michael Biebl <biebl at debian.org> wrote:
> > I tried to explain this to Ron on IRC, but he decided to ignore my advice.
> 
> Oh Please Michael, now you're just sounding like a child whose lolly has
> fallen in the dirt ...

Can we please keep the discussion on a purely technical level?

Cheers,
 -- Guido


> 
> 
> -- System Information:
> Debian Release: 11.1
>   APT prefers stable-updates
>   APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
> Architecture: amd64 (x86_64)
> 
> Kernel: Linux 5.10.0-9-amd64 (SMP w/12 CPU threads)
> Locale: LANG=en_AU.utf8, LC_CTYPE=en_AU.utf8 (charmap=UTF-8), LANGUAGE=en_AU:en
> Shell: /bin/sh linked to /bin/dash
> Init: systemd (via /run/systemd/system)
> 
> Versions of packages libvirt-daemon-system depends on:
> ii  adduser                         3.118
> ii  debconf [debconf-2.0]           1.5.77
> ii  gettext-base                    0.21-4
> ii  iptables                        1.8.7-1
> ii  libvirt-clients                 7.0.0-3
> ii  libvirt-daemon                  7.0.0-3
> ii  libvirt-daemon-config-network   7.0.0-3
> ii  libvirt-daemon-config-nwfilter  7.0.0-3
> ii  libvirt-daemon-system-systemd   7.0.0-3
> ii  logrotate                       3.18.0-2
> ii  policykit-1                     0.105-31



More information about the Pkg-libvirt-maintainers mailing list