[Pkg-xen-devel] system still crashes with bullseye and kernel v5.3

Wed Dec 18 21:37:47 GMT 2019

Hi Alexander,

On 12/18/19 9:24 PM, Alexander Dahl wrote:
> 
> meanwhile I'm running bullseye with kernel v5.3, but the problem
> persists and my Xen system is annoyingly unstable due to this bug. I
> attach some more logs from the last days and add the debian xen devel
> list in Cc. Maybe someone over there has an idea how to fix this. After
> all the log shows plenty of hints it could have something to do with Xen.

I think the xen parts you see in the stack trace listings are usual
calls that show that a domU is asking dom0 via the hypervisor to do some
disk read/writes or send data over the network (the 'upcall').

https://wiki.xen.org/wiki/Event_Channel_Internals

So, after getting that request, the dom0 Linux kernel tries to execute
it, which is e.g. the enqueue function to throw a network packet at the
physical network interface.

The first error we see is the "transmit queue 0 timed out". This looks
like the Linux kernel is looking at the network port hardware, and
expects it to accept the packet, deal with it and put it on the wire.

When this does not happen, and the network port hardware seems frozen
and timeouts, it's forcibly reset (I don't know if the thing is
resetting itself because it crashed, or if the Linux kernel does
something to reset it). "Reset adapter unexpectedly" gives me the
feeling that the firmware inside the network card crashed and something
inside there also reset it.

> Anyone care to help debug this? I have no idea where to start. Can
> kernel or xen generate coredumps one could analyze? Or is the log output
> the only thing?
> 
> (If you look at the logs, the strange thing is the system does not crash
> and reboot immediately, but later after lots of errors with storage, but
> comes back fine after reboot.)

The ata errors (disk fails to process a command) happen after all of the
above happens. Usually disk errors that look like this point at broken
disk hardware or bugs in the firmware in the disk. However, if it
consistently happens 6 to 7 seconds after the network card disaster, it
might be a symptom of the former.

The first thing I would recommend is disabling transmit segmentation
offloading to the network card in dom0 (ethtool enp1s0 tso off) and see
if it prevents the network card from choking on some kind of input. If
not, play with more settings like transmit checksum offloading (tx off).

If this does not help, we can start asking some Xen developers if they
have an idea how we can help with debugging and what we should do. (I
help maintaining the Xen packages in Debian, my knowledge about
internals of it is mostly limited to all the been-there-done-thats
during the years of using it as a user.)

I expect the problem to be related to Linux and the hardware, and not
specifically Xen. Knowing if the same happens when just booting Linux
without Xen is valuable debugging info. However, I realize that it's
likely a bit complicated to, in that case, try triggering the problem by
generate the same workload that's now coming from the domUs.

Curious to hear what happens,
Thanks,
Hans van Kranenburg