[Pkg-xen-devel] Bug#912975: Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

Roalt Zijlstra | webpower roalt.zijlstra at webpower.nl
Wed Nov 7 11:48:54 GMT 2018


Hi Hans,


Op di 6 nov. 2018 om 18:54 schreef Hans van Kranenburg <hans at knorrie.org>:

> Hi,
>
> On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
> > Package: src:xen
> > Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > Severity: important
> >
> > Updating Xen to the latest 4.8 version from the security repo makes
> servers unstable.
>
> Can you confirm that this is the only change that you made between the
> before/after scenario? I mean, if you downgrade the packages, or you
> drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable again?
>

We have several servers running the previous versions and those are still
stable. The servers that we upgraded using 'apt-get update; apt-get
upgrade'  were rock solid before the upgrade.
I did prepare a downgrade script if needed, but atm. the crash interval in
days seems to be higher then before. We did have servers crashing every 2
days or even one crashing twice a day.


>
> > The servers randomly reset without any logs.
>
> Do you have the noreboot option set on the Xen hypervisor command line?
>
>
For now one busy servers runs an older 4.9.0-4-amd64 kernel with a 3.16
kernel DomU with MySQL server on it. The second busy server runs all domUs
with 4.9 (backport) kernels on the lastest 4.9.0-8-amd64 kernel for the
Dom0. Currently we are awaiting any crash.

The last mentioned server was rebooted with the noreboot option, so we
could eventually check the console for errors once it crashes.  The remain
two servers are our fall-back servers and are not that busy. We have seen
them crash too, but we noticed that the less busy servers did not crash
that often. But once they were busy they crashed as quickly as the master
servers.


> Are you able to configure and capture output from serial console?
>

Oh wow..  Using old technology for debugging :-) I will need to see how
that configuration is done. We could connect up physical serial cables
between different machines.


> First interesting thing to know is if it's the Dom0 that crashes, or if
> it's the hypervisor itself, and the logging will tell you that.
>
> > We have serveral Debian Stretch servers running Xen 4.8 and only the
> ones updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> > version tend to crash ranging from 'twice a day' to 'once every two
> weeks'. We have already ruled out if hardware was an
> > issue, since we have 4 individual servers which are different in
> hardware setup and also were bought at different times.
> > And these servers ran stable with the previsous version
> 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
> > These servers are acting exactly the same. Every thing works as it
> should, but without any logs it crashes and resets at
> > a certain point.
> >
> > It looks like it could have something to do with DomUs running older
> (3.16) Linux kernels. As a test we applied 4.9 kernels to
> > all Jessie DomU servers and so far it runs for 13 days (but this server
> did crash twice on a day).
> > We have seen this behaviour with Xen on CentOS6 and 7 too, but the
> trouble seems to be fixed after some more updates.
>
> It can be frustrating that there's not much response on the mailing
> lists. But, these kinds of problems can be really hard to debug and
> solve. Unless there's a clear reproduction scenario and debug output,
> there's often noone who can help you remotely.
>

Well we have been having the issues since february this year with unstable
Xen servers crashing once in a months or so. The first issues were on fresh
Cent OS 7 servers, but then we also got them with updated Cent OS 6
servers. We then decided to use Debian Stretch and the first tests were
pretty stable. We did install a new R740 with it (Xen 4.8.4-pre) and that
ran for 110 days pretty well.


> > As said.. I cannot provide logs since it simply resets without notice.
>
> It's still the best starting point...


Well hopefully the 'noreboot' provided server crashes soon for some logs. I
will check if we can do any serial console tricks.

Roalt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20181107/502e3784/attachment.html>


More information about the Pkg-xen-devel mailing list