[Pkg-xen-devel] Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

Roalt Zijlstra | webpower roalt.zijlstra at webpower.nl
Wed Nov 7 14:43:23 GMT 2018


Op wo 7 nov. 2018 om 14:30 schreef Hans van Kranenburg <hans at knorrie.org>:

> Hi,
>
> On 11/7/18 12:48 PM, Roalt Zijlstra | webpower wrote:
> >
> > Op di 6 nov. 2018 om 18:54 schreef Hans van Kranenburg <hans at knorrie.org
> > <mailto:hans at knorrie.org>>:
> >
> >     Hi,
> >
> >     On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
> >     > Package: src:xen
> >     > Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> >     > Severity: important
> >     >
> >     > Updating Xen to the latest 4.8 version from the security repo
> >     makes servers unstable.
> >
> >     Can you confirm that this is the only change that you made between
> the
> >     before/after scenario? I mean, if you downgrade the packages, or you
> >     drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable
> >     again?
> >
> >
> > We have several servers running the previous versions and those are
> > still stable. The servers that we upgraded using 'apt-get update;
> > apt-get upgrade'  were rock solid before the upgrade.
>
> Yes, that's why I was asking. Did that apt-get upgrade also upgrade your
> dom0 kernel? You can look back in /var/log/dpkg.log* about what
> happened. This is very relevant information.
>

Two servers have been installed at 2018-04-24 and then upgraded:
2018-10-08 19:40:57 upgrade xen-hypervisor-4.8-amd64:amd64
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
2018-10-08 19:41:14 status installed xen-hypervisor-4.8-amd64:amd64
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
2018-07-31 18:50:01 upgrade xen-hypervisor-4.8-amd64:amd64
4.8.3+comet2+shim4.10.0+comet3-1+deb9u5
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
2018-07-31 18:50:45 status installed xen-hypervisor-4.8-amd64:amd64
4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9
2018-04-24 16:22:56 install xen-hypervisor-4.8-amd64:amd64 <none>
4.8.3+comet2+shim4.10.0+comet3-1+deb9u5
2018-04-24 16:23:05 status installed xen-hypervisor-4.8-amd64:amd64
4.8.3+comet2+shim4.10.0+comet3-1+deb9u5

The two other servers ran Cent OS first and were converted to Debian for
other reasons and so are fresh installs:
2018-09-26 22:01:34 install xen-hypervisor-4.8-amd64:amd64 <none>
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
2018-09-26 22:01:57 status installed xen-hypervisor-4.8-amd64:amd64
4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10


>
> > I did prepare a downgrade script if needed, but atm. the crash interval
> > in days seems to be higher then before. We did have servers crashing
> > every 2 days or even one crashing twice a day.
>
> >     > The servers randomly reset without any logs.
> >
> >     Do you have the noreboot option set on the Xen hypervisor command
> line?
> >
> >
> > For now one busy servers runs an older 4.9.0-4-amd64 kernel with a 3.16
> > kernel DomU with MySQL server on it. The second busy server runs all
> > domUs with 4.9 (backport) kernels on the lastest 4.9.0-8-amd64 kernel
> > for the Dom0. Currently we are awaiting any crash.
>
> In Debian, 4.9.0-8-amd64 is in the name of the package, but the real
> kernel version is in the version of that package.
>
> So, if you have linux-image-4.9.0-8-amd64, you should always also
> mention the real version, which is now e.g. 4.9.110-3+deb9u6. This means
> it's based on 4.9.110 upstream.
>
> The kernel team follows the 4.9 LTS releases, but only if the changes
> have to break the ABI (so custom modules have to be rebuilt), they up
> the number in the package name to trigger that process.
>

Right I completely missed that detail:
Two heavy used servers run kernels:
4.9.65-3+deb9u1  with one Jessie DomU kernel: 3.16.57-2
4.9.110-3+deb9u6 with a few Jessie DomU kernels: 4.9.110-3+deb9u5~deb8u1
Two less used servers run:
4.9.110-3+deb9u5 with one Jessie DomU kernel:  3.16.59-1
4.9.110-3+deb9u5 with a few mixed Jessie DomU kernels: 3.16.59-1
and 3.16.57-2


>
> > The last mentioned server was rebooted with the noreboot option, so we
> > could eventually check the console for errors once it crashes.
> > The remain two servers are our fall-back servers and are not that busy.
> > We have seen them crashtoo, but we noticed that the less busy servers
> > did not crash that often. But once they were busy they crashed as
> > quickly as the master servers.
>
> Ok, that's interesting extra data.
>
> >     Are you able to configure and capture output from serial console?
> >
> >
> > Oh wow..  Using old technology for debugging :-) I will need to see how
> > that configuration is done. We could connect up physical serial cables
> > between different machines.
>
> Well... old... It's the best way to capture text after everything
> crashes. On a vga display it scrolls away and you can't copy paste.
>
> If you're using recent Dell hardware, then I guess your drac provides an
> extra emulated serial console. I use HP hardware, there it's the ilo
> virtual serial port.
>

I will get into this, never used it before as most crashes so far, did log
errors
before things stop to work.


>
> >     First interesting thing to know is if it's the Dom0 that crashes, or
> if
> >     it's the hypervisor itself, and the logging will tell you that.
> >
> >     > We have serveral Debian Stretch servers running Xen 4.8 and only
> >     the ones updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
> >     > version tend to crash ranging from 'twice a day' to 'once every
> >     two weeks'. We have already ruled out if hardware was an
> >     > issue, since we have 4 individual servers which are different in
> >     hardware setup and also were bought at different times.
> >     > And these servers ran stable with the previsous version
> >     4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
> >     > These servers are acting exactly the same. Every thing works as it
> >     should, but without any logs it crashes and resets at
> >     > a certain point.
> >     >
> >     > It looks like it could have something to do with DomUs running
> >     older (3.16) Linux kernels. As a test we applied 4.9 kernels to
> >     > all Jessie DomU servers and so far it runs for 13 days (but this
> >     server did crash twice on a day).
> >     > We have seen this behaviour with Xen on CentOS6 and 7 too, but the
> >     trouble seems to be fixed after some more updates.
> >
> >     It can be frustrating that there's not much response on the mailing
> >     lists. But, these kinds of problems can be really hard to debug and
> >     solve. Unless there's a clear reproduction scenario and debug output,
> >     there's often noone who can help you remotely.
> >
> >
> > Well we have been having the issues since february this year with
> > unstable Xen servers crashing once in a months or so. The first issues
> > were on fresh Cent OS 7 servers, but then we also got them with updated
> > Cent OS 6 servers. We then decided to use Debian Stretch and the first
> > tests were pretty stable. We did install a new R740 with it (Xen
> > 4.8.4-pre) and that ran for 110 days pretty well.
>
> I know this feeling. I've been debugging similar kinds of issues this
> year that appeared "every few weeks".
>
> >     > As said.. I cannot provide logs since it simply resets without
> notice.
> >
> >     It's still the best starting point...
> >
> >
> > Well hopefully the 'noreboot' provided server crashes soon for some
> > logs. I will check if we can do any serial console tricks.
>
> Yes.
>

Oh and before I forget.. Thanks for all the feedback/help!

Roalt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20181107/f42fb4b1/attachment-0001.html>


More information about the Pkg-xen-devel mailing list