[Pkg-xen-devel] Bug#988477: xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

Elliott Mitchell ehem+debian at m5p.com
Sun Aug 25 23:58:30 BST 2024


On Sun, Aug 25, 2024 at 11:41:44PM +0200, Maximilian Engelhardt wrote:
> I am changing the severity back to normal as the xen package works fine for 
> many people without any serious issues. From your last message it also seems 

Yet for some lucky people data is corrupted/lost.  There could be other
people who reproduce this, but don't send e-mail saying "me too" to this
bug report.

Presently the main reason there aren't very many reproductions is few
people are bothering to use RAID with flash.  The initial reports are
SSDs have a lower failure rate than disks, but the failure rate isn't
even close to zero.  Whereas the data loss/corruption easily reproduces.

While both cases in #988477 were on systems with AMD hardware, I am
presently doubtful that is a requirement.  The most similar known bug was
found to be more severe on AMD hardware, but also occur on Intel
hardware.  I suspect this issue may be similar, simply no one has noticed
the problem yet...

> you found a workaround for your problem. Please don't change the bug severity 

Something was found which seems to have made another issue more
prominent.  It may reduce the rate at which data corruption occurs, but
I've since confirmed data loss/corruption continues to occur.

> without at least giving an explanation why you think the new severity is 
> justified.

I had thought the original reporter's justification was sufficient.  This
appears to have some specific requirement to meet, but if you meet them
you may be in trouble before alerts trigger.

So far both reports are with AMD machines with IOMMUv2 functionality (I
tried on a machine with IOMMUv1/GART and it didn't reproduce).  Both
reports feature Samsung SATA devices.  A NVMe device from another
manufacturer also showed the issue (I'm almost certain Samsung NVMe
devices will also show the issue).

I suspect Intel machines may also be effected by this issue, but it may
not manifest as severely.  I suspect this is a case of people with AMD
machines being a bit more wary of hardware failure (thus actually
bothering to use RAID1 even with flash devices).

> >From the few log lines in this bug report this seems to be an upstream issue 
> with xen or the linux kernel. Please report your observations upstream. The 
> Debian xen team does not have the resources and knowledge to debug or fix such 
> problems. Once the issue has been identified and fixed upstream we can see if 
> we can backport a fix to our Debian packages, but this is only possible once 
> an upstream fix has landed.

Perhaps it has become easier to report things upstream, but the original
procedure was reportters were supposed to report to bugs.debian.org and
NOT forward upstream.

Other problem is I've run into a chasm with upstream and no way to build
a bridge across.

I do have one more thing to try, but don't yet have a time-frame for
when I'll check that.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



More information about the Pkg-xen-devel mailing list