[Pkg-xen-devel] Bug#988477: Also observing #988477

Elliott Mitchell ehem+undef at m5p.com
Thu Jan 18 16:04:13 GMT 2024


tags 988477 - moreinfo
found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1
affects 988477 src:linux
severity 988477 critical
quit

I am also observing #988477 occur.  This machine has a AMD Zen 4
processor.  The first observation was when motherboard/processor was
swapped out, the older motherboard/processor was several generations old.

The pattern which is emerging is Linux MD RAID1 plus recent AMD processor
which has full IOMMU functionality.  The older machine was believed to
have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables
(IVRS) and thus Xen was unable to utilize it.

This seems to be occuring with a small percentage of write operations.
Subsequent read operations appear to be fine.

I am not convinced this is a Xen bug.  I suspect this is instead a bug
in the Linux MD subsystem.  In particular if the DMA interface was
designed assuming only a single device would ever access any page, but
the MD RAID1 driver is reusing the same page for both devices.

IOMMU page release could be handled by marking the page unused in a
device data structure and later removed by sweeping a table.  In such
case if the MD-RAID1 driver was to redirect the page to another device
between these two steps, the entry for a subsequent device could be wiped
out when trying to invalidate an entry for a prior device.


Anyway, I'm also observing bug #988477.  This could also be a kernel bug.
So far no crashes/confirmed data loss have occured, but sweeping the
mirror does turn up small numbers of inconsistencies.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



More information about the Pkg-xen-devel mailing list