[Pkg-xen-devel] Bug#810964: [Xen-devel] MCE/EDAC Status/Updating?

Elliott Mitchell ehem+xen at m5p.com
Fri Feb 22 05:11:18 GMT 2019


On Mon, Feb 18, 2019 at 02:37:48AM -0700, Jan Beulich wrote:
> >>> On 18.02.19 at 09:42, <ehem+xen at m5p.com> wrote:
> > On Mon, Feb 18, 2019 at 01:12:16AM -0700, Jan Beulich wrote:
> >> >>> On 15.02.19 at 19:20, <ehem+xen at m5p.com> wrote:
> >> > On Fri, Feb 15, 2019 at 03:58:49AM -0700, Jan Beulich wrote:
> >> >> Well, Fam10 is mentioned explicitly, but as per the use of e.g.
> >> >> mcheck_amd_famXX newer ones are supported by this code
> >> >> as well.
> >> > 
> >> > In that case sometime between Xen 4.1 and Xen 4.4, the AMD MCE/EDAC code
> >> > was completely broken and hasn't been fixed.
> >> 
> >> I can't say I'm surprised, but details of the breakage would still
> >> be appreciated.
> > 
> > Originally noticed with Debian: https://bugs.debian.org/810964 
> > 
> > Original observer noticed that half the memory controllers were missing
> > from Linux's Domain-0 dmesg with Xen 4.4.  EDAC capability flags are
> > missing with Xen 4.4.
> 
> And I had been commenting in this bug. I don't recall technical data
> ever having emerged on the list here as to what is really going on,
> and what the root of this perceived regression is.

I've been having an interesting time trying to figure out where to look
to find appropriate information.

I'm thinking Debian's default Xen log level is a little too high.  Adding
"loglvl=info" doesn't put all that much in Xen's dmesg.  I'm suspecting
"mce_verbosity=verbose" may be a different story though.

"loglvl=info" gets me "AMD Fam10h machine check reporting enabled", so
looks like Xen is successfully getting its MCE support operational.

Taking a closer look at Dom0's dmesg though:

MCE: In-kernel MCE decoding enabled.
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
 (Note that use of the override may cause unknown side effects.)

So it seems Linux wants bit 4 of MSR_IA32_MCG_CTL set before it will
willingly enable MCE support (I've no idea what this does).

This was done in commit b272353fe98db5bdc73fff3c60a0574835df4c87.


> > I'd been working with a processor Linux was reporting as
> > "cpu family : 16" (ah yes "10h", that funky olde way of refering to
> > things) and noticing Linux's EDAC support failing on kernel start.  In
> > which case the EDAC support on AMD processors was completely broken
> > between 4.1 and 4.4 (hadn't realized that processor was just old enough
> > to be interesting).
> 
> While there's a relation, I think we need to keep #MC handling
> and EDAC separate here: The latter lives entirely in Dom0. And
> as said in the Debian bug, at least back at the time there was no
> reason to believe the driver would work on Xen other than by
> accident.

True, they might have merely been noticed at the same time and in fact
be two distinct issues.  Having EDAC reporting broken is *very* bad.

I am left though noticing how the state of Xen's EDAC support looks
rather odd from how other bits of Xen are evolving.  Rather than going
more in a direction of para-virtualization, this code looks to be heading
more towards true virtualization.

A more PV type approach might be to let Dom0 handle decoding the machine
check registers.  Then Dom0 asks Xen for what is at physical address X,
then potentially turns this into a PV message to the appropriate
domain and potentially logs the event.  Such an approach could be used
to synthesize machine check events for testing VMs.  Qemu would then
need code to simulate the appropriate register values for a HVM.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445









On Mon, Feb 18, 2019 at 02:37:48AM -0700, Jan Beulich wrote:
> >>> On 18.02.19 at 09:42, <ehem+xen at m5p.com> wrote:
> > On Mon, Feb 18, 2019 at 01:12:16AM -0700, Jan Beulich wrote:
> >> >>> On 15.02.19 at 19:20, <ehem+xen at m5p.com> wrote:
> >> > On Fri, Feb 15, 2019 at 03:58:49AM -0700, Jan Beulich wrote:
> >> >> Well, Fam10 is mentioned explicitly, but as per the use of e.g.
> >> >> mcheck_amd_famXX newer ones are supported by this code
> >> >> as well.
> >> > 
> >> > In that case sometime between Xen 4.1 and Xen 4.4, the AMD MCE/EDAC code
> >> > was completely broken and hasn't been fixed.
> >> 
> >> I can't say I'm surprised, but details of the breakage would still
> >> be appreciated.
> > 
> > Originally noticed with Debian: https://bugs.debian.org/810964 
> > 
> > Original observer noticed that half the memory controllers were missing
> > from Linux's Domain-0 dmesg with Xen 4.4.  EDAC capability flags are
> > missing with Xen 4.4.
> 
> And I had been commenting in this bug. I don't recall technical data
> ever having emerged on the list here as to what is really going on,
> and what the root of this perceived regression is.
> 
> > I'd been working with a processor Linux was reporting as
> > "cpu family : 16" (ah yes "10h", that funky olde way of refering to
> > things) and noticing Linux's EDAC support failing on kernel start.  In
> > which case the EDAC support on AMD processors was completely broken
> > between 4.1 and 4.4 (hadn't realized that processor was just old enough
> > to be interesting).
> 
> While there's a relation, I think we need to keep #MC handling
> and EDAC separate here: The latter lives entirely in Dom0. And
> as said in the Debian bug, at least back at the time there was no
> reason to believe the driver would work on Xen other than by
> accident.
> 
> Jan
> 
> 


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg at m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



More information about the Pkg-xen-devel mailing list