[Pkg-xen-devel] Dom0 on Sid not working

Chuck Zmudzinski brchuckz at netscape.net
Tue May 17 16:24:18 BST 2022


On 4/23/22 6:03 PM, Marek Marczykowski-Górecki wrote:
> On Sat, Apr 23, 2022 at 12:42:36PM -0400, Chuck Zmudzinski wrote:
>> Hello,
>>
>> After updating Sid this morning, which included an upgrade to Linux kernel
>> version 5.17.x from 5.16.x, the system will not boot into Dom0 on Xen.
>>
>> Package versions:
>>
>> Xen hypervisor: xen-hypervisor-4.16-amd64 Debian version
>> 16.0+51-g0941d6cb-1+b1
>> Linux kernel in Dom0: linux-image-5.17.0-1-amd64 Debian version 5.17.3-1
>>
>> The system boots into Linux 5.17.3-1 fine on the bare metal, but not as Dom0
>> on the Xen 4.16 hypervisor Debian version 16.0+51-g0941d6cb-1+b1
>>
>> The system never shows boot messages or a login prompt, but the monitor
>> stays powered on but with no text or image at all. It seems to just hang
>> indefinitely. Downgrading to the latest 5.16 kernel on Xen 4.16 Debian
>> version 16.0+51-g0941d6cb-1+b1 works normally. Anybody else see this
>> problem?
> We have similar issue in Qubes OS:
> https://github.com/QubesOS/qubes-issues/issues/7462. Are you running
> this on AMD?
>

No, mine is an Intel Haswell Intel core i5 (4th gen), but Diederik
reported his Broadwell Intel Xeons (5th gen) work OK with 5.17
as Dom0 on Xen 4.16.

I was able to see the cause of the failure on my Intel Haswell
processor from the systemd journal by looking at the logs of the
failed boot after a successful boot:

Apr 23 08:44:59 debian kernel: [    3.876169] i915 0000:00:02.0: 
[drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_init+0xb6/0x2e0 
[i915]

After that, I see entries in the kern.log for just under a second,
and then it stops and hangs.

It looks like the add_taint_for_CI function for the Intel i915 GPU
driver halts the CPU. That explains the system hang, and just
as on your AMD system on Qubes, the backlight stays on. I recover
by hitting the reset button to reboot a 5.16 or less Dom0 kernel
on Xen, or the 5.17 kernel on the bare metal.

It looks like my older Intel hardware is doing something the Linux
CI people don't like. Perhaps that is also happening on your AMD
system.

I spent some time reading the Linux git logs that might be causing
this during the development of Linux 5.17, and I came up with the
last two commits to arch/x86/xen/vga.c in the Linux source, which
are in 5.17 but not in 5.16. They are commits that involve the
initialization of the Dom0 console, and I tried a build of 5.17 without
those last two commits, but it did not help.

I found a big drm-next merge (committed to Linux on Jan 10, 2022)
in the kernel git logs where the problem appeared by building and
testing the kernel before and after the merge, but it has over a
thousand commits in the merge so it is not so easy to figure out
what is causing this on my system. The link to the merge:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d0749b4f83bf4768ceae45ee6a79e6e7eddfc2a

Next I plan to test a version of 5.17 that prints some debugging
information from the intel_gt_init function where my system
hangs with 5.17 as a Dom0 on Xen.

Are you seeing any taint_for_CI calls in the kernel logs on your AMD
hardware?



More information about the Pkg-xen-devel mailing list