[Pkg-xen-devel] Dom0 on Sid not working

Chuck Zmudzinski brchuckz at netscape.net
Fri May 20 03:57:44 BST 2022


On 5/17/22 11:24 AM, Chuck Zmudzinski wrote:
> On 4/23/22 6:03 PM, Marek Marczykowski-Górecki wrote:
>> On Sat, Apr 23, 2022 at 12:42:36PM -0400, Chuck Zmudzinski wrote:
>>> Hello,
>>>
>>> After updating Sid this morning, which included an upgrade to Linux 
>>> kernel
>>> version 5.17.x from 5.16.x, the system will not boot into Dom0 on Xen.
>>>
>>> Package versions:
>>>
>>> Xen hypervisor: xen-hypervisor-4.16-amd64 Debian version
>>> 16.0+51-g0941d6cb-1+b1
>>> Linux kernel in Dom0: linux-image-5.17.0-1-amd64 Debian version 
>>> 5.17.3-1
>>>
>>> The system boots into Linux 5.17.3-1 fine on the bare metal, but not 
>>> as Dom0
>>> on the Xen 4.16 hypervisor Debian version 16.0+51-g0941d6cb-1+b1
>>>
>>> The system never shows boot messages or a login prompt, but the monitor
>>> stays powered on but with no text or image at all. It seems to just 
>>> hang
>>> indefinitely. Downgrading to the latest 5.16 kernel on Xen 4.16 Debian
>>> version 16.0+51-g0941d6cb-1+b1 works normally. Anybody else see this
>>> problem?
>> We have similar issue in Qubes OS:
>> https://github.com/QubesOS/qubes-issues/issues/7462. Are you running
>> this on AMD?
>>
>
> No, mine is an Intel Haswell Intel core i5 (4th gen), but Diederik
> reported his Broadwell Intel Xeons (5th gen) work OK with 5.17
> as Dom0 on Xen 4.16.
>
> I was able to see the cause of the failure on my Intel Haswell
> processor from the systemd journal by looking at the logs of the
> failed boot after a successful boot:
>
> Apr 23 08:44:59 debian kernel: [    3.876169] i915 0000:00:02.0: 
> [drm:add_taint_for_CI [i915]] CI tainted:0x9 by 
> intel_gt_init+0xb6/0x2e0 [i915]
>
> After that, I see entries in the kern.log for just under a second,
> and then it stops and hangs.
>
> It looks like the add_taint_for_CI function for the Intel i915 GPU
> driver halts the CPU. That explains the system hang, and just
> as on your AMD system on Qubes, the backlight stays on. I recover
> by hitting the reset button to reboot a 5.16 or less Dom0 kernel
> on Xen, or the 5.17 kernel on the bare metal.
>
> It looks like my older Intel hardware is doing something the Linux
> CI people don't like. Perhaps that is also happening on your AMD
> system.
>
> I spent some time reading the Linux git logs that might be causing
> this during the development of Linux 5.17, and I came up with the
> last two commits to arch/x86/xen/vga.c in the Linux source, which
> are in 5.17 but not in 5.16. They are commits that involve the
> initialization of the Dom0 console, and I tried a build of 5.17 without
> those last two commits, but it did not help.
>
> I found a big drm-next merge (committed to Linux on Jan 10, 2022)
> in the kernel git logs where the problem appeared by building and
> testing the kernel before and after the merge, but it has over a
> thousand commits in the merge so it is not so easy to figure out
> what is causing this on my system. The link to the merge:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d0749b4f83bf4768ceae45ee6a79e6e7eddfc2a 
>
>
> Next I plan to test a version of 5.17 that prints some debugging
> information from the intel_gt_init function where my system
> hangs with 5.17 as a Dom0 on Xen.
>
> Are you seeing any taint_for_CI calls in the kernel logs on your AMD
> hardware?

I found a fix for this on my Intel box with a Haswell CPU/GPU, and
the upstream kernel developers are aware of this issue with Intel
and Xen, but not yet about the specific issue I am seeing with a
Haswell CPU:

https://lore.kernel.org/lkml/788dc391-6a20-5c03-9613-9f22fcc125f1@netscape.net/

I am not sure if this can fix the issue on AMD, but the problem
with false negatives from pat_enabled() on Xen discussed on
kernel.org might also affect AMD drivers.



More information about the Pkg-xen-devel mailing list