Bug#785557: perl: FTBFS on i386 and amd64: itimer problems on buildds?
Apollon Oikonomopoulos
apoikos at debian.org
Wed Jun 10 14:52:30 UTC 2015
Control: reassign -1 qemu-system-x86
Control: severity -1 important
Control: found -1 qemu/1:2.1+dfsg-12
Control: retitle -1 QEMU: causes vCPU steal time overflow on live migration
On 22:25 Fri 05 Jun , Dominic Hargreaves wrote:
> Great to hear that you found the underlying cause[1] of this! I note
> the workaround:
>
> -cpu qemu64,-kvm_steal_time
>
> which may be applicable to the Debian hosts?
>
> [1] <https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg01295.html>
After investigating a bit more, it looks like the issue comes from an overflow
in the kernel's accumulate_steal_time() (arch/x86/kvm/x86.c:2023):
static void accumulate_steal_time(struct kvm_vcpu *vcpu)
{
u64 delta;
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
Using systemtap with the attached script to trace KVM execution on the
receiving host kernel, we can see that shortly before marking the vCPUs
as runnable on a migrated KVM instance with 2 vCPUs, the following
happens (** marks lines of interest):
** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=7856949 ns steal=7856949 ns
0 qemu-system-x86(18446): -> kvm_arch_vcpu_load
0 vhost-18446(18447): -> kvm_arch_vcpu_should_kick
5 vhost-18446(18447): <- kvm_arch_vcpu_should_kick
23 qemu-system-x86(18446): <- kvm_arch_vcpu_load
0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl
2 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl
0 qemu-system-x86(18446): -> kvm_arch_vcpu_put
2 qemu-system-x86(18446): -> kvm_put_guest_fpu
3 qemu-system-x86(18446): <- kvm_put_guest_fpu
4 qemu-system-x86(18446): <- kvm_arch_vcpu_put
** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=7856949 ns steal=7856949 ns
0 qemu-system-x86(18446): -> kvm_arch_vcpu_load
1 qemu-system-x86(18446): <- kvm_arch_vcpu_load
0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl
1 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl
0 qemu-system-x86(18446): -> kvm_arch_vcpu_put
1 qemu-system-x86(18446): -> kvm_put_guest_fpu
2 qemu-system-x86(18446): <- kvm_put_guest_fpu
3 qemu-system-x86(18446): <- kvm_arch_vcpu_put
** 0 qemu-system-x86(18449): kvm_arch_vcpu_load: run_delay=40304 ns steal=7856949 ns
0 qemu-system-x86(18449): -> kvm_arch_vcpu_load
** 7 qemu-system-x86(18449): delta: 18446744073701734971 ns, steal=7856949 ns, run_delay=40304 ns
10 qemu-system-x86(18449): <- kvm_arch_vcpu_load
** 0 qemu-system-x86(18449): -> kvm_arch_vcpu_ioctl_run
4 qemu-system-x86(18449): -> kvm_arch_vcpu_runnable
6 qemu-system-x86(18449): <- kvm_arch_vcpu_runnable
...
0 qemu-system-x86(18448): kvm_arch_vcpu_load: run_delay=0 ns steal=7856949 ns
0 qemu-system-x86(18448): -> kvm_arch_vcpu_load
** 34 qemu-system-x86(18448): delta: 18446744073701694667 ns, steal=7856949 ns, run_delay=0 ns
40 qemu-system-x86(18448): <- kvm_arch_vcpu_load
** 0 qemu-system-x86(18448): -> kvm_arch_vcpu_ioctl_run
5 qemu-system-x86(18448): -> kvm_arch_vcpu_runnable
Now, what's really interesting is that current->sched_info.run_delay
gets reset because the tasks (threads) using the vCPUs change, and thus
have a different current->sched_info: it looks like task 18446 created
the two vCPUs, and then they were handed over to 18448 and 18449
respectively. This is also verified by the fact that during the
overflow, both vCPUs have the old steal time of the last vcpu_load of
task 18446. However, according to Documentation/virtual/kvm/api.txt:
- vcpu ioctls: These query and set attributes that control the operation
of a single virtual cpu.
Only run vcpu ioctls from the same thread that was used to create the vcpu.
So it seems qemu is doing something that it shouldn't: calling vCPU
ioctls from a thread that didn't create the vCPU. Note that this
probably happens on every QEMU startup, but is not visible because the
guest kernel zeroes out the steal time on boot.
There are at least two ways to mitigate the issue without a kernel
recompilation:
- The first one is to disable the steal time propagation from host to
guest by invoking qemu with `-cpu qemu64,-kvm_steal_time`. This will
short-circuit accumulate_steal_time() due to (vcpu->arch.st.msr_val &
KVM_MSR_ENABLED) and will completely disable steal time reporting in
the guest, which may not be desired if people rely on it to detect
CPU congestion.
- The other one is using the following systemtap script to prevent the
steal time counter from overflowing by dropping the problematic
samples (WARNING: systemtap guru mode required, use at your own
risk):
probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") {
if (@defined($delta) && $delta < 0) {
printk(4, "kvm: steal time delta < 0, dropping")
$delta = 0
}
}
Note that not all *guests* handle this condition in the same way: 3.2
guests still get the overflow in /proc/stat, but their scheduler
continues to work as expected. 3.16 guests OTOH go nuts once steal time
overflows and stop accumulating system & user time, while entering an
erratic state where steal time in /proc/stat is *decreasing* on every
clock tick.
Regards,
Apollon
-------------- next part --------------
global steal
global run_delay
probe module("kvm").function("*@arch/x86/kvm/x86.c").call {
printf ("%s -> %s\n", thread_indent(1), ppfunc())
}
probe module("kvm").function("*@arch/x86/kvm/x86.c").return {
printf ("%s <- %s\n", thread_indent(-1), ppfunc())
}
probe module("kvm").function("kvm_arch_vcpu_load") {
task = task_current()
run_delay[tid()] = @cast(task, "task_struct", "kernel<linux/sched.h>")->sched_info->run_delay
steal[tid()] = $vcpu->arch->st->last_steal
printf ("%s kvm_arch_vcpu_load: run_delay=%lu ns steal=%lu ns\n", thread_indent(0), run_delay[tid()], steal[tid()])
}
probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") {
if (@defined($delta) && $delta != 0) {
printf ("%s delta: %lu ns, steal=%lu ns, run_delay=%lu ns\n", thread_indent(0), $delta, steal[tid()], run_delay[tid()])
}
}
More information about the Perl-maintainers
mailing list