Bug#785557: perl: FTBFS on i386 and amd64: itimer problems on buildds?

Wed Jun 10 14:52:30 UTC 2015

Control: reassign -1 qemu-system-x86
Control: severity -1 important
Control: found -1 qemu/1:2.1+dfsg-12
Control: retitle -1 QEMU: causes vCPU steal time overflow on live migration

On 22:25 Fri 05 Jun     , Dominic Hargreaves wrote:
> Great to hear that you found the underlying cause[1] of this! I note
> the workaround:
> 
> -cpu qemu64,-kvm_steal_time
> 
> which may be applicable to the Debian hosts?

> 
> [1] <https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg01295.html>

After investigating a bit more, it looks like the issue comes from an overflow
in the kernel's accumulate_steal_time() (arch/x86/kvm/x86.c:2023):

  static void accumulate_steal_time(struct kvm_vcpu *vcpu)
  {
          u64 delta;

          if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
                  return;

          delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;

Using systemtap with the attached script to trace KVM execution on the 
receiving host kernel, we can see that shortly before marking the vCPUs 
as runnable on a migrated KVM instance with 2 vCPUs, the following 
happens (** marks lines of interest):

 **  0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=7856949 ns steal=7856949 ns
     0 qemu-system-x86(18446): -> kvm_arch_vcpu_load
     0 vhost-18446(18447): -> kvm_arch_vcpu_should_kick
     5 vhost-18446(18447): <- kvm_arch_vcpu_should_kick
    23 qemu-system-x86(18446): <- kvm_arch_vcpu_load
     0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl
     2 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl
     0 qemu-system-x86(18446): -> kvm_arch_vcpu_put
     2 qemu-system-x86(18446):  -> kvm_put_guest_fpu
     3 qemu-system-x86(18446):  <- kvm_put_guest_fpu
     4 qemu-system-x86(18446): <- kvm_arch_vcpu_put
 **  0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=7856949 ns steal=7856949 ns
     0 qemu-system-x86(18446): -> kvm_arch_vcpu_load
     1 qemu-system-x86(18446): <- kvm_arch_vcpu_load
     0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl
     1 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl
     0 qemu-system-x86(18446): -> kvm_arch_vcpu_put
     1 qemu-system-x86(18446):  -> kvm_put_guest_fpu
     2 qemu-system-x86(18446):  <- kvm_put_guest_fpu
     3 qemu-system-x86(18446): <- kvm_arch_vcpu_put
 **  0 qemu-system-x86(18449): kvm_arch_vcpu_load: run_delay=40304 ns steal=7856949 ns
     0 qemu-system-x86(18449): -> kvm_arch_vcpu_load
 **  7 qemu-system-x86(18449): delta: 18446744073701734971 ns, steal=7856949 ns, run_delay=40304 ns
    10 qemu-system-x86(18449): <- kvm_arch_vcpu_load
 **  0 qemu-system-x86(18449): -> kvm_arch_vcpu_ioctl_run
     4 qemu-system-x86(18449):  -> kvm_arch_vcpu_runnable
     6 qemu-system-x86(18449):  <- kvm_arch_vcpu_runnable
     ...
     0 qemu-system-x86(18448): kvm_arch_vcpu_load: run_delay=0 ns steal=7856949 ns
     0 qemu-system-x86(18448): -> kvm_arch_vcpu_load
 ** 34 qemu-system-x86(18448): delta: 18446744073701694667 ns, steal=7856949 ns, run_delay=0 ns
    40 qemu-system-x86(18448): <- kvm_arch_vcpu_load
 **  0 qemu-system-x86(18448): -> kvm_arch_vcpu_ioctl_run
     5 qemu-system-x86(18448):  -> kvm_arch_vcpu_runnable

Now, what's really interesting is that current->sched_info.run_delay 
gets reset because the tasks (threads) using the vCPUs change, and thus 
have a different current->sched_info: it looks like task 18446 created 
the two vCPUs, and then they were handed over to 18448 and 18449 
respectively. This is also verified by the fact that during the 
overflow, both vCPUs have the old steal time of the last vcpu_load of 
task 18446. However, according to Documentation/virtual/kvm/api.txt:

 - vcpu ioctls: These query and set attributes that control the operation
   of a single virtual cpu.

   Only run vcpu ioctls from the same thread that was used to create the vcpu.

So it seems qemu is doing something that it shouldn't: calling vCPU 
ioctls from a thread that didn't create the vCPU. Note that this 
probably happens on every QEMU startup, but is not visible because the 
guest kernel zeroes out the steal time on boot.

There are at least two ways to mitigate the issue without a kernel
recompilation:

 - The first one is to disable the steal time propagation from host to 
   guest by invoking qemu with `-cpu qemu64,-kvm_steal_time`. This will 
   short-circuit accumulate_steal_time() due to (vcpu->arch.st.msr_val & 
   KVM_MSR_ENABLED) and will completely disable steal time reporting in 
   the guest, which may not be desired if people rely on it to detect 
   CPU congestion.

 - The other one is using the following systemtap script to prevent the 
   steal time counter from overflowing by dropping the problematic 
   samples (WARNING: systemtap guru mode required, use at your own 
   risk):

      probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") {
        if (@defined($delta) && $delta < 0) {
          printk(4, "kvm: steal time delta < 0, dropping")
          $delta = 0
        }
      }

Note that not all *guests* handle this condition in the same way: 3.2 
guests still get the overflow in /proc/stat, but their scheduler 
continues to work as expected. 3.16 guests OTOH go nuts once steal time 
overflows and stop accumulating system & user time, while entering an 
erratic state where steal time in /proc/stat is *decreasing* on every 
clock tick.

Regards,
Apollon
-------------- next part --------------
global steal
global run_delay

probe module("kvm").function("*@arch/x86/kvm/x86.c").call {
	printf ("%s -> %s\n", thread_indent(1), ppfunc())
}

probe module("kvm").function("*@arch/x86/kvm/x86.c").return {
	printf ("%s <- %s\n", thread_indent(-1), ppfunc())
}

probe module("kvm").function("kvm_arch_vcpu_load") {
	task = task_current()
        run_delay[tid()] = @cast(task, "task_struct", "kernel<linux/sched.h>")->sched_info->run_delay
	steal[tid()] = $vcpu->arch->st->last_steal
	printf ("%s kvm_arch_vcpu_load: run_delay=%lu ns steal=%lu ns\n", thread_indent(0), run_delay[tid()], steal[tid()])
}

probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") {
	if (@defined($delta) && $delta != 0) {
		printf ("%s delta: %lu ns, steal=%lu ns, run_delay=%lu ns\n", thread_indent(0), $delta, steal[tid()], run_delay[tid()])
	}
}