Message ID | 1467096859-3143-1-git-send-email-wanpeng.li@hotmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 28 Jun 2016 14:54:19 +0800 Wanpeng Li <kernellwp@gmail.com> wrote: > INFO: rcu_sched detected stalls on CPUs/tasks: > 1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 > fqs=21663 (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719) > Task dump for CPU 1: > qemu-system-x86 R running task 0 3529 3525 0x00080808 > ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40 > ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8 > 0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9 > Call Trace: > ? kvm_write_guest_cached+0xb9/0x160 [kvm] > ? __delay+0xf/0x20 > ? wait_lapic_expire+0x14a/0x200 [kvm] > ? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm] > ? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm] > ? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm] > ? __fget+0x5/0x210 > ? do_vfs_ioctl+0x96/0x6a0 > ? __fget_light+0x2a/0x90 > ? SyS_ioctl+0x79/0x90 > ? do_syscall_64+0x7c/0x1e0 > ? entry_SYSCALL64_slow_path+0x25/0x25 > > This can be reproduced readily by running a full dynticks guest(since > hrtimer in guest is heavily used) w/ lapic_timer_advance disabled. > > If fail to program hardware preemption timer, we will fallback to > hrtimer based method, however, a previous programmed preemption timer > miss to cancel in this scenario which results in one hardware > preemption timer and one hrtimer emulated tsc deadline timer run > simultaneously. So sometimes the target guest deadline tsc is earlier > than guest tsc, which leads to the computation in vmx_set_hv_timer > can underflow and cause delta_tsc to be set a huge value, then host > soft lockup as above. > > This patch fix it by cancelling the previous programmed preemption > timer if there is once we failed to program the new preemption timer > and fallback to hrtimer based method. Hi, WanPeng, thanks for the patch. > > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Radim Krčmář <rkrcmar@redhat.com> > Cc: Yunhong Jiang <yunhong.jiang@intel.com> > Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> > --- > arch/x86/kvm/lapic.c | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index fdc05ae..b15e32a 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1454,11 +1454,18 @@ static void start_apic_timer(struct kvm_lapic > *apic) /* lapic timer in tsc deadline mode */ > u64 tscdeadline = apic->lapic_timer.tscdeadline; > > - if (kvm_x86_ops->set_hv_timer && > - !kvm_x86_ops->set_hv_timer(apic->vcpu, > tscdeadline)) { > - apic->lapic_timer.hv_timer_in_use = true; > - trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, > + if (kvm_x86_ops->set_hv_timer) { > + if (kvm_x86_ops->set_hv_timer(apic->vcpu, Would it be better that if set_hv_timer fails, we clear the vmx timer (i.e. the VMCS field) before return the failure? I'm not sure if it make sense to clear the previous setup if a new setup fails, although it seems OK for me, since we have to cancel the hv_timer anyway. Your idea? Thanks --jyh > tscdeadline)) { > + if > (apic->lapic_timer.hv_timer_in_use) { > + > kvm_x86_ops->cancel_hv_timer(apic->vcpu); > + > apic->lapic_timer.hv_timer_in_use = false; > + } > + start_sw_tscdeadline(apic); > + } else { > + apic->lapic_timer.hv_timer_in_use = > true; > + > trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, > apic->lapic_timer.hv_timer_in_use); > + } > } else > start_sw_tscdeadline(apic); > } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > index fdc05ae..b15e32a 100644 > > --- a/arch/x86/kvm/lapic.c > > +++ b/arch/x86/kvm/lapic.c > > @@ -1454,11 +1454,18 @@ static void start_apic_timer(struct kvm_lapic > > *apic) /* lapic timer in tsc deadline mode */ > > u64 tscdeadline = apic->lapic_timer.tscdeadline; > > > > - if (kvm_x86_ops->set_hv_timer && > > - !kvm_x86_ops->set_hv_timer(apic->vcpu, > > tscdeadline)) { > > - apic->lapic_timer.hv_timer_in_use = true; > > - trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, > > + if (kvm_x86_ops->set_hv_timer) { > > + if (kvm_x86_ops->set_hv_timer(apic->vcpu, > > Would it be better that if set_hv_timer fails, we clear the vmx timer (i.e. the > VMCS field) before return the failure? I'm not sure if it make sense to clear > the previous setup if a new setup fails, although it seems OK for me, since > we have to cancel the hv_timer anyway. Good question. I think we should abstract a little the set_hv_timer and cancel_hv_timer calls, for example with a new function start_hv_tscdeadline. It's also simpler to avoid the race window by disabling interrupts around the calls to set_hv_timer and hrtimer_cancel. I'll see what I can come up with. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 28 Jun 2016 13:56:38 -0400 (EDT) Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > > index fdc05ae..b15e32a 100644 > > > --- a/arch/x86/kvm/lapic.c > > > +++ b/arch/x86/kvm/lapic.c > > > @@ -1454,11 +1454,18 @@ static void start_apic_timer(struct > > > kvm_lapic *apic) /* lapic timer in tsc deadline mode */ > > > u64 tscdeadline = apic->lapic_timer.tscdeadline; > > > > > > - if (kvm_x86_ops->set_hv_timer && > > > - !kvm_x86_ops->set_hv_timer(apic->vcpu, > > > tscdeadline)) { > > > - apic->lapic_timer.hv_timer_in_use = true; > > > - > > > trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, > > > + if (kvm_x86_ops->set_hv_timer) { > > > + if (kvm_x86_ops->set_hv_timer(apic->vcpu, > > > > Would it be better that if set_hv_timer fails, we clear the vmx > > timer (i.e. the VMCS field) before return the failure? I'm not sure > > if it make sense to clear the previous setup if a new setup fails, > > although it seems OK for me, since we have to cancel the hv_timer > > anyway. > > Good question. I think we should abstract a little the set_hv_timer > and cancel_hv_timer calls, for example with a new function > start_hv_tscdeadline. It's also simpler to avoid the race window by > disabling interrupts around the calls to set_hv_timer and > hrtimer_cancel. I'll see what I can come up with. Paolo, thanks for reply. Which race window you are talking about? Is it the kvm_lapic_switch_to_hv_timer()? If yes, hope we will not forgot it once the lapic timer is not pinned anymore in future. Thanks --jyh > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/06/2016 20:34, yunhong jiang wrote: > Paolo, thanks for reply. > > Which race window you are talking about? Is it the > kvm_lapic_switch_to_hv_timer()? If yes, hope we will not forgot it once the > lapic timer is not pinned anymore in future. Yes, it's that one. This is a good point against using local_irq_save/restore. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index fdc05ae..b15e32a 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1454,11 +1454,18 @@ static void start_apic_timer(struct kvm_lapic *apic) /* lapic timer in tsc deadline mode */ u64 tscdeadline = apic->lapic_timer.tscdeadline; - if (kvm_x86_ops->set_hv_timer && - !kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) { - apic->lapic_timer.hv_timer_in_use = true; - trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, + if (kvm_x86_ops->set_hv_timer) { + if (kvm_x86_ops->set_hv_timer(apic->vcpu, tscdeadline)) { + if (apic->lapic_timer.hv_timer_in_use) { + kvm_x86_ops->cancel_hv_timer(apic->vcpu); + apic->lapic_timer.hv_timer_in_use = false; + } + start_sw_tscdeadline(apic); + } else { + apic->lapic_timer.hv_timer_in_use = true; + trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, apic->lapic_timer.hv_timer_in_use); + } } else start_sw_tscdeadline(apic); }
INFO: rcu_sched detected stalls on CPUs/tasks: 1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663 (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719) Task dump for CPU 1: qemu-system-x86 R running task 0 3529 3525 0x00080808 ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40 ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8 0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9 Call Trace: ? kvm_write_guest_cached+0xb9/0x160 [kvm] ? __delay+0xf/0x20 ? wait_lapic_expire+0x14a/0x200 [kvm] ? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm] ? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm] ? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm] ? __fget+0x5/0x210 ? do_vfs_ioctl+0x96/0x6a0 ? __fget_light+0x2a/0x90 ? SyS_ioctl+0x79/0x90 ? do_syscall_64+0x7c/0x1e0 ? entry_SYSCALL64_slow_path+0x25/0x25 This can be reproduced readily by running a full dynticks guest(since hrtimer in guest is heavily used) w/ lapic_timer_advance disabled. If fail to program hardware preemption timer, we will fallback to hrtimer based method, however, a previous programmed preemption timer miss to cancel in this scenario which results in one hardware preemption timer and one hrtimer emulated tsc deadline timer run simultaneously. So sometimes the target guest deadline tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer can underflow and cause delta_tsc to be set a huge value, then host soft lockup as above. This patch fix it by cancelling the previous programmed preemption timer if there is once we failed to program the new preemption timer and fallback to hrtimer based method. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> --- arch/x86/kvm/lapic.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)