Message ID | 20160620130531.GA8139@amt.cnet (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 20/06/16 14:05, Marcelo Tosatti wrote: > Alan Jenkins reports hang at > https://bugzilla.redhat.com/show_bug.cgi?id=1337667, > due to guest TSC being set far behind than > lapic_timer.expired_tscdeadline, when restoring VM state > on top of currently active VM. > > It is not possible to disable LAPIC timer advancement > (by setting lapic_timer.expired_tscdeadline = 0), at > guest TSC write I like that it acknowledges (though only implicitly) the guest can trigger arbitrary lockups of host CPUs. > because: > > * APIC write: expiration = 1000. > * LAPIC tsc deadline code sets timer to 1000-30. > * Timer fires at 970. > * Guest writes TSC=w. > > Guest fails to VM-entry to process signal to perform > "vmload" in userspace. > > Case 1: w > 970: > Guest entry can be performed. > > Case 2: w < 970: > Guest entry should not be performed because "An interrupt is generated > when the logical processor’s time-stamp counter equals or exceeds the > target value in the IA32_TSC_DEADLINE MSR." > > In case 2, hardware would not fire an interrupt. > > To fix the problem, disable timer advancement when > userspace sets the LAPIC state. Setting of APIC > resets all APIC state, including > any pending interrupt. > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> However I feel this doesn't admit (even implicitly) that host software (not necessarily root) can still hard-lockup the CPU. It depends on the sequence of operations, and the message doesn't show that sequence explicitly. I now understand what the sequence that _is_ in the message shows, but it's unfortunately distracting. I.e. if you restore the LAPIC first (or omit to do so at all), then restore the TSC deadline MSR, then the TSC MSR. The patch assumes that the LAPIC is restored after the MSRs so it can clear the incorrect value of expired_tscdeadline, right? I didn't know whether this patch would work until I tested it, because I didn't try to nail down the exact sequence QEMU is using. > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ea306ad..89be6e9 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, > { > kvm_apic_post_state_restore(vcpu, s); > update_cr8_intercept(vcpu); > + vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0; > > return 0; > } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2016 at 04:20:56PM +0100, Alan Jenkins wrote: > On 20/06/16 14:05, Marcelo Tosatti wrote: > >Alan Jenkins reports hang at > >https://bugzilla.redhat.com/show_bug.cgi?id=1337667, > >due to guest TSC being set far behind than > >lapic_timer.expired_tscdeadline, when restoring VM state > >on top of currently active VM. > > > >It is not possible to disable LAPIC timer advancement > >(by setting lapic_timer.expired_tscdeadline = 0), at > >guest TSC write > > I like that it acknowledges (though only implicitly) the guest can > trigger arbitrary lockups of host CPUs. > > >because: > > > >* APIC write: expiration = 1000. > >* LAPIC tsc deadline code sets timer to 1000-30. > >* Timer fires at 970. > >* Guest writes TSC=w. > > > >Guest fails to VM-entry to process signal to perform > >"vmload" in userspace. > > > >Case 1: w > 970: > >Guest entry can be performed. > > > >Case 2: w < 970: > >Guest entry should not be performed because "An interrupt is generated > >when the logical processor’s time-stamp counter equals or exceeds the > >target value in the IA32_TSC_DEADLINE MSR." > > > >In case 2, hardware would not fire an interrupt. > > > >To fix the problem, disable timer advancement when > >userspace sets the LAPIC state. Setting of APIC > >resets all APIC state, including > >any pending interrupt. > > > >Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > >Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> > > However I feel this doesn't admit (even implicitly) that host > software (not necessarily root) can still hard-lockup the CPU. It > depends on the sequence of operations, and the message doesn't show > that sequence explicitly. I now understand what the sequence that > _is_ in the message shows, but it's unfortunately distracting. > > I.e. if you restore the LAPIC first (or omit to do so at all), then > restore the TSC deadline MSR, then the TSC MSR. > > The patch assumes that the LAPIC is restored after the MSRs so it > can clear the incorrect value of expired_tscdeadline, right? Yes. > I didn't know whether this patch would work until I tested it, > because I didn't try to nail down the exact sequence QEMU is using. You are right about the host lockups -- sent another patch to fix that one. Thanks for the reports. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20/06/2016 15:05, Marcelo Tosatti wrote: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ea306ad..89be6e9 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, > { > kvm_apic_post_state_restore(vcpu, s); > update_cr8_intercept(vcpu); > + vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0; > I think this is not correct. You have programmed the host timer to an early value when kvm_apic_post_state_restore called start_apic_timer. I think that: 1) post_state_restore should cancel the timer and clear lapic_timer.pending before writing the registers, not afterwards. lapic_timer.expired_tscdeadline can be cleared at the same time. 2) kvm_write_tsc should do the same and restart the timer afterwards. Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 21, 2016 at 09:50:54AM +0200, Paolo Bonzini wrote: > > > On 20/06/2016 15:05, Marcelo Tosatti wrote: > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index ea306ad..89be6e9 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, > > { > > kvm_apic_post_state_restore(vcpu, s); > > update_cr8_intercept(vcpu); > > + vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0; > > > > I think this is not correct. You have programmed the host timer to an > early value when kvm_apic_post_state_restore called start_apic_timer. > > I think that: > > 1) post_state_restore should cancel the timer and clear > lapic_timer.pending before writing the registers, not afterwards. > lapic_timer.expired_tscdeadline can be cleared at the same time. > > 2) kvm_write_tsc should do the same and restart the timer afterwards. > > Thanks, > > Paolo Actually, the max should deal with the problem Alan is having with vmload as well (one delay of tenths of microseconds, or 0, per vmload should be irrelevant). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 21/06/2016 13:35, Marcelo Tosatti wrote: > On Tue, Jun 21, 2016 at 09:50:54AM +0200, Paolo Bonzini wrote: >> >> >> On 20/06/2016 15:05, Marcelo Tosatti wrote: >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index ea306ad..89be6e9 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, >>> { >>> kvm_apic_post_state_restore(vcpu, s); >>> update_cr8_intercept(vcpu); >>> + vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0; >>> >> >> I think this is not correct. You have programmed the host timer to an >> early value when kvm_apic_post_state_restore called start_apic_timer. >> >> I think that: >> >> 1) post_state_restore should cancel the timer and clear >> lapic_timer.pending before writing the registers, not afterwards. >> lapic_timer.expired_tscdeadline can be cleared at the same time. >> >> 2) kvm_write_tsc should do the same and restart the timer afterwards. >> >> Thanks, > > Actually, the max should deal with the problem Alan is having with > vmload as well (one delay of tenths of microseconds, or 0, per vmload > should be irrelevant). Yes, it does. Though I think it's a bug that we don't clear .pending on KVM_SET_LAPIC, and I'm happy if you do a v2 of this patch to fix it. :) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ea306ad..89be6e9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, { kvm_apic_post_state_restore(vcpu, s); update_cr8_intercept(vcpu); + vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0; return 0; }
Alan Jenkins reports hang at https://bugzilla.redhat.com/show_bug.cgi?id=1337667, due to guest TSC being set far behind than lapic_timer.expired_tscdeadline, when restoring VM state on top of currently active VM. It is not possible to disable LAPIC timer advancement (by setting lapic_timer.expired_tscdeadline = 0), at guest TSC write because: * APIC write: expiration = 1000. * LAPIC tsc deadline code sets timer to 1000-30. * Timer fires at 970. * Guest writes TSC=w. Guest fails to VM-entry to process signal to perform "vmload" in userspace. Case 1: w > 970: Guest entry can be performed. Case 2: w < 970: Guest entry should not be performed because "An interrupt is generated when the logical processor’s time-stamp counter equals or exceeds the target value in the IA32_TSC_DEADLINE MSR." In case 2, hardware would not fire an interrupt. To fix the problem, disable timer advancement when userspace sets the LAPIC state. Setting of APIC resets all APIC state, including any pending interrupt. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> --- v2: improve commit message -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html