diff mbox

[v2] KVM: x86: reset lapic_timer.expired_tscdeadline at SET_LAPIC time

Message ID 20160620130531.GA8139@amt.cnet (mailing list archive)
State New, archived
Headers show

Commit Message

Marcelo Tosatti June 20, 2016, 1:05 p.m. UTC
Alan Jenkins reports hang at
https://bugzilla.redhat.com/show_bug.cgi?id=1337667,
due to guest TSC being set far behind than
lapic_timer.expired_tscdeadline, when restoring VM state
on top of currently active VM.

It is not possible to disable LAPIC timer advancement 
(by setting lapic_timer.expired_tscdeadline = 0), at 
guest TSC write because:

* APIC write: expiration = 1000.
* LAPIC tsc deadline code sets timer to 1000-30.
* Timer fires at 970.
* Guest writes TSC=w.

Guest fails to VM-entry to process signal to perform
"vmload" in userspace.

Case 1: w > 970:
Guest entry can be performed.

Case 2: w < 970:
Guest entry should not be performed because "An interrupt is generated
when the logical processor’s time-stamp counter equals or exceeds the
target value in the IA32_TSC_DEADLINE MSR."

In case 2, hardware would not fire an interrupt.

To fix the problem, disable timer advancement when 
userspace sets the LAPIC state. Setting of APIC 
resets all APIC state, including 
any pending interrupt.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>

---

v2: improve commit message

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alan Jenkins June 20, 2016, 3:22 p.m. UTC | #1
On 20/06/16 14:05, Marcelo Tosatti wrote:
> Alan Jenkins reports hang at
> https://bugzilla.redhat.com/show_bug.cgi?id=1337667,
> due to guest TSC being set far behind than
> lapic_timer.expired_tscdeadline, when restoring VM state
> on top of currently active VM.
>
> It is not possible to disable LAPIC timer advancement
> (by setting lapic_timer.expired_tscdeadline = 0), at
> guest TSC write

I like that it acknowledges (though only implicitly) the guest can 
trigger arbitrary lockups of host CPUs.

> because:
>
> * APIC write: expiration = 1000.
> * LAPIC tsc deadline code sets timer to 1000-30.
> * Timer fires at 970.
> * Guest writes TSC=w.
>
> Guest fails to VM-entry to process signal to perform
> "vmload" in userspace.
>
> Case 1: w > 970:
> Guest entry can be performed.
>
> Case 2: w < 970:
> Guest entry should not be performed because "An interrupt is generated
> when the logical processor’s time-stamp counter equals or exceeds the
> target value in the IA32_TSC_DEADLINE MSR."
>
> In case 2, hardware would not fire an interrupt.
>
> To fix the problem, disable timer advancement when
> userspace sets the LAPIC state. Setting of APIC
> resets all APIC state, including
> any pending interrupt.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>

However I feel this doesn't admit (even implicitly) that host software 
(not necessarily root) can still hard-lockup the CPU. It depends on the 
sequence of operations, and the message doesn't show that sequence 
explicitly. I now understand what the sequence that _is_ in the message 
shows, but it's unfortunately distracting.

I.e. if you restore the LAPIC first (or omit to do so at all), then 
restore the TSC deadline MSR, then the TSC MSR.

The patch assumes that the LAPIC is restored after the MSRs so it can 
clear the incorrect value of expired_tscdeadline, right?

I didn't know whether this patch would work until I tested it, because I 
didn't try to nail down the exact sequence QEMU is using.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ea306ad..89be6e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
>   {
>   	kvm_apic_post_state_restore(vcpu, s);
>   	update_cr8_intercept(vcpu);
> +	vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0;
>
>   	return 0;
>   }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 21, 2016, 1:31 a.m. UTC | #2
On Mon, Jun 20, 2016 at 04:20:56PM +0100, Alan Jenkins wrote:
> On 20/06/16 14:05, Marcelo Tosatti wrote:
> >Alan Jenkins reports hang at
> >https://bugzilla.redhat.com/show_bug.cgi?id=1337667,
> >due to guest TSC being set far behind than
> >lapic_timer.expired_tscdeadline, when restoring VM state
> >on top of currently active VM.
> >
> >It is not possible to disable LAPIC timer advancement
> >(by setting lapic_timer.expired_tscdeadline = 0), at
> >guest TSC write
> 
> I like that it acknowledges (though only implicitly) the guest can
> trigger arbitrary lockups of host CPUs.
> 
> >because:
> >
> >* APIC write: expiration = 1000.
> >* LAPIC tsc deadline code sets timer to 1000-30.
> >* Timer fires at 970.
> >* Guest writes TSC=w.
> >
> >Guest fails to VM-entry to process signal to perform
> >"vmload" in userspace.
> >
> >Case 1: w > 970:
> >Guest entry can be performed.
> >
> >Case 2: w < 970:
> >Guest entry should not be performed because "An interrupt is generated
> >when the logical processor’s time-stamp counter equals or exceeds the
> >target value in the IA32_TSC_DEADLINE MSR."
> >
> >In case 2, hardware would not fire an interrupt.
> >
> >To fix the problem, disable timer advancement when
> >userspace sets the LAPIC state. Setting of APIC
> >resets all APIC state, including
> >any pending interrupt.
> >
> >Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
> 
> However I feel this doesn't admit (even implicitly) that host
> software (not necessarily root) can still hard-lockup the CPU. It
> depends on the sequence of operations, and the message doesn't show
> that sequence explicitly. I now understand what the sequence that
> _is_ in the message shows, but it's unfortunately distracting.
> 
> I.e. if you restore the LAPIC first (or omit to do so at all), then
> restore the TSC deadline MSR, then the TSC MSR.
> 
> The patch assumes that the LAPIC is restored after the MSRs so it
> can clear the incorrect value of expired_tscdeadline, right?

Yes.

> I didn't know whether this patch would work until I tested it,
> because I didn't try to nail down the exact sequence QEMU is using.

You are right about the host lockups -- sent another patch to fix that
one.

Thanks for the reports.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 21, 2016, 7:50 a.m. UTC | #3
On 20/06/2016 15:05, Marcelo Tosatti wrote:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ea306ad..89be6e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
>  {
>  	kvm_apic_post_state_restore(vcpu, s);
>  	update_cr8_intercept(vcpu);
> +	vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0;
>  

I think this is not correct.  You have programmed the host timer to an
early value when kvm_apic_post_state_restore called start_apic_timer.

I think that:

1) post_state_restore should cancel the timer and clear
lapic_timer.pending before writing the registers, not afterwards.
lapic_timer.expired_tscdeadline can be cleared at the same time.

2) kvm_write_tsc should do the same and restart the timer afterwards.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 21, 2016, 11:35 a.m. UTC | #4
On Tue, Jun 21, 2016 at 09:50:54AM +0200, Paolo Bonzini wrote:
> 
> 
> On 20/06/2016 15:05, Marcelo Tosatti wrote:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ea306ad..89be6e9 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
> >  {
> >  	kvm_apic_post_state_restore(vcpu, s);
> >  	update_cr8_intercept(vcpu);
> > +	vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0;
> >  
> 
> I think this is not correct.  You have programmed the host timer to an
> early value when kvm_apic_post_state_restore called start_apic_timer.
> 
> I think that:
> 
> 1) post_state_restore should cancel the timer and clear
> lapic_timer.pending before writing the registers, not afterwards.
> lapic_timer.expired_tscdeadline can be cleared at the same time.
> 
> 2) kvm_write_tsc should do the same and restart the timer afterwards.
> 
> Thanks,
> 
> Paolo

Actually, the max should deal with the problem Alan is having with
vmload as well (one delay of tenths of microseconds, or 0, per vmload
should be irrelevant).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 21, 2016, 11:37 a.m. UTC | #5
On 21/06/2016 13:35, Marcelo Tosatti wrote:
> On Tue, Jun 21, 2016 at 09:50:54AM +0200, Paolo Bonzini wrote:
>>
>>
>> On 20/06/2016 15:05, Marcelo Tosatti wrote:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index ea306ad..89be6e9 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -2991,6 +2991,7 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
>>>  {
>>>  	kvm_apic_post_state_restore(vcpu, s);
>>>  	update_cr8_intercept(vcpu);
>>> +	vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0;
>>>  
>>
>> I think this is not correct.  You have programmed the host timer to an
>> early value when kvm_apic_post_state_restore called start_apic_timer.
>>
>> I think that:
>>
>> 1) post_state_restore should cancel the timer and clear
>> lapic_timer.pending before writing the registers, not afterwards.
>> lapic_timer.expired_tscdeadline can be cleared at the same time.
>>
>> 2) kvm_write_tsc should do the same and restart the timer afterwards.
>>
>> Thanks,
> 
> Actually, the max should deal with the problem Alan is having with
> vmload as well (one delay of tenths of microseconds, or 0, per vmload
> should be irrelevant).

Yes, it does.  Though I think it's a bug that we don't clear .pending on
KVM_SET_LAPIC, and I'm happy if you do a v2 of this patch to fix it. :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ea306ad..89be6e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2991,6 +2991,7 @@  static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
 {
 	kvm_apic_post_state_restore(vcpu, s);
 	update_cr8_intercept(vcpu);
+	vcpu->arch.apic->lapic_timer.expired_tscdeadline = 0;
 
 	return 0;
 }