diff mbox

hard lockup in wait_lapic_expire() - bug in TSC deadline timer?

Message ID CANRm+Cyu9vqobbiGqrN_EOXrDJCZVbf+BfU_tCZ=nxr3qS=zog@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wanpeng Li May 22, 2016, 1:23 a.m. UTC
2016-05-20 19:47 GMT+08:00 Alan Jenkins <alan.christopher.jenkins@gmail.com>:
> Hi
>
> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why,
> and it didn't seem to reproduce on a second setup. However I found a
> suspect in the code for TSC deadline timers. Maybe someone is interested
> in my analysis.
>
> If a guest has a TSC deadline timer set, it's not re-done when the TSC
> is adjusted, and will fire at the wrong time. The hrtimer used for
> emulation is not being recalculated. If the TSC was set _backwards_, I
> think it could trigger a lockup in wait_lapic_expire(). This function is
> a busy-loop optimization, which could be tricked into busy-looping for
> too long. The expected busy-wait is `lapic_timer_advance_ns`, which
> defaults to 0 (I haven't changed it).

The default clockevent device for hrtimer is lapic, and tsc works as a
clocksource device, even if tsc in guest is backwards/adjusted,
clockevent device is not influenced and work normally I think, so we
just need to keep clockevent device can fire asap when it expires. How
about someting like below(I can't reproduce your bug and just can send
out a formal patch after your confirm):

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alan Jenkins May 22, 2016, 9:14 a.m. UTC | #1
On 22/05/2016, Wanpeng Li <kernellwp@gmail.com> wrote:
> 2016-05-22 9:23 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>> 2016-05-20 19:47 GMT+08:00 Alan Jenkins
>> <alan.christopher.jenkins@gmail.com>:
>>> Hi
>>>
>>> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why,
>>> and it didn't seem to reproduce on a second setup. However I found a
>>> suspect in the code for TSC deadline timers. Maybe someone is interested
>>> in my analysis.
>>>
>>> If a guest has a TSC deadline timer set, it's not re-done when the TSC
>>> is adjusted, and will fire at the wrong time. The hrtimer used for
>>> emulation is not being recalculated. If the TSC was set _backwards_, I
>>> think it could trigger a lockup in wait_lapic_expire(). This function is
>>> a busy-loop optimization, which could be tricked into busy-looping for
>>> too long. The expected busy-wait is `lapic_timer_advance_ns`, which
>>> defaults to 0 (I haven't changed it).
>>
>> The default clockevent device for hrtimer is lapic, and tsc works as a
>> clocksource device, even if tsc in guest is backwards/adjusted,
>> clockevent device is not influenced and work normally I think, so we
>> just need to keep clockevent device can fire asap when it expires. How
>> about someting like below(I can't reproduce your bug and just can send
>> out a formal patch after your confirm):

I don't understand your explanation.  I don't mind testing such a
patch, as it will fix the issue I experience :).

In my understanding this is only a workaround, so we should also add a
(ratelimited) warning message.

>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index bbb5b28..02a21d3 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>
>>         /* __delay is delay_tsc whenever the hardware has TSC, thus
>> always.  */
>>         if (guest_tsc < tsc_deadline)
>> -               __delay(tsc_deadline - guest_tsc);
>> +               __delay(max((tsc_deadline - guest_tsc),
>> lapic_timer_advance_ns));
>>  }
>
> s/max/min

wait I found another one

>> -               __delay(tsc_deadline - guest_tsc);


>>                __delay(
>>                        tsc_deadline - guest_tsc)


(tsc_deadline - guest_tsc) gives a value in terms of guest TSC rate.
But guest TSCs are scaled.  __delay() must expect a value in terms of
host TSC rate.  So the busy-wait could take shorter or longer than
expected depending on how the guest TSC is scaled.  Right?

Alan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Jenkins May 23, 2016, 2:30 p.m. UTC | #2
On 22/05/2016, Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
> On 22/05/2016, Wanpeng Li <kernellwp@gmail.com> wrote:
>> 2016-05-22 9:23 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>>> 2016-05-20 19:47 GMT+08:00 Alan Jenkins
>>> <alan.christopher.jenkins@gmail.com>:
>>>> Hi
>>>>
>>>> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why,
>>>> and it didn't seem to reproduce on a second setup. However I found a
>>>> suspect in the code for TSC deadline timers. Maybe someone is
>>>> interested
>>>> in my analysis.
>>>>
>>>> If a guest has a TSC deadline timer set, it's not re-done when the TSC
>>>> is adjusted, and will fire at the wrong time. The hrtimer used for
>>>> emulation is not being recalculated. If the TSC was set _backwards_, I
>>>> think it could trigger a lockup in wait_lapic_expire(). This function
>>>> is
>>>> a busy-loop optimization, which could be tricked into busy-looping for
>>>> too long. The expected busy-wait is `lapic_timer_advance_ns`, which
>>>> defaults to 0 (I haven't changed it).
>>>
>>> The default clockevent device for hrtimer is lapic, and tsc works as a
>>> clocksource device, even if tsc in guest is backwards/adjusted,
>>> clockevent device is not influenced and work normally I think, so we
>>> just need to keep clockevent device can fire asap when it expires. How
>>> about someting like below(I can't reproduce your bug and just can send
>>> out a formal patch after your confirm):
>
> I don't understand your explanation.  I don't mind testing such a
> patch, as it will fix the issue I experience :).
>
> In my understanding this is only a workaround, so we should also add a
> (ratelimited) warning message.
>
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index bbb5b28..02a21d3 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>>
>>>         /* __delay is delay_tsc whenever the hardware has TSC, thus
>>> always.  */
>>>         if (guest_tsc < tsc_deadline)
>>> -               __delay(tsc_deadline - guest_tsc);
>>> +               __delay(max((tsc_deadline - guest_tsc),
>>> lapic_timer_advance_ns));
>>>  }

But that's comparing TSC ticks and nanoseconds?


>> s/max/min
>
> wait I found another one
>
>>> -               __delay(tsc_deadline - guest_tsc);
>
>
>>>                __delay(
>>>                        tsc_deadline - guest_tsc)
>
>
> (tsc_deadline - guest_tsc) gives a value in terms of guest TSC rate.
> But guest TSCs are scaled.  __delay() must expect a value in terms of
> host TSC rate.  So the busy-wait could take shorter or longer than
> expected depending on how the guest TSC is scaled.  Right?


I wrote an expanded test patch, which is posted with full results on
the Fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1337667

It confirmed the lockup can be avoided by sanity-checking the delay in
wait_lapic_expire().  With my test patch, I get a warning message but
no lockup.

Wanpeng, I have another description here, maybe it is more convincing?

https://github.com/torvalds/linux/commit/5dda13b540c0baeebb69f612036e9e1d9d754ad8


My test patch also added checks for guest TSC being set backwards.
The log confirms this is happening, though I haven't confirmed exactly
how the lockup is created.  It's non-obvious if there's a match
between the adjustments shown and the exact lockup duration.
(Conversion factor = reported guest tsc rate = 1.6 Ghz = host tsc
rate).

Note the kernel log shows a VM being resumed from a snapshot, twice.
I don't stop the VM before I ask virt-manager to resume from snapshot
the second time.  Normally the lockup happens on that second resume.

While I'm ranting, adding the TSC checks made me think I understand
enough to detail the "real" fix.  Not tested:

https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1

Alan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 24, 2016, 12:51 p.m. UTC | #3
On 23/05/2016 16:30, Alan Jenkins wrote:
> https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1

Here the comparison (and the expired_advance_ns field) is only used to
issue the warning.  Could you change the patch to use __delay instead of
ndelay and omit the warning?

+	if (guest_tsc < tsc_deadline) {
+		unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+		u64 delay_ns;
+
+		delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+		do_div(delay_ns, this_tsc_khz);
+
+		if (delay_ns > apic->lapic_timer.expired_advance_ns) {
+			vcpu_err(vcpu,
+			         "timer for tsc_deadline fired too far in advance: %lluns\n",
+			         delay_ns);
+			WARN_ON(1);
+
+			/* avoid lockup. emulated timer will fire early */
+			delay_ns = apic->lapic_timer.expired_advance_ns;
+		}
+
+		ndelay(delay_ns);
+	} 

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Jenkins May 24, 2016, 2:20 p.m. UTC | #4
On 24/05/16 13:51, Paolo Bonzini wrote:
> On 23/05/2016 16:30, Alan Jenkins wrote:
>> https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1
> Here the comparison (and the expired_advance_ns field) is only used to
> issue the warning.
No, because expired_advance_ns is used as the limit for the busy-wait.

> Could you change the patch to use __delay instead of
> ndelay and
No, because the limit is in ns, not host TSC ticks.  Also, we were 
passing _guest_ TSC ticks to __delay(), which looks like a second 
potential cause of lockups because guest TSC ticks are scaled.

>   omit the warning?
Certainly, and I have a positive suggestion to make the code more 
agreeable if it doesn't need WARN_ON().  Thanks for reviewing this. I'll 
send a revised version of the patch after testing.

The field expired_advance_ns was an attempt to prevent warnings (and 
premature timer interrupts) in case lapic_timer_advance_ns is tuned 
downwards, while we're running.  This was a third potential case I 
noticed, which could have caused a lockup.

The problem was I didn't want to make a user-triggerable WARN_ON. But if 
we don't want WARN_ON at all, then maybe it's ok to compare the module 
parameter racily, so we don't need the extra field. There's no existing 
documentation which I could update, to say that reducing 
lapic_timer_advance_ns is not supported, right?  (Because even with 
checks, it means the guest deadline timer can fire when the TSC is 
showing a time prior to the deadline, so it could break guest assumptions).

I wanted to log _a_ warning because I'm writing another patch there (if 
you've seen the parent commit) which is supposed to prevent it ever 
happening.  A better alternative would be to have the workaround first 
(and for -stable), without the warning.  (I assume you still dislike 
WARN_ON with the other patch, but I would try to sell you on at least 
printk_once(KERN_ERROR :).

> +	if (guest_tsc < tsc_deadline) {
> +		unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
> +		u64 delay_ns;
> +
> +		delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> +		do_div(delay_ns, this_tsc_khz);
> +
> +		if (delay_ns > apic->lapic_timer.expired_advance_ns) {
> +			vcpu_err(vcpu,
> +			         "timer for tsc_deadline fired too far in advance: %lluns\n",
> +			         delay_ns);
> +			WARN_ON(1);
> +
> +			/* avoid lockup. emulated timer will fire early */
> +			delay_ns = apic->lapic_timer.expired_advance_ns;
> +		}
> +
> +		ndelay(delay_ns);
> +	}
>
> Thanks,
>
> Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 24, 2016, 2:36 p.m. UTC | #5
On 24/05/2016 16:20, Alan Jenkins wrote:
> On 24/05/16 13:51, Paolo Bonzini wrote:
>> On 23/05/2016 16:30, Alan Jenkins wrote:
>>> https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1
>>>
>> Here the comparison (and the expired_advance_ns field) is only used to
>> issue the warning.
> No, because expired_advance_ns is used as the limit for the busy-wait.

Right, but I expected the limit to go away together with the warning in
the final patch, if the root cause can be fixed.

>> Could you change the patch to use __delay instead of
>> ndelay and
>
> No, because the limit is in ns, not host TSC ticks.  Also, we were
> passing _guest_ TSC ticks to __delay(), which looks like a second
> potential cause of lockups because guest TSC ticks are scaled.

This is correct.  It's missing a call to kvm_scale_tsc if you keep the
__delay.

> The problem was I didn't want to make a user-triggerable WARN_ON. But if
> we don't want WARN_ON at all, then maybe it's ok to compare the module
> parameter racily, so we don't need the extra field. There's no existing
> documentation which I could update, to say that reducing
> lapic_timer_advance_ns is not supported, right?  (Because even with
> checks, it means the guest deadline timer can fire when the TSC is
> showing a time prior to the deadline, so it could break guest assumptions).

I'd just make lapic_timer_advance_ns read-only, I think.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 13, 2016, 8:23 a.m. UTC | #6
On 09/06/2016 11:31, Alan Jenkins wrote:
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 1a2da0e..84abb1a 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
>  {
>      struct kvm_vcpu *vcpu = apic->vcpu;
>      struct swait_queue_head *q = &vcpu->wq;
> -    struct kvm_timer *ktimer = &apic->lapic_timer;
>  
>      if (atomic_read(&apic->lapic_timer.pending))
>          return;
> @@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
>  
>      if (swait_active(q))
>          swake_up(q);
> -
> -    if (apic_lvtt_tscdeadline(apic))
> -        ktimer->expired_tscdeadline = ktimer->tscdeadline;
>  }
>  
>  /*
> @@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>  void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
>  {
>      struct kvm_lapic *apic = vcpu->arch.apic;
> +    struct kvm_timer *ktimer = &apic->lapic_timer;
>  
> -    if (atomic_read(&apic->lapic_timer.pending) > 0) {
> +    if (atomic_read(&ktimer->pending) > 0) {
>          kvm_apic_local_deliver(apic, APIC_LVTT);
> -        if (apic_lvtt_tscdeadline(apic))
> -            apic->lapic_timer.tscdeadline = 0;
> -        atomic_set(&apic->lapic_timer.pending, 0);
> +        if (apic_lvtt_tscdeadline(apic)) {
> +            ktimer->expired_tscdeadline = ktimer->tscdeadline;
> +            ktimer->tscdeadline = 0;
> +        }
> +        atomic_set(&ktimer->pending, 0);
>      }
>  }
>  
> @@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
>  }
>  
>  /*
> + * kvm_apic_tsc_update - called when guest or host changes the TSC
> + *
> + * Update the emulated TSC deadline timer.
> + *
> + * It is also required if host changes TSC frequency scaling.  It is not
> + * required when we "catch up" the TSC, because that is maintaining the
> + * relationship between TSC and real time.
> + */
> +void kvm_apic_tsc_update(struct kvm_vcpu *vcpu)
> +{
> +    struct kvm_lapic *apic = vcpu->arch.apic;
> +    struct kvm_timer *ktimer = &apic->lapic_timer;
> +
> +    if (!lapic_in_kernel(vcpu))
> +        return;
> +
> +    if (!apic_lvtt_tscdeadline(apic))
> +        return;
> +
> +    hrtimer_cancel(&ktimer->timer);
> +
> +    /*
> +     * Handle when guest TSC is adjusted backwards, just after
> +     * inject_apic_timer_irqs().  When we hit wait_lapic_expire(), we risk
> +     * an extended busy-wait.  Because ktimer->expired_tscdeadline will
> +     * have receded further into the future.
> +     *
> +     * The guest does not get a chance to run in this window, but the host
> +     * could modify the TSC at this point.  This race could happen when
> +     * restoring a live snapshot of a VM.
> +     *
> +     * Just clear the busy-wait.  In theory this could cause a wakeup
> +     * before the deadline, if the user configured the the busy-wait
> +     * feature (lapic_advance_timer_ns).  We don't expect this to matter:

Why not just delay by lapic_advance_timer_ns here?  It's usually around
10 nanoseconds, it should be cheap.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Jenkins June 13, 2016, 10:33 a.m. UTC | #7
On 13/06/16 09:23, Paolo Bonzini wrote:
>
> On 09/06/2016 11:31, Alan Jenkins wrote:
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 1a2da0e..84abb1a 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
>>   {
>>       struct kvm_vcpu *vcpu = apic->vcpu;
>>       struct swait_queue_head *q = &vcpu->wq;
>> -    struct kvm_timer *ktimer = &apic->lapic_timer;
>>   
>>       if (atomic_read(&apic->lapic_timer.pending))
>>           return;
>> @@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)
>>   
>>       if (swait_active(q))
>>           swake_up(q);
>> -
>> -    if (apic_lvtt_tscdeadline(apic))
>> -        ktimer->expired_tscdeadline = ktimer->tscdeadline;
>>   }
>>   
>>   /*
>> @@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>>   void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm_lapic *apic = vcpu->arch.apic;
>> +    struct kvm_timer *ktimer = &apic->lapic_timer;
>>   
>> -    if (atomic_read(&apic->lapic_timer.pending) > 0) {
>> +    if (atomic_read(&ktimer->pending) > 0) {
>>           kvm_apic_local_deliver(apic, APIC_LVTT);
>> -        if (apic_lvtt_tscdeadline(apic))
>> -            apic->lapic_timer.tscdeadline = 0;
>> -        atomic_set(&apic->lapic_timer.pending, 0);
>> +        if (apic_lvtt_tscdeadline(apic)) {
>> +            ktimer->expired_tscdeadline = ktimer->tscdeadline;
>> +            ktimer->tscdeadline = 0;
>> +        }
>> +        atomic_set(&ktimer->pending, 0);
>>       }
>>   }
>>   
>> @@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
>>   }
>>   
>>   /*
>> + * kvm_apic_tsc_update - called when guest or host changes the TSC
>> + *
>> + * Update the emulated TSC deadline timer.
>> + *
>> + * It is also required if host changes TSC frequency scaling.  It is not
>> + * required when we "catch up" the TSC, because that is maintaining the
>> + * relationship between TSC and real time.
>> + */
>> +void kvm_apic_tsc_update(struct kvm_vcpu *vcpu)
>> +{
>> +    struct kvm_lapic *apic = vcpu->arch.apic;
>> +    struct kvm_timer *ktimer = &apic->lapic_timer;
>> +
>> +    if (!lapic_in_kernel(vcpu))
>> +        return;
>> +
>> +    if (!apic_lvtt_tscdeadline(apic))
>> +        return;
>> +
>> +    hrtimer_cancel(&ktimer->timer);
>> +
>> +    /*
>> +     * Handle when guest TSC is adjusted backwards, just after
>> +     * inject_apic_timer_irqs().  When we hit wait_lapic_expire(), we risk
>> +     * an extended busy-wait.  Because ktimer->expired_tscdeadline will
>> +     * have receded further into the future.
>> +     *
>> +     * The guest does not get a chance to run in this window, but the host
>> +     * could modify the TSC at this point.  This race could happen when
>> +     * restoring a live snapshot of a VM.
>> +     *
>> +     * Just clear the busy-wait.  In theory this could cause a wakeup
>> +     * before the deadline, if the user configured the the busy-wait
>> +     * feature (lapic_advance_timer_ns).  We don't expect this to matter:
> Why not just delay by lapic_advance_timer_ns here?  It's usually around
> 10 nanoseconds, it should be cheap.
>
> Paolo

I did some more thinking since.  Page-long justifications seem 
sub-optimal.  I agree with your point, and I'm looking in to it. (Though 
I think it's around 10 microseconds.  40 cpu cycles sounds very short :-P).

Thanks
Alan


I thought it would be most natural to call wait_lapic_expire() before 
returning to userspace.  Then we avoid returning with 
expired_tscdeadline set, and userspace doesn't see the irq injected 
before the deadline.

So I checked the pre-conditions for wait_lapic_expire() and tripped over 
another issue.

I don't think it likes tsc_catchup=1, even where it's called now. 
vcpu_load() can make the guest TSC seem to stand still for a bit, and 
the "catchup" part is driven from get_kernel_ns().  I.e. the same 
resolution as the kernel timer tick.  The busy-wait could be extended to 
1ms (HZ=1000), so this low latency feature defeats itself again :-(.  
Like the longer busy-wait issue, it wouldn't be limited to systems where 
the busy-wait feature has been configured.

That could explain why I saw short busy-waits up to .5ms while running a 
VM, because that happened after I'd triggered the long busy-wait, and 
the TSC was marked unstable by the clocksource watchdog.

- Don't support it?  Disable lapic_timer_advance_ns on systems without 
perfectly synchronized TSCs?  Has no-one noticed because no such systems 
have been configured?  Or does it escape their validation tests and this 
change could be perceived as a regression?

- Hack tsc_catchup to take the hrtimer expiry into account?  Disable 
busy-waits only for the old cpus lacking X86_FEATURE_CONSTANT_TSC?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 15, 2016, 10:40 p.m. UTC | #8
On 13/06/2016 12:33, Alan Jenkins wrote:
> 
> I did some more thinking since.  Page-long justifications seem
> sub-optimal.  I agree with your point, and I'm looking in to it. (Though
> I think it's around 10 microseconds.  40 cpu cycles sounds very short :-P).

Yeah.  It's actually less than 10 (more like 4-7), still I was 3 orders
of magnitude off. :)

> I thought it would be most natural to call wait_lapic_expire() before
> returning to userspace.  Then we avoid returning with
> expired_tscdeadline set, and userspace doesn't see the irq injected
> before the deadline.
> 
> So I checked the pre-conditions for wait_lapic_expire() and tripped over
> another issue.
> 
> I don't think it likes tsc_catchup=1, even where it's called now.

I agree that whenever the TSC becomes unstable the pre-expiration
feature should be turned off (and the hrtimer canceled and reset,
similar to kvm_set_lapic_tscdeadline_msr).

> - Don't support it?  Disable lapic_timer_advance_ns on systems without
> perfectly synchronized TSCs?  Has no-one noticed because no such systems
> have been configured?  Or does it escape their validation tests and this
> change could be perceived as a regression?

This feature is meant for real-time systems, you can expect much more
than just synchronized TSCs :) (for example you can expect SMIs to take
a very short and bounded time).

This should be a separate patch though.  Are you going to post v2 for
this one?

Paolo

> - Hack tsc_catchup to take the hrtimer expiry into account?  Disable
> busy-waits only for the old cpus lacking X86_FEATURE_CONSTANT_TSC?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index bbb5b28..02a21d3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1310,7 +1310,7 @@  void wait_lapic_expire(struct kvm_vcpu *vcpu)

        /* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
        if (guest_tsc < tsc_deadline)
-               __delay(tsc_deadline - guest_tsc);
+               __delay(max((tsc_deadline - guest_tsc),
lapic_timer_advance_ns));
 }