Message ID | CANRm+Cyu9vqobbiGqrN_EOXrDJCZVbf+BfU_tCZ=nxr3qS=zog@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 22/05/2016, Wanpeng Li <kernellwp@gmail.com> wrote: > 2016-05-22 9:23 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>: >> 2016-05-20 19:47 GMT+08:00 Alan Jenkins >> <alan.christopher.jenkins@gmail.com>: >>> Hi >>> >>> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why, >>> and it didn't seem to reproduce on a second setup. However I found a >>> suspect in the code for TSC deadline timers. Maybe someone is interested >>> in my analysis. >>> >>> If a guest has a TSC deadline timer set, it's not re-done when the TSC >>> is adjusted, and will fire at the wrong time. The hrtimer used for >>> emulation is not being recalculated. If the TSC was set _backwards_, I >>> think it could trigger a lockup in wait_lapic_expire(). This function is >>> a busy-loop optimization, which could be tricked into busy-looping for >>> too long. The expected busy-wait is `lapic_timer_advance_ns`, which >>> defaults to 0 (I haven't changed it). >> >> The default clockevent device for hrtimer is lapic, and tsc works as a >> clocksource device, even if tsc in guest is backwards/adjusted, >> clockevent device is not influenced and work normally I think, so we >> just need to keep clockevent device can fire asap when it expires. How >> about someting like below(I can't reproduce your bug and just can send >> out a formal patch after your confirm): I don't understand your explanation. I don't mind testing such a patch, as it will fix the issue I experience :). In my understanding this is only a workaround, so we should also add a (ratelimited) warning message. >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> index bbb5b28..02a21d3 100644 >> --- a/arch/x86/kvm/lapic.c >> +++ b/arch/x86/kvm/lapic.c >> @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) >> >> /* __delay is delay_tsc whenever the hardware has TSC, thus >> always. */ >> if (guest_tsc < tsc_deadline) >> - __delay(tsc_deadline - guest_tsc); >> + __delay(max((tsc_deadline - guest_tsc), >> lapic_timer_advance_ns)); >> } > > s/max/min wait I found another one >> - __delay(tsc_deadline - guest_tsc); >> __delay( >> tsc_deadline - guest_tsc) (tsc_deadline - guest_tsc) gives a value in terms of guest TSC rate. But guest TSCs are scaled. __delay() must expect a value in terms of host TSC rate. So the busy-wait could take shorter or longer than expected depending on how the guest TSC is scaled. Right? Alan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 22/05/2016, Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote: > On 22/05/2016, Wanpeng Li <kernellwp@gmail.com> wrote: >> 2016-05-22 9:23 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>: >>> 2016-05-20 19:47 GMT+08:00 Alan Jenkins >>> <alan.christopher.jenkins@gmail.com>: >>>> Hi >>>> >>>> I'm getting a hard lockup in wait_lapic_expire(). I'm not certain why, >>>> and it didn't seem to reproduce on a second setup. However I found a >>>> suspect in the code for TSC deadline timers. Maybe someone is >>>> interested >>>> in my analysis. >>>> >>>> If a guest has a TSC deadline timer set, it's not re-done when the TSC >>>> is adjusted, and will fire at the wrong time. The hrtimer used for >>>> emulation is not being recalculated. If the TSC was set _backwards_, I >>>> think it could trigger a lockup in wait_lapic_expire(). This function >>>> is >>>> a busy-loop optimization, which could be tricked into busy-looping for >>>> too long. The expected busy-wait is `lapic_timer_advance_ns`, which >>>> defaults to 0 (I haven't changed it). >>> >>> The default clockevent device for hrtimer is lapic, and tsc works as a >>> clocksource device, even if tsc in guest is backwards/adjusted, >>> clockevent device is not influenced and work normally I think, so we >>> just need to keep clockevent device can fire asap when it expires. How >>> about someting like below(I can't reproduce your bug and just can send >>> out a formal patch after your confirm): > > I don't understand your explanation. I don't mind testing such a > patch, as it will fix the issue I experience :). > > In my understanding this is only a workaround, so we should also add a > (ratelimited) warning message. > >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >>> index bbb5b28..02a21d3 100644 >>> --- a/arch/x86/kvm/lapic.c >>> +++ b/arch/x86/kvm/lapic.c >>> @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) >>> >>> /* __delay is delay_tsc whenever the hardware has TSC, thus >>> always. */ >>> if (guest_tsc < tsc_deadline) >>> - __delay(tsc_deadline - guest_tsc); >>> + __delay(max((tsc_deadline - guest_tsc), >>> lapic_timer_advance_ns)); >>> } But that's comparing TSC ticks and nanoseconds? >> s/max/min > > wait I found another one > >>> - __delay(tsc_deadline - guest_tsc); > > >>> __delay( >>> tsc_deadline - guest_tsc) > > > (tsc_deadline - guest_tsc) gives a value in terms of guest TSC rate. > But guest TSCs are scaled. __delay() must expect a value in terms of > host TSC rate. So the busy-wait could take shorter or longer than > expected depending on how the guest TSC is scaled. Right? I wrote an expanded test patch, which is posted with full results on the Fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1337667 It confirmed the lockup can be avoided by sanity-checking the delay in wait_lapic_expire(). With my test patch, I get a warning message but no lockup. Wanpeng, I have another description here, maybe it is more convincing? https://github.com/torvalds/linux/commit/5dda13b540c0baeebb69f612036e9e1d9d754ad8 My test patch also added checks for guest TSC being set backwards. The log confirms this is happening, though I haven't confirmed exactly how the lockup is created. It's non-obvious if there's a match between the adjustments shown and the exact lockup duration. (Conversion factor = reported guest tsc rate = 1.6 Ghz = host tsc rate). Note the kernel log shows a VM being resumed from a snapshot, twice. I don't stop the VM before I ask virt-manager to resume from snapshot the second time. Normally the lockup happens on that second resume. While I'm ranting, adding the TSC checks made me think I understand enough to detail the "real" fix. Not tested: https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1 Alan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/05/2016 16:30, Alan Jenkins wrote:
> https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1
Here the comparison (and the expired_advance_ns field) is only used to
issue the warning. Could you change the patch to use __delay instead of
ndelay and omit the warning?
+ if (guest_tsc < tsc_deadline) {
+ unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
+ u64 delay_ns;
+
+ delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+ do_div(delay_ns, this_tsc_khz);
+
+ if (delay_ns > apic->lapic_timer.expired_advance_ns) {
+ vcpu_err(vcpu,
+ "timer for tsc_deadline fired too far in advance: %lluns\n",
+ delay_ns);
+ WARN_ON(1);
+
+ /* avoid lockup. emulated timer will fire early */
+ delay_ns = apic->lapic_timer.expired_advance_ns;
+ }
+
+ ndelay(delay_ns);
+ }
Thanks,
Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24/05/16 13:51, Paolo Bonzini wrote: > On 23/05/2016 16:30, Alan Jenkins wrote: >> https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1 > Here the comparison (and the expired_advance_ns field) is only used to > issue the warning. No, because expired_advance_ns is used as the limit for the busy-wait. > Could you change the patch to use __delay instead of > ndelay and No, because the limit is in ns, not host TSC ticks. Also, we were passing _guest_ TSC ticks to __delay(), which looks like a second potential cause of lockups because guest TSC ticks are scaled. > omit the warning? Certainly, and I have a positive suggestion to make the code more agreeable if it doesn't need WARN_ON(). Thanks for reviewing this. I'll send a revised version of the patch after testing. The field expired_advance_ns was an attempt to prevent warnings (and premature timer interrupts) in case lapic_timer_advance_ns is tuned downwards, while we're running. This was a third potential case I noticed, which could have caused a lockup. The problem was I didn't want to make a user-triggerable WARN_ON. But if we don't want WARN_ON at all, then maybe it's ok to compare the module parameter racily, so we don't need the extra field. There's no existing documentation which I could update, to say that reducing lapic_timer_advance_ns is not supported, right? (Because even with checks, it means the guest deadline timer can fire when the TSC is showing a time prior to the deadline, so it could break guest assumptions). I wanted to log _a_ warning because I'm writing another patch there (if you've seen the parent commit) which is supposed to prevent it ever happening. A better alternative would be to have the workaround first (and for -stable), without the warning. (I assume you still dislike WARN_ON with the other patch, but I would try to sell you on at least printk_once(KERN_ERROR :). > + if (guest_tsc < tsc_deadline) { > + unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz; > + u64 delay_ns; > + > + delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL; > + do_div(delay_ns, this_tsc_khz); > + > + if (delay_ns > apic->lapic_timer.expired_advance_ns) { > + vcpu_err(vcpu, > + "timer for tsc_deadline fired too far in advance: %lluns\n", > + delay_ns); > + WARN_ON(1); > + > + /* avoid lockup. emulated timer will fire early */ > + delay_ns = apic->lapic_timer.expired_advance_ns; > + } > + > + ndelay(delay_ns); > + } > > Thanks, > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24/05/2016 16:20, Alan Jenkins wrote: > On 24/05/16 13:51, Paolo Bonzini wrote: >> On 23/05/2016 16:30, Alan Jenkins wrote: >>> https://github.com/torvalds/linux/compare/master...sourcejedi:kvm_lockup_fix?expand=1 >>> >> Here the comparison (and the expired_advance_ns field) is only used to >> issue the warning. > No, because expired_advance_ns is used as the limit for the busy-wait. Right, but I expected the limit to go away together with the warning in the final patch, if the root cause can be fixed. >> Could you change the patch to use __delay instead of >> ndelay and > > No, because the limit is in ns, not host TSC ticks. Also, we were > passing _guest_ TSC ticks to __delay(), which looks like a second > potential cause of lockups because guest TSC ticks are scaled. This is correct. It's missing a call to kvm_scale_tsc if you keep the __delay. > The problem was I didn't want to make a user-triggerable WARN_ON. But if > we don't want WARN_ON at all, then maybe it's ok to compare the module > parameter racily, so we don't need the extra field. There's no existing > documentation which I could update, to say that reducing > lapic_timer_advance_ns is not supported, right? (Because even with > checks, it means the guest deadline timer can fire when the TSC is > showing a time prior to the deadline, so it could break guest assumptions). I'd just make lapic_timer_advance_ns read-only, I think. Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/06/2016 11:31, Alan Jenkins wrote: > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 1a2da0e..84abb1a 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic) > { > struct kvm_vcpu *vcpu = apic->vcpu; > struct swait_queue_head *q = &vcpu->wq; > - struct kvm_timer *ktimer = &apic->lapic_timer; > > if (atomic_read(&apic->lapic_timer.pending)) > return; > @@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic) > > if (swait_active(q)) > swake_up(q); > - > - if (apic_lvtt_tscdeadline(apic)) > - ktimer->expired_tscdeadline = ktimer->tscdeadline; > } > > /* > @@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu) > void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu) > { > struct kvm_lapic *apic = vcpu->arch.apic; > + struct kvm_timer *ktimer = &apic->lapic_timer; > > - if (atomic_read(&apic->lapic_timer.pending) > 0) { > + if (atomic_read(&ktimer->pending) > 0) { > kvm_apic_local_deliver(apic, APIC_LVTT); > - if (apic_lvtt_tscdeadline(apic)) > - apic->lapic_timer.tscdeadline = 0; > - atomic_set(&apic->lapic_timer.pending, 0); > + if (apic_lvtt_tscdeadline(apic)) { > + ktimer->expired_tscdeadline = ktimer->tscdeadline; > + ktimer->tscdeadline = 0; > + } > + atomic_set(&ktimer->pending, 0); > } > } > > @@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) > } > > /* > + * kvm_apic_tsc_update - called when guest or host changes the TSC > + * > + * Update the emulated TSC deadline timer. > + * > + * It is also required if host changes TSC frequency scaling. It is not > + * required when we "catch up" the TSC, because that is maintaining the > + * relationship between TSC and real time. > + */ > +void kvm_apic_tsc_update(struct kvm_vcpu *vcpu) > +{ > + struct kvm_lapic *apic = vcpu->arch.apic; > + struct kvm_timer *ktimer = &apic->lapic_timer; > + > + if (!lapic_in_kernel(vcpu)) > + return; > + > + if (!apic_lvtt_tscdeadline(apic)) > + return; > + > + hrtimer_cancel(&ktimer->timer); > + > + /* > + * Handle when guest TSC is adjusted backwards, just after > + * inject_apic_timer_irqs(). When we hit wait_lapic_expire(), we risk > + * an extended busy-wait. Because ktimer->expired_tscdeadline will > + * have receded further into the future. > + * > + * The guest does not get a chance to run in this window, but the host > + * could modify the TSC at this point. This race could happen when > + * restoring a live snapshot of a VM. > + * > + * Just clear the busy-wait. In theory this could cause a wakeup > + * before the deadline, if the user configured the the busy-wait > + * feature (lapic_advance_timer_ns). We don't expect this to matter: Why not just delay by lapic_advance_timer_ns here? It's usually around 10 nanoseconds, it should be cheap. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13/06/16 09:23, Paolo Bonzini wrote: > > On 09/06/2016 11:31, Alan Jenkins wrote: >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> index 1a2da0e..84abb1a 100644 >> --- a/arch/x86/kvm/lapic.c >> +++ b/arch/x86/kvm/lapic.c >> @@ -1273,7 +1273,6 @@ static void apic_timer_expired(struct kvm_lapic *apic) >> { >> struct kvm_vcpu *vcpu = apic->vcpu; >> struct swait_queue_head *q = &vcpu->wq; >> - struct kvm_timer *ktimer = &apic->lapic_timer; >> >> if (atomic_read(&apic->lapic_timer.pending)) >> return; >> @@ -1283,9 +1282,6 @@ static void apic_timer_expired(struct kvm_lapic *apic) >> >> if (swait_active(q)) >> swake_up(q); >> - >> - if (apic_lvtt_tscdeadline(apic)) >> - ktimer->expired_tscdeadline = ktimer->tscdeadline; >> } >> >> /* >> @@ -1922,12 +1918,15 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu) >> void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu) >> { >> struct kvm_lapic *apic = vcpu->arch.apic; >> + struct kvm_timer *ktimer = &apic->lapic_timer; >> >> - if (atomic_read(&apic->lapic_timer.pending) > 0) { >> + if (atomic_read(&ktimer->pending) > 0) { >> kvm_apic_local_deliver(apic, APIC_LVTT); >> - if (apic_lvtt_tscdeadline(apic)) >> - apic->lapic_timer.tscdeadline = 0; >> - atomic_set(&apic->lapic_timer.pending, 0); >> + if (apic_lvtt_tscdeadline(apic)) { >> + ktimer->expired_tscdeadline = ktimer->tscdeadline; >> + ktimer->tscdeadline = 0; >> + } >> + atomic_set(&ktimer->pending, 0); >> } >> } >> >> @@ -2007,6 +2006,71 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) >> } >> >> /* >> + * kvm_apic_tsc_update - called when guest or host changes the TSC >> + * >> + * Update the emulated TSC deadline timer. >> + * >> + * It is also required if host changes TSC frequency scaling. It is not >> + * required when we "catch up" the TSC, because that is maintaining the >> + * relationship between TSC and real time. >> + */ >> +void kvm_apic_tsc_update(struct kvm_vcpu *vcpu) >> +{ >> + struct kvm_lapic *apic = vcpu->arch.apic; >> + struct kvm_timer *ktimer = &apic->lapic_timer; >> + >> + if (!lapic_in_kernel(vcpu)) >> + return; >> + >> + if (!apic_lvtt_tscdeadline(apic)) >> + return; >> + >> + hrtimer_cancel(&ktimer->timer); >> + >> + /* >> + * Handle when guest TSC is adjusted backwards, just after >> + * inject_apic_timer_irqs(). When we hit wait_lapic_expire(), we risk >> + * an extended busy-wait. Because ktimer->expired_tscdeadline will >> + * have receded further into the future. >> + * >> + * The guest does not get a chance to run in this window, but the host >> + * could modify the TSC at this point. This race could happen when >> + * restoring a live snapshot of a VM. >> + * >> + * Just clear the busy-wait. In theory this could cause a wakeup >> + * before the deadline, if the user configured the the busy-wait >> + * feature (lapic_advance_timer_ns). We don't expect this to matter: > Why not just delay by lapic_advance_timer_ns here? It's usually around > 10 nanoseconds, it should be cheap. > > Paolo I did some more thinking since. Page-long justifications seem sub-optimal. I agree with your point, and I'm looking in to it. (Though I think it's around 10 microseconds. 40 cpu cycles sounds very short :-P). Thanks Alan I thought it would be most natural to call wait_lapic_expire() before returning to userspace. Then we avoid returning with expired_tscdeadline set, and userspace doesn't see the irq injected before the deadline. So I checked the pre-conditions for wait_lapic_expire() and tripped over another issue. I don't think it likes tsc_catchup=1, even where it's called now. vcpu_load() can make the guest TSC seem to stand still for a bit, and the "catchup" part is driven from get_kernel_ns(). I.e. the same resolution as the kernel timer tick. The busy-wait could be extended to 1ms (HZ=1000), so this low latency feature defeats itself again :-(. Like the longer busy-wait issue, it wouldn't be limited to systems where the busy-wait feature has been configured. That could explain why I saw short busy-waits up to .5ms while running a VM, because that happened after I'd triggered the long busy-wait, and the TSC was marked unstable by the clocksource watchdog. - Don't support it? Disable lapic_timer_advance_ns on systems without perfectly synchronized TSCs? Has no-one noticed because no such systems have been configured? Or does it escape their validation tests and this change could be perceived as a regression? - Hack tsc_catchup to take the hrtimer expiry into account? Disable busy-waits only for the old cpus lacking X86_FEATURE_CONSTANT_TSC? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13/06/2016 12:33, Alan Jenkins wrote: > > I did some more thinking since. Page-long justifications seem > sub-optimal. I agree with your point, and I'm looking in to it. (Though > I think it's around 10 microseconds. 40 cpu cycles sounds very short :-P). Yeah. It's actually less than 10 (more like 4-7), still I was 3 orders of magnitude off. :) > I thought it would be most natural to call wait_lapic_expire() before > returning to userspace. Then we avoid returning with > expired_tscdeadline set, and userspace doesn't see the irq injected > before the deadline. > > So I checked the pre-conditions for wait_lapic_expire() and tripped over > another issue. > > I don't think it likes tsc_catchup=1, even where it's called now. I agree that whenever the TSC becomes unstable the pre-expiration feature should be turned off (and the hrtimer canceled and reset, similar to kvm_set_lapic_tscdeadline_msr). > - Don't support it? Disable lapic_timer_advance_ns on systems without > perfectly synchronized TSCs? Has no-one noticed because no such systems > have been configured? Or does it escape their validation tests and this > change could be perceived as a regression? This feature is meant for real-time systems, you can expect much more than just synchronized TSCs :) (for example you can expect SMIs to take a very short and bounded time). This should be a separate patch though. Are you going to post v2 for this one? Paolo > - Hack tsc_catchup to take the hrtimer expiry into account? Disable > busy-waits only for the old cpus lacking X86_FEATURE_CONSTANT_TSC? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bbb5b28..02a21d3 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1310,7 +1310,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ if (guest_tsc < tsc_deadline) - __delay(tsc_deadline - guest_tsc); + __delay(max((tsc_deadline - guest_tsc), lapic_timer_advance_ns)); }