Message ID | 20190412201834.10831-3-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: lapic: Fix a variety of timer adv issues | expand |
> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > To minimize the latency of timer interrupts as observed by the guest, > KVM adjusts the values it programs into the host timers to account for > the host's overhead of programming and handling the timer event. In > the event that the adjustments are too aggressive, i.e. the timer fires > earlier than the guest expects, KVM busy waits immediately prior to > entering the guest. > > Currently, KVM manually converts the delay from nanoseconds to clock > cycles. But, the conversion is done in the guest's time domain, while > the delay occurs in the host's time domain, i.e. the delay may not be > accurate and could wait too little or too long. Just to make sure I understand the issue you see here: The __delay() is called with a parameter which the number of guest cycles required to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline). But in case guest vCPU frequency is different than physical CPU frequency, this parameter should further be converted from guest cycles to host cycles. Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message. As this is quite confusing. :) > Sidestep the headache > of shifting time domains by delaying 1ns at a time and breaking the loop > when the guest's desired timer delay has been satisfied. Because the > advancement, which caps the delay to avoid unbounded busy waiting, is > stored in nanoseconds, the current advancement time can simply be used > as a loop terminator since we're delaying 1ns at a time (plus the few > cycles of overhead for running the code). Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time. This is simply done by: host_ns = guest_cycles * 1000000ULL; do_div(host_ns, vcpu->arch.virtual_tsc_khz); Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter? i.e. something such as: delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns)); delay_host_ns = delay_guest_cycles * 1000000ULL; do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz); __delay(delay_host_ns); -Liran > > Cc: Liran Alon <liran.alon@oracle.com> > Cc: Wanpeng Li <wanpengli@tencent.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kvm/lapic.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 92446cba9b24..e797e3145a8b 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) > void wait_lapic_expire(struct kvm_vcpu *vcpu) > { > struct kvm_lapic *apic = vcpu->arch.apic; > - u64 guest_tsc, tsc_deadline, ns; > + u32 timer_advance_ns = lapic_timer_advance_ns; > + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively. > > if (!lapic_in_kernel(vcpu)) > return; > @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) > > tsc_deadline = apic->lapic_timer.expired_tscdeadline; > apic->lapic_timer.expired_tscdeadline = 0; > - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); > > - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ > - if (guest_tsc < tsc_deadline) > - __delay(min(tsc_deadline - guest_tsc, > - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); > + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { > + ndelay(1); > + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > + } > > if (!lapic_timer_advance_adjust_done) { > /* too early */ > -- > 2.21.0 >
On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote: > > > > On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > > > To minimize the latency of timer interrupts as observed by the guest, > > KVM adjusts the values it programs into the host timers to account for > > the host's overhead of programming and handling the timer event. In > > the event that the adjustments are too aggressive, i.e. the timer fires > > earlier than the guest expects, KVM busy waits immediately prior to > > entering the guest. > > > > Currently, KVM manually converts the delay from nanoseconds to clock > > cycles. But, the conversion is done in the guest's time domain, while > > the delay occurs in the host's time domain, i.e. the delay may not be > > accurate and could wait too little or too long. > > Just to make sure I understand the issue you see here: > The __delay() is called with a parameter which the number of guest cycles required > to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline). > But in case guest vCPU frequency is different than physical CPU frequency, > this parameter should further be converted from guest cycles to host cycles. > Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message. > As this is quite confusing. :) Heh, I didn't elaborate precisely because it was so darn confusing. I didn't want to say something completely incorrect so I kept it high level. > > > Sidestep the headache > > of shifting time domains by delaying 1ns at a time and breaking the loop > > when the guest's desired timer delay has been satisfied. Because the > > advancement, which caps the delay to avoid unbounded busy waiting, is > > stored in nanoseconds, the current advancement time can simply be used > > as a loop terminator since we're delaying 1ns at a time (plus the few > > cycles of overhead for running the code). > > Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time. > This is simply done by: > host_ns = guest_cycles * 1000000ULL; > do_div(host_ns, vcpu->arch.virtual_tsc_khz); > > Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter? > i.e. something such as: > delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns)); > delay_host_ns = delay_guest_cycles * 1000000ULL; > do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz); > __delay(delay_host_ns); That's also an option. I opted for the loop as it felt simpler and more obvious. And explicitly checking rechecking kvm_read_l1_tsc() guarantees the guest won't see an "early" TSC unless the cap kicks in. That being said, I don't have a strong opinion either way. As for the code, we'd probably want to do the conversion first and cap second so as to avoid converting lapic_timer_advance_ns to the guest's TSC and then back to ns. It should use ndelay(), and maybe just call it "delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a host variable, e.g.: delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL; do_div(delay_ns, vcpu->arch.virtual_tsc_khz); ndelay(min(delay_ns, lapic_timer_advance_ns)); That's actually nowhere near as as bad as I thought it would be now that it's typed out... > > -Liran > > > > > Cc: Liran Alon <liran.alon@oracle.com> > > Cc: Wanpeng Li <wanpengli@tencent.com> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > --- > > arch/x86/kvm/lapic.c | 13 +++++++------ > > 1 file changed, 7 insertions(+), 6 deletions(-) > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > index 92446cba9b24..e797e3145a8b 100644 > > --- a/arch/x86/kvm/lapic.c > > +++ b/arch/x86/kvm/lapic.c > > @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) > > void wait_lapic_expire(struct kvm_vcpu *vcpu) > > { > > struct kvm_lapic *apic = vcpu->arch.apic; > > - u64 guest_tsc, tsc_deadline, ns; > > + u32 timer_advance_ns = lapic_timer_advance_ns; > > + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; > > Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively. guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more appropriate. My main motivation for not using verbose names is keep lines short, i.e. avoid wrapping at 80 chars. For me, wrapping a bunch of lines makes the code more difficult to read than less verbose names. > > > > > if (!lapic_in_kernel(vcpu)) > > return; > > @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) > > > > tsc_deadline = apic->lapic_timer.expired_tscdeadline; > > apic->lapic_timer.expired_tscdeadline = 0; > > - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > > + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > > trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); > > > > - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ > > - if (guest_tsc < tsc_deadline) > > - __delay(min(tsc_deadline - guest_tsc, > > - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); > > + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { > > + ndelay(1); > > + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > > + } > > > > if (!lapic_timer_advance_adjust_done) { > > /* too early */ > > -- > > 2.21.0 > > >
> On 15 Apr 2019, at 19:11, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote: >> >> >>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote: >>> >>> To minimize the latency of timer interrupts as observed by the guest, >>> KVM adjusts the values it programs into the host timers to account for >>> the host's overhead of programming and handling the timer event. In >>> the event that the adjustments are too aggressive, i.e. the timer fires >>> earlier than the guest expects, KVM busy waits immediately prior to >>> entering the guest. >>> >>> Currently, KVM manually converts the delay from nanoseconds to clock >>> cycles. But, the conversion is done in the guest's time domain, while >>> the delay occurs in the host's time domain, i.e. the delay may not be >>> accurate and could wait too little or too long. >> >> Just to make sure I understand the issue you see here: >> The __delay() is called with a parameter which the number of guest cycles required >> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline). >> But in case guest vCPU frequency is different than physical CPU frequency, >> this parameter should further be converted from guest cycles to host cycles. >> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message. >> As this is quite confusing. :) > > Heh, I didn't elaborate precisely because it was so darn confusing. I > didn't want to say something completely incorrect so I kept it high level. Heh, I can agree with that :) But we should make sure others who read commit message won’t get confuse. > >> >>> Sidestep the headache >>> of shifting time domains by delaying 1ns at a time and breaking the loop >>> when the guest's desired timer delay has been satisfied. Because the >>> advancement, which caps the delay to avoid unbounded busy waiting, is >>> stored in nanoseconds, the current advancement time can simply be used >>> as a loop terminator since we're delaying 1ns at a time (plus the few >>> cycles of overhead for running the code). >> >> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time. >> This is simply done by: >> host_ns = guest_cycles * 1000000ULL; >> do_div(host_ns, vcpu->arch.virtual_tsc_khz); >> >> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter? >> i.e. something such as: >> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns)); >> delay_host_ns = delay_guest_cycles * 1000000ULL; >> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz); >> __delay(delay_host_ns); > > That's also an option. I opted for the loop as it felt simpler and more > obvious. And explicitly checking rechecking kvm_read_l1_tsc() guarantees > the guest won't see an "early" TSC unless the cap kicks in. That being > said, I don't have a strong opinion either way. > > As for the code, we'd probably want to do the conversion first and cap > second so as to avoid converting lapic_timer_advance_ns to the guest's > TSC and then back to ns. It should use ndelay(), and maybe just call it > "delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a > host variable, e.g.: > > delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL; > do_div(delay_ns, vcpu->arch.virtual_tsc_khz); > ndelay(min(delay_ns, lapic_timer_advance_ns)); > > That's actually nowhere near as as bad as I thought it would be now that > it's typed out… Yes I agree this code looks nice. I think it’s better. I recommend sending a v2 with this version. > > >> >> -Liran >> >>> >>> Cc: Liran Alon <liran.alon@oracle.com> >>> Cc: Wanpeng Li <wanpengli@tencent.com> >>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >>> --- >>> arch/x86/kvm/lapic.c | 13 +++++++------ >>> 1 file changed, 7 insertions(+), 6 deletions(-) >>> >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >>> index 92446cba9b24..e797e3145a8b 100644 >>> --- a/arch/x86/kvm/lapic.c >>> +++ b/arch/x86/kvm/lapic.c >>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) >>> void wait_lapic_expire(struct kvm_vcpu *vcpu) >>> { >>> struct kvm_lapic *apic = vcpu->arch.apic; >>> - u64 guest_tsc, tsc_deadline, ns; >>> + u32 timer_advance_ns = lapic_timer_advance_ns; >>> + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; >> >> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively. > > guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the > beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more > appropriate. My main motivation for not using verbose names is keep lines > short, i.e. avoid wrapping at 80 chars. For me, wrapping a bunch of lines > makes the code more difficult to read than less verbose names. I think it doesn’t matter now if we agree with the above mentioned suggestion of avoiding this loop. :) -Liran > >> >>> >>> if (!lapic_in_kernel(vcpu)) >>> return; >>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) >>> >>> tsc_deadline = apic->lapic_timer.expired_tscdeadline; >>> apic->lapic_timer.expired_tscdeadline = 0; >>> - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); >>> >>> - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ >>> - if (guest_tsc < tsc_deadline) >>> - __delay(min(tsc_deadline - guest_tsc, >>> - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); >>> + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { >>> + ndelay(1); >>> + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> + } >>> >>> if (!lapic_timer_advance_adjust_done) { >>> /* too early */ >>> -- >>> 2.21.0 >>> >>
On 12/04/19 22:18, Sean Christopherson wrote: > To minimize the latency of timer interrupts as observed by the guest, > KVM adjusts the values it programs into the host timers to account for > the host's overhead of programming and handling the timer event. In > the event that the adjustments are too aggressive, i.e. the timer fires > earlier than the guest expects, KVM busy waits immediately prior to > entering the guest. > > Currently, KVM manually converts the delay from nanoseconds to clock > cycles. But, the conversion is done in the guest's time domain, while > the delay occurs in the host's time domain, i.e. the delay may not be > accurate and could wait too little or too long. Sidestep the headache > of shifting time domains by delaying 1ns at a time and breaking the loop > when the guest's desired timer delay has been satisfied. Because the > advancement, which caps the delay to avoid unbounded busy waiting, is > stored in nanoseconds, the current advancement time can simply be used > as a loop terminator since we're delaying 1ns at a time (plus the few > cycles of overhead for running the code). A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time taken to execute kvm_read_l1_tsc. I would just replace it with cpu_relax(); I can do the change when applying. Paolo > Cc: Liran Alon <liran.alon@oracle.com> > Cc: Wanpeng Li <wanpengli@tencent.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kvm/lapic.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 92446cba9b24..e797e3145a8b 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) > void wait_lapic_expire(struct kvm_vcpu *vcpu) > { > struct kvm_lapic *apic = vcpu->arch.apic; > - u64 guest_tsc, tsc_deadline, ns; > + u32 timer_advance_ns = lapic_timer_advance_ns; > + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; > > if (!lapic_in_kernel(vcpu)) > return; > @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) > > tsc_deadline = apic->lapic_timer.expired_tscdeadline; > apic->lapic_timer.expired_tscdeadline = 0; > - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); > > - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ > - if (guest_tsc < tsc_deadline) > - __delay(min(tsc_deadline - guest_tsc, > - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); > + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { > + ndelay(1); > + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > + } > > if (!lapic_timer_advance_adjust_done) { > /* too early */ >
> On 16 Apr 2019, at 14:02, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 12/04/19 22:18, Sean Christopherson wrote: >> To minimize the latency of timer interrupts as observed by the guest, >> KVM adjusts the values it programs into the host timers to account for >> the host's overhead of programming and handling the timer event. In >> the event that the adjustments are too aggressive, i.e. the timer fires >> earlier than the guest expects, KVM busy waits immediately prior to >> entering the guest. >> >> Currently, KVM manually converts the delay from nanoseconds to clock >> cycles. But, the conversion is done in the guest's time domain, while >> the delay occurs in the host's time domain, i.e. the delay may not be >> accurate and could wait too little or too long. Sidestep the headache >> of shifting time domains by delaying 1ns at a time and breaking the loop >> when the guest's desired timer delay has been satisfied. Because the >> advancement, which caps the delay to avoid unbounded busy waiting, is >> stored in nanoseconds, the current advancement time can simply be used >> as a loop terminator since we're delaying 1ns at a time (plus the few >> cycles of overhead for running the code). > > A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time > taken to execute kvm_read_l1_tsc. I would just replace it with > cpu_relax(); I can do the change when applying. > > Paolo Paolo, there is also another approach Sean and I were discussing. I think it is more elegant. See previous comments in this thread. -Liran > >> Cc: Liran Alon <liran.alon@oracle.com> >> Cc: Wanpeng Li <wanpengli@tencent.com> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> --- >> arch/x86/kvm/lapic.c | 13 +++++++------ >> 1 file changed, 7 insertions(+), 6 deletions(-) >> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> index 92446cba9b24..e797e3145a8b 100644 >> --- a/arch/x86/kvm/lapic.c >> +++ b/arch/x86/kvm/lapic.c >> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) >> void wait_lapic_expire(struct kvm_vcpu *vcpu) >> { >> struct kvm_lapic *apic = vcpu->arch.apic; >> - u64 guest_tsc, tsc_deadline, ns; >> + u32 timer_advance_ns = lapic_timer_advance_ns; >> + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; >> >> if (!lapic_in_kernel(vcpu)) >> return; >> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) >> >> tsc_deadline = apic->lapic_timer.expired_tscdeadline; >> apic->lapic_timer.expired_tscdeadline = 0; >> - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >> + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >> trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); >> >> - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ >> - if (guest_tsc < tsc_deadline) >> - __delay(min(tsc_deadline - guest_tsc, >> - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); >> + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { >> + ndelay(1); >> + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >> + } >> >> if (!lapic_timer_advance_adjust_done) { >> /* too early */ >> >
On 16/04/19 13:04, Liran Alon wrote: >> A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time >> taken to execute kvm_read_l1_tsc. I would just replace it with >> cpu_relax(); I can do the change when applying. > > Paolo, there is also another approach Sean and I were discussing. > I think it is more elegant. > See previous comments in this thread. Yep, I've seen it now. Paolo
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 92446cba9b24..e797e3145a8b 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) void wait_lapic_expire(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; - u64 guest_tsc, tsc_deadline, ns; + u32 timer_advance_ns = lapic_timer_advance_ns; + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; if (!lapic_in_kernel(vcpu)) return; @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) tsc_deadline = apic->lapic_timer.expired_tscdeadline; apic->lapic_timer.expired_tscdeadline = 0; - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ - if (guest_tsc < tsc_deadline) - __delay(min(tsc_deadline - guest_tsc, - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { + ndelay(1); + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); + } if (!lapic_timer_advance_adjust_done) { /* too early */
To minimize the latency of timer interrupts as observed by the guest, KVM adjusts the values it programs into the host timers to account for the host's overhead of programming and handling the timer event. In the event that the adjustments are too aggressive, i.e. the timer fires earlier than the guest expects, KVM busy waits immediately prior to entering the guest. Currently, KVM manually converts the delay from nanoseconds to clock cycles. But, the conversion is done in the guest's time domain, while the delay occurs in the host's time domain, i.e. the delay may not be accurate and could wait too little or too long. Sidestep the headache of shifting time domains by delaying 1ns at a time and breaking the loop when the guest's desired timer delay has been satisfied. Because the advancement, which caps the delay to avoid unbounded busy waiting, is stored in nanoseconds, the current advancement time can simply be used as a loop terminator since we're delaying 1ns at a time (plus the few cycles of overhead for running the code). Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/x86/kvm/lapic.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)