diff mbox series

[v3,2/9] KVM: lapic: Convert guest TSC to host time domain when delaying

Message ID 20190416203248.29429-3-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: lapic: Fix a variety of timer adv issues | expand

Commit Message

Sean Christopherson April 16, 2019, 8:32 p.m. UTC
To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  In
the event that the adjustments are too aggressive, i.e. the timer fires
earlier than the guest expects, KVM busy waits immediately prior to
entering the guest.

Currently, KVM manually converts the delay from nanoseconds to clock
cycles.  But, the conversion is done in the guest's time domain, while
the delay occurs in the host's time domain, i.e. the delay may not be
accurate and could wait too little or too long.

Convert the delay from guest clock cycles to host nanoseconds and use
ndelay() instead of __delay() to provide more accurate timing.  This
also avoids the need to convert lapic_timer_advance_ns, which is used
to cap the delay, to guest clock cycles.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini April 17, 2019, 12:56 p.m. UTC | #1
On 16/04/19 22:32, Sean Christopherson wrote:
> +	/* ndelay uses delay_tsc whenever the hardware has TSC, thus always. */
> +	if (guest_tsc < tsc_deadline) {
> +		ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> +		do_div(ns, vcpu->arch.virtual_tsc_khz);
> +		ndelay(min_t(u32, ns, lapic_timer_advance_ns));
> +	}
>  

This min_t would allow the timer to expire in advance of the indicated
deadline.  If this happens, wouldn't it be a bug elsewhere?

Paolo
Sean Christopherson April 17, 2019, 2:33 p.m. UTC | #2
On Wed, Apr 17, 2019 at 02:56:53PM +0200, Paolo Bonzini wrote:
> On 16/04/19 22:32, Sean Christopherson wrote:
> > +	/* ndelay uses delay_tsc whenever the hardware has TSC, thus always. */
> > +	if (guest_tsc < tsc_deadline) {
> > +		ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> > +		do_div(ns, vcpu->arch.virtual_tsc_khz);
> > +		ndelay(min_t(u32, ns, lapic_timer_advance_ns));
> > +	}
> >  
> 
> This min_t would allow the timer to expire in advance of the indicated
> deadline.  If this happens, wouldn't it be a bug elsewhere?

Maybe?

In this exact variation of the code, no, since userspace can modify
lapic_timer_advance_ns while the vCPU is running.  Obviously that case
goes away when lapic_timer_advance_ns is no longer directly consumed by
vCPUS.

Originally, the check was added because lockup occurred if the guest's
TSC was sent backwards[1].  The discussion surrounding the bugfix patch[2]
is probably more interesting/amusing/relevant.  We're basically rehashing
that discussion. :-)

Anyways, capping at a KVM controlled value is prudent since not doing
so can hang the host if something does go wrong.  As for the __delay()
vs ndelay() discussion, I'll try to dig more into how/why I saw the timer
fire with an incorrect TSC, i.e. why ndelay() "works" but __delay() does
not.

https://bugzilla.redhat.com/show_bug.cgi?id=1337667
https://patchwork.kernel.org/patch/9130687/
Paolo Bonzini April 17, 2019, 3:05 p.m. UTC | #3
On 17/04/19 16:33, Sean Christopherson wrote:
> In this exact variation of the code, no, since userspace can modify
> lapic_timer_advance_ns while the vCPU is running.  Obviously that case
> goes away when lapic_timer_advance_ns is no longer directly consumed by
> vCPUS.

Indeed, and if we weren't about to add per-vCPU lapic_timer_advance_ns,
the old suggestion of making lapic_timer_advance_ns read-only would stand.

> Anyways, capping at a KVM controlled value is prudent since not doing
> so can hang the host if something does go wrong.  

Yes---for the per-vCPU lapic_timer_advance_ns it is completely reasonable.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 92446cba9b24..5891c0badfa6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1502,10 +1502,12 @@  void wait_lapic_expire(struct kvm_vcpu *vcpu)
 	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
 
-	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
-	if (guest_tsc < tsc_deadline)
-		__delay(min(tsc_deadline - guest_tsc,
-			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
+	/* ndelay uses delay_tsc whenever the hardware has TSC, thus always. */
+	if (guest_tsc < tsc_deadline) {
+		ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+		do_div(ns, vcpu->arch.virtual_tsc_khz);
+		ndelay(min_t(u32, ns, lapic_timer_advance_ns));
+	}
 
 	if (!lapic_timer_advance_adjust_done) {
 		/* too early */