diff mbox series

KVM: lapic: Reuse auto-adjusted timer advance of first stable vCPU

Message ID 20190508214702.25317-1-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: lapic: Reuse auto-adjusted timer advance of first stable vCPU | expand

Commit Message

Sean Christopherson May 8, 2019, 9:47 p.m. UTC
Wanpeng pointed out that auto-tuning the advancement time for every vCPU
can lead to inaccurate wait times, e.g. if the advancement is calculated
while an overcommitted host is under heavy load, then KVM will waste a
lot of time busy waiting if the load decreases, e.g. when neighbour VMs
are idle.

Sidestep this issue and reduce time spent adjusting the advancement by
saving the first stable advancement value and reusing that value for all
*new* vCPUs.  This provides a safe way to auto-adjust the advancement,
minimizes the potential for a poor calculation due to system load, and
preserves the ability for userspace to change the advancement on the fly
(the module parameter is writable when KVM is loaded), as the value set
by userspace takes priority.

Regarding changing the advancement on the fly, doing so is likely less
useful with auto-adjusting, especially now that recognizing a change
requires restarting the VM.  Allowing the two concepts to coexist is
theoretically possible, but would be ugly and slow.  Auto-tuning needs
to track advancement time on a per-vCPU basis (because adjusments are
done relative to the last advancement), so supporting both approaches
would require additional code and conditionals, i.e. overhead, but would
only provide marginal value.  That being said, keep the ability to
change the module param without a reload as it's useful for debug and
testing.

Note, the comment from commit 39497d7660d9 ("KVM: lapic: Track lapic
timer advance per vCPU") about TSC scaling:

  And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
  correct calibration for a given vCPU may not apply to all vCPUs.

was effectively resolved by commit b6aa57c69cb2 ("KVM: lapic: Convert
guest TSC to host time domain if necessary").  The timer advancement is
calculated and stored in nanoseconds, and operates in the host time
domain, i.e. calculates *host* overhead.  In short, reusing an
advancement time for vCPUs with different TSC scaling is ok.

Fixes: 39497d7660d9 ("KVM: lapic: Track lapic timer advance per vCPU")
Cc: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini May 20, 2019, 1:03 p.m. UTC | #1
On 08/05/19 23:47, Sean Christopherson wrote:
> +
> +		/*
> +		 * The first vCPU to get a stable advancement time "wins" and
> +		 * sets the advancement time that is used for *new* vCPUS that
> +		 * are created with auto-adjusting enabled.
> +		 */
> +		if (apic->lapic_timer.timer_advance_adjust_done)
> +			(void)cmpxchg(&adjusted_timer_advance_ns, -1,
> +				      timer_advance_ns);

This is relatively expensive, so it should only be done after setting
timer_advance_adjust_done to true.

Paolo
Sean Christopherson May 24, 2019, 8:23 p.m. UTC | #2
On Mon, May 20, 2019 at 03:03:10PM +0200, Paolo Bonzini wrote:
> On 08/05/19 23:47, Sean Christopherson wrote:
> > +
> > +		/*
> > +		 * The first vCPU to get a stable advancement time "wins" and
> > +		 * sets the advancement time that is used for *new* vCPUS that
> > +		 * are created with auto-adjusting enabled.
> > +		 */
> > +		if (apic->lapic_timer.timer_advance_adjust_done)
> > +			(void)cmpxchg(&adjusted_timer_advance_ns, -1,
> > +				      timer_advance_ns);
> 
> This is relatively expensive, so it should only be done after setting
> timer_advance_adjust_done to true.

That's already the case, or am I missing something?

This code is inside "if (!apic->lapic_timer.timer_advance_adjust_done)",
and apic->lapic_timer.timer_advance_adjust_done is set to true at creation
time if "adjusted_timer_advance_ns != -1", i.e. the cmpxchg() will only
be reached on vCPUs that are created before adjusted_timer_advance_ns is
set and will be reached at most once per vCPU.
Paolo Bonzini May 24, 2019, 8:48 p.m. UTC | #3
On 24/05/19 22:23, Sean Christopherson wrote:
> On Mon, May 20, 2019 at 03:03:10PM +0200, Paolo Bonzini wrote:
>> On 08/05/19 23:47, Sean Christopherson wrote:
>>> +
>>> +		/*
>>> +		 * The first vCPU to get a stable advancement time "wins" and
>>> +		 * sets the advancement time that is used for *new* vCPUS that
>>> +		 * are created with auto-adjusting enabled.
>>> +		 */
>>> +		if (apic->lapic_timer.timer_advance_adjust_done)
>>> +			(void)cmpxchg(&adjusted_timer_advance_ns, -1,
>>> +				      timer_advance_ns);
>>
>> This is relatively expensive, so it should only be done after setting
>> timer_advance_adjust_done to true.
> 
> That's already the case, or am I missing something?
> 
> This code is inside "if (!apic->lapic_timer.timer_advance_adjust_done)",

Right, I missed the outer if from the context.  Sorry for the noise.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index bd13fdddbdc4..4f234df11078 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -70,6 +70,8 @@ 
 #define APIC_BROADCAST			0xFF
 #define X2APIC_BROADCAST		0xFFFFFFFFul
 
+static u32 adjusted_timer_advance_ns = -1;
+
 #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100
 /* step-by-step approximation to mitigate fluctuation */
 #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8
@@ -1542,6 +1544,15 @@  void wait_lapic_expire(struct kvm_vcpu *vcpu)
 			apic->lapic_timer.timer_advance_adjust_done = true;
 		}
 		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+
+		/*
+		 * The first vCPU to get a stable advancement time "wins" and
+		 * sets the advancement time that is used for *new* vCPUS that
+		 * are created with auto-adjusting enabled.
+		 */
+		if (apic->lapic_timer.timer_advance_adjust_done)
+			(void)cmpxchg(&adjusted_timer_advance_ns, -1,
+				      timer_advance_ns);
 	}
 }
 
@@ -2302,12 +2313,15 @@  int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
 	hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_ABS_PINNED);
 	apic->lapic_timer.timer.function = apic_timer_fn;
-	if (timer_advance_ns == -1) {
+	if (timer_advance_ns != -1) {
+		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+		apic->lapic_timer.timer_advance_adjust_done = true;
+	} else if (adjusted_timer_advance_ns != -1) {
+		apic->lapic_timer.timer_advance_ns = adjusted_timer_advance_ns;
+		apic->lapic_timer.timer_advance_adjust_done = true;
+	} else {
 		apic->lapic_timer.timer_advance_ns = 1000;
 		apic->lapic_timer.timer_advance_adjust_done = false;
-	} else {
-		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
-		apic->lapic_timer.timer_advance_adjust_done = true;
 	}