diff mbox series

[2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"

Message ID 20190412201834.10831-3-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: lapic: Fix a variety of timer adv issues | expand

Commit Message

Sean Christopherson April 12, 2019, 8:18 p.m. UTC
To minimize the latency of timer interrupts as observed by the guest,
KVM adjusts the values it programs into the host timers to account for
the host's overhead of programming and handling the timer event.  In
the event that the adjustments are too aggressive, i.e. the timer fires
earlier than the guest expects, KVM busy waits immediately prior to
entering the guest.

Currently, KVM manually converts the delay from nanoseconds to clock
cycles.  But, the conversion is done in the guest's time domain, while
the delay occurs in the host's time domain, i.e. the delay may not be
accurate and could wait too little or too long.  Sidestep the headache
of shifting time domains by delaying 1ns at a time and breaking the loop
when the guest's desired timer delay has been satisfied.  Because the
advancement, which caps the delay to avoid unbounded busy waiting, is
stored in nanoseconds, the current advancement time can simply be used
as a loop terminator since we're delaying 1ns at a time (plus the few
cycles of overhead for running the code).

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/lapic.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Comments

Liran Alon April 14, 2019, 11:25 a.m. UTC | #1
> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  In
> the event that the adjustments are too aggressive, i.e. the timer fires
> earlier than the guest expects, KVM busy waits immediately prior to
> entering the guest.
> 
> Currently, KVM manually converts the delay from nanoseconds to clock
> cycles.  But, the conversion is done in the guest's time domain, while
> the delay occurs in the host's time domain, i.e. the delay may not be
> accurate and could wait too little or too long.  

Just to make sure I understand the issue you see here:
The __delay() is called with a parameter which the number of guest cycles required
to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
But in case guest vCPU frequency is different than physical CPU frequency,
this parameter should further be converted from guest cycles to host cycles.
Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
As this is quite confusing. :)

> Sidestep the headache
> of shifting time domains by delaying 1ns at a time and breaking the loop
> when the guest's desired timer delay has been satisfied.  Because the
> advancement, which caps the delay to avoid unbounded busy waiting, is
> stored in nanoseconds, the current advancement time can simply be used
> as a loop terminator since we're delaying 1ns at a time (plus the few
> cycles of overhead for running the code).

Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
This is simply done by:
host_ns = guest_cycles * 1000000ULL;
do_div(host_ns, vcpu->arch.virtual_tsc_khz);

Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
i.e. something such as:
delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
delay_host_ns = delay_guest_cycles * 1000000ULL;
do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
__delay(delay_host_ns);

-Liran

> 
> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> arch/x86/kvm/lapic.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 92446cba9b24..e797e3145a8b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> void wait_lapic_expire(struct kvm_vcpu *vcpu)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u64 guest_tsc, tsc_deadline, ns;
> +	u32 timer_advance_ns = lapic_timer_advance_ns;
> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;

Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.

> 
> 	if (!lapic_in_kernel(vcpu))
> 		return;
> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> 
> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> 	apic->lapic_timer.expired_tscdeadline = 0;
> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> 
> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> -	if (guest_tsc < tsc_deadline)
> -		__delay(min(tsc_deadline - guest_tsc,
> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> +		ndelay(1);
> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	}
> 
> 	if (!lapic_timer_advance_adjust_done) {
> 		/* too early */
> -- 
> 2.21.0
>
Sean Christopherson April 15, 2019, 4:11 p.m. UTC | #2
On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote:
> 
> 
> > On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > To minimize the latency of timer interrupts as observed by the guest,
> > KVM adjusts the values it programs into the host timers to account for
> > the host's overhead of programming and handling the timer event.  In
> > the event that the adjustments are too aggressive, i.e. the timer fires
> > earlier than the guest expects, KVM busy waits immediately prior to
> > entering the guest.
> > 
> > Currently, KVM manually converts the delay from nanoseconds to clock
> > cycles.  But, the conversion is done in the guest's time domain, while
> > the delay occurs in the host's time domain, i.e. the delay may not be
> > accurate and could wait too little or too long.  
> 
> Just to make sure I understand the issue you see here:
> The __delay() is called with a parameter which the number of guest cycles required
> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
> But in case guest vCPU frequency is different than physical CPU frequency,
> this parameter should further be converted from guest cycles to host cycles.
> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
> As this is quite confusing. :)

Heh, I didn't elaborate precisely because it was so darn confusing.  I
didn't want to say something completely incorrect so I kept it high level.

> 
> > Sidestep the headache
> > of shifting time domains by delaying 1ns at a time and breaking the loop
> > when the guest's desired timer delay has been satisfied.  Because the
> > advancement, which caps the delay to avoid unbounded busy waiting, is
> > stored in nanoseconds, the current advancement time can simply be used
> > as a loop terminator since we're delaying 1ns at a time (plus the few
> > cycles of overhead for running the code).
> 
> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
> This is simply done by:
> host_ns = guest_cycles * 1000000ULL;
> do_div(host_ns, vcpu->arch.virtual_tsc_khz);
> 
> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
> i.e. something such as:
> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
> delay_host_ns = delay_guest_cycles * 1000000ULL;
> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
> __delay(delay_host_ns);

That's also an option.  I opted for the loop as it felt simpler and more
obvious.  And explicitly checking rechecking kvm_read_l1_tsc() guarantees
the guest won't see an "early" TSC unless the cap kicks in.  That being
said, I don't have a strong opinion either way.

As for the code, we'd probably want to do the conversion first and cap
second so as to avoid converting lapic_timer_advance_ns to the guest's
TSC and then back to ns.  It should use ndelay(), and maybe just call it
"delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a
host variable, e.g.:

  delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
  do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
  ndelay(min(delay_ns, lapic_timer_advance_ns));

That's actually nowhere near as as bad as I thought it would be now that
it's typed out...


> 
> -Liran
> 
> > 
> > Cc: Liran Alon <liran.alon@oracle.com>
> > Cc: Wanpeng Li <wanpengli@tencent.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> > arch/x86/kvm/lapic.c | 13 +++++++------
> > 1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 92446cba9b24..e797e3145a8b 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> > void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > {
> > 	struct kvm_lapic *apic = vcpu->arch.apic;
> > -	u64 guest_tsc, tsc_deadline, ns;
> > +	u32 timer_advance_ns = lapic_timer_advance_ns;
> > +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
> 
> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.

guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the
beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more
appropriate.  My main motivation for not using verbose names is keep lines
short, i.e. avoid wrapping at 80 chars.  For me, wrapping a bunch of lines
makes the code more difficult to read than less verbose names.

> 
> > 
> > 	if (!lapic_in_kernel(vcpu))
> > 		return;
> > @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > 
> > 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> > 	apic->lapic_timer.expired_tscdeadline = 0;
> > -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> > 
> > -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> > -	if (guest_tsc < tsc_deadline)
> > -		__delay(min(tsc_deadline - guest_tsc,
> > -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> > +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> > +		ndelay(1);
> > +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +	}
> > 
> > 	if (!lapic_timer_advance_adjust_done) {
> > 		/* too early */
> > -- 
> > 2.21.0
> > 
>
Liran Alon April 15, 2019, 5:06 p.m. UTC | #3
> On 15 Apr 2019, at 19:11, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> To minimize the latency of timer interrupts as observed by the guest,
>>> KVM adjusts the values it programs into the host timers to account for
>>> the host's overhead of programming and handling the timer event.  In
>>> the event that the adjustments are too aggressive, i.e. the timer fires
>>> earlier than the guest expects, KVM busy waits immediately prior to
>>> entering the guest.
>>> 
>>> Currently, KVM manually converts the delay from nanoseconds to clock
>>> cycles.  But, the conversion is done in the guest's time domain, while
>>> the delay occurs in the host's time domain, i.e. the delay may not be
>>> accurate and could wait too little or too long.  
>> 
>> Just to make sure I understand the issue you see here:
>> The __delay() is called with a parameter which the number of guest cycles required
>> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
>> But in case guest vCPU frequency is different than physical CPU frequency,
>> this parameter should further be converted from guest cycles to host cycles.
>> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
>> As this is quite confusing. :)
> 
> Heh, I didn't elaborate precisely because it was so darn confusing.  I
> didn't want to say something completely incorrect so I kept it high level.

Heh, I can agree with that :)
But we should make sure others who read commit message won’t get confuse.

> 
>> 
>>> Sidestep the headache
>>> of shifting time domains by delaying 1ns at a time and breaking the loop
>>> when the guest's desired timer delay has been satisfied.  Because the
>>> advancement, which caps the delay to avoid unbounded busy waiting, is
>>> stored in nanoseconds, the current advancement time can simply be used
>>> as a loop terminator since we're delaying 1ns at a time (plus the few
>>> cycles of overhead for running the code).
>> 
>> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
>> This is simply done by:
>> host_ns = guest_cycles * 1000000ULL;
>> do_div(host_ns, vcpu->arch.virtual_tsc_khz);
>> 
>> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
>> i.e. something such as:
>> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
>> delay_host_ns = delay_guest_cycles * 1000000ULL;
>> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
>> __delay(delay_host_ns);
> 
> That's also an option.  I opted for the loop as it felt simpler and more
> obvious.  And explicitly checking rechecking kvm_read_l1_tsc() guarantees
> the guest won't see an "early" TSC unless the cap kicks in.  That being
> said, I don't have a strong opinion either way.
> 
> As for the code, we'd probably want to do the conversion first and cap
> second so as to avoid converting lapic_timer_advance_ns to the guest's
> TSC and then back to ns.  It should use ndelay(), and maybe just call it
> "delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a
> host variable, e.g.:
> 
>  delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
>  do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
>  ndelay(min(delay_ns, lapic_timer_advance_ns));
> 
> That's actually nowhere near as as bad as I thought it would be now that
> it's typed out…

Yes I agree this code looks nice. I think it’s better. I recommend sending a v2 with this version.

> 
> 
>> 
>> -Liran
>> 
>>> 
>>> Cc: Liran Alon <liran.alon@oracle.com>
>>> Cc: Wanpeng Li <wanpengli@tencent.com>
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>> ---
>>> arch/x86/kvm/lapic.c | 13 +++++++------
>>> 1 file changed, 7 insertions(+), 6 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index 92446cba9b24..e797e3145a8b 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>>> void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>> {
>>> 	struct kvm_lapic *apic = vcpu->arch.apic;
>>> -	u64 guest_tsc, tsc_deadline, ns;
>>> +	u32 timer_advance_ns = lapic_timer_advance_ns;
>>> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>> 
>> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.
> 
> guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the
> beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more
> appropriate.  My main motivation for not using verbose names is keep lines
> short, i.e. avoid wrapping at 80 chars.  For me, wrapping a bunch of lines
> makes the code more difficult to read than less verbose names.

I think it doesn’t matter now if we agree with the above mentioned suggestion of avoiding this loop. :)

-Liran

> 
>> 
>>> 
>>> 	if (!lapic_in_kernel(vcpu))
>>> 		return;
>>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>> 
>>> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>>> 	apic->lapic_timer.expired_tscdeadline = 0;
>>> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>>> 
>>> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
>>> -	if (guest_tsc < tsc_deadline)
>>> -		__delay(min(tsc_deadline - guest_tsc,
>>> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>>> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
>>> +		ndelay(1);
>>> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> +	}
>>> 
>>> 	if (!lapic_timer_advance_adjust_done) {
>>> 		/* too early */
>>> -- 
>>> 2.21.0
>>> 
>>
Paolo Bonzini April 16, 2019, 11:02 a.m. UTC | #4
On 12/04/19 22:18, Sean Christopherson wrote:
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  In
> the event that the adjustments are too aggressive, i.e. the timer fires
> earlier than the guest expects, KVM busy waits immediately prior to
> entering the guest.
> 
> Currently, KVM manually converts the delay from nanoseconds to clock
> cycles.  But, the conversion is done in the guest's time domain, while
> the delay occurs in the host's time domain, i.e. the delay may not be
> accurate and could wait too little or too long.  Sidestep the headache
> of shifting time domains by delaying 1ns at a time and breaking the loop
> when the guest's desired timer delay has been satisfied.  Because the
> advancement, which caps the delay to avoid unbounded busy waiting, is
> stored in nanoseconds, the current advancement time can simply be used
> as a loop terminator since we're delaying 1ns at a time (plus the few
> cycles of overhead for running the code).

A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time
taken to execute kvm_read_l1_tsc.  I would just replace it with
cpu_relax(); I can do the change when applying.

Paolo

> Cc: Liran Alon <liran.alon@oracle.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 92446cba9b24..e797e3145a8b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>  void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u64 guest_tsc, tsc_deadline, ns;
> +	u32 timer_advance_ns = lapic_timer_advance_ns;
> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>  
>  	if (!lapic_in_kernel(vcpu))
>  		return;
> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  
>  	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>  	apic->lapic_timer.expired_tscdeadline = 0;
> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>  	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>  
> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> -	if (guest_tsc < tsc_deadline)
> -		__delay(min(tsc_deadline - guest_tsc,
> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> +		ndelay(1);
> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	}
>  
>  	if (!lapic_timer_advance_adjust_done) {
>  		/* too early */
>
Liran Alon April 16, 2019, 11:04 a.m. UTC | #5
> On 16 Apr 2019, at 14:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 12/04/19 22:18, Sean Christopherson wrote:
>> To minimize the latency of timer interrupts as observed by the guest,
>> KVM adjusts the values it programs into the host timers to account for
>> the host's overhead of programming and handling the timer event.  In
>> the event that the adjustments are too aggressive, i.e. the timer fires
>> earlier than the guest expects, KVM busy waits immediately prior to
>> entering the guest.
>> 
>> Currently, KVM manually converts the delay from nanoseconds to clock
>> cycles.  But, the conversion is done in the guest's time domain, while
>> the delay occurs in the host's time domain, i.e. the delay may not be
>> accurate and could wait too little or too long.  Sidestep the headache
>> of shifting time domains by delaying 1ns at a time and breaking the loop
>> when the guest's desired timer delay has been satisfied.  Because the
>> advancement, which caps the delay to avoid unbounded busy waiting, is
>> stored in nanoseconds, the current advancement time can simply be used
>> as a loop terminator since we're delaying 1ns at a time (plus the few
>> cycles of overhead for running the code).
> 
> A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time
> taken to execute kvm_read_l1_tsc.  I would just replace it with
> cpu_relax(); I can do the change when applying.
> 
> Paolo

Paolo, there is also another approach Sean and I were discussing.
I think it is more elegant.
See previous comments in this thread.

-Liran

> 
>> Cc: Liran Alon <liran.alon@oracle.com>
>> Cc: Wanpeng Li <wanpengli@tencent.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> ---
>> arch/x86/kvm/lapic.c | 13 +++++++------
>> 1 file changed, 7 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 92446cba9b24..e797e3145a8b 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>> void wait_lapic_expire(struct kvm_vcpu *vcpu)
>> {
>> 	struct kvm_lapic *apic = vcpu->arch.apic;
>> -	u64 guest_tsc, tsc_deadline, ns;
>> +	u32 timer_advance_ns = lapic_timer_advance_ns;
>> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>> 
>> 	if (!lapic_in_kernel(vcpu))
>> 		return;
>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>> 
>> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>> 	apic->lapic_timer.expired_tscdeadline = 0;
>> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>> 
>> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
>> -	if (guest_tsc < tsc_deadline)
>> -		__delay(min(tsc_deadline - guest_tsc,
>> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
>> +		ndelay(1);
>> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> +	}
>> 
>> 	if (!lapic_timer_advance_adjust_done) {
>> 		/* too early */
>> 
>
Paolo Bonzini April 16, 2019, 11:09 a.m. UTC | #6
On 16/04/19 13:04, Liran Alon wrote:
>> A 1 nanosecond advance (3-5 clock cycles) is even shorter than the time
>> taken to execute kvm_read_l1_tsc.  I would just replace it with
>> cpu_relax(); I can do the change when applying.
>
> Paolo, there is also another approach Sean and I were discussing.
> I think it is more elegant.
> See previous comments in this thread.

Yep, I've seen it now.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 92446cba9b24..e797e3145a8b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1486,7 +1486,8 @@  static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 void wait_lapic_expire(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	u64 guest_tsc, tsc_deadline, ns;
+	u32 timer_advance_ns = lapic_timer_advance_ns;
+	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
 
 	if (!lapic_in_kernel(vcpu))
 		return;
@@ -1499,13 +1500,13 @@  void wait_lapic_expire(struct kvm_vcpu *vcpu)
 
 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
 	apic->lapic_timer.expired_tscdeadline = 0;
-	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
 
-	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
-	if (guest_tsc < tsc_deadline)
-		__delay(min(tsc_deadline - guest_tsc,
-			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
+	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
+		ndelay(1);
+		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	}
 
 	if (!lapic_timer_advance_adjust_done) {
 		/* too early */