diff mbox

KVM: x86: fix tsc catchup issue with tsc scaling

Message ID 20140106141859.GA27260@amt.cnet (mailing list archive)
State New, archived
Headers show

Commit Message

Marcelo Tosatti Jan. 6, 2014, 2:18 p.m. UTC
To fix a problem related to different resolution of TSC and system clock,
the offset in TSC units is approximated by 

delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc

(Guest TSC value at 			(Guest TSC value at last VM-exit)
the last kvm_guest_time_update
call)

Delta is then later scaled using mult,shift pair found in hv_clock 
structure (which is correct against tsc_timestamp in that 
structure).

However, if a frequency change is performed between these two points, 
this delta is measured using different TSC frequencies, but scaled using 
mult,shift pair for one frequency only.

The end result is an incorrect delta.

The bug which this code works around is not the only cause for 
clock backwards events. The global accumulator is still
necessary, so remove the max_kernel_ns fix and rely on the 
global accumulator for no clock backwards events.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Paolo Bonzini Jan. 15, 2014, 11:43 a.m. UTC | #1
Il 06/01/2014 15:18, Marcelo Tosatti ha scritto:
> 
> To fix a problem related to different resolution of TSC and system clock,
> the offset in TSC units is approximated by 
> 
> delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc
> 
> (Guest TSC value at 			(Guest TSC value at last VM-exit)
> the last kvm_guest_time_update
> call)
> 
> Delta is then later scaled using mult,shift pair found in hv_clock 
> structure (which is correct against tsc_timestamp in that 
> structure).
> 
> However, if a frequency change is performed between these two points, 
> this delta is measured using different TSC frequencies, but scaled using 
> mult,shift pair for one frequency only.
> 
> The end result is an incorrect delta.
> 
> The bug which this code works around is not the only cause for 
> clock backwards events. The global accumulator is still
> necessary, so remove the max_kernel_ns fix and rely on the 
> global accumulator for no clock backwards events.

This is basically reverting commit 1d5f066 (KVM: x86: Fix a possible
backwards warp of kvmclock, 2010-08-19).

Your commit message basically says that the guest-side 489fb49 (x86,
paravirt: Add a global synchronization point for pvclock, 2010-05-11) is
the real solution to the problem that the host-side commit 1d5f066 was
trying to fix.  Right?

This patch makes vcpu->hv_clock.tsc_timestamp write only.  Please
provide a follow up that drops the field entirely, then I'll apply both.
 In the meanwhile:

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Thanks,

Paolo

> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-2.6.git/arch/x86/kvm/x86.c
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/kvm/x86.c
> +++ linux-2.6.git/arch/x86/kvm/x86.c
> @@ -1484,7 +1484,7 @@ static int kvm_guest_time_update(struct
>  	unsigned long flags, this_tsc_khz;
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	struct kvm_arch *ka = &v->kvm->arch;
> -	s64 kernel_ns, max_kernel_ns;
> +	s64 kernel_ns;
>  	u64 tsc_timestamp, host_tsc;
>  	struct pvclock_vcpu_time_info guest_hv_clock;
>  	u8 pvclock_flags;
> @@ -1543,37 +1543,6 @@ static int kvm_guest_time_update(struct
>  	if (!vcpu->pv_time_enabled)
>  		return 0;
>  
> -	/*
> -	 * Time as measured by the TSC may go backwards when resetting the base
> -	 * tsc_timestamp.  The reason for this is that the TSC resolution is
> -	 * higher than the resolution of the other clock scales.  Thus, many
> -	 * possible measurments of the TSC correspond to one measurement of any
> -	 * other clock, and so a spread of values is possible.  This is not a
> -	 * problem for the computation of the nanosecond clock; with TSC rates
> -	 * around 1GHZ, there can only be a few cycles which correspond to one
> -	 * nanosecond value, and any path through this code will inevitably
> -	 * take longer than that.  However, with the kernel_ns value itself,
> -	 * the precision may be much lower, down to HZ granularity.  If the
> -	 * first sampling of TSC against kernel_ns ends in the low part of the
> -	 * range, and the second in the high end of the range, we can get:
> -	 *
> -	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> -	 *
> -	 * As the sampling errors potentially range in the thousands of cycles,
> -	 * it is possible such a time value has already been observed by the
> -	 * guest.  To protect against this, we must compute the system time as
> -	 * observed by the guest and ensure the new system time is greater.
> -	 */
> -	max_kernel_ns = 0;
> -	if (vcpu->hv_clock.tsc_timestamp) {
> -		max_kernel_ns = vcpu->last_guest_tsc -
> -				vcpu->hv_clock.tsc_timestamp;
> -		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> -				    vcpu->hv_clock.tsc_to_system_mul,
> -				    vcpu->hv_clock.tsc_shift);
> -		max_kernel_ns += vcpu->last_kernel_ns;
> -	}
> -
>  	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>  		kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
>  				   &vcpu->hv_clock.tsc_shift,
> @@ -1581,14 +1550,6 @@ static int kvm_guest_time_update(struct
>  		vcpu->hw_tsc_khz = this_tsc_khz;
>  	}
>  
> -	/* with a master <monotonic time, tsc value> tuple,
> -	 * pvclock clock reads always increase at the (scaled) rate
> -	 * of guest TSC - no need to deal with sampling errors.
> -	 */
> -	if (!use_master_clock) {
> -		if (max_kernel_ns > kernel_ns)
> -			kernel_ns = max_kernel_ns;
> -	}
>  	/* With all the info we got, fill in the values */
>  	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>  	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Jan. 15, 2014, 12:25 p.m. UTC | #2
On Wed, Jan 15, 2014 at 12:43:02PM +0100, Paolo Bonzini wrote:
> Il 06/01/2014 15:18, Marcelo Tosatti ha scritto:
> > 
> > To fix a problem related to different resolution of TSC and system clock,
> > the offset in TSC units is approximated by 
> > 
> > delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc
> > 
> > (Guest TSC value at 			(Guest TSC value at last VM-exit)
> > the last kvm_guest_time_update
> > call)
> > 
> > Delta is then later scaled using mult,shift pair found in hv_clock 
> > structure (which is correct against tsc_timestamp in that 
> > structure).
> > 
> > However, if a frequency change is performed between these two points, 
> > this delta is measured using different TSC frequencies, but scaled using 
> > mult,shift pair for one frequency only.
> > 
> > The end result is an incorrect delta.
> > 
> > The bug which this code works around is not the only cause for 
> > clock backwards events. The global accumulator is still
> > necessary, so remove the max_kernel_ns fix and rely on the 
> > global accumulator for no clock backwards events.
> 
> This is basically reverting commit 1d5f066 (KVM: x86: Fix a possible
> backwards warp of kvmclock, 2010-08-19).
> 
> Your commit message basically says that the guest-side 489fb49 (x86,
> paravirt: Add a global synchronization point for pvclock, 2010-05-11) is
> the real solution to the problem that the host-side commit 1d5f066 was
> trying to fix.  Right?

Yes.

> This patch makes vcpu->hv_clock.tsc_timestamp write only.  Please
> provide a follow up that drops the field entirely, then I'll apply both.

OK.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Jan. 15, 2014, 12:34 p.m. UTC | #3
On Wed, Jan 15, 2014 at 12:43:02PM +0100, Paolo Bonzini wrote:
> Il 06/01/2014 15:18, Marcelo Tosatti ha scritto:
> > 
> > To fix a problem related to different resolution of TSC and system clock,
> > the offset in TSC units is approximated by 
> > 
> > delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc
> > 
> > (Guest TSC value at 			(Guest TSC value at last VM-exit)
> > the last kvm_guest_time_update
> > call)
> > 
> > Delta is then later scaled using mult,shift pair found in hv_clock 
> > structure (which is correct against tsc_timestamp in that 
> > structure).
> > 
> > However, if a frequency change is performed between these two points, 
> > this delta is measured using different TSC frequencies, but scaled using 
> > mult,shift pair for one frequency only.
> > 
> > The end result is an incorrect delta.
> > 
> > The bug which this code works around is not the only cause for 
> > clock backwards events. The global accumulator is still
> > necessary, so remove the max_kernel_ns fix and rely on the 
> > global accumulator for no clock backwards events.
> 
> This is basically reverting commit 1d5f066 (KVM: x86: Fix a possible
> backwards warp of kvmclock, 2010-08-19).
> 
> Your commit message basically says that the guest-side 489fb49 (x86,
> paravirt: Add a global synchronization point for pvclock, 2010-05-11) is
> the real solution to the problem that the host-side commit 1d5f066 was
> trying to fix.  Right?
> 
> This patch makes vcpu->hv_clock.tsc_timestamp write only.  Please
> provide a follow up that drops the field entirely, then I'll apply both.
>  In the meanwhile:
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Can't do that: its inside hv_clock structure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Jan. 15, 2014, 12:55 p.m. UTC | #4
Il 15/01/2014 13:34, Marcelo Tosatti ha scritto:
> On Wed, Jan 15, 2014 at 12:43:02PM +0100, Paolo Bonzini wrote:
>> Il 06/01/2014 15:18, Marcelo Tosatti ha scritto:
>>>
>>> To fix a problem related to different resolution of TSC and system clock,
>>> the offset in TSC units is approximated by 
>>>
>>> delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc
>>>
>>> (Guest TSC value at 			(Guest TSC value at last VM-exit)
>>> the last kvm_guest_time_update
>>> call)
>>>
>>> Delta is then later scaled using mult,shift pair found in hv_clock 
>>> structure (which is correct against tsc_timestamp in that 
>>> structure).
>>>
>>> However, if a frequency change is performed between these two points, 
>>> this delta is measured using different TSC frequencies, but scaled using 
>>> mult,shift pair for one frequency only.
>>>
>>> The end result is an incorrect delta.
>>>
>>> The bug which this code works around is not the only cause for 
>>> clock backwards events. The global accumulator is still
>>> necessary, so remove the max_kernel_ns fix and rely on the 
>>> global accumulator for no clock backwards events.
>>
>> This is basically reverting commit 1d5f066 (KVM: x86: Fix a possible
>> backwards warp of kvmclock, 2010-08-19).
>>
>> Your commit message basically says that the guest-side 489fb49 (x86,
>> paravirt: Add a global synchronization point for pvclock, 2010-05-11) is
>> the real solution to the problem that the host-side commit 1d5f066 was
>> trying to fix.  Right?
>>
>> This patch makes vcpu->hv_clock.tsc_timestamp write only.  Please
>> provide a follow up that drops the field entirely, then I'll apply both.
>>  In the meanwhile:
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Can't do that: its inside hv_clock structure.

Right.  Another question, what about this comment:

        /* Reset of TSC must disable overshoot protection below */
        vcpu->arch.hv_clock.tsc_timestamp = 0;
        vcpu->arch.last_guest_tsc = data;

Should it be instead like this:

	vcpu->arch.hv_clock.tsc_timestamp += data - vcpu->arch.last_guest_tsc;
	vcpu->arch.last_guest_tsc = data;

?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Jan. 15, 2014, 4:37 p.m. UTC | #5
On Wed, Jan 15, 2014 at 01:55:50PM +0100, Paolo Bonzini wrote:
> Il 15/01/2014 13:34, Marcelo Tosatti ha scritto:
> > On Wed, Jan 15, 2014 at 12:43:02PM +0100, Paolo Bonzini wrote:
> >> Il 06/01/2014 15:18, Marcelo Tosatti ha scritto:
> >>>
> >>> To fix a problem related to different resolution of TSC and system clock,
> >>> the offset in TSC units is approximated by 
> >>>
> >>> delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc
> >>>
> >>> (Guest TSC value at 			(Guest TSC value at last VM-exit)
> >>> the last kvm_guest_time_update
> >>> call)
> >>>
> >>> Delta is then later scaled using mult,shift pair found in hv_clock 
> >>> structure (which is correct against tsc_timestamp in that 
> >>> structure).
> >>>
> >>> However, if a frequency change is performed between these two points, 
> >>> this delta is measured using different TSC frequencies, but scaled using 
> >>> mult,shift pair for one frequency only.
> >>>
> >>> The end result is an incorrect delta.
> >>>
> >>> The bug which this code works around is not the only cause for 
> >>> clock backwards events. The global accumulator is still
> >>> necessary, so remove the max_kernel_ns fix and rely on the 
> >>> global accumulator for no clock backwards events.
> >>
> >> This is basically reverting commit 1d5f066 (KVM: x86: Fix a possible
> >> backwards warp of kvmclock, 2010-08-19).
> >>
> >> Your commit message basically says that the guest-side 489fb49 (x86,
> >> paravirt: Add a global synchronization point for pvclock, 2010-05-11) is
> >> the real solution to the problem that the host-side commit 1d5f066 was
> >> trying to fix.  Right?
> >>
> >> This patch makes vcpu->hv_clock.tsc_timestamp write only.  Please
> >> provide a follow up that drops the field entirely, then I'll apply both.
> >>  In the meanwhile:
> >>
> >> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > Can't do that: its inside hv_clock structure.
> 
> Right.  Another question, what about this comment:
> 
>         /* Reset of TSC must disable overshoot protection below */
>         vcpu->arch.hv_clock.tsc_timestamp = 0;
>         vcpu->arch.last_guest_tsc = data;
> 
> Should it be instead like this:
> 
> 	vcpu->arch.hv_clock.tsc_timestamp += data - vcpu->arch.last_guest_tsc;
> 	vcpu->arch.last_guest_tsc = data;
> 
> ?

Don't see why it should?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Jan. 15, 2014, 4:53 p.m. UTC | #6
Il 15/01/2014 17:37, Marcelo Tosatti ha scritto:
> > Right.  Another question, what about this comment:
> > 
> >         /* Reset of TSC must disable overshoot protection below */
> >         vcpu->arch.hv_clock.tsc_timestamp = 0;
> >         vcpu->arch.last_guest_tsc = data;
> > 
> > Should it be instead like this:
> > 
> > 	vcpu->arch.hv_clock.tsc_timestamp += data - vcpu->arch.last_guest_tsc;
> > 	vcpu->arch.last_guest_tsc = data;
> > 
> > ?
> 
> Don't see why it should?

Setting tsc_timestamp to 0 makes no sense with the current code, I'm
trying to understand if the right fix is to delete that line or
something else.  What I proposed (plus a version increase) would make a
pvclock read continuous before and after the change.

But looking more at the surrounding code, the calls to kvm_write_tsc
ultimately result in a master clock update as soon as all TSCs agree and
the master clock is re-enabled.  This master clock update rewrites
tsc_timestamp.  Then I think that the line that sets tsc_timestamp to 0
can be deleted.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Jan. 15, 2014, 4:55 p.m. UTC | #7
On Wed, Jan 15, 2014 at 05:53:22PM +0100, Paolo Bonzini wrote:
> Il 15/01/2014 17:37, Marcelo Tosatti ha scritto:
> > > Right.  Another question, what about this comment:
> > > 
> > >         /* Reset of TSC must disable overshoot protection below */
> > >         vcpu->arch.hv_clock.tsc_timestamp = 0;
> > >         vcpu->arch.last_guest_tsc = data;
> > > 
> > > Should it be instead like this:
> > > 
> > > 	vcpu->arch.hv_clock.tsc_timestamp += data - vcpu->arch.last_guest_tsc;
> > > 	vcpu->arch.last_guest_tsc = data;
> > > 
> > > ?
> > 
> > Don't see why it should?
> 
> Setting tsc_timestamp to 0 makes no sense with the current code, I'm
> trying to understand if the right fix is to delete that line or
> something else.  What I proposed (plus a version increase) would make a
> pvclock read continuous before and after the change.
> 
> But looking more at the surrounding code, the calls to kvm_write_tsc
> ultimately result in a master clock update as soon as all TSCs agree and
> the master clock is re-enabled.  This master clock update rewrites
> tsc_timestamp.  Then I think that the line that sets tsc_timestamp to 0
> can be deleted.

ACK.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6.git/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6.git.orig/arch/x86/kvm/x86.c
+++ linux-2.6.git/arch/x86/kvm/x86.c
@@ -1484,7 +1484,7 @@  static int kvm_guest_time_update(struct
 	unsigned long flags, this_tsc_khz;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
-	s64 kernel_ns, max_kernel_ns;
+	s64 kernel_ns;
 	u64 tsc_timestamp, host_tsc;
 	struct pvclock_vcpu_time_info guest_hv_clock;
 	u8 pvclock_flags;
@@ -1543,37 +1543,6 @@  static int kvm_guest_time_update(struct
 	if (!vcpu->pv_time_enabled)
 		return 0;
 
-	/*
-	 * Time as measured by the TSC may go backwards when resetting the base
-	 * tsc_timestamp.  The reason for this is that the TSC resolution is
-	 * higher than the resolution of the other clock scales.  Thus, many
-	 * possible measurments of the TSC correspond to one measurement of any
-	 * other clock, and so a spread of values is possible.  This is not a
-	 * problem for the computation of the nanosecond clock; with TSC rates
-	 * around 1GHZ, there can only be a few cycles which correspond to one
-	 * nanosecond value, and any path through this code will inevitably
-	 * take longer than that.  However, with the kernel_ns value itself,
-	 * the precision may be much lower, down to HZ granularity.  If the
-	 * first sampling of TSC against kernel_ns ends in the low part of the
-	 * range, and the second in the high end of the range, we can get:
-	 *
-	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
-	 *
-	 * As the sampling errors potentially range in the thousands of cycles,
-	 * it is possible such a time value has already been observed by the
-	 * guest.  To protect against this, we must compute the system time as
-	 * observed by the guest and ensure the new system time is greater.
-	 */
-	max_kernel_ns = 0;
-	if (vcpu->hv_clock.tsc_timestamp) {
-		max_kernel_ns = vcpu->last_guest_tsc -
-				vcpu->hv_clock.tsc_timestamp;
-		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
-				    vcpu->hv_clock.tsc_to_system_mul,
-				    vcpu->hv_clock.tsc_shift);
-		max_kernel_ns += vcpu->last_kernel_ns;
-	}
-
 	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
 		kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
 				   &vcpu->hv_clock.tsc_shift,
@@ -1581,14 +1550,6 @@  static int kvm_guest_time_update(struct
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-	/* with a master <monotonic time, tsc value> tuple,
-	 * pvclock clock reads always increase at the (scaled) rate
-	 * of guest TSC - no need to deal with sampling errors.
-	 */
-	if (!use_master_clock) {
-		if (max_kernel_ns > kernel_ns)
-			kernel_ns = max_kernel_ns;
-	}
 	/* With all the info we got, fill in the values */
 	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
 	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;