diff mbox series

[1/2] KVM: x86: switch KVMCLOCK base to CLOCK_MONOTONIC_RAW

Message ID 20180809145307.066923033@amt.cnet (mailing list archive)
State New, archived
Headers show
Series switch kvmclock base to CLOCK_MONOTONIC_RAW (v2) | expand

Commit Message

Marcelo Tosatti Aug. 9, 2018, 2:52 p.m. UTC
Commit 0bc48bea36d1 ("KVM: x86: update master clock before computing kvmclock_offset")
switches the order of operations to avoid the conversion 

TSC (without frequency correction) ->
system_timestamp (with frequency correction), 

which might cause a time jump.

However, it leaves any other masterclock update unsafe, which includes, 
at the moment:

        * HV_X64_MSR_REFERENCE_TSC MSR write.
        * TSC writes.
        * Host suspend/resume.

Avoid the time jump issue by using frequency uncorrected
CLOCK_MONOTONIC_RAW clock. 

Its the guests time keeping software responsability
to track and correct a reference clock such as UTC.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 arch/x86/kvm/x86.c |   68 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 46 insertions(+), 22 deletions(-)

Comments

Paolo Bonzini Aug. 10, 2018, 7:15 a.m. UTC | #1
On 09/08/2018 16:52, Marcelo Tosatti wrote:
> Commit 0bc48bea36d1 ("KVM: x86: update master clock before computing kvmclock_offset")
> switches the order of operations to avoid the conversion 
> 
> TSC (without frequency correction) ->
> system_timestamp (with frequency correction), 
> 
> which might cause a time jump.
> 
> However, it leaves any other masterclock update unsafe, which includes, 
> at the moment:
> 
>         * HV_X64_MSR_REFERENCE_TSC MSR write.
>         * TSC writes.
>         * Host suspend/resume.
> 
> Avoid the time jump issue by using frequency uncorrected
> CLOCK_MONOTONIC_RAW clock. 
> 
> Its the guests time keeping software responsability
> to track and correct a reference clock such as UTC.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

What happens across migration?

Paolo
Marcelo Tosatti Aug. 13, 2018, 12:52 p.m. UTC | #2
On Fri, Aug 10, 2018 at 09:15:51AM +0200, Paolo Bonzini wrote:
> On 09/08/2018 16:52, Marcelo Tosatti wrote:
> > Commit 0bc48bea36d1 ("KVM: x86: update master clock before computing kvmclock_offset")
> > switches the order of operations to avoid the conversion 
> > 
> > TSC (without frequency correction) ->
> > system_timestamp (with frequency correction), 
> > 
> > which might cause a time jump.
> > 
> > However, it leaves any other masterclock update unsafe, which includes, 
> > at the moment:
> > 
> >         * HV_X64_MSR_REFERENCE_TSC MSR write.
> >         * TSC writes.
> >         * Host suspend/resume.
> > 
> > Avoid the time jump issue by using frequency uncorrected
> > CLOCK_MONOTONIC_RAW clock. 
> > 
> > Its the guests time keeping software responsability
> > to track and correct a reference clock such as UTC.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> What happens across migration?
> 
> Paolo

You mean between

frequency corrected host -> frequency uncorrected host

And vice-versa? 

I'll check NTP's/Chrony's behaviour and let you know.

Thanks
Paolo Bonzini Aug. 13, 2018, 1:57 p.m. UTC | #3
On 13/08/2018 14:52, Marcelo Tosatti wrote:
>> What happens across migration?
>>
>> Paolo
> You mean between
> 
> frequency corrected host -> frequency uncorrected host
> 
> And vice-versa? 

Yes.  Worst case we can use KVM_SET_CLOCK's flags argument to switch.

Paolo

> I'll check NTP's/Chrony's behaviour and let you know.
Marcelo Tosatti Oct. 23, 2019, 2:44 p.m. UTC | #4
On Mon, Aug 13, 2018 at 09:52:55AM -0300, Marcelo Tosatti wrote:
> On Fri, Aug 10, 2018 at 09:15:51AM +0200, Paolo Bonzini wrote:
> > On 09/08/2018 16:52, Marcelo Tosatti wrote:
> > > Commit 0bc48bea36d1 ("KVM: x86: update master clock before computing kvmclock_offset")
> > > switches the order of operations to avoid the conversion 
> > > 
> > > TSC (without frequency correction) ->
> > > system_timestamp (with frequency correction), 
> > > 
> > > which might cause a time jump.
> > > 
> > > However, it leaves any other masterclock update unsafe, which includes, 
> > > at the moment:
> > > 
> > >         * HV_X64_MSR_REFERENCE_TSC MSR write.
> > >         * TSC writes.
> > >         * Host suspend/resume.
> > > 
> > > Avoid the time jump issue by using frequency uncorrected
> > > CLOCK_MONOTONIC_RAW clock. 
> > > 
> > > Its the guests time keeping software responsability
> > > to track and correct a reference clock such as UTC.
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > What happens across migration?
> > 
> > Paolo
> 
> You mean between
> 
> frequency corrected host -> frequency uncorrected host
> 
> And vice-versa? 
> 
> I'll check NTP's/Chrony's behaviour and let you know.
> 
> Thanks

They will measure and adjust their frequency corrections
(which is necessary anyway due to temperature variations 
for example).

http://doc.ntp.org/4.1.0/ntpd.htm

Frequency Discipline

The ntpd behavior at startup depends on whether the frequency file,
usually ntp.drift, exists. This file contains the latest estimate of
clock frequency error. When the ntpd is started and the file does not
exist, the ntpd enters a special mode designed to quickly adapt to the
particular system clock oscillator time and frequency error. This takes
approximately 15 minutes, after which the time and frequency are set to
nominal values and the ntpd enters normal mode, where the time and
frequency are continuously tracked relative to the server. After one
hour the frequency file is created and the current frequency offset
written to it. When the ntpd is started and the file does exist, the
ntpd frequency is initialized from the file and enters normal mode
immediately. After that the current frequency offset is written to the
file at hourly intervals.
diff mbox series

Patch

Index: linux-2.6.git/arch/x86/kvm/x86.c
===================================================================
--- linux-2.6.git.orig/arch/x86/kvm/x86.c	2018-08-09 10:32:51.151942374 -0300
+++ linux-2.6.git/arch/x86/kvm/x86.c	2018-08-09 11:49:47.111293686 -0300
@@ -1240,20 +1240,25 @@ 
 }
 
 #ifdef CONFIG_X86_64
+struct pvclock_clock {
+	int vclock_mode;
+	u64 cycle_last;
+	u64 mask;
+	u32 mult;
+	u32 shift;
+};
+
 struct pvclock_gtod_data {
 	seqcount_t	seq;
 
-	struct { /* extract of a clocksource struct */
-		int vclock_mode;
-		u64	cycle_last;
-		u64	mask;
-		u32	mult;
-		u32	shift;
-	} clock;
+	struct pvclock_clock clock; /* extract of a clocksource struct */
+	struct pvclock_clock raw_clock; /* extract of a clocksource struct */
 
+	u64		boot_ns_raw;
 	u64		boot_ns;
 	u64		nsec_base;
 	u64		wall_time_sec;
+	u64		monotonic_raw_nsec;
 };
 
 static struct pvclock_gtod_data pvclock_gtod_data;
@@ -1261,10 +1266,20 @@ 
 static void update_pvclock_gtod(struct timekeeper *tk)
 {
 	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
-	u64 boot_ns;
+	u64 boot_ns, boot_ns_raw;
 
 	boot_ns = ktime_to_ns(ktime_add(tk->tkr_mono.base, tk->offs_boot));
 
+	/*
+	 * FIXME: tk->offs_boot should be converted to CLOCK_MONOTONIC_RAW
+	 * interval (that is, without frequency adjustment for that interval).
+	 *
+	 * Lack of this fix can cause system_timestamp to not be equal to
+	 * CLOCK_MONOTONIC_RAW (which happen if the host uses
+	 * suspend/resume).
+	 */
+	boot_ns_raw = ktime_to_ns(ktime_add(tk->tkr_raw.base, tk->offs_boot));
+
 	write_seqcount_begin(&vdata->seq);
 
 	/* copy pvclock gtod data */
@@ -1274,11 +1289,20 @@ 
 	vdata->clock.mult		= tk->tkr_mono.mult;
 	vdata->clock.shift		= tk->tkr_mono.shift;
 
+	vdata->raw_clock.vclock_mode	= tk->tkr_raw.clock->archdata.vclock_mode;
+	vdata->raw_clock.cycle_last	= tk->tkr_raw.cycle_last;
+	vdata->raw_clock.mask		= tk->tkr_raw.mask;
+	vdata->raw_clock.mult		= tk->tkr_raw.mult;
+	vdata->raw_clock.shift		= tk->tkr_raw.shift;
+
 	vdata->boot_ns			= boot_ns;
 	vdata->nsec_base		= tk->tkr_mono.xtime_nsec;
 
 	vdata->wall_time_sec            = tk->xtime_sec;
 
+	vdata->boot_ns_raw		= boot_ns_raw;
+	vdata->monotonic_raw_nsec	= tk->tkr_raw.xtime_nsec;
+
 	write_seqcount_end(&vdata->seq);
 }
 #endif
@@ -1713,21 +1737,21 @@ 
 	return last;
 }
 
-static inline u64 vgettsc(u64 *tsc_timestamp, int *mode)
+static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
+			  int *mode)
 {
 	long v;
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	u64 tsc_pg_val;
 
-	switch (gtod->clock.vclock_mode) {
+	switch (clock->vclock_mode) {
 	case VCLOCK_HVCLOCK:
 		tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
 						  tsc_timestamp);
 		if (tsc_pg_val != U64_MAX) {
 			/* TSC page valid */
 			*mode = VCLOCK_HVCLOCK;
-			v = (tsc_pg_val - gtod->clock.cycle_last) &
-				gtod->clock.mask;
+			v = (tsc_pg_val - clock->cycle_last) &
+				clock->mask;
 		} else {
 			/* TSC page invalid */
 			*mode = VCLOCK_NONE;
@@ -1736,8 +1760,8 @@ 
 	case VCLOCK_TSC:
 		*mode = VCLOCK_TSC;
 		*tsc_timestamp = read_tsc();
-		v = (*tsc_timestamp - gtod->clock.cycle_last) &
-			gtod->clock.mask;
+		v = (*tsc_timestamp - clock->cycle_last) &
+			clock->mask;
 		break;
 	default:
 		*mode = VCLOCK_NONE;
@@ -1746,10 +1770,10 @@ 
 	if (*mode == VCLOCK_NONE)
 		*tsc_timestamp = v = 0;
 
-	return v * gtod->clock.mult;
+	return v * clock->mult;
 }
 
-static int do_monotonic_boot(s64 *t, u64 *tsc_timestamp)
+static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -1758,10 +1782,10 @@ 
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		ns = gtod->nsec_base;
-		ns += vgettsc(tsc_timestamp, &mode);
+		ns = gtod->monotonic_raw_nsec;
+		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
-		ns += gtod->boot_ns;
+		ns += gtod->boot_ns_raw;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	*t = ns;
 
@@ -1779,7 +1803,7 @@ 
 		seq = read_seqcount_begin(&gtod->seq);
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->nsec_base;
-		ns += vgettsc(tsc_timestamp, &mode);
+		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -1796,7 +1820,7 @@ 
 	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return gtod_is_based_on_tsc(do_monotonic_boot(kernel_ns,
+	return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
 						      tsc_timestamp));
 }