diff mbox series

[RFC,v2] ptp: Add vDSO-style vmclock support

Message ID 4a0a240dffc21dde4d69179288547b945142259f.camel@infradead.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC,v2] ptp: Add vDSO-style vmclock support | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 842 this patch: 842
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: dwmw@amazon.co.uk
netdev/build_clang success Errors and warnings before: 849 this patch: 849
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 854 this patch: 855
netdev/checkpatch fail CHECK: Alignment should match open parenthesis CHECK: Please don't use multiple blank lines CHECK: Prefer kernel type 's32' over 'int32_t' CHECK: Prefer kernel type 'u64' over 'uint64_t' CHECK: Unbalanced braces around else statement CHECK: braces {} should be used on all arms of this statement CHECK: spaces preferred around that '|' (ctx:VxV) ERROR: code indent should use tabs where possible WARNING: Missing a blank line after declarations WARNING: Prefer "GPL" over "GPL v2" - see commit bf7fbeeae6db ("module: Cure the MODULE_LICENSE "GPL" vs. "GPL v2" bogosity") WARNING: Prefer 'unsigned int' to bare use of 'unsigned' WARNING: Prefer strscpy, strscpy_pad, or __nonstring over strncpy - see: https://github.com/KSPP/linux/issues/90 WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: memory barrier without comment WARNING: please, no spaces at the start of a line WARNING: space prohibited between function name and open parenthesis '('
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline fail Was 0 now: 1

Commit Message

David Woodhouse June 25, 2024, 7:01 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---

v2: 
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift 
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig          |  13 +
 drivers/ptp/Makefile         |   1 +
 drivers/ptp/ptp_vmclock.c    | 516 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/vmclock.h | 138 ++++++++++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

Comments

Thomas Gleixner June 25, 2024, 9:34 p.m. UTC | #1
On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
>
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
>
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.

There is effort underway to expose PTP clocks to user space via
VDSO. Can we please not expose an ad hoc interface for that?

As you might have heard the sad news, I'm not feeling up to the task to
dig deeper into this right now. Give me a couple of days to look at this
with working brain.

Thanks,

        tglx
David Woodhouse June 25, 2024, 9:48 p.m. UTC | #2
On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> 
> There is effort underway to expose PTP clocks to user space via VDSO.

Ooh, interesting. Got a reference to that please?

>  Can we please not expose an ad hoc interface for that?

Absolutely. I'm explicitly trying to intercept the virtio-rtc
specification here, to *avoid* having to do anything ad hoc.

Note that this is a "vDSO-style" interface from hypervisor to guest via
a shared memory region, not necessarily an actual vDSO.

But yes, it *is* intended to be exposed to userspace, so that userspace
can know the *accurate* time without a system call, and know that it
hasn't been perturbed by live migration.

> As you might have heard the sad news, I'm not feeling up to the task to
> dig deeper into this right now. Give me a couple of days to look at this
> with working brain.

I have not heard any news, although now I'm making inferences.

Wishing you the best!
John Stultz June 25, 2024, 10:22 p.m. UTC | #3
On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse <dwmw2@infradead.org> wrote:
> On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > >
> > > The vmclock "device" provides a shared memory region with precision clock
> > > information. By using shared memory, it is safe across Live Migration.
> > >
> > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > actually helpful.
> > >
> > > The memory region of the device is also exposed to userspace so it can be
> > > read or memory mapped by application which need reliable notification of
> > > clock disruptions.
> >
> > There is effort underway to expose PTP clocks to user space via VDSO.
>
> Ooh, interesting. Got a reference to that please?
>
> >  Can we please not expose an ad hoc interface for that?
>
> Absolutely. I'm explicitly trying to intercept the virtio-rtc
> specification here, to *avoid* having to do anything ad hoc.
>
> Note that this is a "vDSO-style" interface from hypervisor to guest via
> a shared memory region, not necessarily an actual vDSO.
>
> But yes, it *is* intended to be exposed to userspace, so that userspace
> can know the *accurate* time without a system call, and know that it
> hasn't been perturbed by live migration.

Yea, I was going to raise a concern that just defining an mmaped
structure means it has to trust the guest logic is as expected. It's
good that it's versioned! :)

I'd fret a bit about exposing this to userland. It feels very similar
to the old powerpc systemcfg implementation that similarly mapped just
kernel data out to userland and was difficult to maintain as changes
were made. Would including a code page like a proper vdso make sense
to make this more flexible of an UABI to maintain?

thanks
-john
David Woodhouse June 26, 2024, 8:32 a.m. UTC | #4
On Tue, 2024-06-25 at 15:22 -0700, John Stultz wrote:
> On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > 
> > > > The vmclock "device" provides a shared memory region with precision clock
> > > > information. By using shared memory, it is safe across Live Migration.
> > > > 
> > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > > actually helpful.
> > > > 
> > > > The memory region of the device is also exposed to userspace so it can be
> > > > read or memory mapped by application which need reliable notification of
> > > > clock disruptions.
> > > 
> > > There is effort underway to expose PTP clocks to user space via VDSO.
> > 
> > Ooh, interesting. Got a reference to that please?
> > 
> > >  Can we please not expose an ad hoc interface for that?
> > 
> > Absolutely. I'm explicitly trying to intercept the virtio-rtc
> > specification here, to *avoid* having to do anything ad hoc.
> > 
> > Note that this is a "vDSO-style" interface from hypervisor to guest via
> > a shared memory region, not necessarily an actual vDSO.
> > 
> > But yes, it *is* intended to be exposed to userspace, so that userspace
> > can know the *accurate* time without a system call, and know that it
> > hasn't been perturbed by live migration.
> 
> Yea, I was going to raise a concern that just defining an mmaped
> structure means it has to trust the guest logic is as expected. It's
> good that it's versioned! :)

Right. Although it's basically a pvclock, and we've had those for ages.

The main difference here is that we add an indicator that tells the
guest that it's been live migrated, so any additional NTP/PTP
refinement that the *guest* has done of its oscillator, should now be
discarded.

It's also designed to be useful in "disruption-only" mode, where the
pvclock information isn't actually populated, so *all* it does is tell
guests that their clock is now hosed due to live migration.

That part is why it needs to be mappable directly to userspace, so that
userspace can not only get a timestamp but *also* know that it's
actually valid. All without a system call.

The critical use cases are financial systems where they incur massive
fines if they submit mis-timestamped transactions, and distributed
databases which rely on accurate timestamps (and error bounds) for
eventual coherence. Live migration can screw those completely.

I'm open to changing fairly much anything about the proposal as long as
we can address those use cases (which the existing virtio-rtc and other
KVM enlightenments do not).

> I'd fret a bit about exposing this to userland. It feels very similar
> to the old powerpc systemcfg implementation that similarly mapped just
> kernel data out to userland and was difficult to maintain as changes
> were made. Would including a code page like a proper vdso make sense
> to make this more flexible of an UABI to maintain?

I think the structure itself should be stable once we've bikeshedded it
a bit. But there is certainly some potential for vDSO functions which
help us expose it to the user...

This structure exposes a 'disruption count' which is updated every time
the TSC/counter is messed with by live migration. But what is userspace
actually going to *compare* it with?

It basically needs to compare it with the disruption count when the
clock was last synchronized, so maybe the kernel could export *that* to
vDSO too, then expose a simple vDSO function which reports whether the
clock is valid?

The 'invalid' code path could turn into an actual system call which
makes the kernel (check for itself and) call ntp_clear() when the
disruption occurs. Or maybe not just ntp_clear() but actually consume
the pvclock rate information directly and apply the *new* calibration?

That kind of thing would be great, and I've definitely tried to design
the structure so that it *can* be made a first-class citizen within the
kernel's timekeeping code and used like that.

But I was going to start with a more modest proposal that it's "just a
device", and applications which care about reliable time after LM would
have to /dev/vmclock0 and mmap it and check for themselves. (Which
would be assisted by things like the ClockBound library).
Richard Cochran June 26, 2024, 4:43 p.m. UTC | #5
On Tue, Jun 25, 2024 at 11:34:24PM +0200, Thomas Gleixner wrote:

> There is effort underway to expose PTP clocks to user space via
> VDSO.

That sounds interesting.  Has anything been posted?

Thanks,
Richard
Peter Hilber June 27, 2024, 1:50 p.m. UTC | #6
On 25.06.24 21:01, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
> 
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> 
> v2: 
>  • Add gettimex64() support
>  • Convert TSC values to KVM clock when appropriate
>  • Require int128 support
>  • Add counter_period_shift 
>  • Add timeout when seq_count is invalid
>  • Add flags field
>  • Better comments in vmclock ABI structure
>  • Explicitly forbid smearing (as clock rates would need to change)

Leap second smearing information could still be conveyed through the
vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
enough to indicate whether the driver should apply linear or cosine
smearing, and the start time and end time.

> 
>  drivers/ptp/Kconfig          |  13 +
>  drivers/ptp/Makefile         |   1 +
>  drivers/ptp/ptp_vmclock.c    | 516 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vmclock.h | 138 ++++++++++
>  4 files changed, 668 insertions(+)
>  create mode 100644 drivers/ptp/ptp_vmclock.c
>  create mode 100644 include/uapi/linux/vmclock.h
> 

[...]

> +
> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> +					       uint64_t period, uint8_t shift,
> +					       uint64_t frac_sec)
> +{
> +	unsigned __int128 res = (unsigned __int128)delta * period;
> +
> +	res >>= shift;
> +	res += frac_sec;
> +	*res_hi = res >> 64;
> +	return (uint64_t)res;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> +				   struct ptp_system_timestamp *sts,
> +				   struct system_counterval_t *system_counter,
> +				   struct timespec64 *tspec)
> +{
> +	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> +	struct system_time_snapshot systime_snapshot;
> +	uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> +	/*
> +	 * We'd expect the hypervisor to know this and to report the clock
> +	 * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> +	 */
> +	if (check_tsc_unstable())
> +		return -EINVAL;
> +#endif
> +
> +	while (1) {
> +		seq = st->clk->seq_count & ~1ULL;
> +		virt_rmb();
> +
> +		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> +			return -EINVAL;
> +
> +		/*
> +		 * When invoked for gettimex64(), fill in the pre/post system
> +		 * times. The simple case is when system time is based on the
> +		 * same counter as st->cs_id, in which case all three times
> +		 * will be derived from the *same* counter value.
> +		 *
> +		 * If the system isn't using the same counter, then the value
> +		 * from ktime_get_snapshot() will still be used as pre_ts, and
> +		 * ptp_read_system_postts() is called to populate postts after
> +		 * calling get_cycles().
> +		 *
> +		 * The conversion to timespec64 happens further down, outside
> +		 * the seq_count loop.
> +		 */
> +		if (sts) {
> +			ktime_get_snapshot(&systime_snapshot);
> +			if (systime_snapshot.cs_id == st->cs_id) {
> +				cycle = systime_snapshot.cycles;
> +			} else {
> +				cycle = get_cycles();
> +				ptp_read_system_postts(sts);
> +			}
> +		} else
> +			cycle = get_cycles();
> +
> +		delta = cycle - st->clk->counter_value;

AFAIU in the general case this needs to be masked for non 64-bit counters.

> +
> +		frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> +						   st->clk->counter_period_frac_sec,
> +						   st->clk->counter_period_shift,
> +						   st->clk->utc_time_frac_sec);
> +		tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> +		tspec->tv_sec += st->clk->utc_time_sec;
> +
> +		virt_rmb();
> +		if (seq == st->clk->seq_count)
> +			break;
> +
> +		if (ktime_after(ktime_get(), deadline))
> +			return -ETIMEDOUT;
> +	}
> +
> +	if (system_counter) {
> +		system_counter->cycles = cycle;
> +		system_counter->cs_id = st->cs_id;
> +	}
> +
> +	if (sts) {
> +		sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> +		if (systime_snapshot.cs_id == st->cs_id)
> +			sts->post_ts = sts->pre_ts;
> +	}
> +
> +	return 0;
> +}
> +

[...]

> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> +	.owner		= THIS_MODULE,
> +	.max_adj	= 0,
> +	.n_ext_ts	= 0,
> +	.n_pins		= 0,
> +	.pps		= 0,
> +	.adjfine	= ptp_vmclock_adjfine,
> +	.adjtime	= ptp_vmclock_adjtime,
> +	.gettime64	= ptp_vmclock_gettime,

The .gettime64 op is now unneeded.

> +	.gettimex64	= ptp_vmclock_gettimex,
> +	.settime64	= ptp_vmclock_settime,
> +	.enable		= ptp_vmclock_enable,
> +	.getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +

[...]

> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> new file mode 100644
> index 000000000000..cf0f22205e79
> --- /dev/null
> +++ b/include/uapi/linux/vmclock.h
> @@ -0,0 +1,138 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> +
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + *
> + * Note that this must be true UTC, never with smeared leap seconds. If a
> + * guest wishes to construct a smeared clock, it can do so. Presenting a
> + * smeared clock through this interface would be problematic because it
> + * actually messes with the apparent counter *period*. A linear smearing
> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> + * at the start/end of the smearing period, while a sinusoidal smear would
> + * basically be impossible to represent.

Clock types other than UTC could also be supported: TAI, monotonic.

> + */
> +
> +#ifndef __VMCLOCK_H__
> +#define __VMCLOCK_H__
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> +	uint32_t magic;
> +#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
> +	uint16_t size;		/* Size of page containing this structure */
> +	uint16_t version;	/* 1 */
> +
> +	/* Sequence lock. Low bit means an update is in progress. */
> +	uint64_t seq_count;
> +
> +	/*
> +	 * This field changes to another non-repeating value when the CPU
> +	 * counter is disrupted, for example on live migration.
> +	 */
> +	uint64_t disruption_marker;

The field could also change when the clock is stepped (leap seconds
excepted), or when the clock frequency is slewed.
David Woodhouse June 27, 2024, 2:52 p.m. UTC | #7
On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
> On 25.06.24 21:01, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> > 
> > v2: 
> >  • Add gettimex64() support
> >  • Convert TSC values to KVM clock when appropriate
> >  • Require int128 support
> >  • Add counter_period_shift 
> >  • Add timeout when seq_count is invalid
> >  • Add flags field
> >  • Better comments in vmclock ABI structure
> >  • Explicitly forbid smearing (as clock rates would need to change)
> 
> Leap second smearing information could still be conveyed through the
> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
> enough to indicate whether the driver should apply linear or cosine
> smearing, and the start time and end time.

Yes. The clock information actually conveyed through the {counter,
time, rate} tuple should never be smeared, and should only ever be UTC.

But we could provide a hint to the guest operating system about what
type of smearing to perform, *if* it chooses to offer a clock other
than the standard CLOCK_REALTIME to its users.

I already added a flags field, so this might look something like:

        /*
         * Smearing flags. The UTC clock exposed through this structure
         * is only ever true UTC, but a guest operating system may
         * choose to offer a monotonic smeared clock to its users. This
         * merely offers a hint about what kind of smearing to perform,
         * for consistency with systems in the nearby environment.
         */
#define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */


(UTC-SLS is probably a bad example but are there formal definitions for
anything else?)


> > But we 
> >  drivers/ptp/Kconfig          |  13 +
> >  drivers/ptp/Makefile         |   1 +
> >  drivers/ptp/ptp_vmclock.c    | 516 +++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vmclock.h | 138 ++++++++++
> >  4 files changed, 668 insertions(+)
> >  create mode 100644 drivers/ptp/ptp_vmclock.c
> >  create mode 100644 include/uapi/linux/vmclock.h
> > 
> 
> [...]
> 
> > +
> > +/*
> > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> > + * and add the fractional second part of the reference time.
> > + *
> > + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> > + * the low 64 bits are (seconds >> 64).
> > + *
> > + * If __int128 isn't available, perform the calculation 32 bits at a time to
> > + * avoid overflow.
> > + */
> > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> > +                                              uint64_t period, uint8_t shift,
> > +                                              uint64_t frac_sec)
> > +{
> > +       unsigned __int128 res = (unsigned __int128)delta * period;
> > +
> > +       res >>= shift;
> > +       res += frac_sec;
> > +       *res_hi = res >> 64;
> > +       return (uint64_t)res;
> > +}
> > +
> > +static int vmclock_get_crosststamp(struct vmclock_state *st,
> > +                                  struct ptp_system_timestamp *sts,
> > +                                  struct system_counterval_t *system_counter,
> > +                                  struct timespec64 *tspec)
> > +{
> > +       ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> > +       struct system_time_snapshot systime_snapshot;
> > +       uint64_t cycle, delta, seq, frac_sec;
> > +
> > +#ifdef CONFIG_X86
> > +       /*
> > +        * We'd expect the hypervisor to know this and to report the clock
> > +        * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> > +        */
> > +       if (check_tsc_unstable())
> > +               return -EINVAL;
> > +#endif
> > +
> > +       while (1) {
> > +               seq = st->clk->seq_count & ~1ULL;
> > +               virt_rmb();
> > +
> > +               if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> > +                       return -EINVAL;
> > +
> > +               /*
> > +                * When invoked for gettimex64(), fill in the pre/post system
> > +                * times. The simple case is when system time is based on the
> > +                * same counter as st->cs_id, in which case all three times
> > +                * will be derived from the *same* counter value.
> > +                *
> > +                * If the system isn't using the same counter, then the value
> > +                * from ktime_get_snapshot() will still be used as pre_ts, and
> > +                * ptp_read_system_postts() is called to populate postts after
> > +                * calling get_cycles().
> > +                *
> > +                * The conversion to timespec64 happens further down, outside
> > +                * the seq_count loop.
> > +                */
> > +               if (sts) {
> > +                       ktime_get_snapshot(&systime_snapshot);
> > +                       if (systime_snapshot.cs_id == st->cs_id) {
> > +                               cycle = systime_snapshot.cycles;
> > +                       } else {
> > +                               cycle = get_cycles();
> > +                               ptp_read_system_postts(sts);
> > +                       }
> > +               } else
> > +                       cycle = get_cycles();
> > +
> > +               delta = cycle - st->clk->counter_value;
> 
> AFAIU in the general case this needs to be masked for non 64-bit counters.

Are there any non-64-bit counters that will be exposed this way? And
can we handle that with explicit knowledge of the counter type, rather
than a separate explicit field?


If we really dig deep, it gets horrible for scaled TSCs. The guest TSC
is actually calculated from the host TSC with a scale and offset:

  guest_tsc = ((host_tsc * SCALE) >> 48_or_32) + OFFSET

(The shift is 48 bits for Intel, 32 for AMD).

So when the host TSC wraps to zero, the guest TSC jumps from some non-
power-of-two upper limit, to OFFSET which is generally a *negative*
value. And will continue counting from there until it actually reaches
the 64-bit limit and wraps back to zero.


> 
> > +
> > +               frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> > +                                                  st->clk->counter_period_frac_sec,
> > +                                                  st->clk->counter_period_shift,
> > +                                                  st->clk->utc_time_frac_sec);
> > +               tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> > +               tspec->tv_sec += st->clk->utc_time_sec;
> > +
> > +               virt_rmb();
> > +               if (seq == st->clk->seq_count)
> > +                       break;
> > +
> > +               if (ktime_after(ktime_get(), deadline))
> > +                       return -ETIMEDOUT;
> > +       }
> > +
> > +       if (system_counter) {
> > +               system_counter->cycles = cycle;
> > +               system_counter->cs_id = st->cs_id;
> > +       }
> > +
> > +       if (sts) {
> > +               sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> > +               if (systime_snapshot.cs_id == st->cs_id)
> > +                       sts->post_ts = sts->pre_ts;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> 
> [...]
> 
> > +
> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > +       .owner          = THIS_MODULE,
> > +       .max_adj        = 0,
> > +       .n_ext_ts       = 0,
> > +       .n_pins         = 0,
> > +       .pps            = 0,
> > +       .adjfine        = ptp_vmclock_adjfine,
> > +       .adjtime        = ptp_vmclock_adjtime,
> > +       .gettime64      = ptp_vmclock_gettime,
> 
> The .gettime64 op is now unneeded.

Ack, thanks.

> > +       .gettimex64     = ptp_vmclock_gettimex,
> > +       .settime64      = ptp_vmclock_settime,
> > +       .enable         = ptp_vmclock_enable,
> > +       .getcrosststamp = ptp_vmclock_getcrosststamp,
> > +};
> > +
> 
> [...]
> 
> > diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> > new file mode 100644
> > index 000000000000..cf0f22205e79
> > --- /dev/null
> > +++ b/include/uapi/linux/vmclock.h
> > @@ -0,0 +1,138 @@
> > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> > +
> > +/*
> > + * This structure provides a vDSO-style clock to VM guests, exposing the
> > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> > + * counter, etc.) and real time. It is designed to address the problem of
> > + * live migration, which other clock enlightenments do not.
> > + *
> > + * When a guest is live migrated, this affects the clock in two ways.
> > + *
> > + * First, even between identical hosts the actual frequency of the underlying
> > + * counter will change within the tolerances of its specification (typically
> > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> > + * same host, but can be tracked by NTP as it generally varies slowly. With
> > + * live migration there is a step change in the frequency, with no warning.
> > + *
> > + * Second, there may be a step change in the value of the counter itself, as
> > + * its accuracy is limited by the precision of the NTP synchronization on the
> > + * source and destination hosts.
> > + *
> > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> > + * host before migration is invalid, and needs to be redone on the new host.
> > + *
> > + * In its most basic mode, this structure provides only an indication to the
> > + * guest that live migration has occurred. This allows the guest to know that
> > + * its clock is invalid and take remedial action. For applications that need
> > + * reliable accurate timestamps (e.g. distributed databases), the structure
> > + * can be mapped all the way to userspace. This allows the application to see
> > + * directly for itself that the clock is disrupted and take appropriate
> > + * action, even when using a vDSO-style method to get the time instead of a
> > + * system call.
> > + *
> > + * In its more advanced mode. this structure can also be used to expose the
> > + * precise relationship of the CPU counter to real time, as calibrated by the
> > + * host. This means that userspace applications can have accurate time
> > + * immediately after live migration, rather than having to pause operations
> > + * and wait for NTP to recover. This mode does, of course, rely on the
> > + * counter being reliable and consistent across CPUs.
> > + *
> > + * Note that this must be true UTC, never with smeared leap seconds. If a
> > + * guest wishes to construct a smeared clock, it can do so. Presenting a
> > + * smeared clock through this interface would be problematic because it
> > + * actually messes with the apparent counter *period*. A linear smearing
> > + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> > + * at the start/end of the smearing period, while a sinusoidal smear would
> > + * basically be impossible to represent.
> 
> Clock types other than UTC could also be supported: TAI, monotonic.

This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can
expose UTC only, without the TAI offset if that's unknown.

Should we also have a mode for exposing TAI only, for when TAI is known
but not the offset from UTC? Is that really a likely scenario? Isn't a
host much more likely to know UTC and *not* the TAI offset?

I suppose if we have *hardware* implementations of this, they could be
based on an atomic clock and all they'll have is TAI? So OK, maybe that
makes sense?

(We'd have to add something like the ART as the counter to pair with
over an actual PCI bus, of course.)

We can add a type field like the one you have for virtio-rtc, yes?


> 
> > + */
> > +
> > +#ifndef __VMCLOCK_H__
> > +#define __VMCLOCK_H__
> > +
> > +#ifdef __KERNEL
> > +#include <linux/types.h>
> > +#else
> > +#include <stdint.h>
> > +#endif
> > +
> > +struct vmclock_abi {
> > +       uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +       uint16_t size;          /* Size of page containing this structure */
> > +       uint16_t version;       /* 1 */
> > +
> > +       /* Sequence lock. Low bit means an update is in progress. */
> > +       uint64_t seq_count;
> > +
> > +       /*
> > +        * This field changes to another non-repeating value when the CPU
> > +        * counter is disrupted, for example on live migration.
> > +        */
> > +       uint64_t disruption_marker;
> 
> The field could also change when the clock is stepped (leap seconds
> excepted), or when the clock frequency is slewed.

I'm not sure. The concept of the disruption marker is that it tells the
guest to throw away any calibration of the counter that the guest has
done for *itself* (with NTP, other PTP devices, etc.).

One mode for this device would be not to populate the clock fields at
all, but *only* to signal disruption when it occurs. So the guest can
abort transactions until it's resynced its clocks (to avoid incurring
fines if breaking databases, etc.).

Exposing the host timekeeping through the structure means that the
migrated guest can keep working because it can trust the timekeeping
performed by the (new) host and exposed to it.

If the counter is actually varying in frequency over time, and the host
is slewing the clock frequency that it reports, that *isn't* a step
change and doesn't mean that the guest should throw away any
calibration that it's been doing for itself. One hopes that the guest
would have detected the *same* frequency change, and be adapting for
itself. So I don't think that should indicate a disruption.

I think the same is even true if the clock is stepped by the host. The
actual *counter* hasn't changed, so the guest is better off ignoring
the vacillating host and continuing to derive its idea of time from the
hardware counter itself, as calibrated against some external NTP/PTP
sources. Surely we actively *don't* to tell the guest to throw its own
calibrations away, in this case?
David Woodhouse June 27, 2024, 4:03 p.m. UTC | #8
I've updated the tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
(but not yet the qemu one).

I think I've taken into account all your comments apart from the one
about non-64-bit counters wrapping. I reduced the seq_count to 32 bit
to make room for a 32-bit flags field, added the time type
(UTC/TAI/MONOTONIC) and a smearing hint, with some straw man
definitions for smearing algorithms for which I could actually find
definitions.

The structure now looks like this:


struct vmclock_abi {
	uint32_t magic;
#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
	uint16_t size;		/* Size of page containing this structure */
	uint16_t version;	/* 1 */

	/* Sequence lock. Low bit means an update is in progress. */
	uint32_t seq_count;

	uint32_t flags;
	/* Indicates that the tai_offset_sec field is valid */
#define VMCLOCK_FLAG_TAI_OFFSET_VALID		(1 << 0)
	/*
	 * Optionally used to notify guests of pending maintenance events.
	 * A guest may wish to remove itself from service if an event is
	 * coming up. Two flags indicate the rough imminence of the event.
	 */
#define VMCLOCK_FLAG_DISRUPTION_SOON		(1 << 1) /* About a day */
#define VMCLOCK_FLAG_DISRUPTION_IMMINENT	(1 << 2) /* About an hour */
	/* Indicates that the utc_time_maxerror_picosec field is valid */
#define VMCLOCK_FLAG_UTC_MAXERROR_VALID		(1 << 3)
	/* Indicates counter_period_error_rate_frac_sec is valid */
#define VMCLOCK_FLAG_PERIOD_ERROR_VALID		(1 << 4)

	/*
	 * This field changes to another non-repeating value when the CPU
	 * counter is disrupted, for example on live migration. This lets
	 * the guest know that it should discard any calibration it has
	 * performed of the counter against external sources (NTP/PTP/etc.).
	 */
	uint64_t disruption_marker;

	uint8_t clock_status;
#define VMCLOCK_STATUS_UNKNOWN		0
#define VMCLOCK_STATUS_INITIALIZING	1
#define VMCLOCK_STATUS_SYNCHRONIZED	2
#define VMCLOCK_STATUS_FREERUNNING	3
#define VMCLOCK_STATUS_UNRELIABLE	4

	uint8_t counter_id;
#define VMCLOCK_COUNTER_INVALID		0
#define VMCLOCK_COUNTER_X86_TSC		1
#define VMCLOCK_COUNTER_ARM_VCNT	2
#define VMCLOCK_COUNTER_X86_ART		3

	/*
	 * By providing the offset from UTC to TAI, the guest can know both
	 * UTC and TAI reliably, whichever is indicated in the time_type
	 * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
	 */
	int16_t tai_offset_sec;

	/*
	 * The time exposed through this device is never smeaared; if it
	 * claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field
	 * provides a hint to the guest operating system, such that *if*
	 * the guest OS wants to provide its users with an alternative
	 * clock which does not follow the POSIX CLOCK_REALTIME standard,
	 * it may do so in a fashion consistent with the other systems
	 * in the nearby environment.
	 */
	uint8_t leap_second_smearing_hint;
	/* Provide true UTC to users, unsmeared. */;
#define VMCLOCK_SMEARING_NONE			0
	/*
	 * https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/
	 * From noon on the day before to noon on the day after, smear the
	 * clock by a linear 1/86400s per second.
	*/
#define VMCLOCK_SMEARING_LINEAR_86400		1
	/*
	 * draft-kuhn-leapsecond-00
	 * For the 1000s leading up to the leap second, smear the clock by
	 * clock by a linear 1ms per second.
	 */
#define VMCLOCK_SMEARING_UTC_SLS		2

	/*
	 * What time is exposed in the time_sec/time_frac_sec fields?
	 */
	uint8_t time_type;
#define VMCLOCK_TIME_UNKNOWN		0	/* Invalid / no time exposed */
#define VMCLOCK_TIME_UTC		1	/* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI		2	/* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC		3	/* Since undefined epoch */

	/* Bit shift for counter_period_frac_sec and its error rate */
	uint8_t counter_period_shift;

	/*
	 * Unlike in NTP, this can indicate a leap second in the past. This
	 * is needed to allow guests to derive an imprecise clock with
	 * smeared leap seconds for themselves, as some modes of smearing
	 * need the adjustments to continue even after the moment at which
	 * the leap second should have occurred.
	 */
	int8_t leapsecond_direction;
	uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */

	/*
	 * Paired values of counter and UTC at a given point in time.
	 */
	uint64_t counter_value;
	uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
	uint64_t time_frac_sec;

	/*
	 * Counter frequency, and error margin. The unit of these fields is
	 * seconds >> (64 + counter_period_shift)
	 */
	uint64_t counter_period_frac_sec;
	uint64_t counter_period_error_rate_frac_sec;

	/* Error margin of UTC reading above (± picoseconds) */
	uint64_t utc_time_maxerror_picosec;
};

#endif /*  __VMCLOCK_H__ */
Peter Hilber June 28, 2024, 11:33 a.m. UTC | #9
On 27.06.24 16:52, David Woodhouse wrote:
> On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
>> On 25.06.24 21:01, David Woodhouse wrote:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> The vmclock "device" provides a shared memory region with precision clock
>>> information. By using shared memory, it is safe across Live Migration.
>>>
>>> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
>>> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
>>> actually helpful.
>>>
>>> The memory region of the device is also exposed to userspace so it can be
>>> read or memory mapped by application which need reliable notification of
>>> clock disruptions.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>> ---
>>>
>>> v2: 
>>>  • Add gettimex64() support
>>>  • Convert TSC values to KVM clock when appropriate
>>>  • Require int128 support
>>>  • Add counter_period_shift 
>>>  • Add timeout when seq_count is invalid
>>>  • Add flags field
>>>  • Better comments in vmclock ABI structure
>>>  • Explicitly forbid smearing (as clock rates would need to change)
>>
>> Leap second smearing information could still be conveyed through the
>> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
>> enough to indicate whether the driver should apply linear or cosine
>> smearing, and the start time and end time.
> 
> Yes. The clock information actually conveyed through the {counter,
> time, rate} tuple should never be smeared, and should only ever be UTC.
> 
> But we could provide a hint to the guest operating system about what
> type of smearing to perform, *if* it chooses to offer a clock other
> than the standard CLOCK_REALTIME to its users.
> 
> I already added a flags field, so this might look something like:
> 
>         /*
>          * Smearing flags. The UTC clock exposed through this structure
>          * is only ever true UTC, but a guest operating system may
>          * choose to offer a monotonic smeared clock to its users. This
>          * merely offers a hint about what kind of smearing to perform,
>          * for consistency with systems in the nearby environment.
>          */
> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
> 
> 
> (UTC-SLS is probably a bad example but are there formal definitions for
> anything else?)
> 
> 

I think it could also be more generic, like flags for linear smearing,
cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
to the leap second start). That could also represent UTC-SLS, and
noon-to-noon, and it would be well-defined.

This should reduce the likelihood that the guest doesn't know the smearing
variant.


[...]

>>> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
>>> new file mode 100644
>>> index 000000000000..cf0f22205e79
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vmclock.h
>>> @@ -0,0 +1,138 @@
>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
>>> +
>>> +/*
>>> + * This structure provides a vDSO-style clock to VM guests, exposing the
>>> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
>>> + * counter, etc.) and real time. It is designed to address the problem of
>>> + * live migration, which other clock enlightenments do not.
>>> + *
>>> + * When a guest is live migrated, this affects the clock in two ways.
>>> + *
>>> + * First, even between identical hosts the actual frequency of the underlying
>>> + * counter will change within the tolerances of its specification (typically
>>> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
>>> + * same host, but can be tracked by NTP as it generally varies slowly. With
>>> + * live migration there is a step change in the frequency, with no warning.
>>> + *
>>> + * Second, there may be a step change in the value of the counter itself, as
>>> + * its accuracy is limited by the precision of the NTP synchronization on the
>>> + * source and destination hosts.
>>> + *
>>> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
>>> + * host before migration is invalid, and needs to be redone on the new host.
>>> + *
>>> + * In its most basic mode, this structure provides only an indication to the
>>> + * guest that live migration has occurred. This allows the guest to know that
>>> + * its clock is invalid and take remedial action. For applications that need
>>> + * reliable accurate timestamps (e.g. distributed databases), the structure
>>> + * can be mapped all the way to userspace. This allows the application to see
>>> + * directly for itself that the clock is disrupted and take appropriate
>>> + * action, even when using a vDSO-style method to get the time instead of a
>>> + * system call.
>>> + *
>>> + * In its more advanced mode. this structure can also be used to expose the
>>> + * precise relationship of the CPU counter to real time, as calibrated by the
>>> + * host. This means that userspace applications can have accurate time
>>> + * immediately after live migration, rather than having to pause operations
>>> + * and wait for NTP to recover. This mode does, of course, rely on the
>>> + * counter being reliable and consistent across CPUs.
>>> + *
>>> + * Note that this must be true UTC, never with smeared leap seconds. If a
>>> + * guest wishes to construct a smeared clock, it can do so. Presenting a
>>> + * smeared clock through this interface would be problematic because it
>>> + * actually messes with the apparent counter *period*. A linear smearing
>>> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
>>> + * at the start/end of the smearing period, while a sinusoidal smear would
>>> + * basically be impossible to represent.
>>
>> Clock types other than UTC could also be supported: TAI, monotonic.
> 
> This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can
> expose UTC only, without the TAI offset if that's unknown.
> 
> Should we also have a mode for exposing TAI only, for when TAI is known
> but not the offset from UTC? Is that really a likely scenario? Isn't a
> host much more likely to know UTC and *not* the TAI offset?
> 
> I suppose if we have *hardware* implementations of this, they could be
> based on an atomic clock and all they'll have is TAI? So OK, maybe that
> makes sense?
> 
> (We'd have to add something like the ART as the counter to pair with
> over an actual PCI bus, of course.)
> 
> We can add a type field like the one you have for virtio-rtc, yes?
> 
> 

Ack.

>>
>>> + */
>>> +
>>> +#ifndef __VMCLOCK_H__
>>> +#define __VMCLOCK_H__
>>> +
>>> +#ifdef __KERNEL
>>> +#include <linux/types.h>
>>> +#else
>>> +#include <stdint.h>
>>> +#endif
>>> +
>>> +struct vmclock_abi {
>>> +       uint32_t magic;
>>> +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
>>> +       uint16_t size;          /* Size of page containing this structure */
>>> +       uint16_t version;       /* 1 */
>>> +
>>> +       /* Sequence lock. Low bit means an update is in progress. */
>>> +       uint64_t seq_count;
>>> +
>>> +       /*
>>> +        * This field changes to another non-repeating value when the CPU
>>> +        * counter is disrupted, for example on live migration.
>>> +        */
>>> +       uint64_t disruption_marker;
>>
>> The field could also change when the clock is stepped (leap seconds
>> excepted), or when the clock frequency is slewed.
> 
> I'm not sure. The concept of the disruption marker is that it tells the
> guest to throw away any calibration of the counter that the guest has
> done for *itself* (with NTP, other PTP devices, etc.).
> 
> One mode for this device would be not to populate the clock fields at
> all, but *only* to signal disruption when it occurs. So the guest can
> abort transactions until it's resynced its clocks (to avoid incurring
> fines if breaking databases, etc.).
> 
> Exposing the host timekeeping through the structure means that the
> migrated guest can keep working because it can trust the timekeeping
> performed by the (new) host and exposed to it.
> 
> If the counter is actually varying in frequency over time, and the host
> is slewing the clock frequency that it reports, that *isn't* a step
> change and doesn't mean that the guest should throw away any
> calibration that it's been doing for itself. One hopes that the guest
> would have detected the *same* frequency change, and be adapting for
> itself. So I don't think that should indicate a disruption.
> 
> I think the same is even true if the clock is stepped by the host. The
> actual *counter* hasn't changed, so the guest is better off ignoring
> the vacillating host and continuing to derive its idea of time from the
> hardware counter itself, as calibrated against some external NTP/PTP
> sources. Surely we actively *don't* to tell the guest to throw its own
> calibrations away, in this case?

In case the guest is also considering other time sources, it might indeed
not be a good idea to mix host clock changes into the hardware counter
disruption marker.

But if the vmclock is the authoritative source of time, it can still be
helpful to know about such changes, maybe through another marker.
Peter Hilber June 28, 2024, 11:33 a.m. UTC | #10
On 27.06.24 18:03, David Woodhouse wrote:
> 
> I've updated the tree at
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
> (but not yet the qemu one).
> 
> I think I've taken into account all your comments apart from the one
> about non-64-bit counters wrapping. I reduced the seq_count to 32 bit
> to make room for a 32-bit flags field, added the time type
> (UTC/TAI/MONOTONIC) and a smearing hint, with some straw man
> definitions for smearing algorithms for which I could actually find
> definitions.
> 
> The structure now looks like this:
> 
> 
> struct vmclock_abi {

[...]

> 
> 	/*
> 	 * What time is exposed in the time_sec/time_frac_sec fields?
> 	 */
> 	uint8_t time_type;
> #define VMCLOCK_TIME_UNKNOWN		0	/* Invalid / no time exposed */
> #define VMCLOCK_TIME_UTC		1	/* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_TAI		2	/* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_MONOTONIC		3	/* Since undefined epoch */
> 
> 	/* Bit shift for counter_period_frac_sec and its error rate */
> 	uint8_t counter_period_shift;
> 
> 	/*
> 	 * Unlike in NTP, this can indicate a leap second in the past. This
> 	 * is needed to allow guests to derive an imprecise clock with
> 	 * smeared leap seconds for themselves, as some modes of smearing
> 	 * need the adjustments to continue even after the moment at which
> 	 * the leap second should have occurred.
> 	 */
> 	int8_t leapsecond_direction;
> 	uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
> 
> 	/*
> 	 * Paired values of counter and UTC at a given point in time.
> 	 */
> 	uint64_t counter_value;
> 	uint64_t time_sec; /* Since 1970-01-01 00:00:00z */

Nitpick: The comment is not valid any more for TIME_MONOTONIC.
David Woodhouse June 28, 2024, 11:41 a.m. UTC | #11
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> 
> > 
> >         /*
> >          * What time is exposed in the time_sec/time_frac_sec fields?
> >          */
> >         uint8_t time_type;
> > #define VMCLOCK_TIME_UNKNOWN            0       /* Invalid / no time exposed */
> > #define VMCLOCK_TIME_UTC                1       /* Since 1970-01-01 00:00:00z */
> > #define VMCLOCK_TIME_TAI                2       /* Since 1970-01-01 00:00:00z */
> > #define VMCLOCK_TIME_MONOTONIC          3       /* Since undefined epoch */
> > 
> >         /* Bit shift for counter_period_frac_sec and its error rate */
> >         uint8_t counter_period_shift;
> > 
> >         /*
> >          * Unlike in NTP, this can indicate a leap second in the past. This
> >          * is needed to allow guests to derive an imprecise clock with
> >          * smeared leap seconds for themselves, as some modes of smearing
> >          * need the adjustments to continue even after the moment at which
> >          * the leap second should have occurred.
> >          */
> >         int8_t leapsecond_direction;
> >         uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
> > 
> >         /*
> >          * Paired values of counter and UTC at a given point in time.
> >          */
> >         uint64_t counter_value;
> >         uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
> 
> Nitpick: The comment is not valid any more for TIME_MONOTONIC.

Ah yes, I "moved" that comment up to the UTC/TAI time_type values, but
neglected to actually delete it from here. Fixed; thanks.
David Woodhouse June 28, 2024, 12:15 p.m. UTC | #12
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> On 27.06.24 16:52, David Woodhouse wrote:
> > I already added a flags field, so this might look something like:
> > 
> >         /*
> >          * Smearing flags. The UTC clock exposed through this structure
> >          * is only ever true UTC, but a guest operating system may
> >          * choose to offer a monotonic smeared clock to its users. This
> >          * merely offers a hint about what kind of smearing to perform,
> >          * for consistency with systems in the nearby environment.
> >          */
> > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
> > 
> > (UTC-SLS is probably a bad example but are there formal definitions for
> > anything else?)
> 
> I think it could also be more generic, like flags for linear smearing,
> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
> to the leap second start). That could also represent UTC-SLS, and
> noon-to-noon, and it would be well-defined.
> 
> This should reduce the likelihood that the guest doesn't know the smearing
> variant.

I'm wary of making it too generic. That would seem to encourage a
*proliferation* of false "UTC-like" clocks.

It's bad enough that we do smearing at all, let alone that we don't
have a single definition of how to do it.

I made the smearing hint a full uint8_t instead of using bits in flags,
in the end. That gives us a full 255 ways of lying to users about what
the time is, so we're unlikely to run out. And it's easy enough to add
a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
that get invented.


> > > > +       /*
> > > > +        * This field changes to another non-repeating value when the CPU
> > > > +        * counter is disrupted, for example on live migration.
> > > > +        */
> > > > +       uint64_t disruption_marker;
> > > 
> > > The field could also change when the clock is stepped (leap seconds
> > > excepted), or when the clock frequency is slewed.
> > 
> > I'm not sure. The concept of the disruption marker is that it tells the
> > guest to throw away any calibration of the counter that the guest has
> > done for *itself* (with NTP, other PTP devices, etc.).
> > 
> > One mode for this device would be not to populate the clock fields at
> > all, but *only* to signal disruption when it occurs. So the guest can
> > abort transactions until it's resynced its clocks (to avoid incurring
> > fines if breaking databases, etc.).
> > 
> > Exposing the host timekeeping through the structure means that the
> > migrated guest can keep working because it can trust the timekeeping
> > performed by the (new) host and exposed to it.
> > 
> > If the counter is actually varying in frequency over time, and the host
> > is slewing the clock frequency that it reports, that *isn't* a step
> > change and doesn't mean that the guest should throw away any
> > calibration that it's been doing for itself. One hopes that the guest
> > would have detected the *same* frequency change, and be adapting for
> > itself. So I don't think that should indicate a disruption.
> > 
> > I think the same is even true if the clock is stepped by the host. The
> > actual *counter* hasn't changed, so the guest is better off ignoring
> > the vacillating host and continuing to derive its idea of time from the
> > hardware counter itself, as calibrated against some external NTP/PTP
> > sources. Surely we actively *don't* to tell the guest to throw its own
> > calibrations away, in this case?
> 
> In case the guest is also considering other time sources, it might indeed
> not be a good idea to mix host clock changes into the hardware counter
> disruption marker.
> 
> But if the vmclock is the authoritative source of time, it can still be
> helpful to know about such changes, maybe through another marker.

Could that be the existing seq_count field?

Skewing the counter_period_frac_sec as the underlying oscillator speeds
up and slows down is perfectly normal and expected, and we already
expect the seq_count to change when that happens.

Maybe step changes are different, but arguably if the time advertised
by the host steps *outside* the error bounds previously advertised,
that's just broken?

Depending on how the clock information is fed, a change in seq_count
may even result in non-monotonicity. If the underlying oscillator has
sped up and the structure is updated accordingly, the time calculated
the moment *before* that update may appear later than the time
calculated immediately after it.

It's up to the guest operating system to feed that information into its
own timekeeping system and skew towards correctness instead of stepping
the time it reports to its users.
Peter Hilber June 28, 2024, 4:38 p.m. UTC | #13
On 28.06.24 14:15, David Woodhouse wrote:
> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>> On 27.06.24 16:52, David Woodhouse wrote:
>>> I already added a flags field, so this might look something like:
>>>
>>>         /*
>>>          * Smearing flags. The UTC clock exposed through this structure
>>>          * is only ever true UTC, but a guest operating system may
>>>          * choose to offer a monotonic smeared clock to its users. This
>>>          * merely offers a hint about what kind of smearing to perform,
>>>          * for consistency with systems in the nearby environment.
>>>          */
>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>>>
>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>> anything else?)
>>
>> I think it could also be more generic, like flags for linear smearing,
>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>> to the leap second start). That could also represent UTC-SLS, and
>> noon-to-noon, and it would be well-defined.
>>
>> This should reduce the likelihood that the guest doesn't know the smearing
>> variant.
> 
> I'm wary of making it too generic. That would seem to encourage a
> *proliferation* of false "UTC-like" clocks.
> 
> It's bad enough that we do smearing at all, let alone that we don't
> have a single definition of how to do it.
> 
> I made the smearing hint a full uint8_t instead of using bits in flags,
> in the end. That gives us a full 255 ways of lying to users about what
> the time is, so we're unlikely to run out. And it's easy enough to add
> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> that get invented.
> 
> 

My concern is that the registry update may come after a driver has already
been implemented, so that it may be hard to ensure that the smearing which
has been chosen is actually implemented.

>>>>> +       /*
>>>>> +        * This field changes to another non-repeating value when the CPU
>>>>> +        * counter is disrupted, for example on live migration.
>>>>> +        */
>>>>> +       uint64_t disruption_marker;
>>>>
>>>> The field could also change when the clock is stepped (leap seconds
>>>> excepted), or when the clock frequency is slewed.
>>>
>>> I'm not sure. The concept of the disruption marker is that it tells the
>>> guest to throw away any calibration of the counter that the guest has
>>> done for *itself* (with NTP, other PTP devices, etc.).
>>>
>>> One mode for this device would be not to populate the clock fields at
>>> all, but *only* to signal disruption when it occurs. So the guest can
>>> abort transactions until it's resynced its clocks (to avoid incurring
>>> fines if breaking databases, etc.).
>>>
>>> Exposing the host timekeeping through the structure means that the
>>> migrated guest can keep working because it can trust the timekeeping
>>> performed by the (new) host and exposed to it.
>>>
>>> If the counter is actually varying in frequency over time, and the host
>>> is slewing the clock frequency that it reports, that *isn't* a step
>>> change and doesn't mean that the guest should throw away any
>>> calibration that it's been doing for itself. One hopes that the guest
>>> would have detected the *same* frequency change, and be adapting for
>>> itself. So I don't think that should indicate a disruption.
>>>
>>> I think the same is even true if the clock is stepped by the host. The
>>> actual *counter* hasn't changed, so the guest is better off ignoring
>>> the vacillating host and continuing to derive its idea of time from the
>>> hardware counter itself, as calibrated against some external NTP/PTP
>>> sources. Surely we actively *don't* to tell the guest to throw its own
>>> calibrations away, in this case?
>>
>> In case the guest is also considering other time sources, it might indeed
>> not be a good idea to mix host clock changes into the hardware counter
>> disruption marker.
>>
>> But if the vmclock is the authoritative source of time, it can still be
>> helpful to know about such changes, maybe through another marker.
> 
> Could that be the existing seq_count field?
> 
> Skewing the counter_period_frac_sec as the underlying oscillator speeds
> up and slows down is perfectly normal and expected, and we already
> expect the seq_count to change when that happens.
> 
> Maybe step changes are different, but arguably if the time advertised
> by the host steps *outside* the error bounds previously advertised,
> that's just broken?

But the error bounds could be large or missing. I am trying to address use
cases where the host steps or slews the clock as well.

> 
> Depending on how the clock information is fed, a change in seq_count
> may even result in non-monotonicity. If the underlying oscillator has
> sped up and the structure is updated accordingly, the time calculated
> the moment *before* that update may appear later than the time
> calculated immediately after it.
> 
> It's up to the guest operating system to feed that information into its
> own timekeeping system and skew towards correctness instead of stepping
> the time it reports to its users.
> 

The guest can anyway infer from the other information that the clock
changed, so my proposal might not be that useful. Maybe it can be added in
a future version if there is any need.
David Woodhouse June 28, 2024, 9:27 p.m. UTC | #14
On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>On 28.06.24 14:15, David Woodhouse wrote:
>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>>> On 27.06.24 16:52, David Woodhouse wrote:
>>>> I already added a flags field, so this might look something like:
>>>>
>>>>         /*
>>>>          * Smearing flags. The UTC clock exposed through this structure
>>>>          * is only ever true UTC, but a guest operating system may
>>>>          * choose to offer a monotonic smeared clock to its users. This
>>>>          * merely offers a hint about what kind of smearing to perform,
>>>>          * for consistency with systems in the nearby environment.
>>>>          */
>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>>>>
>>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>>> anything else?)
>>>
>>> I think it could also be more generic, like flags for linear smearing,
>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>>> to the leap second start). That could also represent UTC-SLS, and
>>> noon-to-noon, and it would be well-defined.
>>>
>>> This should reduce the likelihood that the guest doesn't know the smearing
>>> variant.
>> 
>> I'm wary of making it too generic. That would seem to encourage a
>> *proliferation* of false "UTC-like" clocks.
>> 
>> It's bad enough that we do smearing at all, let alone that we don't
>> have a single definition of how to do it.
>> 
>> I made the smearing hint a full uint8_t instead of using bits in flags,
>> in the end. That gives us a full 255 ways of lying to users about what
>> the time is, so we're unlikely to run out. And it's easy enough to add
>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
>> that get invented.
>> 
>> 
>
>My concern is that the registry update may come after a driver has already
>been implemented, so that it may be hard to ensure that the smearing which
>has been chosen is actually implemented.

Well yes, but why in the name of all that is holy would anyone want to invent *new* ways to lie to users about the time? If we capture the existing ones as we write this, surely it's a good thing that there's a barrier to entry for adding more?


>But the error bounds could be large or missing. I am trying to address use
>cases where the host steps or slews the clock as well.

The host is absolutely intended to be skewing the clock to keep it accurate as the frequency of the underlying oscillator changes, and the seq_count field will change every time the host does so.

Do we need to handle steps differently? Or just let the guest deal with it?

If an NTP server suddenly steps the time it reports, what does the client do?
Simon Horman June 30, 2024, 1:28 p.m. UTC | #15
+ Kees Cook, linux-hardening

On Tue, Jun 25, 2024 at 08:01:56PM +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
> 
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

...

> diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c

...

> +static int vmclock_probe(struct platform_device *pdev)
> +{

...

> +	/* If there is valid clock information, register a PTP clock */
> +	if (st->cs_id) {
> +		st->ptp_clock_info = ptp_vmclock_info;
> +		strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name));

Hi David,

W=1 allmodconfig builds with gcc-13 flag the following.
Reading the documentation of strncpy() in fortify-string.h,
I wonder if strscpy() would be more appropriate in this case.

In file included from ./include/linux/string.h:374,
                 from ./include/linux/bitmap.h:13,
                 from ./include/linux/cpumask.h:13,
                 from ./arch/x86/include/asm/paravirt.h:21,
                 from ./arch/x86/include/asm/cpuid.h:62,
                 from ./arch/x86/include/asm/processor.h:19,
                 from ./include/linux/sched.h:13,
                 from ./include/linux/ratelimit.h:6,
                 from ./include/linux/dev_printk.h:16,
                 from ./include/linux/device.h:15,
                 from drivers/ptp/ptp_vmclock.c:8:
In function 'strncpy',
    inlined from 'vmclock_probe' at drivers/ptp/ptp_vmclock.c:480:3:
./include/linux/fortify-string.h:125:33: warning: '__builtin_strncpy' specified bound 32 equals destination size [-Wstringop-truncation]
  125 | #define __underlying_strncpy    __builtin_strncpy
      |                                 ^
./include/linux/fortify-string.h:205:16: note: in expansion of macro '__underlying_strncpy'
  205 |         return __underlying_strncpy(p, q, size);

...
David Woodhouse July 1, 2024, 8:02 a.m. UTC | #16
On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote:
> 
> W=1 allmodconfig builds with gcc-13 flag the following.
> Reading the documentation of strncpy() in fortify-string.h,
> I wonder if strscpy() would be more appropriate in this case.

It's never going to matter in practice, as the buffer is 32 bytes, so
"vmclock%d" would never get to the end even if was a 64-bit integer.

But as a matter of hygiene and best practice, yes strscpy() would be
more appropriate. Fixed in git, thanks.
David Woodhouse July 1, 2024, 8:57 a.m. UTC | #17
On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
> On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
> > On 28.06.24 14:15, David Woodhouse wrote:
> > > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> > > > On 27.06.24 16:52, David Woodhouse wrote:
> > > > > I already added a flags field, so this might look something like:
> > > > > 
> > > > >         /*
> > > > >          * Smearing flags. The UTC clock exposed through this structure
> > > > >          * is only ever true UTC, but a guest operating system may
> > > > >          * choose to offer a monotonic smeared clock to its users. This
> > > > >          * merely offers a hint about what kind of smearing to perform,
> > > > >          * for consistency with systems in the nearby environment.
> > > > >          */
> > > > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
> > > > > 
> > > > > (UTC-SLS is probably a bad example but are there formal definitions for
> > > > > anything else?)
> > > > 
> > > > I think it could also be more generic, like flags for linear smearing,
> > > > cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
> > > > to the leap second start). That could also represent UTC-SLS, and
> > > > noon-to-noon, and it would be well-defined.
> > > > 
> > > > This should reduce the likelihood that the guest doesn't know the smearing
> > > > variant.
> > > 
> > > I'm wary of making it too generic. That would seem to encourage a
> > > *proliferation* of false "UTC-like" clocks.
> > > 
> > > It's bad enough that we do smearing at all, let alone that we don't
> > > have a single definition of how to do it.
> > > 
> > > I made the smearing hint a full uint8_t instead of using bits in flags,
> > > in the end. That gives us a full 255 ways of lying to users about what
> > > the time is, so we're unlikely to run out. And it's easy enough to add
> > > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> > > that get invented.
> > > 
> > > 
> > 
> > My concern is that the registry update may come after a driver has already
> > been implemented, so that it may be hard to ensure that the smearing which
> > has been chosen is actually implemented.
> 
> Well yes, but why in the name of all that is holy would anyone want
> to invent *new* ways to lie to users about the time? If we capture
> the existing ones as we write this, surely it's a good thing that
> there's a barrier to entry for adding more?

Ultimately though, this isn't the hill for me to die on. I'm pushing on
that topic because I want to avoid the proliferation of *ambiguity*. If
we have a precision clock, we should *know* what the time is.

So how about this proposal. I line up the fields in the proposed shared
memory structure to match your virtio-rtc proposal, using 'subtype' as
you proposed. But, instead of the 'subtype' being valid only for
VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.

So, you have:

+\begin{lstlisting}
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+\end{lstlisting}

I propose that you add

#define VIRTIO_RTC_CLOCK_SMEARED_UTC 3

If my proposed memory structure is subsumed into the virtio-rtc
proposal we'd literally use the same names, but for the time being I'll
update mine to:

	/*
	 * What time is exposed in the time_sec/time_frac_sec fields?
	 */
	uint8_t time_type;
#define VMCLOCK_TIME_UTC		0	/* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI		1	/* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC		2	/* Since undefined epoch */
#define VMCLOCK_TIME_INVALID		3	/* virtio-rtc uses this for smeared UTC */


I can then use your smearing subtype values as the 'hint' field in the
shared memory structure. You currently have:

+\begin{lstlisting}
+#define VIRTIO_RTC_SUBTYPE_STRICT 0
+#define VIRTIO_RTC_SUBTYPE_SMEAR 1
+#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
+#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
+\end{lstlisting}

I can certainly ensure that 'noon linear' has the same value. I don't
think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:


+\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
+	smearing time in the vicinity of the leap second, in a not
+	precisely defined manner. This avoids clock steps due to UTC
+	leap seconds.

...

+\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
+	standard w.r.t.\ leap second introduction in an unspecified
way
+	(leap seconds may, or may not, be smeared).

To the client, both of those just mean "for a day or so around a leap
second event, you can't trust this device to know what the time is".
There isn't any point in separating "does lie to you" from "might lie
to you", surely? The guest can't do anything useful with that
distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?

And if you *really* want to parameterise it, I think that's a bad idea
and it encourages the proliferation of different time "standards", but
I'd probably just suck it up and do whatever you do because that's not
strictly within the remit of my live-migration part.
Kees Cook July 1, 2024, 3:39 p.m. UTC | #18
On Mon, Jul 01, 2024 at 09:02:38AM +0100, David Woodhouse wrote:
> On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote:
> > 
> > W=1 allmodconfig builds with gcc-13 flag the following.
> > Reading the documentation of strncpy() in fortify-string.h,
> > I wonder if strscpy() would be more appropriate in this case.
> 
> It's never going to matter in practice, as the buffer is 32 bytes, so
> "vmclock%d" would never get to the end even if was a 64-bit integer.
> 
> But as a matter of hygiene and best practice, yes strscpy() would be
> more appropriate. Fixed in git, thanks.

Thanks! You can even use the 2-argument version. :)
Peter Hilber July 2, 2024, 3:03 p.m. UTC | #19
On 01.07.24 10:57, David Woodhouse wrote:
> On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
>> On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>>> On 28.06.24 14:15, David Woodhouse wrote:
>>>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>>>>> On 27.06.24 16:52, David Woodhouse wrote:
>>>>>> I already added a flags field, so this might look something like:
>>>>>>
>>>>>>         /*
>>>>>>          * Smearing flags. The UTC clock exposed through this structure
>>>>>>          * is only ever true UTC, but a guest operating system may
>>>>>>          * choose to offer a monotonic smeared clock to its users. This
>>>>>>          * merely offers a hint about what kind of smearing to perform,
>>>>>>          * for consistency with systems in the nearby environment.
>>>>>>          */
>>>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>>>>>>
>>>>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>>>>> anything else?)
>>>>>
>>>>> I think it could also be more generic, like flags for linear smearing,
>>>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>>>>> to the leap second start). That could also represent UTC-SLS, and
>>>>> noon-to-noon, and it would be well-defined.
>>>>>
>>>>> This should reduce the likelihood that the guest doesn't know the smearing
>>>>> variant.
>>>>
>>>> I'm wary of making it too generic. That would seem to encourage a
>>>> *proliferation* of false "UTC-like" clocks.
>>>>
>>>> It's bad enough that we do smearing at all, let alone that we don't
>>>> have a single definition of how to do it.
>>>>
>>>> I made the smearing hint a full uint8_t instead of using bits in flags,
>>>> in the end. That gives us a full 255 ways of lying to users about what
>>>> the time is, so we're unlikely to run out. And it's easy enough to add
>>>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
>>>> that get invented.
>>>>
>>>>
>>>
>>> My concern is that the registry update may come after a driver has already
>>> been implemented, so that it may be hard to ensure that the smearing which
>>> has been chosen is actually implemented.
>>
>> Well yes, but why in the name of all that is holy would anyone want
>> to invent *new* ways to lie to users about the time? If we capture
>> the existing ones as we write this, surely it's a good thing that
>> there's a barrier to entry for adding more?
> 
> Ultimately though, this isn't the hill for me to die on. I'm pushing on
> that topic because I want to avoid the proliferation of *ambiguity*. If
> we have a precision clock, we should *know* what the time is.
> 
> So how about this proposal. I line up the fields in the proposed shared
> memory structure to match your virtio-rtc proposal, using 'subtype' as
> you proposed. But, instead of the 'subtype' being valid only for
> VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.
> 
> So, you have:
> 
> +\begin{lstlisting}
> +#define VIRTIO_RTC_CLOCK_UTC 0
> +#define VIRTIO_RTC_CLOCK_TAI 1
> +#define VIRTIO_RTC_CLOCK_MONO 2
> +\end{lstlisting}
> 
> I propose that you add
> 
> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
> 
> If my proposed memory structure is subsumed into the virtio-rtc
> proposal we'd literally use the same names, but for the time being I'll
> update mine to:

Do you intend vmclock and virtio-rtc to be ABI compatible? FYI, I see a
potential problem in that Virtio does avoid the use of signed integers so
far. I did not check carefully if there might be other problems, yet.

> 
> 	/*
> 	 * What time is exposed in the time_sec/time_frac_sec fields?
> 	 */
> 	uint8_t time_type;
> #define VMCLOCK_TIME_UTC		0	/* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_TAI		1	/* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_MONOTONIC		2	/* Since undefined epoch */
> #define VMCLOCK_TIME_INVALID		3	/* virtio-rtc uses this for smeared UTC */
> 
> 
> I can then use your smearing subtype values as the 'hint' field in the
> shared memory structure. You currently have:
> 
> +\begin{lstlisting}
> +#define VIRTIO_RTC_SUBTYPE_STRICT 0
> +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
> +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
> +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
> +\end{lstlisting}
> 

I agree with the above part of your proposal.

> I can certainly ensure that 'noon linear' has the same value. I don't
> think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
> 
> 
> +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
> +	smearing time in the vicinity of the leap second, in a not
> +	precisely defined manner. This avoids clock steps due to UTC
> +	leap seconds.
> 
> ...
> 
> +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
> +	standard w.r.t.\ leap second introduction in an unspecified
> way
> +	(leap seconds may, or may not, be smeared).
> 
> To the client, both of those just mean "for a day or so around a leap
> second event, you can't trust this device to know what the time is".
> There isn't any point in separating "does lie to you" from "might lie
> to you", surely? The guest can't do anything useful with that
> distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?

As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
(resp., UTC_SLS may be added).

But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
steps (in particular, steps backwards, which some clients might not like)
due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.

So I think this might be better handled by adding, alongside

> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3

#define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4

(or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC).

> 
> And if you *really* want to parameterise it, I think that's a bad idea
> and it encourages the proliferation of different time "standards", but
> I'd probably just suck it up and do whatever you do because that's not
> strictly within the remit of my live-migration part.

I think the above proposal to have subtypes for
VIRTIO_RTC_CLOCK_SMEARED_UTC should work.
David Woodhouse July 2, 2024, 4:39 p.m. UTC | #20
On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote:
> > On 01.07.24 10:57, David Woodhouse wrote:
> > > > If my proposed memory structure is subsumed into the virtio-rtc
> > > > proposal we'd literally use the same names, but for the time being I'll
> > > > update mine to:
> > 
> > Do you intend vmclock and virtio-rtc to be ABI compatible?

I intend you to incorporate a shared memory structure like the vmclock
one into the virtio-rtc standard :)

Because precision clocks in a VM are fundamentally nonsensical without
a way for the guest to know when live migration has screwed them up.

So yes, so facilitate that, I'm trying to align things so that the
numeric values of fields like the time_type and smearing hint are
*precisely* the same as the VIRTIO_RTC_xxx values.

Time pressure *may* mean I have to ship an ACPI-based device with a
preliminary version of the structure before I've finished persuading
you, and before we've completely finalized the structure. I am hoping
to avoid that.

(In fact, my time pressure only applies to a version of the structure
which carries the disruption_marker field; the actual clock calibration
information doesn't have to be present in the interim implementation.)


> >  FYI, I see a
> > potential problem in that Virtio does avoid the use of signed integers so
> > far. I did not check carefully if there might be other problems, yet.

Hm, you use an unsigned integer to convey the tai_offset. I suppose at
+37 and with a plan to stop doing leap seconds in the next decade,
we're unlikely to get back below zero?

The other signed integer I had was for the leap second direction, but I
think I'm happy to drop that and just adopt your uint8_t leap field
with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}.





> > > > 
> > > >         /*
> > > >          * What time is exposed in the time_sec/time_frac_sec fields?
> > > >          */
> > > >         uint8_t time_type;
> > > > #define VMCLOCK_TIME_UTC                0       /* Since 1970-01-01 00:00:00z */
> > > > #define VMCLOCK_TIME_TAI                1       /* Since 1970-01-01 00:00:00z */
> > > > #define VMCLOCK_TIME_MONOTONIC          2       /* Since undefined epoch */
> > > > #define VMCLOCK_TIME_INVALID            3       /* virtio-rtc uses this for smeared UTC */
> > > > 
> > > > 
> > > > I can then use your smearing subtype values as the 'hint' field in the
> > > > shared memory structure. You currently have:
> > > > 
> > > > +\begin{lstlisting}
> > > > +#define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
> > > > +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
> > > > +\end{lstlisting}
> > > > 
> > 
> > I agree with the above part of your proposal.
> > 
> > > > I can certainly ensure that 'noon linear' has the same value. I don't
> > > > think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
> > > > 
> > > > 
> > > > +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
> > > > +       smearing time in the vicinity of the leap second, in a not
> > > > +       precisely defined manner. This avoids clock steps due to UTC
> > > > +       leap seconds.
> > > > 
> > > > ...
> > > > 
> > > > +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
> > > > +       standard w.r.t.\ leap second introduction in an unspecified
> > > > way
> > > > +       (leap seconds may, or may not, be smeared).
> > > > 
> > > > To the client, both of those just mean "for a day or so around a leap
> > > > second event, you can't trust this device to know what the time is".
> > > > There isn't any point in separating "does lie to you" from "might lie
> > > > to you", surely? The guest can't do anything useful with that
> > > > distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
> > 
> > As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
> > (resp., UTC_SLS may be added).
> > 
> > But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
> > steps (in particular, steps backwards, which some clients might not like)
> > due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.
> > 
> > So I think this might be better handled by adding, alongside
> > 
> > > > #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
> > 
> > #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4
> > 
> > (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC).
> > 
> > > > 
> > > > And if you *really* want to parameterise it, I think that's a bad idea
> > > > and it encourages the proliferation of different time "standards", but
> > > > I'd probably just suck it up and do whatever you do because that's not
> > > > strictly within the remit of my live-migration part.
> > 
> > I think the above proposal to have subtypes for
> > VIRTIO_RTC_CLOCK_SMEARED_UTC should work.

To clarify then, the main types are

 VIRTIO_RTC_CLOCK_UTC == 0
 VIRTIO_RTC_CLOCK_TAI == 1
 VIRTIO_RTC_CLOCK_MONOTONIC == 2
 VIRTIO_RTC_CLOCK_SMEARED_UTC == 3

And the subtypes are *only* for the case of
VIRTIO_RTC_CLOCK_SMEARED_UTC. They include

 VIRTIO_RTC_SUBTYPE_STRICT
 VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
 VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
 VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */

Is that what we just agreed on?
Peter Hilber July 2, 2024, 6:12 p.m. UTC | #21
On 02.07.24 18:39, David Woodhouse wrote:
> On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote:
>>> On 01.07.24 10:57, David Woodhouse wrote:
>>>>> If my proposed memory structure is subsumed into the virtio-rtc
>>>>> proposal we'd literally use the same names, but for the time being I'll
>>>>> update mine to:
>>>
>>> Do you intend vmclock and virtio-rtc to be ABI compatible?
> 
> I intend you to incorporate a shared memory structure like the vmclock
> one into the virtio-rtc standard :)
> 
> Because precision clocks in a VM are fundamentally nonsensical without
> a way for the guest to know when live migration has screwed them up.
> 
> So yes, so facilitate that, I'm trying to align things so that the
> numeric values of fields like the time_type and smearing hint are
> *precisely* the same as the VIRTIO_RTC_xxx values.
> 
> Time pressure *may* mean I have to ship an ACPI-based device with a
> preliminary version of the structure before I've finished persuading
> you, and before we've completely finalized the structure. I am hoping
> to avoid that.
> 
> (In fact, my time pressure only applies to a version of the structure
> which carries the disruption_marker field; the actual clock calibration
> information doesn't have to be present in the interim implementation.)
> 
> 
>>>  FYI, I see a
>>> potential problem in that Virtio does avoid the use of signed integers so
>>> far. I did not check carefully if there might be other problems, yet.
> 
> Hm, you use an unsigned integer to convey the tai_offset. I suppose at
> +37 and with a plan to stop doing leap seconds in the next decade,
> we're unlikely to get back below zero?
> 

I think so.

> The other signed integer I had was for the leap second direction, but I
> think I'm happy to drop that and just adopt your uint8_t leap field
> with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}.
> 
> 
> 
> 
> 
>>>>>
>>>>>         /*
>>>>>          * What time is exposed in the time_sec/time_frac_sec fields?
>>>>>          */
>>>>>         uint8_t time_type;
>>>>> #define VMCLOCK_TIME_UTC                0       /* Since 1970-01-01 00:00:00z */
>>>>> #define VMCLOCK_TIME_TAI                1       /* Since 1970-01-01 00:00:00z */
>>>>> #define VMCLOCK_TIME_MONOTONIC          2       /* Since undefined epoch */
>>>>> #define VMCLOCK_TIME_INVALID            3       /* virtio-rtc uses this for smeared UTC */
>>>>>
>>>>>
>>>>> I can then use your smearing subtype values as the 'hint' field in the
>>>>> shared memory structure. You currently have:
>>>>>
>>>>> +\begin{lstlisting}
>>>>> +#define VIRTIO_RTC_SUBTYPE_STRICT 0
>>>>> +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
>>>>> +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
>>>>> +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
>>>>> +\end{lstlisting}
>>>>>
>>>
>>> I agree with the above part of your proposal.
>>>
>>>>> I can certainly ensure that 'noon linear' has the same value. I don't
>>>>> think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
>>>>>
>>>>>
>>>>> +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
>>>>> +       smearing time in the vicinity of the leap second, in a not
>>>>> +       precisely defined manner. This avoids clock steps due to UTC
>>>>> +       leap seconds.
>>>>>
>>>>> ...
>>>>>
>>>>> +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
>>>>> +       standard w.r.t.\ leap second introduction in an unspecified
>>>>> way
>>>>> +       (leap seconds may, or may not, be smeared).
>>>>>
>>>>> To the client, both of those just mean "for a day or so around a leap
>>>>> second event, you can't trust this device to know what the time is".
>>>>> There isn't any point in separating "does lie to you" from "might lie
>>>>> to you", surely? The guest can't do anything useful with that
>>>>> distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
>>>
>>> As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
>>> (resp., UTC_SLS may be added).
>>>
>>> But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
>>> steps (in particular, steps backwards, which some clients might not like)
>>> due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.
>>>
>>> So I think this might be better handled by adding, alongside
>>>
>>>>> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
>>>
>>> #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4
>>>
>>> (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC).
>>>
>>>>>
>>>>> And if you *really* want to parameterise it, I think that's a bad idea
>>>>> and it encourages the proliferation of different time "standards", but
>>>>> I'd probably just suck it up and do whatever you do because that's not
>>>>> strictly within the remit of my live-migration part.
>>>
>>> I think the above proposal to have subtypes for
>>> VIRTIO_RTC_CLOCK_SMEARED_UTC should work.
> 
> To clarify then, the main types are
> 
>  VIRTIO_RTC_CLOCK_UTC == 0
>  VIRTIO_RTC_CLOCK_TAI == 1
>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> 
> And the subtypes are *only* for the case of
> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
> 
>  VIRTIO_RTC_SUBTYPE_STRICT
>  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
>  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
> 
> Is that what we just agreed on?
> 
> 

This is a misunderstanding. My idea was that the main types are

>  VIRTIO_RTC_CLOCK_UTC == 0
>  VIRTIO_RTC_CLOCK_TAI == 1
>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3

VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4

The subtypes would be (1st for clocks other than
VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
VIRTIO_RTC_CLOCK_SMEARED_UTC):

#define VIRTIO_RTC_SUBTYPE_STRICT 0
#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
David Woodhouse July 2, 2024, 6:40 p.m. UTC | #22
On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>On 02.07.24 18:39, David Woodhouse wrote:
>> To clarify then, the main types are
>> 
>>  VIRTIO_RTC_CLOCK_UTC == 0
>>  VIRTIO_RTC_CLOCK_TAI == 1
>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>> 
>> And the subtypes are *only* for the case of
>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>> 
>>  VIRTIO_RTC_SUBTYPE_STRICT
>>  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>>  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
>>  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>> 
>> Is that what we just agreed on?
>> 
>> 
>
>This is a misunderstanding. My idea was that the main types are
>
>>  VIRTIO_RTC_CLOCK_UTC == 0
>>  VIRTIO_RTC_CLOCK_TAI == 1
>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>
>VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
>
>The subtypes would be (1st for clocks other than
>VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
>VIRTIO_RTC_CLOCK_SMEARED_UTC):
>
>#define VIRTIO_RTC_SUBTYPE_STRICT 0
>#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
>#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
>

Thanks. I really do think that from the guest point of view there's really no distinction between "maybe smeared" and "undefined smearing", and have a preference for using the latter form, which is the key difference there?

Again though, not a hill for me to die on.
David Woodhouse July 3, 2024, 8 a.m. UTC | #23
On Mon, 2024-07-01 at 08:39 -0700, Kees Cook wrote:
> On Mon, Jul 01, 2024 at 09:02:38AM +0100, David Woodhouse wrote:
> > On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote:
> > > 
> > > W=1 allmodconfig builds with gcc-13 flag the following.
> > > Reading the documentation of strncpy() in fortify-string.h,
> > > I wonder if strscpy() would be more appropriate in this case.
> > 
> > It's never going to matter in practice, as the buffer is 32 bytes, so
> > "vmclock%d" would never get to the end even if was a 64-bit integer.
> > 
> > But as a matter of hygiene and best practice, yes strscpy() would be
> > more appropriate. Fixed in git, thanks.
> 
> Thanks! You can even use the 2-argument version. :)

2-argument version? ... oh, that's *special*. I don't know whether to
be impressed or disturbed. Perhaps a little bit of both. 

Done; thanks.
Peter Hilber July 3, 2024, 9:56 a.m. UTC | #24
On 02.07.24 20:40, David Woodhouse wrote:
> On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>> On 02.07.24 18:39, David Woodhouse wrote:
>>> To clarify then, the main types are
>>>
>>>  VIRTIO_RTC_CLOCK_UTC == 0
>>>  VIRTIO_RTC_CLOCK_TAI == 1
>>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>>>
>>> And the subtypes are *only* for the case of
>>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>>>
>>>  VIRTIO_RTC_SUBTYPE_STRICT
>>>  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>>>  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
>>>  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>>>
>>> Is that what we just agreed on?
>>>
>>>
>>
>> This is a misunderstanding. My idea was that the main types are
>>
>>>  VIRTIO_RTC_CLOCK_UTC == 0
>>>  VIRTIO_RTC_CLOCK_TAI == 1
>>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>>
>> VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
>>
>> The subtypes would be (1st for clocks other than
>> VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
>> VIRTIO_RTC_CLOCK_SMEARED_UTC):
>>
>> #define VIRTIO_RTC_SUBTYPE_STRICT 0
>> #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
>> #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
>>
> 
> Thanks. I really do think that from the guest point of view there's really no distinction between "maybe smeared" and "undefined smearing", and have a preference for using the latter form, which is the key difference there?
> 
> Again though, not a hill for me to die on.

I have no issue with staying with "undefined smearing", so would you agree
to something like

VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4

(or another name if you prefer)?
David Woodhouse July 3, 2024, 10:40 a.m. UTC | #25
On Wed, 2024-07-03 at 11:56 +0200, Peter Hilber wrote:
> On 02.07.24 20:40, David Woodhouse wrote:
> > On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
> > > On 02.07.24 18:39, David Woodhouse wrote:
> > > > To clarify then, the main types are
> > > > 
> > > >  VIRTIO_RTC_CLOCK_UTC == 0
> > > >  VIRTIO_RTC_CLOCK_TAI == 1
> > > >  VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > >  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > > 
> > > > And the subtypes are *only* for the case of
> > > > VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
> > > > 
> > > >  VIRTIO_RTC_SUBTYPE_STRICT
> > > >  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
> > > >  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
> > > >  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
> > > > 
> > > > Is that what we just agreed on?
> > > > 
> > > > 
> > > 
> > > This is a misunderstanding. My idea was that the main types are
> > > 
> > > >  VIRTIO_RTC_CLOCK_UTC == 0
> > > >  VIRTIO_RTC_CLOCK_TAI == 1
> > > >  VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > >  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > 
> > > VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
> > > 
> > > The subtypes would be (1st for clocks other than
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC):
> > > 
> > > #define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
> > > 
> > 
> > Thanks. I really do think that from the guest point of view there's
> > really no distinction between "maybe smeared" and "undefined
> > smearing", and have a preference for using the latter form, which
> > is the key difference there?
> > 
> > Again though, not a hill for me to die on.
> 
> I have no issue with staying with "undefined smearing", so would you agree
> to something like
> 
> VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4
> 
> (or another name if you prefer)?

Well, the point of contention was really whether that was a *type* or a
*subtype*.

Either way, it's a "precision clock" telling its consumer that the
device *itself* doesn't really know what time is being exposed. Which
seems like a bizarre thing to support.

But I think I've constructed an argument which persuades me to your
point of view that *if* we permit it, it should be a primary type...

A clock can *either* be UTC, *or* it can be monotonic. The whole point
of smearing is to produce a monotonic clock, of course.

VIRTIO_RTC_CLOCK_UTC is UTC. It is not monotonic.

VIRTIO_RTC_CLOCK_SMEARED is, presumably, monotonic (and I think we
should explicitly require that to be true in virtio-rtc).


But VIRTIO_RTC_CLOCK_MAYBE_SMEARED is the worst of both worlds. It is
neither known to be correct UTC, *nor* is it known to be monotonic. So
(again, if we permit it at all) I think it probably does make sense for
that to be a primary type.


This is what I currently have for 'struct vmclock_abi' that I'd like to
persuade you to adopt. I need to tweak it some more, for at least the
following reasons, as well as any more you can see:

 • size isn't big enough for 64KiB pages
 • Should be explicitly little-endian
 • Does it need esterror as well as maxerror?
 • Why is maxerror in picoseconds? It's the only use of that unit
 • Where do the clock_status values come from? Do they make sense?
 • Are signed integers OK? (I think so!).

 
/*
 * This structure provides a vDSO-style clock to VM guests, exposing the
 * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
 * counter, etc.) and real time. It is designed to address the problem of
 * live migration, which other clock enlightenments do not.
 *
 * When a guest is live migrated, this affects the clock in two ways.
 *
 * First, even between identical hosts the actual frequency of the underlying
 * counter will change within the tolerances of its specification (typically
 * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
 * same host, but can be tracked by NTP as it generally varies slowly. With
 * live migration there is a step change in the frequency, with no warning.
 *
 * Second, there may be a step change in the value of the counter itself, as
 * its accuracy is limited by the precision of the NTP synchronization on the
 * source and destination hosts.
 *
 * So any calibration (NTP, PTP, etc.) which the guest has done on the source
 * host before migration is invalid, and needs to be redone on the new host.
 *
 * In its most basic mode, this structure provides only an indication to the
 * guest that live migration has occurred. This allows the guest to know that
 * its clock is invalid and take remedial action. For applications that need
 * reliable accurate timestamps (e.g. distributed databases), the structure
 * can be mapped all the way to userspace. This allows the application to see
 * directly for itself that the clock is disrupted and take appropriate
 * action, even when using a vDSO-style method to get the time instead of a
 * system call.
 *
 * In its more advanced mode. this structure can also be used to expose the
 * precise relationship of the CPU counter to real time, as calibrated by the
 * host. This means that userspace applications can have accurate time
 * immediately after live migration, rather than having to pause operations
 * and wait for NTP to recover. This mode does, of course, rely on the
 * counter being reliable and consistent across CPUs.
 *
 * Note that this must be true UTC, never with smeared leap seconds. If a
 * guest wishes to construct a smeared clock, it can do so. Presenting a
 * smeared clock through this interface would be problematic because it
 * actually messes with the apparent counter *period*. A linear smearing
 * of 1 ms per second would effectively tweak the counter period by 1000PPM
 * at the start/end of the smearing period, while a sinusoidal smear would
 * basically be impossible to represent.
 *
 * This structure is offered with the intent that it be adopted into the
 * nascent virtio-rtc standard, as a virtio-rtc that does not address the live
 * migration problem seems a little less than fit for purpose. For that
 * reason, certain fields use precisely the same numeric definitions as in
 * the virtio-rtc proposal. The structure can also be exposed through an ACPI
 * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for
 * the fact that it uses a real _CRS to convey the address of the structure
 * (which should be a full page, to allow for mapping directly to userspace).
 */

#ifndef __VMCLOCK_ABI_H__
#define __VMCLOCK_ABI_H__

#ifdef __KERNEL__
#include <linux/types.h>
#else
#include <stdint.h>
#endif

struct vmclock_abi {
	uint64_t magic;
#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
	uint16_t size;		/* Size of page containing this structure */
	uint16_t version;	/* 1 */

	/* Sequence lock. Low bit means an update is in progress. */
	uint32_t seq_count;



	uint32_t flags;
	/* Indicates that the tai_offset_sec field is valid */
#define VMCLOCK_FLAG_TAI_OFFSET_VALID		(1 << 0)
	/*
	 * Optionally used to notify guests of pending maintenance events.
	 * A guest may wish to remove itself from service if an event is
	 * coming up. Two flags indicate the rough imminence of the event.
	 */
#define VMCLOCK_FLAG_DISRUPTION_SOON		(1 << 1) /* About a day */
#define VMCLOCK_FLAG_DISRUPTION_IMMINENT	(1 << 2) /* About an hour */
	/* Indicates that the utc_time_maxerror_picosec field is valid */
#define VMCLOCK_FLAG_UTC_MAXERROR_VALID		(1 << 3)
	/* Indicates counter_period_error_rate_frac_sec is valid */
#define VMCLOCK_FLAG_PERIOD_ERROR_VALID		(1 << 4)

	/*
	 * This field changes to another non-repeating value when the CPU
	 * counter is disrupted, for example on live migration. This lets
	 * the guest know that it should discard any calibration it has
	 * performed of the counter against external sources (NTP/PTP/etc.).
	 */
	uint64_t disruption_marker;

	uint8_t clock_status;
#define VMCLOCK_STATUS_UNKNOWN		0
#define VMCLOCK_STATUS_INITIALIZING	1
#define VMCLOCK_STATUS_SYNCHRONIZED	2
#define VMCLOCK_STATUS_FREERUNNING	3
#define VMCLOCK_STATUS_UNRELIABLE	4

	uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx */
#define VMCLOCK_COUNTER_ARM_VCNT	0
#define VMCLOCK_COUNTER_X86_TSC		1
#define VMCLOCK_COUNTER_INVALID		0xff

	/*
	 * By providing the offset from UTC to TAI, the guest can know both
	 * UTC and TAI reliably, whichever is indicated in the time_type
	 * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
	 */
	int16_t tai_offset_sec;

	/*
	 * What time is exposed in the time_sec/time_frac_sec fields?
	 */
	uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
#define VMCLOCK_TIME_UTC			0	/* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI			1	/* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC			2	/* Since undefined epoch */
#define VMCLOCK_TIME_INVALID_SMEARED		3	/* Not supported */
#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED	4	/* Not supported */

	/*
	 * The time exposed through this device is never smeared. This field
	 * corresponds to the 'subtype' field in virtio-rtc, which indicates
	 * the smearing method. However in this case it provides a *hint* to
	 * the guest operating system, such that *if* the guest OS wants to
	 * provide its users with an alternative clock which does not follow
	 * the POSIX CLOCK_REALTIME standard, it may do so in a fashion
	 * consistent with the other systems in the nearby environment.
	 */
	uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
#define VMCLOCK_SMEARING_STRICT		0
#define VMCLOCK_SMEARING_NOON_LINEAR	1
#define VMCLOCK_SMEARING_UTC_SLS	2

	/* Bit shift for counter_period_frac_sec and its error rate */
	uint8_t counter_period_shift;

	/*
	 * Unlike in NTP, this can indicate a leap second in the past. This
	 * is needed to allow guests to derive an imprecise clock with
	 * smeared leap seconds for themselves, as some modes of smearing
	 * need the adjustments to continue even after the moment at which
	 * the leap second should have occurred.
	 */
	uint8_t leap_indicator; /* Matches VIRTIO_RTC_LEAP_xxx */
#define VMCLOCK_LEAP_NONE	0
#define VMCLOCK_LEAP_PRE_POS	1
#define VMCLOCK_LEAP_PRE_NEG	2
#define VMCLOCK_LEAP_POS	3
#define VMCLOCK_LEAP_NEG	4

	uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */

	/*
	 * Paired values of counter and UTC at a given point in time.
	 */
	uint64_t counter_value;
	uint64_t time_sec;
	uint64_t time_frac_sec;

	/*
	 * Counter frequency, and error margin. The unit of these fields is
	 * seconds >> (64 + counter_period_shift)
	 */
	uint64_t counter_period_frac_sec;
	uint64_t counter_period_error_rate_frac_sec;

	/* Error margin of UTC reading above (± picoseconds) */
	uint64_t utc_time_maxerror_picosec;
};

#endif /*  __VMCLOCK_ABI_H__ */
Peter Hilber July 5, 2024, 8:12 a.m. UTC | #26
On 03.07.24 12:40, David Woodhouse wrote:

[...]

> 
> 
> This is what I currently have for 'struct vmclock_abi' that I'd like to
> persuade you to adopt. I need to tweak it some more, for at least the
> following reasons, as well as any more you can see:
> 
>  • size isn't big enough for 64KiB pages
>  • Should be explicitly little-endian
>  • Does it need esterror as well as maxerror?

I have no opinion about this. I can drop esterror if unwanted.

>  • Why is maxerror in picoseconds? It's the only use of that unit
>  • Where do the clock_status values come from? Do they make sense?
>  • Are signed integers OK? (I think so!).

Signed integers would need to be introduced to Virtio, which so far only
uses explicitly unsigned types: u8, le16 etc.
David Woodhouse July 5, 2024, 3:02 p.m. UTC | #27
On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote:
> On 03.07.24 12:40, David Woodhouse wrote:
> 
> [...]
> 
> > 
> > 
> > This is what I currently have for 'struct vmclock_abi' that I'd like to
> > persuade you to adopt. I need to tweak it some more, for at least the
> > following reasons, as well as any more you can see:
> > 
> >  • size isn't big enough for 64KiB pages
> >  • Should be explicitly little-endian
> >  • Does it need esterror as well as maxerror?
> 
> I have no opinion about this. I can drop esterror if unwanted.

I also don't care. I'm just observing the inconsistency.

> >  • Why is maxerror in picoseconds? It's the only use of that unit

Between us we now have picoseconds, nanoseconds, (seconds >> 64) and
(seconds >> 64+n).

The power-of-two fractions seem to make a lot of sense for the counter
period, because they mean we don't have to perform divisions.

Does it makes sense to harmonise on (seconds >> 64) for all of the
fractional seconds? Again I don't have a strong opinion; I only want us
to have a *reason* for any differences that exist.

> >  • Where do the clock_status values come from? Do they make sense?
> >  • Are signed integers OK? (I think so!).
> 
> Signed integers would need to be introduced to Virtio, which so far only
> uses explicitly unsigned types: u8, le16 etc.

Perhaps. Although it would also be possible (if not ideal) to define
that e.g. the tai_offset field is a 16-bit "unsigned" integer according
to virtio, but to be interpreted as follows:

If the number is <= 32767 then the TAI offset is that value, but if the
number is >= 32768 then the TAI offset is that value minus 65536.

Perhaps not pretty, but there isn't a *fundamental* dependency on
virtio supporting signed integers as a primary type.
Peter Hilber July 6, 2024, 7:50 a.m. UTC | #28
On 05.07.24 17:02, David Woodhouse wrote:
> On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote:
>> On 03.07.24 12:40, David Woodhouse wrote:

[...]

>>>  • Why is maxerror in picoseconds? It's the only use of that unit
> 
> Between us we now have picoseconds, nanoseconds, (seconds >> 64) and
> (seconds >> 64+n).
> 
> The power-of-two fractions seem to make a lot of sense for the counter
> period, because they mean we don't have to perform divisions.
> 
> Does it makes sense to harmonise on (seconds >> 64) for all of the
> fractional seconds? Again I don't have a strong opinion; I only want us
> to have a *reason* for any differences that exist.
> 

I don't have the expertise with fixed-point arithmetic to judge if this
would become unwieldy.

I selected ns for the virtio-rtc drafts so far because that didn't have any
impact on the precision with the Linux kernel driver message-based use
cases, but that would be different for SHM in my understanding.

So I would tend to retain ns for convenience for messages (where it doesn't
impact precision) but do not have any preference for SHM.

>>>  • Where do the clock_status values come from? Do they make sense?
>>>  • Are signed integers OK? (I think so!).
>>
>> Signed integers would need to be introduced to Virtio, which so far only
>> uses explicitly unsigned types: u8, le16 etc.
> 
> Perhaps. Although it would also be possible (if not ideal) to define
> that e.g. the tai_offset field is a 16-bit "unsigned" integer according
> to virtio, but to be interpreted as follows:
> 
> If the number is <= 32767 then the TAI offset is that value, but if the
> number is >= 32768 then the TAI offset is that value minus 65536.
> 
> Perhaps not pretty, but there isn't a *fundamental* dependency on
> virtio supporting signed integers as a primary type.
> 

Agreed.
diff mbox series

Patch

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@  config PTP_1588_CLOCK_KVM
 	  To compile this driver as a module, choose M here: the module
 	  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+	tristate "Virtual machine PTP clock"
+	depends on X86_TSC || ARM_ARCH_TIMER
+	depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+	default y
+	help
+	  This driver adds support for using a virtual precision clock
+	  advertised by the hypervisor. This clock is only useful in virtual
+	  machines where such a device is present.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
 	tristate "IDT 82P33xxx PTP clock"
 	depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@  obj-$(CONFIG_PTP_1588_CLOCK_DTE)	+= ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)	+= ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)	+= ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)	+= ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)	+= ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ)	+= ptp-qoriq.o
 ptp-qoriq-y				+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)		+= ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index 000000000000..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/miscdevice.h>
+#include <linux/acpi.h>
+#include <uapi/linux/vmclock.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+#ifdef CONFIG_X86
+#include <asm/pvclock.h>
+#include <asm/kvmclock.h>
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+	phys_addr_t phys_addr;
+	struct vmclock_abi *clk;
+	struct miscdevice miscdev;
+	struct ptp_clock_info ptp_clock_info;
+	struct ptp_clock *ptp_clock;
+	enum clocksource_ids cs_id, sys_cs_id;
+	int index;
+	char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflow.
+ */
+static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
+					       uint64_t period, uint8_t shift,
+					       uint64_t frac_sec)
+{
+	unsigned __int128 res = (unsigned __int128)delta * period;
+
+	res >>= shift;
+	res += frac_sec;
+	*res_hi = res >> 64;
+	return (uint64_t)res;
+}
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+				   struct ptp_system_timestamp *sts,
+				   struct system_counterval_t *system_counter,
+				   struct timespec64 *tspec)
+{
+	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+	struct system_time_snapshot systime_snapshot;
+	uint64_t cycle, delta, seq, frac_sec;
+
+#ifdef CONFIG_X86
+	/*
+	 * We'd expect the hypervisor to know this and to report the clock
+	 * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
+	 */
+	if (check_tsc_unstable())
+		return -EINVAL;
+#endif
+
+	while (1) {
+		seq = st->clk->seq_count & ~1ULL;
+		virt_rmb();
+
+		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
+			return -EINVAL;
+
+		/*
+		 * When invoked for gettimex64(), fill in the pre/post system
+		 * times. The simple case is when system time is based on the
+		 * same counter as st->cs_id, in which case all three times
+		 * will be derived from the *same* counter value.
+		 *
+		 * If the system isn't using the same counter, then the value
+		 * from ktime_get_snapshot() will still be used as pre_ts, and
+		 * ptp_read_system_postts() is called to populate postts after
+		 * calling get_cycles().
+		 *
+		 * The conversion to timespec64 happens further down, outside
+		 * the seq_count loop.
+		 */
+		if (sts) {
+			ktime_get_snapshot(&systime_snapshot);
+			if (systime_snapshot.cs_id == st->cs_id) {
+				cycle = systime_snapshot.cycles;
+			} else {
+				cycle = get_cycles();
+				ptp_read_system_postts(sts);
+			}
+		} else
+			cycle = get_cycles();
+
+		delta = cycle - st->clk->counter_value;
+
+		frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
+						   st->clk->counter_period_frac_sec,
+						   st->clk->counter_period_shift,
+						   st->clk->utc_time_frac_sec);
+		tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
+		tspec->tv_sec += st->clk->utc_time_sec;
+
+		virt_rmb();
+		if (seq == st->clk->seq_count)
+			break;
+
+		if (ktime_after(ktime_get(), deadline))
+			return -ETIMEDOUT;
+	}
+
+	if (system_counter) {
+		system_counter->cycles = cycle;
+		system_counter->cs_id = st->cs_id;
+	}
+
+	if (sts) {
+		sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+		if (systime_snapshot.cs_id == st->cs_id)
+			sts->post_ts = sts->pre_ts;
+	}
+
+	return 0;
+}
+
+#ifdef CONFIG_X86
+/*
+ * In the case where the system is using the KVM clock for timekeeping, convert
+ * the TSC value into a KVM clock time in order to return a paired reading that
+ * get_device_system_crosststamp() can cope with.
+ */
+static int vmclock_get_crosststamp_kvmclock(struct vmclock_state *st,
+					    struct ptp_system_timestamp *sts,
+					    struct system_counterval_t *system_counter,
+					    struct timespec64 *tspec)
+{
+	struct pvclock_vcpu_time_info *pvti = this_cpu_pvti();
+	unsigned pvti_ver;
+	int ret;
+
+	preempt_disable_notrace();
+
+	do {
+		pvti_ver = pvclock_read_begin(pvti);
+
+		ret = vmclock_get_crosststamp(st, sts, system_counter, tspec);
+		if (ret)
+			break;
+
+		system_counter->cycles = __pvclock_read_cycles(pvti,
+							       system_counter->cycles);
+		system_counter->cs_id = CSID_X86_KVM_CLK;
+
+		/*
+		 * This retry should never really happen; if the TSC is
+		 * stable and reliable enough across vCPUS that it is sane
+		 * for the hypervisor to expose a VMCLOCK device which uses
+		 * it as the reference counter, then the KVM clock sohuld be
+		 * in 'master clock mode' and basically never changed. But
+		 * the KVM clock is a fickle and often broken thing, so do
+		 * it "properly" just in case.
+		 */
+	} while (pvclock_read_retry(pvti, pvti_ver));
+
+	preempt_enable_notrace();
+
+	return ret;
+}
+#endif
+
+static int ptp_vmclock_get_time_fn(ktime_t *device_time,
+				   struct system_counterval_t *system_counter,
+				   void *ctx)
+{
+	struct vmclock_state *st = ctx;
+	struct timespec64 tspec;
+	int ret;
+
+#ifdef CONFIG_X86
+	if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK)
+		ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter,
+						       &tspec);
+	else
+#endif
+		ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
+
+	if (!ret)
+		*device_time = timespec64_to_ktime(tspec);
+
+	return ret;
+}
+
+
+static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
+				      struct system_device_crosststamp *xtstamp)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+	int ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, st,
+						NULL, xtstamp);
+#ifdef CONFIG_X86
+	/*
+	 * On x86, the KVM clock may be used for the system time. We can
+	 * actually convert a TSC reading to that, and return a paired
+	 * timestamp that get_device_system_crosststamp() *can* handle.
+	 */
+	if (ret == -ENODEV) {
+		struct system_time_snapshot systime_snapshot;
+		ktime_get_snapshot(&systime_snapshot);
+
+		if (systime_snapshot.cs_id == CSID_X86_TSC ||
+		    systime_snapshot.cs_id == CSID_X86_KVM_CLK) {
+			WRITE_ONCE(st->sys_cs_id, systime_snapshot.cs_id);
+			ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn,
+							    st, NULL, xtstamp);
+		}
+	}
+#endif
+	return ret;
+}
+
+/*
+ * PTP clock operations
+ */
+
+static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_settime(struct ptp_clock_info *ptp,
+			   const struct timespec64 *ts)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+
+	return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
+				struct ptp_system_timestamp *sts)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+
+	return vmclock_get_crosststamp(st, sts, NULL, ts);
+}
+
+static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
+			  struct ptp_clock_request *rq, int on)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct ptp_clock_info ptp_vmclock_info = {
+	.owner		= THIS_MODULE,
+	.max_adj	= 0,
+	.n_ext_ts	= 0,
+	.n_pins		= 0,
+	.pps		= 0,
+	.adjfine	= ptp_vmclock_adjfine,
+	.adjtime	= ptp_vmclock_adjtime,
+	.gettime64	= ptp_vmclock_gettime,
+	.gettimex64	= ptp_vmclock_gettimex,
+	.settime64	= ptp_vmclock_settime,
+	.enable		= ptp_vmclock_enable,
+	.getcrosststamp = ptp_vmclock_getcrosststamp,
+};
+
+static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
+		return -EROFS;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
+		return -EINVAL;
+
+        if (io_remap_pfn_range(vma, vma->vm_start,
+			       st->phys_addr >> PAGE_SHIFT, PAGE_SIZE,
+                               vma->vm_page_prot))
+                return -EAGAIN;
+
+        return 0;
+}
+
+static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
+				    size_t count, loff_t *ppos)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+	size_t max_count;
+	int32_t seq;
+
+	if (*ppos >= PAGE_SIZE)
+		return 0;
+
+	max_count = PAGE_SIZE - *ppos;
+	if (count > max_count)
+		count = max_count;
+
+	while (1) {
+		seq = st->clk->seq_count & ~1ULL;
+		virt_rmb();
+
+		if (copy_to_user(buf, ((char *)st->clk) + *ppos, count))
+			return -EFAULT;
+
+		virt_rmb();
+		if (seq == st->clk->seq_count)
+			break;
+
+		if (ktime_after(ktime_get(), deadline))
+			return -ETIMEDOUT;
+	}
+
+	*ppos += count;
+	return count;
+}
+
+static const struct file_operations vmclock_miscdev_fops = {
+        .mmap = vmclock_miscdev_mmap,
+        .read = vmclock_miscdev_read,
+};
+
+/* module operations */
+
+static void vmclock_remove(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct vmclock_state *st = dev_get_drvdata(dev);
+
+	if (st->ptp_clock)
+		ptp_clock_unregister(st->ptp_clock);
+
+	if (st->miscdev.minor == MISC_DYNAMIC_MINOR)
+		misc_deregister(&st->miscdev);
+}
+
+static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
+{
+	struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
+	struct acpi_device *adev = ACPI_COMPANION(dev);
+	union acpi_object *obj;
+	acpi_status status;
+
+	status = acpi_evaluate_object(adev->handle, "ADDR", NULL, &parsed);
+	if (ACPI_FAILURE(status)) {
+		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+		return -ENODEV;
+	}
+	obj = parsed.pointer;
+	if (!obj || obj->type != ACPI_TYPE_PACKAGE || obj->package.count != 2 ||
+	    obj->package.elements[0].type != ACPI_TYPE_INTEGER ||
+	    obj->package.elements[1].type != ACPI_TYPE_INTEGER)
+		return -EINVAL;
+
+	st->phys_addr = (obj->package.elements[0].integer.value << 0) |
+		(obj->package.elements[1].integer.value << 32);
+
+	return 0;
+}
+
+static void vmclock_put_idx(void *data)
+{
+	struct vmclock_state *st = data;
+
+	ida_free(&vmclock_ida, st->index);
+}
+
+static int vmclock_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct vmclock_state *st;
+	int ret;
+
+	st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL);
+	if (!st)
+		return -ENOMEM;
+
+	if (has_acpi_companion(dev))
+		ret = vmclock_probe_acpi(dev, st);
+	else
+		ret = -EINVAL; /* Only ACPI for now */
+
+	if (ret) {
+		dev_info(dev, "Failed to obtain physical address: %d\n", ret);
+		goto out;
+	}
+
+	st->clk = devm_memremap(dev, st->phys_addr, sizeof(*st->clk),
+				MEMREMAP_WB);
+	if (IS_ERR(st->clk)) {
+		ret = PTR_ERR(st->clk);
+		dev_info(dev, "failed to map shared memory\n");
+		st->clk = NULL;
+		goto out;
+	}
+
+	if (st->clk->magic != VMCLOCK_MAGIC ||
+	    st->clk->size < sizeof(*st->clk) ||
+	    st->clk->version != 1) {
+		dev_info(dev, "vmclock magic fields invalid\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_ARM64) &&
+	    st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) {
+		/* Can we check it's the virtual counter? */
+		st->cs_id = CSID_ARM_ARCH_COUNTER;
+	} else if (IS_ENABLED(CONFIG_X86) &&
+		   st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) {
+		st->cs_id = CSID_X86_TSC;
+	}
+	st->sys_cs_id = st->cs_id;
+
+	ret = ida_alloc(&vmclock_ida, GFP_KERNEL);
+	if (ret < 0)
+		goto out;
+
+	st->index = ret;
+        ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st);
+	if (ret)
+		goto out;
+
+	st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index);
+	if (!st->name) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* If the structure is big enough, it can be mapped to userspace */
+	if (st->clk->size >= PAGE_SIZE) {
+		st->miscdev.minor = MISC_DYNAMIC_MINOR;
+		st->miscdev.fops = &vmclock_miscdev_fops;
+		st->miscdev.name = st->name;
+
+		ret = misc_register(&st->miscdev);
+		if (ret)
+			goto out;
+	}
+
+	/* If there is valid clock information, register a PTP clock */
+	if (st->cs_id) {
+		st->ptp_clock_info = ptp_vmclock_info;
+		strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name));
+		st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev);
+
+		if (IS_ERR(st->ptp_clock)) {
+			ret = PTR_ERR(st->ptp_clock);
+			st->ptp_clock = NULL;
+			vmclock_remove(pdev);
+			goto out;
+		}
+	}
+
+	dev_set_drvdata(dev, st);
+
+ out:
+	return ret;
+}
+
+static const struct acpi_device_id vmclock_acpi_ids[] = {
+	{ "VMCLOCK", 0 },
+	{}
+};
+MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
+
+static struct platform_driver vmclock_platform_driver = {
+	.probe		= vmclock_probe,
+	.remove_new	= vmclock_remove,
+	.driver	= {
+		.name	= "vmclock",
+		.acpi_match_table = vmclock_acpi_ids,
+	},
+};
+
+module_platform_driver(vmclock_platform_driver)
+
+MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>");
+MODULE_DESCRIPTION("PTP clock using VMCLOCK");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
new file mode 100644
index 000000000000..cf0f22205e79
--- /dev/null
+++ b/include/uapi/linux/vmclock.h
@@ -0,0 +1,138 @@ 
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+
+/*
+ * This structure provides a vDSO-style clock to VM guests, exposing the
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
+ * counter, etc.) and real time. It is designed to address the problem of
+ * live migration, which other clock enlightenments do not.
+ *
+ * When a guest is live migrated, this affects the clock in two ways.
+ *
+ * First, even between identical hosts the actual frequency of the underlying
+ * counter will change within the tolerances of its specification (typically
+ * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
+ * same host, but can be tracked by NTP as it generally varies slowly. With
+ * live migration there is a step change in the frequency, with no warning.
+ *
+ * Second, there may be a step change in the value of the counter itself, as
+ * its accuracy is limited by the precision of the NTP synchronization on the
+ * source and destination hosts.
+ *
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
+ * host before migration is invalid, and needs to be redone on the new host.
+ *
+ * In its most basic mode, this structure provides only an indication to the
+ * guest that live migration has occurred. This allows the guest to know that
+ * its clock is invalid and take remedial action. For applications that need
+ * reliable accurate timestamps (e.g. distributed databases), the structure
+ * can be mapped all the way to userspace. This allows the application to see
+ * directly for itself that the clock is disrupted and take appropriate
+ * action, even when using a vDSO-style method to get the time instead of a
+ * system call.
+ *
+ * In its more advanced mode. this structure can also be used to expose the
+ * precise relationship of the CPU counter to real time, as calibrated by the
+ * host. This means that userspace applications can have accurate time
+ * immediately after live migration, rather than having to pause operations
+ * and wait for NTP to recover. This mode does, of course, rely on the
+ * counter being reliable and consistent across CPUs.
+ *
+ * Note that this must be true UTC, never with smeared leap seconds. If a
+ * guest wishes to construct a smeared clock, it can do so. Presenting a
+ * smeared clock through this interface would be problematic because it
+ * actually messes with the apparent counter *period*. A linear smearing
+ * of 1 ms per second would effectively tweak the counter period by 1000PPM
+ * at the start/end of the smearing period, while a sinusoidal smear would
+ * basically be impossible to represent.
+ */
+
+#ifndef __VMCLOCK_H__
+#define __VMCLOCK_H__
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+struct vmclock_abi {
+	uint32_t magic;
+#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
+	uint16_t size;		/* Size of page containing this structure */
+	uint16_t version;	/* 1 */
+
+	/* Sequence lock. Low bit means an update is in progress. */
+	uint64_t seq_count;
+
+	/*
+	 * This field changes to another non-repeating value when the CPU
+	 * counter is disrupted, for example on live migration.
+	 */
+	uint64_t disruption_marker;
+
+	/*
+	 * By providing the TAI offset, the guest can know both UTC and TAI
+	 * reliably. There is no need to choose one *or* the other. Valid if
+	 * VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
+	 */
+	int16_t tai_offset_sec;
+
+	uint16_t flags;
+	/* Indicates that the tai_offset_sec field is valid */
+#define VMCLOCK_FLAG_TAI_OFFSET_VALID		(1 << 0)
+	/*
+	 * Optionally used to notify guests of pending maintenance events.
+	 * A guest may wish to remove itself from service if an event is
+	 * coming up. Two flags indicate the rough imminence of the event.
+	 */
+#define VMCLOCK_FLAG_DISRUPTION_SOON		(1 << 1) /* About a day */
+#define VMCLOCK_FLAG_DISRUPTION_IMMINENT	(1 << 2) /* About an hour */
+	/* Indicates that the utc_time_maxerror_picosec field is valid */
+#define VMCLOCK_FLAG_UTC_MAXERROR_VALID		(1 << 3)
+	/* Indicates counter_period_error_rate_frac_sec is valid */
+#define VMCLOCK_FLAG_UTC_PERIOD_ERROR_VALID	(1 << 4)
+
+	uint8_t clock_status;
+#define VMCLOCK_STATUS_UNKNOWN		0
+#define VMCLOCK_STATUS_INITIALIZING	1
+#define VMCLOCK_STATUS_SYNCHRONIZED	2
+#define VMCLOCK_STATUS_FREERUNNING	3
+#define VMCLOCK_STATUS_UNRELIABLE	4
+
+	uint8_t counter_id;
+#define VMCLOCK_COUNTER_INVALID		0
+#define VMCLOCK_COUNTER_X86_TSC		1
+#define VMCLOCK_COUNTER_ARM_VCNT	2
+
+	/* Bit shift for counter_period_frac_sec and its error rate */
+	uint8_t counter_period_shift;
+
+	/*
+	 * Unlike in NTP, this can indicate a leap second in the past. This
+	 * is needed to allow guests to derive an imprecise clock with
+	 * smeared leap seconds for themselves, as some modes of smearing
+	 * need the adjustments to continue even after the moment at which
+	 * the leap second should have occurred.
+	 */
+	int8_t leapsecond_direction;
+	uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
+
+	/*
+	 * Paired values of counter and UTC at a given point in time.
+	 */
+	uint64_t counter_value;
+	uint64_t utc_time_sec; /* Since 1970-01-01 00:00:00z */
+	uint64_t utc_time_frac_sec;
+
+	/*
+	 * Counter frequency, and error margin. The unit of these fields is
+	 * seconds >> (64 + counter_period_shift)
+	 */
+	uint64_t counter_period_frac_sec;
+	uint64_t counter_period_error_rate_frac_sec;
+
+	/* Error margin of UTC reading above (± picoseconds) */
+	uint64_t utc_time_maxerror_picosec;
+};
+
+#endif /*  __VMCLOCK_H__ */