Message ID | 4a0a240dffc21dde4d69179288547b945142259f.camel@infradead.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,v2] ptp: Add vDSO-style vmclock support | expand |
On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > actually helpful. > > The memory region of the device is also exposed to userspace so it can be > read or memory mapped by application which need reliable notification of > clock disruptions. There is effort underway to expose PTP clocks to user space via VDSO. Can we please not expose an ad hoc interface for that? As you might have heard the sad news, I'm not feeling up to the task to dig deeper into this right now. Give me a couple of days to look at this with working brain. Thanks, tglx
On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote: > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The vmclock "device" provides a shared memory region with precision clock > > information. By using shared memory, it is safe across Live Migration. > > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > > actually helpful. > > > > The memory region of the device is also exposed to userspace so it can be > > read or memory mapped by application which need reliable notification of > > clock disruptions. > > There is effort underway to expose PTP clocks to user space via VDSO. Ooh, interesting. Got a reference to that please? > Can we please not expose an ad hoc interface for that? Absolutely. I'm explicitly trying to intercept the virtio-rtc specification here, to *avoid* having to do anything ad hoc. Note that this is a "vDSO-style" interface from hypervisor to guest via a shared memory region, not necessarily an actual vDSO. But yes, it *is* intended to be exposed to userspace, so that userspace can know the *accurate* time without a system call, and know that it hasn't been perturbed by live migration. > As you might have heard the sad news, I'm not feeling up to the task to > dig deeper into this right now. Give me a couple of days to look at this > with working brain. I have not heard any news, although now I'm making inferences. Wishing you the best!
On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse <dwmw2@infradead.org> wrote: > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote: > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote: > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > The vmclock "device" provides a shared memory region with precision clock > > > information. By using shared memory, it is safe across Live Migration. > > > > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > > > actually helpful. > > > > > > The memory region of the device is also exposed to userspace so it can be > > > read or memory mapped by application which need reliable notification of > > > clock disruptions. > > > > There is effort underway to expose PTP clocks to user space via VDSO. > > Ooh, interesting. Got a reference to that please? > > > Can we please not expose an ad hoc interface for that? > > Absolutely. I'm explicitly trying to intercept the virtio-rtc > specification here, to *avoid* having to do anything ad hoc. > > Note that this is a "vDSO-style" interface from hypervisor to guest via > a shared memory region, not necessarily an actual vDSO. > > But yes, it *is* intended to be exposed to userspace, so that userspace > can know the *accurate* time without a system call, and know that it > hasn't been perturbed by live migration. Yea, I was going to raise a concern that just defining an mmaped structure means it has to trust the guest logic is as expected. It's good that it's versioned! :) I'd fret a bit about exposing this to userland. It feels very similar to the old powerpc systemcfg implementation that similarly mapped just kernel data out to userland and was difficult to maintain as changes were made. Would including a code page like a proper vdso make sense to make this more flexible of an UABI to maintain? thanks -john
On Tue, 2024-06-25 at 15:22 -0700, John Stultz wrote: > On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse <dwmw2@infradead.org> wrote: > > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote: > > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote: > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > > > The vmclock "device" provides a shared memory region with precision clock > > > > information. By using shared memory, it is safe across Live Migration. > > > > > > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > > > > actually helpful. > > > > > > > > The memory region of the device is also exposed to userspace so it can be > > > > read or memory mapped by application which need reliable notification of > > > > clock disruptions. > > > > > > There is effort underway to expose PTP clocks to user space via VDSO. > > > > Ooh, interesting. Got a reference to that please? > > > > > Can we please not expose an ad hoc interface for that? > > > > Absolutely. I'm explicitly trying to intercept the virtio-rtc > > specification here, to *avoid* having to do anything ad hoc. > > > > Note that this is a "vDSO-style" interface from hypervisor to guest via > > a shared memory region, not necessarily an actual vDSO. > > > > But yes, it *is* intended to be exposed to userspace, so that userspace > > can know the *accurate* time without a system call, and know that it > > hasn't been perturbed by live migration. > > Yea, I was going to raise a concern that just defining an mmaped > structure means it has to trust the guest logic is as expected. It's > good that it's versioned! :) Right. Although it's basically a pvclock, and we've had those for ages. The main difference here is that we add an indicator that tells the guest that it's been live migrated, so any additional NTP/PTP refinement that the *guest* has done of its oscillator, should now be discarded. It's also designed to be useful in "disruption-only" mode, where the pvclock information isn't actually populated, so *all* it does is tell guests that their clock is now hosed due to live migration. That part is why it needs to be mappable directly to userspace, so that userspace can not only get a timestamp but *also* know that it's actually valid. All without a system call. The critical use cases are financial systems where they incur massive fines if they submit mis-timestamped transactions, and distributed databases which rely on accurate timestamps (and error bounds) for eventual coherence. Live migration can screw those completely. I'm open to changing fairly much anything about the proposal as long as we can address those use cases (which the existing virtio-rtc and other KVM enlightenments do not). > I'd fret a bit about exposing this to userland. It feels very similar > to the old powerpc systemcfg implementation that similarly mapped just > kernel data out to userland and was difficult to maintain as changes > were made. Would including a code page like a proper vdso make sense > to make this more flexible of an UABI to maintain? I think the structure itself should be stable once we've bikeshedded it a bit. But there is certainly some potential for vDSO functions which help us expose it to the user... This structure exposes a 'disruption count' which is updated every time the TSC/counter is messed with by live migration. But what is userspace actually going to *compare* it with? It basically needs to compare it with the disruption count when the clock was last synchronized, so maybe the kernel could export *that* to vDSO too, then expose a simple vDSO function which reports whether the clock is valid? The 'invalid' code path could turn into an actual system call which makes the kernel (check for itself and) call ntp_clear() when the disruption occurs. Or maybe not just ntp_clear() but actually consume the pvclock rate information directly and apply the *new* calibration? That kind of thing would be great, and I've definitely tried to design the structure so that it *can* be made a first-class citizen within the kernel's timekeeping code and used like that. But I was going to start with a more modest proposal that it's "just a device", and applications which care about reliable time after LM would have to /dev/vmclock0 and mmap it and check for themselves. (Which would be assisted by things like the ClockBound library).
On Tue, Jun 25, 2024 at 11:34:24PM +0200, Thomas Gleixner wrote: > There is effort underway to expose PTP clocks to user space via > VDSO. That sounds interesting. Has anything been posted? Thanks, Richard
On 25.06.24 21:01, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > actually helpful. > > The memory region of the device is also exposed to userspace so it can be > read or memory mapped by application which need reliable notification of > clock disruptions. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > > v2: > • Add gettimex64() support > • Convert TSC values to KVM clock when appropriate > • Require int128 support > • Add counter_period_shift > • Add timeout when seq_count is invalid > • Add flags field > • Better comments in vmclock ABI structure > • Explicitly forbid smearing (as clock rates would need to change) Leap second smearing information could still be conveyed through the vmclock_abi. AFAIU, to cover the popular smearing variants, it should be enough to indicate whether the driver should apply linear or cosine smearing, and the start time and end time. > > drivers/ptp/Kconfig | 13 + > drivers/ptp/Makefile | 1 + > drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++ > include/uapi/linux/vmclock.h | 138 ++++++++++ > 4 files changed, 668 insertions(+) > create mode 100644 drivers/ptp/ptp_vmclock.c > create mode 100644 include/uapi/linux/vmclock.h > [...] > + > +/* > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64 > + * and add the fractional second part of the reference time. > + * > + * The result is a 128-bit value, the top 64 bits of which are seconds, and > + * the low 64 bits are (seconds >> 64). > + * > + * If __int128 isn't available, perform the calculation 32 bits at a time to > + * avoid overflow. > + */ > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta, > + uint64_t period, uint8_t shift, > + uint64_t frac_sec) > +{ > + unsigned __int128 res = (unsigned __int128)delta * period; > + > + res >>= shift; > + res += frac_sec; > + *res_hi = res >> 64; > + return (uint64_t)res; > +} > + > +static int vmclock_get_crosststamp(struct vmclock_state *st, > + struct ptp_system_timestamp *sts, > + struct system_counterval_t *system_counter, > + struct timespec64 *tspec) > +{ > + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); > + struct system_time_snapshot systime_snapshot; > + uint64_t cycle, delta, seq, frac_sec; > + > +#ifdef CONFIG_X86 > + /* > + * We'd expect the hypervisor to know this and to report the clock > + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid. > + */ > + if (check_tsc_unstable()) > + return -EINVAL; > +#endif > + > + while (1) { > + seq = st->clk->seq_count & ~1ULL; > + virt_rmb(); > + > + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) > + return -EINVAL; > + > + /* > + * When invoked for gettimex64(), fill in the pre/post system > + * times. The simple case is when system time is based on the > + * same counter as st->cs_id, in which case all three times > + * will be derived from the *same* counter value. > + * > + * If the system isn't using the same counter, then the value > + * from ktime_get_snapshot() will still be used as pre_ts, and > + * ptp_read_system_postts() is called to populate postts after > + * calling get_cycles(). > + * > + * The conversion to timespec64 happens further down, outside > + * the seq_count loop. > + */ > + if (sts) { > + ktime_get_snapshot(&systime_snapshot); > + if (systime_snapshot.cs_id == st->cs_id) { > + cycle = systime_snapshot.cycles; > + } else { > + cycle = get_cycles(); > + ptp_read_system_postts(sts); > + } > + } else > + cycle = get_cycles(); > + > + delta = cycle - st->clk->counter_value; AFAIU in the general case this needs to be masked for non 64-bit counters. > + > + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta, > + st->clk->counter_period_frac_sec, > + st->clk->counter_period_shift, > + st->clk->utc_time_frac_sec); > + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64); > + tspec->tv_sec += st->clk->utc_time_sec; > + > + virt_rmb(); > + if (seq == st->clk->seq_count) > + break; > + > + if (ktime_after(ktime_get(), deadline)) > + return -ETIMEDOUT; > + } > + > + if (system_counter) { > + system_counter->cycles = cycle; > + system_counter->cs_id = st->cs_id; > + } > + > + if (sts) { > + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real); > + if (systime_snapshot.cs_id == st->cs_id) > + sts->post_ts = sts->pre_ts; > + } > + > + return 0; > +} > + [...] > + > +static const struct ptp_clock_info ptp_vmclock_info = { > + .owner = THIS_MODULE, > + .max_adj = 0, > + .n_ext_ts = 0, > + .n_pins = 0, > + .pps = 0, > + .adjfine = ptp_vmclock_adjfine, > + .adjtime = ptp_vmclock_adjtime, > + .gettime64 = ptp_vmclock_gettime, The .gettime64 op is now unneeded. > + .gettimex64 = ptp_vmclock_gettimex, > + .settime64 = ptp_vmclock_settime, > + .enable = ptp_vmclock_enable, > + .getcrosststamp = ptp_vmclock_getcrosststamp, > +}; > + [...] > diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h > new file mode 100644 > index 000000000000..cf0f22205e79 > --- /dev/null > +++ b/include/uapi/linux/vmclock.h > @@ -0,0 +1,138 @@ > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ > + > +/* > + * This structure provides a vDSO-style clock to VM guests, exposing the > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch > + * counter, etc.) and real time. It is designed to address the problem of > + * live migration, which other clock enlightenments do not. > + * > + * When a guest is live migrated, this affects the clock in two ways. > + * > + * First, even between identical hosts the actual frequency of the underlying > + * counter will change within the tolerances of its specification (typically > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the > + * same host, but can be tracked by NTP as it generally varies slowly. With > + * live migration there is a step change in the frequency, with no warning. > + * > + * Second, there may be a step change in the value of the counter itself, as > + * its accuracy is limited by the precision of the NTP synchronization on the > + * source and destination hosts. > + * > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source > + * host before migration is invalid, and needs to be redone on the new host. > + * > + * In its most basic mode, this structure provides only an indication to the > + * guest that live migration has occurred. This allows the guest to know that > + * its clock is invalid and take remedial action. For applications that need > + * reliable accurate timestamps (e.g. distributed databases), the structure > + * can be mapped all the way to userspace. This allows the application to see > + * directly for itself that the clock is disrupted and take appropriate > + * action, even when using a vDSO-style method to get the time instead of a > + * system call. > + * > + * In its more advanced mode. this structure can also be used to expose the > + * precise relationship of the CPU counter to real time, as calibrated by the > + * host. This means that userspace applications can have accurate time > + * immediately after live migration, rather than having to pause operations > + * and wait for NTP to recover. This mode does, of course, rely on the > + * counter being reliable and consistent across CPUs. > + * > + * Note that this must be true UTC, never with smeared leap seconds. If a > + * guest wishes to construct a smeared clock, it can do so. Presenting a > + * smeared clock through this interface would be problematic because it > + * actually messes with the apparent counter *period*. A linear smearing > + * of 1 ms per second would effectively tweak the counter period by 1000PPM > + * at the start/end of the smearing period, while a sinusoidal smear would > + * basically be impossible to represent. Clock types other than UTC could also be supported: TAI, monotonic. > + */ > + > +#ifndef __VMCLOCK_H__ > +#define __VMCLOCK_H__ > + > +#ifdef __KERNEL__ > +#include <linux/types.h> > +#else > +#include <stdint.h> > +#endif > + > +struct vmclock_abi { > + uint32_t magic; > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > + uint16_t size; /* Size of page containing this structure */ > + uint16_t version; /* 1 */ > + > + /* Sequence lock. Low bit means an update is in progress. */ > + uint64_t seq_count; > + > + /* > + * This field changes to another non-repeating value when the CPU > + * counter is disrupted, for example on live migration. > + */ > + uint64_t disruption_marker; The field could also change when the clock is stepped (leap seconds excepted), or when the clock frequency is slewed.
On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote: > On 25.06.24 21:01, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The vmclock "device" provides a shared memory region with precision clock > > information. By using shared memory, it is safe across Live Migration. > > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > > actually helpful. > > > > The memory region of the device is also exposed to userspace so it can be > > read or memory mapped by application which need reliable notification of > > clock disruptions. > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > > > v2: > > • Add gettimex64() support > > • Convert TSC values to KVM clock when appropriate > > • Require int128 support > > • Add counter_period_shift > > • Add timeout when seq_count is invalid > > • Add flags field > > • Better comments in vmclock ABI structure > > • Explicitly forbid smearing (as clock rates would need to change) > > Leap second smearing information could still be conveyed through the > vmclock_abi. AFAIU, to cover the popular smearing variants, it should be > enough to indicate whether the driver should apply linear or cosine > smearing, and the start time and end time. Yes. The clock information actually conveyed through the {counter, time, rate} tuple should never be smeared, and should only ever be UTC. But we could provide a hint to the guest operating system about what type of smearing to perform, *if* it chooses to offer a clock other than the standard CLOCK_REALTIME to its users. I already added a flags field, so this might look something like: /* * Smearing flags. The UTC clock exposed through this structure * is only ever true UTC, but a guest operating system may * choose to offer a monotonic smeared clock to its users. This * merely offers a hint about what kind of smearing to perform, * for consistency with systems in the nearby environment. */ #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ (UTC-SLS is probably a bad example but are there formal definitions for anything else?) > > But we > > drivers/ptp/Kconfig | 13 + > > drivers/ptp/Makefile | 1 + > > drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++ > > include/uapi/linux/vmclock.h | 138 ++++++++++ > > 4 files changed, 668 insertions(+) > > create mode 100644 drivers/ptp/ptp_vmclock.c > > create mode 100644 include/uapi/linux/vmclock.h > > > > [...] > > > + > > +/* > > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64 > > + * and add the fractional second part of the reference time. > > + * > > + * The result is a 128-bit value, the top 64 bits of which are seconds, and > > + * the low 64 bits are (seconds >> 64). > > + * > > + * If __int128 isn't available, perform the calculation 32 bits at a time to > > + * avoid overflow. > > + */ > > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta, > > + uint64_t period, uint8_t shift, > > + uint64_t frac_sec) > > +{ > > + unsigned __int128 res = (unsigned __int128)delta * period; > > + > > + res >>= shift; > > + res += frac_sec; > > + *res_hi = res >> 64; > > + return (uint64_t)res; > > +} > > + > > +static int vmclock_get_crosststamp(struct vmclock_state *st, > > + struct ptp_system_timestamp *sts, > > + struct system_counterval_t *system_counter, > > + struct timespec64 *tspec) > > +{ > > + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); > > + struct system_time_snapshot systime_snapshot; > > + uint64_t cycle, delta, seq, frac_sec; > > + > > +#ifdef CONFIG_X86 > > + /* > > + * We'd expect the hypervisor to know this and to report the clock > > + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid. > > + */ > > + if (check_tsc_unstable()) > > + return -EINVAL; > > +#endif > > + > > + while (1) { > > + seq = st->clk->seq_count & ~1ULL; > > + virt_rmb(); > > + > > + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) > > + return -EINVAL; > > + > > + /* > > + * When invoked for gettimex64(), fill in the pre/post system > > + * times. The simple case is when system time is based on the > > + * same counter as st->cs_id, in which case all three times > > + * will be derived from the *same* counter value. > > + * > > + * If the system isn't using the same counter, then the value > > + * from ktime_get_snapshot() will still be used as pre_ts, and > > + * ptp_read_system_postts() is called to populate postts after > > + * calling get_cycles(). > > + * > > + * The conversion to timespec64 happens further down, outside > > + * the seq_count loop. > > + */ > > + if (sts) { > > + ktime_get_snapshot(&systime_snapshot); > > + if (systime_snapshot.cs_id == st->cs_id) { > > + cycle = systime_snapshot.cycles; > > + } else { > > + cycle = get_cycles(); > > + ptp_read_system_postts(sts); > > + } > > + } else > > + cycle = get_cycles(); > > + > > + delta = cycle - st->clk->counter_value; > > AFAIU in the general case this needs to be masked for non 64-bit counters. Are there any non-64-bit counters that will be exposed this way? And can we handle that with explicit knowledge of the counter type, rather than a separate explicit field? If we really dig deep, it gets horrible for scaled TSCs. The guest TSC is actually calculated from the host TSC with a scale and offset: guest_tsc = ((host_tsc * SCALE) >> 48_or_32) + OFFSET (The shift is 48 bits for Intel, 32 for AMD). So when the host TSC wraps to zero, the guest TSC jumps from some non- power-of-two upper limit, to OFFSET which is generally a *negative* value. And will continue counting from there until it actually reaches the 64-bit limit and wraps back to zero. > > > + > > + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta, > > + st->clk->counter_period_frac_sec, > > + st->clk->counter_period_shift, > > + st->clk->utc_time_frac_sec); > > + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64); > > + tspec->tv_sec += st->clk->utc_time_sec; > > + > > + virt_rmb(); > > + if (seq == st->clk->seq_count) > > + break; > > + > > + if (ktime_after(ktime_get(), deadline)) > > + return -ETIMEDOUT; > > + } > > + > > + if (system_counter) { > > + system_counter->cycles = cycle; > > + system_counter->cs_id = st->cs_id; > > + } > > + > > + if (sts) { > > + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real); > > + if (systime_snapshot.cs_id == st->cs_id) > > + sts->post_ts = sts->pre_ts; > > + } > > + > > + return 0; > > +} > > + > > [...] > > > + > > +static const struct ptp_clock_info ptp_vmclock_info = { > > + .owner = THIS_MODULE, > > + .max_adj = 0, > > + .n_ext_ts = 0, > > + .n_pins = 0, > > + .pps = 0, > > + .adjfine = ptp_vmclock_adjfine, > > + .adjtime = ptp_vmclock_adjtime, > > + .gettime64 = ptp_vmclock_gettime, > > The .gettime64 op is now unneeded. Ack, thanks. > > + .gettimex64 = ptp_vmclock_gettimex, > > + .settime64 = ptp_vmclock_settime, > > + .enable = ptp_vmclock_enable, > > + .getcrosststamp = ptp_vmclock_getcrosststamp, > > +}; > > + > > [...] > > > diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h > > new file mode 100644 > > index 000000000000..cf0f22205e79 > > --- /dev/null > > +++ b/include/uapi/linux/vmclock.h > > @@ -0,0 +1,138 @@ > > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ > > + > > +/* > > + * This structure provides a vDSO-style clock to VM guests, exposing the > > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch > > + * counter, etc.) and real time. It is designed to address the problem of > > + * live migration, which other clock enlightenments do not. > > + * > > + * When a guest is live migrated, this affects the clock in two ways. > > + * > > + * First, even between identical hosts the actual frequency of the underlying > > + * counter will change within the tolerances of its specification (typically > > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the > > + * same host, but can be tracked by NTP as it generally varies slowly. With > > + * live migration there is a step change in the frequency, with no warning. > > + * > > + * Second, there may be a step change in the value of the counter itself, as > > + * its accuracy is limited by the precision of the NTP synchronization on the > > + * source and destination hosts. > > + * > > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source > > + * host before migration is invalid, and needs to be redone on the new host. > > + * > > + * In its most basic mode, this structure provides only an indication to the > > + * guest that live migration has occurred. This allows the guest to know that > > + * its clock is invalid and take remedial action. For applications that need > > + * reliable accurate timestamps (e.g. distributed databases), the structure > > + * can be mapped all the way to userspace. This allows the application to see > > + * directly for itself that the clock is disrupted and take appropriate > > + * action, even when using a vDSO-style method to get the time instead of a > > + * system call. > > + * > > + * In its more advanced mode. this structure can also be used to expose the > > + * precise relationship of the CPU counter to real time, as calibrated by the > > + * host. This means that userspace applications can have accurate time > > + * immediately after live migration, rather than having to pause operations > > + * and wait for NTP to recover. This mode does, of course, rely on the > > + * counter being reliable and consistent across CPUs. > > + * > > + * Note that this must be true UTC, never with smeared leap seconds. If a > > + * guest wishes to construct a smeared clock, it can do so. Presenting a > > + * smeared clock through this interface would be problematic because it > > + * actually messes with the apparent counter *period*. A linear smearing > > + * of 1 ms per second would effectively tweak the counter period by 1000PPM > > + * at the start/end of the smearing period, while a sinusoidal smear would > > + * basically be impossible to represent. > > Clock types other than UTC could also be supported: TAI, monotonic. This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can expose UTC only, without the TAI offset if that's unknown. Should we also have a mode for exposing TAI only, for when TAI is known but not the offset from UTC? Is that really a likely scenario? Isn't a host much more likely to know UTC and *not* the TAI offset? I suppose if we have *hardware* implementations of this, they could be based on an atomic clock and all they'll have is TAI? So OK, maybe that makes sense? (We'd have to add something like the ART as the counter to pair with over an actual PCI bus, of course.) We can add a type field like the one you have for virtio-rtc, yes? > > > + */ > > + > > +#ifndef __VMCLOCK_H__ > > +#define __VMCLOCK_H__ > > + > > +#ifdef __KERNEL > > +#include <linux/types.h> > > +#else > > +#include <stdint.h> > > +#endif > > + > > +struct vmclock_abi { > > + uint32_t magic; > > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > > + uint16_t size; /* Size of page containing this structure */ > > + uint16_t version; /* 1 */ > > + > > + /* Sequence lock. Low bit means an update is in progress. */ > > + uint64_t seq_count; > > + > > + /* > > + * This field changes to another non-repeating value when the CPU > > + * counter is disrupted, for example on live migration. > > + */ > > + uint64_t disruption_marker; > > The field could also change when the clock is stepped (leap seconds > excepted), or when the clock frequency is slewed. I'm not sure. The concept of the disruption marker is that it tells the guest to throw away any calibration of the counter that the guest has done for *itself* (with NTP, other PTP devices, etc.). One mode for this device would be not to populate the clock fields at all, but *only* to signal disruption when it occurs. So the guest can abort transactions until it's resynced its clocks (to avoid incurring fines if breaking databases, etc.). Exposing the host timekeeping through the structure means that the migrated guest can keep working because it can trust the timekeeping performed by the (new) host and exposed to it. If the counter is actually varying in frequency over time, and the host is slewing the clock frequency that it reports, that *isn't* a step change and doesn't mean that the guest should throw away any calibration that it's been doing for itself. One hopes that the guest would have detected the *same* frequency change, and be adapting for itself. So I don't think that should indicate a disruption. I think the same is even true if the clock is stepped by the host. The actual *counter* hasn't changed, so the guest is better off ignoring the vacillating host and continuing to derive its idea of time from the hardware counter itself, as calibrated against some external NTP/PTP sources. Surely we actively *don't* to tell the guest to throw its own calibrations away, in this case?
I've updated the tree at https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock (but not yet the qemu one). I think I've taken into account all your comments apart from the one about non-64-bit counters wrapping. I reduced the seq_count to 32 bit to make room for a 32-bit flags field, added the time type (UTC/TAI/MONOTONIC) and a smearing hint, with some straw man definitions for smearing algorithms for which I could actually find definitions. The structure now looks like this: struct vmclock_abi { uint32_t magic; #define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ uint16_t size; /* Size of page containing this structure */ uint16_t version; /* 1 */ /* Sequence lock. Low bit means an update is in progress. */ uint32_t seq_count; uint32_t flags; /* Indicates that the tai_offset_sec field is valid */ #define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) /* * Optionally used to notify guests of pending maintenance events. * A guest may wish to remove itself from service if an event is * coming up. Two flags indicate the rough imminence of the event. */ #define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ #define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ /* Indicates that the utc_time_maxerror_picosec field is valid */ #define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3) /* Indicates counter_period_error_rate_frac_sec is valid */ #define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4) /* * This field changes to another non-repeating value when the CPU * counter is disrupted, for example on live migration. This lets * the guest know that it should discard any calibration it has * performed of the counter against external sources (NTP/PTP/etc.). */ uint64_t disruption_marker; uint8_t clock_status; #define VMCLOCK_STATUS_UNKNOWN 0 #define VMCLOCK_STATUS_INITIALIZING 1 #define VMCLOCK_STATUS_SYNCHRONIZED 2 #define VMCLOCK_STATUS_FREERUNNING 3 #define VMCLOCK_STATUS_UNRELIABLE 4 uint8_t counter_id; #define VMCLOCK_COUNTER_INVALID 0 #define VMCLOCK_COUNTER_X86_TSC 1 #define VMCLOCK_COUNTER_ARM_VCNT 2 #define VMCLOCK_COUNTER_X86_ART 3 /* * By providing the offset from UTC to TAI, the guest can know both * UTC and TAI reliably, whichever is indicated in the time_type * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags. */ int16_t tai_offset_sec; /* * The time exposed through this device is never smeaared; if it * claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field * provides a hint to the guest operating system, such that *if* * the guest OS wants to provide its users with an alternative * clock which does not follow the POSIX CLOCK_REALTIME standard, * it may do so in a fashion consistent with the other systems * in the nearby environment. */ uint8_t leap_second_smearing_hint; /* Provide true UTC to users, unsmeared. */; #define VMCLOCK_SMEARING_NONE 0 /* * https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/ * From noon on the day before to noon on the day after, smear the * clock by a linear 1/86400s per second. */ #define VMCLOCK_SMEARING_LINEAR_86400 1 /* * draft-kuhn-leapsecond-00 * For the 1000s leading up to the leap second, smear the clock by * clock by a linear 1ms per second. */ #define VMCLOCK_SMEARING_UTC_SLS 2 /* * What time is exposed in the time_sec/time_frac_sec fields? */ uint8_t time_type; #define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */ #define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */ #define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */ #define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */ /* Bit shift for counter_period_frac_sec and its error rate */ uint8_t counter_period_shift; /* * Unlike in NTP, this can indicate a leap second in the past. This * is needed to allow guests to derive an imprecise clock with * smeared leap seconds for themselves, as some modes of smearing * need the adjustments to continue even after the moment at which * the leap second should have occurred. */ int8_t leapsecond_direction; uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */ /* * Paired values of counter and UTC at a given point in time. */ uint64_t counter_value; uint64_t time_sec; /* Since 1970-01-01 00:00:00z */ uint64_t time_frac_sec; /* * Counter frequency, and error margin. The unit of these fields is * seconds >> (64 + counter_period_shift) */ uint64_t counter_period_frac_sec; uint64_t counter_period_error_rate_frac_sec; /* Error margin of UTC reading above (± picoseconds) */ uint64_t utc_time_maxerror_picosec; }; #endif /* __VMCLOCK_H__ */
On 27.06.24 16:52, David Woodhouse wrote: > On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote: >> On 25.06.24 21:01, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> The vmclock "device" provides a shared memory region with precision clock >>> information. By using shared memory, it is safe across Live Migration. >>> >>> Like the KVM PTP clock, this can convert TSC-based cross timestamps into >>> KVM clock values. Unlike the KVM PTP clock, it does so only when such is >>> actually helpful. >>> >>> The memory region of the device is also exposed to userspace so it can be >>> read or memory mapped by application which need reliable notification of >>> clock disruptions. >>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >>> --- >>> >>> v2: >>> • Add gettimex64() support >>> • Convert TSC values to KVM clock when appropriate >>> • Require int128 support >>> • Add counter_period_shift >>> • Add timeout when seq_count is invalid >>> • Add flags field >>> • Better comments in vmclock ABI structure >>> • Explicitly forbid smearing (as clock rates would need to change) >> >> Leap second smearing information could still be conveyed through the >> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be >> enough to indicate whether the driver should apply linear or cosine >> smearing, and the start time and end time. > > Yes. The clock information actually conveyed through the {counter, > time, rate} tuple should never be smeared, and should only ever be UTC. > > But we could provide a hint to the guest operating system about what > type of smearing to perform, *if* it chooses to offer a clock other > than the standard CLOCK_REALTIME to its users. > > I already added a flags field, so this might look something like: > > /* > * Smearing flags. The UTC clock exposed through this structure > * is only ever true UTC, but a guest operating system may > * choose to offer a monotonic smeared clock to its users. This > * merely offers a hint about what kind of smearing to perform, > * for consistency with systems in the nearby environment. > */ > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ > > > (UTC-SLS is probably a bad example but are there formal definitions for > anything else?) > > I think it could also be more generic, like flags for linear smearing, cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative to the leap second start). That could also represent UTC-SLS, and noon-to-noon, and it would be well-defined. This should reduce the likelihood that the guest doesn't know the smearing variant. [...] >>> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h >>> new file mode 100644 >>> index 000000000000..cf0f22205e79 >>> --- /dev/null >>> +++ b/include/uapi/linux/vmclock.h >>> @@ -0,0 +1,138 @@ >>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ >>> + >>> +/* >>> + * This structure provides a vDSO-style clock to VM guests, exposing the >>> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch >>> + * counter, etc.) and real time. It is designed to address the problem of >>> + * live migration, which other clock enlightenments do not. >>> + * >>> + * When a guest is live migrated, this affects the clock in two ways. >>> + * >>> + * First, even between identical hosts the actual frequency of the underlying >>> + * counter will change within the tolerances of its specification (typically >>> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the >>> + * same host, but can be tracked by NTP as it generally varies slowly. With >>> + * live migration there is a step change in the frequency, with no warning. >>> + * >>> + * Second, there may be a step change in the value of the counter itself, as >>> + * its accuracy is limited by the precision of the NTP synchronization on the >>> + * source and destination hosts. >>> + * >>> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source >>> + * host before migration is invalid, and needs to be redone on the new host. >>> + * >>> + * In its most basic mode, this structure provides only an indication to the >>> + * guest that live migration has occurred. This allows the guest to know that >>> + * its clock is invalid and take remedial action. For applications that need >>> + * reliable accurate timestamps (e.g. distributed databases), the structure >>> + * can be mapped all the way to userspace. This allows the application to see >>> + * directly for itself that the clock is disrupted and take appropriate >>> + * action, even when using a vDSO-style method to get the time instead of a >>> + * system call. >>> + * >>> + * In its more advanced mode. this structure can also be used to expose the >>> + * precise relationship of the CPU counter to real time, as calibrated by the >>> + * host. This means that userspace applications can have accurate time >>> + * immediately after live migration, rather than having to pause operations >>> + * and wait for NTP to recover. This mode does, of course, rely on the >>> + * counter being reliable and consistent across CPUs. >>> + * >>> + * Note that this must be true UTC, never with smeared leap seconds. If a >>> + * guest wishes to construct a smeared clock, it can do so. Presenting a >>> + * smeared clock through this interface would be problematic because it >>> + * actually messes with the apparent counter *period*. A linear smearing >>> + * of 1 ms per second would effectively tweak the counter period by 1000PPM >>> + * at the start/end of the smearing period, while a sinusoidal smear would >>> + * basically be impossible to represent. >> >> Clock types other than UTC could also be supported: TAI, monotonic. > > This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can > expose UTC only, without the TAI offset if that's unknown. > > Should we also have a mode for exposing TAI only, for when TAI is known > but not the offset from UTC? Is that really a likely scenario? Isn't a > host much more likely to know UTC and *not* the TAI offset? > > I suppose if we have *hardware* implementations of this, they could be > based on an atomic clock and all they'll have is TAI? So OK, maybe that > makes sense? > > (We'd have to add something like the ART as the counter to pair with > over an actual PCI bus, of course.) > > We can add a type field like the one you have for virtio-rtc, yes? > > Ack. >> >>> + */ >>> + >>> +#ifndef __VMCLOCK_H__ >>> +#define __VMCLOCK_H__ >>> + >>> +#ifdef __KERNEL >>> +#include <linux/types.h> >>> +#else >>> +#include <stdint.h> >>> +#endif >>> + >>> +struct vmclock_abi { >>> + uint32_t magic; >>> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ >>> + uint16_t size; /* Size of page containing this structure */ >>> + uint16_t version; /* 1 */ >>> + >>> + /* Sequence lock. Low bit means an update is in progress. */ >>> + uint64_t seq_count; >>> + >>> + /* >>> + * This field changes to another non-repeating value when the CPU >>> + * counter is disrupted, for example on live migration. >>> + */ >>> + uint64_t disruption_marker; >> >> The field could also change when the clock is stepped (leap seconds >> excepted), or when the clock frequency is slewed. > > I'm not sure. The concept of the disruption marker is that it tells the > guest to throw away any calibration of the counter that the guest has > done for *itself* (with NTP, other PTP devices, etc.). > > One mode for this device would be not to populate the clock fields at > all, but *only* to signal disruption when it occurs. So the guest can > abort transactions until it's resynced its clocks (to avoid incurring > fines if breaking databases, etc.). > > Exposing the host timekeeping through the structure means that the > migrated guest can keep working because it can trust the timekeeping > performed by the (new) host and exposed to it. > > If the counter is actually varying in frequency over time, and the host > is slewing the clock frequency that it reports, that *isn't* a step > change and doesn't mean that the guest should throw away any > calibration that it's been doing for itself. One hopes that the guest > would have detected the *same* frequency change, and be adapting for > itself. So I don't think that should indicate a disruption. > > I think the same is even true if the clock is stepped by the host. The > actual *counter* hasn't changed, so the guest is better off ignoring > the vacillating host and continuing to derive its idea of time from the > hardware counter itself, as calibrated against some external NTP/PTP > sources. Surely we actively *don't* to tell the guest to throw its own > calibrations away, in this case? In case the guest is also considering other time sources, it might indeed not be a good idea to mix host clock changes into the hardware counter disruption marker. But if the vmclock is the authoritative source of time, it can still be helpful to know about such changes, maybe through another marker.
On 27.06.24 18:03, David Woodhouse wrote: > > I've updated the tree at > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock > (but not yet the qemu one). > > I think I've taken into account all your comments apart from the one > about non-64-bit counters wrapping. I reduced the seq_count to 32 bit > to make room for a 32-bit flags field, added the time type > (UTC/TAI/MONOTONIC) and a smearing hint, with some straw man > definitions for smearing algorithms for which I could actually find > definitions. > > The structure now looks like this: > > > struct vmclock_abi { [...] > > /* > * What time is exposed in the time_sec/time_frac_sec fields? > */ > uint8_t time_type; > #define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */ > #define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */ > #define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */ > #define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */ > > /* Bit shift for counter_period_frac_sec and its error rate */ > uint8_t counter_period_shift; > > /* > * Unlike in NTP, this can indicate a leap second in the past. This > * is needed to allow guests to derive an imprecise clock with > * smeared leap seconds for themselves, as some modes of smearing > * need the adjustments to continue even after the moment at which > * the leap second should have occurred. > */ > int8_t leapsecond_direction; > uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */ > > /* > * Paired values of counter and UTC at a given point in time. > */ > uint64_t counter_value; > uint64_t time_sec; /* Since 1970-01-01 00:00:00z */ Nitpick: The comment is not valid any more for TIME_MONOTONIC.
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote: > > > > > /* > > * What time is exposed in the time_sec/time_frac_sec fields? > > */ > > uint8_t time_type; > > #define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */ > > #define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */ > > #define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */ > > #define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */ > > > > /* Bit shift for counter_period_frac_sec and its error rate */ > > uint8_t counter_period_shift; > > > > /* > > * Unlike in NTP, this can indicate a leap second in the past. This > > * is needed to allow guests to derive an imprecise clock with > > * smeared leap seconds for themselves, as some modes of smearing > > * need the adjustments to continue even after the moment at which > > * the leap second should have occurred. > > */ > > int8_t leapsecond_direction; > > uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */ > > > > /* > > * Paired values of counter and UTC at a given point in time. > > */ > > uint64_t counter_value; > > uint64_t time_sec; /* Since 1970-01-01 00:00:00z */ > > Nitpick: The comment is not valid any more for TIME_MONOTONIC. Ah yes, I "moved" that comment up to the UTC/TAI time_type values, but neglected to actually delete it from here. Fixed; thanks.
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote: > On 27.06.24 16:52, David Woodhouse wrote: > > I already added a flags field, so this might look something like: > > > > /* > > * Smearing flags. The UTC clock exposed through this structure > > * is only ever true UTC, but a guest operating system may > > * choose to offer a monotonic smeared clock to its users. This > > * merely offers a hint about what kind of smearing to perform, > > * for consistency with systems in the nearby environment. > > */ > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ > > > > (UTC-SLS is probably a bad example but are there formal definitions for > > anything else?) > > I think it could also be more generic, like flags for linear smearing, > cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative > to the leap second start). That could also represent UTC-SLS, and > noon-to-noon, and it would be well-defined. > > This should reduce the likelihood that the guest doesn't know the smearing > variant. I'm wary of making it too generic. That would seem to encourage a *proliferation* of false "UTC-like" clocks. It's bad enough that we do smearing at all, let alone that we don't have a single definition of how to do it. I made the smearing hint a full uint8_t instead of using bits in flags, in the end. That gives us a full 255 ways of lying to users about what the time is, so we're unlikely to run out. And it's easy enough to add a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods that get invented. > > > > + /* > > > > + * This field changes to another non-repeating value when the CPU > > > > + * counter is disrupted, for example on live migration. > > > > + */ > > > > + uint64_t disruption_marker; > > > > > > The field could also change when the clock is stepped (leap seconds > > > excepted), or when the clock frequency is slewed. > > > > I'm not sure. The concept of the disruption marker is that it tells the > > guest to throw away any calibration of the counter that the guest has > > done for *itself* (with NTP, other PTP devices, etc.). > > > > One mode for this device would be not to populate the clock fields at > > all, but *only* to signal disruption when it occurs. So the guest can > > abort transactions until it's resynced its clocks (to avoid incurring > > fines if breaking databases, etc.). > > > > Exposing the host timekeeping through the structure means that the > > migrated guest can keep working because it can trust the timekeeping > > performed by the (new) host and exposed to it. > > > > If the counter is actually varying in frequency over time, and the host > > is slewing the clock frequency that it reports, that *isn't* a step > > change and doesn't mean that the guest should throw away any > > calibration that it's been doing for itself. One hopes that the guest > > would have detected the *same* frequency change, and be adapting for > > itself. So I don't think that should indicate a disruption. > > > > I think the same is even true if the clock is stepped by the host. The > > actual *counter* hasn't changed, so the guest is better off ignoring > > the vacillating host and continuing to derive its idea of time from the > > hardware counter itself, as calibrated against some external NTP/PTP > > sources. Surely we actively *don't* to tell the guest to throw its own > > calibrations away, in this case? > > In case the guest is also considering other time sources, it might indeed > not be a good idea to mix host clock changes into the hardware counter > disruption marker. > > But if the vmclock is the authoritative source of time, it can still be > helpful to know about such changes, maybe through another marker. Could that be the existing seq_count field? Skewing the counter_period_frac_sec as the underlying oscillator speeds up and slows down is perfectly normal and expected, and we already expect the seq_count to change when that happens. Maybe step changes are different, but arguably if the time advertised by the host steps *outside* the error bounds previously advertised, that's just broken? Depending on how the clock information is fed, a change in seq_count may even result in non-monotonicity. If the underlying oscillator has sped up and the structure is updated accordingly, the time calculated the moment *before* that update may appear later than the time calculated immediately after it. It's up to the guest operating system to feed that information into its own timekeeping system and skew towards correctness instead of stepping the time it reports to its users.
On 28.06.24 14:15, David Woodhouse wrote: > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote: >> On 27.06.24 16:52, David Woodhouse wrote: >>> I already added a flags field, so this might look something like: >>> >>> /* >>> * Smearing flags. The UTC clock exposed through this structure >>> * is only ever true UTC, but a guest operating system may >>> * choose to offer a monotonic smeared clock to its users. This >>> * merely offers a hint about what kind of smearing to perform, >>> * for consistency with systems in the nearby environment. >>> */ >>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ >>> >>> (UTC-SLS is probably a bad example but are there formal definitions for >>> anything else?) >> >> I think it could also be more generic, like flags for linear smearing, >> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative >> to the leap second start). That could also represent UTC-SLS, and >> noon-to-noon, and it would be well-defined. >> >> This should reduce the likelihood that the guest doesn't know the smearing >> variant. > > I'm wary of making it too generic. That would seem to encourage a > *proliferation* of false "UTC-like" clocks. > > It's bad enough that we do smearing at all, let alone that we don't > have a single definition of how to do it. > > I made the smearing hint a full uint8_t instead of using bits in flags, > in the end. That gives us a full 255 ways of lying to users about what > the time is, so we're unlikely to run out. And it's easy enough to add > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods > that get invented. > > My concern is that the registry update may come after a driver has already been implemented, so that it may be hard to ensure that the smearing which has been chosen is actually implemented. >>>>> + /* >>>>> + * This field changes to another non-repeating value when the CPU >>>>> + * counter is disrupted, for example on live migration. >>>>> + */ >>>>> + uint64_t disruption_marker; >>>> >>>> The field could also change when the clock is stepped (leap seconds >>>> excepted), or when the clock frequency is slewed. >>> >>> I'm not sure. The concept of the disruption marker is that it tells the >>> guest to throw away any calibration of the counter that the guest has >>> done for *itself* (with NTP, other PTP devices, etc.). >>> >>> One mode for this device would be not to populate the clock fields at >>> all, but *only* to signal disruption when it occurs. So the guest can >>> abort transactions until it's resynced its clocks (to avoid incurring >>> fines if breaking databases, etc.). >>> >>> Exposing the host timekeeping through the structure means that the >>> migrated guest can keep working because it can trust the timekeeping >>> performed by the (new) host and exposed to it. >>> >>> If the counter is actually varying in frequency over time, and the host >>> is slewing the clock frequency that it reports, that *isn't* a step >>> change and doesn't mean that the guest should throw away any >>> calibration that it's been doing for itself. One hopes that the guest >>> would have detected the *same* frequency change, and be adapting for >>> itself. So I don't think that should indicate a disruption. >>> >>> I think the same is even true if the clock is stepped by the host. The >>> actual *counter* hasn't changed, so the guest is better off ignoring >>> the vacillating host and continuing to derive its idea of time from the >>> hardware counter itself, as calibrated against some external NTP/PTP >>> sources. Surely we actively *don't* to tell the guest to throw its own >>> calibrations away, in this case? >> >> In case the guest is also considering other time sources, it might indeed >> not be a good idea to mix host clock changes into the hardware counter >> disruption marker. >> >> But if the vmclock is the authoritative source of time, it can still be >> helpful to know about such changes, maybe through another marker. > > Could that be the existing seq_count field? > > Skewing the counter_period_frac_sec as the underlying oscillator speeds > up and slows down is perfectly normal and expected, and we already > expect the seq_count to change when that happens. > > Maybe step changes are different, but arguably if the time advertised > by the host steps *outside* the error bounds previously advertised, > that's just broken? But the error bounds could be large or missing. I am trying to address use cases where the host steps or slews the clock as well. > > Depending on how the clock information is fed, a change in seq_count > may even result in non-monotonicity. If the underlying oscillator has > sped up and the structure is updated accordingly, the time calculated > the moment *before* that update may appear later than the time > calculated immediately after it. > > It's up to the guest operating system to feed that information into its > own timekeeping system and skew towards correctness instead of stepping > the time it reports to its users. > The guest can anyway infer from the other information that the clock changed, so my proposal might not be that useful. Maybe it can be added in a future version if there is any need.
On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: >On 28.06.24 14:15, David Woodhouse wrote: >> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote: >>> On 27.06.24 16:52, David Woodhouse wrote: >>>> I already added a flags field, so this might look something like: >>>> >>>> /* >>>> * Smearing flags. The UTC clock exposed through this structure >>>> * is only ever true UTC, but a guest operating system may >>>> * choose to offer a monotonic smeared clock to its users. This >>>> * merely offers a hint about what kind of smearing to perform, >>>> * for consistency with systems in the nearby environment. >>>> */ >>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ >>>> >>>> (UTC-SLS is probably a bad example but are there formal definitions for >>>> anything else?) >>> >>> I think it could also be more generic, like flags for linear smearing, >>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative >>> to the leap second start). That could also represent UTC-SLS, and >>> noon-to-noon, and it would be well-defined. >>> >>> This should reduce the likelihood that the guest doesn't know the smearing >>> variant. >> >> I'm wary of making it too generic. That would seem to encourage a >> *proliferation* of false "UTC-like" clocks. >> >> It's bad enough that we do smearing at all, let alone that we don't >> have a single definition of how to do it. >> >> I made the smearing hint a full uint8_t instead of using bits in flags, >> in the end. That gives us a full 255 ways of lying to users about what >> the time is, so we're unlikely to run out. And it's easy enough to add >> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods >> that get invented. >> >> > >My concern is that the registry update may come after a driver has already >been implemented, so that it may be hard to ensure that the smearing which >has been chosen is actually implemented. Well yes, but why in the name of all that is holy would anyone want to invent *new* ways to lie to users about the time? If we capture the existing ones as we write this, surely it's a good thing that there's a barrier to entry for adding more? >But the error bounds could be large or missing. I am trying to address use >cases where the host steps or slews the clock as well. The host is absolutely intended to be skewing the clock to keep it accurate as the frequency of the underlying oscillator changes, and the seq_count field will change every time the host does so. Do we need to handle steps differently? Or just let the guest deal with it? If an NTP server suddenly steps the time it reports, what does the client do?
+ Kees Cook, linux-hardening On Tue, Jun 25, 2024 at 08:01:56PM +0100, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into > KVM clock values. Unlike the KVM PTP clock, it does so only when such is > actually helpful. > > The memory region of the device is also exposed to userspace so it can be > read or memory mapped by application which need reliable notification of > clock disruptions. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> ... > diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c ... > +static int vmclock_probe(struct platform_device *pdev) > +{ ... > + /* If there is valid clock information, register a PTP clock */ > + if (st->cs_id) { > + st->ptp_clock_info = ptp_vmclock_info; > + strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name)); Hi David, W=1 allmodconfig builds with gcc-13 flag the following. Reading the documentation of strncpy() in fortify-string.h, I wonder if strscpy() would be more appropriate in this case. In file included from ./include/linux/string.h:374, from ./include/linux/bitmap.h:13, from ./include/linux/cpumask.h:13, from ./arch/x86/include/asm/paravirt.h:21, from ./arch/x86/include/asm/cpuid.h:62, from ./arch/x86/include/asm/processor.h:19, from ./include/linux/sched.h:13, from ./include/linux/ratelimit.h:6, from ./include/linux/dev_printk.h:16, from ./include/linux/device.h:15, from drivers/ptp/ptp_vmclock.c:8: In function 'strncpy', inlined from 'vmclock_probe' at drivers/ptp/ptp_vmclock.c:480:3: ./include/linux/fortify-string.h:125:33: warning: '__builtin_strncpy' specified bound 32 equals destination size [-Wstringop-truncation] 125 | #define __underlying_strncpy __builtin_strncpy | ^ ./include/linux/fortify-string.h:205:16: note: in expansion of macro '__underlying_strncpy' 205 | return __underlying_strncpy(p, q, size); ...
On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote: > > W=1 allmodconfig builds with gcc-13 flag the following. > Reading the documentation of strncpy() in fortify-string.h, > I wonder if strscpy() would be more appropriate in this case. It's never going to matter in practice, as the buffer is 32 bytes, so "vmclock%d" would never get to the end even if was a 64-bit integer. But as a matter of hygiene and best practice, yes strscpy() would be more appropriate. Fixed in git, thanks.
On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote: > On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: > > On 28.06.24 14:15, David Woodhouse wrote: > > > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote: > > > > On 27.06.24 16:52, David Woodhouse wrote: > > > > > I already added a flags field, so this might look something like: > > > > > > > > > > /* > > > > > * Smearing flags. The UTC clock exposed through this structure > > > > > * is only ever true UTC, but a guest operating system may > > > > > * choose to offer a monotonic smeared clock to its users. This > > > > > * merely offers a hint about what kind of smearing to perform, > > > > > * for consistency with systems in the nearby environment. > > > > > */ > > > > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ > > > > > > > > > > (UTC-SLS is probably a bad example but are there formal definitions for > > > > > anything else?) > > > > > > > > I think it could also be more generic, like flags for linear smearing, > > > > cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative > > > > to the leap second start). That could also represent UTC-SLS, and > > > > noon-to-noon, and it would be well-defined. > > > > > > > > This should reduce the likelihood that the guest doesn't know the smearing > > > > variant. > > > > > > I'm wary of making it too generic. That would seem to encourage a > > > *proliferation* of false "UTC-like" clocks. > > > > > > It's bad enough that we do smearing at all, let alone that we don't > > > have a single definition of how to do it. > > > > > > I made the smearing hint a full uint8_t instead of using bits in flags, > > > in the end. That gives us a full 255 ways of lying to users about what > > > the time is, so we're unlikely to run out. And it's easy enough to add > > > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods > > > that get invented. > > > > > > > > > > My concern is that the registry update may come after a driver has already > > been implemented, so that it may be hard to ensure that the smearing which > > has been chosen is actually implemented. > > Well yes, but why in the name of all that is holy would anyone want > to invent *new* ways to lie to users about the time? If we capture > the existing ones as we write this, surely it's a good thing that > there's a barrier to entry for adding more? Ultimately though, this isn't the hill for me to die on. I'm pushing on that topic because I want to avoid the proliferation of *ambiguity*. If we have a precision clock, we should *know* what the time is. So how about this proposal. I line up the fields in the proposed shared memory structure to match your virtio-rtc proposal, using 'subtype' as you proposed. But, instead of the 'subtype' being valid only for VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC. So, you have: +\begin{lstlisting} +#define VIRTIO_RTC_CLOCK_UTC 0 +#define VIRTIO_RTC_CLOCK_TAI 1 +#define VIRTIO_RTC_CLOCK_MONO 2 +\end{lstlisting} I propose that you add #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3 If my proposed memory structure is subsumed into the virtio-rtc proposal we'd literally use the same names, but for the time being I'll update mine to: /* * What time is exposed in the time_sec/time_frac_sec fields? */ uint8_t time_type; #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */ I can then use your smearing subtype values as the 'hint' field in the shared memory structure. You currently have: +\begin{lstlisting} +#define VIRTIO_RTC_SUBTYPE_STRICT 0 +#define VIRTIO_RTC_SUBTYPE_SMEAR 1 +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2 +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3 +\end{lstlisting} I can certainly ensure that 'noon linear' has the same value. I don't think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though: +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by + smearing time in the vicinity of the leap second, in a not + precisely defined manner. This avoids clock steps due to UTC + leap seconds. ... +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC + standard w.r.t.\ leap second introduction in an unspecified way + (leap seconds may, or may not, be smeared). To the client, both of those just mean "for a day or so around a leap second event, you can't trust this device to know what the time is". There isn't any point in separating "does lie to you" from "might lie to you", surely? The guest can't do anything useful with that distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED? And if you *really* want to parameterise it, I think that's a bad idea and it encourages the proliferation of different time "standards", but I'd probably just suck it up and do whatever you do because that's not strictly within the remit of my live-migration part.
On Mon, Jul 01, 2024 at 09:02:38AM +0100, David Woodhouse wrote: > On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote: > > > > W=1 allmodconfig builds with gcc-13 flag the following. > > Reading the documentation of strncpy() in fortify-string.h, > > I wonder if strscpy() would be more appropriate in this case. > > It's never going to matter in practice, as the buffer is 32 bytes, so > "vmclock%d" would never get to the end even if was a 64-bit integer. > > But as a matter of hygiene and best practice, yes strscpy() would be > more appropriate. Fixed in git, thanks. Thanks! You can even use the 2-argument version. :)
On 01.07.24 10:57, David Woodhouse wrote: > On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote: >> On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: >>> On 28.06.24 14:15, David Woodhouse wrote: >>>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote: >>>>> On 27.06.24 16:52, David Woodhouse wrote: >>>>>> I already added a flags field, so this might look something like: >>>>>> >>>>>> /* >>>>>> * Smearing flags. The UTC clock exposed through this structure >>>>>> * is only ever true UTC, but a guest operating system may >>>>>> * choose to offer a monotonic smeared clock to its users. This >>>>>> * merely offers a hint about what kind of smearing to perform, >>>>>> * for consistency with systems in the nearby environment. >>>>>> */ >>>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */ >>>>>> >>>>>> (UTC-SLS is probably a bad example but are there formal definitions for >>>>>> anything else?) >>>>> >>>>> I think it could also be more generic, like flags for linear smearing, >>>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative >>>>> to the leap second start). That could also represent UTC-SLS, and >>>>> noon-to-noon, and it would be well-defined. >>>>> >>>>> This should reduce the likelihood that the guest doesn't know the smearing >>>>> variant. >>>> >>>> I'm wary of making it too generic. That would seem to encourage a >>>> *proliferation* of false "UTC-like" clocks. >>>> >>>> It's bad enough that we do smearing at all, let alone that we don't >>>> have a single definition of how to do it. >>>> >>>> I made the smearing hint a full uint8_t instead of using bits in flags, >>>> in the end. That gives us a full 255 ways of lying to users about what >>>> the time is, so we're unlikely to run out. And it's easy enough to add >>>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods >>>> that get invented. >>>> >>>> >>> >>> My concern is that the registry update may come after a driver has already >>> been implemented, so that it may be hard to ensure that the smearing which >>> has been chosen is actually implemented. >> >> Well yes, but why in the name of all that is holy would anyone want >> to invent *new* ways to lie to users about the time? If we capture >> the existing ones as we write this, surely it's a good thing that >> there's a barrier to entry for adding more? > > Ultimately though, this isn't the hill for me to die on. I'm pushing on > that topic because I want to avoid the proliferation of *ambiguity*. If > we have a precision clock, we should *know* what the time is. > > So how about this proposal. I line up the fields in the proposed shared > memory structure to match your virtio-rtc proposal, using 'subtype' as > you proposed. But, instead of the 'subtype' being valid only for > VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC. > > So, you have: > > +\begin{lstlisting} > +#define VIRTIO_RTC_CLOCK_UTC 0 > +#define VIRTIO_RTC_CLOCK_TAI 1 > +#define VIRTIO_RTC_CLOCK_MONO 2 > +\end{lstlisting} > > I propose that you add > > #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3 > > If my proposed memory structure is subsumed into the virtio-rtc > proposal we'd literally use the same names, but for the time being I'll > update mine to: Do you intend vmclock and virtio-rtc to be ABI compatible? FYI, I see a potential problem in that Virtio does avoid the use of signed integers so far. I did not check carefully if there might be other problems, yet. > > /* > * What time is exposed in the time_sec/time_frac_sec fields? > */ > uint8_t time_type; > #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ > #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ > #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ > #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */ > > > I can then use your smearing subtype values as the 'hint' field in the > shared memory structure. You currently have: > > +\begin{lstlisting} > +#define VIRTIO_RTC_SUBTYPE_STRICT 0 > +#define VIRTIO_RTC_SUBTYPE_SMEAR 1 > +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2 > +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3 > +\end{lstlisting} > I agree with the above part of your proposal. > I can certainly ensure that 'noon linear' has the same value. I don't > think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though: > > > +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by > + smearing time in the vicinity of the leap second, in a not > + precisely defined manner. This avoids clock steps due to UTC > + leap seconds. > > ... > > +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC > + standard w.r.t.\ leap second introduction in an unspecified > way > + (leap seconds may, or may not, be smeared). > > To the client, both of those just mean "for a day or so around a leap > second event, you can't trust this device to know what the time is". > There isn't any point in separating "does lie to you" from "might lie > to you", surely? The guest can't do anything useful with that > distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED? As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed (resp., UTC_SLS may be added). But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no steps (in particular, steps backwards, which some clients might not like) due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee. So I think this might be better handled by adding, alongside > #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3 #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4 (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC). > > And if you *really* want to parameterise it, I think that's a bad idea > and it encourages the proliferation of different time "standards", but > I'd probably just suck it up and do whatever you do because that's not > strictly within the remit of my live-migration part. I think the above proposal to have subtypes for VIRTIO_RTC_CLOCK_SMEARED_UTC should work.
On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote: > > On 01.07.24 10:57, David Woodhouse wrote: > > > > If my proposed memory structure is subsumed into the virtio-rtc > > > > proposal we'd literally use the same names, but for the time being I'll > > > > update mine to: > > > > Do you intend vmclock and virtio-rtc to be ABI compatible? I intend you to incorporate a shared memory structure like the vmclock one into the virtio-rtc standard :) Because precision clocks in a VM are fundamentally nonsensical without a way for the guest to know when live migration has screwed them up. So yes, so facilitate that, I'm trying to align things so that the numeric values of fields like the time_type and smearing hint are *precisely* the same as the VIRTIO_RTC_xxx values. Time pressure *may* mean I have to ship an ACPI-based device with a preliminary version of the structure before I've finished persuading you, and before we've completely finalized the structure. I am hoping to avoid that. (In fact, my time pressure only applies to a version of the structure which carries the disruption_marker field; the actual clock calibration information doesn't have to be present in the interim implementation.) > > FYI, I see a > > potential problem in that Virtio does avoid the use of signed integers so > > far. I did not check carefully if there might be other problems, yet. Hm, you use an unsigned integer to convey the tai_offset. I suppose at +37 and with a plan to stop doing leap seconds in the next decade, we're unlikely to get back below zero? The other signed integer I had was for the leap second direction, but I think I'm happy to drop that and just adopt your uint8_t leap field with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}. > > > > > > > > /* > > > > * What time is exposed in the time_sec/time_frac_sec fields? > > > > */ > > > > uint8_t time_type; > > > > #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ > > > > #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ > > > > #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ > > > > #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */ > > > > > > > > > > > > I can then use your smearing subtype values as the 'hint' field in the > > > > shared memory structure. You currently have: > > > > > > > > +\begin{lstlisting} > > > > +#define VIRTIO_RTC_SUBTYPE_STRICT 0 > > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR 1 > > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2 > > > > +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3 > > > > +\end{lstlisting} > > > > > > > > I agree with the above part of your proposal. > > > > > > I can certainly ensure that 'noon linear' has the same value. I don't > > > > think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though: > > > > > > > > > > > > +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by > > > > + smearing time in the vicinity of the leap second, in a not > > > > + precisely defined manner. This avoids clock steps due to UTC > > > > + leap seconds. > > > > > > > > ... > > > > > > > > +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC > > > > + standard w.r.t.\ leap second introduction in an unspecified > > > > way > > > > + (leap seconds may, or may not, be smeared). > > > > > > > > To the client, both of those just mean "for a day or so around a leap > > > > second event, you can't trust this device to know what the time is". > > > > There isn't any point in separating "does lie to you" from "might lie > > > > to you", surely? The guest can't do anything useful with that > > > > distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED? > > > > As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed > > (resp., UTC_SLS may be added). > > > > But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no > > steps (in particular, steps backwards, which some clients might not like) > > due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee. > > > > So I think this might be better handled by adding, alongside > > > > > > #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3 > > > > #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4 > > > > (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC). > > > > > > > > > > And if you *really* want to parameterise it, I think that's a bad idea > > > > and it encourages the proliferation of different time "standards", but > > > > I'd probably just suck it up and do whatever you do because that's not > > > > strictly within the remit of my live-migration part. > > > > I think the above proposal to have subtypes for > > VIRTIO_RTC_CLOCK_SMEARED_UTC should work. To clarify then, the main types are VIRTIO_RTC_CLOCK_UTC == 0 VIRTIO_RTC_CLOCK_TAI == 1 VIRTIO_RTC_CLOCK_MONOTONIC == 2 VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 And the subtypes are *only* for the case of VIRTIO_RTC_CLOCK_SMEARED_UTC. They include VIRTIO_RTC_SUBTYPE_STRICT VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */ VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */ Is that what we just agreed on?
On 02.07.24 18:39, David Woodhouse wrote: > On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote: >>> On 01.07.24 10:57, David Woodhouse wrote: >>>>> If my proposed memory structure is subsumed into the virtio-rtc >>>>> proposal we'd literally use the same names, but for the time being I'll >>>>> update mine to: >>> >>> Do you intend vmclock and virtio-rtc to be ABI compatible? > > I intend you to incorporate a shared memory structure like the vmclock > one into the virtio-rtc standard :) > > Because precision clocks in a VM are fundamentally nonsensical without > a way for the guest to know when live migration has screwed them up. > > So yes, so facilitate that, I'm trying to align things so that the > numeric values of fields like the time_type and smearing hint are > *precisely* the same as the VIRTIO_RTC_xxx values. > > Time pressure *may* mean I have to ship an ACPI-based device with a > preliminary version of the structure before I've finished persuading > you, and before we've completely finalized the structure. I am hoping > to avoid that. > > (In fact, my time pressure only applies to a version of the structure > which carries the disruption_marker field; the actual clock calibration > information doesn't have to be present in the interim implementation.) > > >>> FYI, I see a >>> potential problem in that Virtio does avoid the use of signed integers so >>> far. I did not check carefully if there might be other problems, yet. > > Hm, you use an unsigned integer to convey the tai_offset. I suppose at > +37 and with a plan to stop doing leap seconds in the next decade, > we're unlikely to get back below zero? > I think so. > The other signed integer I had was for the leap second direction, but I > think I'm happy to drop that and just adopt your uint8_t leap field > with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}. > > > > > >>>>> >>>>> /* >>>>> * What time is exposed in the time_sec/time_frac_sec fields? >>>>> */ >>>>> uint8_t time_type; >>>>> #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ >>>>> #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ >>>>> #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ >>>>> #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */ >>>>> >>>>> >>>>> I can then use your smearing subtype values as the 'hint' field in the >>>>> shared memory structure. You currently have: >>>>> >>>>> +\begin{lstlisting} >>>>> +#define VIRTIO_RTC_SUBTYPE_STRICT 0 >>>>> +#define VIRTIO_RTC_SUBTYPE_SMEAR 1 >>>>> +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2 >>>>> +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3 >>>>> +\end{lstlisting} >>>>> >>> >>> I agree with the above part of your proposal. >>> >>>>> I can certainly ensure that 'noon linear' has the same value. I don't >>>>> think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though: >>>>> >>>>> >>>>> +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by >>>>> + smearing time in the vicinity of the leap second, in a not >>>>> + precisely defined manner. This avoids clock steps due to UTC >>>>> + leap seconds. >>>>> >>>>> ... >>>>> >>>>> +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC >>>>> + standard w.r.t.\ leap second introduction in an unspecified >>>>> way >>>>> + (leap seconds may, or may not, be smeared). >>>>> >>>>> To the client, both of those just mean "for a day or so around a leap >>>>> second event, you can't trust this device to know what the time is". >>>>> There isn't any point in separating "does lie to you" from "might lie >>>>> to you", surely? The guest can't do anything useful with that >>>>> distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED? >>> >>> As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed >>> (resp., UTC_SLS may be added). >>> >>> But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no >>> steps (in particular, steps backwards, which some clients might not like) >>> due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee. >>> >>> So I think this might be better handled by adding, alongside >>> >>>>> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3 >>> >>> #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4 >>> >>> (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC). >>> >>>>> >>>>> And if you *really* want to parameterise it, I think that's a bad idea >>>>> and it encourages the proliferation of different time "standards", but >>>>> I'd probably just suck it up and do whatever you do because that's not >>>>> strictly within the remit of my live-migration part. >>> >>> I think the above proposal to have subtypes for >>> VIRTIO_RTC_CLOCK_SMEARED_UTC should work. > > To clarify then, the main types are > > VIRTIO_RTC_CLOCK_UTC == 0 > VIRTIO_RTC_CLOCK_TAI == 1 > VIRTIO_RTC_CLOCK_MONOTONIC == 2 > VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 > > And the subtypes are *only* for the case of > VIRTIO_RTC_CLOCK_SMEARED_UTC. They include > > VIRTIO_RTC_SUBTYPE_STRICT > VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */ > VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR > VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */ > > Is that what we just agreed on? > > This is a misunderstanding. My idea was that the main types are > VIRTIO_RTC_CLOCK_UTC == 0 > VIRTIO_RTC_CLOCK_TAI == 1 > VIRTIO_RTC_CLOCK_MONOTONIC == 2 > VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4 The subtypes would be (1st for clocks other than VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for VIRTIO_RTC_CLOCK_SMEARED_UTC): #define VIRTIO_RTC_SUBTYPE_STRICT 0 #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1 #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: >On 02.07.24 18:39, David Woodhouse wrote: >> To clarify then, the main types are >> >> VIRTIO_RTC_CLOCK_UTC == 0 >> VIRTIO_RTC_CLOCK_TAI == 1 >> VIRTIO_RTC_CLOCK_MONOTONIC == 2 >> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 >> >> And the subtypes are *only* for the case of >> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include >> >> VIRTIO_RTC_SUBTYPE_STRICT >> VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */ >> VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR >> VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */ >> >> Is that what we just agreed on? >> >> > >This is a misunderstanding. My idea was that the main types are > >> VIRTIO_RTC_CLOCK_UTC == 0 >> VIRTIO_RTC_CLOCK_TAI == 1 >> VIRTIO_RTC_CLOCK_MONOTONIC == 2 >> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 > >VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4 > >The subtypes would be (1st for clocks other than >VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for >VIRTIO_RTC_CLOCK_SMEARED_UTC): > >#define VIRTIO_RTC_SUBTYPE_STRICT 0 >#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1 >#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2 > Thanks. I really do think that from the guest point of view there's really no distinction between "maybe smeared" and "undefined smearing", and have a preference for using the latter form, which is the key difference there? Again though, not a hill for me to die on.
On Mon, 2024-07-01 at 08:39 -0700, Kees Cook wrote: > On Mon, Jul 01, 2024 at 09:02:38AM +0100, David Woodhouse wrote: > > On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote: > > > > > > W=1 allmodconfig builds with gcc-13 flag the following. > > > Reading the documentation of strncpy() in fortify-string.h, > > > I wonder if strscpy() would be more appropriate in this case. > > > > It's never going to matter in practice, as the buffer is 32 bytes, so > > "vmclock%d" would never get to the end even if was a 64-bit integer. > > > > But as a matter of hygiene and best practice, yes strscpy() would be > > more appropriate. Fixed in git, thanks. > > Thanks! You can even use the 2-argument version. :) 2-argument version? ... oh, that's *special*. I don't know whether to be impressed or disturbed. Perhaps a little bit of both. Done; thanks.
On 02.07.24 20:40, David Woodhouse wrote: > On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: >> On 02.07.24 18:39, David Woodhouse wrote: >>> To clarify then, the main types are >>> >>> VIRTIO_RTC_CLOCK_UTC == 0 >>> VIRTIO_RTC_CLOCK_TAI == 1 >>> VIRTIO_RTC_CLOCK_MONOTONIC == 2 >>> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 >>> >>> And the subtypes are *only* for the case of >>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include >>> >>> VIRTIO_RTC_SUBTYPE_STRICT >>> VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */ >>> VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR >>> VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */ >>> >>> Is that what we just agreed on? >>> >>> >> >> This is a misunderstanding. My idea was that the main types are >> >>> VIRTIO_RTC_CLOCK_UTC == 0 >>> VIRTIO_RTC_CLOCK_TAI == 1 >>> VIRTIO_RTC_CLOCK_MONOTONIC == 2 >>> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 >> >> VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4 >> >> The subtypes would be (1st for clocks other than >> VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for >> VIRTIO_RTC_CLOCK_SMEARED_UTC): >> >> #define VIRTIO_RTC_SUBTYPE_STRICT 0 >> #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1 >> #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2 >> > > Thanks. I really do think that from the guest point of view there's really no distinction between "maybe smeared" and "undefined smearing", and have a preference for using the latter form, which is the key difference there? > > Again though, not a hill for me to die on. I have no issue with staying with "undefined smearing", so would you agree to something like VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4 (or another name if you prefer)?
On Wed, 2024-07-03 at 11:56 +0200, Peter Hilber wrote: > On 02.07.24 20:40, David Woodhouse wrote: > > On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote: > > > On 02.07.24 18:39, David Woodhouse wrote: > > > > To clarify then, the main types are > > > > > > > > VIRTIO_RTC_CLOCK_UTC == 0 > > > > VIRTIO_RTC_CLOCK_TAI == 1 > > > > VIRTIO_RTC_CLOCK_MONOTONIC == 2 > > > > VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 > > > > > > > > And the subtypes are *only* for the case of > > > > VIRTIO_RTC_CLOCK_SMEARED_UTC. They include > > > > > > > > VIRTIO_RTC_SUBTYPE_STRICT > > > > VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */ > > > > VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR > > > > VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */ > > > > > > > > Is that what we just agreed on? > > > > > > > > > > > > > > This is a misunderstanding. My idea was that the main types are > > > > > > > VIRTIO_RTC_CLOCK_UTC == 0 > > > > VIRTIO_RTC_CLOCK_TAI == 1 > > > > VIRTIO_RTC_CLOCK_MONOTONIC == 2 > > > > VIRTIO_RTC_CLOCK_SMEARED_UTC == 3 > > > > > > VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4 > > > > > > The subtypes would be (1st for clocks other than > > > VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for > > > VIRTIO_RTC_CLOCK_SMEARED_UTC): > > > > > > #define VIRTIO_RTC_SUBTYPE_STRICT 0 > > > #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1 > > > #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2 > > > > > > > Thanks. I really do think that from the guest point of view there's > > really no distinction between "maybe smeared" and "undefined > > smearing", and have a preference for using the latter form, which > > is the key difference there? > > > > Again though, not a hill for me to die on. > > I have no issue with staying with "undefined smearing", so would you agree > to something like > > VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4 > > (or another name if you prefer)? Well, the point of contention was really whether that was a *type* or a *subtype*. Either way, it's a "precision clock" telling its consumer that the device *itself* doesn't really know what time is being exposed. Which seems like a bizarre thing to support. But I think I've constructed an argument which persuades me to your point of view that *if* we permit it, it should be a primary type... A clock can *either* be UTC, *or* it can be monotonic. The whole point of smearing is to produce a monotonic clock, of course. VIRTIO_RTC_CLOCK_UTC is UTC. It is not monotonic. VIRTIO_RTC_CLOCK_SMEARED is, presumably, monotonic (and I think we should explicitly require that to be true in virtio-rtc). But VIRTIO_RTC_CLOCK_MAYBE_SMEARED is the worst of both worlds. It is neither known to be correct UTC, *nor* is it known to be monotonic. So (again, if we permit it at all) I think it probably does make sense for that to be a primary type. This is what I currently have for 'struct vmclock_abi' that I'd like to persuade you to adopt. I need to tweak it some more, for at least the following reasons, as well as any more you can see: • size isn't big enough for 64KiB pages • Should be explicitly little-endian • Does it need esterror as well as maxerror? • Why is maxerror in picoseconds? It's the only use of that unit • Where do the clock_status values come from? Do they make sense? • Are signed integers OK? (I think so!). /* * This structure provides a vDSO-style clock to VM guests, exposing the * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch * counter, etc.) and real time. It is designed to address the problem of * live migration, which other clock enlightenments do not. * * When a guest is live migrated, this affects the clock in two ways. * * First, even between identical hosts the actual frequency of the underlying * counter will change within the tolerances of its specification (typically * ±50PPM, or 4 seconds a day). This frequency also varies over time on the * same host, but can be tracked by NTP as it generally varies slowly. With * live migration there is a step change in the frequency, with no warning. * * Second, there may be a step change in the value of the counter itself, as * its accuracy is limited by the precision of the NTP synchronization on the * source and destination hosts. * * So any calibration (NTP, PTP, etc.) which the guest has done on the source * host before migration is invalid, and needs to be redone on the new host. * * In its most basic mode, this structure provides only an indication to the * guest that live migration has occurred. This allows the guest to know that * its clock is invalid and take remedial action. For applications that need * reliable accurate timestamps (e.g. distributed databases), the structure * can be mapped all the way to userspace. This allows the application to see * directly for itself that the clock is disrupted and take appropriate * action, even when using a vDSO-style method to get the time instead of a * system call. * * In its more advanced mode. this structure can also be used to expose the * precise relationship of the CPU counter to real time, as calibrated by the * host. This means that userspace applications can have accurate time * immediately after live migration, rather than having to pause operations * and wait for NTP to recover. This mode does, of course, rely on the * counter being reliable and consistent across CPUs. * * Note that this must be true UTC, never with smeared leap seconds. If a * guest wishes to construct a smeared clock, it can do so. Presenting a * smeared clock through this interface would be problematic because it * actually messes with the apparent counter *period*. A linear smearing * of 1 ms per second would effectively tweak the counter period by 1000PPM * at the start/end of the smearing period, while a sinusoidal smear would * basically be impossible to represent. * * This structure is offered with the intent that it be adopted into the * nascent virtio-rtc standard, as a virtio-rtc that does not address the live * migration problem seems a little less than fit for purpose. For that * reason, certain fields use precisely the same numeric definitions as in * the virtio-rtc proposal. The structure can also be exposed through an ACPI * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for * the fact that it uses a real _CRS to convey the address of the structure * (which should be a full page, to allow for mapping directly to userspace). */ #ifndef __VMCLOCK_ABI_H__ #define __VMCLOCK_ABI_H__ #ifdef __KERNEL__ #include <linux/types.h> #else #include <stdint.h> #endif struct vmclock_abi { uint64_t magic; #define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ uint16_t size; /* Size of page containing this structure */ uint16_t version; /* 1 */ /* Sequence lock. Low bit means an update is in progress. */ uint32_t seq_count; uint32_t flags; /* Indicates that the tai_offset_sec field is valid */ #define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) /* * Optionally used to notify guests of pending maintenance events. * A guest may wish to remove itself from service if an event is * coming up. Two flags indicate the rough imminence of the event. */ #define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ #define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ /* Indicates that the utc_time_maxerror_picosec field is valid */ #define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3) /* Indicates counter_period_error_rate_frac_sec is valid */ #define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4) /* * This field changes to another non-repeating value when the CPU * counter is disrupted, for example on live migration. This lets * the guest know that it should discard any calibration it has * performed of the counter against external sources (NTP/PTP/etc.). */ uint64_t disruption_marker; uint8_t clock_status; #define VMCLOCK_STATUS_UNKNOWN 0 #define VMCLOCK_STATUS_INITIALIZING 1 #define VMCLOCK_STATUS_SYNCHRONIZED 2 #define VMCLOCK_STATUS_FREERUNNING 3 #define VMCLOCK_STATUS_UNRELIABLE 4 uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx */ #define VMCLOCK_COUNTER_ARM_VCNT 0 #define VMCLOCK_COUNTER_X86_TSC 1 #define VMCLOCK_COUNTER_INVALID 0xff /* * By providing the offset from UTC to TAI, the guest can know both * UTC and TAI reliably, whichever is indicated in the time_type * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags. */ int16_t tai_offset_sec; /* * What time is exposed in the time_sec/time_frac_sec fields? */ uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ #define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ #define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ /* * The time exposed through this device is never smeared. This field * corresponds to the 'subtype' field in virtio-rtc, which indicates * the smearing method. However in this case it provides a *hint* to * the guest operating system, such that *if* the guest OS wants to * provide its users with an alternative clock which does not follow * the POSIX CLOCK_REALTIME standard, it may do so in a fashion * consistent with the other systems in the nearby environment. */ uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */ #define VMCLOCK_SMEARING_STRICT 0 #define VMCLOCK_SMEARING_NOON_LINEAR 1 #define VMCLOCK_SMEARING_UTC_SLS 2 /* Bit shift for counter_period_frac_sec and its error rate */ uint8_t counter_period_shift; /* * Unlike in NTP, this can indicate a leap second in the past. This * is needed to allow guests to derive an imprecise clock with * smeared leap seconds for themselves, as some modes of smearing * need the adjustments to continue even after the moment at which * the leap second should have occurred. */ uint8_t leap_indicator; /* Matches VIRTIO_RTC_LEAP_xxx */ #define VMCLOCK_LEAP_NONE 0 #define VMCLOCK_LEAP_PRE_POS 1 #define VMCLOCK_LEAP_PRE_NEG 2 #define VMCLOCK_LEAP_POS 3 #define VMCLOCK_LEAP_NEG 4 uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */ /* * Paired values of counter and UTC at a given point in time. */ uint64_t counter_value; uint64_t time_sec; uint64_t time_frac_sec; /* * Counter frequency, and error margin. The unit of these fields is * seconds >> (64 + counter_period_shift) */ uint64_t counter_period_frac_sec; uint64_t counter_period_error_rate_frac_sec; /* Error margin of UTC reading above (± picoseconds) */ uint64_t utc_time_maxerror_picosec; }; #endif /* __VMCLOCK_ABI_H__ */
On 03.07.24 12:40, David Woodhouse wrote: [...] > > > This is what I currently have for 'struct vmclock_abi' that I'd like to > persuade you to adopt. I need to tweak it some more, for at least the > following reasons, as well as any more you can see: > > • size isn't big enough for 64KiB pages > • Should be explicitly little-endian > • Does it need esterror as well as maxerror? I have no opinion about this. I can drop esterror if unwanted. > • Why is maxerror in picoseconds? It's the only use of that unit > • Where do the clock_status values come from? Do they make sense? > • Are signed integers OK? (I think so!). Signed integers would need to be introduced to Virtio, which so far only uses explicitly unsigned types: u8, le16 etc.
On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote: > On 03.07.24 12:40, David Woodhouse wrote: > > [...] > > > > > > > This is what I currently have for 'struct vmclock_abi' that I'd like to > > persuade you to adopt. I need to tweak it some more, for at least the > > following reasons, as well as any more you can see: > > > > • size isn't big enough for 64KiB pages > > • Should be explicitly little-endian > > • Does it need esterror as well as maxerror? > > I have no opinion about this. I can drop esterror if unwanted. I also don't care. I'm just observing the inconsistency. > > • Why is maxerror in picoseconds? It's the only use of that unit Between us we now have picoseconds, nanoseconds, (seconds >> 64) and (seconds >> 64+n). The power-of-two fractions seem to make a lot of sense for the counter period, because they mean we don't have to perform divisions. Does it makes sense to harmonise on (seconds >> 64) for all of the fractional seconds? Again I don't have a strong opinion; I only want us to have a *reason* for any differences that exist. > > • Where do the clock_status values come from? Do they make sense? > > • Are signed integers OK? (I think so!). > > Signed integers would need to be introduced to Virtio, which so far only > uses explicitly unsigned types: u8, le16 etc. Perhaps. Although it would also be possible (if not ideal) to define that e.g. the tai_offset field is a 16-bit "unsigned" integer according to virtio, but to be interpreted as follows: If the number is <= 32767 then the TAI offset is that value, but if the number is >= 32768 then the TAI offset is that value minus 65536. Perhaps not pretty, but there isn't a *fundamental* dependency on virtio supporting signed integers as a primary type.
On 05.07.24 17:02, David Woodhouse wrote: > On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote: >> On 03.07.24 12:40, David Woodhouse wrote: [...] >>> • Why is maxerror in picoseconds? It's the only use of that unit > > Between us we now have picoseconds, nanoseconds, (seconds >> 64) and > (seconds >> 64+n). > > The power-of-two fractions seem to make a lot of sense for the counter > period, because they mean we don't have to perform divisions. > > Does it makes sense to harmonise on (seconds >> 64) for all of the > fractional seconds? Again I don't have a strong opinion; I only want us > to have a *reason* for any differences that exist. > I don't have the expertise with fixed-point arithmetic to judge if this would become unwieldy. I selected ns for the virtio-rtc drafts so far because that didn't have any impact on the precision with the Linux kernel driver message-based use cases, but that would be different for SHM in my understanding. So I would tend to retain ns for convenience for messages (where it doesn't impact precision) but do not have any preference for SHM. >>> • Where do the clock_status values come from? Do they make sense? >>> • Are signed integers OK? (I think so!). >> >> Signed integers would need to be introduced to Virtio, which so far only >> uses explicitly unsigned types: u8, le16 etc. > > Perhaps. Although it would also be possible (if not ideal) to define > that e.g. the tai_offset field is a 16-bit "unsigned" integer according > to virtio, but to be interpreted as follows: > > If the number is <= 32767 then the TAI offset is that value, but if the > number is >= 32768 then the TAI offset is that value minus 65536. > > Perhaps not pretty, but there isn't a *fundamental* dependency on > virtio supporting signed integers as a primary type. > Agreed.
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig index 604541dcb320..e98c9767e0ef 100644 --- a/drivers/ptp/Kconfig +++ b/drivers/ptp/Kconfig @@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM To compile this driver as a module, choose M here: the module will be called ptp_kvm. +config PTP_1588_CLOCK_VMCLOCK + tristate "Virtual machine PTP clock" + depends on X86_TSC || ARM_ARCH_TIMER + depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128 + default y + help + This driver adds support for using a virtual precision clock + advertised by the hypervisor. This clock is only useful in virtual + machines where such a device is present. + + To compile this driver as a module, choose M here: the module + will be called ptp_vmclock. + config PTP_1588_CLOCK_IDT82P33 tristate "IDT 82P33xxx PTP clock" depends on PTP_1588_CLOCK && I2C diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile index 68bf02078053..01b5cd91eb61 100644 --- a/drivers/ptp/Makefile +++ b/drivers/ptp/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o obj-$(CONFIG_PTP_1588_CLOCK_INES) += ptp_ines.o obj-$(CONFIG_PTP_1588_CLOCK_PCH) += ptp_pch.o obj-$(CONFIG_PTP_1588_CLOCK_KVM) += ptp_kvm.o +obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK) += ptp_vmclock.o obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o ptp-qoriq-y += ptp_qoriq.o ptp-qoriq-$(CONFIG_DEBUG_FS) += ptp_qoriq_debugfs.o diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c new file mode 100644 index 000000000000..e19c2eed8009 --- /dev/null +++ b/drivers/ptp/ptp_vmclock.c @@ -0,0 +1,516 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Virtual PTP 1588 clock for use with LM-safe VMclock device. + * + * Copyright © 2024 Amazon.com, Inc. or its affiliates. + */ + +#include <linux/device.h> +#include <linux/err.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/module.h> +#include <linux/platform_device.h> +#include <linux/miscdevice.h> +#include <linux/acpi.h> +#include <uapi/linux/vmclock.h> + +#include <linux/ptp_clock_kernel.h> + +#ifdef CONFIG_X86 +#include <asm/pvclock.h> +#include <asm/kvmclock.h> +#endif + +static DEFINE_IDA(vmclock_ida); + +ACPI_MODULE_NAME("vmclock"); + +struct vmclock_state { + phys_addr_t phys_addr; + struct vmclock_abi *clk; + struct miscdevice miscdev; + struct ptp_clock_info ptp_clock_info; + struct ptp_clock *ptp_clock; + enum clocksource_ids cs_id, sys_cs_id; + int index; + char *name; +}; + +#define VMCLOCK_MAX_WAIT ms_to_ktime(100) + +/* + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64 + * and add the fractional second part of the reference time. + * + * The result is a 128-bit value, the top 64 bits of which are seconds, and + * the low 64 bits are (seconds >> 64). + * + * If __int128 isn't available, perform the calculation 32 bits at a time to + * avoid overflow. + */ +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta, + uint64_t period, uint8_t shift, + uint64_t frac_sec) +{ + unsigned __int128 res = (unsigned __int128)delta * period; + + res >>= shift; + res += frac_sec; + *res_hi = res >> 64; + return (uint64_t)res; +} + +static int vmclock_get_crosststamp(struct vmclock_state *st, + struct ptp_system_timestamp *sts, + struct system_counterval_t *system_counter, + struct timespec64 *tspec) +{ + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); + struct system_time_snapshot systime_snapshot; + uint64_t cycle, delta, seq, frac_sec; + +#ifdef CONFIG_X86 + /* + * We'd expect the hypervisor to know this and to report the clock + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid. + */ + if (check_tsc_unstable()) + return -EINVAL; +#endif + + while (1) { + seq = st->clk->seq_count & ~1ULL; + virt_rmb(); + + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) + return -EINVAL; + + /* + * When invoked for gettimex64(), fill in the pre/post system + * times. The simple case is when system time is based on the + * same counter as st->cs_id, in which case all three times + * will be derived from the *same* counter value. + * + * If the system isn't using the same counter, then the value + * from ktime_get_snapshot() will still be used as pre_ts, and + * ptp_read_system_postts() is called to populate postts after + * calling get_cycles(). + * + * The conversion to timespec64 happens further down, outside + * the seq_count loop. + */ + if (sts) { + ktime_get_snapshot(&systime_snapshot); + if (systime_snapshot.cs_id == st->cs_id) { + cycle = systime_snapshot.cycles; + } else { + cycle = get_cycles(); + ptp_read_system_postts(sts); + } + } else + cycle = get_cycles(); + + delta = cycle - st->clk->counter_value; + + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta, + st->clk->counter_period_frac_sec, + st->clk->counter_period_shift, + st->clk->utc_time_frac_sec); + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64); + tspec->tv_sec += st->clk->utc_time_sec; + + virt_rmb(); + if (seq == st->clk->seq_count) + break; + + if (ktime_after(ktime_get(), deadline)) + return -ETIMEDOUT; + } + + if (system_counter) { + system_counter->cycles = cycle; + system_counter->cs_id = st->cs_id; + } + + if (sts) { + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real); + if (systime_snapshot.cs_id == st->cs_id) + sts->post_ts = sts->pre_ts; + } + + return 0; +} + +#ifdef CONFIG_X86 +/* + * In the case where the system is using the KVM clock for timekeeping, convert + * the TSC value into a KVM clock time in order to return a paired reading that + * get_device_system_crosststamp() can cope with. + */ +static int vmclock_get_crosststamp_kvmclock(struct vmclock_state *st, + struct ptp_system_timestamp *sts, + struct system_counterval_t *system_counter, + struct timespec64 *tspec) +{ + struct pvclock_vcpu_time_info *pvti = this_cpu_pvti(); + unsigned pvti_ver; + int ret; + + preempt_disable_notrace(); + + do { + pvti_ver = pvclock_read_begin(pvti); + + ret = vmclock_get_crosststamp(st, sts, system_counter, tspec); + if (ret) + break; + + system_counter->cycles = __pvclock_read_cycles(pvti, + system_counter->cycles); + system_counter->cs_id = CSID_X86_KVM_CLK; + + /* + * This retry should never really happen; if the TSC is + * stable and reliable enough across vCPUS that it is sane + * for the hypervisor to expose a VMCLOCK device which uses + * it as the reference counter, then the KVM clock sohuld be + * in 'master clock mode' and basically never changed. But + * the KVM clock is a fickle and often broken thing, so do + * it "properly" just in case. + */ + } while (pvclock_read_retry(pvti, pvti_ver)); + + preempt_enable_notrace(); + + return ret; +} +#endif + +static int ptp_vmclock_get_time_fn(ktime_t *device_time, + struct system_counterval_t *system_counter, + void *ctx) +{ + struct vmclock_state *st = ctx; + struct timespec64 tspec; + int ret; + +#ifdef CONFIG_X86 + if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK) + ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter, + &tspec); + else +#endif + ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec); + + if (!ret) + *device_time = timespec64_to_ktime(tspec); + + return ret; +} + + +static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp, + struct system_device_crosststamp *xtstamp) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + int ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, st, + NULL, xtstamp); +#ifdef CONFIG_X86 + /* + * On x86, the KVM clock may be used for the system time. We can + * actually convert a TSC reading to that, and return a paired + * timestamp that get_device_system_crosststamp() *can* handle. + */ + if (ret == -ENODEV) { + struct system_time_snapshot systime_snapshot; + ktime_get_snapshot(&systime_snapshot); + + if (systime_snapshot.cs_id == CSID_X86_TSC || + systime_snapshot.cs_id == CSID_X86_KVM_CLK) { + WRITE_ONCE(st->sys_cs_id, systime_snapshot.cs_id); + ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, + st, NULL, xtstamp); + } + } +#endif + return ret; +} + +/* + * PTP clock operations + */ + +static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_settime(struct ptp_clock_info *ptp, + const struct timespec64 *ts) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + + return vmclock_get_crosststamp(st, NULL, NULL, ts); +} + +static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts, + struct ptp_system_timestamp *sts) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + + return vmclock_get_crosststamp(st, sts, NULL, ts); +} + +static int ptp_vmclock_enable(struct ptp_clock_info *ptp, + struct ptp_clock_request *rq, int on) +{ + return -EOPNOTSUPP; +} + +static const struct ptp_clock_info ptp_vmclock_info = { + .owner = THIS_MODULE, + .max_adj = 0, + .n_ext_ts = 0, + .n_pins = 0, + .pps = 0, + .adjfine = ptp_vmclock_adjfine, + .adjtime = ptp_vmclock_adjtime, + .gettime64 = ptp_vmclock_gettime, + .gettimex64 = ptp_vmclock_gettimex, + .settime64 = ptp_vmclock_settime, + .enable = ptp_vmclock_enable, + .getcrosststamp = ptp_vmclock_getcrosststamp, +}; + +static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma) +{ + struct vmclock_state *st = container_of(fp->private_data, + struct vmclock_state, miscdev); + + if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ) + return -EROFS; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff) + return -EINVAL; + + if (io_remap_pfn_range(vma, vma->vm_start, + st->phys_addr >> PAGE_SHIFT, PAGE_SIZE, + vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + +static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf, + size_t count, loff_t *ppos) +{ + struct vmclock_state *st = container_of(fp->private_data, + struct vmclock_state, miscdev); + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT); + size_t max_count; + int32_t seq; + + if (*ppos >= PAGE_SIZE) + return 0; + + max_count = PAGE_SIZE - *ppos; + if (count > max_count) + count = max_count; + + while (1) { + seq = st->clk->seq_count & ~1ULL; + virt_rmb(); + + if (copy_to_user(buf, ((char *)st->clk) + *ppos, count)) + return -EFAULT; + + virt_rmb(); + if (seq == st->clk->seq_count) + break; + + if (ktime_after(ktime_get(), deadline)) + return -ETIMEDOUT; + } + + *ppos += count; + return count; +} + +static const struct file_operations vmclock_miscdev_fops = { + .mmap = vmclock_miscdev_mmap, + .read = vmclock_miscdev_read, +}; + +/* module operations */ + +static void vmclock_remove(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct vmclock_state *st = dev_get_drvdata(dev); + + if (st->ptp_clock) + ptp_clock_unregister(st->ptp_clock); + + if (st->miscdev.minor == MISC_DYNAMIC_MINOR) + misc_deregister(&st->miscdev); +} + +static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st) +{ + struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER }; + struct acpi_device *adev = ACPI_COMPANION(dev); + union acpi_object *obj; + acpi_status status; + + status = acpi_evaluate_object(adev->handle, "ADDR", NULL, &parsed); + if (ACPI_FAILURE(status)) { + ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR")); + return -ENODEV; + } + obj = parsed.pointer; + if (!obj || obj->type != ACPI_TYPE_PACKAGE || obj->package.count != 2 || + obj->package.elements[0].type != ACPI_TYPE_INTEGER || + obj->package.elements[1].type != ACPI_TYPE_INTEGER) + return -EINVAL; + + st->phys_addr = (obj->package.elements[0].integer.value << 0) | + (obj->package.elements[1].integer.value << 32); + + return 0; +} + +static void vmclock_put_idx(void *data) +{ + struct vmclock_state *st = data; + + ida_free(&vmclock_ida, st->index); +} + +static int vmclock_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct vmclock_state *st; + int ret; + + st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL); + if (!st) + return -ENOMEM; + + if (has_acpi_companion(dev)) + ret = vmclock_probe_acpi(dev, st); + else + ret = -EINVAL; /* Only ACPI for now */ + + if (ret) { + dev_info(dev, "Failed to obtain physical address: %d\n", ret); + goto out; + } + + st->clk = devm_memremap(dev, st->phys_addr, sizeof(*st->clk), + MEMREMAP_WB); + if (IS_ERR(st->clk)) { + ret = PTR_ERR(st->clk); + dev_info(dev, "failed to map shared memory\n"); + st->clk = NULL; + goto out; + } + + if (st->clk->magic != VMCLOCK_MAGIC || + st->clk->size < sizeof(*st->clk) || + st->clk->version != 1) { + dev_info(dev, "vmclock magic fields invalid\n"); + ret = -EINVAL; + goto out; + } + + if (IS_ENABLED(CONFIG_ARM64) && + st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) { + /* Can we check it's the virtual counter? */ + st->cs_id = CSID_ARM_ARCH_COUNTER; + } else if (IS_ENABLED(CONFIG_X86) && + st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) { + st->cs_id = CSID_X86_TSC; + } + st->sys_cs_id = st->cs_id; + + ret = ida_alloc(&vmclock_ida, GFP_KERNEL); + if (ret < 0) + goto out; + + st->index = ret; + ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st); + if (ret) + goto out; + + st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index); + if (!st->name) { + ret = -ENOMEM; + goto out; + } + + /* If the structure is big enough, it can be mapped to userspace */ + if (st->clk->size >= PAGE_SIZE) { + st->miscdev.minor = MISC_DYNAMIC_MINOR; + st->miscdev.fops = &vmclock_miscdev_fops; + st->miscdev.name = st->name; + + ret = misc_register(&st->miscdev); + if (ret) + goto out; + } + + /* If there is valid clock information, register a PTP clock */ + if (st->cs_id) { + st->ptp_clock_info = ptp_vmclock_info; + strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name)); + st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev); + + if (IS_ERR(st->ptp_clock)) { + ret = PTR_ERR(st->ptp_clock); + st->ptp_clock = NULL; + vmclock_remove(pdev); + goto out; + } + } + + dev_set_drvdata(dev, st); + + out: + return ret; +} + +static const struct acpi_device_id vmclock_acpi_ids[] = { + { "VMCLOCK", 0 }, + {} +}; +MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids); + +static struct platform_driver vmclock_platform_driver = { + .probe = vmclock_probe, + .remove_new = vmclock_remove, + .driver = { + .name = "vmclock", + .acpi_match_table = vmclock_acpi_ids, + }, +}; + +module_platform_driver(vmclock_platform_driver) + +MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>"); +MODULE_DESCRIPTION("PTP clock using VMCLOCK"); +MODULE_LICENSE("GPL v2"); diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h new file mode 100644 index 000000000000..cf0f22205e79 --- /dev/null +++ b/include/uapi/linux/vmclock.h @@ -0,0 +1,138 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ + +/* + * This structure provides a vDSO-style clock to VM guests, exposing the + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch + * counter, etc.) and real time. It is designed to address the problem of + * live migration, which other clock enlightenments do not. + * + * When a guest is live migrated, this affects the clock in two ways. + * + * First, even between identical hosts the actual frequency of the underlying + * counter will change within the tolerances of its specification (typically + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the + * same host, but can be tracked by NTP as it generally varies slowly. With + * live migration there is a step change in the frequency, with no warning. + * + * Second, there may be a step change in the value of the counter itself, as + * its accuracy is limited by the precision of the NTP synchronization on the + * source and destination hosts. + * + * So any calibration (NTP, PTP, etc.) which the guest has done on the source + * host before migration is invalid, and needs to be redone on the new host. + * + * In its most basic mode, this structure provides only an indication to the + * guest that live migration has occurred. This allows the guest to know that + * its clock is invalid and take remedial action. For applications that need + * reliable accurate timestamps (e.g. distributed databases), the structure + * can be mapped all the way to userspace. This allows the application to see + * directly for itself that the clock is disrupted and take appropriate + * action, even when using a vDSO-style method to get the time instead of a + * system call. + * + * In its more advanced mode. this structure can also be used to expose the + * precise relationship of the CPU counter to real time, as calibrated by the + * host. This means that userspace applications can have accurate time + * immediately after live migration, rather than having to pause operations + * and wait for NTP to recover. This mode does, of course, rely on the + * counter being reliable and consistent across CPUs. + * + * Note that this must be true UTC, never with smeared leap seconds. If a + * guest wishes to construct a smeared clock, it can do so. Presenting a + * smeared clock through this interface would be problematic because it + * actually messes with the apparent counter *period*. A linear smearing + * of 1 ms per second would effectively tweak the counter period by 1000PPM + * at the start/end of the smearing period, while a sinusoidal smear would + * basically be impossible to represent. + */ + +#ifndef __VMCLOCK_H__ +#define __VMCLOCK_H__ + +#ifdef __KERNEL__ +#include <linux/types.h> +#else +#include <stdint.h> +#endif + +struct vmclock_abi { + uint32_t magic; +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ + uint16_t size; /* Size of page containing this structure */ + uint16_t version; /* 1 */ + + /* Sequence lock. Low bit means an update is in progress. */ + uint64_t seq_count; + + /* + * This field changes to another non-repeating value when the CPU + * counter is disrupted, for example on live migration. + */ + uint64_t disruption_marker; + + /* + * By providing the TAI offset, the guest can know both UTC and TAI + * reliably. There is no need to choose one *or* the other. Valid if + * VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags. + */ + int16_t tai_offset_sec; + + uint16_t flags; + /* Indicates that the tai_offset_sec field is valid */ +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) + /* + * Optionally used to notify guests of pending maintenance events. + * A guest may wish to remove itself from service if an event is + * coming up. Two flags indicate the rough imminence of the event. + */ +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ + /* Indicates that the utc_time_maxerror_picosec field is valid */ +#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3) + /* Indicates counter_period_error_rate_frac_sec is valid */ +#define VMCLOCK_FLAG_UTC_PERIOD_ERROR_VALID (1 << 4) + + uint8_t clock_status; +#define VMCLOCK_STATUS_UNKNOWN 0 +#define VMCLOCK_STATUS_INITIALIZING 1 +#define VMCLOCK_STATUS_SYNCHRONIZED 2 +#define VMCLOCK_STATUS_FREERUNNING 3 +#define VMCLOCK_STATUS_UNRELIABLE 4 + + uint8_t counter_id; +#define VMCLOCK_COUNTER_INVALID 0 +#define VMCLOCK_COUNTER_X86_TSC 1 +#define VMCLOCK_COUNTER_ARM_VCNT 2 + + /* Bit shift for counter_period_frac_sec and its error rate */ + uint8_t counter_period_shift; + + /* + * Unlike in NTP, this can indicate a leap second in the past. This + * is needed to allow guests to derive an imprecise clock with + * smeared leap seconds for themselves, as some modes of smearing + * need the adjustments to continue even after the moment at which + * the leap second should have occurred. + */ + int8_t leapsecond_direction; + uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */ + + /* + * Paired values of counter and UTC at a given point in time. + */ + uint64_t counter_value; + uint64_t utc_time_sec; /* Since 1970-01-01 00:00:00z */ + uint64_t utc_time_frac_sec; + + /* + * Counter frequency, and error margin. The unit of these fields is + * seconds >> (64 + counter_period_shift) + */ + uint64_t counter_period_frac_sec; + uint64_t counter_period_error_rate_frac_sec; + + /* Error margin of UTC reading above (± picoseconds) */ + uint64_t utc_time_maxerror_picosec; +}; + +#endif /* __VMCLOCK_H__ */