diff mbox series

ptp: Add vDSO-style vmclock support

Message ID 14d1626bc9ddae9d8ad19d3c508538d10f5a8e44.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series ptp: Add vDSO-style vmclock support | expand

Commit Message

David Woodhouse July 24, 2024, 5:16 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

Although the ACPI device implemented in QEMU (and some other
hypervisor) stands alone, most of the fields and values herein are
aligned as much as possible with the nascent virtio-rtc specification,
with the intent that a version of the same structure can be
incorporated into that standard.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in 
    https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbfff71@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)


 drivers/ptp/Kconfig              |  13 +
 drivers/ptp/Makefile             |   1 +
 drivers/ptp/ptp_vmclock.c        | 567 +++++++++++++++++++++++++++++++
 include/uapi/linux/vmclock-abi.h | 187 ++++++++++
 4 files changed, 768 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

Comments

Michael S. Tsirkin July 25, 2024, 5:48 a.m. UTC | #1
On Wed, Jul 24, 2024 at 06:16:37PM +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
> 
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> QEMU implementation at
> https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> 
> Although the ACPI device implemented in QEMU (and some other
> hypervisor) stands alone, most of the fields and values herein are
> aligned as much as possible with the nascent virtio-rtc specification,
> with the intent that a version of the same structure can be
> incorporated into that standard.

Do you want to just help complete virtio-rtc then? Would be easier than
trying to keep two specs in sync.


> v1:
>  • Change absolute error fields to nanoseconds
>  • Update leap second definition to match virtio-rtc intentions in 
>     https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbfff71@opensynergy.com
> 
> RFC v4:
>  • Add esterror fields, MONOTONIC flag.
>  • Reduce seq_count to 32 bits
>  • Expand size to permit 64KiB pages
>  • Align with virtio-rtc fields, values and leap handling
>  • Drop gettime() method (since we have gettimex())
>  • Add leap second smearing hint
>  • Use a real _CRS on the ACPI device
> 
> RFC v3: (wrong patch sent)
> 
> RFC v2:
>  • Add gettimex64() support
>  • Convert TSC values to KVM clock when appropriate
>  • Require int128 support
>  • Add counter_period_shift
>  • Add timeout when seq_count is invalid
>  • Add flags field
>  • Better comments in vmclock ABI structure
>  • Explicitly forbid smearing (as clock rates would need to change)
> 
> 
>  drivers/ptp/Kconfig              |  13 +
>  drivers/ptp/Makefile             |   1 +
>  drivers/ptp/ptp_vmclock.c        | 567 +++++++++++++++++++++++++++++++
>  include/uapi/linux/vmclock-abi.h | 187 ++++++++++
>  4 files changed, 768 insertions(+)
>  create mode 100644 drivers/ptp/ptp_vmclock.c
>  create mode 100644 include/uapi/linux/vmclock-abi.h
> 
> diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
> index 604541dcb320..e98c9767e0ef 100644
> --- a/drivers/ptp/Kconfig
> +++ b/drivers/ptp/Kconfig
> @@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called ptp_kvm.
>  
> +config PTP_1588_CLOCK_VMCLOCK
> +	tristate "Virtual machine PTP clock"
> +	depends on X86_TSC || ARM_ARCH_TIMER
> +	depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
> +	default y
> +	help
> +	  This driver adds support for using a virtual precision clock
> +	  advertised by the hypervisor. This clock is only useful in virtual
> +	  machines where such a device is present.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called ptp_vmclock.
> +
>  config PTP_1588_CLOCK_IDT82P33
>  	tristate "IDT 82P33xxx PTP clock"
>  	depends on PTP_1588_CLOCK && I2C
> diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
> index 68bf02078053..01b5cd91eb61 100644
> --- a/drivers/ptp/Makefile
> +++ b/drivers/ptp/Makefile
> @@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)	+= ptp_dte.o
>  obj-$(CONFIG_PTP_1588_CLOCK_INES)	+= ptp_ines.o
>  obj-$(CONFIG_PTP_1588_CLOCK_PCH)	+= ptp_pch.o
>  obj-$(CONFIG_PTP_1588_CLOCK_KVM)	+= ptp_kvm.o
> +obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)	+= ptp_vmclock.o
>  obj-$(CONFIG_PTP_1588_CLOCK_QORIQ)	+= ptp-qoriq.o
>  ptp-qoriq-y				+= ptp_qoriq.o
>  ptp-qoriq-$(CONFIG_DEBUG_FS)		+= ptp_qoriq_debugfs.o
> diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
> new file mode 100644
> index 000000000000..9c508c21c062
> --- /dev/null
> +++ b/drivers/ptp/ptp_vmclock.c
> @@ -0,0 +1,567 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Virtual PTP 1588 clock for use with LM-safe VMclock device.
> + *
> + * Copyright © 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/device.h>
> +#include <linux/err.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +
> +#include <uapi/linux/vmclock-abi.h>
> +
> +#include <linux/ptp_clock_kernel.h>
> +
> +#ifdef CONFIG_X86
> +#include <asm/pvclock.h>
> +#include <asm/kvmclock.h>
> +#endif
> +
> +#ifdef CONFIG_KVM_GUEST
> +#define SUPPORT_KVMCLOCK
> +#endif
> +
> +static DEFINE_IDA(vmclock_ida);
> +
> +ACPI_MODULE_NAME("vmclock");
> +
> +struct vmclock_state {
> +	struct resource res;
> +	struct vmclock_abi *clk;
> +	struct miscdevice miscdev;
> +	struct ptp_clock_info ptp_clock_info;
> +	struct ptp_clock *ptp_clock;
> +	enum clocksource_ids cs_id, sys_cs_id;
> +	int index;
> +	char *name;
> +};
> +
> +#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
> +
> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> +					       uint64_t period, uint8_t shift,
> +					       uint64_t frac_sec)
> +{
> +	unsigned __int128 res = (unsigned __int128)delta * period;
> +
> +	res >>= shift;
> +	res += frac_sec;
> +	*res_hi = res >> 64;
> +	return (uint64_t)res;
> +}
> +
> +static inline bool tai_adjust(struct vmclock_abi *clk, uint64_t *sec)
> +{
> +	if (likely(clk->time_type == VMCLOCK_TIME_UTC))
> +		return true;
> +
> +	if (clk->time_type == VMCLOCK_TIME_TAI &&
> +	    (clk->flags & VMCLOCK_FLAG_TAI_OFFSET_VALID)) {
> +		if (sec)
> +			*sec += clk->tai_offset_sec;
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> +				   struct ptp_system_timestamp *sts,
> +				   struct system_counterval_t *system_counter,
> +				   struct timespec64 *tspec)
> +{
> +	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> +	struct system_time_snapshot systime_snapshot;
> +	uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> +	/*
> +	 * We'd expect the hypervisor to know this and to report the clock
> +	 * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> +	 */
> +	if (check_tsc_unstable())
> +		return -EINVAL;
> +#endif
> +
> +	while (1) {
> +		seq = st->clk->seq_count & ~1ULL;
> +		virt_rmb();
> +
> +		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> +			return -EINVAL;
> +
> +		/*
> +		 * When invoked for gettimex64(), fill in the pre/post system
> +		 * times. The simple case is when system time is based on the
> +		 * same counter as st->cs_id, in which case all three times
> +		 * will be derived from the *same* counter value.
> +		 *
> +		 * If the system isn't using the same counter, then the value
> +		 * from ktime_get_snapshot() will still be used as pre_ts, and
> +		 * ptp_read_system_postts() is called to populate postts after
> +		 * calling get_cycles().
> +		 *
> +		 * The conversion to timespec64 happens further down, outside
> +		 * the seq_count loop.
> +		 */
> +		if (sts) {
> +			ktime_get_snapshot(&systime_snapshot);
> +			if (systime_snapshot.cs_id == st->cs_id) {
> +				cycle = systime_snapshot.cycles;
> +			} else {
> +				cycle = get_cycles();
> +				ptp_read_system_postts(sts);
> +			}
> +		} else
> +			cycle = get_cycles();
> +
> +		delta = cycle - st->clk->counter_value;
> +
> +		frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> +						   st->clk->counter_period_frac_sec,
> +						   st->clk->counter_period_shift,
> +						   st->clk->time_frac_sec);
> +		tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> +		tspec->tv_sec += st->clk->time_sec;
> +
> +		if (!tai_adjust(st->clk, &tspec->tv_sec))
> +			return -EINVAL;
> +
> +		virt_rmb();
> +		if (seq == st->clk->seq_count)
> +			break;
> +
> +		if (ktime_after(ktime_get(), deadline))
> +			return -ETIMEDOUT;
> +	}
> +
> +	if (system_counter) {
> +		system_counter->cycles = cycle;
> +		system_counter->cs_id = st->cs_id;
> +	}
> +
> +	if (sts) {
> +		sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> +		if (systime_snapshot.cs_id == st->cs_id)
> +			sts->post_ts = sts->pre_ts;
> +	}
> +
> +	return 0;
> +}
> +
> +#ifdef SUPPORT_KVMCLOCK
> +/*
> + * In the case where the system is using the KVM clock for timekeeping, convert
> + * the TSC value into a KVM clock time in order to return a paired reading that
> + * get_device_system_crosststamp() can cope with.
> + */
> +static int vmclock_get_crosststamp_kvmclock(struct vmclock_state *st,
> +					    struct ptp_system_timestamp *sts,
> +					    struct system_counterval_t *system_counter,
> +					    struct timespec64 *tspec)
> +{
> +	struct pvclock_vcpu_time_info *pvti = this_cpu_pvti();
> +	unsigned pvti_ver;
> +	int ret;
> +
> +	preempt_disable_notrace();
> +
> +	do {
> +		pvti_ver = pvclock_read_begin(pvti);
> +
> +		ret = vmclock_get_crosststamp(st, sts, system_counter, tspec);
> +		if (ret)
> +			break;
> +
> +		system_counter->cycles = __pvclock_read_cycles(pvti,
> +							       system_counter->cycles);
> +		system_counter->cs_id = CSID_X86_KVM_CLK;
> +
> +		/*
> +		 * This retry should never really happen; if the TSC is
> +		 * stable and reliable enough across vCPUS that it is sane
> +		 * for the hypervisor to expose a VMCLOCK device which uses
> +		 * it as the reference counter, then the KVM clock sohuld be
> +		 * in 'master clock mode' and basically never changed. But
> +		 * the KVM clock is a fickle and often broken thing, so do
> +		 * it "properly" just in case.
> +		 */
> +	} while (pvclock_read_retry(pvti, pvti_ver));
> +
> +	preempt_enable_notrace();
> +
> +	return ret;
> +}
> +#endif
> +
> +static int ptp_vmclock_get_time_fn(ktime_t *device_time,
> +				   struct system_counterval_t *system_counter,
> +				   void *ctx)
> +{
> +	struct vmclock_state *st = ctx;
> +	struct timespec64 tspec;
> +	int ret;
> +
> +#ifdef SUPPORT_KVMCLOCK
> +	if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK)
> +		ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter,
> +						       &tspec);
> +	else
> +#endif
> +		ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
> +
> +	if (!ret)
> +		*device_time = timespec64_to_ktime(tspec);
> +
> +	return ret;
> +}
> +
> +
> +static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
> +				      struct system_device_crosststamp *xtstamp)
> +{
> +	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
> +						ptp_clock_info);
> +	int ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, st,
> +						NULL, xtstamp);
> +#ifdef SUPPORT_KVMCLOCK
> +	/*
> +	 * On x86, the KVM clock may be used for the system time. We can
> +	 * actually convert a TSC reading to that, and return a paired
> +	 * timestamp that get_device_system_crosststamp() *can* handle.
> +	 */
> +	if (ret == -ENODEV) {
> +		struct system_time_snapshot systime_snapshot;
> +		ktime_get_snapshot(&systime_snapshot);
> +
> +		if (systime_snapshot.cs_id == CSID_X86_TSC ||
> +		    systime_snapshot.cs_id == CSID_X86_KVM_CLK) {
> +			WRITE_ONCE(st->sys_cs_id, systime_snapshot.cs_id);
> +			ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn,
> +							    st, NULL, xtstamp);
> +		}
> +	}
> +#endif
> +	return ret;
> +}
> +
> +/*
> + * PTP clock operations
> + */
> +
> +static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int ptp_vmclock_settime(struct ptp_clock_info *ptp,
> +			   const struct timespec64 *ts)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
> +				struct ptp_system_timestamp *sts)
> +{
> +	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
> +						ptp_clock_info);
> +
> +	return vmclock_get_crosststamp(st, sts, NULL, ts);
> +}
> +
> +static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
> +			  struct ptp_clock_request *rq, int on)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> +	.owner		= THIS_MODULE,
> +	.max_adj	= 0,
> +	.n_ext_ts	= 0,
> +	.n_pins		= 0,
> +	.pps		= 0,
> +	.adjfine	= ptp_vmclock_adjfine,
> +	.adjtime	= ptp_vmclock_adjtime,
> +	.gettimex64	= ptp_vmclock_gettimex,
> +	.settime64	= ptp_vmclock_settime,
> +	.enable		= ptp_vmclock_enable,
> +	.getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +
> +static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
> +{
> +	struct vmclock_state *st = container_of(fp->private_data,
> +						struct vmclock_state, miscdev);
> +
> +	if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
> +		return -EROFS;
> +
> +	if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
> +		return -EINVAL;
> +
> +        if (io_remap_pfn_range(vma, vma->vm_start,
> +			       st->res.start >> PAGE_SHIFT, PAGE_SIZE,
> +                               vma->vm_page_prot))
> +                return -EAGAIN;
> +
> +        return 0;
> +}
> +
> +static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
> +				    size_t count, loff_t *ppos)
> +{
> +	struct vmclock_state *st = container_of(fp->private_data,
> +						struct vmclock_state, miscdev);
> +	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> +	size_t max_count;
> +	int32_t seq;
> +
> +	if (*ppos >= PAGE_SIZE)
> +		return 0;
> +
> +	max_count = PAGE_SIZE - *ppos;
> +	if (count > max_count)
> +		count = max_count;
> +
> +	while (1) {
> +		seq = st->clk->seq_count & ~1ULL;
> +		virt_rmb();
> +
> +		if (copy_to_user(buf, ((char *)st->clk) + *ppos, count))
> +			return -EFAULT;
> +
> +		virt_rmb();
> +		if (seq == st->clk->seq_count)
> +			break;
> +
> +		if (ktime_after(ktime_get(), deadline))
> +			return -ETIMEDOUT;
> +	}
> +
> +	*ppos += count;
> +	return count;
> +}
> +
> +static const struct file_operations vmclock_miscdev_fops = {
> +        .mmap = vmclock_miscdev_mmap,
> +        .read = vmclock_miscdev_read,
> +};
> +
> +/* module operations */
> +
> +static void vmclock_remove(struct platform_device *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct vmclock_state *st = dev_get_drvdata(dev);
> +
> +	if (st->ptp_clock)
> +		ptp_clock_unregister(st->ptp_clock);
> +
> +	if (st->miscdev.minor != MISC_DYNAMIC_MINOR)
> +		misc_deregister(&st->miscdev);
> +}
> +
> +static acpi_status vmclock_acpi_resources(struct acpi_resource *ares, void *data)
> +{
> +	struct vmclock_state *st = data;
> +	struct resource_win win;
> +	struct resource *res = &(win.res);
> +
> +	if (ares->type == ACPI_RESOURCE_TYPE_END_TAG)
> +		return AE_OK;
> +
> +	/* There can be only one */
> +	if (resource_type(&st->res) == IORESOURCE_MEM)
> +		return AE_ERROR;
> +
> +        if (acpi_dev_resource_memory(ares, res) ||
> +	    acpi_dev_resource_address_space(ares, &win)) {
> +
> +		if (resource_type(res) != IORESOURCE_MEM ||
> +		    resource_size(res) < sizeof(st->clk))
> +			return AE_ERROR;
> +
> +		st->res = *res;
> +		return AE_OK;
> +	}
> +
> +	return AE_ERROR;
> +}
> +
> +static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
> +{
> +	struct acpi_device *adev = ACPI_COMPANION(dev);
> +	acpi_status status;
> +
> +        status = acpi_walk_resources(adev->handle, METHOD_NAME__CRS,
> +                                     vmclock_acpi_resources, st);
> +        if (ACPI_FAILURE(status) || resource_type(&st->res) != IORESOURCE_MEM) {
> +                dev_err(dev, "failed to get resources\n");
> +                return -ENODEV;
> +        }
> +
> +	return 0;
> +}
> +
> +static void vmclock_put_idx(void *data)
> +{
> +	struct vmclock_state *st = data;
> +
> +	ida_free(&vmclock_ida, st->index);
> +}
> +
> +static int vmclock_probe(struct platform_device *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct vmclock_state *st;
> +	int ret;
> +
> +	st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL);
> +	if (!st)
> +		return -ENOMEM;
> +
> +	if (has_acpi_companion(dev))
> +		ret = vmclock_probe_acpi(dev, st);
> +	else
> +		ret = -EINVAL; /* Only ACPI for now */
> +
> +	if (ret) {
> +		dev_info(dev, "Failed to obtain physical address: %d\n", ret);
> +		goto out;
> +	}
> +
> +	st->clk = devm_memremap(dev, st->res.start, resource_size(&st->res),
> +				MEMREMAP_WB | MEMREMAP_DEC);
> +	if (IS_ERR(st->clk)) {
> +		ret = PTR_ERR(st->clk);
> +		dev_info(dev, "failed to map shared memory\n");
> +		st->clk = NULL;
> +		goto out;
> +	}
> +
> +	if (st->clk->magic != VMCLOCK_MAGIC ||
> +	    st->clk->size < sizeof(*st->clk) ||


This is a problem and what I mention below.
As structure will evolve, sizeof(*st->clk) will grow,
then it will fail on old hosts.


> +	    st->clk->version != 1) {
> +		dev_info(dev, "vmclock magic fields invalid\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = ida_alloc(&vmclock_ida, GFP_KERNEL);
> +	if (ret < 0)
> +		goto out;
> +
> +	st->index = ret;
> +        ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st);
> +	if (ret)
> +		goto out;
> +
> +	st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index);
> +	if (!st->name) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* If the structure is big enough, it can be mapped to userspace */
> +	if (st->clk->size >= PAGE_SIZE) {

This is also part of ABI. And a problematic one since linux can
increase PAGE_SIZE at any time for its own reasons.


> +		st->miscdev.minor = MISC_DYNAMIC_MINOR;
> +		st->miscdev.fops = &vmclock_miscdev_fops;
> +		st->miscdev.name = st->name;
> +
> +		ret = misc_register(&st->miscdev);
> +		if (ret)
> +			goto out;
> +	}

Hmm so you want to map a page to userspace, but who will
keep the clock in that memory page up to date?
Making hypervisor do it will mean stealing a bunch
of hypervisor time and we don't *know* userspace is
going to use it.


> +
> +	/* If there is valid clock information, register a PTP clock */
> +	if (IS_ENABLED(CONFIG_ARM64) &&
> +	    st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) {
> +		/* Can we check it's the virtual counter? */
> +		st->cs_id = CSID_ARM_ARCH_COUNTER;
> +	} else if (IS_ENABLED(CONFIG_X86) &&
> +		   st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) {
> +		st->cs_id = CSID_X86_TSC;
> +	}
> +
> +	/* Only UTC, or TAI with offset */
> +	if (!tai_adjust(st->clk, NULL)) {
> +		dev_info(dev, "vmclock does not provide unambiguous UTC\n");
> +		st->cs_id = CSID_GENERIC;
> +	}
> +
> +	if (st->cs_id) {
> +		st->sys_cs_id = st->cs_id;
> +
> +		st->ptp_clock_info = ptp_vmclock_info;
> +		strscpy(st->ptp_clock_info.name, st->name);
> +		st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev);
> +
> +		if (IS_ERR(st->ptp_clock)) {
> +			ret = PTR_ERR(st->ptp_clock);
> +			st->ptp_clock = NULL;
> +			vmclock_remove(pdev);
> +			goto out;
> +		}
> +	} else if (!st->miscdev.minor) {
> +		/* Neither miscdev nor PTP registered */
> +		dev_info(dev, "vmclock: Neither miscdev nor PTP available; not registering\n");
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	dev_info(dev, "%s: registered %s%s%s\n", st->name,
> +		 st->miscdev.minor ? "miscdev" : "",
> +		 (st->miscdev.minor && st->ptp_clock) ? ", " : "",
> +		 st->ptp_clock ? "PTP" : "");
> +
> +	dev_set_drvdata(dev, st);
> +
> + out:
> +	return ret;
> +}
> +
> +static const struct acpi_device_id vmclock_acpi_ids[] = {
> +	{ "VMCLOCK", 0 },
> +	{}
> +};
> +MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);

This is also part of ABI.
How do we know no one in the world happens to have a CID like this?
We should just reserve a valid HID, I think.


> +
> +static struct platform_driver vmclock_platform_driver = {
> +	.probe		= vmclock_probe,
> +	.remove_new	= vmclock_remove,
> +	.driver	= {
> +		.name	= "vmclock",
> +		.acpi_match_table = vmclock_acpi_ids,
> +	},
> +};
> +
> +module_platform_driver(vmclock_platform_driver)
> +
> +MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>");
> +MODULE_DESCRIPTION("PTP clock using VMCLOCK");
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h
> new file mode 100644
> index 000000000000..3bde10ddec38
> --- /dev/null
> +++ b/include/uapi/linux/vmclock-abi.h
> @@ -0,0 +1,187 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> +
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + *
> + * Note that this must be true UTC, never with smeared leap seconds. If a
> + * guest wishes to construct a smeared clock, it can do so. Presenting a
> + * smeared clock through this interface would be problematic because it
> + * actually messes with the apparent counter *period*. A linear smearing
> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> + * at the start/end of the smearing period, while a sinusoidal smear would
> + * basically be impossible to represent.
> + *
> + * This structure is offered with the intent that it be adopted into the
> + * nascent virtio-rtc standard, as a virtio-rtc that does not address the live
> + * migration problem seems a little less than fit for purpose. For that
> + * reason, certain fields use precisely the same numeric definitions as in
> + * the virtio-rtc proposal. The structure can also be exposed through an ACPI
> + * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for
> + * the fact that it uses a real _CRS to convey the address of the structure
> + * (which should be a full page, to allow for mapping directly to userspace).
> + */
> +
> +#ifndef __VMCLOCK_ABI_H__
> +#define __VMCLOCK_ABI_H__
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> +	/* CONSTANT FIELDS */
> +	uint32_t magic;
> +#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
> +	uint32_t size;		/* Size of region containing this structure */
> +	uint16_t version;	/* 1 */

I think we need to document how will compatibility be handled,
otherwise people will just go
(assert(size == sizeof(struct vmclock_abi))
and we can't extend it ever.

I also feel some kind of negotiation so host can detect guest
expectations would be a good idea, basically a la virtio feature
negotiation.


Also, I think it's not great that host is just expected to
poke at this memory at all times. Being able to know that e.g.
system is being shut down and so no need to poke
at memory would be good.


> +	uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */
> +#define VMCLOCK_COUNTER_ARM_VCNT	0
> +#define VMCLOCK_COUNTER_X86_TSC		1
> +#define VMCLOCK_COUNTER_INVALID		0xff
> +	uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
> +#define VMCLOCK_TIME_UTC			0	/* Since 1970-01-01 00:00:00z */
> +#define VMCLOCK_TIME_TAI			1	/* Since 1970-01-01 00:00:00z */
> +#define VMCLOCK_TIME_MONOTONIC			2	/* Since undefined epoch */
> +#define VMCLOCK_TIME_INVALID_SMEARED		3	/* Not supported */
> +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED	4	/* Not supported */
> +
> +	/* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
> +	uint32_t seq_count;	/* Low bit means an update is in progress */
> +	/*
> +	 * This field changes to another non-repeating value when the CPU
> +	 * counter is disrupted, for example on live migration. This lets
> +	 * the guest know that it should discard any calibration it has
> +	 * performed of the counter against external sources (NTP/PTP/etc.).
> +	 */
> +	uint64_t disruption_marker;
> +	uint64_t flags;
> +	/* Indicates that the tai_offset_sec field is valid */
> +#define VMCLOCK_FLAG_TAI_OFFSET_VALID		(1 << 0)
> +	/*
> +	 * Optionally used to notify guests of pending maintenance events.
> +	 * A guest which provides latency-sensitive services may wish to
> +	 * remove itself from service if an event is coming up. Two flags
> +	 * indicate the approximate imminence of the event.
> +	 */
> +#define VMCLOCK_FLAG_DISRUPTION_SOON		(1 << 1) /* About a day */
> +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT	(1 << 2) /* About an hour */
> +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID	(1 << 3)
> +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID	(1 << 4)
> +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID	(1 << 5)
> +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID	(1 << 6)
> +	/*
> +	 * If the MONOTONIC flag is set then (other than leap seconds) it is
> +	 * guaranteed that the time calculated according this structure at
> +	 * any given moment shall never appear to be later than the time
> +	 * calculated via the structure at any *later* moment.
> +	 *
> +	 * In particular, a timestamp based on a counter reading taken
> +	 * immediately after setting the low bit of seq_count (and the
> +	 * associated memory barrier), using the previously-valid time and
> +	 * period fields, shall never be later than a timestamp based on
> +	 * a counter reading taken immediately before *clearing* the low
> +	 * bit again after the update, using the about-to-be-valid fields.
> +	 */
> +#define VMCLOCK_FLAG_TIME_MONOTONIC		(1 << 7)
> +
> +	uint8_t pad[2];
> +	uint8_t clock_status;
> +#define VMCLOCK_STATUS_UNKNOWN		0
> +#define VMCLOCK_STATUS_INITIALIZING	1
> +#define VMCLOCK_STATUS_SYNCHRONIZED	2
> +#define VMCLOCK_STATUS_FREERUNNING	3
> +#define VMCLOCK_STATUS_UNRELIABLE	4
> +
> +	/*
> +	 * The time exposed through this device is never smeared. This field
> +	 * corresponds to the 'subtype' field in virtio-rtc, which indicates
> +	 * the smearing method. However in this case it provides a *hint* to
> +	 * the guest operating system, such that *if* the guest OS wants to
> +	 * provide its users with an alternative clock which does not follow
> +	 * UTC, it may do so in a fashion consistent with the other systems
> +	 * in the nearby environment.
> +	 */
> +	uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
> +#define VMCLOCK_SMEARING_STRICT		0
> +#define VMCLOCK_SMEARING_NOON_LINEAR	1
> +#define VMCLOCK_SMEARING_UTC_SLS	2
> +	int16_t tai_offset_sec;
> +	uint8_t leap_indicator;
> +	/*
> +	 * This field is based on the the VIRTIO_RTC_LEAP_xxx values as
> +	 * defined in the current draft of virtio-rtc, but since smearing
> +	 * cannot be used with the shared memory device, some values are
> +	 * not used.
> +	 *
> +	 * The _POST_POS and _POST_NEG values allow the guest to perform
> +	 * its own smearing during the day or so after a leap second when
> +	 * such smearing may need to continue being applied for a leap
> +	 * second which is now theoretically "historical".
> +	 */
> +#define VMCLOCK_LEAP_NONE	0x00	/* No known nearby leap second */
> +#define VMCLOCK_LEAP_PRE_POS	0x01	/* Positive leap second at EOM */
> +#define VMCLOCK_LEAP_PRE_NEG	0x02	/* Negative leap second at EOM */
> +#define VMCLOCK_LEAP_POS	0x03	/* Set during 23:59:60 second */
> +#define VMCLOCK_LEAP_POST_POS	0x04
> +#define VMCLOCK_LEAP_POST_NEG	0x05
> +
> +	/* Bit shift for counter_period_frac_sec and its error rate */
> +	uint8_t counter_period_shift;
> +	/*
> +	 * Paired values of counter and UTC at a given point in time.
> +	 */
> +	uint64_t counter_value;
> +	/*
> +	 * Counter frequency, and error margin. The unit of these fields is
> +	 * seconds >> (64 + counter_period_shift)
> +	 */
> +	uint64_t counter_period_frac_sec;
> +	uint64_t counter_period_esterror_rate_frac_sec;
> +	uint64_t counter_period_maxerror_rate_frac_sec;
> +
> +	/*
> +	 * Time according to time_type field above.
> +	 */
> +	uint64_t time_sec;		/* Seconds since time_type epoch */
> +	uint64_t time_frac_sec;		/* (seconds >> 64) */
> +	uint64_t time_esterror_nanosec;
> +	uint64_t time_maxerror_nanosec;
> +};
> +
> +#endif /*  __VMCLOCK_ABI_H__ */
> -- 
> 2.44.0
> 
>
Michael S. Tsirkin July 25, 2024, 5:54 a.m. UTC | #2
On Wed, Jul 24, 2024 at 06:16:37PM +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
> 
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

one other thing worth mentioning is that this design can't work
with confidential computing setups. By comparison, mapping e.g. a
range in a PCI BAR would work for these setups.
Is there a reason this functionality is not interesting for
confidential VMs?
David Woodhouse July 25, 2024, 9:56 a.m. UTC | #3
Hi Michael, thanks for the review!

On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2024 at 06:16:37PM +0100, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> > QEMU implementation at
> > https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> > 
> > Although the ACPI device implemented in QEMU (and some other
> > hypervisor) stands alone, most of the fields and values herein are
> > aligned as much as possible with the nascent virtio-rtc specification,
> > with the intent that a version of the same structure can be
> > incorporated into that standard.
> 
> Do you want to just help complete virtio-rtc then? Would be easier than
> trying to keep two specs in sync.

The ACPI version is much more lightweight and doesn't take up a
valuable PCI slot#. (I know, you can do virtio without PCI but that's
complex in other ways).

So I do see these existing in parallel, and I'm happy for the VMCLOCK
device to defer to the shared memory structure in the virtio-rtc
specification once that's final.

I've done my best to ensure we have a fair chance of it being adopted
as-is, and it's versioned so if virtio-rtc gets published with v2 of
the structure, that's also OK.
> 

> > +
> > +       if (st->clk->magic != VMCLOCK_MAGIC ||
> > +           st->clk->size < sizeof(*st->clk) ||
> 
> 
> This is a problem and what I mention below.
> As structure will evolve, sizeof(*st->clk) will grow,
> then it will fail on old hosts.

Potentially, yes. At the time a guest starts to support a larger
version of the struct, it would need to check for the size of the v1
structure.

We certainly *allow* for the structure to evolve, but as you say this
is ABI. So we've tried to plan ahead and avoid having to change it
unless something *really* surprising happens in the field of leap
seconds.


> > +
> > +       /* If the structure is big enough, it can be mapped to userspace */
> > +       if (st->clk->size >= PAGE_SIZE) {
> 
> This is also part of ABI. And a problematic one since linux can
> increase PAGE_SIZE at any time for its own reasons.

Yes. For it to be mapped to userspace, the hypervisor has to put it
into a memory region which is large enough for the page granularity
that the guest happens to use.

The hypervisor implementations so far (for x86 in QEMU since there
seems to be no ACPI support for Arm?, and for x86+Arm in my other pet
hypervisor) expose a 4KiB region.

So if the guest uses larger pages (and can't cope with using 4KiB PTEs
for MMIO mappings), then it can't use this feature unless the
hypervisor gives it a larger region. I think that's OK.

> > +               st->miscdev.minor = MISC_DYNAMIC_MINOR;
> > +               st->miscdev.fops = &vmclock_miscdev_fops;
> > +               st->miscdev.name = st->name;
> > +
> > +               ret = misc_register(&st->miscdev);
> > +               if (ret)
> > +                       goto out;
> > +       }
> 
> Hmm so you want to map a page to userspace, but who will
> keep the clock in that memory page up to date?
> Making hypervisor do it will mean stealing a bunch
> of hypervisor time and we don't *know* userspace is
> going to use it.

I think it should scale OK. The bulk of the actual *calculations* are
done only once by the hypervisor on an NTP skew adjustment anyway,
rather than per-VM. That gives the period of the actual host counter,
and the reference time.

Where TSC scaling *isn't* occurring, all that the VMM needs to do is
copy that information, applying the appropriate TSC offset. I think it
should be mostly in the noise compared with other VMM overhead?

Where TSC scaling is in use, it's a little more complex. The counter
value at the reference time needs to be converted using the standard
mul-and-shift TSC scaling method. But the counter *period* needs to
undergo the reverse conversion. That would be a bit more work if done
naïvely with a right shift and then a *division*, but an appropriate
efficient conversion to match the frequency scaling could be
precalculated, much as KVM already does in kvm_get_time_scale()? Then
the conversion of the counter period could just be a mul and shift too.

If it's just the TSC scaling which pushes it over the edge, then
*maybe* we could add the raw host→guest TSC scaling information to the
structure and make it the guest's problem? I'm not sure I'm overly fond
of that approach.

(You mention feature negotiation later, so we'll come back to that...)

> > +static const struct acpi_device_id vmclock_acpi_ids[] = {
> > +       { "VMCLOCK", 0 },
> > +       {}
> > +};
> > +MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
> 
> This is also part of ABI.
> How do we know no one in the world happens to have a CID like this?
> We should just reserve a valid HID, I think.

Ack. This was largely just based on the existing VMGENID code. Can I
assign a QEMUxxxx HID for it?

> > +struct vmclock_abi {
> > +       /* CONSTANT FIELDS */
> > +       uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +       uint32_t size;          /* Size of region containing this structure */
> > +       uint16_t version;       /* 1 */
> 
> I think we need to document how will compatibility be handled,
> otherwise people will just go
> (assert(size == sizeof(struct vmclock_abi))
> and we can't extend it ever.

As you noted, the existing guest code only checks for it being large
*enough*. I've let this "specification" be implementation-led for the
time being, as I'm planning for Peter to accept it into the virtio-rtc
spec and for *that* to be the formal definition, where such admonitions
would reside. 

> I also feel some kind of negotiation so host can detect guest
> expectations would be a good idea, basically a la virtio feature
> negotiation.
> 
> Also, I think it's not great that host is just expected to
> poke at this memory at all times. Being able to know that e.g.
> system is being shut down and so no need to poke
> at memory would be good.

Fair enough. I think that when used in virtio-rtc, doing it through
feature negotiation is probably the right answer?

For the ACPI device, perhaps we can just add an ACPI method to
enable/disable the timekeeping (the disruption signal can always be
operational).
David Woodhouse July 25, 2024, 10 a.m. UTC | #4
On Thu, 2024-07-25 at 01:54 -0400, Michael S. Tsirkin wrote:
> one other thing worth mentioning is that this design can't work
> with confidential computing setups. By comparison, mapping e.g. a
> range in a PCI BAR would work for these setups.

Why so? This is just like mapping a PCI BAR, isn't it? It's cacheable
MMIO space, *not* part of the encrypted guest RAM ranges. It just
happens to be discovered through the _CRS of an ACPI device, not the
BAR of a PCI device.

> Is there a reason this functionality is not interesting for
> confidential VMs?

It is. In fact, that was one of the reasons for doing it as mappable
MMIO space, instead of having the guest allocate a portion of its own
RAM and invoke a hypervisor enlightenment to populate it. (Although the
latter *can* work with CC too, as demonstrated by e.g. ptp_kvm).
Paolo Abeni July 25, 2024, 11:20 a.m. UTC | #5
Hi,

Just a bunch of 'nits below

On 7/24/24 19:16, David Woodhouse wrote:
> diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
> new file mode 100644
> index 000000000000..9c508c21c062
> --- /dev/null
> +++ b/drivers/ptp/ptp_vmclock.c

[...]

> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> +					       uint64_t period, uint8_t shift,
> +					       uint64_t frac_sec)

Please, no 'inline' in \.c files

> +{
> +	unsigned __int128 res = (unsigned __int128)delta * period;
> +
> +	res >>= shift;
> +	res += frac_sec;
> +	*res_hi = res >> 64;
> +	return (uint64_t)res;
> +}
> +
> +static inline bool tai_adjust(struct vmclock_abi *clk, uint64_t *sec)
> +{

Same here

> +	if (likely(clk->time_type == VMCLOCK_TIME_UTC))
> +		return true;
> +
> +	if (clk->time_type == VMCLOCK_TIME_TAI &&
> +	    (clk->flags & VMCLOCK_FLAG_TAI_OFFSET_VALID)) {
> +		if (sec)
> +			*sec += clk->tai_offset_sec;
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> +				   struct ptp_system_timestamp *sts,
> +				   struct system_counterval_t *system_counter,
> +				   struct timespec64 *tspec)
> +{
> +	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> +	struct system_time_snapshot systime_snapshot;
> +	uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> +	/*
> +	 * We'd expect the hypervisor to know this and to report the clock
> +	 * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> +	 */
> +	if (check_tsc_unstable())
> +		return -EINVAL;
> +#endif
> +
> +	while (1) {
> +		seq = st->clk->seq_count & ~1ULL;
> +		virt_rmb();

Please document which other barrier pair witht this one

> +
> +		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> +			return -EINVAL;
> +
> +		/*
> +		 * When invoked for gettimex64(), fill in the pre/post system
> +		 * times. The simple case is when system time is based on the
> +		 * same counter as st->cs_id, in which case all three times
> +		 * will be derived from the *same* counter value.
> +		 *
> +		 * If the system isn't using the same counter, then the value
> +		 * from ktime_get_snapshot() will still be used as pre_ts, and
> +		 * ptp_read_system_postts() is called to populate postts after
> +		 * calling get_cycles().
> +		 *
> +		 * The conversion to timespec64 happens further down, outside
> +		 * the seq_count loop.
> +		 */
> +		if (sts) {
> +			ktime_get_snapshot(&systime_snapshot);
> +			if (systime_snapshot.cs_id == st->cs_id) {
> +				cycle = systime_snapshot.cycles;
> +			} else {
> +				cycle = get_cycles();
> +				ptp_read_system_postts(sts);
> +			}
> +		} else
> +			cycle = get_cycles();

Please use the brackets even for the else case

[...]
> +static int ptp_vmclock_get_time_fn(ktime_t *device_time,
> +				   struct system_counterval_t *system_counter,
> +				   void *ctx)
> +{
> +	struct vmclock_state *st = ctx;
> +	struct timespec64 tspec;
> +	int ret;
> +
> +#ifdef SUPPORT_KVMCLOCK
> +	if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK)
> +		ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter,
> +						       &tspec);
> +	else
> +#endif
> +		ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
> +
> +	if (!ret)
> +		*device_time = timespec64_to_ktime(tspec);
> +
> +	return ret;
> +}
> +
> +

Please, don't add 2 consecutive blank lines.

Cheers,

Paolo
Daniel P. Berrangé July 25, 2024, 11:31 a.m. UTC | #6
On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> Hi Michael, thanks for the review!
> 
> On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2024 at 06:16:37PM +0100, David Woodhouse wrote:
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > 
> > > The vmclock "device" provides a shared memory region with precision clock
> > > information. By using shared memory, it is safe across Live Migration.
> > > 
> > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > actually helpful.
> > > 
> > > The memory region of the device is also exposed to userspace so it can be
> > > read or memory mapped by application which need reliable notification of
> > > clock disruptions.
> > > 
> > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > > ---
> > > QEMU implementation at
> > > https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> > > 
> > > Although the ACPI device implemented in QEMU (and some other
> > > hypervisor) stands alone, most of the fields and values herein are
> > > aligned as much as possible with the nascent virtio-rtc specification,
> > > with the intent that a version of the same structure can be
> > > incorporated into that standard.
> > 
> > Do you want to just help complete virtio-rtc then? Would be easier than
> > trying to keep two specs in sync.
> 
> The ACPI version is much more lightweight and doesn't take up a
> valuable PCI slot#. (I know, you can do virtio without PCI but that's
> complex in other ways).

In general it shouldn't have to take up a PCI slot, that's just
a common default policy. virtio-devices only need a dedicated
slot if there's a need to do hotplug/unplug of them. There is a
set of core devices for which hotplug doesn't make sense, which
could all be put as functions in the same slot. ie virtio-rng,
virtio-balloon and virtio-rtc, could all live in one slot.

With regards,
Daniel
David Woodhouse July 25, 2024, 11:49 a.m. UTC | #7
On Thu, 2024-07-25 at 13:20 +0200, Paolo Abeni wrote:
> 
> 
> Just a bunch of 'nits below

Thank you. Fixed in
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock

I'll post a new version once I've finished resolving mst's and any
other feedback.
David Woodhouse July 25, 2024, 11:53 a.m. UTC | #8
On Thu, 2024-07-25 at 12:31 +0100, Daniel P. Berrangé wrote:
> On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > Hi Michael, thanks for the review!
> > 
> > On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > trying to keep two specs in sync.
> > 
> > The ACPI version is much more lightweight and doesn't take up a
> > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > complex in other ways).
> 
> In general it shouldn't have to take up a PCI slot, that's just
> a common default policy. virtio-devices only need a dedicated
> slot if there's a need to do hotplug/unplug of them. There is a
> set of core devices for which hotplug doesn't make sense, which
> could all be put as functions in the same slot. ie virtio-rng,
> virtio-balloon and virtio-rtc, could all live in one slot.

But if you don't have any virtio devices already, you still need one
slot to put them in.
Daniel P. Berrangé July 25, 2024, noon UTC | #9
On Thu, Jul 25, 2024 at 12:53:34PM +0100, David Woodhouse wrote:
> On Thu, 2024-07-25 at 12:31 +0100, Daniel P. Berrangé wrote:
> > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > Hi Michael, thanks for the review!
> > > 
> > > On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> > > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > > trying to keep two specs in sync.
> > > 
> > > The ACPI version is much more lightweight and doesn't take up a
> > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > complex in other ways).
> > 
> > In general it shouldn't have to take up a PCI slot, that's just
> > a common default policy. virtio-devices only need a dedicated
> > slot if there's a need to do hotplug/unplug of them. There is a
> > set of core devices for which hotplug doesn't make sense, which
> > could all be put as functions in the same slot. ie virtio-rng,
> > virtio-balloon and virtio-rtc, could all live in one slot.
> 
> But if you don't have any virtio devices already, you still need one
> slot to put them in.

Perhaps - it may still be practical to be a function within a slot
shared with non-virtio PCI devices too.

With regards,
Daniel
Michael S. Tsirkin July 25, 2024, 12:17 p.m. UTC | #10
On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > Do you want to just help complete virtio-rtc then? Would be easier than
> > trying to keep two specs in sync.
> 
> The ACPI version is much more lightweight and doesn't take up a
> valuable PCI slot#. (I know, you can do virtio without PCI but that's
> complex in other ways).
> 

Hmm, should we support virtio over ACPI? Just asking.
David Woodhouse July 25, 2024, 12:27 p.m. UTC | #11
On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > trying to keep two specs in sync.
> > 
> > The ACPI version is much more lightweight and doesn't take up a
> > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > complex in other ways).
> > 
> 
> Hmm, should we support virtio over ACPI? Just asking.

Given that we support virtio DT bindings, and the ACPI "PRP0001" device
exists with a DSM method which literally returns DT properties,
including such properties as "compatible=virtio,mmio" ... do we
already?
Michael S. Tsirkin July 25, 2024, 12:29 p.m. UTC | #12
On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:
> On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > > trying to keep two specs in sync.
> > > 
> > > The ACPI version is much more lightweight and doesn't take up a
> > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > complex in other ways).
> > > 
> > 
> > Hmm, should we support virtio over ACPI? Just asking.
> 
> Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> exists with a DSM method which literally returns DT properties,
> including such properties as "compatible=virtio,mmio" ... do we
> already?
> 
> 

In a sense, but you are saying that is too complex?
Can you elaborate?
David Woodhouse July 25, 2024, 12:31 p.m. UTC | #13
On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > > > trying to keep two specs in sync.
> > > > 
> > > > The ACPI version is much more lightweight and doesn't take up a
> > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > > complex in other ways).
> > > > 
> > > 
> > > Hmm, should we support virtio over ACPI? Just asking.
> > 
> > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> > exists with a DSM method which literally returns DT properties,
> > including such properties as "compatible=virtio,mmio" ... do we
> > already?
> > 
> > 
> 
> In a sense, but you are saying that is too complex?
> Can you elaborate?

No, I think it's fine. I encourage the use of the PRP0001 device to
expose DT devices through ACPI. I was just reminding you of its
existence.
Michael S. Tsirkin July 25, 2024, 12:33 p.m. UTC | #14
On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:
> On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:
> > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > > > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > > > > trying to keep two specs in sync.
> > > > > 
> > > > > The ACPI version is much more lightweight and doesn't take up a
> > > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > > > complex in other ways).
> > > > > 
> > > > 
> > > > Hmm, should we support virtio over ACPI? Just asking.
> > > 
> > > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> > > exists with a DSM method which literally returns DT properties,
> > > including such properties as "compatible=virtio,mmio" ... do we
> > > already?
> > > 
> > > 
> > 
> > In a sense, but you are saying that is too complex?
> > Can you elaborate?
> 
> No, I think it's fine. I encourage the use of the PRP0001 device to
> expose DT devices through ACPI. I was just reminding you of its
> existence.
> 
> 

Confused. You said "I know, you can do virtio without PCI but that's
complex in other ways" as the explanation why you are doing a custom
protocol.
David Woodhouse July 25, 2024, 1:50 p.m. UTC | #15
On Thu, 2024-07-25 at 08:33 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:
> > > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > > > > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > > > > > trying to keep two specs in sync.
> > > > > > 
> > > > > > The ACPI version is much more lightweight and doesn't take up a
> > > > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > > > > complex in other ways).
> > > > > > 
> > > > > 
> > > > > Hmm, should we support virtio over ACPI? Just asking.
> > > > 
> > > > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> > > > exists with a DSM method which literally returns DT properties,
> > > > including such properties as "compatible=virtio,mmio" ... do we
> > > > already?
> > > > 
> > > > 
> > > 
> > > In a sense, but you are saying that is too complex?
> > > Can you elaborate?
> > 
> > No, I think it's fine. I encourage the use of the PRP0001 device to
> > expose DT devices through ACPI. I was just reminding you of its
> > existence.
> 
> Confused. You said "I know, you can do virtio without PCI but that's
> complex in other ways" as the explanation why you are doing a custom
> protocol.

Ah, apologies, I wasn't thinking that far back in the conversation.

If we wanted to support virtio over ACPI, I think PRP0001 can be made
to work and isn't too complex (even though it probably doesn't yet work
out of the box).

But for the VMCLOCK thing, yes, the simple ACPI device is a lot simpler
than virtio-rtc and much more attractive.

Even if the virtio-rtc specification were official today, and I was
able to expose it via PCI, I probably wouldn't do it that way. There's
just far more in virtio-rtc than we need; the simple shared memory
region is perfectly sufficient for most needs, and especially ours.

I have reworked
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
to take your other feedback into account.

It's now more flexible about the size handling, and explicitly checking
that specific fields are present before using them. 

I think I'm going to add a method on the ACPI device to enable the
precise clock information. I haven't done that in the driver yet; it
still just consumes the precise clock information if it happens to be
present already. The enable method can be added in a compatible fashion
(the failure mode is that guests which don't invoke this method when
the hypervisor needs them to will see only the disruption signal and
not precise time).

For the HID I'm going to use AMZNVCLK. I had used QEMUVCLK in the QEMU
patches, but I'll change that to use AMZNVCLK too when I repost the
QEMU patch.
Michael S. Tsirkin July 25, 2024, 2:11 p.m. UTC | #16
On Thu, Jul 25, 2024 at 02:50:50PM +0100, David Woodhouse wrote:
> Even if the virtio-rtc specification were official today, and I was
> able to expose it via PCI, I probably wouldn't do it that way. There's
> just far more in virtio-rtc than we need; the simple shared memory
> region is perfectly sufficient for most needs, and especially ours.

I can't stop amazon from shipping whatever in its hypervisor,
I'd just like to understand this better, if there is a use-case
not addressed here then we can change virtio to address it.

The rtc driver patch posted is 900 lines, yours is 700 lines, does not
look like a big difference.  As for using a memory region, this is
valid, but maybe rtc should be changed to do exactly that?
E.g. we can easily add a capability describing such a region.
or put it in device config space.

I mean yes, we can build a new transport for each specific need but in
the end we'll get a ton of interfaces with unclear compatibility
requirements.  If effort is instead spent improving common interfaces,
we get consistency and everyone benefits. That's why I'm trying to
understand the need here.
David Woodhouse July 25, 2024, 3:18 p.m. UTC | #17
On Thu, 2024-07-25 at 10:11 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 02:50:50PM +0100, David Woodhouse wrote:
> > Even if the virtio-rtc specification were official today, and I was
> > able to expose it via PCI, I probably wouldn't do it that way. There's
> > just far more in virtio-rtc than we need; the simple shared memory
> > region is perfectly sufficient for most needs, and especially ours.
> 
> I can't stop amazon from shipping whatever in its hypervisor,
> I'd just like to understand this better, if there is a use-case
> not addressed here then we can change virtio to address it.
> 
> The rtc driver patch posted is 900 lines, yours is 700 lines, does not
> look like a big difference.  As for using a memory region, this is
> valid, but maybe rtc should be changed to do exactly that?

I'm certainly aiming for virtio-rtc to include that as an *option*,
because I think I don't think it makes sense for an RTC specification
aimed at virtual machines *not* to deal with the live migration
problem.

AFAICT the only ways to deal with the LM problem are either to make a
hypercall/virtio transaction for *every* clock read which needs to be
accurate, or expose a memory region for the guest to do it "vDSO-
style".

And similarly, unless we want guest userspace to have to make a
*system* call every time, that memory region needs to be mappable all
the way to userspace.

The use case isn't necessarily for all users of gettimeofday(), of
course; this is for those applications which *need* precision time.
Like distributed databases which rely on timestamps for coherency, and
users who get fined millions of dollars when LM messes up their clocks
and they put wrong timestamps on financial transactions.

> E.g. we can easily add a capability describing such a region.
> or put it in device config space.

I think it has to be memory, not config space. But yes.

The intent is that my driver would be usable with the shared memory
region from a virtio-rtc device too. It'd need a tiny amount of
refactoring of the discovery code in vmclock_probe(), which I haven't
done yet as it would be premature optimisation. 

> I mean yes, we can build a new transport for each specific need but in
> the end we'll get a ton of interfaces with unclear compatibility
> requirements.  If effort is instead spent improving common interfaces,
> we get consistency and everyone benefits. That's why I'm trying to
> understand the need here.

It's simplicity. Because this isn't even a "transport". It's just a
simple breadcrumb given to the guest to tell it where the information
is.

In the fullness of time assuming this becomes part of virtio-rtc too,
the fact that it can *also* be discovered by ACPI is just a tiny
detail. And it allows hypervisors to implement it a *whole* lot more
simply.

The addition of an ACPI method to enable the timekeeping does make it a
tiny bit more than a 'breadcrump', I concede — but that's still
basically trivial to implement. A whole lot simpler than a full virtio
device.
Michael S. Tsirkin July 25, 2024, 4:38 p.m. UTC | #18
On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> On Thu, 2024-07-25 at 10:11 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 02:50:50PM +0100, David Woodhouse wrote:
> > > Even if the virtio-rtc specification were official today, and I was
> > > able to expose it via PCI, I probably wouldn't do it that way. There's
> > > just far more in virtio-rtc than we need; the simple shared memory
> > > region is perfectly sufficient for most needs, and especially ours.
> > 
> > I can't stop amazon from shipping whatever in its hypervisor,
> > I'd just like to understand this better, if there is a use-case
> > not addressed here then we can change virtio to address it.
> > 
> > The rtc driver patch posted is 900 lines, yours is 700 lines, does not
> > look like a big difference.  As for using a memory region, this is
> > valid, but maybe rtc should be changed to do exactly that?
> 
> I'm certainly aiming for virtio-rtc to include that as an *option*,
> because I think I don't think it makes sense for an RTC specification
> aimed at virtual machines *not* to deal with the live migration
> problem.
> 
> AFAICT the only ways to deal with the LM problem are either to make a
> hypercall/virtio transaction for *every* clock read which needs to be
> accurate, or expose a memory region for the guest to do it "vDSO-
> style".

virtio can support the second option, we already have
VIRTIO_PCI_CAP_SHARED_MEMORY_CFG, I'd just use it.


> And similarly, unless we want guest userspace to have to make a
> *system* call every time, that memory region needs to be mappable all
> the way to userspace.

This part is classic for pci, mapping pci bar has been well
studied.


> The use case isn't necessarily for all users of gettimeofday(), of
> course; this is for those applications which *need* precision time.
> Like distributed databases which rely on timestamps for coherency, and
> users who get fined millions of dollars when LM messes up their clocks
> and they put wrong timestamps on financial transactions.

I would however worry that with all this pass through,
applications have to be coded to each hypervisor or even
version of the hypervisor.

I don't really know the use-case well enough - is sending
an interrupt to linux and having linux create a device
independent structure not workable?


> > E.g. we can easily add a capability describing such a region.
> > or put it in device config space.
> 
> I think it has to be memory, not config space. But yes.

virtio config space, which is just a region in a BAR.
But yes, maybe VIRTIO_PCI_CAP_SHARED_MEMORY_CFG is cleaner.

> The intent is that my driver would be usable with the shared memory
> region from a virtio-rtc device too. It'd need a tiny amount of
> refactoring of the discovery code in vmclock_probe(), which I haven't
> done yet as it would be premature optimisation. 
> 
> > I mean yes, we can build a new transport for each specific need but in
> > the end we'll get a ton of interfaces with unclear compatibility
> > requirements.  If effort is instead spent improving common interfaces,
> > we get consistency and everyone benefits. That's why I'm trying to
> > understand the need here.
> 
> It's simplicity. Because this isn't even a "transport". It's just a
> simple breadcrumb given to the guest to tell it where the information
> is.
> In the fullness of time assuming this becomes part of virtio-rtc too,
> the fact that it can *also* be discovered by ACPI is just a tiny
> detail. And it allows hypervisors to implement it a *whole* lot more
> simply.
> 
> The addition of an ACPI method to enable the timekeeping does make it a
> tiny bit more than a 'breadcrump', I concede — but that's still
> basically trivial to implement. A whole lot simpler than a full virtio
> device.

virtio has been developed with the painful experience that we keep
making mistakes, or coming up with new needed features,
and that maintaining forward and backward compatibility
becomes a whole lot harder than it seems in the beginning.
David Woodhouse July 25, 2024, 7:35 p.m. UTC | #19
On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > The use case isn't necessarily for all users of gettimeofday(), of
> > course; this is for those applications which *need* precision time.
> > Like distributed databases which rely on timestamps for coherency, and
> > users who get fined millions of dollars when LM messes up their clocks
> > and they put wrong timestamps on financial transactions.
> 
> I would however worry that with all this pass through,
> applications have to be coded to each hypervisor or even
> version of the hypervisor.

Yes, that would be a problem. Which is why I feel it's so important to
harmonise the contents of the shared memory, and I'm implementing it
both QEMU and $DAYJOB, as well as aligning with virtio-rtc.

I don't think the structure should be changing between hypervisors (and
especially versions). We *will* see a progression from simply providing
the disruption signal, to providing the full clock information so that
guests don't have to abort transactions while they resync their clock.
But that's perfectly fine.

And it's also entirely agnostic to the mechanism by which the memory
region is *discovered*. It doesn't matter if it's ACPI, DT, a
hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
anything else.

ACPI is one of the *simplest* options for a hypervisor and guest to
implement, and doesn't prevent us from using the same structure in
virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
along later.

> virtio has been developed with the painful experience that we keep
> making mistakes, or coming up with new needed features,
> and that maintaining forward and backward compatibility
> becomes a whole lot harder than it seems in the beginning.

Yes. But as you note, this shared memory structure is a userspace ABI
all of its own, so we get to make a completely *different* kind of
mistake :)
Michael S. Tsirkin July 25, 2024, 8:50 p.m. UTC | #20
On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > The use case isn't necessarily for all users of gettimeofday(), of
> > > course; this is for those applications which *need* precision time.
> > > Like distributed databases which rely on timestamps for coherency, and
> > > users who get fined millions of dollars when LM messes up their clocks
> > > and they put wrong timestamps on financial transactions.
> > 
> > I would however worry that with all this pass through,
> > applications have to be coded to each hypervisor or even
> > version of the hypervisor.
> 
> Yes, that would be a problem. Which is why I feel it's so important to
> harmonise the contents of the shared memory, and I'm implementing it
> both QEMU and $DAYJOB, as well as aligning with virtio-rtc.


Writing an actual spec for this would be another thing that might help.

> I don't think the structure should be changing between hypervisors (and
> especially versions). We *will* see a progression from simply providing
> the disruption signal, to providing the full clock information so that
> guests don't have to abort transactions while they resync their clock.
> But that's perfectly fine.
> 
> And it's also entirely agnostic to the mechanism by which the memory
> region is *discovered*. It doesn't matter if it's ACPI, DT, a
> hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
> anything else.
> 
> ACPI is one of the *simplest* options for a hypervisor and guest to
> implement, and doesn't prevent us from using the same structure in
> virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
> along later.
> 
> > virtio has been developed with the painful experience that we keep
> > making mistakes, or coming up with new needed features,
> > and that maintaining forward and backward compatibility
> > becomes a whole lot harder than it seems in the beginning.
> 
> Yes. But as you note, this shared memory structure is a userspace ABI
> all of its own, so we get to make a completely *different* kind of
> mistake :)
> 


So, something I still don't completely understand.
Can't the VDSO thing be written to by kernel?
Let's say on LM, an interrupt triggers and kernel copies
data from a specific device to the VDSO.

Is that problematic somehow? I imagine there is a race where
userspace reads vdso after lm but before kernel updated
vdso - is that the concern?

Then can't we fix it by interrupting all CPUs right after LM?

To me that seems like a cleaner approach - we then compartmentalize
the ABI issue - kernel has its own ABI against userspace,
devices have their own ABI against kernel.
It'd mean we need a way to detect that interrupt was sent,
maybe yet another counter inside that structure.

WDYT?

By the way the same idea would work for snapshots -
some people wanted to expose that info to userspace, too.
David Woodhouse July 25, 2024, 9 p.m. UTC | #21
On Thu, 2024-07-25 at 16:50 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > > The use case isn't necessarily for all users of gettimeofday(), of
> > > > course; this is for those applications which *need* precision time.
> > > > Like distributed databases which rely on timestamps for coherency, and
> > > > users who get fined millions of dollars when LM messes up their clocks
> > > > and they put wrong timestamps on financial transactions.
> > > 
> > > I would however worry that with all this pass through,
> > > applications have to be coded to each hypervisor or even
> > > version of the hypervisor.
> > 
> > Yes, that would be a problem. Which is why I feel it's so important to
> > harmonise the contents of the shared memory, and I'm implementing it
> > both QEMU and $DAYJOB, as well as aligning with virtio-rtc.
> 
> 
> Writing an actual spec for this would be another thing that might help.
> 

> > I don't think the structure should be changing between hypervisors (and
> > especially versions). We *will* see a progression from simply providing
> > the disruption signal, to providing the full clock information so that
> > guests don't have to abort transactions while they resync their clock.
> > But that's perfectly fine.
> > 
> > And it's also entirely agnostic to the mechanism by which the memory
> > region is *discovered*. It doesn't matter if it's ACPI, DT, a
> > hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
> > anything else.
> > 
> > ACPI is one of the *simplest* options for a hypervisor and guest to
> > implement, and doesn't prevent us from using the same structure in
> > virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
> > along later.
> > 
> > > virtio has been developed with the painful experience that we keep
> > > making mistakes, or coming up with new needed features,
> > > and that maintaining forward and backward compatibility
> > > becomes a whole lot harder than it seems in the beginning.
> > 
> > Yes. But as you note, this shared memory structure is a userspace ABI
> > all of its own, so we get to make a completely *different* kind of
> > mistake :)
> > 
> 
> 
> So, something I still don't completely understand.
> Can't the VDSO thing be written to by kernel?
> Let's say on LM, an interrupt triggers and kernel copies
> data from a specific device to the VDSO.
> 
> Is that problematic somehow? I imagine there is a race where
> userspace reads vdso after lm but before kernel updated
> vdso - is that the concern?
> 
> Then can't we fix it by interrupting all CPUs right after LM?
> 
> To me that seems like a cleaner approach - we then compartmentalize
> the ABI issue - kernel has its own ABI against userspace,
> devices have their own ABI against kernel.
> It'd mean we need a way to detect that interrupt was sent,
> maybe yet another counter inside that structure.
> 
> WDYT?
> 
> By the way the same idea would work for snapshots -
> some people wanted to expose that info to userspace, too.
>
Michael S. Tsirkin July 25, 2024, 9:04 p.m. UTC | #22
On Thu, Jul 25, 2024 at 10:00:24PM +0100, David Woodhouse wrote:
> On Thu, 2024-07-25 at 16:50 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> > > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > > > The use case isn't necessarily for all users of gettimeofday(), of
> > > > > course; this is for those applications which *need* precision time.
> > > > > Like distributed databases which rely on timestamps for coherency, and
> > > > > users who get fined millions of dollars when LM messes up their clocks
> > > > > and they put wrong timestamps on financial transactions.
> > > > 
> > > > I would however worry that with all this pass through,
> > > > applications have to be coded to each hypervisor or even
> > > > version of the hypervisor.
> > > 
> > > Yes, that would be a problem. Which is why I feel it's so important to
> > > harmonise the contents of the shared memory, and I'm implementing it
> > > both QEMU and $DAYJOB, as well as aligning with virtio-rtc.
> > 
> > 
> > Writing an actual spec for this would be another thing that might help.
> > 
> 
> > > I don't think the structure should be changing between hypervisors (and
> > > especially versions). We *will* see a progression from simply providing
> > > the disruption signal, to providing the full clock information so that
> > > guests don't have to abort transactions while they resync their clock.
> > > But that's perfectly fine.
> > > 
> > > And it's also entirely agnostic to the mechanism by which the memory
> > > region is *discovered*. It doesn't matter if it's ACPI, DT, a
> > > hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
> > > anything else.
> > > 
> > > ACPI is one of the *simplest* options for a hypervisor and guest to
> > > implement, and doesn't prevent us from using the same structure in
> > > virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
> > > along later.
> > > 
> > > > virtio has been developed with the painful experience that we keep
> > > > making mistakes, or coming up with new needed features,
> > > > and that maintaining forward and backward compatibility
> > > > becomes a whole lot harder than it seems in the beginning.
> > > 
> > > Yes. But as you note, this shared memory structure is a userspace ABI
> > > all of its own, so we get to make a completely *different* kind of
> > > mistake :)
> > > 
> > 
> > 
> > So, something I still don't completely understand.
> > Can't the VDSO thing be written to by kernel?
> > Let's say on LM, an interrupt triggers and kernel copies
> > data from a specific device to the VDSO.
> > 
> > Is that problematic somehow? I imagine there is a race where
> > userspace reads vdso after lm but before kernel updated
> > vdso - is that the concern?
> > 
> > Then can't we fix it by interrupting all CPUs right after LM?
> > 
> > To me that seems like a cleaner approach - we then compartmentalize
> > the ABI issue - kernel has its own ABI against userspace,
> > devices have their own ABI against kernel.
> > It'd mean we need a way to detect that interrupt was sent,
> > maybe yet another counter inside that structure.
> > 
> > WDYT?
> > 
> > By the way the same idea would work for snapshots -
> > some people wanted to expose that info to userspace, too.
> > 
> 



was there supposed to be text here, or did you just like this
so much you decided to repost my mail ;)
David Woodhouse July 25, 2024, 9:29 p.m. UTC | #23
On Thu, 2024-07-25 at 17:04 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:00:24PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 16:50 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> > > > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > > > > The use case isn't necessarily for all users of gettimeofday(), of
> > > > > > course; this is for those applications which *need* precision time.
> > > > > > Like distributed databases which rely on timestamps for coherency, and
> > > > > > users who get fined millions of dollars when LM messes up their clocks
> > > > > > and they put wrong timestamps on financial transactions.
> > > > > 
> > > > > I would however worry that with all this pass through,
> > > > > applications have to be coded to each hypervisor or even
> > > > > version of the hypervisor.
> > > > 
> > > > Yes, that would be a problem. Which is why I feel it's so important to
> > > > harmonise the contents of the shared memory, and I'm implementing it
> > > > both QEMU and $DAYJOB, as well as aligning with virtio-rtc.
> > > 
> > > 
> > > Writing an actual spec for this would be another thing that might help.

Potentially, although working over it with our internal clock team and
with Peter on virtio-rtc has put us in good shape. I'm confident now
that we have something that's viable and extensible enough.

> > > 
> > > > > virtio has been developed with the painful experience that we keep
> > > > > making mistakes, or coming up with new needed features,
> > > > > and that maintaining forward and backward compatibility
> > > > > becomes a whole lot harder than it seems in the beginning.
> > > > 
> > > > Yes. But as you note, this shared memory structure is a userspace ABI
> > > > all of its own, so we get to make a completely *different* kind of
> > > > mistake :)
> > > > 
> > > 
> > > 
> > > So, something I still don't completely understand.
> > > Can't the VDSO thing be written to by kernel?
> > > Let's say on LM, an interrupt triggers and kernel copies
> > > data from a specific device to the VDSO.
> > > 
> > > Is that problematic somehow? I imagine there is a race where
> > > userspace reads vdso after lm but before kernel updated
> > > vdso - is that the concern?

Yes.

> > > Then can't we fix it by interrupting all CPUs right after LM?
> > > 
> > > To me that seems like a cleaner approach - we then compartmentalize
> > > the ABI issue - kernel has its own ABI against userspace,
> > > devices have their own ABI against kernel.
> > > It'd mean we need a way to detect that interrupt was sent,
> > > maybe yet another counter inside that structure.
> > > 
> > > WDYT?
> > > 
> > > By the way the same idea would work for snapshots -
> > > some people wanted to expose that info to userspace, too.

Those people included me. I wanted to interrupt all the vCPUs, even the
ones which were in userspace at the moment of migration, and have the
kernel deal with passing it on to userspace via a different ABI.

It ends up being complex and intricate, and requiring a lot of new
kernel and userspace support. I gave up on it in the end for snapshots,
and didn't go there again for this.

By contrast, a driver which merely exposes a page of MMIO space
identified by an ACPI device (without even the in-kernel PTP support)
could probably be fewer than a hundred lines of code. In an externally-
buildable module that goes back as far as RHEL8 or even further,
allowing users to just build and use it from their application.

> was there supposed to be text here, or did you just like this
> so much you decided to repost my mail ;) 

Hm, weirdness. I've known Evolution get into a state where it sends
completely *empty* messages, but I've never seen it eat only my own
part before. I had definitely typed responses (along the lines of the
above) last time.
Michael S. Tsirkin July 25, 2024, 9:47 p.m. UTC | #24
On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > > > Then can't we fix it by interrupting all CPUs right after LM?
> > > > 
> > > > To me that seems like a cleaner approach - we then compartmentalize
> > > > the ABI issue - kernel has its own ABI against userspace,
> > > > devices have their own ABI against kernel.
> > > > It'd mean we need a way to detect that interrupt was sent,
> > > > maybe yet another counter inside that structure.
> > > > 
> > > > WDYT?
> > > > 
> > > > By the way the same idea would work for snapshots -
> > > > some people wanted to expose that info to userspace, too.
> 
> Those people included me. I wanted to interrupt all the vCPUs, even the
> ones which were in userspace at the moment of migration, and have the
> kernel deal with passing it on to userspace via a different ABI.
> 
> It ends up being complex and intricate, and requiring a lot of new
> kernel and userspace support. I gave up on it in the end for snapshots,
> and didn't go there again for this.

ok I believe you, I am just curious how come you need userspace
support - what I imagine would live completely in kernel ...


> By contrast, a driver which merely exposes a page of MMIO space
> identified by an ACPI device (without even the in-kernel PTP support)
> could probably be fewer than a hundred lines of code. In an externally-
> buildable module that goes back as far as RHEL8 or even further,
> allowing users to just build and use it from their application.
> 
> > was there supposed to be text here, or did you just like this
> > so much you decided to repost my mail ;) 
> 
> Hm, weirdness. I've known Evolution get into a state where it sends
> completely *empty* messages, but I've never seen it eat only my own
> part before. I had definitely typed responses (along the lines of the
> above) last time.

mutt sucks less ;)
David Woodhouse July 25, 2024, 10:20 p.m. UTC | #25
On Thu, 2024-07-25 at 17:47 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > Those people included me. I wanted to interrupt all the vCPUs, even the
> > ones which were in userspace at the moment of migration, and have the
> > kernel deal with passing it on to userspace via a different ABI.
> > 
> > It ends up being complex and intricate, and requiring a lot of new
> > kernel and userspace support. I gave up on it in the end for snapshots,
> > and didn't go there again for this.
> 
> ok I believe you, I am just curious how come you need userspace
> support - what I imagine would live completely in kernel ...

Userspace doesn't even make a system call for gettimeofday() any more;
the relevant information is exposed to userspace through the vDSO.

If userspace needs to know that the time has been disrupted by LM, then
fundamentally that either needs to be exposed directly to it as well,
or userspace needs to go back to making actual system calls to get the
time (which is slow, and not acceptable for the same use cases which
care about it being accurate).

So how do we make it available in a form that's mappable directly to
userspace? 

Well, we could have a hypervisor enlightenment, where the guest kernel
uses an MSR or hypercall to tell the hypervisor "please write the
information to <this> GPA", and provides an address within the vDSO
information page. Which isn't nice for Confidential Compute, and is
hard to allow for expansion in the size of the structure. And is much
more complex to support consistently across different hypervisors and
different architectures.

We *could* attempt to contrive a system where we indeed interrupt *all*
vCPUs and the kernel then updates something in the vDSO page before
running userspace again. That could work in theory and *might* be a bit
simpler than what we were trying to do for VMGENID/snapshots, but it's
still complex and would take an eternity to deploy to actual users, and
would probably never work for non-Linux. And imposes an even higher
cost on the guest kernel when LM occurs.

Or there's this method, where the hypervisor puts it in a shared memory
region which is just a PCI BAR or an ACPI _CRS or attached to virtio
(we really don't care how it's discovered). There's a nit that it now
has to be page sized, and a guest which has larger pages than the
hypervisor expects is going to have to use a small PTE to map it (or
not support that mode). But I think that's reasonable.

Having gone around in circles a few times, I'm fairly sure that
exposing a memory region which the hypervisor updates directly is the
simplest and cleanest way of doing it and getting it in the hands of
users.

We're rolling out the AMZNVCLK device for internal use cases, and plan
to add it in public instances some time later. This is the guest driver
which consumes that, and I've separately posted the QEMU patch to
provide the same device. Because I absolutely do want this to be
standardised across hypervisors, for the reasons you point out. You're
preaching to the choir there; I even got Microsoft to implement the
same 15-bit MSI extensions that we added to KVM :)

Supporting the disruption signal is the critical part, which allows
applications to abort operations until their clock is good again.
Providing the actual clock information on the new host, so that
applications can keep running immediately, is what I'll be working on
next.

I'd love virtio-rtc to adopt this structure too, and I've done my best
to ensure that that's feasible, but I can't take a dependency on that
and wait for it (and as discussed, wouldn't use the virtio form in my
environment anyway).

> mutt sucks less ;)

So does 'nc' but Evolution can talk to the corporate Exchange calendar
and email. And I'm used to it and can mostly cope with its quirks :)
Michael S. Tsirkin July 26, 2024, 5:09 a.m. UTC | #26
On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > > > Then can't we fix it by interrupting all CPUs right after LM?
> > > > 
> > > > To me that seems like a cleaner approach - we then compartmentalize
> > > > the ABI issue - kernel has its own ABI against userspace,
> > > > devices have their own ABI against kernel.
> > > > It'd mean we need a way to detect that interrupt was sent,
> > > > maybe yet another counter inside that structure.
> > > > 
> > > > WDYT?
> > > > 
> > > > By the way the same idea would work for snapshots -
> > > > some people wanted to expose that info to userspace, too.
> 
> Those people included me. I wanted to interrupt all the vCPUs, even the
> ones which were in userspace at the moment of migration, and have the
> kernel deal with passing it on to userspace via a different ABI.
> 
> It ends up being complex and intricate, and requiring a lot of new
> kernel and userspace support. I gave up on it in the end for snapshots,
> and didn't go there again for this.

Maybe become you insist on using ACPI?
I see a fairly simple way to do it. For example, with virtio:

one vq per CPU, with a single outstanding buffer,
callback copies from the buffer into the userspace
visible memory.

Want me to show you the code?
Michael S. Tsirkin July 26, 2024, 5:55 a.m. UTC | #27
On Fri, Jul 26, 2024 at 01:09:24AM -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > > > > Then can't we fix it by interrupting all CPUs right after LM?
> > > > > 
> > > > > To me that seems like a cleaner approach - we then compartmentalize
> > > > > the ABI issue - kernel has its own ABI against userspace,
> > > > > devices have their own ABI against kernel.
> > > > > It'd mean we need a way to detect that interrupt was sent,
> > > > > maybe yet another counter inside that structure.
> > > > > 
> > > > > WDYT?
> > > > > 
> > > > > By the way the same idea would work for snapshots -
> > > > > some people wanted to expose that info to userspace, too.
> > 
> > Those people included me. I wanted to interrupt all the vCPUs, even the
> > ones which were in userspace at the moment of migration, and have the
> > kernel deal with passing it on to userspace via a different ABI.
> > 
> > It ends up being complex and intricate, and requiring a lot of new
> > kernel and userspace support. I gave up on it in the end for snapshots,
> > and didn't go there again for this.
> 
> Maybe become you insist on using ACPI?
> I see a fairly simple way to do it. For example, with virtio:
> 
> one vq per CPU, with a single outstanding buffer,
> callback copies from the buffer into the userspace
> visible memory.
> 
> Want me to show you the code?

Couldn't resist, so I wrote a bit of this code.
Fundamentally, we keep a copy of the hypervisor abi
in the device:

struct virtclk_info *vci {
	struct vmclock_abi abi;
};

each vq will has its own copy:

struct virtqueue_info {
	struct scatterlist sg[];
	struct vmclock_abi abi;
}

we add it during probe:
        sg_init_one(vqi->sg, &vqi->abi, sizeof(vqi->abi));
	virtqueue_add_inbuf(vq,
                        vqi->sg, 1,
                        &vq->vabi,
                        GFP_ATOMIC);



We set the affinity for each vq:

       for (i = 0; i < num_online_cpus(); i++)
               virtqueue_set_affinity(vi->vq[i], i);

(virtio net does it, and it handles cpu hotplug as well)

each vq callback would do:

static void vmclock_cb(struct virtqueue *vq)
{
        struct virtclk_info *vci = vq->vdev->priv;
        struct virtqueue_info *vqi = vq->priv;
	void *buf;
        unsigned int len;

	buf = virtqueue_get_buf(vq, &len);
	if (!buf)
		return;

	BUG_ON(buf != &vq->abi);

	spin_lock(vci->lock);
	if (memcmp(&vci->abi, &vqi->abi, sizeof(vqi->abi))) {
		memcpy(&vci->abi, &vqi->abi, sizeof(vqi->abi));
	}

	/* Update the userspace visible structure now */
	.....

	/* Re-add the buffer */
	virtqueue_add_inbuf(vq,
                        vqi->sg, 1,
                        &vqi->abi,
                        GFP_ATOMIC);

	spin_unlock(vi->lock);
}

That's it!
Where's the problem here?
Michael S. Tsirkin July 26, 2024, 6:06 a.m. UTC | #28
On Thu, Jul 25, 2024 at 11:20:56PM +0100, David Woodhouse wrote:
> We're rolling out the AMZNVCLK device for internal use cases, and plan
> to add it in public instances some time later.

Let's be real. If amazon does something in its own hypervisor, and the
only way to use that is to expose the interface to userspace, there is
very little the linux community can do.  Moreover, userspace will be
written to this ABI, and be locked in to the specific hypervisor. It
might be a win for amazon short term but long term you will want to
extend things and it will be a mess.

So I feel you have chosen ACPI badly.  It just does not have the APIs
that you need. Virtio does, and would not create a userpspace lock-in
to a specific hypervisor. It's not really virtio specific either,
you can write a bare pci device with a BAR and a bunch of msix
vectors and it will get you the same effect.
David Woodhouse July 26, 2024, 8:06 a.m. UTC | #29
On Fri, 2024-07-26 at 01:55 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 01:09:24AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > > > > > Then can't we fix it by interrupting all CPUs right after LM?
> > > > > > 
> > > > > > To me that seems like a cleaner approach - we then compartmentalize
> > > > > > the ABI issue - kernel has its own ABI against userspace,
> > > > > > devices have their own ABI against kernel.
> > > > > > It'd mean we need a way to detect that interrupt was sent,
> > > > > > maybe yet another counter inside that structure.
> > > > > > 
> > > > > > WDYT?
> > > > > > 
> > > > > > By the way the same idea would work for snapshots -
> > > > > > some people wanted to expose that info to userspace, too.
> > > 
> > > Those people included me. I wanted to interrupt all the vCPUs, even the
> > > ones which were in userspace at the moment of migration, and have the
> > > kernel deal with passing it on to userspace via a different ABI.
> > > 
> > > It ends up being complex and intricate, and requiring a lot of new
> > > kernel and userspace support. I gave up on it in the end for snapshots,
> > > and didn't go there again for this.
> > 
> > Maybe become you insist on using ACPI?
> > I see a fairly simple way to do it. For example, with virtio:
> > 
> > one vq per CPU, with a single outstanding buffer,
> > callback copies from the buffer into the userspace
> > visible memory.
> > 
> > Want me to show you the code?
> 
> Couldn't resist, so I wrote a bit of this code.
> Fundamentally, we keep a copy of the hypervisor abi
> in the device:
> 
> struct virtclk_info *vci {
>         struct vmclock_abi abi;
> };
> 
> each vq will has its own copy:
> 
> struct virtqueue_info {
>         struct scatterlist sg[];
>         struct vmclock_abi abi;
> }
> 
> we add it during probe:
>         sg_init_one(vqi->sg, &vqi->abi, sizeof(vqi->abi));
>         virtqueue_add_inbuf(vq,
>                         vqi->sg, 1,
>                         &vq->vabi,
>                         GFP_ATOMIC);
> 
> 
> 
> We set the affinity for each vq:
> 
>        for (i = 0; i < num_online_cpus(); i++)
>                virtqueue_set_affinity(vi->vq[i], i);
> 
> (virtio net does it, and it handles cpu hotplug as well)
> 
> each vq callback would do:
> 
> static void vmclock_cb(struct virtqueue *vq)
> {
>         struct virtclk_info *vci = vq->vdev->priv;
>         struct virtqueue_info *vqi = vq->priv;
>         void *buf;
>         unsigned int len;
> 
>         buf = virtqueue_get_buf(vq, &len);
>         if (!buf)
>                 return;
> 
>         BUG_ON(buf != &vq->abi);
> 
>         spin_lock(vci->lock);
>         if (memcmp(&vci->abi, &vqi->abi, sizeof(vqi->abi))) {
>                 memcpy(&vci->abi, &vqi->abi, sizeof(vqi->abi));
>         }
> 
>         /* Update the userspace visible structure now */
>         .....
> 
>         /* Re-add the buffer */
>         virtqueue_add_inbuf(vq,
>                         vqi->sg, 1,
>                         &vqi->abi,
>                         GFP_ATOMIC);
> 
>         spin_unlock(vi->lock);
> }
> 
> That's it!
> Where's the problem here?

That's great. You don't even need it to be per-vCPU if you let the
hypervisor write directly to the single physical location that's mapped
to userspace. It can do that before it even starts *running* the vCPUs
after migration. It's a whole lot simpler. 

Even if we were to insist one having two *different* ABIs, one for
hypervisor←→guest kernel, and another for kernel←→userspace, then you
still only need the hypervisor to write one copy. It's just that you do
need to interrupt to all the other vCPUs to ensure they aren't running
userspace before the data are updated. So yes, we can do the above and
(ab)use gratuitous data transfers to trigger those IRQs, and then the
handler has some kind of locking to ensure that we can't reenter
userspace until the user-visible structure is updated? 

Obviously the "interrupt all CPUs" doesn't work well for hardware
implementations but those won't have to deal with live migration, so
that's OK. A hypothetical hardware implementation would map something
like the ART to reference time. Obviously there's no LM but are they
any other causes of 'disruption' in the same sense? Not that I can
think of.

But I don't think we actually do want to have separate ABIs for
hypervisor-kernel and kernel-user. The ABI we have in the structure is
just fine, and extensible. And having the hypervisor update it directly
is fine. It means that applications can use it with a *simple*
externally buildable kernel driver, which is easy to persuade OSVs to
add to their product kernels. 

If Linux wants to adopt it and map it directly into the vDSO area and
use it from vDSO functions, that *great*, and still entirely possible.
But there's no need for us to take a dependency on Linux doing that
kind of thing before this functionality is even usable at all.

I think something like the above could well be how we expose the
pvclock_abi struct in virtio-rtc, and that's fine. It even fixes the
PAGE_SIZE problem that the ACPI model has.

But I see it as just another transport, completely orthogonal to the
actual contents of the pvclock_abi.
David Woodhouse July 26, 2024, 8:35 a.m. UTC | #30
On Fri, 2024-07-26 at 02:06 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 11:20:56PM +0100, David Woodhouse wrote:
> > We're rolling out the AMZNVCLK device for internal use cases, and plan
> > to add it in public instances some time later.
> 
> Let's be real. If amazon does something in its own hypervisor, and the
> only way to use that is to expose the interface to userspace, there is
> very little the linux community can do.  Moreover, userspace will be
> written to this ABI, and be locked in to the specific hypervisor. It
> might be a win for amazon short term but long term you will want to
> extend things and it will be a mess.
> 
> So I feel you have chosen ACPI badly.  It just does not have the APIs
> that you need. Virtio does, and would not create a userpspace lock-in
> to a specific hypervisor. It's not really virtio specific either,
> you can write a bare pci device with a BAR and a bunch of msix
> vectors and it will get you the same effect.

I *am* as bad as the next person for taking the "I have a hammer,
therefore everything is a nail" approach. For you that hammer is
virtio, and I respect that. But mine isn't ACPI — quite the opposite,
it's DT.

I *hate* ACPI. I hate everything about it. I hate that Arm started
using it for Arm64 instead of going with Device Tree.

That's why we have the DSM method for obtaining properties, and the
PRP0001 ACPI HID which means "look for the compatible property and
treat it like a DT node". So people can make DT bindings and hey, if
you're on a system which is afflicted with ACPI, you can still use
them. Which I'm still proselytising today, as you saw.

But for this use case, we only need a memory region that the hypervisor
can update. We don't need any of that complexity of gratuitously
interrupting all the vCPUs just to ensure that none of them can be
running userspace while one of them does an update for itself,
potentially translating from one ABI to another. The hypervisor can
just update the user-visible memory in place.

In this case, exposing a simple MMIO memory region in _CRS of an ACPI
device was the simplest and most compatible solution. 

Yes, we can add a virtio transport for that where the hypervisor is
invited to DMA into (unencrypted) guest memory, and it solves the
PAGE_SIZE problem of the trivial ACPI method. But there's still a place
in this world for the ACPI method, and it doesn't *hurt* virtio.

The important part is the vmclock_abi structure; the transport is just
fluff. And I do not agree that this is a lock-in to a specific
hypervisor. I've literally rewritten the fields in the structure to
align to what virtio-rtc does and accommodate Peter's feedback (to the
dismay of my internal team who just wanted to stick with the initial
straw man struct and didn't want to keep up, and haven't even engaged
with the public threads which have been ongoing since March¹, even when
I've beaten them with a big stick). I've added a QEMU implementation
too. We absolutely *don't* want this to be hypervisor-specific.


¹ https://lore.kernel.org/all/0e21e3e2be26acd70b5575b9932b3a911c9fe721.camel@infradead.org
Michael S. Tsirkin July 26, 2024, 12:47 p.m. UTC | #31
On Fri, Jul 26, 2024 at 09:06:29AM +0100, David Woodhouse wrote:
> That's great. You don't even need it to be per-vCPU if you let the
> hypervisor write directly to the single physical location that's mapped
> to userspace. It can do that before it even starts *running* the vCPUs
> after migration. It's a whole lot simpler. 

It *seems* simpler, until you realize that there is no way
to change anything in the interface, there is no negotiation
between hypervisor and userspace. If I learned anything at all
in tens of years working on software, it's that it is
never done. So let's have userspace talk to the kernel
and have kernel talk to the devices, please. There's
no compelling reason to have this bypass here.
David Woodhouse July 26, 2024, 12:51 p.m. UTC | #32
On Fri, 2024-07-26 at 08:47 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 09:06:29AM +0100, David Woodhouse wrote:
> > That's great. You don't even need it to be per-vCPU if you let the
> > hypervisor write directly to the single physical location that's mapped
> > to userspace. It can do that before it even starts *running* the vCPUs
> > after migration. It's a whole lot simpler. 
> 
> It *seems* simpler, until you realize that there is no way
> to change anything in the interface, there is no negotiation
> between hypervisor and userspace. If I learned anything at all
> in tens of years working on software, it's that it is
> never done. So let's have userspace talk to the kernel
> and have kernel talk to the devices, please. There's
> no compelling reason to have this bypass here.

Thanks for the useful feedback. As you see, I've incorporated most of
it into the v2 post a few minutes ago.

On this particular topic we disagree. I absolutely don't want to take
dependencies on kernel code, on virtio, or on the cross-platform
assumption (even if it's true) that a device can raise an interrupt and
guarantee that no userspace code will run before that interrupt is
handled.

Using virtio does allow for some negotiation — for handling differences
in page sizes, and enabling the timekeeping only on demand. That's
great, but the structure is still an ABI in its own right, and we know
how to do those.

We'll ship the ACPI version, and I look forward to incorporating it
into virtio-rtc too.
Michael S. Tsirkin July 26, 2024, 12:52 p.m. UTC | #33
On Fri, Jul 26, 2024 at 09:35:51AM +0100, David Woodhouse wrote:
> But for this use case, we only need a memory region that the hypervisor
> can update. We don't need any of that complexity of gratuitously
> interrupting all the vCPUs just to ensure that none of them can be
> running userspace while one of them does an update for itself,
> potentially translating from one ABI to another. The hypervisor can
> just update the user-visible memory in place.

Looks like then your userspace is hypervisor specific, and that's a
problem because it's a one way street - there is no way for hypervisor
to know what does userspace need, so no way for hypervisor to know which
information to provide. No real way to fix bugs.
David Woodhouse July 26, 2024, 1 p.m. UTC | #34
On Fri, 2024-07-26 at 08:52 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 09:35:51AM +0100, David Woodhouse wrote:
> > But for this use case, we only need a memory region that the hypervisor
> > can update. We don't need any of that complexity of gratuitously
> > interrupting all the vCPUs just to ensure that none of them can be
> > running userspace while one of them does an update for itself,
> > potentially translating from one ABI to another. The hypervisor can
> > just update the user-visible memory in place.
> 
> Looks like then your userspace is hypervisor specific, and that's a
> problem because it's a one way street - there is no way for hypervisor
> to know what does userspace need, so no way for hypervisor to know which
> information to provide. No real way to fix bugs.

It's not hypervisor specific, but you're right that as it stands there
is no negotiation of what userspace wants. So the hypervisor provides
what it feels it can provide without significant overhead (which may or
may not include the precise timekeeping, as discussed, but should
always include the disruption signal which is the most important
thing).

The guest *does* know what the hypervisor provides. And when we get to
do this in virtio, we get all the goodness of negotiation as well. The
existence of the simple ACPI model doesn't hurt that at all.
Michael S. Tsirkin July 26, 2024, 1:04 p.m. UTC | #35
On Fri, Jul 26, 2024 at 02:00:25PM +0100, David Woodhouse wrote:
> On Fri, 2024-07-26 at 08:52 -0400, Michael S. Tsirkin wrote:
> > On Fri, Jul 26, 2024 at 09:35:51AM +0100, David Woodhouse wrote:
> > > But for this use case, we only need a memory region that the hypervisor
> > > can update. We don't need any of that complexity of gratuitously
> > > interrupting all the vCPUs just to ensure that none of them can be
> > > running userspace while one of them does an update for itself,
> > > potentially translating from one ABI to another. The hypervisor can
> > > just update the user-visible memory in place.
> > 
> > Looks like then your userspace is hypervisor specific, and that's a
> > problem because it's a one way street - there is no way for hypervisor
> > to know what does userspace need, so no way for hypervisor to know which
> > information to provide. No real way to fix bugs.
> 
> It's not hypervisor specific, but you're right that as it stands there
> is no negotiation of what userspace wants. So the hypervisor provides
> what it feels it can provide without significant overhead (which may or
> may not include the precise timekeeping, as discussed, but should
> always include the disruption signal which is the most important
> thing).
> 
> The guest *does* know what the hypervisor provides. And when we get to
> do this in virtio, we get all the goodness of negotiation as well. The
> existence of the simple ACPI model doesn't hurt that at all.

Maybe it doesn't, at that. E.g. virtio does a copy, acpi doesn't?
I'll ponder compatibility over the weekend.
David Woodhouse July 26, 2024, 1:08 p.m. UTC | #36
On Fri, 2024-07-26 at 09:04 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 02:00:25PM +0100, David Woodhouse wrote:
> > On Fri, 2024-07-26 at 08:52 -0400, Michael S. Tsirkin wrote:
> > > On Fri, Jul 26, 2024 at 09:35:51AM +0100, David Woodhouse wrote:
> > > > But for this use case, we only need a memory region that the hypervisor
> > > > can update. We don't need any of that complexity of gratuitously
> > > > interrupting all the vCPUs just to ensure that none of them can be
> > > > running userspace while one of them does an update for itself,
> > > > potentially translating from one ABI to another. The hypervisor can
> > > > just update the user-visible memory in place.
> > > 
> > > Looks like then your userspace is hypervisor specific, and that's a
> > > problem because it's a one way street - there is no way for hypervisor
> > > to know what does userspace need, so no way for hypervisor to know which
> > > information to provide. No real way to fix bugs.
> > 
> > It's not hypervisor specific, but you're right that as it stands there
> > is no negotiation of what userspace wants. So the hypervisor provides
> > what it feels it can provide without significant overhead (which may or
> > may not include the precise timekeeping, as discussed, but should
> > always include the disruption signal which is the most important
> > thing).
> > 
> > The guest *does* know what the hypervisor provides. And when we get to
> > do this in virtio, we get all the goodness of negotiation as well. The
> > existence of the simple ACPI model doesn't hurt that at all.
> 
> Maybe it doesn't, at that. E.g. virtio does a copy, acpi doesn't?
> I'll ponder compatibility over the weekend.

For clarity, I think I've ditched the idea of a poor-man's negotiation
through invoking an ACPI method to enable the timekeeping info.

I think we're better off waiting for virtio, to enable that kind of
thing.

The guest gets what the hypervisor is prepared to offer "for free".

That isn't a one-way door; we *can* add an optional ACPI method later
if we really want to. But I'm generally OK with "use virtio if you want
that".
Jonathan Cameron July 26, 2024, 4:49 p.m. UTC | #37
On Thu, 25 Jul 2024 14:50:50 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Thu, 2024-07-25 at 08:33 -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:  
> > > On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:  
> > > > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:  
> > > > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:  
> > > > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:  
> > > > > > > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > > > > > > trying to keep two specs in sync.  
> > > > > > > 
> > > > > > > The ACPI version is much more lightweight and doesn't take up a
> > > > > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > > > > > complex in other ways).
> > > > > > >   
> > > > > > 
> > > > > > Hmm, should we support virtio over ACPI? Just asking.  
> > > > > 
> > > > > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> > > > > exists with a DSM method which literally returns DT properties,
> > > > > including such properties as "compatible=virtio,mmio" ... do we
> > > > > already?
> > > > > 
> > > > >   
> > > > 
> > > > In a sense, but you are saying that is too complex?
> > > > Can you elaborate?  
> > > 
> > > No, I think it's fine. I encourage the use of the PRP0001 device to
> > > expose DT devices through ACPI. I was just reminding you of its
> > > existence.  
> > 
> > Confused. You said "I know, you can do virtio without PCI but that's
> > complex in other ways" as the explanation why you are doing a custom
> > protocol.  
> 
> Ah, apologies, I wasn't thinking that far back in the conversation.
> 
> If we wanted to support virtio over ACPI, I think PRP0001 can be made
> to work and isn't too complex (even though it probably doesn't yet work
> out of the box).
> 
> But for the VMCLOCK thing, yes, the simple ACPI device is a lot simpler
> than virtio-rtc and much more attractive.
> 
> Even if the virtio-rtc specification were official today, and I was
> able to expose it via PCI, I probably wouldn't do it that way. There's
> just far more in virtio-rtc than we need; the simple shared memory
> region is perfectly sufficient for most needs, and especially ours.
> 
> I have reworked
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
> to take your other feedback into account.
> 
> It's now more flexible about the size handling, and explicitly checking
> that specific fields are present before using them. 
> 
> I think I'm going to add a method on the ACPI device to enable the
> precise clock information. I haven't done that in the driver yet; it
> still just consumes the precise clock information if it happens to be
> present already. The enable method can be added in a compatible fashion
> (the failure mode is that guests which don't invoke this method when
> the hypervisor needs them to will see only the disruption signal and
> not precise time).
> 
> For the HID I'm going to use AMZNVCLK. I had used QEMUVCLK in the QEMU
> patches, but I'll change that to use AMZNVCLK too when I repost the
> QEMU patch.

That doesn't fit with ACPI _HID definitions.
Second set 4 characters need to be hex digits as this is an
ACPI style ID (which I assume this is given AMZN is a valid
vendor ID.  6.1.5 in ACPI v6.5

Maybe I'm missing something...

J
David Woodhouse July 26, 2024, 6:28 p.m. UTC | #38
On 26 July 2024 17:49:58 BST, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
>On Thu, 25 Jul 2024 14:50:50 +0100
>David Woodhouse <dwmw2@infradead.org> wrote:
>
>> On Thu, 2024-07-25 at 08:33 -0400, Michael S. Tsirkin wrote:
>> > On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:  
>> > > On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:  
>> > > > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:  
>> > > > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:  
>> > > > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:  
>> > > > > > > > Do you want to just help complete virtio-rtc then? Would be easier than
>> > > > > > > > trying to keep two specs in sync.  
>> > > > > > > 
>> > > > > > > The ACPI version is much more lightweight and doesn't take up a
>> > > > > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
>> > > > > > > complex in other ways).
>> > > > > > >   
>> > > > > > 
>> > > > > > Hmm, should we support virtio over ACPI? Just asking.  
>> > > > > 
>> > > > > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
>> > > > > exists with a DSM method which literally returns DT properties,
>> > > > > including such properties as "compatible=virtio,mmio" ... do we
>> > > > > already?
>> > > > > 
>> > > > >   
>> > > > 
>> > > > In a sense, but you are saying that is too complex?
>> > > > Can you elaborate?  
>> > > 
>> > > No, I think it's fine. I encourage the use of the PRP0001 device to
>> > > expose DT devices through ACPI. I was just reminding you of its
>> > > existence.  
>> > 
>> > Confused. You said "I know, you can do virtio without PCI but that's
>> > complex in other ways" as the explanation why you are doing a custom
>> > protocol.  
>> 
>> Ah, apologies, I wasn't thinking that far back in the conversation.
>> 
>> If we wanted to support virtio over ACPI, I think PRP0001 can be made
>> to work and isn't too complex (even though it probably doesn't yet work
>> out of the box).
>> 
>> But for the VMCLOCK thing, yes, the simple ACPI device is a lot simpler
>> than virtio-rtc and much more attractive.
>> 
>> Even if the virtio-rtc specification were official today, and I was
>> able to expose it via PCI, I probably wouldn't do it that way. There's
>> just far more in virtio-rtc than we need; the simple shared memory
>> region is perfectly sufficient for most needs, and especially ours.
>> 
>> I have reworked
>> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
>> to take your other feedback into account.
>> 
>> It's now more flexible about the size handling, and explicitly checking
>> that specific fields are present before using them. 
>> 
>> I think I'm going to add a method on the ACPI device to enable the
>> precise clock information. I haven't done that in the driver yet; it
>> still just consumes the precise clock information if it happens to be
>> present already. The enable method can be added in a compatible fashion
>> (the failure mode is that guests which don't invoke this method when
>> the hypervisor needs them to will see only the disruption signal and
>> not precise time).
>> 
>> For the HID I'm going to use AMZNVCLK. I had used QEMUVCLK in the QEMU
>> patches, but I'll change that to use AMZNVCLK too when I repost the
>> QEMU patch.
>
>That doesn't fit with ACPI _HID definitions.
>Second set 4 characters need to be hex digits as this is an
>ACPI style ID (which I assume this is given AMZN is a valid
>vendor ID.  6.1.5 in ACPI v6.5
>
>Maybe I'm missing something...
>
>J
>
>



Hm, is the same not true for QEMUVGID and AMZNVGID, which I was using as an example?

QEMU seemed to get to 0002, and AFAICT the VMGENID patches were initially posted using QEMU0003, but what's actually in QEMU now is QEMUVGID. So I presumed that was now the preferred option.
Michael S. Tsirkin July 28, 2024, 10:37 a.m. UTC | #39
On Fri, Jul 26, 2024 at 07:28:28PM +0100, David Woodhouse wrote:
> On 26 July 2024 17:49:58 BST, Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> >On Thu, 25 Jul 2024 14:50:50 +0100
> >David Woodhouse <dwmw2@infradead.org> wrote:
> >
> >> On Thu, 2024-07-25 at 08:33 -0400, Michael S. Tsirkin wrote:
> >> > On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:  
> >> > > On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:  
> >> > > > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:  
> >> > > > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:  
> >> > > > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:  
> >> > > > > > > > Do you want to just help complete virtio-rtc then? Would be easier than
> >> > > > > > > > trying to keep two specs in sync.  
> >> > > > > > > 
> >> > > > > > > The ACPI version is much more lightweight and doesn't take up a
> >> > > > > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> >> > > > > > > complex in other ways).
> >> > > > > > >   
> >> > > > > > 
> >> > > > > > Hmm, should we support virtio over ACPI? Just asking.  
> >> > > > > 
> >> > > > > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> >> > > > > exists with a DSM method which literally returns DT properties,
> >> > > > > including such properties as "compatible=virtio,mmio" ... do we
> >> > > > > already?
> >> > > > > 
> >> > > > >   
> >> > > > 
> >> > > > In a sense, but you are saying that is too complex?
> >> > > > Can you elaborate?  
> >> > > 
> >> > > No, I think it's fine. I encourage the use of the PRP0001 device to
> >> > > expose DT devices through ACPI. I was just reminding you of its
> >> > > existence.  
> >> > 
> >> > Confused. You said "I know, you can do virtio without PCI but that's
> >> > complex in other ways" as the explanation why you are doing a custom
> >> > protocol.  
> >> 
> >> Ah, apologies, I wasn't thinking that far back in the conversation.
> >> 
> >> If we wanted to support virtio over ACPI, I think PRP0001 can be made
> >> to work and isn't too complex (even though it probably doesn't yet work
> >> out of the box).
> >> 
> >> But for the VMCLOCK thing, yes, the simple ACPI device is a lot simpler
> >> than virtio-rtc and much more attractive.
> >> 
> >> Even if the virtio-rtc specification were official today, and I was
> >> able to expose it via PCI, I probably wouldn't do it that way. There's
> >> just far more in virtio-rtc than we need; the simple shared memory
> >> region is perfectly sufficient for most needs, and especially ours.
> >> 
> >> I have reworked
> >> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
> >> to take your other feedback into account.
> >> 
> >> It's now more flexible about the size handling, and explicitly checking
> >> that specific fields are present before using them. 
> >> 
> >> I think I'm going to add a method on the ACPI device to enable the
> >> precise clock information. I haven't done that in the driver yet; it
> >> still just consumes the precise clock information if it happens to be
> >> present already. The enable method can be added in a compatible fashion
> >> (the failure mode is that guests which don't invoke this method when
> >> the hypervisor needs them to will see only the disruption signal and
> >> not precise time).
> >> 
> >> For the HID I'm going to use AMZNVCLK. I had used QEMUVCLK in the QEMU
> >> patches, but I'll change that to use AMZNVCLK too when I repost the
> >> QEMU patch.
> >
> >That doesn't fit with ACPI _HID definitions.
> >Second set 4 characters need to be hex digits as this is an
> >ACPI style ID (which I assume this is given AMZN is a valid
> >vendor ID.  6.1.5 in ACPI v6.5
> >
> >Maybe I'm missing something...
> >
> >J
> >
> >
> 
> 
> 
> Hm, is the same not true for QEMUVGID and AMZNVGID, which I was using as an example?
> 
> QEMU seemed to get to 0002, and AFAICT the VMGENID patches were initially posted using QEMU0003, but what's actually in QEMU now is QEMUVGID. So I presumed that was now the preferred option.

Glad you asked :)


ACPI 1.0 indeed did not place restrictions on it:

6.1.4 _HID
This object is used to supply the OS with the device’s Plug and Play Hardware ID. When
describing a platform, use of any _HID objects is optional. However, a _HID object must
be used to describe any device that will be enumerated by the ACPI driver. The ACPI
driver only enumerates a device when no bus enumerator can detect the device ID. For
example, devices on an ISA bus are enumerated by the ACPI driver. Use the _ADR
object to describe devices enumerated by bus enumerators other than the ACPI driver.
A _HID object evaluates to either a numeric 32-bit compressed EISA type ID or a string.


Then 3.0 was very draconic:
6.1.4 _HID (Hardware ID)
This object is used to supply OSPM with the device’s Plug and Play hardware ID.8 When describing a
platform, use of any _HID objects is optional. However, a _HID object must be used to describe any device
that will be enumerated by OSPM. OSPM only enumerates a device when no bus enumerator can detect the
device ID. For example, devices on an ISA bus are enumerated by OSPM. Use the _ADR object to describe
devices enumerated by bus enumerators other than OSPM.
A _HID object evaluates to either a numeric 32-bit compressed EISA type ID or a string. If a string, the
format must be an alphanumeric PNP or ACPI ID with no asterisk or other leading characters.
A valid PNP ID must be of the form “AAA####” where A is an uppercase letter and # is a hex digit. A
valid ACPI ID must be of the form “ACPI####” where # is a hex digit.

Then 5.0 changed it to:

6.1.5 _HID (Hardware ID)
This object is used to supply OSPM with the device’s Plug and Play hardware ID.1
1.
256
A Plug and Play ID or ACPI ID can be obtained by sending e-mail to pnpid@microsoft.com.
Hewlett-Packard/Intel/Microsoft/Phoenix/ToshibaAdvanced Configuration and Power Interface Specification
When describing a platform, use of any _HID objects is optional. However, a _HID object must be
used to describe any device that will be enumerated by OSPM. OSPM only enumerates a device
when no bus enumerator can detect the device ID. For example, devices on an ISA bus are
enumerated by OSPM. Use the _ADR object to describe devices enumerated by bus enumerators
other than OSPM.
Arguments:
None
Return Value:
An Integer or String containing the HID
A _HID object evaluates to either a numeric 32-bit compressed EISA type ID or a string. If a
string, the format must be an alphanumeric PNP or ACPI ID with no asterisk or other leading
characters.
A valid PNP ID must be of the form "AAA####" where A is an uppercase letter and # is a hex
digit. A valid ACPI ID must be of the form "NNNN####" where N is an uppercase letter or a
digit ('0'-'9') and # is a hex digit. This specification reserves the string "ACPI" for use only
with devices defined herein. It further reserves all strings representing 4 HEX digits for
exclusive use with PCI-assigned Vendor IDs.



Long story short, QEMUVGID is indeed out of spec, but it works
both because of guest compatibility with ACPI 1.0, and because no one
much uses it.
David Woodhouse July 28, 2024, 1:07 p.m. UTC | #40
On 28 July 2024 11:37:04 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>Glad you asked :)

Heh, I'm not sure I'm so glad. Did I mention I hate ACPI? Perhaps it's still not too late for me just to define a DT binding and use PRP0001 for it :)

>Long story short, QEMUVGID is indeed out of spec, but it works
>both because of guest compatibility with ACPI 1.0, and because no one
>much uses it.


I think it's reasonable enough to follow that example and use AMZNVCLK (or QEMUVCLK, but there seems little point in both) then?
Michael S. Tsirkin July 28, 2024, 3:23 p.m. UTC | #41
On Sun, Jul 28, 2024 at 02:07:01PM +0100, David Woodhouse wrote:
> On 28 July 2024 11:37:04 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >Glad you asked :)
> 
> Heh, I'm not sure I'm so glad. Did I mention I hate ACPI? Perhaps it's still not too late for me just to define a DT binding and use PRP0001 for it :)
> 
> >Long story short, QEMUVGID is indeed out of spec, but it works
> >both because of guest compatibility with ACPI 1.0, and because no one
> >much uses it.
> 
> 
> I think it's reasonable enough to follow that example and use AMZNVCLK (or QEMUVCLK, but there seems little point in both) then?

I'd stick to spec. If you like puns, QEMUC10C maybe?
David Woodhouse July 29, 2024, 6:45 a.m. UTC | #42
On 28 July 2024 16:23:49 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>On Sun, Jul 28, 2024 at 02:07:01PM +0100, David Woodhouse wrote:
>> On 28 July 2024 11:37:04 BST, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >Glad you asked :)
>> 
>> Heh, I'm not sure I'm so glad. Did I mention I hate ACPI? Perhaps it's still not too late for me just to define a DT binding and use PRP0001 for it :)
>> 
>> >Long story short, QEMUVGID is indeed out of spec, but it works
>> >both because of guest compatibility with ACPI 1.0, and because no one
>> >much uses it.

But why *did* it change from QEMU0003 which was in some of the patches that got posted?

>> I think it's reasonable enough to follow that example and use AMZNVCLK (or QEMUVCLK, but there seems little point in both) then?
>
>I'd stick to spec. If you like puns, QEMUC10C maybe?

Meh, might as well be sensible. I'll chase up which of my colleagues curates the AMZN space (which will no doubt be me by the end of that thread), and issue the next one from that.
diff mbox series

Patch

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@  config PTP_1588_CLOCK_KVM
 	  To compile this driver as a module, choose M here: the module
 	  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+	tristate "Virtual machine PTP clock"
+	depends on X86_TSC || ARM_ARCH_TIMER
+	depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+	default y
+	help
+	  This driver adds support for using a virtual precision clock
+	  advertised by the hypervisor. This clock is only useful in virtual
+	  machines where such a device is present.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
 	tristate "IDT 82P33xxx PTP clock"
 	depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@  obj-$(CONFIG_PTP_1588_CLOCK_DTE)	+= ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)	+= ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)	+= ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)	+= ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)	+= ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ)	+= ptp-qoriq.o
 ptp-qoriq-y				+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)		+= ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index 000000000000..9c508c21c062
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,567 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+
+#include <uapi/linux/vmclock-abi.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+#ifdef CONFIG_X86
+#include <asm/pvclock.h>
+#include <asm/kvmclock.h>
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+	struct resource res;
+	struct vmclock_abi *clk;
+	struct miscdevice miscdev;
+	struct ptp_clock_info ptp_clock_info;
+	struct ptp_clock *ptp_clock;
+	enum clocksource_ids cs_id, sys_cs_id;
+	int index;
+	char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflow.
+ */
+static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
+					       uint64_t period, uint8_t shift,
+					       uint64_t frac_sec)
+{
+	unsigned __int128 res = (unsigned __int128)delta * period;
+
+	res >>= shift;
+	res += frac_sec;
+	*res_hi = res >> 64;
+	return (uint64_t)res;
+}
+
+static inline bool tai_adjust(struct vmclock_abi *clk, uint64_t *sec)
+{
+	if (likely(clk->time_type == VMCLOCK_TIME_UTC))
+		return true;
+
+	if (clk->time_type == VMCLOCK_TIME_TAI &&
+	    (clk->flags & VMCLOCK_FLAG_TAI_OFFSET_VALID)) {
+		if (sec)
+			*sec += clk->tai_offset_sec;
+		return true;
+	}
+	return false;
+}
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+				   struct ptp_system_timestamp *sts,
+				   struct system_counterval_t *system_counter,
+				   struct timespec64 *tspec)
+{
+	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+	struct system_time_snapshot systime_snapshot;
+	uint64_t cycle, delta, seq, frac_sec;
+
+#ifdef CONFIG_X86
+	/*
+	 * We'd expect the hypervisor to know this and to report the clock
+	 * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
+	 */
+	if (check_tsc_unstable())
+		return -EINVAL;
+#endif
+
+	while (1) {
+		seq = st->clk->seq_count & ~1ULL;
+		virt_rmb();
+
+		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
+			return -EINVAL;
+
+		/*
+		 * When invoked for gettimex64(), fill in the pre/post system
+		 * times. The simple case is when system time is based on the
+		 * same counter as st->cs_id, in which case all three times
+		 * will be derived from the *same* counter value.
+		 *
+		 * If the system isn't using the same counter, then the value
+		 * from ktime_get_snapshot() will still be used as pre_ts, and
+		 * ptp_read_system_postts() is called to populate postts after
+		 * calling get_cycles().
+		 *
+		 * The conversion to timespec64 happens further down, outside
+		 * the seq_count loop.
+		 */
+		if (sts) {
+			ktime_get_snapshot(&systime_snapshot);
+			if (systime_snapshot.cs_id == st->cs_id) {
+				cycle = systime_snapshot.cycles;
+			} else {
+				cycle = get_cycles();
+				ptp_read_system_postts(sts);
+			}
+		} else
+			cycle = get_cycles();
+
+		delta = cycle - st->clk->counter_value;
+
+		frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
+						   st->clk->counter_period_frac_sec,
+						   st->clk->counter_period_shift,
+						   st->clk->time_frac_sec);
+		tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
+		tspec->tv_sec += st->clk->time_sec;
+
+		if (!tai_adjust(st->clk, &tspec->tv_sec))
+			return -EINVAL;
+
+		virt_rmb();
+		if (seq == st->clk->seq_count)
+			break;
+
+		if (ktime_after(ktime_get(), deadline))
+			return -ETIMEDOUT;
+	}
+
+	if (system_counter) {
+		system_counter->cycles = cycle;
+		system_counter->cs_id = st->cs_id;
+	}
+
+	if (sts) {
+		sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+		if (systime_snapshot.cs_id == st->cs_id)
+			sts->post_ts = sts->pre_ts;
+	}
+
+	return 0;
+}
+
+#ifdef SUPPORT_KVMCLOCK
+/*
+ * In the case where the system is using the KVM clock for timekeeping, convert
+ * the TSC value into a KVM clock time in order to return a paired reading that
+ * get_device_system_crosststamp() can cope with.
+ */
+static int vmclock_get_crosststamp_kvmclock(struct vmclock_state *st,
+					    struct ptp_system_timestamp *sts,
+					    struct system_counterval_t *system_counter,
+					    struct timespec64 *tspec)
+{
+	struct pvclock_vcpu_time_info *pvti = this_cpu_pvti();
+	unsigned pvti_ver;
+	int ret;
+
+	preempt_disable_notrace();
+
+	do {
+		pvti_ver = pvclock_read_begin(pvti);
+
+		ret = vmclock_get_crosststamp(st, sts, system_counter, tspec);
+		if (ret)
+			break;
+
+		system_counter->cycles = __pvclock_read_cycles(pvti,
+							       system_counter->cycles);
+		system_counter->cs_id = CSID_X86_KVM_CLK;
+
+		/*
+		 * This retry should never really happen; if the TSC is
+		 * stable and reliable enough across vCPUS that it is sane
+		 * for the hypervisor to expose a VMCLOCK device which uses
+		 * it as the reference counter, then the KVM clock sohuld be
+		 * in 'master clock mode' and basically never changed. But
+		 * the KVM clock is a fickle and often broken thing, so do
+		 * it "properly" just in case.
+		 */
+	} while (pvclock_read_retry(pvti, pvti_ver));
+
+	preempt_enable_notrace();
+
+	return ret;
+}
+#endif
+
+static int ptp_vmclock_get_time_fn(ktime_t *device_time,
+				   struct system_counterval_t *system_counter,
+				   void *ctx)
+{
+	struct vmclock_state *st = ctx;
+	struct timespec64 tspec;
+	int ret;
+
+#ifdef SUPPORT_KVMCLOCK
+	if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK)
+		ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter,
+						       &tspec);
+	else
+#endif
+		ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
+
+	if (!ret)
+		*device_time = timespec64_to_ktime(tspec);
+
+	return ret;
+}
+
+
+static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
+				      struct system_device_crosststamp *xtstamp)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+	int ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, st,
+						NULL, xtstamp);
+#ifdef SUPPORT_KVMCLOCK
+	/*
+	 * On x86, the KVM clock may be used for the system time. We can
+	 * actually convert a TSC reading to that, and return a paired
+	 * timestamp that get_device_system_crosststamp() *can* handle.
+	 */
+	if (ret == -ENODEV) {
+		struct system_time_snapshot systime_snapshot;
+		ktime_get_snapshot(&systime_snapshot);
+
+		if (systime_snapshot.cs_id == CSID_X86_TSC ||
+		    systime_snapshot.cs_id == CSID_X86_KVM_CLK) {
+			WRITE_ONCE(st->sys_cs_id, systime_snapshot.cs_id);
+			ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn,
+							    st, NULL, xtstamp);
+		}
+	}
+#endif
+	return ret;
+}
+
+/*
+ * PTP clock operations
+ */
+
+static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_settime(struct ptp_clock_info *ptp,
+			   const struct timespec64 *ts)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
+				struct ptp_system_timestamp *sts)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+
+	return vmclock_get_crosststamp(st, sts, NULL, ts);
+}
+
+static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
+			  struct ptp_clock_request *rq, int on)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct ptp_clock_info ptp_vmclock_info = {
+	.owner		= THIS_MODULE,
+	.max_adj	= 0,
+	.n_ext_ts	= 0,
+	.n_pins		= 0,
+	.pps		= 0,
+	.adjfine	= ptp_vmclock_adjfine,
+	.adjtime	= ptp_vmclock_adjtime,
+	.gettimex64	= ptp_vmclock_gettimex,
+	.settime64	= ptp_vmclock_settime,
+	.enable		= ptp_vmclock_enable,
+	.getcrosststamp = ptp_vmclock_getcrosststamp,
+};
+
+static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
+		return -EROFS;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
+		return -EINVAL;
+
+        if (io_remap_pfn_range(vma, vma->vm_start,
+			       st->res.start >> PAGE_SHIFT, PAGE_SIZE,
+                               vma->vm_page_prot))
+                return -EAGAIN;
+
+        return 0;
+}
+
+static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
+				    size_t count, loff_t *ppos)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+	size_t max_count;
+	int32_t seq;
+
+	if (*ppos >= PAGE_SIZE)
+		return 0;
+
+	max_count = PAGE_SIZE - *ppos;
+	if (count > max_count)
+		count = max_count;
+
+	while (1) {
+		seq = st->clk->seq_count & ~1ULL;
+		virt_rmb();
+
+		if (copy_to_user(buf, ((char *)st->clk) + *ppos, count))
+			return -EFAULT;
+
+		virt_rmb();
+		if (seq == st->clk->seq_count)
+			break;
+
+		if (ktime_after(ktime_get(), deadline))
+			return -ETIMEDOUT;
+	}
+
+	*ppos += count;
+	return count;
+}
+
+static const struct file_operations vmclock_miscdev_fops = {
+        .mmap = vmclock_miscdev_mmap,
+        .read = vmclock_miscdev_read,
+};
+
+/* module operations */
+
+static void vmclock_remove(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct vmclock_state *st = dev_get_drvdata(dev);
+
+	if (st->ptp_clock)
+		ptp_clock_unregister(st->ptp_clock);
+
+	if (st->miscdev.minor != MISC_DYNAMIC_MINOR)
+		misc_deregister(&st->miscdev);
+}
+
+static acpi_status vmclock_acpi_resources(struct acpi_resource *ares, void *data)
+{
+	struct vmclock_state *st = data;
+	struct resource_win win;
+	struct resource *res = &(win.res);
+
+	if (ares->type == ACPI_RESOURCE_TYPE_END_TAG)
+		return AE_OK;
+
+	/* There can be only one */
+	if (resource_type(&st->res) == IORESOURCE_MEM)
+		return AE_ERROR;
+
+        if (acpi_dev_resource_memory(ares, res) ||
+	    acpi_dev_resource_address_space(ares, &win)) {
+
+		if (resource_type(res) != IORESOURCE_MEM ||
+		    resource_size(res) < sizeof(st->clk))
+			return AE_ERROR;
+
+		st->res = *res;
+		return AE_OK;
+	}
+
+	return AE_ERROR;
+}
+
+static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
+{
+	struct acpi_device *adev = ACPI_COMPANION(dev);
+	acpi_status status;
+
+        status = acpi_walk_resources(adev->handle, METHOD_NAME__CRS,
+                                     vmclock_acpi_resources, st);
+        if (ACPI_FAILURE(status) || resource_type(&st->res) != IORESOURCE_MEM) {
+                dev_err(dev, "failed to get resources\n");
+                return -ENODEV;
+        }
+
+	return 0;
+}
+
+static void vmclock_put_idx(void *data)
+{
+	struct vmclock_state *st = data;
+
+	ida_free(&vmclock_ida, st->index);
+}
+
+static int vmclock_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct vmclock_state *st;
+	int ret;
+
+	st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL);
+	if (!st)
+		return -ENOMEM;
+
+	if (has_acpi_companion(dev))
+		ret = vmclock_probe_acpi(dev, st);
+	else
+		ret = -EINVAL; /* Only ACPI for now */
+
+	if (ret) {
+		dev_info(dev, "Failed to obtain physical address: %d\n", ret);
+		goto out;
+	}
+
+	st->clk = devm_memremap(dev, st->res.start, resource_size(&st->res),
+				MEMREMAP_WB | MEMREMAP_DEC);
+	if (IS_ERR(st->clk)) {
+		ret = PTR_ERR(st->clk);
+		dev_info(dev, "failed to map shared memory\n");
+		st->clk = NULL;
+		goto out;
+	}
+
+	if (st->clk->magic != VMCLOCK_MAGIC ||
+	    st->clk->size < sizeof(*st->clk) ||
+	    st->clk->version != 1) {
+		dev_info(dev, "vmclock magic fields invalid\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = ida_alloc(&vmclock_ida, GFP_KERNEL);
+	if (ret < 0)
+		goto out;
+
+	st->index = ret;
+        ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st);
+	if (ret)
+		goto out;
+
+	st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index);
+	if (!st->name) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* If the structure is big enough, it can be mapped to userspace */
+	if (st->clk->size >= PAGE_SIZE) {
+		st->miscdev.minor = MISC_DYNAMIC_MINOR;
+		st->miscdev.fops = &vmclock_miscdev_fops;
+		st->miscdev.name = st->name;
+
+		ret = misc_register(&st->miscdev);
+		if (ret)
+			goto out;
+	}
+
+	/* If there is valid clock information, register a PTP clock */
+	if (IS_ENABLED(CONFIG_ARM64) &&
+	    st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) {
+		/* Can we check it's the virtual counter? */
+		st->cs_id = CSID_ARM_ARCH_COUNTER;
+	} else if (IS_ENABLED(CONFIG_X86) &&
+		   st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) {
+		st->cs_id = CSID_X86_TSC;
+	}
+
+	/* Only UTC, or TAI with offset */
+	if (!tai_adjust(st->clk, NULL)) {
+		dev_info(dev, "vmclock does not provide unambiguous UTC\n");
+		st->cs_id = CSID_GENERIC;
+	}
+
+	if (st->cs_id) {
+		st->sys_cs_id = st->cs_id;
+
+		st->ptp_clock_info = ptp_vmclock_info;
+		strscpy(st->ptp_clock_info.name, st->name);
+		st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev);
+
+		if (IS_ERR(st->ptp_clock)) {
+			ret = PTR_ERR(st->ptp_clock);
+			st->ptp_clock = NULL;
+			vmclock_remove(pdev);
+			goto out;
+		}
+	} else if (!st->miscdev.minor) {
+		/* Neither miscdev nor PTP registered */
+		dev_info(dev, "vmclock: Neither miscdev nor PTP available; not registering\n");
+		ret = -ENODEV;
+		goto out;
+	}
+
+	dev_info(dev, "%s: registered %s%s%s\n", st->name,
+		 st->miscdev.minor ? "miscdev" : "",
+		 (st->miscdev.minor && st->ptp_clock) ? ", " : "",
+		 st->ptp_clock ? "PTP" : "");
+
+	dev_set_drvdata(dev, st);
+
+ out:
+	return ret;
+}
+
+static const struct acpi_device_id vmclock_acpi_ids[] = {
+	{ "VMCLOCK", 0 },
+	{}
+};
+MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
+
+static struct platform_driver vmclock_platform_driver = {
+	.probe		= vmclock_probe,
+	.remove_new	= vmclock_remove,
+	.driver	= {
+		.name	= "vmclock",
+		.acpi_match_table = vmclock_acpi_ids,
+	},
+};
+
+module_platform_driver(vmclock_platform_driver)
+
+MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>");
+MODULE_DESCRIPTION("PTP clock using VMCLOCK");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h
new file mode 100644
index 000000000000..3bde10ddec38
--- /dev/null
+++ b/include/uapi/linux/vmclock-abi.h
@@ -0,0 +1,187 @@ 
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+
+/*
+ * This structure provides a vDSO-style clock to VM guests, exposing the
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
+ * counter, etc.) and real time. It is designed to address the problem of
+ * live migration, which other clock enlightenments do not.
+ *
+ * When a guest is live migrated, this affects the clock in two ways.
+ *
+ * First, even between identical hosts the actual frequency of the underlying
+ * counter will change within the tolerances of its specification (typically
+ * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
+ * same host, but can be tracked by NTP as it generally varies slowly. With
+ * live migration there is a step change in the frequency, with no warning.
+ *
+ * Second, there may be a step change in the value of the counter itself, as
+ * its accuracy is limited by the precision of the NTP synchronization on the
+ * source and destination hosts.
+ *
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
+ * host before migration is invalid, and needs to be redone on the new host.
+ *
+ * In its most basic mode, this structure provides only an indication to the
+ * guest that live migration has occurred. This allows the guest to know that
+ * its clock is invalid and take remedial action. For applications that need
+ * reliable accurate timestamps (e.g. distributed databases), the structure
+ * can be mapped all the way to userspace. This allows the application to see
+ * directly for itself that the clock is disrupted and take appropriate
+ * action, even when using a vDSO-style method to get the time instead of a
+ * system call.
+ *
+ * In its more advanced mode. this structure can also be used to expose the
+ * precise relationship of the CPU counter to real time, as calibrated by the
+ * host. This means that userspace applications can have accurate time
+ * immediately after live migration, rather than having to pause operations
+ * and wait for NTP to recover. This mode does, of course, rely on the
+ * counter being reliable and consistent across CPUs.
+ *
+ * Note that this must be true UTC, never with smeared leap seconds. If a
+ * guest wishes to construct a smeared clock, it can do so. Presenting a
+ * smeared clock through this interface would be problematic because it
+ * actually messes with the apparent counter *period*. A linear smearing
+ * of 1 ms per second would effectively tweak the counter period by 1000PPM
+ * at the start/end of the smearing period, while a sinusoidal smear would
+ * basically be impossible to represent.
+ *
+ * This structure is offered with the intent that it be adopted into the
+ * nascent virtio-rtc standard, as a virtio-rtc that does not address the live
+ * migration problem seems a little less than fit for purpose. For that
+ * reason, certain fields use precisely the same numeric definitions as in
+ * the virtio-rtc proposal. The structure can also be exposed through an ACPI
+ * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for
+ * the fact that it uses a real _CRS to convey the address of the structure
+ * (which should be a full page, to allow for mapping directly to userspace).
+ */
+
+#ifndef __VMCLOCK_ABI_H__
+#define __VMCLOCK_ABI_H__
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+struct vmclock_abi {
+	/* CONSTANT FIELDS */
+	uint32_t magic;
+#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
+	uint32_t size;		/* Size of region containing this structure */
+	uint16_t version;	/* 1 */
+	uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */
+#define VMCLOCK_COUNTER_ARM_VCNT	0
+#define VMCLOCK_COUNTER_X86_TSC		1
+#define VMCLOCK_COUNTER_INVALID		0xff
+	uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
+#define VMCLOCK_TIME_UTC			0	/* Since 1970-01-01 00:00:00z */
+#define VMCLOCK_TIME_TAI			1	/* Since 1970-01-01 00:00:00z */
+#define VMCLOCK_TIME_MONOTONIC			2	/* Since undefined epoch */
+#define VMCLOCK_TIME_INVALID_SMEARED		3	/* Not supported */
+#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED	4	/* Not supported */
+
+	/* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
+	uint32_t seq_count;	/* Low bit means an update is in progress */
+	/*
+	 * This field changes to another non-repeating value when the CPU
+	 * counter is disrupted, for example on live migration. This lets
+	 * the guest know that it should discard any calibration it has
+	 * performed of the counter against external sources (NTP/PTP/etc.).
+	 */
+	uint64_t disruption_marker;
+	uint64_t flags;
+	/* Indicates that the tai_offset_sec field is valid */
+#define VMCLOCK_FLAG_TAI_OFFSET_VALID		(1 << 0)
+	/*
+	 * Optionally used to notify guests of pending maintenance events.
+	 * A guest which provides latency-sensitive services may wish to
+	 * remove itself from service if an event is coming up. Two flags
+	 * indicate the approximate imminence of the event.
+	 */
+#define VMCLOCK_FLAG_DISRUPTION_SOON		(1 << 1) /* About a day */
+#define VMCLOCK_FLAG_DISRUPTION_IMMINENT	(1 << 2) /* About an hour */
+#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID	(1 << 3)
+#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID	(1 << 4)
+#define VMCLOCK_FLAG_TIME_ESTERROR_VALID	(1 << 5)
+#define VMCLOCK_FLAG_TIME_MAXERROR_VALID	(1 << 6)
+	/*
+	 * If the MONOTONIC flag is set then (other than leap seconds) it is
+	 * guaranteed that the time calculated according this structure at
+	 * any given moment shall never appear to be later than the time
+	 * calculated via the structure at any *later* moment.
+	 *
+	 * In particular, a timestamp based on a counter reading taken
+	 * immediately after setting the low bit of seq_count (and the
+	 * associated memory barrier), using the previously-valid time and
+	 * period fields, shall never be later than a timestamp based on
+	 * a counter reading taken immediately before *clearing* the low
+	 * bit again after the update, using the about-to-be-valid fields.
+	 */
+#define VMCLOCK_FLAG_TIME_MONOTONIC		(1 << 7)
+
+	uint8_t pad[2];
+	uint8_t clock_status;
+#define VMCLOCK_STATUS_UNKNOWN		0
+#define VMCLOCK_STATUS_INITIALIZING	1
+#define VMCLOCK_STATUS_SYNCHRONIZED	2
+#define VMCLOCK_STATUS_FREERUNNING	3
+#define VMCLOCK_STATUS_UNRELIABLE	4
+
+	/*
+	 * The time exposed through this device is never smeared. This field
+	 * corresponds to the 'subtype' field in virtio-rtc, which indicates
+	 * the smearing method. However in this case it provides a *hint* to
+	 * the guest operating system, such that *if* the guest OS wants to
+	 * provide its users with an alternative clock which does not follow
+	 * UTC, it may do so in a fashion consistent with the other systems
+	 * in the nearby environment.
+	 */
+	uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
+#define VMCLOCK_SMEARING_STRICT		0
+#define VMCLOCK_SMEARING_NOON_LINEAR	1
+#define VMCLOCK_SMEARING_UTC_SLS	2
+	int16_t tai_offset_sec;
+	uint8_t leap_indicator;
+	/*
+	 * This field is based on the the VIRTIO_RTC_LEAP_xxx values as
+	 * defined in the current draft of virtio-rtc, but since smearing
+	 * cannot be used with the shared memory device, some values are
+	 * not used.
+	 *
+	 * The _POST_POS and _POST_NEG values allow the guest to perform
+	 * its own smearing during the day or so after a leap second when
+	 * such smearing may need to continue being applied for a leap
+	 * second which is now theoretically "historical".
+	 */
+#define VMCLOCK_LEAP_NONE	0x00	/* No known nearby leap second */
+#define VMCLOCK_LEAP_PRE_POS	0x01	/* Positive leap second at EOM */
+#define VMCLOCK_LEAP_PRE_NEG	0x02	/* Negative leap second at EOM */
+#define VMCLOCK_LEAP_POS	0x03	/* Set during 23:59:60 second */
+#define VMCLOCK_LEAP_POST_POS	0x04
+#define VMCLOCK_LEAP_POST_NEG	0x05
+
+	/* Bit shift for counter_period_frac_sec and its error rate */
+	uint8_t counter_period_shift;
+	/*
+	 * Paired values of counter and UTC at a given point in time.
+	 */
+	uint64_t counter_value;
+	/*
+	 * Counter frequency, and error margin. The unit of these fields is
+	 * seconds >> (64 + counter_period_shift)
+	 */
+	uint64_t counter_period_frac_sec;
+	uint64_t counter_period_esterror_rate_frac_sec;
+	uint64_t counter_period_maxerror_rate_frac_sec;
+
+	/*
+	 * Time according to time_type field above.
+	 */
+	uint64_t time_sec;		/* Seconds since time_type epoch */
+	uint64_t time_frac_sec;		/* (seconds >> 64) */
+	uint64_t time_esterror_nanosec;
+	uint64_t time_maxerror_nanosec;
+};
+
+#endif /*  __VMCLOCK_ABI_H__ */