mbox series

[RFC,v3,0/7] Add virtio_rtc module and related changes

Message ID 20231218073849.35294-1-peter.hilber@opensynergy.com (mailing list archive)
Headers show
Series Add virtio_rtc module and related changes | expand

Message

Peter Hilber Dec. 18, 2023, 7:38 a.m. UTC
RFC v3 updates
--------------

This series implements a driver for a virtio-rtc device conforming to spec
RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
the PTP clock driver already present before.

This patch series depends on the patch series "treewide: Use clocksource id
for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
series on top of mainline.

Overview
--------

This patch series adds the virtio_rtc module, and related bugfixes. The
virtio_rtc module implements a driver compatible with the proposed Virtio
RTC device specification [1]. The Virtio RTC (Real Time Clock) device
provides information about current time. The device can provide different
clocks, e.g. for the UTC or TAI time standards, or for physical time
elapsed since some past epoch. The driver can read the clocks with simple
or more accurate methods. Optionally, the driver can set an alarm.

The series first fixes some bugs in the get_device_system_crosststamp()
interpolation code, which is required for reliable virtio_rtc operation.
Then, add the virtio_rtc implementation.

For the Virtio RTC device, there is currently a proprietary implementation,
which has been used for provisional testing.

PTP clock interface
-------------------

virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
If both the Virtio RTC device and this driver have special support for the
current clocksource, time synchronization programs can use
cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
with single-digit ns precision is possible with a quiescent reference clock
(from the Virtio RTC device). This works even when the Virtio device
response is slow compared to ptp_kvm hypercalls.

The following illustrates a test using PTP_SYS_OFFSET_PRECISE, with
interspersed strace log and chrony [2] refclocks log, on arm64. In the
example, chrony tracks a virtio_rtc PTP clock ("PHCV", /dev/ptp0). The raw
offset between the virtio_rtc clock and CLOCK_REALTIME is 0 to 1 ns. At the
device side, the Virtio RTC device artificially delays both the clock read
request, and the response, by 50 ms. Cross-timestamp interpolation still
works with this delay. chrony also monitors a ptp_kvm clock ("PHCK",
/dev/ptp3) for comparison, which yields a similar offset.

	ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000329>
	===============================================================================
	   Date (UTC) Time         Refid  DP L P  Raw offset   Cooked offset      Disp.
	===============================================================================
	2023-06-29 18:49:55.595742 PHCK    0 N 0  1.000000e-09  8.717931e-10  5.500e-08
	2023-06-29 18:49:55.595742 PHCK    - N -       -        8.717931e-10  5.500e-08
	ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101545>
	2023-06-29 18:49:56.147766 PHCV    0 N 0  1.000000e-09  8.801870e-10  5.500e-08
	2023-06-29 18:49:56.147766 PHCV    - N -       -        8.801870e-10  5.500e-08
	ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000195>
	2023-06-29 18:49:56.202446 PHCK    0 N 0  1.000000e-09  7.364180e-10  5.500e-08
	2023-06-29 18:49:56.202446 PHCK    - N -       -        7.364180e-10  5.500e-08
	ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101484>
	2023-06-29 18:49:56.754641 PHCV    0 N 0  0.000000e+00 -2.617368e-10  5.500e-08
	2023-06-29 18:49:56.754641 PHCV    - N -       -       -2.617368e-10  5.500e-08
	ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000270>
	2023-06-29 18:49:56.809282 PHCK    0 N 0  1.000000e-09  7.779321e-10  5.500e-08
	2023-06-29 18:49:56.809282 PHCK    - N -       -        7.779321e-10  5.500e-08
	ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101510>
	2023-06-29 18:49:57.361376 PHCV    0 N 0  0.000000e+00 -2.198794e-10  5.500e-08
	2023-06-29 18:49:57.361376 PHCV    - N -       -       -2.198794e-10  5.500e-08

This patch series only adds special support for the Arm Generic Timer
clocksource. At the driver side, it should be easy to support more
clocksources.

Fallback PTP clock interface
----------------------------

Without special support for the current clocksource, time synchronization
programs can still use ioctl PTP_SYS_OFFSET_EXTENDED2 aka
PTP_SYS_OFFSET_EXTENDED. In this case, precision will generally be worse
and will depend on the Virtio device response characteristics.

The following illustrates a test using PTP_SYS_OFFSET_EXTENDED, with
interspersed strace log and chrony refclocks log, on x86-64 (with `ts'
values omitted):

	ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
	===============================================================================
	   Date (UTC) Time         Refid  DP L P  Raw offset   Cooked offset      Disp.
	===============================================================================
	2023-06-28 14:11:26.697782 PHCV    0 N 0  3.318200e-05  3.450891e-05  4.611e-06
	2023-06-28 14:11:26.697782 PHCV    - N -       -        3.450891e-05  4.611e-06
	ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
	2023-06-28 14:11:27.208763 PHCV    0 N 0 -3.792800e-05 -4.023965e-05  4.611e-06
	2023-06-28 14:11:27.208763 PHCV    - N -       -       -4.023965e-05  4.611e-06
	ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
	2023-06-28 14:11:27.722818 PHCV    0 N 0 -3.328600e-05 -3.134404e-05  4.611e-06
	2023-06-28 14:11:27.722818 PHCV    - N -       -       -3.134404e-05  4.611e-06
	ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
	2023-06-28 14:11:28.233572 PHCV    0 N 0 -4.966900e-05 -4.584331e-05  4.611e-06
	2023-06-28 14:11:28.233572 PHCV    - N -       -       -4.584331e-05  4.611e-06
	ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
	2023-06-28 14:11:28.742737 PHCV    0 N 0  4.902700e-05  5.361388e-05  4.611e-06
	2023-06-28 14:11:28.742737 PHCV    - N -       -        5.361388e-05  4.611e-06

PTP clock setup
---------------

The following udev rule can be used to get a symlink /dev/ptp_virtio to the
UTC clock:

	SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP UTC", SYMLINK += "ptp_virtio"

The following chrony configuration directive can then be added in
/etc/chrony/chrony.conf to synchronize to the Virtio UTC clock:

	refclock PHC /dev/ptp_virtio refid PHCV poll -1 dpoll -1

RTC interface
-------------

This patch series adds virtio_rtc as a generic Virtio driver, including
both a PTP clock driver and an RTC class driver.

Feedback is greatly appreciated.

[1] https://lore.kernel.org/virtio-comment/20231218064253.9734-1-peter.hilber@opensynergy.com/
[2] https://chrony.tuxfamily.org/
[3] https://lore.kernel.org/lkml/20231215220612.173603-1-peter.hilber@opensynergy.com/
[4] https://github.com/OpenSynergy/linux.git virtio-rtc-v3-on-master

v3:

- Update to conform to virtio spec RFC v3 (no significant behavioral
  changes).

- Add RTC class driver with alarm according to virtio spec RFC v3.

- For cross-timestamp corner case fix, switch back to v1 style closed
  interval test (Thomas Gleixner).

v2:

- Depend on patch series "treewide: Use clocksource id for
  get_device_system_crosststamp()" to avoid requiring a clocksource pointer
  with get_device_system_crosststamp().

- Assume Arm Generic Timer will use CP15 virtual counter. Drop
  arm_arch_timer helper functions (Marc Zyngier).

- Improve cross-timestamp fixes problem description and implementation
  (John Stultz).


Peter Hilber (7):
  timekeeping: Fix cross-timestamp interpolation on counter wrap
  timekeeping: Fix cross-timestamp interpolation corner case decision
  timekeeping: Fix cross-timestamp interpolation for non-x86
  virtio_rtc: Add module and driver core
  virtio_rtc: Add PTP clocks
  virtio_rtc: Add Arm Generic Timer cross-timestamping
  virtio_rtc: Add RTC class driver

 MAINTAINERS                          |    7 +
 drivers/virtio/Kconfig               |   62 ++
 drivers/virtio/Makefile              |    5 +
 drivers/virtio/virtio_rtc_arm.c      |   22 +
 drivers/virtio/virtio_rtc_class.c    |  269 +++++
 drivers/virtio/virtio_rtc_driver.c   | 1357 ++++++++++++++++++++++++++
 drivers/virtio/virtio_rtc_internal.h |  122 +++
 drivers/virtio/virtio_rtc_ptp.c      |  342 +++++++
 include/uapi/linux/virtio_rtc.h      |  230 +++++
 kernel/time/timekeeping.c            |   24 +-
 10 files changed, 2428 insertions(+), 12 deletions(-)
 create mode 100644 drivers/virtio/virtio_rtc_arm.c
 create mode 100644 drivers/virtio/virtio_rtc_class.c
 create mode 100644 drivers/virtio/virtio_rtc_driver.c
 create mode 100644 drivers/virtio/virtio_rtc_internal.h
 create mode 100644 drivers/virtio/virtio_rtc_ptp.c
 create mode 100644 include/uapi/linux/virtio_rtc.h


base-commit: 2c41b211d72c1eb350c7629a8c85234fef0d12c1

Comments

David Woodhouse March 7, 2024, 2:02 p.m. UTC | #1
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --------------
> 
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
> 
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
> 
> Overview
> --------
> 
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. 

Hm, should we allow UTC? If you tell me the time in UTC, then
(sometimes) I still don't actually know what the time is, because some
UTC seconds occur twice. UTC only makes sense if you provide the TAI
offset, surely? Should the virtio_rtc specification make it mandatory
to provide such?

Otherwise you're just designing it to allow crappy hypervisors to
expose incomplete information.

> PTP clock interface
> -------------------
> 
> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> If both the Virtio RTC device and this driver have special support for the
> current clocksource, time synchronization programs can use
> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> with single-digit ns precision is possible with a quiescent reference clock
> (from the Virtio RTC device). This works even when the Virtio device
> response is slow compared to ptp_kvm hypercalls.

Is PTP the right mechanism for this? As I understand it, PTP is a way
to precisely synchronize one clock with another. But in the case of
virt guests synchronizing against the host, it isn't really *another*
clock. It really is the *same* underlying clock. As the host clock
varies with temperature, for example, so does the guest clock. The only
difference is an offset and (on x86 perhaps) a mathematical scaling of
the frequency.

I was looking at this another way, when I came across this virtio-rtc
work.

My idea was just for the hypervisor to expose its own timekeeping
information — the counter/TSC value and TAI time at a given moment,
frequency of the counter, and the precision of both that frequency
(±PPM) and the TAI timestamp (±µs).

By putting that in a host/guest shared data structure with a seqcount
for lockless updates, we can update it as time synchronization on the
host is refined, and we can even cleanly handle live migration where
the guest ends up on a completely different host. It allows for use
cases which *really* care (e.g. timestamping financial transactions) to
ensure that there is never even a moment of getting *wrong* timestamps
if they haven't yet resynced after a migration.

Now I'm trying to work out if I should attempt to reconcile with your
existing virtio-rtc work, or just decide that virtio-rtc isn't trying
to solve the actual problem that we have, and go ahead with something
different... ?
Peter Hilber March 8, 2024, 10:32 a.m. UTC | #2
On 07.03.24 15:02, David Woodhouse wrote:
> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>> RFC v3 updates
>> --------------
>>
>> This series implements a driver for a virtio-rtc device conforming to spec
>> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
>> the PTP clock driver already present before.
>>
>> This patch series depends on the patch series "treewide: Use clocksource id
>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>> series on top of mainline.
>>
>> Overview
>> --------
>>
>> This patch series adds the virtio_rtc module, and related bugfixes. The
>> virtio_rtc module implements a driver compatible with the proposed Virtio
>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>> provides information about current time. The device can provide different
>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>> elapsed since some past epoch. 
> 
> Hm, should we allow UTC? If you tell me the time in UTC, then
> (sometimes) I still don't actually know what the time is, because some
> UTC seconds occur twice. UTC only makes sense if you provide the TAI
> offset, surely? Should the virtio_rtc specification make it mandatory
> to provide such?
> 
> Otherwise you're just designing it to allow crappy hypervisors to
> expose incomplete information.
> 

Hi David,

(adding virtio-comment@lists.oasis-open.org for spec discussion),

thank you for your insightful comments. I think I take a broadly similar
view. The reason why the current spec and driver is like this is that I
took a pragmatic approach at first and only included features which work
out-of-the-box for the current Linux ecosystem.

The current virtio_rtc features work similar to ptp_kvm, and therefore can
work out-of-the-box with time sync daemons such as chrony.

As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
well, I am afraid that

- in some (embedded) scenarios, the TAI clock may not be available

- crappy hypervisors will pass off the UTC clock as the TAI clock.

For the same reasons, I am also not sure about adding a *mandatory* TAI
offset to each readout. I don't know user-space software which would
leverage this already (at least not through the PTP clock interface). And
why would such software not go straight for the TAI clock instead?

How about adding a requirement to the spec that the virtio-rtc device
SHOULD expose the TAI clock whenever it is available - would this address
your concerns?

>> PTP clock interface
>> -------------------
>>
>> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
>> If both the Virtio RTC device and this driver have special support for the
>> current clocksource, time synchronization programs can use
>> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
>> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
>> with single-digit ns precision is possible with a quiescent reference clock
>> (from the Virtio RTC device). This works even when the Virtio device
>> response is slow compared to ptp_kvm hypercalls.
> 
> Is PTP the right mechanism for this? As I understand it, PTP is a way
> to precisely synchronize one clock with another. But in the case of
> virt guests synchronizing against the host, it isn't really *another*
> clock. It really is the *same* underlying clock. As the host clock
> varies with temperature, for example, so does the guest clock. The only
> difference is an offset and (on x86 perhaps) a mathematical scaling of
> the frequency.
> 
> I was looking at this another way, when I came across this virtio-rtc
> work.
> 
> My idea was just for the hypervisor to expose its own timekeeping
> information — the counter/TSC value and TAI time at a given moment,
> frequency of the counter, and the precision of both that frequency
> (±PPM) and the TAI timestamp (±µs).
> 
> By putting that in a host/guest shared data structure with a seqcount
> for lockless updates, we can update it as time synchronization on the
> host is refined, and we can even cleanly handle live migration where
> the guest ends up on a completely different host. It allows for use
> cases which *really* care (e.g. timestamping financial transactions) to
> ensure that there is never even a moment of getting *wrong* timestamps
> if they haven't yet resynced after a migration.

I considered a similar approach as well, but integrating that with the
kernel timekeeping seemed too much effort for the first step. However,
reading the clock from user space would be much simpler.

> 
> Now I'm trying to work out if I should attempt to reconcile with your
> existing virtio-rtc work, or just decide that virtio-rtc isn't trying
> to solve the actual problem that we have, and go ahead with something
> different... ?
> 

We are certainly interested into the discussed, say, "virtual timekeeper"
mechanism, which would also solve a lot of problems for us (especially if
it would be integrated with kernel timekeeping). Even without Linux kernel
timekeeping, the virtual timekeeper would be useful to us for guests with
simpler timekeeping, and potentially for user space applications.

Our current intent is to at first try to upstream the current (RFC spec v3)
feature set. I think the virtual timekeeper would be suitable as an
optional feature of virtio_rtc (with Virtio, this could easily be added
after initial upstreaming). It is also possible to have a virtio-rtc device
only implement the virtual timekeeper, but not the other clock reading
methods, if these are of no interest.

Best regards,

Peter
David Woodhouse March 8, 2024, 12:33 p.m. UTC | #3
On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> On 07.03.24 15:02, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > RFC v3 updates
> > > --------------
> > > 
> > > This series implements a driver for a virtio-rtc device conforming to spec
> > > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> > > the PTP clock driver already present before.
> > > 
> > > This patch series depends on the patch series "treewide: Use clocksource id
> > > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> > > series on top of mainline.
> > > 
> > > Overview
> > > --------
> > > 
> > > This patch series adds the virtio_rtc module, and related bugfixes. The
> > > virtio_rtc module implements a driver compatible with the proposed Virtio
> > > RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> > > provides information about current time. The device can provide different
> > > clocks, e.g. for the UTC or TAI time standards, or for physical time
> > > elapsed since some past epoch. 
> > 
> > Hm, should we allow UTC? If you tell me the time in UTC, then
> > (sometimes) I still don't actually know what the time is, because some
> > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > offset, surely? Should the virtio_rtc specification make it mandatory
> > to provide such?
> > 
> > Otherwise you're just designing it to allow crappy hypervisors to
> > expose incomplete information.
> > 
> 
> Hi David,
> 
> (adding virtio-comment@lists.oasis-open.org for spec discussion),
> 
> thank you for your insightful comments. I think I take a broadly similar
> view. The reason why the current spec and driver is like this is that I
> took a pragmatic approach at first and only included features which work
> out-of-the-box for the current Linux ecosystem.
> 
> The current virtio_rtc features work similar to ptp_kvm, and therefore can
> work out-of-the-box with time sync daemons such as chrony.
> 
> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
> well, I am afraid that
> 
> - in some (embedded) scenarios, the TAI clock may not be available
> 
> - crappy hypervisors will pass off the UTC clock as the TAI clock.
> 
> For the same reasons, I am also not sure about adding a *mandatory* TAI
> offset to each readout. I don't know user-space software which would
> leverage this already (at least not through the PTP clock interface). And
> why would such software not go straight for the TAI clock instead?
> 
> How about adding a requirement to the spec that the virtio-rtc device
> SHOULD expose the TAI clock whenever it is available - would this address
> your concerns?

I think that would be too easy for implementors to miss, or decide not
to obey. Or to get *wrong*, by exposing a TAI clock but actually
putting UTC in it.

I think I prefer to mandate the tai_offset field with the UTC clock.
Crappy implementations will just set it to zero, but at least that
gives a clear signal to the guests that it's *their* problem to
resolve.




> > > PTP clock interface
> > > -------------------
> > > 
> > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> > > If both the Virtio RTC device and this driver have special support for the
> > > current clocksource, time synchronization programs can use
> > > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> > > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> > > with single-digit ns precision is possible with a quiescent reference clock
> > > (from the Virtio RTC device). This works even when the Virtio device
> > > response is slow compared to ptp_kvm hypercalls.
> > 
> > Is PTP the right mechanism for this? As I understand it, PTP is a way
> > to precisely synchronize one clock with another. But in the case of
> > virt guests synchronizing against the host, it isn't really *another*
> > clock. It really is the *same* underlying clock. As the host clock
> > varies with temperature, for example, so does the guest clock. The only
> > difference is an offset and (on x86 perhaps) a mathematical scaling of
> > the frequency.
> > 
> > I was looking at this another way, when I came across this virtio-rtc
> > work.
> > 
> > My idea was just for the hypervisor to expose its own timekeeping
> > information — the counter/TSC value and TAI time at a given moment,
> > frequency of the counter, and the precision of both that frequency
> > (±PPM) and the TAI timestamp (±µs).
> > 
> > By putting that in a host/guest shared data structure with a seqcount
> > for lockless updates, we can update it as time synchronization on the
> > host is refined, and we can even cleanly handle live migration where
> > the guest ends up on a completely different host. It allows for use
> > cases which *really* care (e.g. timestamping financial transactions) to
> > ensure that there is never even a moment of getting *wrong* timestamps
> > if they haven't yet resynced after a migration.
> 
> I considered a similar approach as well, but integrating that with the
> kernel timekeeping seemed too much effort for the first step. However,
> reading the clock from user space would be much simpler.

Right. In fact my *first* use case was userspace, specifically in the
context of https://github.com/aws/clock-bound — but anything we design
for this absolutely has to be usable for kernel timekeeping too.

It's also critical to solve the Live Migration problem.

But is it so hard to integrate into the kernel timekeeping? My plan
would have given us effectively an infinite number of cross-reads of
the realtime clock vs. TSC. You don't have to actually read from a
virtio device; you just read the TSC and do the maths, using the values
in the shared memory region. Couldn't that be used to present a PTP
device to the guest kernel just the same as you do at the moment?

You could probably even simulate PPS with it. Typically with PPS we
have to catch the hardware interrupt and then read the TSC as soon as
possible thereafter. With this, you'd be able to *calculate* the TSC
value at the start of the next second, and wouldn't have to suffer the
real hardware latency :)

> > 
> > Now I'm trying to work out if I should attempt to reconcile with your
> > existing virtio-rtc work, or just decide that virtio-rtc isn't trying
> > to solve the actual problem that we have, and go ahead with something
> > different... ?
> > 
> 
> We are certainly interested into the discussed, say, "virtual timekeeper"
> mechanism, which would also solve a lot of problems for us (especially if
> it would be integrated with kernel timekeeping). Even without Linux kernel
> timekeeping, the virtual timekeeper would be useful to us for guests with
> simpler timekeeping, and potentially for user space applications.
> 
> Our current intent is to at first try to upstream the current (RFC spec v3)
> feature set. I think the virtual timekeeper would be suitable as an
> optional feature of virtio_rtc (with Virtio, this could easily be added
> after initial upstreaming). It is also possible to have a virtio-rtc device
> only implement the virtual timekeeper, but not the other clock reading
> methods, if these are of no interest.

Yeah, that might make sense. I was thinking of a simple ACPI/DT device
exposing a page of memory and *maybe* an interrupt for when an update
happens. (With the caveat that the interrupt would always occur too
late by definition, so it's no substitute for using the seqlock
correctly in applications that *really* care and are going to get fined
millions of dollars for mis-timestamping their transactions.)

But using the virtio-rtc device as the vehicle for that shared memory
page is reasonable too. It's not even mutually exclusive; we could
expose the *same* data structure in memory via whatever mechanisms we
wanted.

One other thing to note is I think we're being very naïve about the TSC
on x86 hosts. Theoretically, the TSC for every vCPU might run at a
different frequency, and even if they run at the same frequency they
might be offset from each other. I'm happy to be naïve but I think we
should be *explicitly* so, and just say for example that it's defined
against vCPU0 so if other vCPUs are different then all bets are off. 

We *can* cope with TSC frequencies changing. Fundamentally, that's the
whole *point*; NTP calibrates itself as the underlying frequency does
change due to temperature changes, etc. — so a deliberate frequency
scaling, or even a live migration, are just a slightly special case of
the same thing.

One thing I have added to the memory region is a migration counter. In
the ideal case, guests will be happy just to use the hypervisor's
synchronization. But in some cases the guests may want to do NTP (or
PPS, PTP or something else) for themselves, to have more precise
timekeeping than the host. Even if the host is advertising itself as
stratum 16, the guest still needs to know of *migration*, because it
has to consider itself unsynchronized when that happens.
Peter Hilber March 11, 2024, 6:24 p.m. UTC | #4
On 08.03.24 13:33, David Woodhouse wrote:
> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>> On 07.03.24 15:02, David Woodhouse wrote:
>>> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>>>> RFC v3 updates
>>>> --------------
>>>>
>>>> This series implements a driver for a virtio-rtc device conforming to
>>>> spec
>>>> RFC v3 [1]. It now includes an RTC class driver with alarm, in
>>>> addition to
>>>> the PTP clock driver already present before.
>>>>
>>>> This patch series depends on the patch series "treewide: Use
>>>> clocksource id
>>>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>>>> series on top of mainline.
>>>>
>>>> Overview
>>>> --------
>>>>
>>>> This patch series adds the virtio_rtc module, and related bugfixes.
>>>> The
>>>> virtio_rtc module implements a driver compatible with the proposed
>>>> Virtio
>>>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>>>> provides information about current time. The device can provide
>>>> different
>>>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>>>> elapsed since some past epoch.
>>>
>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>> (sometimes) I still don't actually know what the time is, because some
>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>> to provide such?
>>>
>>> Otherwise you're just designing it to allow crappy hypervisors to
>>> expose incomplete information.
>>>
>>
>> Hi David,
>>
>> (adding virtio-comment@lists.oasis-open.org for spec discussion),
>>
>> thank you for your insightful comments. I think I take a broadly similar
>> view. The reason why the current spec and driver is like this is that I
>> took a pragmatic approach at first and only included features which work
>> out-of-the-box for the current Linux ecosystem.
>>
>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>> can
>> work out-of-the-box with time sync daemons such as chrony.
>>
>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>> as
>> well, I am afraid that
>>
>> - in some (embedded) scenarios, the TAI clock may not be available
>>
>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>
>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>> offset to each readout. I don't know user-space software which would
>> leverage this already (at least not through the PTP clock interface).
>> And
>> why would such software not go straight for the TAI clock instead?
>>
>> How about adding a requirement to the spec that the virtio-rtc device
>> SHOULD expose the TAI clock whenever it is available - would this
>> address
>> your concerns?
>
> I think that would be too easy for implementors to miss, or decide not
> to obey. Or to get *wrong*, by exposing a TAI clock but actually
> putting UTC in it.
>
> I think I prefer to mandate the tai_offset field with the UTC clock.
> Crappy implementations will just set it to zero, but at least that
> gives a clear signal to the guests that it's *their* problem to
> resolve.

To me there are some open questions regarding how this would work. Is there
a use case for this with the v3 clock reading methods, or would it be
enough to address this with the Virtio timekeeper?

Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
best alongside some additional information about leap seconds. I am not
aware about any user-space user. In addition, leap second smearing should
also be addressed.

>
>
>
>
>>>> PTP clock interface
>>>> -------------------
>>>>
>>>> virtio_rtc exposes clocks as PTP clocks to userspace, similar to
>>>> ptp_kvm.
>>>> If both the Virtio RTC device and this driver have special support for
>>>> the
>>>> current clocksource, time synchronization programs can use
>>>> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
>>>> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time
>>>> synchronization
>>>> with single-digit ns precision is possible with a quiescent reference
>>>> clock
>>>> (from the Virtio RTC device). This works even when the Virtio device
>>>> response is slow compared to ptp_kvm hypercalls.
>>>
>>> Is PTP the right mechanism for this? As I understand it, PTP is a way
>>> to precisely synchronize one clock with another. But in the case of
>>> virt guests synchronizing against the host, it isn't really *another*
>>> clock. It really is the *same* underlying clock. As the host clock
>>> varies with temperature, for example, so does the guest clock. The only
>>> difference is an offset and (on x86 perhaps) a mathematical scaling of
>>> the frequency.
>>>
>>> I was looking at this another way, when I came across this virtio-rtc
>>> work.
>>>
>>> My idea was just for the hypervisor to expose its own timekeeping
>>> information — the counter/TSC value and TAI time at a given moment,
>>> frequency of the counter, and the precision of both that frequency
>>> (±PPM) and the TAI timestamp (±µs).
>>>
>>> By putting that in a host/guest shared data structure with a seqcount
>>> for lockless updates, we can update it as time synchronization on the
>>> host is refined, and we can even cleanly handle live migration where
>>> the guest ends up on a completely different host. It allows for use
>>> cases which *really* care (e.g. timestamping financial transactions) to
>>> ensure that there is never even a moment of getting *wrong* timestamps
>>> if they haven't yet resynced after a migration.
>>
>> I considered a similar approach as well, but integrating that with the
>> kernel timekeeping seemed too much effort for the first step. However,
>> reading the clock from user space would be much simpler.
>
> Right. In fact my *first* use case was userspace, specifically in the
> context of https://github.com/aws/clock-bound — but anything we design
> for this absolutely has to be usable for kernel timekeeping too.
>
> It's also critical to solve the Live Migration problem.
>
> But is it so hard to integrate into the kernel timekeeping? My plan
> would have given us effectively an infinite number of cross-reads of
> the realtime clock vs. TSC. You don't have to actually read from a
> virtio device; you just read the TSC and do the maths, using the values
> in the shared memory region. Couldn't that be used to present a PTP
> device to the guest kernel just the same as you do at the moment?

Yes, and it would also decrease the clock reading overhead (saving at least
the Virtio response interrupt and associated scheduling). Would make sense
to me.

To be clear, with "kernel timekeeping integration" I meant to have
timekeeping.c derive the time directly from the Virtio timekeeper.

>
> You could probably even simulate PPS with it. Typically with PPS we
> have to catch the hardware interrupt and then read the TSC as soon as
> possible thereafter. With this, you'd be able to *calculate* the TSC
> value at the start of the next second, and wouldn't have to suffer the
> real hardware latency :)
>
>>>
>>> Now I'm trying to work out if I should attempt to reconcile with your
>>> existing virtio-rtc work, or just decide that virtio-rtc isn't trying
>>> to solve the actual problem that we have, and go ahead with something
>>> different... ?
>>>
>>
>> We are certainly interested into the discussed, say, "virtual
>> timekeeper"
>> mechanism, which would also solve a lot of problems for us (especially
>> if
>> it would be integrated with kernel timekeeping). Even without Linux
>> kernel
>> timekeeping, the virtual timekeeper would be useful to us for guests
>> with
>> simpler timekeeping, and potentially for user space applications.
>>
>> Our current intent is to at first try to upstream the current (RFC spec
>> v3)
>> feature set. I think the virtual timekeeper would be suitable as an
>> optional feature of virtio_rtc (with Virtio, this could easily be added
>> after initial upstreaming). It is also possible to have a virtio-rtc
>> device
>> only implement the virtual timekeeper, but not the other clock reading
>> methods, if these are of no interest.
>
> Yeah, that might make sense. I was thinking of a simple ACPI/DT device
> exposing a page of memory and *maybe* an interrupt for when an update
> happens. (With the caveat that the interrupt would always occur too
> late by definition, so it's no substitute for using the seqlock
> correctly in applications that *really* care and are going to get fined
> millions of dollars for mis-timestamping their transactions.)
>
> But using the virtio-rtc device as the vehicle for that shared memory
> page is reasonable too. It's not even mutually exclusive; we could
> expose the *same* data structure in memory via whatever mechanisms we
> wanted.
>
> One other thing to note is I think we're being very naïve about the TSC
> on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> different frequency, and even if they run at the same frequency they
> might be offset from each other. I'm happy to be naïve but I think we
> should be *explicitly* so, and just say for example that it's defined
> against vCPU0 so if other vCPUs are different then all bets are off.

ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
have an opinion on how to represent this in a platform-independent way.

Thank you for the comments,

Peter

>
> We *can* cope with TSC frequencies changing. Fundamentally, that's the
> whole *point*; NTP calibrates itself as the underlying frequency does
> change due to temperature changes, etc. — so a deliberate frequency
> scaling, or even a live migration, are just a slightly special case of
> the same thing.
>
> One thing I have added to the memory region is a migration counter. In
> the ideal case, guests will be happy just to use the hypervisor's
> synchronization. But in some cases the guests may want to do NTP (or
> PPS, PTP or something else) for themselves, to have more precise
> timekeeping than the host. Even if the host is advertising itself as
> stratum 16, the guest still needs to know of *migration*, because it
> has to consider itself unsynchronized when that happens
David Woodhouse March 12, 2024, 5:15 p.m. UTC | #5
On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> On 08.03.24 13:33, David Woodhouse wrote:
> > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > (sometimes) I still don't actually know what the time is, because some
> > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > to provide such?
> > > > 
> > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > expose incomplete information.
> > > > 
> > > 
> > > Hi David,
> > > 
> > > (adding virtio-comment@lists.oasis-open.org for spec discussion),
> > > 
> > > thank you for your insightful comments. I think I take a broadly similar
> > > view. The reason why the current spec and driver is like this is that I
> > > took a pragmatic approach at first and only included features which work
> > > out-of-the-box for the current Linux ecosystem.
> > > 
> > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > can work out-of-the-box with time sync daemons such as chrony.
> > > 
> > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > as well, I am afraid that
> > > 
> > > - in some (embedded) scenarios, the TAI clock may not be available
> > > 
> > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > 
> > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > offset to each readout. I don't know user-space software which would
> > > leverage this already (at least not through the PTP clock interface).
> > > And why would such software not go straight for the TAI clock instead?
> > > 
> > > How about adding a requirement to the spec that the virtio-rtc device
> > > SHOULD expose the TAI clock whenever it is available - would this
> > > address your concerns?
> > 
> > I think that would be too easy for implementors to miss, or decide not
> > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > putting UTC in it.
> > 
> > I think I prefer to mandate the tai_offset field with the UTC clock.
> > Crappy implementations will just set it to zero, but at least that
> > gives a clear signal to the guests that it's *their* problem to
> > resolve.
> 
> To me there are some open questions regarding how this would work. Is there
> a use case for this with the v3 clock reading methods, or would it be
> enough to address this with the Virtio timekeeper?
> 
> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> best alongside some additional information about leap seconds. I am not
> aware about any user-space user. In addition, leap second smearing should
> also be addressed.
> 

Is there even a standard yet for leap-smearing? Will it be linear over
1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
think is what Google does? Meta does something different again, don't
they?

Exposing UTC as the only clock reference is bad enough; when leap
seconds happen there's a whole second during which you don't *know*
which second it is. It seems odd to me, for a precision clock to be
deliberately ambiguous about what the time is!

But if the virtio-rtc clock is defined as UTC and then expose something
*different* in it, that's even worse. You potentially end up providing
inaccurate time for a whole *day* leading up to the leap second.

I think you're right that leap second smearing should be addressed. At
the very least, by making it clear that the virtio-rtc clock which
advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
yet-to-be-defined variant.

Please make it explicit that any hypervisor which wants to advertise a
smeared clock shall define a new type which specifies the precise
smearing algorithm and cannot be conflated with the one you're defining
here.

> > One other thing to note is I think we're being very naïve about the TSC
> > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > different frequency, and even if they run at the same frequency they
> > might be offset from each other. I'm happy to be naïve but I think we
> > should be *explicitly* so, and just say for example that it's defined
> > against vCPU0 so if other vCPUs are different then all bets are off.
> 
> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> have an opinion on how to represent this in a platform-independent way.

Well, it doesn't have a notion of TSCs either; you include that by
implicit reference don't you?
Peter Hilber March 13, 2024, 9:45 a.m. UTC | #6
On 12.03.24 18:15, David Woodhouse wrote:
> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
>> On 08.03.24 13:33, David Woodhouse wrote:
>>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>>>> On 07.03.24 15:02, David Woodhouse wrote:
>>>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>>>> (sometimes) I still don't actually know what the time is, because some
>>>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>>>> to provide such?
>>>>>
>>>>> Otherwise you're just designing it to allow crappy hypervisors to
>>>>> expose incomplete information.
>>>>>
>>>>
>>>> Hi David,
>>>>
>>>> (adding virtio-comment@lists.oasis-open.org for spec discussion),
>>>>
>>>> thank you for your insightful comments. I think I take a broadly similar
>>>> view. The reason why the current spec and driver is like this is that I
>>>> took a pragmatic approach at first and only included features which work
>>>> out-of-the-box for the current Linux ecosystem.
>>>>
>>>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>>>> can work out-of-the-box with time sync daemons such as chrony.
>>>>
>>>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>>>> as well, I am afraid that
>>>>
>>>> - in some (embedded) scenarios, the TAI clock may not be available
>>>>
>>>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>>>
>>>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>>>> offset to each readout. I don't know user-space software which would
>>>> leverage this already (at least not through the PTP clock interface).
>>>> And why would such software not go straight for the TAI clock instead?
>>>>
>>>> How about adding a requirement to the spec that the virtio-rtc device
>>>> SHOULD expose the TAI clock whenever it is available - would this
>>>> address your concerns?
>>>
>>> I think that would be too easy for implementors to miss, or decide not
>>> to obey. Or to get *wrong*, by exposing a TAI clock but actually
>>> putting UTC in it.
>>>
>>> I think I prefer to mandate the tai_offset field with the UTC clock.
>>> Crappy implementations will just set it to zero, but at least that
>>> gives a clear signal to the guests that it's *their* problem to
>>> resolve.
>>
>> To me there are some open questions regarding how this would work. Is there
>> a use case for this with the v3 clock reading methods, or would it be
>> enough to address this with the Virtio timekeeper?
>>
>> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
>> best alongside some additional information about leap seconds. I am not
>> aware about any user-space user. In addition, leap second smearing should
>> also be addressed.
>>
> 
> Is there even a standard yet for leap-smearing? Will it be linear over
> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> think is what Google does? Meta does something different again, don't
> they?
> 
> Exposing UTC as the only clock reference is bad enough; when leap
> seconds happen there's a whole second during which you don't *know*
> which second it is. It seems odd to me, for a precision clock to be
> deliberately ambiguous about what the time is!

Just to be clear, the device can perfectly expose only a TAI reference
clock (or both UTC and TAI), the spec is just completely open about this,
as it tries to work for diverse use cases.

> 
> But if the virtio-rtc clock is defined as UTC and then expose something
> *different* in it, that's even worse. You potentially end up providing
> inaccurate time for a whole *day* leading up to the leap second.
> 
> I think you're right that leap second smearing should be addressed. At
> the very least, by making it clear that the virtio-rtc clock which
> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> yet-to-be-defined variant.
> 

Agreed.

> Please make it explicit that any hypervisor which wants to advertise a
> smeared clock shall define a new type which specifies the precise
> smearing algorithm and cannot be conflated with the one you're defining
> here.
> 

I will add a requirement that the UTC clock can never have smeared/smoothed
leap seconds.

I think that not every vendor would bother to first add a definition of a
smearing algorithm. Also, I think in some cases knowing the precise
smearing algorithm might not be important (when having the same time as the
hypervisor is enough and accuracy w.r.t. actual time is less important).

So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
now could catch every UTC-like clock which smears/smoothes leap seconds,
where the vendor cannot be bothered to add the smearing algorithm to spec
and implementations.

As for UTC-SLS, this *could* also be added, although [1] says

	It is inappropriate to use Internet-Drafts as reference material or
	to cite them other than as "work in progress."

[1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00

>>> One other thing to note is I think we're being very naïve about the TSC
>>> on x86 hosts. Theoretically, the TSC for every vCPU might run at a
>>> different frequency, and even if they run at the same frequency they
>>> might be offset from each other. I'm happy to be naïve but I think we
>>> should be *explicitly* so, and just say for example that it's defined
>>> against vCPU0 so if other vCPUs are different then all bets are off.
>>
>> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
>> have an opinion on how to represent this in a platform-independent way.
> 
> Well, it doesn't have a notion of TSCs either; you include that by
> implicit reference don't you?

I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
so the device might not even know which vCPUs there are. E.g. there is even
interest to make virtio-rtc work as part of the virtio-net device (which
might be implemented in hardware).

Thanks for the comments,

Peter
Alexandre Belloni March 13, 2024, 11:18 a.m. UTC | #7
On 13/03/2024 10:45:54+0100, Peter Hilber wrote:
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
> 
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.
> 
> > 
> > But if the virtio-rtc clock is defined as UTC and then expose something
> > *different* in it, that's even worse. You potentially end up providing
> > inaccurate time for a whole *day* leading up to the leap second.
> > 
> > I think you're right that leap second smearing should be addressed. At
> > the very least, by making it clear that the virtio-rtc clock which
> > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> > yet-to-be-defined variant.
> > 
> 
> Agreed.
> 
> > Please make it explicit that any hypervisor which wants to advertise a
> > smeared clock shall define a new type which specifies the precise
> > smearing algorithm and cannot be conflated with the one you're defining
> > here.
> > 
> 
> I will add a requirement that the UTC clock can never have smeared/smoothed
> leap seconds.
> 
> I think that not every vendor would bother to first add a definition of a
> smearing algorithm. Also, I think in some cases knowing the precise
> smearing algorithm might not be important (when having the same time as the
> hypervisor is enough and accuracy w.r.t. actual time is less important).
> 
> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
> now could catch every UTC-like clock which smears/smoothes leap seconds,
> where the vendor cannot be bothered to add the smearing algorithm to spec
> and implementations.
> 

I still don't know anything about virtio but under Linux, an RTC is
always UTC (or localtime when dual booting but let's not care) and never
accounts for leap seconds. Having an RTC and RTC driver behaving
differently would be super inconvenient. Why don't you leave this to
userspace?

I guess I'm still questioning whether this is the correct interface to
expose the host system time instead of an actual RTC.
David Woodhouse March 13, 2024, 12:29 p.m. UTC | #8
On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> 
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?

Well yes, we don't need to expose *anything* from the hypervisor and we
can leave it all to guest userspace. We can run NTP on every single one
of *hundreds* of guests, leaving them all to duplicate the work of
calibrating the *same* underlying oscillator.

I thought we were trying to avoid that, by having the hypervisor tell
them what the time was. If we're going to do that, we need it to be
sufficiently precise (and some clients want to *know* the precision),
and above all we need it to be *unambiguous*.

If the hypervisor says that the time is 3692217600.001, then the guest
doesn't actually know *which* 3692217600.001 it is, and thus it still
doesn't know the time to an accuracy better than 1 second.

And if we start allowing the hypervisor to smear clocks in some other
underspecified ways, then we end up with errors of up to 1 second in
the clock for long periods of time *around* the leap second.

We need to avoid that ambiguity.

> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.

If an RTC device is able to report '23:59:60' as the time of day, I
suppose that *could* resolve the ambiguity. But talking to a device is
slow; we want guests to be able to know the time — accurately — with a
simple counter/tsc read and some arithmetic. Which means *paired* reads
of 'RTC' and the counter, and a precise indication of the counter
frequency.
David Woodhouse March 13, 2024, 12:45 p.m. UTC | #9
On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
> On 12.03.24 18:15, David Woodhouse wrote:
> > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> > > On 08.03.24 13:33, David Woodhouse wrote:
> > > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > > > (sometimes) I still don't actually know what the time is, because some
> > > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > > > to provide such?
> > > > > > 
> > > > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > > > expose incomplete information.
> > > > > > 
> > > > > 
> > > > > Hi David,
> > > > > 
> > > > > (adding virtio-comment@lists.oasis-open.org for spec discussion),
> > > > > 
> > > > > thank you for your insightful comments. I think I take a broadly similar
> > > > > view. The reason why the current spec and driver is like this is that I
> > > > > took a pragmatic approach at first and only included features which work
> > > > > out-of-the-box for the current Linux ecosystem.
> > > > > 
> > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > > > can work out-of-the-box with time sync daemons such as chrony.
> > > > > 
> > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > > > as well, I am afraid that
> > > > > 
> > > > > - in some (embedded) scenarios, the TAI clock may not be available
> > > > > 
> > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > > > 
> > > > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > > > offset to each readout. I don't know user-space software which would
> > > > > leverage this already (at least not through the PTP clock interface).
> > > > > And why would such software not go straight for the TAI clock instead?
> > > > > 
> > > > > How about adding a requirement to the spec that the virtio-rtc device
> > > > > SHOULD expose the TAI clock whenever it is available - would this
> > > > > address your concerns?
> > > > 
> > > > I think that would be too easy for implementors to miss, or decide not
> > > > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > > > putting UTC in it.
> > > > 
> > > > I think I prefer to mandate the tai_offset field with the UTC clock.
> > > > Crappy implementations will just set it to zero, but at least that
> > > > gives a clear signal to the guests that it's *their* problem to
> > > > resolve.
> > > 
> > > To me there are some open questions regarding how this would work. Is there
> > > a use case for this with the v3 clock reading methods, or would it be
> > > enough to address this with the Virtio timekeeper?
> > > 
> > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> > > best alongside some additional information about leap seconds. I am not
> > > aware about any user-space user. In addition, leap second smearing should
> > > also be addressed.
> > > 
> > 
> > Is there even a standard yet for leap-smearing? Will it be linear over
> > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> > think is what Google does? Meta does something different again, don't
> > they?
> > 
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
> 
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.

As long as the guest *knows* what it's getting, sure.

> > 
> > But if the virtio-rtc clock is defined as UTC and then expose something
> > *different* in it, that's even worse. You potentially end up providing
> > inaccurate time for a whole *day* leading up to the leap second.
> > 
> > I think you're right that leap second smearing should be addressed. At
> > the very least, by making it clear that the virtio-rtc clock which
> > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> > yet-to-be-defined variant.
> > 
> 
> Agreed.
> 
> > Please make it explicit that any hypervisor which wants to advertise a
> > smeared clock shall define a new type which specifies the precise
> > smearing algorithm and cannot be conflated with the one you're defining
> > here.
> > 
> 
> I will add a requirement that the UTC clock can never have smeared/smoothed
> leap seconds.

Thanks.

> I think that not every vendor would bother to first add a definition of a
> smearing algorithm. Also, I think in some cases knowing the precise
> smearing algorithm might not be important (when having the same time as the
> hypervisor is enough and accuracy w.r.t. actual time is less important).
> 
> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
> now could catch every UTC-like clock which smears/smoothes leap seconds,
> where the vendor cannot be bothered to add the smearing algorithm to spec
> and implementations.

Please $DEITY no.

Surely the whole point of this effort is to provide guests with precise
and *unambiguous* knowledge of what the time is? 

Using UTC is bad enough, because for a UTC timestamp in the middle of a
leap second the guest can't know know *which* occurrence of that leap
second it is, so it might be wrong by a second. To resolve that
ambiguity needs a leap indicator and/or tai_offset field.

But if you allow and encourage the use of smeared time without even a
specification of *how* it's smeared... that's even worse. You have an
unknown inaccuracy of up to a second for whole periods of time around a
leap second. That's surely the *antithesis* of what we're trying to do
here? Without an actual definition of the smearing, how is a guest
actually supposed to know what time it is?

(I suppose you could add a tai_offset_nanoseconds field? I don't know
that I want to *encourage* that thought process...)

> As for UTC-SLS, this *could* also be added, although [1] says
> 
>         It is inappropriate to use Internet-Drafts as reference material or
>         to cite them other than as "work in progress."
> 
> [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00
> 
> > > > One other thing to note is I think we're being very naïve about the TSC
> > > > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > > > different frequency, and even if they run at the same frequency they
> > > > might be offset from each other. I'm happy to be naïve but I think we
> > > > should be *explicitly* so, and just say for example that it's defined
> > > > against vCPU0 so if other vCPUs are different then all bets are off.
> > > 
> > > ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> > > have an opinion on how to represent this in a platform-independent way.
> > 
> > Well, it doesn't have a notion of TSCs either; you include that by
> > implicit reference don't you?
> 
> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
> so the device might not even know which vCPUs there are. E.g. there is even
> interest to make virtio-rtc work as part of the virtio-net device (which
> might be implemented in hardware).

Sure, but those implementations aren't going to offer the TSC pairing
at all, are they?
Alexandre Belloni March 13, 2024, 12:58 p.m. UTC | #10
On 13/03/2024 12:29:38+0000, David Woodhouse wrote:
> On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> > 
> > I still don't know anything about virtio but under Linux, an RTC is
> > always UTC (or localtime when dual booting but let's not care) and never
> > accounts for leap seconds. Having an RTC and RTC driver behaving
> > differently would be super inconvenient. Why don't you leave this to
> > userspace?
> 
> Well yes, we don't need to expose *anything* from the hypervisor and we
> can leave it all to guest userspace. We can run NTP on every single one
> of *hundreds* of guests, leaving them all to duplicate the work of
> calibrating the *same* underlying oscillator.
> 

Really, I see the point of sharing the time accurately between the host
and the guest and not duplicating the effort. This is not what I am
objecting to.

> I thought we were trying to avoid that, by having the hypervisor tell
> them what the time was. If we're going to do that, we need it to be
> sufficiently precise (and some clients want to *know* the precision),
> and above all we need it to be *unambiguous*.
> 
> If the hypervisor says that the time is 3692217600.001, then the guest
> doesn't actually know *which* 3692217600.001 it is, and thus it still
> doesn't know the time to an accuracy better than 1 second.
> 

The RTC subsystem has a 1 second resolution and this is not going to
change because there is no point doing so for the hardware, to get the
time precisely, UIE MUST be used there is no vendor that will just
support reading the time/date or timestamp as this is way too imprecise.

> And if we start allowing the hypervisor to smear clocks in some other
> underspecified ways, then we end up with errors of up to 1 second in
> the clock for long periods of time *around* the leap second.
> 
> We need to avoid that ambiguity.

Exactly my point and I said, reading an RTC is always UTC and never
handles leap seconds so if userspace doesn't handle the leap second and
updates the RTC, too bad. There are obviously no RTC that will smear
clock unless instructed to, so the hypervisor must not smear the clock.

Note that this is not an issue for the subsystem because if you are not
capable to track an accurate clock, the RTC itself will have drifted by
much more than a second in between leap second inclusions.

> 
> > I guess I'm still questioning whether this is the correct interface to
> > expose the host system time instead of an actual RTC.
> 
> If an RTC device is able to report '23:59:60' as the time of day, I
> suppose that *could* resolve the ambiguity. But talking to a device is
> slow; we want guests to be able to know the time — accurately — with a
> simple counter/tsc read and some arithmetic. Which means *paired* reads
> of 'RTC' and the counter, and a precise indication of the counter
> frequency.

23:59:60 is not and will never be allowed in the RTC subsystem as this
is an invalid value for the hardware.

The TSC or whatever CPU counter/clock that is used to keep the system
time is not an RTC, I don't get why it has to be exposed as such to the
guests. PTP is fine and precise, RTC is not.
David Woodhouse March 13, 2024, 2:06 p.m. UTC | #11
On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
> The TSC or whatever CPU counter/clock that is used to keep the system
> time is not an RTC, I don't get why it has to be exposed as such to the
> guests. PTP is fine and precise, RTC is not.

Ah, I see. But the point of the virtio_rtc is not really to expose that
CPU counter. The point is to report the wallclock time, just like an
actual RTC. The real difference is the *precision*.

The virtio_rtc device has a facility to *also* expose the counter,
because that's what we actually need to gain that precision...

Applications don't read the RTC every time they want to know what the
time is. These days, they don't even make a system call; it's done
entirely in userspace mode. The kernel exposes some shared memory,
essentially saying "the counter was X at time Y, and runs at Z Hz".
Then applications just read the CPU counter and do some arithmetic.

As we require more and more precision in the calibration, it becomes
important to get *paired* readings of the CPU counter and the wallclock
time at precisely the same moment. If the guest has to read one and
then the other, potentially taking interrupts, getting preempted and
suffering steal/SMI time in the middle, that introduces an error which
is increasingly significant as we increasingly care about precision.

Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
kernels having to repeat readings over time and perform the calibration
as the underlying hardware oscillator frequency (Z) drifts with
temperature. I'm trying to get him to let the hypervisor expose the
calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
Which aside from reducing the duplication of effort, will *also* fix
the problem of live migration where *all* those things suffer a step
change and leave the guest with an inaccurate clock but not knowing it.

But that part isn't relevant to the RTC, as you say. RTC doesn't care
about that level of precision; it's just what the system uses to know
roughly what year it is, when it powers up. (And isn't even really
trusted for that, which is a large part of why I added the
X509_V_FLAG_NO_CHECK_TIME flag to OpenSSL, so Secure Boot doesn't break
when the RTC is catastrophically wrong :)

If you're asking why patch 7/7 in Peter's series exists to expose the
virtio clock through RTC, and you're not particularly interested in the
first six, I suppose that's a fair question. As is the question of "why
is it called virtio_rtc not virtio_ptp?". 

But let me turn it around: if the kernel has access to this virtio
device and *not* any other RTC, why *wouldn't* the kernel use the time
from it? The fact that it can optionally *also* provide paired readings
with the CPU counter doesn't actually *hurt* for the RTC use case, does
it?
Peter Hilber March 13, 2024, 2:15 p.m. UTC | #12
On 13.03.24 12:18, Alexandre Belloni wrote:
> On 13/03/2024 10:45:54+0100, Peter Hilber wrote:
>>> Exposing UTC as the only clock reference is bad enough; when leap
>>> seconds happen there's a whole second during which you don't *know*
>>> which second it is. It seems odd to me, for a precision clock to be
>>> deliberately ambiguous about what the time is!
>>
>> Just to be clear, the device can perfectly expose only a TAI reference
>> clock (or both UTC and TAI), the spec is just completely open about this,
>> as it tries to work for diverse use cases.
>>
>>>
>>> But if the virtio-rtc clock is defined as UTC and then expose something
>>> *different* in it, that's even worse. You potentially end up providing
>>> inaccurate time for a whole *day* leading up to the leap second.
>>>
>>> I think you're right that leap second smearing should be addressed. At
>>> the very least, by making it clear that the virtio-rtc clock which
>>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
>>> yet-to-be-defined variant.
>>>
>>
>> Agreed.
>>
>>> Please make it explicit that any hypervisor which wants to advertise a
>>> smeared clock shall define a new type which specifies the precise
>>> smearing algorithm and cannot be conflated with the one you're defining
>>> here.
>>>
>>
>> I will add a requirement that the UTC clock can never have smeared/smoothed
>> leap seconds.
>>
>> I think that not every vendor would bother to first add a definition of a
>> smearing algorithm. Also, I think in some cases knowing the precise
>> smearing algorithm might not be important (when having the same time as the
>> hypervisor is enough and accuracy w.r.t. actual time is less important).
>>
>> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
>> now could catch every UTC-like clock which smears/smoothes leap seconds,
>> where the vendor cannot be bothered to add the smearing algorithm to spec
>> and implementations.
>>
> 
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?
> 
> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.
> 

virtio_rtc only registers RTC class devices for virtio_rtc clock type UTC,
so adding an UTC_SMEARED clock type would avoid leap seconds for the RTC
class.

But I understand that there are more concerns and I will re-consider. From
my POV CLOCK_{REALTIME,MONOTONIC}_ALARM support is very important however.

So the only alternative to me seems to be adding an alternative alarmtimer
backend (and time synchronization through user space).

Thanks for the comment,

Peter
Alexandre Belloni March 13, 2024, 2:50 p.m. UTC | #13
On 13/03/2024 14:06:42+0000, David Woodhouse wrote:
> If you're asking why patch 7/7 in Peter's series exists to expose the
> virtio clock through RTC, and you're not particularly interested in the
> first six, I suppose that's a fair question. As is the question of "why
> is it called virtio_rtc not virtio_ptp?". 
> 

Exactly my question, thanks :)

> But let me turn it around: if the kernel has access to this virtio
> device and *not* any other RTC, why *wouldn't* the kernel use the time
> from it? The fact that it can optionally *also* provide paired readings
> with the CPU counter doesn't actually *hurt* for the RTC use case, does
> it?

As long as it doesn't behave differently from the other RTC, I'm fine
with this. This is important because I don't want to carry any special
infrastructure for this driver or to have to special case this driver
later on because it is incompatible with some evolution of the
subsystem.
Peter Hilber March 13, 2024, 5:50 p.m. UTC | #14
On 13.03.24 13:45, David Woodhouse wrote:
> On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
>> On 12.03.24 18:15, David Woodhouse wrote:
>>> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
>>>> On 08.03.24 13:33, David Woodhouse wrote:
>>>>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>>>>>> On 07.03.24 15:02, David Woodhouse wrote:
>>>>>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>>>>>> (sometimes) I still don't actually know what the time is, because some
>>>>>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>>>>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>>>>>> to provide such?
>>>>>>>
>>>>>>> Otherwise you're just designing it to allow crappy hypervisors to
>>>>>>> expose incomplete information.
>>>>>>>
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> (adding virtio-comment@lists.oasis-open.org for spec discussion),
>>>>>>
>>>>>> thank you for your insightful comments. I think I take a broadly similar
>>>>>> view. The reason why the current spec and driver is like this is that I
>>>>>> took a pragmatic approach at first and only included features which work
>>>>>> out-of-the-box for the current Linux ecosystem.
>>>>>>
>>>>>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>>>>>> can work out-of-the-box with time sync daemons such as chrony.
>>>>>>
>>>>>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>>>>>> as well, I am afraid that
>>>>>>
>>>>>> - in some (embedded) scenarios, the TAI clock may not be available
>>>>>>
>>>>>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>>>>>
>>>>>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>>>>>> offset to each readout. I don't know user-space software which would
>>>>>> leverage this already (at least not through the PTP clock interface).
>>>>>> And why would such software not go straight for the TAI clock instead?
>>>>>>
>>>>>> How about adding a requirement to the spec that the virtio-rtc device
>>>>>> SHOULD expose the TAI clock whenever it is available - would this
>>>>>> address your concerns?
>>>>>
>>>>> I think that would be too easy for implementors to miss, or decide not
>>>>> to obey. Or to get *wrong*, by exposing a TAI clock but actually
>>>>> putting UTC in it.
>>>>>
>>>>> I think I prefer to mandate the tai_offset field with the UTC clock.
>>>>> Crappy implementations will just set it to zero, but at least that
>>>>> gives a clear signal to the guests that it's *their* problem to
>>>>> resolve.
>>>>
>>>> To me there are some open questions regarding how this would work. Is there
>>>> a use case for this with the v3 clock reading methods, or would it be
>>>> enough to address this with the Virtio timekeeper?
>>>>
>>>> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
>>>> best alongside some additional information about leap seconds. I am not
>>>> aware about any user-space user. In addition, leap second smearing should
>>>> also be addressed.
>>>>
>>>
>>> Is there even a standard yet for leap-smearing? Will it be linear over
>>> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
>>> think is what Google does? Meta does something different again, don't
>>> they?
>>>
>>> Exposing UTC as the only clock reference is bad enough; when leap
>>> seconds happen there's a whole second during which you don't *know*
>>> which second it is. It seems odd to me, for a precision clock to be
>>> deliberately ambiguous about what the time is!
>>
>> Just to be clear, the device can perfectly expose only a TAI reference
>> clock (or both UTC and TAI), the spec is just completely open about this,
>> as it tries to work for diverse use cases.
> 
> As long as the guest *knows* what it's getting, sure.
> 
>>>
>>> But if the virtio-rtc clock is defined as UTC and then expose something
>>> *different* in it, that's even worse. You potentially end up providing
>>> inaccurate time for a whole *day* leading up to the leap second.
>>>
>>> I think you're right that leap second smearing should be addressed. At
>>> the very least, by making it clear that the virtio-rtc clock which
>>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
>>> yet-to-be-defined variant.
>>>
>>
>> Agreed.
>>
>>> Please make it explicit that any hypervisor which wants to advertise a
>>> smeared clock shall define a new type which specifies the precise
>>> smearing algorithm and cannot be conflated with the one you're defining
>>> here.
>>>
>>
>> I will add a requirement that the UTC clock can never have smeared/smoothed
>> leap seconds.
> 
> Thanks.
> 
>> I think that not every vendor would bother to first add a definition of a
>> smearing algorithm. Also, I think in some cases knowing the precise
>> smearing algorithm might not be important (when having the same time as the
>> hypervisor is enough and accuracy w.r.t. actual time is less important).
>>
>> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
>> now could catch every UTC-like clock which smears/smoothes leap seconds,
>> where the vendor cannot be bothered to add the smearing algorithm to spec
>> and implementations.
> 
> Please $DEITY no.
> 
> Surely the whole point of this effort is to provide guests with precise
> and *unambiguous* knowledge of what the time is? 

I would say, a fundamental point of this effort is to enable such
implementations, and to detect if a device is promising to support this.

Where we might differ is as to whether the Virtio clock *for every
implementation* has to be *continuously* accurate w.r.t. a time standard,
or whether *for some implementations* it could be enough that all guests in
the local system have the same, precise local notion of time, which might
be off from the actual time standard.

Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

With your described use case the UTC_SMEARED clock should of course not be
used. The UTC_SMEARED clock would get a distinct name through udev, like
/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
detected.

> 
> Using UTC is bad enough, because for a UTC timestamp in the middle of a
> leap second the guest can't know know *which* occurrence of that leap
> second it is, so it might be wrong by a second. To resolve that
> ambiguity needs a leap indicator and/or tai_offset field.

I agree that virtio-rtc should communicate this. The question is, what
exactly, and for which clock read request?

As for PTP clocks:

- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.

- The clock_adjtime(2) tai_offset and return value could be set (if
  upstream will accept this). Would this help? As discussed, user space
  would need to interpret this (and currently no dynamic POSIX clock sets
  this).

> 
> But if you allow and encourage the use of smeared time without even a
> specification of *how* it's smeared... that's even worse. You have an
> unknown inaccuracy of up to a second for whole periods of time around a
> leap second. That's surely the *antithesis* of what we're trying to do
> here? Without an actual definition of the smearing, how is a guest
> actually supposed to know what time it is?

As discussed above, I think in some use cases it is enough for the guest to
have a precise notion of time shared with the other guests.

> 
> (I suppose you could add a tai_offset_nanoseconds field? I don't know
> that I want to *encourage* that thought process...)
> 
>> As for UTC-SLS, this *could* also be added, although [1] says
>>
>>         It is inappropriate to use Internet-Drafts as reference material or
>>         to cite them other than as "work in progress."
>>
>> [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00
>>
>>>>> One other thing to note is I think we're being very naïve about the TSC
>>>>> on x86 hosts. Theoretically, the TSC for every vCPU might run at a
>>>>> different frequency, and even if they run at the same frequency they
>>>>> might be offset from each other. I'm happy to be naïve but I think we
>>>>> should be *explicitly* so, and just say for example that it's defined
>>>>> against vCPU0 so if other vCPUs are different then all bets are off.
>>>>
>>>> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
>>>> have an opinion on how to represent this in a platform-independent way.
>>>
>>> Well, it doesn't have a notion of TSCs either; you include that by
>>> implicit reference don't you?
>>
>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>> so the device might not even know which vCPUs there are. E.g. there is even
>> interest to make virtio-rtc work as part of the virtio-net device (which
>> might be implemented in hardware).
> 
> Sure, but those implementations aren't going to offer the TSC pairing
> at all, are they?
> 

They could offer an Intel ART pairing (some physical PTP NICs are already
doing this, look for the convert_art_to_tsc() users).

Thanks for the comments,

Peter
Peter Hilber March 13, 2024, 5:50 p.m. UTC | #15
On 13.03.24 15:06, David Woodhouse wrote:
> On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
>> The TSC or whatever CPU counter/clock that is used to keep the system
>> time is not an RTC, I don't get why it has to be exposed as such to the
>> guests. PTP is fine and precise, RTC is not.
> 
> Ah, I see. But the point of the virtio_rtc is not really to expose that
> CPU counter. The point is to report the wallclock time, just like an
> actual RTC. The real difference is the *precision*.
> 
> The virtio_rtc device has a facility to *also* expose the counter,
> because that's what we actually need to gain that precision...
> 
> Applications don't read the RTC every time they want to know what the
> time is. These days, they don't even make a system call; it's done
> entirely in userspace mode. The kernel exposes some shared memory,
> essentially saying "the counter was X at time Y, and runs at Z Hz".
> Then applications just read the CPU counter and do some arithmetic.
> 
> As we require more and more precision in the calibration, it becomes
> important to get *paired* readings of the CPU counter and the wallclock
> time at precisely the same moment. If the guest has to read one and
> then the other, potentially taking interrupts, getting preempted and
> suffering steal/SMI time in the middle, that introduces an error which
> is increasingly significant as we increasingly care about precision.
> 
> Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
> kernels having to repeat readings over time and perform the calibration
> as the underlying hardware oscillator frequency (Z) drifts with
> temperature. I'm trying to get him to let the hypervisor expose the
> calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
> Which aside from reducing the duplication of effort, will *also* fix
> the problem of live migration where *all* those things suffer a step
> change and leave the guest with an inaccurate clock but not knowing it.

I am already convinced that this would work significantly better than the
{X,Y} pair (but would be a bit more effort to implement):

1. when accessed by user space, obviously

2. when backing the PTP clock, it saves CPU time and makes non-paired
   reads more precise.

I would just prefer to try upstreaming the {X,Y} pairing first. I think the
{X,Y,Z...} pairing could be discussed and developed in parallel.

Thanks for the comments,

Peter
David Woodhouse March 13, 2024, 6:18 p.m. UTC | #16
On 13 March 2024 17:50:48 GMT, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>On 13.03.24 13:45, David Woodhouse wrote:
>> Surely the whole point of this effort is to provide guests with precise
>> and *unambiguous* knowledge of what the time is? 
>
>I would say, a fundamental point of this effort is to enable such
>implementations, and to detect if a device is promising to support this.
>
>Where we might differ is as to whether the Virtio clock *for every
>implementation* has to be *continuously* accurate w.r.t. a time standard,
>or whether *for some implementations* it could be enough that all guests in
>the local system have the same, precise local notion of time, which might
>be off from the actual time standard.

That makes sense, but remember I don't just want {X, Y, Z} but *also* the error bounds of ±deltaY and ±deltaZ too.

So your example just boils down to "I'm calling it UTC, and it's really precise, but we make no promises about its *accuracy*". And that's fine.

>Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

KVM is not an exemplar of good time practices. 
Not in *any* respect :)

>With your described use case the UTC_SMEARED clock should of course not be
>used. The UTC_SMEARED clock would get a distinct name through udev, like
>/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>detected.

As long as it's clear to all concerned that this is fundamentally not usable as an accurate time source, and is only for the local-sync case you described, sure.

>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>> leap second the guest can't know know *which* occurrence of that leap
>> second it is, so it might be wrong by a second. To resolve that
>> ambiguity needs a leap indicator and/or tai_offset field.
>
>I agree that virtio-rtc should communicate this. The question is, what
>exactly, and for which clock read request?

Are we now conflating software architecture (and Linux in particular) with "hardware" design?

To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.

>As for PTP clocks:
>
>- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>
>- The clock_adjtime(2) tai_offset and return value could be set (if
>  upstream will accept this). Would this help? As discussed, user space
>  would need to interpret this (and currently no dynamic POSIX clock sets
>  this).

Hm, maybe?


>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>> so the device might not even know which vCPUs there are. E.g. there is even
>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>> might be implemented in hardware).
>> 
>> Sure, but those implementations aren't going to offer the TSC pairing
>> at all, are they?
>> 
>
>They could offer an Intel ART pairing (some physical PTP NICs are already
>doing this, look for the convert_art_to_tsc() users).

Right, but isn't that software's problem? The time pairing is defined against the ART in that case.
Andrew Lunn March 13, 2024, 8:12 p.m. UTC | #17
> As long as it doesn't behave differently from the other RTC, I'm fine
> with this. This is important because I don't want to carry any special
> infrastructure for this driver or to have to special case this driver
> later on because it is incompatible with some evolution of the
> subsystem.

Maybe deliberately throw away all the sub-second accuracy when
accessing the time via the RTC API? That helps to make it look like an
RTC. And when doing the rounding, add a constant offset of 10ms to
emulate the virtual i2c bus it is hanging off, like most RTCs?

	  Andrew
Peter Hilber March 14, 2024, 9:13 a.m. UTC | #18
On 13.03.24 21:12, Andrew Lunn wrote:
>> As long as it doesn't behave differently from the other RTC, I'm fine
>> with this. This is important because I don't want to carry any special
>> infrastructure for this driver or to have to special case this driver
>> later on because it is incompatible with some evolution of the
>> subsystem.
> 
> Maybe deliberately throw away all the sub-second accuracy when
> accessing the time via the RTC API? That helps to make it look like an
> RTC. And when doing the rounding, add a constant offset of 10ms to
> emulate the virtual i2c bus it is hanging off, like most RTCs?
> 
> 	  Andrew

The truncating to whole seconds is already done. As to the offset, I do not
understand how this would help. I can read out my CMOS RTC in ~50 us.

Thanks for the comment,

Peter
Peter Hilber March 14, 2024, 10:13 a.m. UTC | #19
Now CC'ing the previous commenters to the virtio-rtc spec draft, since
this discussion is mostly about the spec, and the Virtio mailing lists
still seem to be in a migration hiatus...

On 13.03.24 19:18, David Woodhouse wrote:
> On 13 March 2024 17:50:48 GMT, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>> On 13.03.24 13:45, David Woodhouse wrote:
>>> Surely the whole point of this effort is to provide guests with precise
>>> and *unambiguous* knowledge of what the time is? 
>>
>> I would say, a fundamental point of this effort is to enable such
>> implementations, and to detect if a device is promising to support this.
>>
>> Where we might differ is as to whether the Virtio clock *for every
>> implementation* has to be *continuously* accurate w.r.t. a time standard,
>> or whether *for some implementations* it could be enough that all guests in
>> the local system have the same, precise local notion of time, which might
>> be off from the actual time standard.
> 
> That makes sense, but remember I don't just want {X, Y, Z} but *also* the error bounds of ±deltaY and ±deltaZ too.
> 
> So your example just boils down to "I'm calling it UTC, and it's really precise, but we make no promises about its *accuracy*". And that's fine.
> 
>> Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...
> 
> KVM is not an exemplar of good time practices. 
> Not in *any* respect :)
> 
>> With your described use case the UTC_SMEARED clock should of course not be
>> used. The UTC_SMEARED clock would get a distinct name through udev, like
>> /dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>> detected.
> 
> As long as it's clear to all concerned that this is fundamentally not usable as an accurate time source, and is only for the local-sync case you described, sure.
> 
>>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>>> leap second the guest can't know know *which* occurrence of that leap
>>> second it is, so it might be wrong by a second. To resolve that
>>> ambiguity needs a leap indicator and/or tai_offset field.
>>
>> I agree that virtio-rtc should communicate this. The question is, what
>> exactly, and for which clock read request?
> 
> Are we now conflating software architecture (and Linux in particular) with "hardware" design?
> 
> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
> 

As Virtio is extensible (unlike hardware), my approach is to mostly specify
only what also has a PoC user and a use case.

>> As for PTP clocks:
>>
>> - It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>>
>> - The clock_adjtime(2) tai_offset and return value could be set (if
>>  upstream will accept this). Would this help? As discussed, user space
>>  would need to interpret this (and currently no dynamic POSIX clock sets
>>  this).
> 
> Hm, maybe?
> 
> 
>>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>>> so the device might not even know which vCPUs there are. E.g. there is even
>>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>>> might be implemented in hardware).
>>>
>>> Sure, but those implementations aren't going to offer the TSC pairing
>>> at all, are they?
>>>
>>
>> They could offer an Intel ART pairing (some physical PTP NICs are already
>> doing this, look for the convert_art_to_tsc() users).
> 
> Right, but isn't that software's problem? The time pairing is defined against the ART in that case.

My point was that such a device would then not necessarily have an idea
what vCPU 0 is. But let's just say that this will be phrased as a SHOULD
best-effort requirement anyway.

Thanks for the comments,

Peter
David Woodhouse March 14, 2024, 2:19 p.m. UTC | #20
On 14 March 2024 11:13:37 CET, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
>> 
>
>As Virtio is extensible (unlike hardware), my approach is to mostly specify
>only what also has a PoC user and a use case.

If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM).

My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that it's extensible but we don't want a v1.0 of the spec, implemented by various hypervisors, which still leaves guests not knowing what the actual time is. That would not be good. And even UTC without a leap second indicator has that problem.
Peter Hilber March 19, 2024, 1:47 p.m. UTC | #21
While the virtio-comment list is not available, now also CC'ing Parav,
which may be interested in this virtio-rtc spec related discussion thread.

On 14.03.24 15:19, David Woodhouse wrote:
> On 14 March 2024 11:13:37 CET, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>>> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
>>>
>>
>> As Virtio is extensible (unlike hardware), my approach is to mostly specify
>> only what also has a PoC user and a use case.
> 
> If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM).

We plan to add 

- leap second indication,

- UTC-to-TAI offset,

- clock smearing indication (including the noon-to-noon linear smearing
  variant which seems to be somewhat popular), and

- clock accuracy indication

to the initial spec and to the PoC implementation.

However, due to resource restrictions, we cannot ourselves add the
memory-mapped clock to the initial spec.

Everyone is very welcome to contribute the memory-mapped clock to the spec,
and I think it might then still make it to the initial version.

> 
> My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that it's extensible but we don't want a v1.0 of the spec, implemented by various hypervisors, which still leaves guests not knowing what the actual time is. That would not be good. And even UTC without a leap second indicator has that problem.

Agreed. That should be addressed by the above changes.

Best regards,

Peter
David Woodhouse March 20, 2024, 5:22 p.m. UTC | #22
On Tue, 2024-03-19 at 14:47 +0100, Peter Hilber wrote:
> While the virtio-comment list is not available, now also CC'ing Parav,
> which may be interested in this virtio-rtc spec related discussion thread.
> 
> On 14.03.24 15:19, David Woodhouse wrote:
> > On 14 March 2024 11:13:37 CET, Peter Hilber <peter.hilber@opensynergy.com> wrote:
> > > > To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
> > > > 
> > > 
> > > As Virtio is extensible (unlike hardware), my approach is to mostly specify
> > > only what also has a PoC user and a use case.
> > 
> > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM).
> 
> We plan to add 
> 
> - leap second indication,
> 
> - UTC-to-TAI offset,
> 
> - clock smearing indication (including the noon-to-noon linear smearing
>   variant which seems to be somewhat popular), and
>
> - clock accuracy indication
> 
> to the initial spec and to the PoC implementation.

Sounds good, thanks! I look forward to seeing the new revision. I'm
hoping Julien can give feedback on the clock accuracy parts.

> However, due to resource restrictions, we cannot ourselves add the
> memory-mapped clock to the initial spec.
>
> Everyone is very welcome to contribute the memory-mapped clock to the spec,
> and I think it might then still make it to the initial version.

Makes sense. That is my primary target, so I'm *hoping* we can converge
and get that into your initial spec, otherwise for expediency I'm going
to have to define an ACPI or DT or PCI device of our own and expose the
memory region through that instead.

(Even if I have to do that in the short term to stop the bleeding with
customers' clocks and live migration, I'd still aspire to migrate to a
virtio_rtc version of it in future)
David Woodhouse June 15, 2024, 8:40 a.m. UTC | #23
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --------------
> 
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
> 
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
> 
> Overview
> --------
> 
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. The driver can read the clocks with simple
> or more accurate methods. Optionally, the driver can set an alarm.
> 
> The series first fixes some bugs in the get_device_system_crosststamp()
> interpolation code, which is required for reliable virtio_rtc operation.
> Then, add the virtio_rtc implementation.
> 
> For the Virtio RTC device, there is currently a proprietary implementation,
> which has been used for provisional testing.

As discussed before, I don't think it makes sense to design a new high-
precision virtual clock which only gets it right *most* of the time. We
absolutely need to address the issue of live migration.

When live migration occurs, the guest's time precision suffers in two
ways.

First, even when migrating to a supposedly identical host the precise
rate of the underlying counter (TSC, arch counter, etc.) can change
within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
changes that NTP normally manages to track, this is a *step* change,
potentially from -50PPM to +50PPM from one host to the next.

Second, the accuracy of the counter as preserved across migration is
limited by the accuracy of each host's NTP synchronization. So there is
also a step change in the value of the counter itself.

At the moment of migration, the guest's timekeeping should be
considered invalid. Any previous cross-timestamps are meaningless, and
blindly using the previously-known relationship between the counter and
real time can lead to problems such as corruption in distributed
databases, fines for mis-timestamped transactions, and other general
unhappiness.

We obviously can't get a new timestamp from the virtio_rtc device every
time an application wants to know the time reliably. We don't even want
*system* calls for that, which is why we have it in a vDSO.

We can take the same approach to warning the guest about clock
disruption due to live migration. A shared memory region from the
virtual clock device even be mapped all the way to userspace, for those
applications which need precise and *reliable* time to check a
'disruption' marker in it, and do whatever is appropriate (e.g. abort
transactions and wait for time to resync) when it happens.

We can do better than just letting the guest know about disruption, of
course. We can put the actual counter→realtime relationship into the
memory region too. As well as being able to provide a PTP driver with
this, the applications which care about *reliable* timestamps can mmap
the page directly and use it, vDSO-style, to have accurate timestamps
even from the first cycle after migration.

When disruption is signalled, the guest needs to throw away any
*additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
to knowing nothing more than what the hypervisor advertises here.

Here's a first attempt at defining such a memory structure. For now
I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
think it makes most sense for this to be part of the virtio_rtc spec.
Ultimately it doesn't matter *how* the guest finds the memory region.

Very preliminary QEMU hacks at 
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
(I still need to do the KVM side helper for actually filling in the
host clock information, converted to the *guest* TSC)

This is the guest side. H aving heckled your assumption that we can use
the virt counter on Arm, I concede I'm doing the same thing for now.
The structure itself is at the end, or see
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock

From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Mon, 10 Jun 2024 15:10:11 +0100
Subject: [PATCH] ptp: Add vDSO-style vmclock support

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/ptp/Kconfig          |  13 ++
 drivers/ptp/Makefile         |   1 +
 drivers/ptp/ptp_vmclock.c    | 404 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/vmclock.h | 102 +++++++++
 4 files changed, 520 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..ace6d58c1781 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
 	  To compile this driver as a module, choose M here: the module
 	  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+	tristate "Virtual machine PTP clock"
+	depends on X86_TSC || ARM_ARCH_TIMER
+	depends on PTP_1588_CLOCK && ACPI
+	default y
+	help
+	  This driver adds support for using a virtual precision clock
+	  advertised by the hypervisor. This clock is only useful in virtual
+	  machines where such a device is present.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
 	tristate "IDT 82P33xxx PTP clock"
 	depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)	+= ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)	+= ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)	+= ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)	+= ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)	+= ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ)	+= ptp-qoriq.o
 ptp-qoriq-y				+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)		+= ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index 000000000000..3c4e027090c5
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/ptp_kvm.h>
+#include <linux/platform_device.h>
+#include <linux/miscdevice.h>
+#include <linux/acpi.h>
+#include <uapi/linux/vmclock.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+	phys_addr_t phys_addr;
+	struct vmclock_abi *clk;
+	struct miscdevice miscdev;
+	struct ptp_clock_info ptp_clock_info;
+	struct ptp_clock *ptp_clock;
+	enum clocksource_ids cs_id;
+	int index;
+	char *name;
+};
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+				   struct system_counterval_t *system_counter,
+				   struct timespec64 *tspec)
+{
+	uint64_t cycle, delta, seq, delta_s32, delta_s64, delta_s32b;
+	uint64_t delta_lo, delta_hi, period_lo, period_hi, frac_sec_lo, frac_sec_hi;
+	int ret = 0;
+
+#ifdef CONFIG_X86
+	if (check_tsc_unstable())
+		return -EINVAL;
+#endif
+
+	preempt_disable_notrace();
+
+	do {
+
+		seq = st->clk->seq_count & ~1ULL;
+		virt_rmb();
+
+		if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) {
+			ret = -EINVAL;
+			virt_rmb();
+			continue;
+		}
+
+		cycle = get_cycles();
+
+		tspec->tv_sec = st->clk->utc_time_sec;
+		frac_sec_lo = st->clk->utc_time_frac_sec & 0xffffffff;
+		frac_sec_hi = st->clk->utc_time_frac_sec >> 32;
+
+		delta = cycle - st->clk->counter_value;
+
+		delta_lo = delta & 0xffffffff;
+		delta_hi = delta >> 32;
+		period_lo = st->clk->counter_period_frac_sec & 0xffffffff;
+		period_hi = st->clk->counter_period_frac_sec >> 32;
+
+		/* Delta in units of (second >> 32) */
+		delta_s32 = delta_lo * period_hi;
+
+		/* Avoid overflow by skimming off the full seconds */
+		tspec->tv_sec += delta_s32 >> 32;
+		delta_s32 &= 0xffffffff;
+
+		if (delta_hi) {
+			tspec->tv_sec += delta_hi * period_hi;
+			delta_s32b = delta_hi * period_lo;
+
+			tspec->tv_sec += delta_s32b >> 32;
+			delta_s32 += delta_s32b & 0xffffffff;
+		}
+
+		delta_s32 += frac_sec_hi;
+
+		delta_s64 = (period_lo * delta_lo) + frac_sec_lo;
+		delta_s32 += delta_s64 >> 32;
+		delta_s64 &= 0xffffffff;
+
+		tspec->tv_sec += delta_s32 >> 32;
+		delta_s32 &= 0xffffffff;
+
+		tspec->tv_nsec = ((delta_s32 * NSEC_PER_SEC) +
+				  ((delta_s64 * NSEC_PER_SEC) >> 32)) >> 32;
+
+		ret = 0;
+
+		virt_rmb();
+	} while (seq != st->clk->seq_count);
+
+	preempt_enable_notrace();
+
+	if (ret)
+		return ret;
+
+	if (system_counter) {
+		system_counter->cycles = cycle;
+		system_counter->cs_id = st->cs_id;
+	}
+
+	return 0;
+}
+
+static int ptp_vmclock_get_time_fn(ktime_t *device_time,
+				   struct system_counterval_t *system_counter,
+				   void *ctx)
+{
+	struct vmclock_state *st = ctx;
+	struct timespec64 tspec;
+	int ret;
+
+	ret = vmclock_get_crosststamp(st, system_counter, &tspec);
+	if (!ret)
+		*device_time = timespec64_to_ktime(tspec);
+
+	return ret;
+}
+
+
+static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
+				      struct system_device_crosststamp *xtstamp)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+
+	return get_device_system_crosststamp(ptp_vmclock_get_time_fn, st,
+					     NULL, xtstamp);
+}
+
+/*
+ * PTP clock operations
+ */
+
+static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_settime(struct ptp_clock_info *ptp,
+			   const struct timespec64 *ts)
+{
+	return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+
+	return vmclock_get_crosststamp(st, NULL, ts);
+}
+
+static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
+			  struct ptp_clock_request *rq, int on)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct ptp_clock_info ptp_vmclock_info = {
+	.owner		= THIS_MODULE,
+	.max_adj	= 0,
+	.n_ext_ts	= 0,
+	.n_pins		= 0,
+	.pps		= 0,
+	.adjfine	= ptp_vmclock_adjfine,
+	.adjtime	= ptp_vmclock_adjtime,
+	.gettime64	= ptp_vmclock_gettime,
+	.settime64	= ptp_vmclock_settime,
+	.enable		= ptp_vmclock_enable,
+	.getcrosststamp = ptp_vmclock_getcrosststamp,
+};
+
+static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
+		return -EROFS;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
+		return -EINVAL;
+
+        if (io_remap_pfn_range(vma, vma->vm_start,
+			       st->phys_addr >> PAGE_SHIFT, PAGE_SIZE,
+                               vma->vm_page_prot))
+                return -EAGAIN;
+
+        return 0;
+}
+
+static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
+				    size_t count, loff_t *ppos)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+	size_t max_count;
+	int32_t seq;
+
+	if (*ppos >= PAGE_SIZE)
+		return 0;
+
+	max_count = PAGE_SIZE - *ppos;
+	if (count > max_count)
+		count = max_count;
+
+	do {
+		seq = st->clk->seq_count & ~1ULL;
+		virt_rmb();
+
+		if (copy_to_user(buf, ((char *)st->clk) + *ppos, count))
+			return -EFAULT;
+
+		virt_rmb();
+	} while (seq != st->clk->seq_count);
+
+	*ppos += count;
+	return count;
+}
+
+static const struct file_operations vmclock_miscdev_fops = {
+        .mmap = vmclock_miscdev_mmap,
+        .read = vmclock_miscdev_read,
+};
+
+/* module operations */
+
+static void vmclock_remove(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct vmclock_state *st = dev_get_drvdata(dev);
+
+	if (st->ptp_clock)
+		ptp_clock_unregister(st->ptp_clock);
+
+	if (st->miscdev.minor == MISC_DYNAMIC_MINOR)
+		misc_deregister(&st->miscdev);
+}
+
+static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
+{
+	struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
+	struct acpi_device *adev = ACPI_COMPANION(dev);
+	union acpi_object *obj;
+	acpi_status status;
+
+	status = acpi_evaluate_object(adev->handle, "ADDR", NULL, &parsed);
+	if (ACPI_FAILURE(status)) {
+		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+		return -ENODEV;
+	}
+	obj = parsed.pointer;
+	if (!obj || obj->type != ACPI_TYPE_PACKAGE || obj->package.count != 2 ||
+	    obj->package.elements[0].type != ACPI_TYPE_INTEGER ||
+	    obj->package.elements[1].type != ACPI_TYPE_INTEGER)
+		return -EINVAL;
+
+	st->phys_addr = (obj->package.elements[0].integer.value << 0) |
+		(obj->package.elements[1].integer.value << 32);
+
+	return 0;
+}
+
+static void vmclock_put_idx(void *data)
+{
+	struct vmclock_state *st = data;
+
+	ida_free(&vmclock_ida, st->index);
+}
+
+static int vmclock_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct vmclock_state *st;
+	int ret;
+
+	st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL);
+	if (!st)
+		return -ENOMEM;
+
+	if (has_acpi_companion(dev))
+		ret = vmclock_probe_acpi(dev, st);
+	else
+		ret = -EINVAL; /* Only ACPI for now */
+
+	if (ret) {
+		dev_info(dev, "Failed to obtain physical address: %d\n", ret);
+		goto out;
+	}
+
+	st->clk = devm_memremap(dev, st->phys_addr, sizeof(*st->clk),
+				MEMREMAP_WB);
+	if (IS_ERR(st->clk)) {
+		ret = PTR_ERR(st->clk);
+		dev_info(dev, "failed to map shared memory\n");
+		st->clk = NULL;
+		goto out;
+	}
+
+	if (st->clk->magic != VMCLOCK_MAGIC ||
+	    st->clk->size < sizeof(*st->clk) ||
+	    st->clk->version != 1) {
+		dev_info(dev, "vmclock magic fields invalid\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_ARM64) &&
+	    st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) {
+		/* Can we check it's the virtual counter? */
+		st->cs_id = CSID_ARM_ARCH_COUNTER;
+	} else if (IS_ENABLED(CONFIG_X86) &&
+		   st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) {
+		st->cs_id = CSID_X86_TSC;
+	}
+
+	ret = ida_alloc(&vmclock_ida, GFP_KERNEL);
+	if (ret < 0)
+		goto out;
+
+	st->index = ret;
+        ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st);
+	if (ret)
+		goto out;
+
+	st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index);
+	if (!st->name) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* If the structure is big enough, it can be mapped to userspace */
+	if (st->clk->size >= PAGE_SIZE) {
+		st->miscdev.minor = MISC_DYNAMIC_MINOR;
+		st->miscdev.fops = &vmclock_miscdev_fops;
+		st->miscdev.name = st->name;
+
+		ret = misc_register(&st->miscdev);
+		if (ret)
+			goto out;
+	}
+
+	/* If there is valid clock information, register a PTP clock */
+	if (st->cs_id) {
+		st->ptp_clock_info = ptp_vmclock_info;
+		strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name));
+		st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev);
+
+		if (IS_ERR(st->ptp_clock)) {
+			ret = PTR_ERR(st->ptp_clock);
+			st->ptp_clock = NULL;
+			vmclock_remove(pdev);
+			goto out;
+		}
+	}
+
+	dev_set_drvdata(dev, st);
+
+ out:
+	return ret;
+}
+
+static const struct acpi_device_id vmclock_acpi_ids[] = {
+	{ "VMCLOCK", 0 },
+	{}
+};
+MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
+
+static struct platform_driver vmclock_platform_driver = {
+	.probe		= vmclock_probe,
+	.remove_new	= vmclock_remove,
+	.driver	= {
+		.name	= "vmclock",
+		.acpi_match_table = vmclock_acpi_ids,
+	},
+};
+
+module_platform_driver(vmclock_platform_driver)
+
+MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>");
+MODULE_DESCRIPTION("PTP clock using VMCLOCK");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
new file mode 100644
index 000000000000..cac24536c5c8
--- /dev/null
+++ b/include/uapi/linux/vmclock.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+
+/*
+ * This structure provides a vDSO-style clock to VM guests, exposing the
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
+ * counter, etc.) and real time. It is designed to address the problem of
+ * live migration, which other clock enlightenments do not.
+ *
+ * When a guest is live migrated, this affects the clock in two ways.
+ *
+ * First, even between identical hosts the actual frequency of the underlying
+ * counter will change within the tolerances of its specification (typically
+ * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
+ * same host, but can be tracked by NTP as it generally varies slowly. With
+ * live migration there is a step change in the frequency, with no warning.
+ *
+ * Second, there may be a step change in the value of the counter itself, as
+ * its accuracy is limited by the precision of the NTP synchronization on the
+ * source and destination hosts.
+ *
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
+ * host before migration is invalid, and needs to be redone on the new host.
+ *
+ * In its most basic mode, this structure provides only an indication to the
+ * guest that live migration has occurred. This allows the guest to know that
+ * its clock is invalid and take remedial action. For applications that need
+ * reliable accurate timestamps (e.g. distributed databases), the structure
+ * can be mapped all the way to userspace. This allows the application to see
+ * directly for itself that the clock is disrupted and take appropriate
+ * action, even when using a vDSO-style method to get the time instead of a
+ * system call.
+ *
+ * In its more advanced mode. this structure can also be used to expose the
+ * precise relationship of the CPU counter to real time, as calibrated by the
+ * host. This means that userspace applications can have accurate time
+ * immediately after live migration, rather than having to pause operations
+ * and wait for NTP to recover. This mode does, of course, rely on the
+ * counter being reliable and consistent across CPUs.
+ */
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+struct vmclock_abi {
+	uint32_t magic;
+#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
+	uint16_t size;		/* Size of page containing this structure */
+	uint16_t version;	/* 1 */
+
+	/* Sequence lock. Low bit means an update is in progress. */
+	uint64_t seq_count;
+
+	/*
+	 * This field changes to another non-repeating value when the CPU
+	 * counter is disrupted, for example on live migration.
+	 */
+	uint64_t disruption_marker;
+
+	uint8_t clock_status;
+#define VMCLOCK_STATUS_UNKNOWN		0
+#define VMCLOCK_STATUS_INITIALIZING	1
+#define VMCLOCK_STATUS_SYNCHRONIZED	2
+#define VMCLOCK_STATUS_FREERUNNING	3
+#define VMCLOCK_STATUS_UNRELIABLE	4
+
+	uint8_t counter_id;
+#define VMCLOCK_COUNTER_INVALID		0
+#define VMCLOCK_COUNTER_X86_TSC		1
+#define VMCLOCK_COUNTER_ARM_VCNT	2
+
+	uint8_t pad[3];
+
+	/*
+	 * UTC on its own is non-monotonic and ambiguous.
+	 *
+	 * Inform the guest about the TAI offset, so that it can have an
+	 * actual monotonic and reliable time reference if it needs it.
+	 *
+	 * Also indicate a nearby leap second, if one exists. Unlike in
+	 * NTP, this may indicate a leap second in the past in order to
+	 * indicate that it *has* been taken into account.
+	 */
+	int8_t leapsecond_direction;
+	int16_t tai_offset_sec;
+	uint64_t leapsecond_counter_value;
+	uint64_t leapsecond_tai_time;
+
+	/* Paired values of counter and UTC at a given point in time. */
+	uint64_t counter_value;
+	uint64_t utc_time_sec;
+	uint64_t utc_time_frac_sec;
+
+	/* Counter frequency, and error margin. Units of (second >> 64) */
+	uint64_t counter_period_frac_sec;
+	uint64_t counter_period_error_rate_frac_sec;
+
+	/* Error margin of UTC reading above (± picoseconds) */
+	uint64_t utc_time_maxerror_picosec;
+};
Peter Hilber June 20, 2024, 12:37 p.m. UTC | #24
Changing virtio-dev address to the new one. The discussion might also be
relevant for virtio-comment, but it is discouraged to forward it to both.

On 15.06.24 10:40, David Woodhouse wrote:
> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>> RFC v3 updates
>> --------------
>>
>> This series implements a driver for a virtio-rtc device conforming to spec
>> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
>> the PTP clock driver already present before.
>>
>> This patch series depends on the patch series "treewide: Use clocksource id
>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>> series on top of mainline.
>>
>> Overview
>> --------
>>
>> This patch series adds the virtio_rtc module, and related bugfixes. The
>> virtio_rtc module implements a driver compatible with the proposed Virtio
>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>> provides information about current time. The device can provide different
>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>> elapsed since some past epoch. The driver can read the clocks with simple
>> or more accurate methods. Optionally, the driver can set an alarm.
>>
>> The series first fixes some bugs in the get_device_system_crosststamp()
>> interpolation code, which is required for reliable virtio_rtc operation.
>> Then, add the virtio_rtc implementation.
>>
>> For the Virtio RTC device, there is currently a proprietary implementation,
>> which has been used for provisional testing.
> 
> As discussed before, I don't think it makes sense to design a new high-
> precision virtual clock which only gets it right *most* of the time. We
> absolutely need to address the issue of live migration.
> 
> When live migration occurs, the guest's time precision suffers in two
> ways.
> 
> First, even when migrating to a supposedly identical host the precise
> rate of the underlying counter (TSC, arch counter, etc.) can change
> within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
> changes that NTP normally manages to track, this is a *step* change,
> potentially from -50PPM to +50PPM from one host to the next.
> 
> Second, the accuracy of the counter as preserved across migration is
> limited by the accuracy of each host's NTP synchronization. So there is
> also a step change in the value of the counter itself.
> 
> At the moment of migration, the guest's timekeeping should be
> considered invalid. Any previous cross-timestamps are meaningless, and
> blindly using the previously-known relationship between the counter and
> real time can lead to problems such as corruption in distributed
> databases, fines for mis-timestamped transactions, and other general
> unhappiness.
> 
> We obviously can't get a new timestamp from the virtio_rtc device every
> time an application wants to know the time reliably. We don't even want
> *system* calls for that, which is why we have it in a vDSO.
> 
> We can take the same approach to warning the guest about clock
> disruption due to live migration. A shared memory region from the
> virtual clock device even be mapped all the way to userspace, for those
> applications which need precise and *reliable* time to check a
> 'disruption' marker in it, and do whatever is appropriate (e.g. abort
> transactions and wait for time to resync) when it happens.
> 
> We can do better than just letting the guest know about disruption, of
> course. We can put the actual counter→realtime relationship into the
> memory region too. As well as being able to provide a PTP driver with
> this, the applications which care about *reliable* timestamps can mmap
> the page directly and use it, vDSO-style, to have accurate timestamps
> even from the first cycle after migration.
> 
> When disruption is signalled, the guest needs to throw away any
> *additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
> to knowing nothing more than what the hypervisor advertises here.
> 
> Here's a first attempt at defining such a memory structure. For now
> I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> think it makes most sense for this to be part of the virtio_rtc spec.
> Ultimately it doesn't matter *how* the guest finds the memory region.

This looks sensible to me. I also think it would be possible to adapt this for
the virtio-rtc spec. The proposal also supports some other use cases which are
not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
others such as indication of a past leap second.

Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
support leap second smearing. Also, it would be helpful to allow indicating
when some of the fields are not valid (such as leapsecond_counter_value,
leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).

Do you have plans to contribute this to the Virtio specification and Linux
driver implementation?

I also added a few comments below.

Thanks for sharing,

Peter

[2] https://lore.kernel.org/virtio-comment/20240522170003.52565-1-peter.hilber@opensynergy.com/

> 
> Very preliminary QEMU hacks at 
> https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> (I still need to do the KVM side helper for actually filling in the
> host clock information, converted to the *guest* TSC)
> 
> This is the guest side. H aving heckled your assumption that we can use
> the virt counter on Arm, I concede I'm doing the same thing for now.
> The structure itself is at the end, or see
> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock
> 
> From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
> From: David Woodhouse <dwmw@amazon.co.uk>
> Date: Mon, 10 Jun 2024 15:10:11 +0100
> Subject: [PATCH] ptp: Add vDSO-style vmclock support
> 
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  drivers/ptp/Kconfig          |  13 ++
>  drivers/ptp/Makefile         |   1 +
>  drivers/ptp/ptp_vmclock.c    | 404 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vmclock.h | 102 +++++++++
>  4 files changed, 520 insertions(+)
>  create mode 100644 drivers/ptp/ptp_vmclock.c
>  create mode 100644 include/uapi/linux/vmclock.h
> 

[...]

> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> +	.owner		= THIS_MODULE,
> +	.max_adj	= 0,
> +	.n_ext_ts	= 0,
> +	.n_pins		= 0,
> +	.pps		= 0,
> +	.adjfine	= ptp_vmclock_adjfine,
> +	.adjtime	= ptp_vmclock_adjtime,
> +	.gettime64	= ptp_vmclock_gettime,

Should implement .gettimex64 instead.

> +	.settime64	= ptp_vmclock_settime,
> +	.enable		= ptp_vmclock_enable,
> +	.getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +

[...]

> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + */
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> +	uint32_t magic;
> +#define VMCLOCK_MAGIC	0x4b4c4356 /* "VCLK" */
> +	uint16_t size;		/* Size of page containing this structure */
> +	uint16_t version;	/* 1 */
> +
> +	/* Sequence lock. Low bit means an update is in progress. */
> +	uint64_t seq_count;
> +
> +	/*
> +	 * This field changes to another non-repeating value when the CPU
> +	 * counter is disrupted, for example on live migration.
> +	 */
> +	uint64_t disruption_marker;
> +
> +	uint8_t clock_status;
> +#define VMCLOCK_STATUS_UNKNOWN		0
> +#define VMCLOCK_STATUS_INITIALIZING	1
> +#define VMCLOCK_STATUS_SYNCHRONIZED	2
> +#define VMCLOCK_STATUS_FREERUNNING	3
> +#define VMCLOCK_STATUS_UNRELIABLE	4
> +
> +	uint8_t counter_id;
> +#define VMCLOCK_COUNTER_INVALID		0
> +#define VMCLOCK_COUNTER_X86_TSC		1
> +#define VMCLOCK_COUNTER_ARM_VCNT	2
> +
> +	uint8_t pad[3];
> +
> +	/*
> +	 * UTC on its own is non-monotonic and ambiguous.
> +	 *
> +	 * Inform the guest about the TAI offset, so that it can have an
> +	 * actual monotonic and reliable time reference if it needs it.
> +	 *
> +	 * Also indicate a nearby leap second, if one exists. Unlike in
> +	 * NTP, this may indicate a leap second in the past in order to
> +	 * indicate that it *has* been taken into account.
> +	 */
> +	int8_t leapsecond_direction;
> +	int16_t tai_offset_sec;
> +	uint64_t leapsecond_counter_value;
> +	uint64_t leapsecond_tai_time;
> +
> +	/* Paired values of counter and UTC at a given point in time. */
> +	uint64_t counter_value;
> +	uint64_t utc_time_sec;
> +	uint64_t utc_time_frac_sec;
> +
> +	/* Counter frequency, and error margin. Units of (second >> 64) */
> +	uint64_t counter_period_frac_sec;

AFAIU this might limit the precision in case of high counter frequencies.
Could the unit be aligned to the expected frequency band of counters?

> +	uint64_t counter_period_error_rate_frac_sec;
> +
> +	/* Error margin of UTC reading above (± picoseconds) */
> +	uint64_t utc_time_maxerror_picosec;
> +};
David Woodhouse June 20, 2024, 4:19 p.m. UTC | #25
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Changing virtio-dev address to the new one. The discussion might also be
> relevant for virtio-comment, but it is discouraged to forward it to both.

I will happily take it to whichever forum you think is most
appropriate. (And you have my permission to direct replies to whichever
you choose.)

> On 15.06.24 10:40, David Woodhouse wrote:
> > As discussed before, I don't think it makes sense to design a new high-
> > precision virtual clock which only gets it right *most* of the time. We
> > absolutely need to address the issue of live migration.
 ...
> > Here's a first attempt at defining such a memory structure. For now
> > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> > think it makes most sense for this to be part of the virtio_rtc spec.
> > Ultimately it doesn't matter *how* the guest finds the memory region.
> 
> This looks sensible to me. I also think it would be possible to adapt this for
> the virtio-rtc spec. The proposal also supports some other use cases which are
> not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
> others such as indication of a past leap second.

Right. The vDSO-style clock reading is key to solving the live
migration problem.

The other key thing this adds is the error bounds, which some
applications care deeply about. I've been working with the team that
owns ClockBound on that part: https://github.com/aws/clock-bound

> Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
> support leap second smearing. 

That's kind of intentional. Leap second smearing should be considered
the *antithesis* of precise time. Anyone who wants a monotonic realtime
clock should be using the POSIX CLOCK_TAI.

Part of my motivation for fixing the LM problem is because some
financial institutions can incur significant penalties for putting
inaccurate timestamps on transactions — even the disruption caused by
live migration is enough to trigger that. So deliberately lying to them
about what the UTC time is, by up to a second in either direction, is
not necessarily in their best interest.

As you noted, this proposal does expose leap seconds in the recent
past, which is all that's needed to allow a guest to generate a smeared
clock *from* the accurate clock that is provided through this
mechanism.

(Knowledge of past leap seconds is needed because in some modes,
smearing adjustments continue for some hours *afte* the leap second
should have occurred. So the NTP style of leap indicator isn't
sufficient).

> Also, it would be helpful to allow indicating
> when some of the fields are not valid (such as leapsecond_counter_value,
> leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).

Right. For some of those the answer can just be 'zero means invalid',
for the max error, perhaps MAX_UINT64. But we should definitely make
that explicit.

I'm also not entirely sure I understood Julien's insistence that we
include the leapsecond_counter_value as *well* as the
leapsecond_tai_time. It seems to me that the implementation would have
to recalculate that every time the frequency is adjusted. 

For some of those fields, I was expecting a certain amount of
bikeshedding to occur and figured it was better to post an early straw
man and solicit feedback.

> Do you have plans to contribute this to the Virtio specification and Linux
> driver implementation?

Yes, absolutely. For now I've implemented it in the Linux guest¹ and in
QEMU² as an ACPI device modelled on vmgenid, but I'd love *not* to have
to do that, and just to do it based on virtio instead.

¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
² https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock


> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > +       .owner          = THIS_MODULE,
> > +       .max_adj        = 0,
> > +       .n_ext_ts       = 0,
> > +       .n_pins         = 0,
> > +       .pps            = 0,
> > +       .adjfine        = ptp_vmclock_adjfine,
> > +       .adjtime        = ptp_vmclock_adjtime,
> > +       .gettime64      = ptp_vmclock_gettime,
> 
> Should implement .gettimex64 instead.

Ack, thanks. I'll go play with that.

> 
> > +
> > +       /* Counter frequency, and error margin. Units of (second >> 64) */
> > +       uint64_t counter_period_frac_sec;
> 
> AFAIU this might limit the precision in case of high counter frequencies.
> Could the unit be aligned to the expected frequency band of counters?

This field indicates the period of a single tick, in units of 1>>64 of
a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 

Can you walk me through a calculation where you believe that level of
precision is insufficient?

I guess the precision matters if the structure isn't updated for a long
period of time, and the delta between the current counter and the
snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
would need a *lot* of them before you care? And if nobody's been
calibrating your counter for that long, surely you have bigger worries?

Am I missing something there?
David Woodhouse June 21, 2024, 8:45 a.m. UTC | #26
On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote:
> 
> > 
> > > +
> > > +       /* Counter frequency, and error margin. Units of (second >> 64) */
> > > +       uint64_t counter_period_frac_sec;
> > 
> > AFAIU this might limit the precision in case of high counter frequencies.
> > Could the unit be aligned to the expected frequency band of counters?
> 
> This field indicates the period of a single tick, in units of 1>>64 of
> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 
> 
> Can you walk me through a calculation where you believe that level of
> precision is insufficient?
> 
> I guess the precision matters if the structure isn't updated for a long
> period of time, and the delta between the current counter and the
> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
> would need a *lot* of them before you care? And if nobody's been
> calibrating your counter for that long, surely you have bigger worries?
> 
> Am I missing something there?

Hm, that was a bit rushed at the end of the day; let's take a better look...

Let's take a hypothetical example of a 100GHz counter. That's two
orders of magnitude more than today's Arm arch counter.

The period of such a counter would be 10 picoseconds. 

(Let's ignore the question of how far light actually travels in that
time and how *realistic* that example is, for the moment.)

It turns out that at that rate, there *are* a lot of 54 zeptosecondses
of precision loss in the day. It could be half a millisecond a day, or
20µs an hour.

That particular example of 10 picoseconds is 184467440.7370955
(seconds>>64) which could be truncated to 184467440 — losing about 4PPB
(a third of a millisecond a day; 14µs an hour).

So yeah, I suppose a 'shift' field could make sense. It's easy enough
to consume on the guest side as it doesn't really perturb the 128-bit
multiplication very much; especially if we don't let it be negative.

And implementations *can* just set it to zero. It hurts nobody.

Or were you thinking of just using a fixed shift like (seconds>>80)
instead?
David Woodhouse June 21, 2024, 2:02 p.m. UTC | #27
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Should implement .gettimex64 instead.

Thanks. This look sane?

As noted in the code comment, in the *ideal* case we just build all
three pre/post/device timestamps from the very same counter read. So
sts->pre_ts == sts->post_ts.

In the less ideal case (which will happen on x86 when kvmclock is being
used for the system time), we use the time from ktime_get_snapshot() as
the pre_ts and take a new snapshot immediately after the get_cycles().


diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index e8c65405a8f3..07a81a94d29a 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -96,9 +96,11 @@ static inline uint64_t mul_u64_u64_add_u64(uint64_t *res_hi, uint64_t delta,
 }
 
 static int vmclock_get_crosststamp(struct vmclock_state *st,
+				   struct ptp_system_timestamp *sts,
 				   struct system_counterval_t *system_counter,
 				   struct timespec64 *tspec)
 {
+	struct system_time_snapshot systime_snapshot;
 	uint64_t cycle, delta, seq, frac_sec;
 	int ret = 0;
 
@@ -119,7 +121,17 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
 			continue;
 		}
 
-		cycle = get_cycles();
+		if (sts) {
+			ktime_get_snapshot(&systime_snapshot);
+
+			if (systime_snapshot.cs_id == st->cs_id) {
+				cycle = systime_snapshot.cycles;
+			} else {
+				cycle = get_cycles();
+				ptp_read_system_postts(sts);
+			}
+		} else
+			cycle = get_cycles();
 
 		delta = cycle - st->clk->counter_value;
 
@@ -139,6 +151,21 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
 	if (ret)
 		return ret;
 
+	/*
+	 * When invoked for gettimex64, fill in the pre/post system times.
+	 * The ideal case is when system time is based on the the same
+	 * counter as st->cs_id, in which case all three pre/post/device
+	 * times are derived from the *same* counter value. If cs_id does
+	 * not match, then the value from ktime_get_snapshot() is used as
+	 * pre_ts, and ptp_read_system_postts() was already called above
+	 * for the post_ts. Those are either side of the get_cycles() call.
+	 */
+	if (sts) {
+		sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+		if (systime_snapshot.cs_id == st->cs_id)
+			sts->post_ts = sts->pre_ts;
+	}
+
 	if (system_counter) {
 		system_counter->cycles = cycle;
 		system_counter->cs_id = st->cs_id;
@@ -155,7 +182,7 @@ static int ptp_vmclock_get_time_fn(ktime_t *device_time,
 	struct timespec64 tspec;
 	int ret;
 
-	ret = vmclock_get_crosststamp(st, system_counter, &tspec);
+	ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
 	if (!ret)
 		*device_time = timespec64_to_ktime(tspec);
 
@@ -198,7 +225,16 @@ static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts
 	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
 						ptp_clock_info);
 
-	return vmclock_get_crosststamp(st, NULL, ts);
+	return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
+				struct ptp_system_timestamp *sts)
+{
+	struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+						ptp_clock_info);
+
+	return vmclock_get_crosststamp(st, sts, NULL, ts);
 }
 
 static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
@@ -216,6 +252,7 @@ static const struct ptp_clock_info ptp_vmclock_info = {
 	.adjfine	= ptp_vmclock_adjfine,
 	.adjtime	= ptp_vmclock_adjtime,
 	.gettime64	= ptp_vmclock_gettime,
+	.gettimex64	= ptp_vmclock_gettimex,
 	.settime64	= ptp_vmclock_settime,
 	.enable		= ptp_vmclock_enable,
 	.getcrosststamp = ptp_vmclock_getcrosststamp,
Peter Hilber June 27, 2024, 1:50 p.m. UTC | #28
On 21.06.24 10:45, David Woodhouse wrote:
> On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote:
>>
>>>
>>>> +
>>>> +       /* Counter frequency, and error margin. Units of (second >> 64) */
>>>> +       uint64_t counter_period_frac_sec;
>>>
>>> AFAIU this might limit the precision in case of high counter frequencies.
>>> Could the unit be aligned to the expected frequency band of counters?
>>
>> This field indicates the period of a single tick, in units of 1>>64 of
>> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 
>>
>> Can you walk me through a calculation where you believe that level of
>> precision is insufficient?
>>
>> I guess the precision matters if the structure isn't updated for a long
>> period of time, and the delta between the current counter and the
>> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
>> would need a *lot* of them before you care? And if nobody's been
>> calibrating your counter for that long, surely you have bigger worries?
>>
>> Am I missing something there?
> 
> Hm, that was a bit rushed at the end of the day; let's take a better look...
> 
> Let's take a hypothetical example of a 100GHz counter. That's two
> orders of magnitude more than today's Arm arch counter.
> 
> The period of such a counter would be 10 picoseconds. 
> 
> (Let's ignore the question of how far light actually travels in that
> time and how *realistic* that example is, for the moment.)
> 
> It turns out that at that rate, there *are* a lot of 54 zeptosecondses
> of precision loss in the day. It could be half a millisecond a day, or
> 20µs an hour.
> 
> That particular example of 10 picoseconds is 184467440.7370955
> (seconds>>64) which could be truncated to 184467440 — losing about 4PPB
> (a third of a millisecond a day; 14µs an hour).
> 
> So yeah, I suppose a 'shift' field could make sense. It's easy enough
> to consume on the guest side as it doesn't really perturb the 128-bit
> multiplication very much; especially if we don't let it be negative.
> 
> And implementations *can* just set it to zero. It hurts nobody.
> 
> Or were you thinking of just using a fixed shift like (seconds>>80)
> instead?

The 'shift' field should be fine.