Message ID | 20231218073849.35294-1-peter.hilber@opensynergy.com (mailing list archive) |
---|---|
Headers | show |
Series | Add virtio_rtc module and related changes | expand |
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: > RFC v3 updates > -------------- > > This series implements a driver for a virtio-rtc device conforming to spec > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to > the PTP clock driver already present before. > > This patch series depends on the patch series "treewide: Use clocksource id > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined > series on top of mainline. > > Overview > -------- > > This patch series adds the virtio_rtc module, and related bugfixes. The > virtio_rtc module implements a driver compatible with the proposed Virtio > RTC device specification [1]. The Virtio RTC (Real Time Clock) device > provides information about current time. The device can provide different > clocks, e.g. for the UTC or TAI time standards, or for physical time > elapsed since some past epoch. Hm, should we allow UTC? If you tell me the time in UTC, then (sometimes) I still don't actually know what the time is, because some UTC seconds occur twice. UTC only makes sense if you provide the TAI offset, surely? Should the virtio_rtc specification make it mandatory to provide such? Otherwise you're just designing it to allow crappy hypervisors to expose incomplete information. > PTP clock interface > ------------------- > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm. > If both the Virtio RTC device and this driver have special support for the > current clocksource, time synchronization programs can use > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization > with single-digit ns precision is possible with a quiescent reference clock > (from the Virtio RTC device). This works even when the Virtio device > response is slow compared to ptp_kvm hypercalls. Is PTP the right mechanism for this? As I understand it, PTP is a way to precisely synchronize one clock with another. But in the case of virt guests synchronizing against the host, it isn't really *another* clock. It really is the *same* underlying clock. As the host clock varies with temperature, for example, so does the guest clock. The only difference is an offset and (on x86 perhaps) a mathematical scaling of the frequency. I was looking at this another way, when I came across this virtio-rtc work. My idea was just for the hypervisor to expose its own timekeeping information — the counter/TSC value and TAI time at a given moment, frequency of the counter, and the precision of both that frequency (±PPM) and the TAI timestamp (±µs). By putting that in a host/guest shared data structure with a seqcount for lockless updates, we can update it as time synchronization on the host is refined, and we can even cleanly handle live migration where the guest ends up on a completely different host. It allows for use cases which *really* care (e.g. timestamping financial transactions) to ensure that there is never even a moment of getting *wrong* timestamps if they haven't yet resynced after a migration. Now I'm trying to work out if I should attempt to reconcile with your existing virtio-rtc work, or just decide that virtio-rtc isn't trying to solve the actual problem that we have, and go ahead with something different... ?
On 07.03.24 15:02, David Woodhouse wrote: > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: >> RFC v3 updates >> -------------- >> >> This series implements a driver for a virtio-rtc device conforming to spec >> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to >> the PTP clock driver already present before. >> >> This patch series depends on the patch series "treewide: Use clocksource id >> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined >> series on top of mainline. >> >> Overview >> -------- >> >> This patch series adds the virtio_rtc module, and related bugfixes. The >> virtio_rtc module implements a driver compatible with the proposed Virtio >> RTC device specification [1]. The Virtio RTC (Real Time Clock) device >> provides information about current time. The device can provide different >> clocks, e.g. for the UTC or TAI time standards, or for physical time >> elapsed since some past epoch. > > Hm, should we allow UTC? If you tell me the time in UTC, then > (sometimes) I still don't actually know what the time is, because some > UTC seconds occur twice. UTC only makes sense if you provide the TAI > offset, surely? Should the virtio_rtc specification make it mandatory > to provide such? > > Otherwise you're just designing it to allow crappy hypervisors to > expose incomplete information. > Hi David, (adding virtio-comment@lists.oasis-open.org for spec discussion), thank you for your insightful comments. I think I take a broadly similar view. The reason why the current spec and driver is like this is that I took a pragmatic approach at first and only included features which work out-of-the-box for the current Linux ecosystem. The current virtio_rtc features work similar to ptp_kvm, and therefore can work out-of-the-box with time sync daemons such as chrony. As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as well, I am afraid that - in some (embedded) scenarios, the TAI clock may not be available - crappy hypervisors will pass off the UTC clock as the TAI clock. For the same reasons, I am also not sure about adding a *mandatory* TAI offset to each readout. I don't know user-space software which would leverage this already (at least not through the PTP clock interface). And why would such software not go straight for the TAI clock instead? How about adding a requirement to the spec that the virtio-rtc device SHOULD expose the TAI clock whenever it is available - would this address your concerns? >> PTP clock interface >> ------------------- >> >> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm. >> If both the Virtio RTC device and this driver have special support for the >> current clocksource, time synchronization programs can use >> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka >> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization >> with single-digit ns precision is possible with a quiescent reference clock >> (from the Virtio RTC device). This works even when the Virtio device >> response is slow compared to ptp_kvm hypercalls. > > Is PTP the right mechanism for this? As I understand it, PTP is a way > to precisely synchronize one clock with another. But in the case of > virt guests synchronizing against the host, it isn't really *another* > clock. It really is the *same* underlying clock. As the host clock > varies with temperature, for example, so does the guest clock. The only > difference is an offset and (on x86 perhaps) a mathematical scaling of > the frequency. > > I was looking at this another way, when I came across this virtio-rtc > work. > > My idea was just for the hypervisor to expose its own timekeeping > information — the counter/TSC value and TAI time at a given moment, > frequency of the counter, and the precision of both that frequency > (±PPM) and the TAI timestamp (±µs). > > By putting that in a host/guest shared data structure with a seqcount > for lockless updates, we can update it as time synchronization on the > host is refined, and we can even cleanly handle live migration where > the guest ends up on a completely different host. It allows for use > cases which *really* care (e.g. timestamping financial transactions) to > ensure that there is never even a moment of getting *wrong* timestamps > if they haven't yet resynced after a migration. I considered a similar approach as well, but integrating that with the kernel timekeeping seemed too much effort for the first step. However, reading the clock from user space would be much simpler. > > Now I'm trying to work out if I should attempt to reconcile with your > existing virtio-rtc work, or just decide that virtio-rtc isn't trying > to solve the actual problem that we have, and go ahead with something > different... ? > We are certainly interested into the discussed, say, "virtual timekeeper" mechanism, which would also solve a lot of problems for us (especially if it would be integrated with kernel timekeeping). Even without Linux kernel timekeeping, the virtual timekeeper would be useful to us for guests with simpler timekeeping, and potentially for user space applications. Our current intent is to at first try to upstream the current (RFC spec v3) feature set. I think the virtual timekeeper would be suitable as an optional feature of virtio_rtc (with Virtio, this could easily be added after initial upstreaming). It is also possible to have a virtio-rtc device only implement the virtual timekeeper, but not the other clock reading methods, if these are of no interest. Best regards, Peter
On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: > On 07.03.24 15:02, David Woodhouse wrote: > > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: > > > RFC v3 updates > > > -------------- > > > > > > This series implements a driver for a virtio-rtc device conforming to spec > > > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to > > > the PTP clock driver already present before. > > > > > > This patch series depends on the patch series "treewide: Use clocksource id > > > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined > > > series on top of mainline. > > > > > > Overview > > > -------- > > > > > > This patch series adds the virtio_rtc module, and related bugfixes. The > > > virtio_rtc module implements a driver compatible with the proposed Virtio > > > RTC device specification [1]. The Virtio RTC (Real Time Clock) device > > > provides information about current time. The device can provide different > > > clocks, e.g. for the UTC or TAI time standards, or for physical time > > > elapsed since some past epoch. > > > > Hm, should we allow UTC? If you tell me the time in UTC, then > > (sometimes) I still don't actually know what the time is, because some > > UTC seconds occur twice. UTC only makes sense if you provide the TAI > > offset, surely? Should the virtio_rtc specification make it mandatory > > to provide such? > > > > Otherwise you're just designing it to allow crappy hypervisors to > > expose incomplete information. > > > > Hi David, > > (adding virtio-comment@lists.oasis-open.org for spec discussion), > > thank you for your insightful comments. I think I take a broadly similar > view. The reason why the current spec and driver is like this is that I > took a pragmatic approach at first and only included features which work > out-of-the-box for the current Linux ecosystem. > > The current virtio_rtc features work similar to ptp_kvm, and therefore can > work out-of-the-box with time sync daemons such as chrony. > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as > well, I am afraid that > > - in some (embedded) scenarios, the TAI clock may not be available > > - crappy hypervisors will pass off the UTC clock as the TAI clock. > > For the same reasons, I am also not sure about adding a *mandatory* TAI > offset to each readout. I don't know user-space software which would > leverage this already (at least not through the PTP clock interface). And > why would such software not go straight for the TAI clock instead? > > How about adding a requirement to the spec that the virtio-rtc device > SHOULD expose the TAI clock whenever it is available - would this address > your concerns? I think that would be too easy for implementors to miss, or decide not to obey. Or to get *wrong*, by exposing a TAI clock but actually putting UTC in it. I think I prefer to mandate the tai_offset field with the UTC clock. Crappy implementations will just set it to zero, but at least that gives a clear signal to the guests that it's *their* problem to resolve. > > > PTP clock interface > > > ------------------- > > > > > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm. > > > If both the Virtio RTC device and this driver have special support for the > > > current clocksource, time synchronization programs can use > > > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka > > > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization > > > with single-digit ns precision is possible with a quiescent reference clock > > > (from the Virtio RTC device). This works even when the Virtio device > > > response is slow compared to ptp_kvm hypercalls. > > > > Is PTP the right mechanism for this? As I understand it, PTP is a way > > to precisely synchronize one clock with another. But in the case of > > virt guests synchronizing against the host, it isn't really *another* > > clock. It really is the *same* underlying clock. As the host clock > > varies with temperature, for example, so does the guest clock. The only > > difference is an offset and (on x86 perhaps) a mathematical scaling of > > the frequency. > > > > I was looking at this another way, when I came across this virtio-rtc > > work. > > > > My idea was just for the hypervisor to expose its own timekeeping > > information — the counter/TSC value and TAI time at a given moment, > > frequency of the counter, and the precision of both that frequency > > (±PPM) and the TAI timestamp (±µs). > > > > By putting that in a host/guest shared data structure with a seqcount > > for lockless updates, we can update it as time synchronization on the > > host is refined, and we can even cleanly handle live migration where > > the guest ends up on a completely different host. It allows for use > > cases which *really* care (e.g. timestamping financial transactions) to > > ensure that there is never even a moment of getting *wrong* timestamps > > if they haven't yet resynced after a migration. > > I considered a similar approach as well, but integrating that with the > kernel timekeeping seemed too much effort for the first step. However, > reading the clock from user space would be much simpler. Right. In fact my *first* use case was userspace, specifically in the context of https://github.com/aws/clock-bound — but anything we design for this absolutely has to be usable for kernel timekeeping too. It's also critical to solve the Live Migration problem. But is it so hard to integrate into the kernel timekeeping? My plan would have given us effectively an infinite number of cross-reads of the realtime clock vs. TSC. You don't have to actually read from a virtio device; you just read the TSC and do the maths, using the values in the shared memory region. Couldn't that be used to present a PTP device to the guest kernel just the same as you do at the moment? You could probably even simulate PPS with it. Typically with PPS we have to catch the hardware interrupt and then read the TSC as soon as possible thereafter. With this, you'd be able to *calculate* the TSC value at the start of the next second, and wouldn't have to suffer the real hardware latency :) > > > > Now I'm trying to work out if I should attempt to reconcile with your > > existing virtio-rtc work, or just decide that virtio-rtc isn't trying > > to solve the actual problem that we have, and go ahead with something > > different... ? > > > > We are certainly interested into the discussed, say, "virtual timekeeper" > mechanism, which would also solve a lot of problems for us (especially if > it would be integrated with kernel timekeeping). Even without Linux kernel > timekeeping, the virtual timekeeper would be useful to us for guests with > simpler timekeeping, and potentially for user space applications. > > Our current intent is to at first try to upstream the current (RFC spec v3) > feature set. I think the virtual timekeeper would be suitable as an > optional feature of virtio_rtc (with Virtio, this could easily be added > after initial upstreaming). It is also possible to have a virtio-rtc device > only implement the virtual timekeeper, but not the other clock reading > methods, if these are of no interest. Yeah, that might make sense. I was thinking of a simple ACPI/DT device exposing a page of memory and *maybe* an interrupt for when an update happens. (With the caveat that the interrupt would always occur too late by definition, so it's no substitute for using the seqlock correctly in applications that *really* care and are going to get fined millions of dollars for mis-timestamping their transactions.) But using the virtio-rtc device as the vehicle for that shared memory page is reasonable too. It's not even mutually exclusive; we could expose the *same* data structure in memory via whatever mechanisms we wanted. One other thing to note is I think we're being very naïve about the TSC on x86 hosts. Theoretically, the TSC for every vCPU might run at a different frequency, and even if they run at the same frequency they might be offset from each other. I'm happy to be naïve but I think we should be *explicitly* so, and just say for example that it's defined against vCPU0 so if other vCPUs are different then all bets are off. We *can* cope with TSC frequencies changing. Fundamentally, that's the whole *point*; NTP calibrates itself as the underlying frequency does change due to temperature changes, etc. — so a deliberate frequency scaling, or even a live migration, are just a slightly special case of the same thing. One thing I have added to the memory region is a migration counter. In the ideal case, guests will be happy just to use the hypervisor's synchronization. But in some cases the guests may want to do NTP (or PPS, PTP or something else) for themselves, to have more precise timekeeping than the host. Even if the host is advertising itself as stratum 16, the guest still needs to know of *migration*, because it has to consider itself unsynchronized when that happens.
On 08.03.24 13:33, David Woodhouse wrote: > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: >> On 07.03.24 15:02, David Woodhouse wrote: >>> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: >>>> RFC v3 updates >>>> -------------- >>>> >>>> This series implements a driver for a virtio-rtc device conforming to >>>> spec >>>> RFC v3 [1]. It now includes an RTC class driver with alarm, in >>>> addition to >>>> the PTP clock driver already present before. >>>> >>>> This patch series depends on the patch series "treewide: Use >>>> clocksource id >>>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined >>>> series on top of mainline. >>>> >>>> Overview >>>> -------- >>>> >>>> This patch series adds the virtio_rtc module, and related bugfixes. >>>> The >>>> virtio_rtc module implements a driver compatible with the proposed >>>> Virtio >>>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device >>>> provides information about current time. The device can provide >>>> different >>>> clocks, e.g. for the UTC or TAI time standards, or for physical time >>>> elapsed since some past epoch. >>> >>> Hm, should we allow UTC? If you tell me the time in UTC, then >>> (sometimes) I still don't actually know what the time is, because some >>> UTC seconds occur twice. UTC only makes sense if you provide the TAI >>> offset, surely? Should the virtio_rtc specification make it mandatory >>> to provide such? >>> >>> Otherwise you're just designing it to allow crappy hypervisors to >>> expose incomplete information. >>> >> >> Hi David, >> >> (adding virtio-comment@lists.oasis-open.org for spec discussion), >> >> thank you for your insightful comments. I think I take a broadly similar >> view. The reason why the current spec and driver is like this is that I >> took a pragmatic approach at first and only included features which work >> out-of-the-box for the current Linux ecosystem. >> >> The current virtio_rtc features work similar to ptp_kvm, and therefore >> can >> work out-of-the-box with time sync daemons such as chrony. >> >> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock >> as >> well, I am afraid that >> >> - in some (embedded) scenarios, the TAI clock may not be available >> >> - crappy hypervisors will pass off the UTC clock as the TAI clock. >> >> For the same reasons, I am also not sure about adding a *mandatory* TAI >> offset to each readout. I don't know user-space software which would >> leverage this already (at least not through the PTP clock interface). >> And >> why would such software not go straight for the TAI clock instead? >> >> How about adding a requirement to the spec that the virtio-rtc device >> SHOULD expose the TAI clock whenever it is available - would this >> address >> your concerns? > > I think that would be too easy for implementors to miss, or decide not > to obey. Or to get *wrong*, by exposing a TAI clock but actually > putting UTC in it. > > I think I prefer to mandate the tai_offset field with the UTC clock. > Crappy implementations will just set it to zero, but at least that > gives a clear signal to the guests that it's *their* problem to > resolve. To me there are some open questions regarding how this would work. Is there a use case for this with the v3 clock reading methods, or would it be enough to address this with the Virtio timekeeper? Looking at clock_adjtime(2), the tai_offset could be exposed, but probably best alongside some additional information about leap seconds. I am not aware about any user-space user. In addition, leap second smearing should also be addressed. > > > > >>>> PTP clock interface >>>> ------------------- >>>> >>>> virtio_rtc exposes clocks as PTP clocks to userspace, similar to >>>> ptp_kvm. >>>> If both the Virtio RTC device and this driver have special support for >>>> the >>>> current clocksource, time synchronization programs can use >>>> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka >>>> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time >>>> synchronization >>>> with single-digit ns precision is possible with a quiescent reference >>>> clock >>>> (from the Virtio RTC device). This works even when the Virtio device >>>> response is slow compared to ptp_kvm hypercalls. >>> >>> Is PTP the right mechanism for this? As I understand it, PTP is a way >>> to precisely synchronize one clock with another. But in the case of >>> virt guests synchronizing against the host, it isn't really *another* >>> clock. It really is the *same* underlying clock. As the host clock >>> varies with temperature, for example, so does the guest clock. The only >>> difference is an offset and (on x86 perhaps) a mathematical scaling of >>> the frequency. >>> >>> I was looking at this another way, when I came across this virtio-rtc >>> work. >>> >>> My idea was just for the hypervisor to expose its own timekeeping >>> information — the counter/TSC value and TAI time at a given moment, >>> frequency of the counter, and the precision of both that frequency >>> (±PPM) and the TAI timestamp (±µs). >>> >>> By putting that in a host/guest shared data structure with a seqcount >>> for lockless updates, we can update it as time synchronization on the >>> host is refined, and we can even cleanly handle live migration where >>> the guest ends up on a completely different host. It allows for use >>> cases which *really* care (e.g. timestamping financial transactions) to >>> ensure that there is never even a moment of getting *wrong* timestamps >>> if they haven't yet resynced after a migration. >> >> I considered a similar approach as well, but integrating that with the >> kernel timekeeping seemed too much effort for the first step. However, >> reading the clock from user space would be much simpler. > > Right. In fact my *first* use case was userspace, specifically in the > context of https://github.com/aws/clock-bound — but anything we design > for this absolutely has to be usable for kernel timekeeping too. > > It's also critical to solve the Live Migration problem. > > But is it so hard to integrate into the kernel timekeeping? My plan > would have given us effectively an infinite number of cross-reads of > the realtime clock vs. TSC. You don't have to actually read from a > virtio device; you just read the TSC and do the maths, using the values > in the shared memory region. Couldn't that be used to present a PTP > device to the guest kernel just the same as you do at the moment? Yes, and it would also decrease the clock reading overhead (saving at least the Virtio response interrupt and associated scheduling). Would make sense to me. To be clear, with "kernel timekeeping integration" I meant to have timekeeping.c derive the time directly from the Virtio timekeeper. > > You could probably even simulate PPS with it. Typically with PPS we > have to catch the hardware interrupt and then read the TSC as soon as > possible thereafter. With this, you'd be able to *calculate* the TSC > value at the start of the next second, and wouldn't have to suffer the > real hardware latency :) > >>> >>> Now I'm trying to work out if I should attempt to reconcile with your >>> existing virtio-rtc work, or just decide that virtio-rtc isn't trying >>> to solve the actual problem that we have, and go ahead with something >>> different... ? >>> >> >> We are certainly interested into the discussed, say, "virtual >> timekeeper" >> mechanism, which would also solve a lot of problems for us (especially >> if >> it would be integrated with kernel timekeeping). Even without Linux >> kernel >> timekeeping, the virtual timekeeper would be useful to us for guests >> with >> simpler timekeeping, and potentially for user space applications. >> >> Our current intent is to at first try to upstream the current (RFC spec >> v3) >> feature set. I think the virtual timekeeper would be suitable as an >> optional feature of virtio_rtc (with Virtio, this could easily be added >> after initial upstreaming). It is also possible to have a virtio-rtc >> device >> only implement the virtual timekeeper, but not the other clock reading >> methods, if these are of no interest. > > Yeah, that might make sense. I was thinking of a simple ACPI/DT device > exposing a page of memory and *maybe* an interrupt for when an update > happens. (With the caveat that the interrupt would always occur too > late by definition, so it's no substitute for using the seqlock > correctly in applications that *really* care and are going to get fined > millions of dollars for mis-timestamping their transactions.) > > But using the virtio-rtc device as the vehicle for that shared memory > page is reasonable too. It's not even mutually exclusive; we could > expose the *same* data structure in memory via whatever mechanisms we > wanted. > > One other thing to note is I think we're being very naïve about the TSC > on x86 hosts. Theoretically, the TSC for every vCPU might run at a > different frequency, and even if they run at the same frequency they > might be offset from each other. I'm happy to be naïve but I think we > should be *explicitly* so, and just say for example that it's defined > against vCPU0 so if other vCPUs are different then all bets are off. ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you have an opinion on how to represent this in a platform-independent way. Thank you for the comments, Peter > > We *can* cope with TSC frequencies changing. Fundamentally, that's the > whole *point*; NTP calibrates itself as the underlying frequency does > change due to temperature changes, etc. — so a deliberate frequency > scaling, or even a live migration, are just a slightly special case of > the same thing. > > One thing I have added to the memory region is a migration counter. In > the ideal case, guests will be happy just to use the hypervisor's > synchronization. But in some cases the guests may want to do NTP (or > PPS, PTP or something else) for themselves, to have more precise > timekeeping than the host. Even if the host is advertising itself as > stratum 16, the guest still needs to know of *migration*, because it > has to consider itself unsynchronized when that happens
On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote: > On 08.03.24 13:33, David Woodhouse wrote: > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: > > > On 07.03.24 15:02, David Woodhouse wrote: > > > > Hm, should we allow UTC? If you tell me the time in UTC, then > > > > (sometimes) I still don't actually know what the time is, because some > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI > > > > offset, surely? Should the virtio_rtc specification make it mandatory > > > > to provide such? > > > > > > > > Otherwise you're just designing it to allow crappy hypervisors to > > > > expose incomplete information. > > > > > > > > > > Hi David, > > > > > > (adding virtio-comment@lists.oasis-open.org for spec discussion), > > > > > > thank you for your insightful comments. I think I take a broadly similar > > > view. The reason why the current spec and driver is like this is that I > > > took a pragmatic approach at first and only included features which work > > > out-of-the-box for the current Linux ecosystem. > > > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore > > > can work out-of-the-box with time sync daemons such as chrony. > > > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock > > > as well, I am afraid that > > > > > > - in some (embedded) scenarios, the TAI clock may not be available > > > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock. > > > > > > For the same reasons, I am also not sure about adding a *mandatory* TAI > > > offset to each readout. I don't know user-space software which would > > > leverage this already (at least not through the PTP clock interface). > > > And why would such software not go straight for the TAI clock instead? > > > > > > How about adding a requirement to the spec that the virtio-rtc device > > > SHOULD expose the TAI clock whenever it is available - would this > > > address your concerns? > > > > I think that would be too easy for implementors to miss, or decide not > > to obey. Or to get *wrong*, by exposing a TAI clock but actually > > putting UTC in it. > > > > I think I prefer to mandate the tai_offset field with the UTC clock. > > Crappy implementations will just set it to zero, but at least that > > gives a clear signal to the guests that it's *their* problem to > > resolve. > > To me there are some open questions regarding how this would work. Is there > a use case for this with the v3 clock reading methods, or would it be > enough to address this with the Virtio timekeeper? > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably > best alongside some additional information about leap seconds. I am not > aware about any user-space user. In addition, leap second smearing should > also be addressed. > Is there even a standard yet for leap-smearing? Will it be linear over 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I think is what Google does? Meta does something different again, don't they? Exposing UTC as the only clock reference is bad enough; when leap seconds happen there's a whole second during which you don't *know* which second it is. It seems odd to me, for a precision clock to be deliberately ambiguous about what the time is! But if the virtio-rtc clock is defined as UTC and then expose something *different* in it, that's even worse. You potentially end up providing inaccurate time for a whole *day* leading up to the leap second. I think you're right that leap second smearing should be addressed. At the very least, by making it clear that the virtio-rtc clock which advertises UTC shall be used *only* for UTC, never UTC-SLS or any other yet-to-be-defined variant. Please make it explicit that any hypervisor which wants to advertise a smeared clock shall define a new type which specifies the precise smearing algorithm and cannot be conflated with the one you're defining here. > > One other thing to note is I think we're being very naïve about the TSC > > on x86 hosts. Theoretically, the TSC for every vCPU might run at a > > different frequency, and even if they run at the same frequency they > > might be offset from each other. I'm happy to be naïve but I think we > > should be *explicitly* so, and just say for example that it's defined > > against vCPU0 so if other vCPUs are different then all bets are off. > > ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you > have an opinion on how to represent this in a platform-independent way. Well, it doesn't have a notion of TSCs either; you include that by implicit reference don't you?
On 12.03.24 18:15, David Woodhouse wrote: > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote: >> On 08.03.24 13:33, David Woodhouse wrote: >>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: >>>> On 07.03.24 15:02, David Woodhouse wrote: >>>>> Hm, should we allow UTC? If you tell me the time in UTC, then >>>>> (sometimes) I still don't actually know what the time is, because some >>>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI >>>>> offset, surely? Should the virtio_rtc specification make it mandatory >>>>> to provide such? >>>>> >>>>> Otherwise you're just designing it to allow crappy hypervisors to >>>>> expose incomplete information. >>>>> >>>> >>>> Hi David, >>>> >>>> (adding virtio-comment@lists.oasis-open.org for spec discussion), >>>> >>>> thank you for your insightful comments. I think I take a broadly similar >>>> view. The reason why the current spec and driver is like this is that I >>>> took a pragmatic approach at first and only included features which work >>>> out-of-the-box for the current Linux ecosystem. >>>> >>>> The current virtio_rtc features work similar to ptp_kvm, and therefore >>>> can work out-of-the-box with time sync daemons such as chrony. >>>> >>>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock >>>> as well, I am afraid that >>>> >>>> - in some (embedded) scenarios, the TAI clock may not be available >>>> >>>> - crappy hypervisors will pass off the UTC clock as the TAI clock. >>>> >>>> For the same reasons, I am also not sure about adding a *mandatory* TAI >>>> offset to each readout. I don't know user-space software which would >>>> leverage this already (at least not through the PTP clock interface). >>>> And why would such software not go straight for the TAI clock instead? >>>> >>>> How about adding a requirement to the spec that the virtio-rtc device >>>> SHOULD expose the TAI clock whenever it is available - would this >>>> address your concerns? >>> >>> I think that would be too easy for implementors to miss, or decide not >>> to obey. Or to get *wrong*, by exposing a TAI clock but actually >>> putting UTC in it. >>> >>> I think I prefer to mandate the tai_offset field with the UTC clock. >>> Crappy implementations will just set it to zero, but at least that >>> gives a clear signal to the guests that it's *their* problem to >>> resolve. >> >> To me there are some open questions regarding how this would work. Is there >> a use case for this with the v3 clock reading methods, or would it be >> enough to address this with the Virtio timekeeper? >> >> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably >> best alongside some additional information about leap seconds. I am not >> aware about any user-space user. In addition, leap second smearing should >> also be addressed. >> > > Is there even a standard yet for leap-smearing? Will it be linear over > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I > think is what Google does? Meta does something different again, don't > they? > > Exposing UTC as the only clock reference is bad enough; when leap > seconds happen there's a whole second during which you don't *know* > which second it is. It seems odd to me, for a precision clock to be > deliberately ambiguous about what the time is! Just to be clear, the device can perfectly expose only a TAI reference clock (or both UTC and TAI), the spec is just completely open about this, as it tries to work for diverse use cases. > > But if the virtio-rtc clock is defined as UTC and then expose something > *different* in it, that's even worse. You potentially end up providing > inaccurate time for a whole *day* leading up to the leap second. > > I think you're right that leap second smearing should be addressed. At > the very least, by making it clear that the virtio-rtc clock which > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other > yet-to-be-defined variant. > Agreed. > Please make it explicit that any hypervisor which wants to advertise a > smeared clock shall define a new type which specifies the precise > smearing algorithm and cannot be conflated with the one you're defining > here. > I will add a requirement that the UTC clock can never have smeared/smoothed leap seconds. I think that not every vendor would bother to first add a definition of a smearing algorithm. Also, I think in some cases knowing the precise smearing algorithm might not be important (when having the same time as the hypervisor is enough and accuracy w.r.t. actual time is less important). So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for now could catch every UTC-like clock which smears/smoothes leap seconds, where the vendor cannot be bothered to add the smearing algorithm to spec and implementations. As for UTC-SLS, this *could* also be added, although [1] says It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00 >>> One other thing to note is I think we're being very naïve about the TSC >>> on x86 hosts. Theoretically, the TSC for every vCPU might run at a >>> different frequency, and even if they run at the same frequency they >>> might be offset from each other. I'm happy to be naïve but I think we >>> should be *explicitly* so, and just say for example that it's defined >>> against vCPU0 so if other vCPUs are different then all bets are off. >> >> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you >> have an opinion on how to represent this in a platform-independent way. > > Well, it doesn't have a notion of TSCs either; you include that by > implicit reference don't you? I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor, so the device might not even know which vCPUs there are. E.g. there is even interest to make virtio-rtc work as part of the virtio-net device (which might be implemented in hardware). Thanks for the comments, Peter
On 13/03/2024 10:45:54+0100, Peter Hilber wrote: > > Exposing UTC as the only clock reference is bad enough; when leap > > seconds happen there's a whole second during which you don't *know* > > which second it is. It seems odd to me, for a precision clock to be > > deliberately ambiguous about what the time is! > > Just to be clear, the device can perfectly expose only a TAI reference > clock (or both UTC and TAI), the spec is just completely open about this, > as it tries to work for diverse use cases. > > > > > But if the virtio-rtc clock is defined as UTC and then expose something > > *different* in it, that's even worse. You potentially end up providing > > inaccurate time for a whole *day* leading up to the leap second. > > > > I think you're right that leap second smearing should be addressed. At > > the very least, by making it clear that the virtio-rtc clock which > > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other > > yet-to-be-defined variant. > > > > Agreed. > > > Please make it explicit that any hypervisor which wants to advertise a > > smeared clock shall define a new type which specifies the precise > > smearing algorithm and cannot be conflated with the one you're defining > > here. > > > > I will add a requirement that the UTC clock can never have smeared/smoothed > leap seconds. > > I think that not every vendor would bother to first add a definition of a > smearing algorithm. Also, I think in some cases knowing the precise > smearing algorithm might not be important (when having the same time as the > hypervisor is enough and accuracy w.r.t. actual time is less important). > > So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for > now could catch every UTC-like clock which smears/smoothes leap seconds, > where the vendor cannot be bothered to add the smearing algorithm to spec > and implementations. > I still don't know anything about virtio but under Linux, an RTC is always UTC (or localtime when dual booting but let's not care) and never accounts for leap seconds. Having an RTC and RTC driver behaving differently would be super inconvenient. Why don't you leave this to userspace? I guess I'm still questioning whether this is the correct interface to expose the host system time instead of an actual RTC.
On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote: > > I still don't know anything about virtio but under Linux, an RTC is > always UTC (or localtime when dual booting but let's not care) and never > accounts for leap seconds. Having an RTC and RTC driver behaving > differently would be super inconvenient. Why don't you leave this to > userspace? Well yes, we don't need to expose *anything* from the hypervisor and we can leave it all to guest userspace. We can run NTP on every single one of *hundreds* of guests, leaving them all to duplicate the work of calibrating the *same* underlying oscillator. I thought we were trying to avoid that, by having the hypervisor tell them what the time was. If we're going to do that, we need it to be sufficiently precise (and some clients want to *know* the precision), and above all we need it to be *unambiguous*. If the hypervisor says that the time is 3692217600.001, then the guest doesn't actually know *which* 3692217600.001 it is, and thus it still doesn't know the time to an accuracy better than 1 second. And if we start allowing the hypervisor to smear clocks in some other underspecified ways, then we end up with errors of up to 1 second in the clock for long periods of time *around* the leap second. We need to avoid that ambiguity. > I guess I'm still questioning whether this is the correct interface to > expose the host system time instead of an actual RTC. If an RTC device is able to report '23:59:60' as the time of day, I suppose that *could* resolve the ambiguity. But talking to a device is slow; we want guests to be able to know the time — accurately — with a simple counter/tsc read and some arithmetic. Which means *paired* reads of 'RTC' and the counter, and a precise indication of the counter frequency.
On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote: > On 12.03.24 18:15, David Woodhouse wrote: > > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote: > > > On 08.03.24 13:33, David Woodhouse wrote: > > > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: > > > > > On 07.03.24 15:02, David Woodhouse wrote: > > > > > > Hm, should we allow UTC? If you tell me the time in UTC, then > > > > > > (sometimes) I still don't actually know what the time is, because some > > > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI > > > > > > offset, surely? Should the virtio_rtc specification make it mandatory > > > > > > to provide such? > > > > > > > > > > > > Otherwise you're just designing it to allow crappy hypervisors to > > > > > > expose incomplete information. > > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > (adding virtio-comment@lists.oasis-open.org for spec discussion), > > > > > > > > > > thank you for your insightful comments. I think I take a broadly similar > > > > > view. The reason why the current spec and driver is like this is that I > > > > > took a pragmatic approach at first and only included features which work > > > > > out-of-the-box for the current Linux ecosystem. > > > > > > > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore > > > > > can work out-of-the-box with time sync daemons such as chrony. > > > > > > > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock > > > > > as well, I am afraid that > > > > > > > > > > - in some (embedded) scenarios, the TAI clock may not be available > > > > > > > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock. > > > > > > > > > > For the same reasons, I am also not sure about adding a *mandatory* TAI > > > > > offset to each readout. I don't know user-space software which would > > > > > leverage this already (at least not through the PTP clock interface). > > > > > And why would such software not go straight for the TAI clock instead? > > > > > > > > > > How about adding a requirement to the spec that the virtio-rtc device > > > > > SHOULD expose the TAI clock whenever it is available - would this > > > > > address your concerns? > > > > > > > > I think that would be too easy for implementors to miss, or decide not > > > > to obey. Or to get *wrong*, by exposing a TAI clock but actually > > > > putting UTC in it. > > > > > > > > I think I prefer to mandate the tai_offset field with the UTC clock. > > > > Crappy implementations will just set it to zero, but at least that > > > > gives a clear signal to the guests that it's *their* problem to > > > > resolve. > > > > > > To me there are some open questions regarding how this would work. Is there > > > a use case for this with the v3 clock reading methods, or would it be > > > enough to address this with the Virtio timekeeper? > > > > > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably > > > best alongside some additional information about leap seconds. I am not > > > aware about any user-space user. In addition, leap second smearing should > > > also be addressed. > > > > > > > Is there even a standard yet for leap-smearing? Will it be linear over > > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I > > think is what Google does? Meta does something different again, don't > > they? > > > > Exposing UTC as the only clock reference is bad enough; when leap > > seconds happen there's a whole second during which you don't *know* > > which second it is. It seems odd to me, for a precision clock to be > > deliberately ambiguous about what the time is! > > Just to be clear, the device can perfectly expose only a TAI reference > clock (or both UTC and TAI), the spec is just completely open about this, > as it tries to work for diverse use cases. As long as the guest *knows* what it's getting, sure. > > > > But if the virtio-rtc clock is defined as UTC and then expose something > > *different* in it, that's even worse. You potentially end up providing > > inaccurate time for a whole *day* leading up to the leap second. > > > > I think you're right that leap second smearing should be addressed. At > > the very least, by making it clear that the virtio-rtc clock which > > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other > > yet-to-be-defined variant. > > > > Agreed. > > > Please make it explicit that any hypervisor which wants to advertise a > > smeared clock shall define a new type which specifies the precise > > smearing algorithm and cannot be conflated with the one you're defining > > here. > > > > I will add a requirement that the UTC clock can never have smeared/smoothed > leap seconds. Thanks. > I think that not every vendor would bother to first add a definition of a > smearing algorithm. Also, I think in some cases knowing the precise > smearing algorithm might not be important (when having the same time as the > hypervisor is enough and accuracy w.r.t. actual time is less important). > > So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for > now could catch every UTC-like clock which smears/smoothes leap seconds, > where the vendor cannot be bothered to add the smearing algorithm to spec > and implementations. Please $DEITY no. Surely the whole point of this effort is to provide guests with precise and *unambiguous* knowledge of what the time is? Using UTC is bad enough, because for a UTC timestamp in the middle of a leap second the guest can't know know *which* occurrence of that leap second it is, so it might be wrong by a second. To resolve that ambiguity needs a leap indicator and/or tai_offset field. But if you allow and encourage the use of smeared time without even a specification of *how* it's smeared... that's even worse. You have an unknown inaccuracy of up to a second for whole periods of time around a leap second. That's surely the *antithesis* of what we're trying to do here? Without an actual definition of the smearing, how is a guest actually supposed to know what time it is? (I suppose you could add a tai_offset_nanoseconds field? I don't know that I want to *encourage* that thought process...) > As for UTC-SLS, this *could* also be added, although [1] says > > It is inappropriate to use Internet-Drafts as reference material or > to cite them other than as "work in progress." > > [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00 > > > > > One other thing to note is I think we're being very naïve about the TSC > > > > on x86 hosts. Theoretically, the TSC for every vCPU might run at a > > > > different frequency, and even if they run at the same frequency they > > > > might be offset from each other. I'm happy to be naïve but I think we > > > > should be *explicitly* so, and just say for example that it's defined > > > > against vCPU0 so if other vCPUs are different then all bets are off. > > > > > > ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you > > > have an opinion on how to represent this in a platform-independent way. > > > > Well, it doesn't have a notion of TSCs either; you include that by > > implicit reference don't you? > > I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or > boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor, > so the device might not even know which vCPUs there are. E.g. there is even > interest to make virtio-rtc work as part of the virtio-net device (which > might be implemented in hardware). Sure, but those implementations aren't going to offer the TSC pairing at all, are they?
On 13/03/2024 12:29:38+0000, David Woodhouse wrote: > On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote: > > > > I still don't know anything about virtio but under Linux, an RTC is > > always UTC (or localtime when dual booting but let's not care) and never > > accounts for leap seconds. Having an RTC and RTC driver behaving > > differently would be super inconvenient. Why don't you leave this to > > userspace? > > Well yes, we don't need to expose *anything* from the hypervisor and we > can leave it all to guest userspace. We can run NTP on every single one > of *hundreds* of guests, leaving them all to duplicate the work of > calibrating the *same* underlying oscillator. > Really, I see the point of sharing the time accurately between the host and the guest and not duplicating the effort. This is not what I am objecting to. > I thought we were trying to avoid that, by having the hypervisor tell > them what the time was. If we're going to do that, we need it to be > sufficiently precise (and some clients want to *know* the precision), > and above all we need it to be *unambiguous*. > > If the hypervisor says that the time is 3692217600.001, then the guest > doesn't actually know *which* 3692217600.001 it is, and thus it still > doesn't know the time to an accuracy better than 1 second. > The RTC subsystem has a 1 second resolution and this is not going to change because there is no point doing so for the hardware, to get the time precisely, UIE MUST be used there is no vendor that will just support reading the time/date or timestamp as this is way too imprecise. > And if we start allowing the hypervisor to smear clocks in some other > underspecified ways, then we end up with errors of up to 1 second in > the clock for long periods of time *around* the leap second. > > We need to avoid that ambiguity. Exactly my point and I said, reading an RTC is always UTC and never handles leap seconds so if userspace doesn't handle the leap second and updates the RTC, too bad. There are obviously no RTC that will smear clock unless instructed to, so the hypervisor must not smear the clock. Note that this is not an issue for the subsystem because if you are not capable to track an accurate clock, the RTC itself will have drifted by much more than a second in between leap second inclusions. > > > I guess I'm still questioning whether this is the correct interface to > > expose the host system time instead of an actual RTC. > > If an RTC device is able to report '23:59:60' as the time of day, I > suppose that *could* resolve the ambiguity. But talking to a device is > slow; we want guests to be able to know the time — accurately — with a > simple counter/tsc read and some arithmetic. Which means *paired* reads > of 'RTC' and the counter, and a precise indication of the counter > frequency. 23:59:60 is not and will never be allowed in the RTC subsystem as this is an invalid value for the hardware. The TSC or whatever CPU counter/clock that is used to keep the system time is not an RTC, I don't get why it has to be exposed as such to the guests. PTP is fine and precise, RTC is not.
On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote: > The TSC or whatever CPU counter/clock that is used to keep the system > time is not an RTC, I don't get why it has to be exposed as such to the > guests. PTP is fine and precise, RTC is not. Ah, I see. But the point of the virtio_rtc is not really to expose that CPU counter. The point is to report the wallclock time, just like an actual RTC. The real difference is the *precision*. The virtio_rtc device has a facility to *also* expose the counter, because that's what we actually need to gain that precision... Applications don't read the RTC every time they want to know what the time is. These days, they don't even make a system call; it's done entirely in userspace mode. The kernel exposes some shared memory, essentially saying "the counter was X at time Y, and runs at Z Hz". Then applications just read the CPU counter and do some arithmetic. As we require more and more precision in the calibration, it becomes important to get *paired* readings of the CPU counter and the wallclock time at precisely the same moment. If the guest has to read one and then the other, potentially taking interrupts, getting preempted and suffering steal/SMI time in the middle, that introduces an error which is increasingly significant as we increasingly care about precision. Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest kernels having to repeat readings over time and perform the calibration as the underlying hardware oscillator frequency (Z) drifts with temperature. I'm trying to get him to let the hypervisor expose the calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ. Which aside from reducing the duplication of effort, will *also* fix the problem of live migration where *all* those things suffer a step change and leave the guest with an inaccurate clock but not knowing it. But that part isn't relevant to the RTC, as you say. RTC doesn't care about that level of precision; it's just what the system uses to know roughly what year it is, when it powers up. (And isn't even really trusted for that, which is a large part of why I added the X509_V_FLAG_NO_CHECK_TIME flag to OpenSSL, so Secure Boot doesn't break when the RTC is catastrophically wrong :) If you're asking why patch 7/7 in Peter's series exists to expose the virtio clock through RTC, and you're not particularly interested in the first six, I suppose that's a fair question. As is the question of "why is it called virtio_rtc not virtio_ptp?". But let me turn it around: if the kernel has access to this virtio device and *not* any other RTC, why *wouldn't* the kernel use the time from it? The fact that it can optionally *also* provide paired readings with the CPU counter doesn't actually *hurt* for the RTC use case, does it?
On 13.03.24 12:18, Alexandre Belloni wrote: > On 13/03/2024 10:45:54+0100, Peter Hilber wrote: >>> Exposing UTC as the only clock reference is bad enough; when leap >>> seconds happen there's a whole second during which you don't *know* >>> which second it is. It seems odd to me, for a precision clock to be >>> deliberately ambiguous about what the time is! >> >> Just to be clear, the device can perfectly expose only a TAI reference >> clock (or both UTC and TAI), the spec is just completely open about this, >> as it tries to work for diverse use cases. >> >>> >>> But if the virtio-rtc clock is defined as UTC and then expose something >>> *different* in it, that's even worse. You potentially end up providing >>> inaccurate time for a whole *day* leading up to the leap second. >>> >>> I think you're right that leap second smearing should be addressed. At >>> the very least, by making it clear that the virtio-rtc clock which >>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other >>> yet-to-be-defined variant. >>> >> >> Agreed. >> >>> Please make it explicit that any hypervisor which wants to advertise a >>> smeared clock shall define a new type which specifies the precise >>> smearing algorithm and cannot be conflated with the one you're defining >>> here. >>> >> >> I will add a requirement that the UTC clock can never have smeared/smoothed >> leap seconds. >> >> I think that not every vendor would bother to first add a definition of a >> smearing algorithm. Also, I think in some cases knowing the precise >> smearing algorithm might not be important (when having the same time as the >> hypervisor is enough and accuracy w.r.t. actual time is less important). >> >> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for >> now could catch every UTC-like clock which smears/smoothes leap seconds, >> where the vendor cannot be bothered to add the smearing algorithm to spec >> and implementations. >> > > I still don't know anything about virtio but under Linux, an RTC is > always UTC (or localtime when dual booting but let's not care) and never > accounts for leap seconds. Having an RTC and RTC driver behaving > differently would be super inconvenient. Why don't you leave this to > userspace? > > I guess I'm still questioning whether this is the correct interface to > expose the host system time instead of an actual RTC. > virtio_rtc only registers RTC class devices for virtio_rtc clock type UTC, so adding an UTC_SMEARED clock type would avoid leap seconds for the RTC class. But I understand that there are more concerns and I will re-consider. From my POV CLOCK_{REALTIME,MONOTONIC}_ALARM support is very important however. So the only alternative to me seems to be adding an alternative alarmtimer backend (and time synchronization through user space). Thanks for the comment, Peter
On 13/03/2024 14:06:42+0000, David Woodhouse wrote: > If you're asking why patch 7/7 in Peter's series exists to expose the > virtio clock through RTC, and you're not particularly interested in the > first six, I suppose that's a fair question. As is the question of "why > is it called virtio_rtc not virtio_ptp?". > Exactly my question, thanks :) > But let me turn it around: if the kernel has access to this virtio > device and *not* any other RTC, why *wouldn't* the kernel use the time > from it? The fact that it can optionally *also* provide paired readings > with the CPU counter doesn't actually *hurt* for the RTC use case, does > it? As long as it doesn't behave differently from the other RTC, I'm fine with this. This is important because I don't want to carry any special infrastructure for this driver or to have to special case this driver later on because it is incompatible with some evolution of the subsystem.
On 13.03.24 13:45, David Woodhouse wrote: > On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote: >> On 12.03.24 18:15, David Woodhouse wrote: >>> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote: >>>> On 08.03.24 13:33, David Woodhouse wrote: >>>>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: >>>>>> On 07.03.24 15:02, David Woodhouse wrote: >>>>>>> Hm, should we allow UTC? If you tell me the time in UTC, then >>>>>>> (sometimes) I still don't actually know what the time is, because some >>>>>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI >>>>>>> offset, surely? Should the virtio_rtc specification make it mandatory >>>>>>> to provide such? >>>>>>> >>>>>>> Otherwise you're just designing it to allow crappy hypervisors to >>>>>>> expose incomplete information. >>>>>>> >>>>>> >>>>>> Hi David, >>>>>> >>>>>> (adding virtio-comment@lists.oasis-open.org for spec discussion), >>>>>> >>>>>> thank you for your insightful comments. I think I take a broadly similar >>>>>> view. The reason why the current spec and driver is like this is that I >>>>>> took a pragmatic approach at first and only included features which work >>>>>> out-of-the-box for the current Linux ecosystem. >>>>>> >>>>>> The current virtio_rtc features work similar to ptp_kvm, and therefore >>>>>> can work out-of-the-box with time sync daemons such as chrony. >>>>>> >>>>>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock >>>>>> as well, I am afraid that >>>>>> >>>>>> - in some (embedded) scenarios, the TAI clock may not be available >>>>>> >>>>>> - crappy hypervisors will pass off the UTC clock as the TAI clock. >>>>>> >>>>>> For the same reasons, I am also not sure about adding a *mandatory* TAI >>>>>> offset to each readout. I don't know user-space software which would >>>>>> leverage this already (at least not through the PTP clock interface). >>>>>> And why would such software not go straight for the TAI clock instead? >>>>>> >>>>>> How about adding a requirement to the spec that the virtio-rtc device >>>>>> SHOULD expose the TAI clock whenever it is available - would this >>>>>> address your concerns? >>>>> >>>>> I think that would be too easy for implementors to miss, or decide not >>>>> to obey. Or to get *wrong*, by exposing a TAI clock but actually >>>>> putting UTC in it. >>>>> >>>>> I think I prefer to mandate the tai_offset field with the UTC clock. >>>>> Crappy implementations will just set it to zero, but at least that >>>>> gives a clear signal to the guests that it's *their* problem to >>>>> resolve. >>>> >>>> To me there are some open questions regarding how this would work. Is there >>>> a use case for this with the v3 clock reading methods, or would it be >>>> enough to address this with the Virtio timekeeper? >>>> >>>> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably >>>> best alongside some additional information about leap seconds. I am not >>>> aware about any user-space user. In addition, leap second smearing should >>>> also be addressed. >>>> >>> >>> Is there even a standard yet for leap-smearing? Will it be linear over >>> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I >>> think is what Google does? Meta does something different again, don't >>> they? >>> >>> Exposing UTC as the only clock reference is bad enough; when leap >>> seconds happen there's a whole second during which you don't *know* >>> which second it is. It seems odd to me, for a precision clock to be >>> deliberately ambiguous about what the time is! >> >> Just to be clear, the device can perfectly expose only a TAI reference >> clock (or both UTC and TAI), the spec is just completely open about this, >> as it tries to work for diverse use cases. > > As long as the guest *knows* what it's getting, sure. > >>> >>> But if the virtio-rtc clock is defined as UTC and then expose something >>> *different* in it, that's even worse. You potentially end up providing >>> inaccurate time for a whole *day* leading up to the leap second. >>> >>> I think you're right that leap second smearing should be addressed. At >>> the very least, by making it clear that the virtio-rtc clock which >>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other >>> yet-to-be-defined variant. >>> >> >> Agreed. >> >>> Please make it explicit that any hypervisor which wants to advertise a >>> smeared clock shall define a new type which specifies the precise >>> smearing algorithm and cannot be conflated with the one you're defining >>> here. >>> >> >> I will add a requirement that the UTC clock can never have smeared/smoothed >> leap seconds. > > Thanks. > >> I think that not every vendor would bother to first add a definition of a >> smearing algorithm. Also, I think in some cases knowing the precise >> smearing algorithm might not be important (when having the same time as the >> hypervisor is enough and accuracy w.r.t. actual time is less important). >> >> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for >> now could catch every UTC-like clock which smears/smoothes leap seconds, >> where the vendor cannot be bothered to add the smearing algorithm to spec >> and implementations. > > Please $DEITY no. > > Surely the whole point of this effort is to provide guests with precise > and *unambiguous* knowledge of what the time is? I would say, a fundamental point of this effort is to enable such implementations, and to detect if a device is promising to support this. Where we might differ is as to whether the Virtio clock *for every implementation* has to be *continuously* accurate w.r.t. a time standard, or whether *for some implementations* it could be enough that all guests in the local system have the same, precise local notion of time, which might be off from the actual time standard. Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all... With your described use case the UTC_SMEARED clock should of course not be used. The UTC_SMEARED clock would get a distinct name through udev, like /dev/ptp_virtio_utc_smeared, so the incompatibility could at least be detected. > > Using UTC is bad enough, because for a UTC timestamp in the middle of a > leap second the guest can't know know *which* occurrence of that leap > second it is, so it might be wrong by a second. To resolve that > ambiguity needs a leap indicator and/or tai_offset field. I agree that virtio-rtc should communicate this. The question is, what exactly, and for which clock read request? As for PTP clocks: - It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2. - The clock_adjtime(2) tai_offset and return value could be set (if upstream will accept this). Would this help? As discussed, user space would need to interpret this (and currently no dynamic POSIX clock sets this). > > But if you allow and encourage the use of smeared time without even a > specification of *how* it's smeared... that's even worse. You have an > unknown inaccuracy of up to a second for whole periods of time around a > leap second. That's surely the *antithesis* of what we're trying to do > here? Without an actual definition of the smearing, how is a guest > actually supposed to know what time it is? As discussed above, I think in some use cases it is enough for the guest to have a precise notion of time shared with the other guests. > > (I suppose you could add a tai_offset_nanoseconds field? I don't know > that I want to *encourage* that thought process...) > >> As for UTC-SLS, this *could* also be added, although [1] says >> >> It is inappropriate to use Internet-Drafts as reference material or >> to cite them other than as "work in progress." >> >> [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00 >> >>>>> One other thing to note is I think we're being very naïve about the TSC >>>>> on x86 hosts. Theoretically, the TSC for every vCPU might run at a >>>>> different frequency, and even if they run at the same frequency they >>>>> might be offset from each other. I'm happy to be naïve but I think we >>>>> should be *explicitly* so, and just say for example that it's defined >>>>> against vCPU0 so if other vCPUs are different then all bets are off. >>>> >>>> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you >>>> have an opinion on how to represent this in a platform-independent way. >>> >>> Well, it doesn't have a notion of TSCs either; you include that by >>> implicit reference don't you? >> >> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or >> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor, >> so the device might not even know which vCPUs there are. E.g. there is even >> interest to make virtio-rtc work as part of the virtio-net device (which >> might be implemented in hardware). > > Sure, but those implementations aren't going to offer the TSC pairing > at all, are they? > They could offer an Intel ART pairing (some physical PTP NICs are already doing this, look for the convert_art_to_tsc() users). Thanks for the comments, Peter
On 13.03.24 15:06, David Woodhouse wrote: > On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote: >> The TSC or whatever CPU counter/clock that is used to keep the system >> time is not an RTC, I don't get why it has to be exposed as such to the >> guests. PTP is fine and precise, RTC is not. > > Ah, I see. But the point of the virtio_rtc is not really to expose that > CPU counter. The point is to report the wallclock time, just like an > actual RTC. The real difference is the *precision*. > > The virtio_rtc device has a facility to *also* expose the counter, > because that's what we actually need to gain that precision... > > Applications don't read the RTC every time they want to know what the > time is. These days, they don't even make a system call; it's done > entirely in userspace mode. The kernel exposes some shared memory, > essentially saying "the counter was X at time Y, and runs at Z Hz". > Then applications just read the CPU counter and do some arithmetic. > > As we require more and more precision in the calibration, it becomes > important to get *paired* readings of the CPU counter and the wallclock > time at precisely the same moment. If the guest has to read one and > then the other, potentially taking interrupts, getting preempted and > suffering steal/SMI time in the middle, that introduces an error which > is increasingly significant as we increasingly care about precision. > > Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest > kernels having to repeat readings over time and perform the calibration > as the underlying hardware oscillator frequency (Z) drifts with > temperature. I'm trying to get him to let the hypervisor expose the > calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ. > Which aside from reducing the duplication of effort, will *also* fix > the problem of live migration where *all* those things suffer a step > change and leave the guest with an inaccurate clock but not knowing it. I am already convinced that this would work significantly better than the {X,Y} pair (but would be a bit more effort to implement): 1. when accessed by user space, obviously 2. when backing the PTP clock, it saves CPU time and makes non-paired reads more precise. I would just prefer to try upstreaming the {X,Y} pairing first. I think the {X,Y,Z...} pairing could be discussed and developed in parallel. Thanks for the comments, Peter
On 13 March 2024 17:50:48 GMT, Peter Hilber <peter.hilber@opensynergy.com> wrote: >On 13.03.24 13:45, David Woodhouse wrote: >> Surely the whole point of this effort is to provide guests with precise >> and *unambiguous* knowledge of what the time is? > >I would say, a fundamental point of this effort is to enable such >implementations, and to detect if a device is promising to support this. > >Where we might differ is as to whether the Virtio clock *for every >implementation* has to be *continuously* accurate w.r.t. a time standard, >or whether *for some implementations* it could be enough that all guests in >the local system have the same, precise local notion of time, which might >be off from the actual time standard. That makes sense, but remember I don't just want {X, Y, Z} but *also* the error bounds of ±deltaY and ±deltaZ too. So your example just boils down to "I'm calling it UTC, and it's really precise, but we make no promises about its *accuracy*". And that's fine. >Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all... KVM is not an exemplar of good time practices. Not in *any* respect :) >With your described use case the UTC_SMEARED clock should of course not be >used. The UTC_SMEARED clock would get a distinct name through udev, like >/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be >detected. As long as it's clear to all concerned that this is fundamentally not usable as an accurate time source, and is only for the local-sync case you described, sure. >> Using UTC is bad enough, because for a UTC timestamp in the middle of a >> leap second the guest can't know know *which* occurrence of that leap >> second it is, so it might be wrong by a second. To resolve that >> ambiguity needs a leap indicator and/or tai_offset field. > >I agree that virtio-rtc should communicate this. The question is, what >exactly, and for which clock read request? Are we now conflating software architecture (and Linux in particular) with "hardware" design? To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though. >As for PTP clocks: > >- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2. > >- The clock_adjtime(2) tai_offset and return value could be set (if > upstream will accept this). Would this help? As discussed, user space > would need to interpret this (and currently no dynamic POSIX clock sets > this). Hm, maybe? >>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or >>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor, >>> so the device might not even know which vCPUs there are. E.g. there is even >>> interest to make virtio-rtc work as part of the virtio-net device (which >>> might be implemented in hardware). >> >> Sure, but those implementations aren't going to offer the TSC pairing >> at all, are they? >> > >They could offer an Intel ART pairing (some physical PTP NICs are already >doing this, look for the convert_art_to_tsc() users). Right, but isn't that software's problem? The time pairing is defined against the ART in that case.
> As long as it doesn't behave differently from the other RTC, I'm fine > with this. This is important because I don't want to carry any special > infrastructure for this driver or to have to special case this driver > later on because it is incompatible with some evolution of the > subsystem. Maybe deliberately throw away all the sub-second accuracy when accessing the time via the RTC API? That helps to make it look like an RTC. And when doing the rounding, add a constant offset of 10ms to emulate the virtual i2c bus it is hanging off, like most RTCs? Andrew
On 13.03.24 21:12, Andrew Lunn wrote: >> As long as it doesn't behave differently from the other RTC, I'm fine >> with this. This is important because I don't want to carry any special >> infrastructure for this driver or to have to special case this driver >> later on because it is incompatible with some evolution of the >> subsystem. > > Maybe deliberately throw away all the sub-second accuracy when > accessing the time via the RTC API? That helps to make it look like an > RTC. And when doing the rounding, add a constant offset of 10ms to > emulate the virtual i2c bus it is hanging off, like most RTCs? > > Andrew The truncating to whole seconds is already done. As to the offset, I do not understand how this would help. I can read out my CMOS RTC in ~50 us. Thanks for the comment, Peter
Now CC'ing the previous commenters to the virtio-rtc spec draft, since this discussion is mostly about the spec, and the Virtio mailing lists still seem to be in a migration hiatus... On 13.03.24 19:18, David Woodhouse wrote: > On 13 March 2024 17:50:48 GMT, Peter Hilber <peter.hilber@opensynergy.com> wrote: >> On 13.03.24 13:45, David Woodhouse wrote: >>> Surely the whole point of this effort is to provide guests with precise >>> and *unambiguous* knowledge of what the time is? >> >> I would say, a fundamental point of this effort is to enable such >> implementations, and to detect if a device is promising to support this. >> >> Where we might differ is as to whether the Virtio clock *for every >> implementation* has to be *continuously* accurate w.r.t. a time standard, >> or whether *for some implementations* it could be enough that all guests in >> the local system have the same, precise local notion of time, which might >> be off from the actual time standard. > > That makes sense, but remember I don't just want {X, Y, Z} but *also* the error bounds of ±deltaY and ±deltaZ too. > > So your example just boils down to "I'm calling it UTC, and it's really precise, but we make no promises about its *accuracy*". And that's fine. > >> Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all... > > KVM is not an exemplar of good time practices. > Not in *any* respect :) > >> With your described use case the UTC_SMEARED clock should of course not be >> used. The UTC_SMEARED clock would get a distinct name through udev, like >> /dev/ptp_virtio_utc_smeared, so the incompatibility could at least be >> detected. > > As long as it's clear to all concerned that this is fundamentally not usable as an accurate time source, and is only for the local-sync case you described, sure. > >>> Using UTC is bad enough, because for a UTC timestamp in the middle of a >>> leap second the guest can't know know *which* occurrence of that leap >>> second it is, so it might be wrong by a second. To resolve that >>> ambiguity needs a leap indicator and/or tai_offset field. >> >> I agree that virtio-rtc should communicate this. The question is, what >> exactly, and for which clock read request? > > Are we now conflating software architecture (and Linux in particular) with "hardware" design? > > To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though. > As Virtio is extensible (unlike hardware), my approach is to mostly specify only what also has a PoC user and a use case. >> As for PTP clocks: >> >> - It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2. >> >> - The clock_adjtime(2) tai_offset and return value could be set (if >> upstream will accept this). Would this help? As discussed, user space >> would need to interpret this (and currently no dynamic POSIX clock sets >> this). > > Hm, maybe? > > >>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or >>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor, >>>> so the device might not even know which vCPUs there are. E.g. there is even >>>> interest to make virtio-rtc work as part of the virtio-net device (which >>>> might be implemented in hardware). >>> >>> Sure, but those implementations aren't going to offer the TSC pairing >>> at all, are they? >>> >> >> They could offer an Intel ART pairing (some physical PTP NICs are already >> doing this, look for the convert_art_to_tsc() users). > > Right, but isn't that software's problem? The time pairing is defined against the ART in that case. My point was that such a device would then not necessarily have an idea what vCPU 0 is. But let's just say that this will be phrased as a SHOULD best-effort requirement anyway. Thanks for the comments, Peter
On 14 March 2024 11:13:37 CET, Peter Hilber <peter.hilber@opensynergy.com> wrote: >> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though. >> > >As Virtio is extensible (unlike hardware), my approach is to mostly specify >only what also has a PoC user and a use case. If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM). My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that it's extensible but we don't want a v1.0 of the spec, implemented by various hypervisors, which still leaves guests not knowing what the actual time is. That would not be good. And even UTC without a leap second indicator has that problem.
While the virtio-comment list is not available, now also CC'ing Parav, which may be interested in this virtio-rtc spec related discussion thread. On 14.03.24 15:19, David Woodhouse wrote: > On 14 March 2024 11:13:37 CET, Peter Hilber <peter.hilber@opensynergy.com> wrote: >>> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though. >>> >> >> As Virtio is extensible (unlike hardware), my approach is to mostly specify >> only what also has a PoC user and a use case. > > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM). We plan to add - leap second indication, - UTC-to-TAI offset, - clock smearing indication (including the noon-to-noon linear smearing variant which seems to be somewhat popular), and - clock accuracy indication to the initial spec and to the PoC implementation. However, due to resource restrictions, we cannot ourselves add the memory-mapped clock to the initial spec. Everyone is very welcome to contribute the memory-mapped clock to the spec, and I think it might then still make it to the initial version. > > My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that it's extensible but we don't want a v1.0 of the spec, implemented by various hypervisors, which still leaves guests not knowing what the actual time is. That would not be good. And even UTC without a leap second indicator has that problem. Agreed. That should be addressed by the above changes. Best regards, Peter
On Tue, 2024-03-19 at 14:47 +0100, Peter Hilber wrote: > While the virtio-comment list is not available, now also CC'ing Parav, > which may be interested in this virtio-rtc spec related discussion thread. > > On 14.03.24 15:19, David Woodhouse wrote: > > On 14 March 2024 11:13:37 CET, Peter Hilber <peter.hilber@opensynergy.com> wrote: > > > > To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though. > > > > > > > > > > As Virtio is extensible (unlike hardware), my approach is to mostly specify > > > only what also has a PoC user and a use case. > > > > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM). > > We plan to add > > - leap second indication, > > - UTC-to-TAI offset, > > - clock smearing indication (including the noon-to-noon linear smearing > variant which seems to be somewhat popular), and > > - clock accuracy indication > > to the initial spec and to the PoC implementation. Sounds good, thanks! I look forward to seeing the new revision. I'm hoping Julien can give feedback on the clock accuracy parts. > However, due to resource restrictions, we cannot ourselves add the > memory-mapped clock to the initial spec. > > Everyone is very welcome to contribute the memory-mapped clock to the spec, > and I think it might then still make it to the initial version. Makes sense. That is my primary target, so I'm *hoping* we can converge and get that into your initial spec, otherwise for expediency I'm going to have to define an ACPI or DT or PCI device of our own and expose the memory region through that instead. (Even if I have to do that in the short term to stop the bleeding with customers' clocks and live migration, I'd still aspire to migrate to a virtio_rtc version of it in future)
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: > RFC v3 updates > -------------- > > This series implements a driver for a virtio-rtc device conforming to spec > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to > the PTP clock driver already present before. > > This patch series depends on the patch series "treewide: Use clocksource id > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined > series on top of mainline. > > Overview > -------- > > This patch series adds the virtio_rtc module, and related bugfixes. The > virtio_rtc module implements a driver compatible with the proposed Virtio > RTC device specification [1]. The Virtio RTC (Real Time Clock) device > provides information about current time. The device can provide different > clocks, e.g. for the UTC or TAI time standards, or for physical time > elapsed since some past epoch. The driver can read the clocks with simple > or more accurate methods. Optionally, the driver can set an alarm. > > The series first fixes some bugs in the get_device_system_crosststamp() > interpolation code, which is required for reliable virtio_rtc operation. > Then, add the virtio_rtc implementation. > > For the Virtio RTC device, there is currently a proprietary implementation, > which has been used for provisional testing. As discussed before, I don't think it makes sense to design a new high- precision virtual clock which only gets it right *most* of the time. We absolutely need to address the issue of live migration. When live migration occurs, the guest's time precision suffers in two ways. First, even when migrating to a supposedly identical host the precise rate of the underlying counter (TSC, arch counter, etc.) can change within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural changes that NTP normally manages to track, this is a *step* change, potentially from -50PPM to +50PPM from one host to the next. Second, the accuracy of the counter as preserved across migration is limited by the accuracy of each host's NTP synchronization. So there is also a step change in the value of the counter itself. At the moment of migration, the guest's timekeeping should be considered invalid. Any previous cross-timestamps are meaningless, and blindly using the previously-known relationship between the counter and real time can lead to problems such as corruption in distributed databases, fines for mis-timestamped transactions, and other general unhappiness. We obviously can't get a new timestamp from the virtio_rtc device every time an application wants to know the time reliably. We don't even want *system* calls for that, which is why we have it in a vDSO. We can take the same approach to warning the guest about clock disruption due to live migration. A shared memory region from the virtual clock device even be mapped all the way to userspace, for those applications which need precise and *reliable* time to check a 'disruption' marker in it, and do whatever is appropriate (e.g. abort transactions and wait for time to resync) when it happens. We can do better than just letting the guest know about disruption, of course. We can put the actual counter→realtime relationship into the memory region too. As well as being able to provide a PTP driver with this, the applications which care about *reliable* timestamps can mmap the page directly and use it, vDSO-style, to have accurate timestamps even from the first cycle after migration. When disruption is signalled, the guest needs to throw away any *additional* refinement that it's done with NTP/PTP/PPS/etc. and revert to knowing nothing more than what the hypervisor advertises here. Here's a first attempt at defining such a memory structure. For now I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I think it makes most sense for this to be part of the virtio_rtc spec. Ultimately it doesn't matter *how* the guest finds the memory region. Very preliminary QEMU hacks at https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock (I still need to do the KVM side helper for actually filling in the host clock information, converted to the *guest* TSC) This is the guest side. H aving heckled your assumption that we can use the virt counter on Arm, I concede I'm doing the same thing for now. The structure itself is at the end, or see https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001 From: David Woodhouse <dwmw@amazon.co.uk> Date: Mon, 10 Jun 2024 15:10:11 +0100 Subject: [PATCH] ptp: Add vDSO-style vmclock support The vmclock "device" provides a shared memory region with precision clock information. By using shared memory, it is safe across Live Migration. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- drivers/ptp/Kconfig | 13 ++ drivers/ptp/Makefile | 1 + drivers/ptp/ptp_vmclock.c | 404 +++++++++++++++++++++++++++++++++++ include/uapi/linux/vmclock.h | 102 +++++++++ 4 files changed, 520 insertions(+) create mode 100644 drivers/ptp/ptp_vmclock.c create mode 100644 include/uapi/linux/vmclock.h diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig index 604541dcb320..ace6d58c1781 100644 --- a/drivers/ptp/Kconfig +++ b/drivers/ptp/Kconfig @@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM To compile this driver as a module, choose M here: the module will be called ptp_kvm. +config PTP_1588_CLOCK_VMCLOCK + tristate "Virtual machine PTP clock" + depends on X86_TSC || ARM_ARCH_TIMER + depends on PTP_1588_CLOCK && ACPI + default y + help + This driver adds support for using a virtual precision clock + advertised by the hypervisor. This clock is only useful in virtual + machines where such a device is present. + + To compile this driver as a module, choose M here: the module + will be called ptp_vmclock. + config PTP_1588_CLOCK_IDT82P33 tristate "IDT 82P33xxx PTP clock" depends on PTP_1588_CLOCK && I2C diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile index 68bf02078053..01b5cd91eb61 100644 --- a/drivers/ptp/Makefile +++ b/drivers/ptp/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o obj-$(CONFIG_PTP_1588_CLOCK_INES) += ptp_ines.o obj-$(CONFIG_PTP_1588_CLOCK_PCH) += ptp_pch.o obj-$(CONFIG_PTP_1588_CLOCK_KVM) += ptp_kvm.o +obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK) += ptp_vmclock.o obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o ptp-qoriq-y += ptp_qoriq.o ptp-qoriq-$(CONFIG_DEBUG_FS) += ptp_qoriq_debugfs.o diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c new file mode 100644 index 000000000000..3c4e027090c5 --- /dev/null +++ b/drivers/ptp/ptp_vmclock.c @@ -0,0 +1,404 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Virtual PTP 1588 clock for use with LM-safe VMclock device. + * + * Copyright © 2024 Amazon.com, Inc. or its affiliates. + */ + +#include <linux/device.h> +#include <linux/err.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/module.h> +#include <linux/ptp_kvm.h> +#include <linux/platform_device.h> +#include <linux/miscdevice.h> +#include <linux/acpi.h> +#include <uapi/linux/vmclock.h> + +#include <linux/ptp_clock_kernel.h> + +static DEFINE_IDA(vmclock_ida); + +ACPI_MODULE_NAME("vmclock"); + +struct vmclock_state { + phys_addr_t phys_addr; + struct vmclock_abi *clk; + struct miscdevice miscdev; + struct ptp_clock_info ptp_clock_info; + struct ptp_clock *ptp_clock; + enum clocksource_ids cs_id; + int index; + char *name; +}; + +static int vmclock_get_crosststamp(struct vmclock_state *st, + struct system_counterval_t *system_counter, + struct timespec64 *tspec) +{ + uint64_t cycle, delta, seq, delta_s32, delta_s64, delta_s32b; + uint64_t delta_lo, delta_hi, period_lo, period_hi, frac_sec_lo, frac_sec_hi; + int ret = 0; + +#ifdef CONFIG_X86 + if (check_tsc_unstable()) + return -EINVAL; +#endif + + preempt_disable_notrace(); + + do { + + seq = st->clk->seq_count & ~1ULL; + virt_rmb(); + + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) { + ret = -EINVAL; + virt_rmb(); + continue; + } + + cycle = get_cycles(); + + tspec->tv_sec = st->clk->utc_time_sec; + frac_sec_lo = st->clk->utc_time_frac_sec & 0xffffffff; + frac_sec_hi = st->clk->utc_time_frac_sec >> 32; + + delta = cycle - st->clk->counter_value; + + delta_lo = delta & 0xffffffff; + delta_hi = delta >> 32; + period_lo = st->clk->counter_period_frac_sec & 0xffffffff; + period_hi = st->clk->counter_period_frac_sec >> 32; + + /* Delta in units of (second >> 32) */ + delta_s32 = delta_lo * period_hi; + + /* Avoid overflow by skimming off the full seconds */ + tspec->tv_sec += delta_s32 >> 32; + delta_s32 &= 0xffffffff; + + if (delta_hi) { + tspec->tv_sec += delta_hi * period_hi; + delta_s32b = delta_hi * period_lo; + + tspec->tv_sec += delta_s32b >> 32; + delta_s32 += delta_s32b & 0xffffffff; + } + + delta_s32 += frac_sec_hi; + + delta_s64 = (period_lo * delta_lo) + frac_sec_lo; + delta_s32 += delta_s64 >> 32; + delta_s64 &= 0xffffffff; + + tspec->tv_sec += delta_s32 >> 32; + delta_s32 &= 0xffffffff; + + tspec->tv_nsec = ((delta_s32 * NSEC_PER_SEC) + + ((delta_s64 * NSEC_PER_SEC) >> 32)) >> 32; + + ret = 0; + + virt_rmb(); + } while (seq != st->clk->seq_count); + + preempt_enable_notrace(); + + if (ret) + return ret; + + if (system_counter) { + system_counter->cycles = cycle; + system_counter->cs_id = st->cs_id; + } + + return 0; +} + +static int ptp_vmclock_get_time_fn(ktime_t *device_time, + struct system_counterval_t *system_counter, + void *ctx) +{ + struct vmclock_state *st = ctx; + struct timespec64 tspec; + int ret; + + ret = vmclock_get_crosststamp(st, system_counter, &tspec); + if (!ret) + *device_time = timespec64_to_ktime(tspec); + + return ret; +} + + +static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp, + struct system_device_crosststamp *xtstamp) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + + return get_device_system_crosststamp(ptp_vmclock_get_time_fn, st, + NULL, xtstamp); +} + +/* + * PTP clock operations + */ + +static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_settime(struct ptp_clock_info *ptp, + const struct timespec64 *ts) +{ + return -EOPNOTSUPP; +} + +static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts) +{ + struct vmclock_state *st = container_of(ptp, struct vmclock_state, + ptp_clock_info); + + return vmclock_get_crosststamp(st, NULL, ts); +} + +static int ptp_vmclock_enable(struct ptp_clock_info *ptp, + struct ptp_clock_request *rq, int on) +{ + return -EOPNOTSUPP; +} + +static const struct ptp_clock_info ptp_vmclock_info = { + .owner = THIS_MODULE, + .max_adj = 0, + .n_ext_ts = 0, + .n_pins = 0, + .pps = 0, + .adjfine = ptp_vmclock_adjfine, + .adjtime = ptp_vmclock_adjtime, + .gettime64 = ptp_vmclock_gettime, + .settime64 = ptp_vmclock_settime, + .enable = ptp_vmclock_enable, + .getcrosststamp = ptp_vmclock_getcrosststamp, +}; + +static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma) +{ + struct vmclock_state *st = container_of(fp->private_data, + struct vmclock_state, miscdev); + + if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ) + return -EROFS; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff) + return -EINVAL; + + if (io_remap_pfn_range(vma, vma->vm_start, + st->phys_addr >> PAGE_SHIFT, PAGE_SIZE, + vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + +static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf, + size_t count, loff_t *ppos) +{ + struct vmclock_state *st = container_of(fp->private_data, + struct vmclock_state, miscdev); + size_t max_count; + int32_t seq; + + if (*ppos >= PAGE_SIZE) + return 0; + + max_count = PAGE_SIZE - *ppos; + if (count > max_count) + count = max_count; + + do { + seq = st->clk->seq_count & ~1ULL; + virt_rmb(); + + if (copy_to_user(buf, ((char *)st->clk) + *ppos, count)) + return -EFAULT; + + virt_rmb(); + } while (seq != st->clk->seq_count); + + *ppos += count; + return count; +} + +static const struct file_operations vmclock_miscdev_fops = { + .mmap = vmclock_miscdev_mmap, + .read = vmclock_miscdev_read, +}; + +/* module operations */ + +static void vmclock_remove(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct vmclock_state *st = dev_get_drvdata(dev); + + if (st->ptp_clock) + ptp_clock_unregister(st->ptp_clock); + + if (st->miscdev.minor == MISC_DYNAMIC_MINOR) + misc_deregister(&st->miscdev); +} + +static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st) +{ + struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER }; + struct acpi_device *adev = ACPI_COMPANION(dev); + union acpi_object *obj; + acpi_status status; + + status = acpi_evaluate_object(adev->handle, "ADDR", NULL, &parsed); + if (ACPI_FAILURE(status)) { + ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR")); + return -ENODEV; + } + obj = parsed.pointer; + if (!obj || obj->type != ACPI_TYPE_PACKAGE || obj->package.count != 2 || + obj->package.elements[0].type != ACPI_TYPE_INTEGER || + obj->package.elements[1].type != ACPI_TYPE_INTEGER) + return -EINVAL; + + st->phys_addr = (obj->package.elements[0].integer.value << 0) | + (obj->package.elements[1].integer.value << 32); + + return 0; +} + +static void vmclock_put_idx(void *data) +{ + struct vmclock_state *st = data; + + ida_free(&vmclock_ida, st->index); +} + +static int vmclock_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct vmclock_state *st; + int ret; + + st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL); + if (!st) + return -ENOMEM; + + if (has_acpi_companion(dev)) + ret = vmclock_probe_acpi(dev, st); + else + ret = -EINVAL; /* Only ACPI for now */ + + if (ret) { + dev_info(dev, "Failed to obtain physical address: %d\n", ret); + goto out; + } + + st->clk = devm_memremap(dev, st->phys_addr, sizeof(*st->clk), + MEMREMAP_WB); + if (IS_ERR(st->clk)) { + ret = PTR_ERR(st->clk); + dev_info(dev, "failed to map shared memory\n"); + st->clk = NULL; + goto out; + } + + if (st->clk->magic != VMCLOCK_MAGIC || + st->clk->size < sizeof(*st->clk) || + st->clk->version != 1) { + dev_info(dev, "vmclock magic fields invalid\n"); + ret = -EINVAL; + goto out; + } + + if (IS_ENABLED(CONFIG_ARM64) && + st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) { + /* Can we check it's the virtual counter? */ + st->cs_id = CSID_ARM_ARCH_COUNTER; + } else if (IS_ENABLED(CONFIG_X86) && + st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) { + st->cs_id = CSID_X86_TSC; + } + + ret = ida_alloc(&vmclock_ida, GFP_KERNEL); + if (ret < 0) + goto out; + + st->index = ret; + ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st); + if (ret) + goto out; + + st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index); + if (!st->name) { + ret = -ENOMEM; + goto out; + } + + /* If the structure is big enough, it can be mapped to userspace */ + if (st->clk->size >= PAGE_SIZE) { + st->miscdev.minor = MISC_DYNAMIC_MINOR; + st->miscdev.fops = &vmclock_miscdev_fops; + st->miscdev.name = st->name; + + ret = misc_register(&st->miscdev); + if (ret) + goto out; + } + + /* If there is valid clock information, register a PTP clock */ + if (st->cs_id) { + st->ptp_clock_info = ptp_vmclock_info; + strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name)); + st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev); + + if (IS_ERR(st->ptp_clock)) { + ret = PTR_ERR(st->ptp_clock); + st->ptp_clock = NULL; + vmclock_remove(pdev); + goto out; + } + } + + dev_set_drvdata(dev, st); + + out: + return ret; +} + +static const struct acpi_device_id vmclock_acpi_ids[] = { + { "VMCLOCK", 0 }, + {} +}; +MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids); + +static struct platform_driver vmclock_platform_driver = { + .probe = vmclock_probe, + .remove_new = vmclock_remove, + .driver = { + .name = "vmclock", + .acpi_match_table = vmclock_acpi_ids, + }, +}; + +module_platform_driver(vmclock_platform_driver) + +MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>"); +MODULE_DESCRIPTION("PTP clock using VMCLOCK"); +MODULE_LICENSE("GPL v2"); diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h new file mode 100644 index 000000000000..cac24536c5c8 --- /dev/null +++ b/include/uapi/linux/vmclock.h @@ -0,0 +1,102 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ + +/* + * This structure provides a vDSO-style clock to VM guests, exposing the + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch + * counter, etc.) and real time. It is designed to address the problem of + * live migration, which other clock enlightenments do not. + * + * When a guest is live migrated, this affects the clock in two ways. + * + * First, even between identical hosts the actual frequency of the underlying + * counter will change within the tolerances of its specification (typically + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the + * same host, but can be tracked by NTP as it generally varies slowly. With + * live migration there is a step change in the frequency, with no warning. + * + * Second, there may be a step change in the value of the counter itself, as + * its accuracy is limited by the precision of the NTP synchronization on the + * source and destination hosts. + * + * So any calibration (NTP, PTP, etc.) which the guest has done on the source + * host before migration is invalid, and needs to be redone on the new host. + * + * In its most basic mode, this structure provides only an indication to the + * guest that live migration has occurred. This allows the guest to know that + * its clock is invalid and take remedial action. For applications that need + * reliable accurate timestamps (e.g. distributed databases), the structure + * can be mapped all the way to userspace. This allows the application to see + * directly for itself that the clock is disrupted and take appropriate + * action, even when using a vDSO-style method to get the time instead of a + * system call. + * + * In its more advanced mode. this structure can also be used to expose the + * precise relationship of the CPU counter to real time, as calibrated by the + * host. This means that userspace applications can have accurate time + * immediately after live migration, rather than having to pause operations + * and wait for NTP to recover. This mode does, of course, rely on the + * counter being reliable and consistent across CPUs. + */ + +#ifdef __KERNEL__ +#include <linux/types.h> +#else +#include <stdint.h> +#endif + +struct vmclock_abi { + uint32_t magic; +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ + uint16_t size; /* Size of page containing this structure */ + uint16_t version; /* 1 */ + + /* Sequence lock. Low bit means an update is in progress. */ + uint64_t seq_count; + + /* + * This field changes to another non-repeating value when the CPU + * counter is disrupted, for example on live migration. + */ + uint64_t disruption_marker; + + uint8_t clock_status; +#define VMCLOCK_STATUS_UNKNOWN 0 +#define VMCLOCK_STATUS_INITIALIZING 1 +#define VMCLOCK_STATUS_SYNCHRONIZED 2 +#define VMCLOCK_STATUS_FREERUNNING 3 +#define VMCLOCK_STATUS_UNRELIABLE 4 + + uint8_t counter_id; +#define VMCLOCK_COUNTER_INVALID 0 +#define VMCLOCK_COUNTER_X86_TSC 1 +#define VMCLOCK_COUNTER_ARM_VCNT 2 + + uint8_t pad[3]; + + /* + * UTC on its own is non-monotonic and ambiguous. + * + * Inform the guest about the TAI offset, so that it can have an + * actual monotonic and reliable time reference if it needs it. + * + * Also indicate a nearby leap second, if one exists. Unlike in + * NTP, this may indicate a leap second in the past in order to + * indicate that it *has* been taken into account. + */ + int8_t leapsecond_direction; + int16_t tai_offset_sec; + uint64_t leapsecond_counter_value; + uint64_t leapsecond_tai_time; + + /* Paired values of counter and UTC at a given point in time. */ + uint64_t counter_value; + uint64_t utc_time_sec; + uint64_t utc_time_frac_sec; + + /* Counter frequency, and error margin. Units of (second >> 64) */ + uint64_t counter_period_frac_sec; + uint64_t counter_period_error_rate_frac_sec; + + /* Error margin of UTC reading above (± picoseconds) */ + uint64_t utc_time_maxerror_picosec; +};
Changing virtio-dev address to the new one. The discussion might also be relevant for virtio-comment, but it is discouraged to forward it to both. On 15.06.24 10:40, David Woodhouse wrote: > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote: >> RFC v3 updates >> -------------- >> >> This series implements a driver for a virtio-rtc device conforming to spec >> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to >> the PTP clock driver already present before. >> >> This patch series depends on the patch series "treewide: Use clocksource id >> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined >> series on top of mainline. >> >> Overview >> -------- >> >> This patch series adds the virtio_rtc module, and related bugfixes. The >> virtio_rtc module implements a driver compatible with the proposed Virtio >> RTC device specification [1]. The Virtio RTC (Real Time Clock) device >> provides information about current time. The device can provide different >> clocks, e.g. for the UTC or TAI time standards, or for physical time >> elapsed since some past epoch. The driver can read the clocks with simple >> or more accurate methods. Optionally, the driver can set an alarm. >> >> The series first fixes some bugs in the get_device_system_crosststamp() >> interpolation code, which is required for reliable virtio_rtc operation. >> Then, add the virtio_rtc implementation. >> >> For the Virtio RTC device, there is currently a proprietary implementation, >> which has been used for provisional testing. > > As discussed before, I don't think it makes sense to design a new high- > precision virtual clock which only gets it right *most* of the time. We > absolutely need to address the issue of live migration. > > When live migration occurs, the guest's time precision suffers in two > ways. > > First, even when migrating to a supposedly identical host the precise > rate of the underlying counter (TSC, arch counter, etc.) can change > within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural > changes that NTP normally manages to track, this is a *step* change, > potentially from -50PPM to +50PPM from one host to the next. > > Second, the accuracy of the counter as preserved across migration is > limited by the accuracy of each host's NTP synchronization. So there is > also a step change in the value of the counter itself. > > At the moment of migration, the guest's timekeeping should be > considered invalid. Any previous cross-timestamps are meaningless, and > blindly using the previously-known relationship between the counter and > real time can lead to problems such as corruption in distributed > databases, fines for mis-timestamped transactions, and other general > unhappiness. > > We obviously can't get a new timestamp from the virtio_rtc device every > time an application wants to know the time reliably. We don't even want > *system* calls for that, which is why we have it in a vDSO. > > We can take the same approach to warning the guest about clock > disruption due to live migration. A shared memory region from the > virtual clock device even be mapped all the way to userspace, for those > applications which need precise and *reliable* time to check a > 'disruption' marker in it, and do whatever is appropriate (e.g. abort > transactions and wait for time to resync) when it happens. > > We can do better than just letting the guest know about disruption, of > course. We can put the actual counter→realtime relationship into the > memory region too. As well as being able to provide a PTP driver with > this, the applications which care about *reliable* timestamps can mmap > the page directly and use it, vDSO-style, to have accurate timestamps > even from the first cycle after migration. > > When disruption is signalled, the guest needs to throw away any > *additional* refinement that it's done with NTP/PTP/PPS/etc. and revert > to knowing nothing more than what the hypervisor advertises here. > > Here's a first attempt at defining such a memory structure. For now > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I > think it makes most sense for this to be part of the virtio_rtc spec. > Ultimately it doesn't matter *how* the guest finds the memory region. This looks sensible to me. I also think it would be possible to adapt this for the virtio-rtc spec. The proposal also supports some other use cases which are not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and others such as indication of a past leap second. Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to support leap second smearing. Also, it would be helpful to allow indicating when some of the fields are not valid (such as leapsecond_counter_value, leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...). Do you have plans to contribute this to the Virtio specification and Linux driver implementation? I also added a few comments below. Thanks for sharing, Peter [2] https://lore.kernel.org/virtio-comment/20240522170003.52565-1-peter.hilber@opensynergy.com/ > > Very preliminary QEMU hacks at > https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock > (I still need to do the KVM side helper for actually filling in the > host clock information, converted to the *guest* TSC) > > This is the guest side. H aving heckled your assumption that we can use > the virt counter on Arm, I concede I'm doing the same thing for now. > The structure itself is at the end, or see > https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock > > From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001 > From: David Woodhouse <dwmw@amazon.co.uk> > Date: Mon, 10 Jun 2024 15:10:11 +0100 > Subject: [PATCH] ptp: Add vDSO-style vmclock support > > The vmclock "device" provides a shared memory region with precision clock > information. By using shared memory, it is safe across Live Migration. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > drivers/ptp/Kconfig | 13 ++ > drivers/ptp/Makefile | 1 + > drivers/ptp/ptp_vmclock.c | 404 +++++++++++++++++++++++++++++++++++ > include/uapi/linux/vmclock.h | 102 +++++++++ > 4 files changed, 520 insertions(+) > create mode 100644 drivers/ptp/ptp_vmclock.c > create mode 100644 include/uapi/linux/vmclock.h > [...] > + > +static const struct ptp_clock_info ptp_vmclock_info = { > + .owner = THIS_MODULE, > + .max_adj = 0, > + .n_ext_ts = 0, > + .n_pins = 0, > + .pps = 0, > + .adjfine = ptp_vmclock_adjfine, > + .adjtime = ptp_vmclock_adjtime, > + .gettime64 = ptp_vmclock_gettime, Should implement .gettimex64 instead. > + .settime64 = ptp_vmclock_settime, > + .enable = ptp_vmclock_enable, > + .getcrosststamp = ptp_vmclock_getcrosststamp, > +}; > + [...] > +/* > + * This structure provides a vDSO-style clock to VM guests, exposing the > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch > + * counter, etc.) and real time. It is designed to address the problem of > + * live migration, which other clock enlightenments do not. > + * > + * When a guest is live migrated, this affects the clock in two ways. > + * > + * First, even between identical hosts the actual frequency of the underlying > + * counter will change within the tolerances of its specification (typically > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the > + * same host, but can be tracked by NTP as it generally varies slowly. With > + * live migration there is a step change in the frequency, with no warning. > + * > + * Second, there may be a step change in the value of the counter itself, as > + * its accuracy is limited by the precision of the NTP synchronization on the > + * source and destination hosts. > + * > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source > + * host before migration is invalid, and needs to be redone on the new host. > + * > + * In its most basic mode, this structure provides only an indication to the > + * guest that live migration has occurred. This allows the guest to know that > + * its clock is invalid and take remedial action. For applications that need > + * reliable accurate timestamps (e.g. distributed databases), the structure > + * can be mapped all the way to userspace. This allows the application to see > + * directly for itself that the clock is disrupted and take appropriate > + * action, even when using a vDSO-style method to get the time instead of a > + * system call. > + * > + * In its more advanced mode. this structure can also be used to expose the > + * precise relationship of the CPU counter to real time, as calibrated by the > + * host. This means that userspace applications can have accurate time > + * immediately after live migration, rather than having to pause operations > + * and wait for NTP to recover. This mode does, of course, rely on the > + * counter being reliable and consistent across CPUs. > + */ > + > +#ifdef __KERNEL__ > +#include <linux/types.h> > +#else > +#include <stdint.h> > +#endif > + > +struct vmclock_abi { > + uint32_t magic; > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ > + uint16_t size; /* Size of page containing this structure */ > + uint16_t version; /* 1 */ > + > + /* Sequence lock. Low bit means an update is in progress. */ > + uint64_t seq_count; > + > + /* > + * This field changes to another non-repeating value when the CPU > + * counter is disrupted, for example on live migration. > + */ > + uint64_t disruption_marker; > + > + uint8_t clock_status; > +#define VMCLOCK_STATUS_UNKNOWN 0 > +#define VMCLOCK_STATUS_INITIALIZING 1 > +#define VMCLOCK_STATUS_SYNCHRONIZED 2 > +#define VMCLOCK_STATUS_FREERUNNING 3 > +#define VMCLOCK_STATUS_UNRELIABLE 4 > + > + uint8_t counter_id; > +#define VMCLOCK_COUNTER_INVALID 0 > +#define VMCLOCK_COUNTER_X86_TSC 1 > +#define VMCLOCK_COUNTER_ARM_VCNT 2 > + > + uint8_t pad[3]; > + > + /* > + * UTC on its own is non-monotonic and ambiguous. > + * > + * Inform the guest about the TAI offset, so that it can have an > + * actual monotonic and reliable time reference if it needs it. > + * > + * Also indicate a nearby leap second, if one exists. Unlike in > + * NTP, this may indicate a leap second in the past in order to > + * indicate that it *has* been taken into account. > + */ > + int8_t leapsecond_direction; > + int16_t tai_offset_sec; > + uint64_t leapsecond_counter_value; > + uint64_t leapsecond_tai_time; > + > + /* Paired values of counter and UTC at a given point in time. */ > + uint64_t counter_value; > + uint64_t utc_time_sec; > + uint64_t utc_time_frac_sec; > + > + /* Counter frequency, and error margin. Units of (second >> 64) */ > + uint64_t counter_period_frac_sec; AFAIU this might limit the precision in case of high counter frequencies. Could the unit be aligned to the expected frequency band of counters? > + uint64_t counter_period_error_rate_frac_sec; > + > + /* Error margin of UTC reading above (± picoseconds) */ > + uint64_t utc_time_maxerror_picosec; > +};
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote: > Changing virtio-dev address to the new one. The discussion might also be > relevant for virtio-comment, but it is discouraged to forward it to both. I will happily take it to whichever forum you think is most appropriate. (And you have my permission to direct replies to whichever you choose.) > On 15.06.24 10:40, David Woodhouse wrote: > > As discussed before, I don't think it makes sense to design a new high- > > precision virtual clock which only gets it right *most* of the time. We > > absolutely need to address the issue of live migration. ... > > Here's a first attempt at defining such a memory structure. For now > > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I > > think it makes most sense for this to be part of the virtio_rtc spec. > > Ultimately it doesn't matter *how* the guest finds the memory region. > > This looks sensible to me. I also think it would be possible to adapt this for > the virtio-rtc spec. The proposal also supports some other use cases which are > not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and > others such as indication of a past leap second. Right. The vDSO-style clock reading is key to solving the live migration problem. The other key thing this adds is the error bounds, which some applications care deeply about. I've been working with the team that owns ClockBound on that part: https://github.com/aws/clock-bound > Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to > support leap second smearing. That's kind of intentional. Leap second smearing should be considered the *antithesis* of precise time. Anyone who wants a monotonic realtime clock should be using the POSIX CLOCK_TAI. Part of my motivation for fixing the LM problem is because some financial institutions can incur significant penalties for putting inaccurate timestamps on transactions — even the disruption caused by live migration is enough to trigger that. So deliberately lying to them about what the UTC time is, by up to a second in either direction, is not necessarily in their best interest. As you noted, this proposal does expose leap seconds in the recent past, which is all that's needed to allow a guest to generate a smeared clock *from* the accurate clock that is provided through this mechanism. (Knowledge of past leap seconds is needed because in some modes, smearing adjustments continue for some hours *afte* the leap second should have occurred. So the NTP style of leap indicator isn't sufficient). > Also, it would be helpful to allow indicating > when some of the fields are not valid (such as leapsecond_counter_value, > leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...). Right. For some of those the answer can just be 'zero means invalid', for the max error, perhaps MAX_UINT64. But we should definitely make that explicit. I'm also not entirely sure I understood Julien's insistence that we include the leapsecond_counter_value as *well* as the leapsecond_tai_time. It seems to me that the implementation would have to recalculate that every time the frequency is adjusted. For some of those fields, I was expecting a certain amount of bikeshedding to occur and figured it was better to post an early straw man and solicit feedback. > Do you have plans to contribute this to the Virtio specification and Linux > driver implementation? Yes, absolutely. For now I've implemented it in the Linux guest¹ and in QEMU² as an ACPI device modelled on vmgenid, but I'd love *not* to have to do that, and just to do it based on virtio instead. ¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock ² https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock > > +static const struct ptp_clock_info ptp_vmclock_info = { > > + .owner = THIS_MODULE, > > + .max_adj = 0, > > + .n_ext_ts = 0, > > + .n_pins = 0, > > + .pps = 0, > > + .adjfine = ptp_vmclock_adjfine, > > + .adjtime = ptp_vmclock_adjtime, > > + .gettime64 = ptp_vmclock_gettime, > > Should implement .gettimex64 instead. Ack, thanks. I'll go play with that. > > > + > > + /* Counter frequency, and error margin. Units of (second >> 64) */ > > + uint64_t counter_period_frac_sec; > > AFAIU this might limit the precision in case of high counter frequencies. > Could the unit be aligned to the expected frequency band of counters? This field indicates the period of a single tick, in units of 1>>64 of a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? Can you walk me through a calculation where you believe that level of precision is insufficient? I guess the precision matters if the structure isn't updated for a long period of time, and the delta between the current counter and the snapshot is high? That's a *lot* of 54 zeptosecondses? But you really would need a *lot* of them before you care? And if nobody's been calibrating your counter for that long, surely you have bigger worries? Am I missing something there?
On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote: > > > > > > + > > > + /* Counter frequency, and error margin. Units of (second >> 64) */ > > > + uint64_t counter_period_frac_sec; > > > > AFAIU this might limit the precision in case of high counter frequencies. > > Could the unit be aligned to the expected frequency band of counters? > > This field indicates the period of a single tick, in units of 1>>64 of > a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? > > Can you walk me through a calculation where you believe that level of > precision is insufficient? > > I guess the precision matters if the structure isn't updated for a long > period of time, and the delta between the current counter and the > snapshot is high? That's a *lot* of 54 zeptosecondses? But you really > would need a *lot* of them before you care? And if nobody's been > calibrating your counter for that long, surely you have bigger worries? > > Am I missing something there? Hm, that was a bit rushed at the end of the day; let's take a better look... Let's take a hypothetical example of a 100GHz counter. That's two orders of magnitude more than today's Arm arch counter. The period of such a counter would be 10 picoseconds. (Let's ignore the question of how far light actually travels in that time and how *realistic* that example is, for the moment.) It turns out that at that rate, there *are* a lot of 54 zeptosecondses of precision loss in the day. It could be half a millisecond a day, or 20µs an hour. That particular example of 10 picoseconds is 184467440.7370955 (seconds>>64) which could be truncated to 184467440 — losing about 4PPB (a third of a millisecond a day; 14µs an hour). So yeah, I suppose a 'shift' field could make sense. It's easy enough to consume on the guest side as it doesn't really perturb the 128-bit multiplication very much; especially if we don't let it be negative. And implementations *can* just set it to zero. It hurts nobody. Or were you thinking of just using a fixed shift like (seconds>>80) instead?
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Should implement .gettimex64 instead.
Thanks. This look sane?
As noted in the code comment, in the *ideal* case we just build all
three pre/post/device timestamps from the very same counter read. So
sts->pre_ts == sts->post_ts.
In the less ideal case (which will happen on x86 when kvmclock is being
used for the system time), we use the time from ktime_get_snapshot() as
the pre_ts and take a new snapshot immediately after the get_cycles().
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index e8c65405a8f3..07a81a94d29a 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -96,9 +96,11 @@ static inline uint64_t mul_u64_u64_add_u64(uint64_t *res_hi, uint64_t delta,
}
static int vmclock_get_crosststamp(struct vmclock_state *st,
+ struct ptp_system_timestamp *sts,
struct system_counterval_t *system_counter,
struct timespec64 *tspec)
{
+ struct system_time_snapshot systime_snapshot;
uint64_t cycle, delta, seq, frac_sec;
int ret = 0;
@@ -119,7 +121,17 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
continue;
}
- cycle = get_cycles();
+ if (sts) {
+ ktime_get_snapshot(&systime_snapshot);
+
+ if (systime_snapshot.cs_id == st->cs_id) {
+ cycle = systime_snapshot.cycles;
+ } else {
+ cycle = get_cycles();
+ ptp_read_system_postts(sts);
+ }
+ } else
+ cycle = get_cycles();
delta = cycle - st->clk->counter_value;
@@ -139,6 +151,21 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
if (ret)
return ret;
+ /*
+ * When invoked for gettimex64, fill in the pre/post system times.
+ * The ideal case is when system time is based on the the same
+ * counter as st->cs_id, in which case all three pre/post/device
+ * times are derived from the *same* counter value. If cs_id does
+ * not match, then the value from ktime_get_snapshot() is used as
+ * pre_ts, and ptp_read_system_postts() was already called above
+ * for the post_ts. Those are either side of the get_cycles() call.
+ */
+ if (sts) {
+ sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+ if (systime_snapshot.cs_id == st->cs_id)
+ sts->post_ts = sts->pre_ts;
+ }
+
if (system_counter) {
system_counter->cycles = cycle;
system_counter->cs_id = st->cs_id;
@@ -155,7 +182,7 @@ static int ptp_vmclock_get_time_fn(ktime_t *device_time,
struct timespec64 tspec;
int ret;
- ret = vmclock_get_crosststamp(st, system_counter, &tspec);
+ ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
if (!ret)
*device_time = timespec64_to_ktime(tspec);
@@ -198,7 +225,16 @@ static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts
struct vmclock_state *st = container_of(ptp, struct vmclock_state,
ptp_clock_info);
- return vmclock_get_crosststamp(st, NULL, ts);
+ return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+
+ return vmclock_get_crosststamp(st, sts, NULL, ts);
}
static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
@@ -216,6 +252,7 @@ static const struct ptp_clock_info ptp_vmclock_info = {
.adjfine = ptp_vmclock_adjfine,
.adjtime = ptp_vmclock_adjtime,
.gettime64 = ptp_vmclock_gettime,
+ .gettimex64 = ptp_vmclock_gettimex,
.settime64 = ptp_vmclock_settime,
.enable = ptp_vmclock_enable,
.getcrosststamp = ptp_vmclock_getcrosststamp,
On 21.06.24 10:45, David Woodhouse wrote: > On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote: >> >>> >>>> + >>>> + /* Counter frequency, and error margin. Units of (second >> 64) */ >>>> + uint64_t counter_period_frac_sec; >>> >>> AFAIU this might limit the precision in case of high counter frequencies. >>> Could the unit be aligned to the expected frequency band of counters? >> >> This field indicates the period of a single tick, in units of 1>>64 of >> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? >> >> Can you walk me through a calculation where you believe that level of >> precision is insufficient? >> >> I guess the precision matters if the structure isn't updated for a long >> period of time, and the delta between the current counter and the >> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really >> would need a *lot* of them before you care? And if nobody's been >> calibrating your counter for that long, surely you have bigger worries? >> >> Am I missing something there? > > Hm, that was a bit rushed at the end of the day; let's take a better look... > > Let's take a hypothetical example of a 100GHz counter. That's two > orders of magnitude more than today's Arm arch counter. > > The period of such a counter would be 10 picoseconds. > > (Let's ignore the question of how far light actually travels in that > time and how *realistic* that example is, for the moment.) > > It turns out that at that rate, there *are* a lot of 54 zeptosecondses > of precision loss in the day. It could be half a millisecond a day, or > 20µs an hour. > > That particular example of 10 picoseconds is 184467440.7370955 > (seconds>>64) which could be truncated to 184467440 — losing about 4PPB > (a third of a millisecond a day; 14µs an hour). > > So yeah, I suppose a 'shift' field could make sense. It's easy enough > to consume on the guest side as it doesn't really perturb the 128-bit > multiplication very much; especially if we don't let it be negative. > > And implementations *can* just set it to zero. It hurts nobody. > > Or were you thinking of just using a fixed shift like (seconds>>80) > instead? The 'shift' field should be fine.