Message ID | 20220316045308.2313184-1-oupton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Documentation: KVM: Describe guest TSC scaling in migration algorithm | expand |
On 3/16/22 05:53, Oliver Upton wrote: > The VMM has control of both the guest's TSC scale and offset. Extend the > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation > to cover TSC scaling. > > Reported-by: David Woodhouse<dwmw@amazon.co.uk> > Signed-off-by: Oliver Upton<oupton@google.com> > --- > > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd). A few more things that have to be changed: > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src), > kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds > (host_src). > One of two changes: a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src", add a new device attribute for the host TSC frequency. b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new device attribute for the guest_frequency/host_frequency ratio. A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's confusing IMO. > 3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the > guest's TSC (freq). Replace freq with guest_freq. > 6. Adjust the guest TSC offsets for every vCPU to account for (1) time > elapsed since recording state and (2) difference in TSCs between the > source and destination machine: > > ofs_dst[i] = ofs_src[i] - > (guest_src - guest_dest) * freq + > (tsc_src - tsc_dest) > Replace freq with guest_freq. Replace tsc_src with scaled_tsc_src; replace tsc_dest with tsc_dest * guest_freq / dest_freq. > ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to Replace with "ofs[i] + tsc * guest_freq / host_freq - guest * guest_freq", or "ofs[i] + tsc * ratio - guest * guest_freq". Paolo
Hi Paolo, On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote: > On 3/16/22 05:53, Oliver Upton wrote: > > The VMM has control of both the guest's TSC scale and offset. Extend the > > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation > > to cover TSC scaling. > > > > Reported-by: David Woodhouse<dwmw@amazon.co.uk> > > Signed-off-by: Oliver Upton<oupton@google.com> > > --- > > > > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd). > > A few more things that have to be changed: > > > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src), > > kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds > > (host_src). > > > > One of two changes: > > a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src", > add a new device attribute for the host TSC frequency. > > b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new > device attribute for the guest_frequency/host_frequency ratio. > > A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's > confusing IMO. Agreed -- I think kvmclock should remain as is. A fourth would be to expose the host's TSC frequency outside of KVM since we're really in the business of guest features, not host ones :-) We already have a patch that does this internally, and its visible in some of our open source userspace libraries [1]. That said, I do not have any immediate reservations about adding an attribute as the easy way out. Ack on the rest of the feedback, I'll touch up the documentation further once we figure out TSC frequency exposure. [1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311 -- Thanks, Oliver
On 3/18/22 19:39, Oliver Upton wrote: > Hi Paolo, > > On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote: >> On 3/16/22 05:53, Oliver Upton wrote: >>> The VMM has control of both the guest's TSC scale and offset. Extend the >>> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation >>> to cover TSC scaling. >>> >>> Reported-by: David Woodhouse<dwmw@amazon.co.uk> >>> Signed-off-by: Oliver Upton<oupton@google.com> >>> --- >>> >>> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd). >> >> A few more things that have to be changed: >> >>> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src), >>> kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds >>> (host_src). >>> >> >> One of two changes: >> >> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src", >> add a new device attribute for the host TSC frequency. >> >> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new >> device attribute for the guest_frequency/host_frequency ratio. >> >> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's >> confusing IMO. > > Agreed -- I think kvmclock should remain as is. > > A fourth would be to expose the host's TSC frequency outside of KVM > since we're really in the business of guest features, not host ones :-) > We already have a patch that does this internally, and its visible in > some of our open source userspace libraries [1]. Yeah, it was a bit of a cop out on my part but adding it to sysfs would be nice. Please tell me if any of you going to send a patch, or even just share it so that I can handle the upstream submission. Paolo > > That said, I do not have any immediate reservations about adding an > attribute as the easy way out. > > Ack on the rest of the feedback, I'll touch up the documentation further > once we figure out TSC frequency exposure. > > [1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311 > > -- > Thanks, > Oliver >
On Sat, Mar 19, 2022 at 08:52:56AM +0100, Paolo Bonzini wrote: > On 3/18/22 19:39, Oliver Upton wrote: > > Hi Paolo, > > > > On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote: > > > On 3/16/22 05:53, Oliver Upton wrote: > > > > The VMM has control of both the guest's TSC scale and offset. Extend the > > > > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation > > > > to cover TSC scaling. > > > > > > > > Reported-by: David Woodhouse<dwmw@amazon.co.uk> > > > > Signed-off-by: Oliver Upton<oupton@google.com> > > > > --- > > > > > > > > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd). > > > > > > A few more things that have to be changed: > > > > > > > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src), > > > > kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds > > > > (host_src). > > > > > > > > > > One of two changes: > > > > > > a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src", > > > add a new device attribute for the host TSC frequency. > > > > > > b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new > > > device attribute for the guest_frequency/host_frequency ratio. > > > > > > A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's > > > confusing IMO. > > > > Agreed -- I think kvmclock should remain as is. > > > > A fourth would be to expose the host's TSC frequency outside of KVM > > since we're really in the business of guest features, not host ones :-) > > We already have a patch that does this internally, and its visible in > > some of our open source userspace libraries [1]. > > Yeah, it was a bit of a cop out on my part but adding it to sysfs would be > nice. Please tell me if any of you going to send a patch, or even just > share it so that I can handle the upstream submission. I'll have some time to look at it next week and send upstream. I was somewhat panning if there were any other ideas here, but sysfs does seem sensible :) -- Thanks, Oliver
On Sat, 2022-03-19 at 07:59 +0000, Oliver Upton wrote: > I'll have some time to look at it next week and send upstream. I was > somewhat panning if there were any other ideas here, but sysfs does seem > sensible :) I can't say this fills me with joy. If a basic API requires this much documentation, my instinct is to *fix* it with fire first, then document what's left. A userspace-friendly API for migration would be more like KVM on the source host giving me { TIME_OF_DAY, TSC } and then all I have to do on the destination host (after providing the TSC frequency) is give it precisely the same data. We depend on source and destination clocks being kept in sync with NTP so that the TIME_OF_DAY is meaningful, but that's fairly fundamental; that was *always* going to be our reference¹, right? For migration between hosts, the offset-based API doesn't seem to be much of an improvement over what we had before.
On 3/19/22 09:08, David Woodhouse wrote: > If a basic API requires this much documentation, my instinct is to > *fix* it with fire first, then document what's left. I agree, but you're missing all the improvements that went in together with the offset API in order to enable the ugly algorithm. > A userspace-friendly API for migration would be more like KVM on the > source host giving me { TIME_OF_DAY, TSC } and then all I have to do on > the destination host (after providing the TSC frequency) is give it > precisely the same data. Again I agree but it would have to be {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay}. > For migration between hosts, the offset-based API doesn't seem to be > much of an improvement over what we had before. The improvement is that KVM now provides a consistent {hostTimeOfDay, hostTSC, guestTimeOfDay} triple, so the "get" part is already okay. I agree that the "set" leaves a substantial amount of work to the guest, because *using* that information is left to userspace rather than being done in KVM. But at least it is correct. So we only lack a way to retrieve hostTSCFrequency to have an ugly but fully correct source-side API. Then we can add a new destination-side ioctl (possibly an extension of KVM_SET_CLOCK?) that takes a {sourceTimeOfDay, sourceTSC, sourceTSCFrequency, guestTimeOfDay} tuple and relieves userspace from having to do the math. Paolo
On 3/19/22 12:54, Paolo Bonzini wrote: > On 3/19/22 09:08, David Woodhouse wrote: >> If a basic API requires this much documentation, my instinct is to >> *fix* it with fire first, then document what's left. > I agree, but you're missing all the improvements that went in together > with the offset API in order to enable the ugly algorithm. > >> A userspace-friendly API for migration would be more like KVM on the >> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on >> the destination host (after providing the TSC frequency) is give it >> precisely the same data. > > Again I agree but it would have to be {hostTimeOfDay, hostTSC, > hostTSCFrequency, guestTimeOfDay}. Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}. That would work, and in that case it wouldn't even be KVM returning that host information. In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is not enough, you also need the guestTSCFrequency and guestTSC (or equivalently the scale/offset pair). Paolo
> On 3/19/22 12:54, Paolo Bonzini wrote: >> On 3/19/22 09:08, David Woodhouse wrote: >>> If a basic API requires this much documentation, my instinct is to >>> *fix* it with fire first, then document what's left. >> I agree, but you're missing all the improvements that went in together >> with the offset API in order to enable the ugly algorithm. >> >>> A userspace-friendly API for migration would be more like KVM on the >>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on >>> the destination host (after providing the TSC frequency) is give it >>> precisely the same data. >> >> Again I agree but it would have to be {hostTimeOfDay, hostTSC, >> hostTSCFrequency, guestTimeOfDay}. > > Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant > {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}. That would work, > and in that case it wouldn't even be KVM returning that host information. > > In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is > not enough, you also need the guestTSCFrequency and guestTSC (or > equivalently the scale/offset pair). I would have said nobody cares about the host TSC value and frequency. That is for KVM to know and deal with internally. For guest migration it should be as simple as "guest TSC frequency is <F>, and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>). Not sure I have an opinion on whether the objective time reference is the timeofday clock or the KVM clock.
On 3/19/22 14:13, David Woodhouse wrote: > > >> On 3/19/22 12:54, Paolo Bonzini wrote: >>> On 3/19/22 09:08, David Woodhouse wrote: >>>> If a basic API requires this much documentation, my instinct is to >>>> *fix* it with fire first, then document what's left. >>> I agree, but you're missing all the improvements that went in together >>> with the offset API in order to enable the ugly algorithm. >>> >>>> A userspace-friendly API for migration would be more like KVM on the >>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on >>>> the destination host (after providing the TSC frequency) is give it >>>> precisely the same data. >> >> I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant >> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}. That would work, >> and in that case it wouldn't even be KVM returning that host information. > > I would have said nobody cares about the host TSC value and frequency. > That is for KVM to know and deal with internally. There are two schools as to how to do migration. The QEMU school is to just load back the guest TOD and TSC and let NTP resync. They had better be synced, but a difference of a few microseconds might not matter. This has the advantage of not showing the guest that there was a pause. QEMU is doing it this way due to not having postcopy live migration for a long time; precopy is subject to longer brownout between source and destination, which might result in soft lockups. Apart from this it really has only disadvantage. The Google school has the destination come up with the guest TOD and TSC that takes into account the length of the brownout phase. This is where the algorithm in Documentation/ comes into play, and why you need the host pair as well. Actually Google does not use it because they already have precise time available to userspace as part of Spanner. Maybe so does Amazon (?), but for the rest of the world the host {TOD, TSC} pair is required to compute what the guest TSC "should look like" on the destination. Paolo > For guest migration it should be as simple as "guest TSC frequency is <F>, > and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>). > > Not sure I have an opinion on whether the objective time reference is the > timeofday clock or the KVM clock. > >
On Sun, Mar 20, 2022 at 09:10:15AM +0100, Paolo Bonzini wrote: > On 3/19/22 14:13, David Woodhouse wrote: > > > > > > > On 3/19/22 12:54, Paolo Bonzini wrote: > > > > On 3/19/22 09:08, David Woodhouse wrote: > > > > > If a basic API requires this much documentation, my instinct is to > > > > > *fix* it with fire first, then document what's left. > > > > I agree, but you're missing all the improvements that went in together > > > > with the offset API in order to enable the ugly algorithm. > > > > > > > > > A userspace-friendly API for migration would be more like KVM on the > > > > > source host giving me { TIME_OF_DAY, TSC } and then all I have to do on > > > > > the destination host (after providing the TSC frequency) is give it > > > > > precisely the same data. > > > > > > I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant > > > {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}. That would work, > > > and in that case it wouldn't even be KVM returning that host information. > > > > I would have said nobody cares about the host TSC value and frequency. > > That is for KVM to know and deal with internally. > > There are two schools as to how to do migration. The QEMU school is to just > load back the guest TOD and TSC and let NTP resync. They had better be > synced, but a difference of a few microseconds might not matter. > > This has the advantage of not showing the guest that there was a pause. > QEMU is doing it this way due to not having postcopy live migration for a > long time; precopy is subject to longer brownout between source and > destination, which might result in soft lockups. Apart from this it really > has only disadvantage. > > The Google school has the destination come up with the guest TOD and TSC > that takes into account the length of the brownout phase. This is where the > algorithm in Documentation/ comes into play, and why you need the host pair > as well. Actually Google does not use it because they already have precise > time available to userspace as part of Spanner. Maybe so does Amazon (?), > but for the rest of the world the host {TOD, TSC} pair is required to > compute what the guest TSC "should look like" on the destination. Hey, beat me to the punch :) Paolo is pretty much spot on, but there are a few additional details here that I believe are relevant. I really don't think we want to effectively step the guest's monotonic clock if at all possible. It hurts when you do this for large windows, and leads to soft lockups as you've noted above. Nonetheless, its a kludgy way to advance the guest's realtime clock without informing it that it is about to experience time travel. Given all of this, there is a limit to how much we advance the TSC in the Google school. If this limit is exceeded we refuse to step the TSC further and inform the guest it has experienced time travel [1]. It is an attempt to bridge the gap and avoid completely laying waste to guest clocks while hiding the migration if we're confident it was smooth enough. Beyond that, guest userspace wants to be appraised of time travel as well (TFD_TIMER_CANCEL_ON_SET). Having the guest clean up a messy migration ensures that this all 'just works'. The offset interface completely punts the decision around guest clocks to userspace. We (KVM) have absolutely no idea what userspace is about to do with the guest. The guest could be paused for 5 seconds or 5 years. Encouraging host userspace to just read/write a { TOD, TSC } pair and let KVM do the heavy lifting could completely wreck the guest's monotonic clock. Additionally, it is impossible for userspace to enforce policy/limits on how much to time travel a guest with a value-based interface. Any event could sneak in between the time userspace checks the value and KVM sets the L1 offset. Offsets are idempotent and will still uphold userspace's intentions even if an inordinate amount of time elapses until KVM processes it. Apologies for grandstanding, but clocks has been a real source of pain during migration. I do agree that the documented algorithm is a mess at the moment, given that there's no good way for userspace to transform host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is nice to have, but probably not ABI to hang this whole thing off of. What do you folks think about having a new R/O vCPU attribute that returns a { TOD, guest_tsc } pair? I believe that would immediately satisfy the needs of upstream to implement clock-advancing live migration. [1]: https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/clock.go -- Thanks, Oliver
> The offset interface completely punts the decision around guest clocks > to userspace. We (KVM) have absolutely no idea what userspace is about > to do with the guest. The guest could be paused for 5 seconds or 5 > years. Encouraging host userspace to just read/write a { TOD, TSC } pair > and let KVM do the heavy lifting could completely wreck the guest's > monotonic clock. > > Additionally, it is impossible for userspace to enforce policy/limits on > how much to time travel a guest with a value-based interface. Any event > could sneak in between the time userspace checks the value and KVM sets > the L1 offset. Offsets are idempotent and will still uphold userspace's > intentions even if an inordinate amount of time elapses until KVM > processes it. Thanks for the detailed explanation. One part which confuses me here... Why can't userspace impose those same limits using a (TOD, value) tuple? Userspace can still look at that TOD from before the brownout period started, and declare that is too far in the past. If the event happens *after* userspace has decided that the migration was quick enough, but before the vCPUs are actually running again, even the offset based interface doesn't protect against that. > Apologies for grandstanding, but clocks has been a real source of pain > during migration. I do agree that the documented algorithm is a mess at > the moment, given that there's no good way for userspace to transform > host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is > nice to have, but probably not ABI to hang this whole thing off of. > > What do you folks think about having a new R/O vCPU attribute that > returns a { TOD, guest_tsc } pair? I believe that would immediately > satisfy the needs of upstream to implement clock-advancing live > migration. Hm, I need to do some more thinking here. I poked at this because for TSC scaling even before we think about clock jumps it was just utterly hosed — userspace naively just creates a bunch of vCPUs and sets their TSC frequency + value, and they all end up with unsynced TSC values. But coincidentally since then I have started having conversations with people who really want the guest to have an immediate knowledge of the adjtimex maxerror etc. on the new host immediately after the migration. Maybe the "if the migration isn't fast enough then let the guest know it's now unsynced" is OK, but I'll need to work out what "immediately" means when we have a guest userspace component involved in it. I'll need to do that with a real screen and keyboard though, and fingers that aren't freezing as I sit by a 9-year-old's hockey training...
On 3/20/22 09:52, Oliver Upton wrote: > What do you folks think about having a new R/O vCPU attribute that > returns a { TOD, guest_tsc } pair? I believe that would immediately > satisfy the needs of upstream to implement clock-advancing live > migration. I don't think this adds much. The missing link is on the destination side, not the source side. To recap, the data that needs to be migrated from source to destination is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of each of the following: * either guestTSCOffset or a guestTSC synced with the hostTSC * either guestTODOffset or a guestTOD synced with the hostTOD. * either guestTSCScale or hostTSCFreq Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a bit mix-and-match, but it's already a 5-tuple that the destination can use. What's missing is a ioctl on the destination side that relieves userspace from having to do the math. Adding a nicer API just means choosing a different 5-tuple that the source side sends over to the destination, but it isn't particularly useful if userspace still has to do the math. Paolo
On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote: > > > The offset interface completely punts the decision around guest clocks > > to userspace. We (KVM) have absolutely no idea what userspace is about > > to do with the guest. The guest could be paused for 5 seconds or 5 > > years. Encouraging host userspace to just read/write a { TOD, TSC } pair > > and let KVM do the heavy lifting could completely wreck the guest's > > monotonic clock. > > > > Additionally, it is impossible for userspace to enforce policy/limits on > > how much to time travel a guest with a value-based interface. Any event > > could sneak in between the time userspace checks the value and KVM sets > > the L1 offset. Offsets are idempotent and will still uphold userspace's > > intentions even if an inordinate amount of time elapses until KVM > > processes it. > > Thanks for the detailed explanation. One part which confuses me here... > Why can't userspace impose those same limits using a (TOD, value) tuple? > Userspace can still look at that TOD from before the brownout period > started, and declare that is too far in the past. > > If the event happens *after* userspace has decided that the migration was > quick enough, but before the vCPUs are actually running again, even the > offset based interface doesn't protect against that. Right, if nastiness happens after the offset was calculated, the guest is still going to feel it. The benefit is that we've at least stopped the bleeding on the monotonic clock. Guests of KVM are probably happier having a messed up realtime clock instead of a panicked kernel that threw a fit over monotonic stepping too far. > > Apologies for grandstanding, but clocks has been a real source of pain > > during migration. I do agree that the documented algorithm is a mess at > > the moment, given that there's no good way for userspace to transform > > host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is > > nice to have, but probably not ABI to hang this whole thing off of. > > > > What do you folks think about having a new R/O vCPU attribute that > > returns a { TOD, guest_tsc } pair? I believe that would immediately > > satisfy the needs of upstream to implement clock-advancing live > > migration. > > Hm, I need to do some more thinking here. I poked at this because for TSC > scaling even before we think about clock jumps it was just utterly hosed — > userspace naively just creates a bunch of vCPUs and sets their TSC > frequency + value, and they all end up with unsynced TSC values. And thank you for addressing that. I think there is a broader generalization that can be made about guest timekeeping that you've started poking at with that patch set, which is that we should only really care about these at a VM scope. There's nothing we can do to cover up broken hardware, so if the host's TSCs are out of whack then oh well. To that end, we could have a single host<->guest offset that is used across all vCPUs. Guest fiddling with TSC_ADJUST or even worse direct writes to the TSC can then be treated as solely a part of guest state, and get added on top of the host<->guest offset. > But coincidentally since then I have started having conversations with > people who really want the guest to have an immediate knowledge of the > adjtimex maxerror etc. on the new host immediately after the migration. > Maybe the "if the migration isn't fast enough then let the guest know it's > now unsynced" is OK, but I'll need to work out what "immediately" means > when we have a guest userspace component involved in it. This has also been an area of interest to me. I think we've all seen the many ways in which doing migrations behind the guest's can put software in an extremely undesirable state on the other end. If those conversations are taking place on the mailing lists, could you please CC me? Our (Google) TSC adjustment clamping and userspace notification mechanism was a halfway kludge to keep things happy on the other end. And it generally has worked well, but misses a fundamental point. The hypervisor should tell the guest kernel about time travel and let it cascade that information throughout the guest system. Regardless of what we do to the TSC, we invariably destroy one of the two guest clocks along the way. If we told the guest "you time traveled X seconds", it could fold that into its own idea of real time. Guest kernel can then fire off events to inform software that wants to keep up with clock changes, and even a new event to let NTP know its probably running on different hardware. Time sucks :-) -- Thanks, Oliver
On Sun, Mar 20, 2022 at 02:39:34PM +0100, Paolo Bonzini wrote: > On 3/20/22 09:52, Oliver Upton wrote: > > What do you folks think about having a new R/O vCPU attribute that > > returns a { TOD, guest_tsc } pair? I believe that would immediately > > satisfy the needs of upstream to implement clock-advancing live > > migration. > > I don't think this adds much. The missing link is on the destination side, > not the source side. I think it'll work: Source: - Pick a vCPU and save its { TOD, guest_TSC } pair - Save the tsc offset of every vCPU - Using all of the offsets, calculate the drift of all the follower vCPUs from the leader vCPU (from step 1) - Save the TSC frequency Destination: - Restore the TSC frequency - Read the { TOD, guest_TSC } pair for the first vCPU - Compare with the source value to work out delta_guest_TSC and delta_TOD - Apply delta_guest_TSC to all vCPUs in a VM - If you want to account for realtime, apply guest_tsc_freq * delta_TOD to every vCPU in the VM - Account for drift between leader/follower vCPUs Userspace has some math to do, but IMO it needs to until we have a better mechanism for helping the guest clean up a slow migration. It does eliminate the need for doing TSC scaling in userspace, which I think is the trickiest piece of it all. Alternative could be to say that the VM has a single, authoritative { TOD, guest_TSC } clock that can be read or written. Any vCPU offsets then account for guest-induced drift in TSCs. > To recap, the data that needs to be migrated from source to destination is > the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of each of > the following: > > * either guestTSCOffset or a guestTSC synced with the hostTSC > > * either guestTODOffset or a guestTOD synced with the hostTOD. > > * either guestTSCScale or hostTSCFreq > > Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD > returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a > bit mix-and-match, but it's already a 5-tuple that the destination can use. > What's missing is a ioctl on the destination side that relieves userspace > from having to do the math. That ioctl will work fine, but userspace needs to accept all the nastiness that ensues. If it yanks the guest too hard into the future it'll need to pick up the pieces when the guest kernel panics. -- Thanks, Oliver
On Sun, 2022-03-20 at 14:39 +0100, Paolo Bonzini wrote: > On 3/20/22 09:52, Oliver Upton wrote: > > What do you folks think about having a new R/O vCPU attribute that > > returns a { TOD, guest_tsc } pair? I believe that would immediately > > satisfy the needs of upstream to implement clock-advancing live > > migration. > > I don't think this adds much. The missing link is on the destination > side, not the source side. > > To recap, the data that needs to be migrated from source to destination > is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of > each of the following: > > * either guestTSCOffset or a guestTSC synced with the hostTSC > > * either guestTODOffset or a guestTOD synced with the hostTOD. > > * either guestTSCScale or hostTSCFreq > > Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD > returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. > It's a bit mix-and-match, but it's already a 5-tuple that the > destination can use. What's missing is a ioctl on the destination side > that relieves userspace from having to do the math. > > Adding a nicer API just means choosing a different 5-tuple that the > source side sends over to the destination, but it isn't particularly > useful if userspace still has to do the math. Hm, I still don't really understand why it's a 5-tuple and why we care about the *host* TSC at all. Fundamentally, guest state is *guest* state. There is a 3-tuple of { NTP-synchronized time of day, Guest KVM clock, Guest TSC value } (plus a TSC offset from vCPU0, for each other vCPU perhaps.) All else is redundant, and including host values is just wrong. We can already do clock-advancing live migration by adding the appropriate number of nanoseconds, and the appropriate number of TSC ticks, to the values being restored on the destination host based on the time at which the migrated instance is being restored there, as compared with the originally stored time-of-day. I understand why KVM would want to use the *same* host TSC value for calculating the time-of-day as well as the guest KVM clock and the guest TSC value — so that all three elements of that 3-tuple reflect precisely the same moment in time. I don't understand why the actual *value* of the host TSC is something that userspace needs to see.
On 3/21/22 01:51, Oliver Upton wrote: >> >> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD >> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a >> bit mix-and-match, but it's already a 5-tuple that the destination can use. >> What's missing is a ioctl on the destination side that relieves userspace >> from having to do the math. > That ioctl will work fine, but userspace needs to accept all the > nastiness that ensues. If it yanks the guest too hard into the future > it'll need to pick up the pieces when the guest kernel panics. It can do that before issuing the ioctl, by comparing the source and destination TOD, can't it? Paolo
On Mon, 2022-03-21 at 13:36 +0100, Paolo Bonzini wrote: > On 3/21/22 01:51, Oliver Upton wrote: > > > > > > Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD > > > returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a > > > bit mix-and-match, but it's already a 5-tuple that the destination can use. > > > What's missing is a ioctl on the destination side that relieves userspace > > > from having to do the math. > > That ioctl will work fine, but userspace needs to accept all the > > nastiness that ensues. If it yanks the guest too hard into the future > > it'll need to pick up the pieces when the guest kernel panics. > > It can do that before issuing the ioctl, by comparing the source and > destination TOD, can't it? That's how we do it. We currently take the *really* naïve approach of just letting each vCPU thread serialize its own vCPU's TSC with the other MSRs, which means they're slightly out of sync. Then we calculate the TOD delta when resuming on the destination, and add the corresponding number of nanoseconds to the KVM clock, and ticks to each vCPU's MSR. When each vCPU thread then *restores* its vCPU's MSRs, the 'within a second of the last sync" bit kicks in and they are all synchronized properly. There's definitely a bit of slop there; I don't think we even use the KVM_CLOCK_REALTIME flag when setting the guest's KVM clock. But those are fixable by having a single ioctl which handles the 3-tuple of { TOD, KVM clock, TSC } at a single instance in time, I think?
On 3/21/22 13:16, David Woodhouse wrote: > Hm, I still don't really understand why it's a 5-tuple and why we care > about the *host* TSC at all. > > Fundamentally, guest state is *guest* state. There is a 3-tuple of > { NTP-synchronized time of day, Guest KVM clock, Guest TSC value } > > (plus a TSC offset from vCPU0, for each other vCPU perhaps.) > > All else is redundant, and including host values is just wrong. [...] > I don't understand why the actual *value* of the host TSC is something > that userspace needs to see. Ok, I understand now. You're right, you don't need the value of the host TSC; you only need a {hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source. If all the frequencies are the same, that can be "packed" as {hostTOD, guestNS, anyTSC} plus N offsets. The N offsets in turn can be KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 and the others if you use the vCPU0 guestTSC. I think reasoning in terms of the host TSC is nicer in general, because it doesn't make vCPU0 special. But apart from the aesthetics of having a "special" vCPU, making vCPU0 special is actually harder, because the TSC frequencies need not be the same for all vCPUs. I think that is a mistake in the KVM API, but it was done long before I was involved (and long before I actually understood this stuff). Paolo
On Mon, 2022-03-21 at 14:10 +0100, Paolo Bonzini wrote: > On 3/21/22 13:16, David Woodhouse wrote: > > Hm, I still don't really understand why it's a 5-tuple and why we care > > about the *host* TSC at all. > > > > Fundamentally, guest state is *guest* state. There is a 3-tuple of > > { NTP-synchronized time of day, Guest KVM clock, Guest TSC value } > > > > (plus a TSC offset from vCPU0, for each other vCPU perhaps.) > > > > All else is redundant, and including host values is just wrong. [...] > > I don't understand why the actual *value* of the host TSC is something > > that userspace needs to see. > > Ok, I understand now. > > You're right, you don't need the value of the host TSC; you only need a > {hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source. Actually, isn't that still redundant? All we really need is a *single* { hostTOD, guestNS } for the KVM clock, and then each vCPU has its own { guestNS, guestTSC } tuple. > If all the frequencies are the same, that can be "packed" as {hostTOD, > guestNS, anyTSC} plus N offsets. The N offsets in turn can be > KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 > and the others if you use the vCPU0 guestTSC. > > I think reasoning in terms of the host TSC is nicer in general, because > it doesn't make vCPU0 special. But apart from the aesthetics of having > a "special" vCPU, making vCPU0 special is actually harder, because the > TSC frequencies need not be the same for all vCPUs. I think that is a > mistake in the KVM API, but it was done long before I was involved (and > long before I actually understood this stuff). If each vCPU has its own { guestNS, guestTSC } tuple that actually works out just fine even if they have different frequencies. The *common* case would be that they are all at the same frequency and have the same value at the same time. But other cases can be accommodated. I'm not averse to *reasoning* in terms of the host TSC; I just don't like exposing the actual numbers to userspace and forcing userspace to access it through some new ABI just in order to translate some other fundamental property (either the time of day, or the guest data) into that domain. And I especially don't like considering it part of 'guest state'. Right now when I'm not frowning at TSC synchronisation issues, I'm working on guest transparent live migration from actual Xen, to KVM-pretending-to-be-Xen. That kind of insanity is only really possible with a strict adherence to the design principle that "guest state is guest state", without conflating it with host/implementation details :)
On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote: > On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote: > > But coincidentally since then I have started having conversations with > > people who really want the guest to have an immediate knowledge of the > > adjtimex maxerror etc. on the new host immediately after the migration. > > Maybe the "if the migration isn't fast enough then let the guest know it's > > now unsynced" is OK, but I'll need to work out what "immediately" means > > when we have a guest userspace component involved in it. > > This has also been an area of interest to me. I think we've all seen the > many ways in which doing migrations behind the guest's can put software > in an extremely undesirable state on the other end. If those > conversations are taking place on the mailing lists, could you please CC > me? > > Our (Google) TSC adjustment clamping and userspace notification mechanism > was a halfway kludge to keep things happy on the other end. And it > generally has worked well, but misses a fundamental point. > > The hypervisor should tell the guest kernel about time travel and let it > cascade that information throughout the guest system. Regardless of what > we do to the TSC, we invariably destroy one of the two guest clocks along > the way. If we told the guest "you time traveled X seconds", it could > fold that into its own idea of real time. Guest kernel can then fire off > events to inform software that wants to keep up with clock changes, and > even a new event to let NTP know its probably running on different > hardware. > > Time sucks :-) So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that its clock may have experienced a jump. Linux guests will use this to kick various watchdogs to prevent them whining. Shouldn't we *also* be driving the NTP reset from that same signal?
On Mon, Mar 21, 2022 at 07:43:21PM +0000, David Woodhouse wrote: > On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote: > > On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote: > > > But coincidentally since then I have started having conversations with > > > people who really want the guest to have an immediate knowledge of the > > > adjtimex maxerror etc. on the new host immediately after the migration. > > > Maybe the "if the migration isn't fast enough then let the guest know it's > > > now unsynced" is OK, but I'll need to work out what "immediately" means > > > when we have a guest userspace component involved in it. > > > > This has also been an area of interest to me. I think we've all seen the > > many ways in which doing migrations behind the guest's can put software > > in an extremely undesirable state on the other end. If those > > conversations are taking place on the mailing lists, could you please CC > > me? > > > > Our (Google) TSC adjustment clamping and userspace notification mechanism > > was a halfway kludge to keep things happy on the other end. And it > > generally has worked well, but misses a fundamental point. > > > > The hypervisor should tell the guest kernel about time travel and let it > > cascade that information throughout the guest system. Regardless of what > > we do to the TSC, we invariably destroy one of the two guest clocks along > > the way. If we told the guest "you time traveled X seconds", it could > > fold that into its own idea of real time. Guest kernel can then fire off > > events to inform software that wants to keep up with clock changes, and > > even a new event to let NTP know its probably running on different > > hardware. > > > > Time sucks :-) > > So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that > its clock may have experienced a jump. Linux guests will use this to > kick various watchdogs to prevent them whining. Shouldn't we *also* be > driving the NTP reset from that same signal? Right, but I'd argue that interface has some problems too. It depends on the guest polling instead of an interrupt from the hypervisor. It also has no way of informing the kernel exactly how much time has elapsed. The whole point of all these hacks that we've done internally is that we, the hypervisor, know full well how much real time hasv advanced during the VM blackout. If we can at least let the guest know how much to fudge real time, it can then poke NTP for better refinement. I worry about using NTP as the sole source of truth for such a mechanism, since you'll need to go out to the network and any reads until the response comes back are hosed. -- Thanks, Oliver
diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 60a29972d3f1..85199e9b7f8d 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -180,13 +180,18 @@ Returns: ======= ====================================== Specifies the guest's TSC offset relative to the host's TSC. The guest's -TSC is then derived by the following equation: +TSC is then derived by the following equation where tsc_scale is the +ratio between the VM and host TSC frequencies: - guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET + guest_tsc = (host_tsc * tsc_scale) + KVM_VCPU_TSC_OFFSET -This attribute is useful to adjust the guest's TSC on live migration, -so that the TSC counts the time during which the VM was paused. The -following describes a possible algorithm to use for this purpose. +The VM's TSC frequency is configured with the KVM_{GET,SET}_TSC_KHZ +vCPU or VM ioctls. + +The KVM_VCPU_TSC_OFFSET attribute is useful to adjust the guest's TSC +on live migration, so that the TSC counts the time during which the VM +was paused. The following describes a possible algorithm to use for this +purpose. From the source VMM process: @@ -202,7 +207,10 @@ From the source VMM process: From the destination VMM process: -4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from +4. Invoke the KVM_SET_TSC_KHZ ioctl to set the guest TSC frequency to + the value recorded in step 3 (freq). + +5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided structure. @@ -214,10 +222,10 @@ From the destination VMM process: between the source pausing the VMs and the destination executing steps 4-7. -5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and +6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and kvmclock nanoseconds (guest_dest). -6. Adjust the guest TSC offsets for every vCPU to account for (1) time +7. Adjust the guest TSC offsets for every vCPU to account for (1) time elapsed since recording state and (2) difference in TSCs between the source and destination machine: @@ -229,5 +237,5 @@ From the destination VMM process: a time of 0 in kvmclock. The above formula ensures that it is the same on the destination as it was on the source). -7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the +8. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the respective value derived in the previous step.
The VMM has control of both the guest's TSC scale and offset. Extend the described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation to cover TSC scaling. Reported-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Oliver Upton <oupton@google.com> --- Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd). Parent commit: 2ca1ba339ed8 ("KVM: x86: Test case for TSC scaling and offset sync") Documentation/virt/kvm/devices/vcpu.rst | 26 ++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-)