diff mbox series

Documentation: KVM: Describe guest TSC scaling in migration algorithm

Message ID 20220316045308.2313184-1-oupton@google.com (mailing list archive)
State New, archived
Headers show
Series Documentation: KVM: Describe guest TSC scaling in migration algorithm | expand

Commit Message

Oliver Upton March 16, 2022, 4:53 a.m. UTC
The VMM has control of both the guest's TSC scale and offset. Extend the
described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
to cover TSC scaling.

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Oliver Upton <oupton@google.com>
---

Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

Parent commit: 2ca1ba339ed8 ("KVM: x86: Test case for TSC scaling and offset sync")

 Documentation/virt/kvm/devices/vcpu.rst | 26 ++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

Comments

Paolo Bonzini March 16, 2022, 7:47 a.m. UTC | #1
On 3/16/22 05:53, Oliver Upton wrote:
> The VMM has control of both the guest's TSC scale and offset. Extend the
> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> to cover TSC scaling.
> 
> Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> Signed-off-by: Oliver Upton<oupton@google.com>
> ---
> 
> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

A few more things that have to be changed:

> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
>    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
>    (host_src).
> 

One of two changes:

a) Add "Multiply tsc_src by guest_freq / src_freq to obtain 
scaled_tsc_src", add a new device attribute for the host TSC frequency.

b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a 
new device attribute for the guest_frequency/host_frequency ratio.

A third would be scaling the host TSC frequency in KVM_GETCLOCK, but 
that's confusing IMO.

> 3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
>    guest's TSC (freq).

Replace freq with guest_freq.

> 6. Adjust the guest TSC offsets for every vCPU to account for (1) time
>    elapsed since recording state and (2) difference in TSCs between the
>    source and destination machine:
> 
>    ofs_dst[i] = ofs_src[i] -
>      (guest_src - guest_dest) * freq +
>      (tsc_src - tsc_dest)
> 

Replace freq with guest_freq.

Replace tsc_src with scaled_tsc_src; replace tsc_dest with tsc_dest * 
guest_freq / dest_freq.

>    ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to

Replace with "ofs[i] + tsc * guest_freq / host_freq - guest * 
guest_freq", or "ofs[i] + tsc * ratio - guest * guest_freq".

Paolo
Oliver Upton March 18, 2022, 6:39 p.m. UTC | #2
Hi Paolo,

On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
> On 3/16/22 05:53, Oliver Upton wrote:
> > The VMM has control of both the guest's TSC scale and offset. Extend the
> > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> > to cover TSC scaling.
> > 
> > Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> > Signed-off-by: Oliver Upton<oupton@google.com>
> > ---
> > 
> > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
> 
> A few more things that have to be changed:
> 
> > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
> >    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
> >    (host_src).
> > 
> 
> One of two changes:
> 
> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
> add a new device attribute for the host TSC frequency.
> 
> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
> device attribute for the guest_frequency/host_frequency ratio.
> 
> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
> confusing IMO.

Agreed -- I think kvmclock should remain as is.

A fourth would be to expose the host's TSC frequency outside of KVM
since we're really in the business of guest features, not host ones :-)
We already have a patch that does this internally, and its visible in
some of our open source userspace libraries [1].

That said, I do not have any immediate reservations about adding an
attribute as the easy way out.

Ack on the rest of the feedback, I'll touch up the documentation further
once we figure out TSC frequency exposure.

[1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311

--
Thanks,
Oliver
Paolo Bonzini March 19, 2022, 7:52 a.m. UTC | #3
On 3/18/22 19:39, Oliver Upton wrote:
> Hi Paolo,
> 
> On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
>> On 3/16/22 05:53, Oliver Upton wrote:
>>> The VMM has control of both the guest's TSC scale and offset. Extend the
>>> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
>>> to cover TSC scaling.
>>>
>>> Reported-by: David Woodhouse<dwmw@amazon.co.uk>
>>> Signed-off-by: Oliver Upton<oupton@google.com>
>>> ---
>>>
>>> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
>>
>> A few more things that have to be changed:
>>
>>> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
>>>     kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
>>>     (host_src).
>>>
>>
>> One of two changes:
>>
>> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
>> add a new device attribute for the host TSC frequency.
>>
>> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
>> device attribute for the guest_frequency/host_frequency ratio.
>>
>> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
>> confusing IMO.
> 
> Agreed -- I think kvmclock should remain as is.
> 
> A fourth would be to expose the host's TSC frequency outside of KVM
> since we're really in the business of guest features, not host ones :-)
> We already have a patch that does this internally, and its visible in
> some of our open source userspace libraries [1].

Yeah, it was a bit of a cop out on my part but adding it to sysfs would 
be nice.  Please tell me if any of you going to send a patch, or even 
just share it so that I can handle the upstream submission.

Paolo

> 
> That said, I do not have any immediate reservations about adding an
> attribute as the easy way out.
> 
> Ack on the rest of the feedback, I'll touch up the documentation further
> once we figure out TSC frequency exposure.
> 
> [1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311
> 
> --
> Thanks,
> Oliver
>
Oliver Upton March 19, 2022, 7:59 a.m. UTC | #4
On Sat, Mar 19, 2022 at 08:52:56AM +0100, Paolo Bonzini wrote:
> On 3/18/22 19:39, Oliver Upton wrote:
> > Hi Paolo,
> > 
> > On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
> > > On 3/16/22 05:53, Oliver Upton wrote:
> > > > The VMM has control of both the guest's TSC scale and offset. Extend the
> > > > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> > > > to cover TSC scaling.
> > > > 
> > > > Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> > > > Signed-off-by: Oliver Upton<oupton@google.com>
> > > > ---
> > > > 
> > > > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
> > > 
> > > A few more things that have to be changed:
> > > 
> > > > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
> > > >     kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
> > > >     (host_src).
> > > > 
> > > 
> > > One of two changes:
> > > 
> > > a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
> > > add a new device attribute for the host TSC frequency.
> > > 
> > > b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
> > > device attribute for the guest_frequency/host_frequency ratio.
> > > 
> > > A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
> > > confusing IMO.
> > 
> > Agreed -- I think kvmclock should remain as is.
> > 
> > A fourth would be to expose the host's TSC frequency outside of KVM
> > since we're really in the business of guest features, not host ones :-)
> > We already have a patch that does this internally, and its visible in
> > some of our open source userspace libraries [1].
> 
> Yeah, it was a bit of a cop out on my part but adding it to sysfs would be
> nice.  Please tell me if any of you going to send a patch, or even just
> share it so that I can handle the upstream submission.

I'll have some time to look at it next week and send upstream. I was
somewhat panning if there were any other ideas here, but sysfs does seem
sensible :)

--
Thanks,
Oliver
David Woodhouse March 19, 2022, 8:08 a.m. UTC | #5
On Sat, 2022-03-19 at 07:59 +0000, Oliver Upton wrote:
> I'll have some time to look at it next week and send upstream. I was
> somewhat panning if there were any other ideas here, but sysfs does seem
> sensible :)

I can't say this fills me with joy. If a basic API requires this much
documentation, my instinct is to *fix* it with fire first, then
document what's left.

A userspace-friendly API for migration would be more like KVM on the
source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
the destination host (after providing the TSC frequency) is give it
precisely the same data.

We depend on source and destination clocks being kept in sync with NTP
so that the TIME_OF_DAY is meaningful, but that's fairly fundamental;
that was *always* going to be our reference¹, right?

For migration between hosts, the offset-based API doesn't seem to be
much of an improvement over what we had before.
Paolo Bonzini March 19, 2022, 11:54 a.m. UTC | #6
On 3/19/22 09:08, David Woodhouse wrote:
> If a basic API requires this much documentation, my instinct is to
> *fix* it with fire first, then document what's left.
I agree, but you're missing all the improvements that went in together 
with the offset API in order to enable the ugly algorithm.

> A userspace-friendly API for migration would be more like KVM on the
> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> the destination host (after providing the TSC frequency) is give it
> precisely the same data.

Again I agree but it would have to be {hostTimeOfDay, hostTSC, 
hostTSCFrequency, guestTimeOfDay}.

> For migration between hosts, the offset-based API doesn't seem to be
> much of an improvement over what we had before.

The improvement is that KVM now provides a consistent {hostTimeOfDay, 
hostTSC, guestTimeOfDay} triple, so the "get" part is already okay.  I 
agree that the "set" leaves a substantial amount of work to the guest, 
because *using* that information is left to userspace rather than being 
done in KVM.  But at least it is correct.

So we only lack a way to retrieve hostTSCFrequency to have an ugly but 
fully correct source-side API.  Then we can add a new destination-side 
ioctl (possibly an extension of KVM_SET_CLOCK?) that takes a 
{sourceTimeOfDay, sourceTSC, sourceTSCFrequency, guestTimeOfDay} tuple 
and relieves userspace from having to do the math.

Paolo
Paolo Bonzini March 19, 2022, 1 p.m. UTC | #7
On 3/19/22 12:54, Paolo Bonzini wrote:
> On 3/19/22 09:08, David Woodhouse wrote:
>> If a basic API requires this much documentation, my instinct is to
>> *fix* it with fire first, then document what's left.
> I agree, but you're missing all the improvements that went in together 
> with the offset API in order to enable the ugly algorithm.
> 
>> A userspace-friendly API for migration would be more like KVM on the
>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>> the destination host (after providing the TSC frequency) is give it
>> precisely the same data.
> 
> Again I agree but it would have to be {hostTimeOfDay, hostTSC, 
> hostTSCFrequency, guestTimeOfDay}.

Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant 
{guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work, 
and in that case it wouldn't even be KVM returning that host information.

In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is 
not enough, you also need the guestTSCFrequency and guestTSC (or 
equivalently the scale/offset pair).

Paolo
David Woodhouse March 19, 2022, 1:13 p.m. UTC | #8
> On 3/19/22 12:54, Paolo Bonzini wrote:
>> On 3/19/22 09:08, David Woodhouse wrote:
>>> If a basic API requires this much documentation, my instinct is to
>>> *fix* it with fire first, then document what's left.
>> I agree, but you're missing all the improvements that went in together
>> with the offset API in order to enable the ugly algorithm.
>>
>>> A userspace-friendly API for migration would be more like KVM on the
>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>>> the destination host (after providing the TSC frequency) is give it
>>> precisely the same data.
>>
>> Again I agree but it would have to be {hostTimeOfDay, hostTSC,
>> hostTSCFrequency, guestTimeOfDay}.
>
> Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> and in that case it wouldn't even be KVM returning that host information.
>
> In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is
> not enough, you also need the guestTSCFrequency and guestTSC (or
> equivalently the scale/offset pair).

I would have said nobody cares about the host TSC value and frequency.
That is for KVM to know and deal with internally.

For guest migration it should be as simple as "guest TSC frequency is <F>,
and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>).

Not sure I have an opinion on whether the objective time reference is the
timeofday clock or the KVM clock.
Paolo Bonzini March 20, 2022, 8:10 a.m. UTC | #9
On 3/19/22 14:13, David Woodhouse wrote:
> 
> 
>> On 3/19/22 12:54, Paolo Bonzini wrote:
>>> On 3/19/22 09:08, David Woodhouse wrote:
>>>> If a basic API requires this much documentation, my instinct is to
>>>> *fix* it with fire first, then document what's left.
>>> I agree, but you're missing all the improvements that went in together
>>> with the offset API in order to enable the ugly algorithm.
>>>
>>>> A userspace-friendly API for migration would be more like KVM on the
>>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>>>> the destination host (after providing the TSC frequency) is give it
>>>> precisely the same data.
>>
>> I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
>> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
>> and in that case it wouldn't even be KVM returning that host information.
> 
> I would have said nobody cares about the host TSC value and frequency.
> That is for KVM to know and deal with internally.

There are two schools as to how to do migration.  The QEMU school is to 
just load back the guest TOD and TSC and let NTP resync.  They had 
better be synced, but a difference of a few microseconds might not matter.

This has the advantage of not showing the guest that there was a pause. 
  QEMU is doing it this way due to not having postcopy live migration 
for a long time; precopy is subject to longer brownout between source 
and destination, which might result in soft lockups.  Apart from this it 
really has only disadvantage.

The Google school has the destination come up with the guest TOD and TSC 
that takes into account the length of the brownout phase.  This is where 
the algorithm in Documentation/ comes into play, and why you need the 
host pair as well.  Actually Google does not use it because they already 
have precise time available to userspace as part of Spanner.  Maybe so 
does Amazon (?), but for the rest of the world the host {TOD, TSC} pair 
is required to compute what the guest TSC "should look like" on the 
destination.

Paolo

> For guest migration it should be as simple as "guest TSC frequency is <F>,
> and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>).
> 
> Not sure I have an opinion on whether the objective time reference is the
> timeofday clock or the KVM clock.
> 
>
Oliver Upton March 20, 2022, 8:52 a.m. UTC | #10
On Sun, Mar 20, 2022 at 09:10:15AM +0100, Paolo Bonzini wrote:
> On 3/19/22 14:13, David Woodhouse wrote:
> > 
> > 
> > > On 3/19/22 12:54, Paolo Bonzini wrote:
> > > > On 3/19/22 09:08, David Woodhouse wrote:
> > > > > If a basic API requires this much documentation, my instinct is to
> > > > > *fix* it with fire first, then document what's left.
> > > > I agree, but you're missing all the improvements that went in together
> > > > with the offset API in order to enable the ugly algorithm.
> > > > 
> > > > > A userspace-friendly API for migration would be more like KVM on the
> > > > > source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> > > > > the destination host (after providing the TSC frequency) is give it
> > > > > precisely the same data.
> > > 
> > > I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> > > {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> > > and in that case it wouldn't even be KVM returning that host information.
> > 
> > I would have said nobody cares about the host TSC value and frequency.
> > That is for KVM to know and deal with internally.
> 
> There are two schools as to how to do migration.  The QEMU school is to just
> load back the guest TOD and TSC and let NTP resync.  They had better be
> synced, but a difference of a few microseconds might not matter.
> 
> This has the advantage of not showing the guest that there was a pause.
> QEMU is doing it this way due to not having postcopy live migration for a
> long time; precopy is subject to longer brownout between source and
> destination, which might result in soft lockups.  Apart from this it really
> has only disadvantage.
> 
> The Google school has the destination come up with the guest TOD and TSC
> that takes into account the length of the brownout phase.  This is where the
> algorithm in Documentation/ comes into play, and why you need the host pair
> as well.  Actually Google does not use it because they already have precise
> time available to userspace as part of Spanner.  Maybe so does Amazon (?),
> but for the rest of the world the host {TOD, TSC} pair is required to
> compute what the guest TSC "should look like" on the destination.

Hey, beat me to the punch :) Paolo is pretty much spot on, but there are
a few additional details here that I believe are relevant.

I really don't think we want to effectively step the guest's monotonic
clock if at all possible. It hurts when you do this for large windows,
and leads to soft lockups as you've noted above. Nonetheless, its a
kludgy way to advance the guest's realtime clock without informing it
that it is about to experience time travel.

Given all of this, there is a limit to how much we advance the TSC in
the Google school. If this limit is exceeded we refuse to step the TSC
further and inform the guest it has experienced time travel [1]. It is
an attempt to bridge the gap and avoid completely laying waste to guest
clocks while hiding the migration if we're confident it was smooth
enough. Beyond that, guest userspace wants to be appraised of time
travel as well (TFD_TIMER_CANCEL_ON_SET). Having the guest clean up a
messy migration ensures that this all 'just works'.

The offset interface completely punts the decision around guest clocks
to userspace. We (KVM) have absolutely no idea what userspace is about
to do with the guest. The guest could be paused for 5 seconds or 5
years. Encouraging host userspace to just read/write a { TOD, TSC } pair
and let KVM do the heavy lifting could completely wreck the guest's
monotonic clock.

Additionally, it is impossible for userspace to enforce policy/limits on
how much to time travel a guest with a value-based interface. Any event
could sneak in between the time userspace checks the value and KVM sets
the L1 offset. Offsets are idempotent and will still uphold userspace's
intentions even if an inordinate amount of time elapses until KVM
processes it.

Apologies for grandstanding, but clocks has been a real source of pain
during migration. I do agree that the documented algorithm is a mess at
the moment, given that there's no good way for userspace to transform
host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
nice to have, but probably not ABI to hang this whole thing off of.

What do you folks think about having a new R/O vCPU attribute that
returns a { TOD, guest_tsc } pair? I believe that would immediately
satisfy the needs of upstream to implement clock-advancing live
migration.

[1]: https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/clock.go
--
Thanks,
Oliver
David Woodhouse March 20, 2022, 9:46 a.m. UTC | #11
> The offset interface completely punts the decision around guest clocks
> to userspace. We (KVM) have absolutely no idea what userspace is about
> to do with the guest. The guest could be paused for 5 seconds or 5
> years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> and let KVM do the heavy lifting could completely wreck the guest's
> monotonic clock.
>
> Additionally, it is impossible for userspace to enforce policy/limits on
> how much to time travel a guest with a value-based interface. Any event
> could sneak in between the time userspace checks the value and KVM sets
> the L1 offset. Offsets are idempotent and will still uphold userspace's
> intentions even if an inordinate amount of time elapses until KVM
> processes it.

Thanks for the detailed explanation. One part which confuses me here...
Why can't userspace impose those same limits using a (TOD, value) tuple?
Userspace can still look at that TOD from before the brownout period
started, and declare that is too far in the past.

If the event happens *after* userspace has decided that the migration was
quick enough, but before the vCPUs are actually running again, even the
offset based interface doesn't protect against that.


> Apologies for grandstanding, but clocks has been a real source of pain
> during migration. I do agree that the documented algorithm is a mess at
> the moment, given that there's no good way for userspace to transform
> host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> nice to have, but probably not ABI to hang this whole thing off of.
>
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

Hm, I need to do some more thinking here. I poked at this because for TSC
scaling even before we think about clock jumps it was just utterly hosed —
userspace naively just creates a bunch of vCPUs and sets their TSC
frequency + value, and they all end up with unsynced TSC values.

But coincidentally since then I have started having conversations with
people who really want the guest to have an immediate knowledge of the
adjtimex maxerror etc. on the new host immediately after the migration.
Maybe the "if the migration isn't fast enough then let the guest know it's
now unsynced" is OK, but I'll need to work out what "immediately" means
when we have a guest userspace component involved in it.

I'll need to do that with a real screen and keyboard though, and fingers
that aren't freezing as I sit by a 9-year-old's hockey training...
Paolo Bonzini March 20, 2022, 1:39 p.m. UTC | #12
On 3/20/22 09:52, Oliver Upton wrote:
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

I don't think this adds much.  The missing link is on the destination 
side, not the source side.

To recap, the data that needs to be migrated from source to destination 
is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of 
each of the following:

* either guestTSCOffset or a guestTSC synced with the hostTSC

* either guestTODOffset or a guestTOD synced with the hostTOD.

* either guestTSCScale or hostTSCFreq

Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD 
returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. 
It's a bit mix-and-match, but it's already a 5-tuple that the 
destination can use.  What's missing is a ioctl on the destination side 
that relieves userspace from having to do the math.

Adding a nicer API just means choosing a different 5-tuple that the 
source side sends over to the destination, but it isn't particularly 
useful if userspace still has to do the math.

Paolo
Oliver Upton March 21, 2022, 12:38 a.m. UTC | #13
On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> 
> > The offset interface completely punts the decision around guest clocks
> > to userspace. We (KVM) have absolutely no idea what userspace is about
> > to do with the guest. The guest could be paused for 5 seconds or 5
> > years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> > and let KVM do the heavy lifting could completely wreck the guest's
> > monotonic clock.
> >
> > Additionally, it is impossible for userspace to enforce policy/limits on
> > how much to time travel a guest with a value-based interface. Any event
> > could sneak in between the time userspace checks the value and KVM sets
> > the L1 offset. Offsets are idempotent and will still uphold userspace's
> > intentions even if an inordinate amount of time elapses until KVM
> > processes it.
> 
> Thanks for the detailed explanation. One part which confuses me here...
> Why can't userspace impose those same limits using a (TOD, value) tuple?
> Userspace can still look at that TOD from before the brownout period
> started, and declare that is too far in the past.
> 
> If the event happens *after* userspace has decided that the migration was
> quick enough, but before the vCPUs are actually running again, even the
> offset based interface doesn't protect against that.

Right, if nastiness happens after the offset was calculated, the guest
is still going to feel it. The benefit is that we've at least stopped
the bleeding on the monotonic clock. Guests of KVM are probably happier
having a messed up realtime clock instead of a panicked kernel that
threw a fit over monotonic stepping too far.

> > Apologies for grandstanding, but clocks has been a real source of pain
> > during migration. I do agree that the documented algorithm is a mess at
> > the moment, given that there's no good way for userspace to transform
> > host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> > nice to have, but probably not ABI to hang this whole thing off of.
> >
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> Hm, I need to do some more thinking here. I poked at this because for TSC
> scaling even before we think about clock jumps it was just utterly hosed —
> userspace naively just creates a bunch of vCPUs and sets their TSC
> frequency + value, and they all end up with unsynced TSC values.

And thank you for addressing that. I think there is a broader
generalization that can be made about guest timekeeping that you've
started poking at with that patch set, which is that we should only
really care about these at a VM scope. There's nothing we can do
to cover up broken hardware, so if the host's TSCs are out of whack then
oh well.

To that end, we could have a single host<->guest offset that is used
across all vCPUs. Guest fiddling with TSC_ADJUST or even worse direct
writes to the TSC can then be treated as solely a part of guest state,
and get added on top of the host<->guest offset.

> But coincidentally since then I have started having conversations with
> people who really want the guest to have an immediate knowledge of the
> adjtimex maxerror etc. on the new host immediately after the migration.
> Maybe the "if the migration isn't fast enough then let the guest know it's
> now unsynced" is OK, but I'll need to work out what "immediately" means
> when we have a guest userspace component involved in it.

This has also been an area of interest to me. I think we've all seen the
many ways in which doing migrations behind the guest's can put software
in an extremely undesirable state on the other end. If those
conversations are taking place on the mailing lists, could you please CC
me?

Our (Google) TSC adjustment clamping and userspace notification mechanism
was a halfway kludge to keep things happy on the other end. And it
generally has worked well, but misses a fundamental point.

The hypervisor should tell the guest kernel about time travel and let it
cascade that information throughout the guest system. Regardless of what
we do to the TSC, we invariably destroy one of the two guest clocks along
the way. If we told the guest "you time traveled X seconds", it could
fold that into its own idea of real time. Guest kernel can then fire off
events to inform software that wants to keep up with clock changes, and
even a new event to let NTP know its probably running on different
hardware.

Time sucks :-)

--
Thanks,
Oliver
Oliver Upton March 21, 2022, 12:51 a.m. UTC | #14
On Sun, Mar 20, 2022 at 02:39:34PM +0100, Paolo Bonzini wrote:
> On 3/20/22 09:52, Oliver Upton wrote:
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> I don't think this adds much.  The missing link is on the destination side,
> not the source side.

I think it'll work:

 Source:
  - Pick a vCPU and save its { TOD, guest_TSC } pair
  - Save the tsc offset of every vCPU
  - Using all of the offsets, calculate the drift of all the follower
    vCPUs from the leader vCPU (from step 1)
  - Save the TSC frequency

 Destination:
  - Restore the TSC frequency
  - Read the { TOD, guest_TSC } pair for the first vCPU
  - Compare with the source value to work out delta_guest_TSC and
    delta_TOD
  - Apply delta_guest_TSC to all vCPUs in a VM
  - If you want to account for realtime, apply guest_tsc_freq *
    delta_TOD to every vCPU in the VM
  - Account for drift between leader/follower vCPUs

Userspace has some math to do, but IMO it needs to until we have a
better mechanism for helping the guest clean up a slow migration.
It does eliminate the need for doing TSC scaling in userspace, which
I think is the trickiest piece of it all.

Alternative could be to say that the VM has a single, authoritative {
TOD, guest_TSC } clock that can be read or written. Any vCPU offsets
then account for guest-induced drift in TSCs.

> To recap, the data that needs to be migrated from source to destination is
> the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of each of
> the following:
> 
> * either guestTSCOffset or a guestTSC synced with the hostTSC
> 
> * either guestTODOffset or a guestTOD synced with the hostTOD.
> 
> * either guestTSCScale or hostTSCFreq
> 
> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
> bit mix-and-match, but it's already a 5-tuple that the destination can use.
> What's missing is a ioctl on the destination side that relieves userspace
> from having to do the math.

That ioctl will work fine, but userspace needs to accept all the
nastiness that ensues. If it yanks the guest too hard into the future
it'll need to pick up the pieces when the guest kernel panics.

--
Thanks,
Oliver
David Woodhouse March 21, 2022, 12:16 p.m. UTC | #15
On Sun, 2022-03-20 at 14:39 +0100, Paolo Bonzini wrote:
> On 3/20/22 09:52, Oliver Upton wrote:
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> I don't think this adds much.  The missing link is on the destination 
> side, not the source side.
> 
> To recap, the data that needs to be migrated from source to destination 
> is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of 
> each of the following:
> 
> * either guestTSCOffset or a guestTSC synced with the hostTSC
> 
> * either guestTODOffset or a guestTOD synced with the hostTOD.
> 
> * either guestTSCScale or hostTSCFreq
> 
> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD 
> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. 
> It's a bit mix-and-match, but it's already a 5-tuple that the 
> destination can use.  What's missing is a ioctl on the destination side 
> that relieves userspace from having to do the math.
> 
> Adding a nicer API just means choosing a different 5-tuple that the 
> source side sends over to the destination, but it isn't particularly 
> useful if userspace still has to do the math.

Hm, I still don't really understand why it's a 5-tuple and why we care
about the *host* TSC at all.

Fundamentally, guest state is *guest* state. There is a 3-tuple of
 { NTP-synchronized time of day, Guest KVM clock, Guest TSC value } 

(plus a TSC offset from vCPU0, for each other vCPU perhaps.)

All else is redundant, and including host values is just wrong.

We can already do clock-advancing live migration by adding the
appropriate number of nanoseconds, and the appropriate number of TSC
ticks, to the values being restored on the destination host based on
the time at which the migrated instance is being restored there, as
compared with the originally stored time-of-day.

I understand why KVM would want to use the *same* host TSC value for
calculating the time-of-day as well as the guest KVM clock and the
guest TSC value — so that all three elements of that 3-tuple reflect
precisely the same moment in time.

I don't understand why the actual *value* of the host TSC is something
that userspace needs to see.
Paolo Bonzini March 21, 2022, 12:36 p.m. UTC | #16
On 3/21/22 01:51, Oliver Upton wrote:
>>
>> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
>> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
>> bit mix-and-match, but it's already a 5-tuple that the destination can use.
>> What's missing is a ioctl on the destination side that relieves userspace
>> from having to do the math.
> That ioctl will work fine, but userspace needs to accept all the
> nastiness that ensues. If it yanks the guest too hard into the future
> it'll need to pick up the pieces when the guest kernel panics.

It can do that before issuing the ioctl, by comparing the source and 
destination TOD, can't it?

Paolo
David Woodhouse March 21, 2022, 12:56 p.m. UTC | #17
On Mon, 2022-03-21 at 13:36 +0100, Paolo Bonzini wrote:
> On 3/21/22 01:51, Oliver Upton wrote:
> > > 
> > > Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
> > > returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
> > > bit mix-and-match, but it's already a 5-tuple that the destination can use.
> > > What's missing is a ioctl on the destination side that relieves userspace
> > > from having to do the math.
> > That ioctl will work fine, but userspace needs to accept all the
> > nastiness that ensues. If it yanks the guest too hard into the future
> > it'll need to pick up the pieces when the guest kernel panics.
> 
> It can do that before issuing the ioctl, by comparing the source and 
> destination TOD, can't it?

That's how we do it. 

We currently take the *really* naïve approach of just letting each vCPU
thread serialize its own vCPU's TSC with the other MSRs, which means
they're slightly out of sync.

Then we calculate the TOD delta when resuming on the destination, and
add the corresponding number of nanoseconds to the KVM clock, and ticks
to each vCPU's MSR.

When each vCPU thread then *restores* its vCPU's MSRs, the 'within a
second of the last sync" bit kicks in and they are all synchronized
properly.

There's definitely a bit of slop there; I don't think we even use the
KVM_CLOCK_REALTIME flag when setting the guest's KVM clock. But those
are fixable by having a single ioctl which handles the 3-tuple of
{ TOD, KVM clock, TSC } at a single instance in time, I think?
Paolo Bonzini March 21, 2022, 1:10 p.m. UTC | #18
On 3/21/22 13:16, David Woodhouse wrote:
> Hm, I still don't really understand why it's a 5-tuple and why we care
> about the *host* TSC at all.
> 
> Fundamentally, guest state is *guest* state. There is a 3-tuple of
>   { NTP-synchronized time of day, Guest KVM clock, Guest TSC value }
> 
> (plus a TSC offset from vCPU0, for each other vCPU perhaps.)
> 
> All else is redundant, and including host values is just wrong. [...]
> I don't understand why the actual *value* of the host TSC is something
> that userspace needs to see.

Ok, I understand now.

You're right, you don't need the value of the host TSC; you only need a 
{hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source.

If all the frequencies are the same, that can be "packed" as {hostTOD, 
guestNS, anyTSC} plus N offsets.  The N offsets in turn can be 
KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 
and the others if you use the vCPU0 guestTSC.

I think reasoning in terms of the host TSC is nicer in general, because 
it doesn't make vCPU0 special.  But apart from the aesthetics of having 
a "special" vCPU, making vCPU0 special is actually harder, because the 
TSC frequencies need not be the same for all vCPUs.  I think that is a 
mistake in the KVM API, but it was done long before I was involved (and 
long before I actually understood this stuff).

Paolo
David Woodhouse March 21, 2022, 2:59 p.m. UTC | #19
On Mon, 2022-03-21 at 14:10 +0100, Paolo Bonzini wrote:
> On 3/21/22 13:16, David Woodhouse wrote:
> > Hm, I still don't really understand why it's a 5-tuple and why we care
> > about the *host* TSC at all.
> > 
> > Fundamentally, guest state is *guest* state. There is a 3-tuple of
> >   { NTP-synchronized time of day, Guest KVM clock, Guest TSC value }
> > 
> > (plus a TSC offset from vCPU0, for each other vCPU perhaps.)
> > 
> > All else is redundant, and including host values is just wrong. [...]
> > I don't understand why the actual *value* of the host TSC is something
> > that userspace needs to see.
> 
> Ok, I understand now.
> 
> You're right, you don't need the value of the host TSC; you only need a 
> {hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source.

Actually, isn't that still redundant? All we really need is a *single*
{ hostTOD, guestNS } for the KVM clock, and then each vCPU has its own
{ guestNS, guestTSC } tuple.

> If all the frequencies are the same, that can be "packed" as {hostTOD, 
> guestNS, anyTSC} plus N offsets.  The N offsets in turn can be 
> KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 
> and the others if you use the vCPU0 guestTSC.
> 
> I think reasoning in terms of the host TSC is nicer in general, because 
> it doesn't make vCPU0 special.  But apart from the aesthetics of having 
> a "special" vCPU, making vCPU0 special is actually harder, because the 
> TSC frequencies need not be the same for all vCPUs.  I think that is a 
> mistake in the KVM API, but it was done long before I was involved (and 
> long before I actually understood this stuff).

If each vCPU has its own { guestNS, guestTSC } tuple that actually
works out just fine even if they have different frequencies. The
*common* case would be that they are all at the same frequency and have
the same value at the same time. But other cases can be accommodated.

I'm not averse to *reasoning* in terms of the host TSC; I just don't
like exposing the actual numbers to userspace and forcing userspace to
access it through some new ABI just in order to translate some other
fundamental property (either the time of day, or the guest data) into
that domain. And I especially don't like considering it part of 'guest
state'.

Right now when I'm not frowning at TSC synchronisation issues, I'm
working on guest transparent live migration from actual Xen, to
KVM-pretending-to-be-Xen. That kind of insanity is only really possible
with a strict adherence to the design principle that "guest state is
guest state", without conflating it with host/implementation details :)
David Woodhouse March 21, 2022, 7:43 p.m. UTC | #20
On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > But coincidentally since then I have started having conversations with
> > people who really want the guest to have an immediate knowledge of the
> > adjtimex maxerror etc. on the new host immediately after the migration.
> > Maybe the "if the migration isn't fast enough then let the guest know it's
> > now unsynced" is OK, but I'll need to work out what "immediately" means
> > when we have a guest userspace component involved in it.
> 
> This has also been an area of interest to me. I think we've all seen the
> many ways in which doing migrations behind the guest's can put software
> in an extremely undesirable state on the other end. If those
> conversations are taking place on the mailing lists, could you please CC
> me?
> 
> Our (Google) TSC adjustment clamping and userspace notification mechanism
> was a halfway kludge to keep things happy on the other end. And it
> generally has worked well, but misses a fundamental point.
> 
> The hypervisor should tell the guest kernel about time travel and let it
> cascade that information throughout the guest system. Regardless of what
> we do to the TSC, we invariably destroy one of the two guest clocks along
> the way. If we told the guest "you time traveled X seconds", it could
> fold that into its own idea of real time. Guest kernel can then fire off
> events to inform software that wants to keep up with clock changes, and
> even a new event to let NTP know its probably running on different
> hardware.
> 
> Time sucks :-)

So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
its clock may have experienced a jump. Linux guests will use this to
kick various watchdogs to prevent them whining. Shouldn't we *also* be
driving the NTP reset from that same signal?
Oliver Upton March 21, 2022, 9:23 p.m. UTC | #21
On Mon, Mar 21, 2022 at 07:43:21PM +0000, David Woodhouse wrote:
> On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> > On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > > But coincidentally since then I have started having conversations with
> > > people who really want the guest to have an immediate knowledge of the
> > > adjtimex maxerror etc. on the new host immediately after the migration.
> > > Maybe the "if the migration isn't fast enough then let the guest know it's
> > > now unsynced" is OK, but I'll need to work out what "immediately" means
> > > when we have a guest userspace component involved in it.
> > 
> > This has also been an area of interest to me. I think we've all seen the
> > many ways in which doing migrations behind the guest's can put software
> > in an extremely undesirable state on the other end. If those
> > conversations are taking place on the mailing lists, could you please CC
> > me?
> > 
> > Our (Google) TSC adjustment clamping and userspace notification mechanism
> > was a halfway kludge to keep things happy on the other end. And it
> > generally has worked well, but misses a fundamental point.
> > 
> > The hypervisor should tell the guest kernel about time travel and let it
> > cascade that information throughout the guest system. Regardless of what
> > we do to the TSC, we invariably destroy one of the two guest clocks along
> > the way. If we told the guest "you time traveled X seconds", it could
> > fold that into its own idea of real time. Guest kernel can then fire off
> > events to inform software that wants to keep up with clock changes, and
> > even a new event to let NTP know its probably running on different
> > hardware.
> > 
> > Time sucks :-)
> 
> So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
> its clock may have experienced a jump. Linux guests will use this to
> kick various watchdogs to prevent them whining. Shouldn't we *also* be
> driving the NTP reset from that same signal?

Right, but I'd argue that interface has some problems too. It
depends on the guest polling instead of an interrupt from the
hypervisor. It also has no way of informing the kernel exactly how much
time has elapsed.

The whole point of all these hacks that we've done internally is that we,
the hypervisor, know full well how much real time hasv advanced during the
VM blackout. If we can at least let the guest know how much to fudge real
time, it can then poke NTP for better refinement. I worry about using NTP
as the sole source of truth for such a mechanism, since you'll need to go
out to the network and any reads until the response comes back are hosed.

--
Thanks,
Oliver
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60a29972d3f1..85199e9b7f8d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -180,13 +180,18 @@  Returns:
 	 ======= ======================================
 
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
-TSC is then derived by the following equation:
+TSC is then derived by the following equation where tsc_scale is the
+ratio between the VM and host TSC frequencies:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (host_tsc * tsc_scale) + KVM_VCPU_TSC_OFFSET
 
-This attribute is useful to adjust the guest's TSC on live migration,
-so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+The VM's TSC frequency is configured with the KVM_{GET,SET}_TSC_KHZ
+vCPU or VM ioctls.
+
+The KVM_VCPU_TSC_OFFSET attribute is useful to adjust the guest's TSC
+on live migration, so that the TSC counts the time during which the VM
+was paused. The following describes a possible algorithm to use for this
+purpose.
 
 From the source VMM process:
 
@@ -202,7 +207,10 @@  From the source VMM process:
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+4. Invoke the KVM_SET_TSC_KHZ ioctl to set the guest TSC frequency to
+   the value recorded in step 3 (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
    kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
@@ -214,10 +222,10 @@  From the destination VMM process:
    between the source pausing the VMs and the destination executing
    steps 4-7.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
    kvmclock nanoseconds (guest_dest).
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+7. Adjust the guest TSC offsets for every vCPU to account for (1) time
    elapsed since recording state and (2) difference in TSCs between the
    source and destination machine:
 
@@ -229,5 +237,5 @@  From the destination VMM process:
    a time of 0 in kvmclock.  The above formula ensures that it is the
    same on the destination as it was on the source).
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+8. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.