diff mbox series

[v8,7/7] KVM: x86: Expose TSC offset controls to userspace

Message ID 20210916181538.968978-8-oupton@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Add idempotent controls for migrating system counter state | expand

Commit Message

Oliver Upton Sept. 16, 2021, 6:15 p.m. UTC
To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
 arch/x86/include/asm/kvm_host.h         |   1 +
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
 4 files changed, 172 insertions(+)

Comments

Marcelo Tosatti Sept. 30, 2021, 7:14 p.m. UTC | #1
On Thu, Sep 16, 2021 at 06:15:38PM +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>


> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> +
> +This attribute is useful for the precise migration of a guest's TSC. The
> +following describes a possible algorithm to use for the migration of a
> +guest's TSC:
> +
> +From the source VMM process:
> +
> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
> +   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
> +
> +2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
> +   guest TSC offset (off_n).
> +
> +3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
> +   guest's TSC (freq).
> +
> +From the destination VMM process:
> +
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
> +
> +5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
> +   kvmclock nanoseconds (k_1).
> +
> +6. Adjust the guest TSC offsets for every vCPU to account for (1) time
> +   elapsed since recording state and (2) difference in TSCs between the
> +   source and destination machine:
> +
> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

Hi Oliver,

This won't advance the TSC values themselves, right?
This (advancing the TSC values by the realtime elapsed time) would be
awesome because TSC clock_gettime() vdso is faster, and some
applications prefer to just read from TSC directly.
See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".

The advancement with this patchset only applies to kvmclock.
Paolo Bonzini Oct. 1, 2021, 9:17 a.m. UTC | #2
On 30/09/21 21:14, Marcelo Tosatti wrote:
>> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> Hi Oliver,
> 
> This won't advance the TSC values themselves, right?

Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is 
advanced too.

Paolo

> This (advancing the TSC values by the realtime elapsed time) would be
> awesome because TSC clock_gettime() vdso is faster, and some
> applications prefer to just read from TSC directly.
> See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> 
> The advancement with this patchset only applies to kvmclock.
>
Marcelo Tosatti Oct. 1, 2021, 10:32 a.m. UTC | #3
On Fri, Oct 01, 2021 at 11:17:33AM +0200, Paolo Bonzini wrote:
> On 30/09/21 21:14, Marcelo Tosatti wrote:
> > > +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> > Hi Oliver,
> > 
> > This won't advance the TSC values themselves, right?
> 
> Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is
> advanced too.
> 
> Paolo

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

You can't advance both kvmclock (kvmclock_offset variable) and the TSCs,
which would be double counting.

So you have to either add the elapsed realtime (1) between KVM_GET_CLOCK
to kvmclock (which this patch is doing), or to the TSCs. If you do both, there
is double counting. Am i missing something?

To make it clearer: TSC clocksource is faster than kvmclock source, so
we'd rather use when possible, which is achievable with TSC scaling 
support on HW.

1: As mentioned earlier, just using the realtime clock delta between
hosts can introduce problems. So need a scheme to:

	- Find the offset between host clocks, with upper and lower
	  bounds on error.
	- Take appropriate actions based on that (for example,
	  do not use KVM_CLOCK_REALTIME flag on KVM_SET_CLOCK
	  if the delta between hosts is large).

Which can be done in userspace or kernel space... (hum, but maybe
delegating this to userspace will introduce different solutions
of the same problem?).

> > This (advancing the TSC values by the realtime elapsed time) would be
> > awesome because TSC clock_gettime() vdso is faster, and some
> > applications prefer to just read from TSC directly.
> > See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> > 
> > The advancement with this patchset only applies to kvmclock.
> > 
> 
>
Paolo Bonzini Oct. 1, 2021, 3:12 p.m. UTC | #4
On 01/10/21 12:32, Marcelo Tosatti wrote:
>> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
>> kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
>>  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
>> nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
>> respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
>> set in the provided +   structure. KVM will advance the VM's
>> kvmclock to account for elapsed +   time since recording the clock
>> values.
> 
> You can't advance both kvmclock (kvmclock_offset variable) and the
> TSCs, which would be double counting.
> 
> So you have to either add the elapsed realtime (1) between
> KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> TSCs. If you do both, there is double counting. Am i missing
> something?

Probably one of these two (but it's worth pointing out both of them):

1) the attribute that's introduced here *replaces*
KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.

2) the adjustment formula later in the algorithm does not care about how
much time passed between step 1 and step 4.  It just takes two well
known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
the same on the destination as if the guest was still running on the
source.  It is irrelevant that one of them is before migration and one
is after, all it matters is that one is on the source and one is on the
destination.

Perhaps we can add to step 6 something like:

> +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> time +   elapsed since recording state and (2) difference in TSCs
> between the +   source and destination machine: + +   new_off_n = t_0
> + off_n + (k_1 - k_0) * freq - t_1 +

"off + t - k * freq" is the guest TSC value corresponding to a time of 0
in kvmclock.  The above formula ensures that it is the same on the
destination as it was on the source.

Also, the names are a bit hard to follow.  Perhaps

	t_0		tsc_src
	t_1		tsc_dest
	k_0		guest_src
	k_1		guest_dest
	r_0		host_src
	off_n		ofs_src[i]
	new_off_n	ofs_dest[i]

Paolo
Marcelo Tosatti Oct. 1, 2021, 7:11 p.m. UTC | #5
On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > set in the provided +   structure. KVM will advance the VM's
> > > kvmclock to account for elapsed +   time since recording the clock
> > > values.
> > 
> > You can't advance both kvmclock (kvmclock_offset variable) and the
> > TSCs, which would be double counting.
> > 
> > So you have to either add the elapsed realtime (1) between
> > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > TSCs. If you do both, there is double counting. Am i missing
> > something?
> 
> Probably one of these two (but it's worth pointing out both of them):
> 
> 1) the attribute that's introduced here *replaces*
> KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> 
> 2) the adjustment formula later in the algorithm does not care about how
> much time passed between step 1 and step 4.  It just takes two well
> known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> the same on the destination as if the guest was still running on the
> source.  It is irrelevant that one of them is before migration and one
> is after, all it matters is that one is on the source and one is on the
> destination.

OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay 
which is introduced during migration (which is what i would guess is
the lower hanging fruit) (for guests using TSC).

My point was that, by advancing the _TSC value_ by:

T0. stop guest vcpus	(source)
T1. KVM_GET_CLOCK	(source)
T2. KVM_SET_CLOCK	(destination)
T3. Write guest TSCs	(destination)
T4. resume guest	(destination)

new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

t_0:    host TSC at KVM_GET_CLOCK time.
off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
TSC offset is fixed).
...

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
(hopefully modern guests on modern hosts will use TSC clocksource,
whose clock_gettime is faster... some people are using that already).

At some point QEMU should enable invariant TSC flag by default?

That said, the point is: why not advance the _TSC_ values
(instead of kvmclock nanoseconds), as doing so would reduce 
the "the CLOCK_REALTIME delay which is introduced during migration"
for both kvmclock users and modern tsc clocksource users.

So yes, i also like this patchset, but would like it even more
if it fixed the case above as well (and not sure whether adding
the migration delta to KVMCLOCK makes it harder to fix TSC case
later).

> Perhaps we can add to step 6 something like:
> 
> > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > time +   elapsed since recording state and (2) difference in TSCs
> > between the +   source and destination machine: + +   new_off_n = t_0
> > + off_n + (k_1 - k_0) * freq - t_1 +
> 
> "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> in kvmclock.  The above formula ensures that it is the same on the
> destination as it was on the source.
> 
> Also, the names are a bit hard to follow.  Perhaps
> 
> 	t_0		tsc_src
> 	t_1		tsc_dest
> 	k_0		guest_src
> 	k_1		guest_dest
> 	r_0		host_src
> 	off_n		ofs_src[i]
> 	new_off_n	ofs_dest[i]
> 
> Paolo
> 
>
Oliver Upton Oct. 1, 2021, 7:33 p.m. UTC | #6
Marcelo,

On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > set in the provided +   structure. KVM will advance the VM's
> > > > kvmclock to account for elapsed +   time since recording the clock
> > > > values.
> > >
> > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > TSCs, which would be double counting.
> > >
> > > So you have to either add the elapsed realtime (1) between
> > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > TSCs. If you do both, there is double counting. Am i missing
> > > something?
> >
> > Probably one of these two (but it's worth pointing out both of them):
> >
> > 1) the attribute that's introduced here *replaces*
> > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> >
> > 2) the adjustment formula later in the algorithm does not care about how
> > much time passed between step 1 and step 4.  It just takes two well
> > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > the same on the destination as if the guest was still running on the
> > source.  It is irrelevant that one of them is before migration and one
> > is after, all it matters is that one is on the source and one is on the
> > destination.
>
> OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> which is introduced during migration (which is what i would guess is
> the lower hanging fruit) (for guests using TSC).

The series gives userspace the ability to modify the guest's
perception of the TSC in whatever way it sees fit. The algorithm in
the documentation provides a suggestion to userspace on how to do
exactly that. I kept that advancement logic out of the kernel because
IMO it is an implementation detail: users have differing opinions on
how clocks should behave across a migration and KVM shouldn't have any
baked-in rules around it.

At the same time, userspace can choose to _not_ jump the TSC and use
the available interfaces to just migrate the existing state of the
TSCs.

When I had initially proposed this series upstream, Paolo astutely
pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
pairing, which is critical for the TSC advancement algorithm in the
documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
in userspace [1], hence the missing kvm clock changes. So, in all, the
spirit of the KVM clock changes is to provide missing UAPI around the
clock/TSC, with the side effect of changing the guest-visible value.

[1] https://cloud.google.com/spanner/docs/true-time-external-consistency

> My point was that, by advancing the _TSC value_ by:
>
> T0. stop guest vcpus    (source)
> T1. KVM_GET_CLOCK       (source)
> T2. KVM_SET_CLOCK       (destination)
> T3. Write guest TSCs    (destination)
> T4. resume guest        (destination)
>
> new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
>
> t_0:    host TSC at KVM_GET_CLOCK time.
> off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> TSC offset is fixed).
> ...
>
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
>
> Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> (hopefully modern guests on modern hosts will use TSC clocksource,
> whose clock_gettime is faster... some people are using that already).
>

Hopefully the above explanation made it clearer how the TSCs are
supposed to get advanced, and why it isn't done in the kernel.

> At some point QEMU should enable invariant TSC flag by default?
>
> That said, the point is: why not advance the _TSC_ values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.
>
> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).
>
> > Perhaps we can add to step 6 something like:
> >
> > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > time +   elapsed since recording state and (2) difference in TSCs
> > > between the +   source and destination machine: + +   new_off_n = t_0
> > > + off_n + (k_1 - k_0) * freq - t_1 +
> >
> > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > in kvmclock.  The above formula ensures that it is the same on the
> > destination as it was on the source.
> >
> > Also, the names are a bit hard to follow.  Perhaps
> >
> >       t_0             tsc_src
> >       t_1             tsc_dest
> >       k_0             guest_src
> >       k_1             guest_dest
> >       r_0             host_src
> >       off_n           ofs_src[i]
> >       new_off_n       ofs_dest[i]
> >
> > Paolo
> >

Yeah, sounds good to me. Shall I respin the whole series from what you
have in kvm/queue, or just send you the bits and pieces that ought to
be applied?

--
Thanks,
Oliver
Paolo Bonzini Oct. 4, 2021, 11:44 a.m. UTC | #7
On 01/10/21 21:11, Marcelo Tosatti wrote:
> That said, the point is: why not advance the_TSC_  values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.

It already does, that's the cool part.  Take again the formula here:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq - t_1

and set:

	t_1 = t_0 + host_off_0_1 + (k_1 - k_0) * freq

i.e. t_0 and t_1 are different because 1) the machines were booted at 
different times, which is host_off_0_1 2) t_1 includes the migration 
downtime between k_0 and k_1

Now you have:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq
	       - t_0 - real_off_n - (k_1 - k_0) * freq

    guest_off_1 = guest_off_0 - host_off_0_1

That is, the TSC is exactly the same as it was on the source, just 
adjusted because the two machines were booted at different times.

The need to have precise (ns, cycle) pairings is exactly because it 
ensures that everything cancels in the formula, and all that is left is 
the differences in the TSC of the two hosts.

Paolo

> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).
Marcelo Tosatti Oct. 4, 2021, 2:30 p.m. UTC | #8
On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote:
> Marcelo,
> 
> On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > > set in the provided +   structure. KVM will advance the VM's
> > > > > kvmclock to account for elapsed +   time since recording the clock
> > > > > values.
> > > >
> > > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > > TSCs, which would be double counting.
> > > >
> > > > So you have to either add the elapsed realtime (1) between
> > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > > TSCs. If you do both, there is double counting. Am i missing
> > > > something?
> > >
> > > Probably one of these two (but it's worth pointing out both of them):
> > >
> > > 1) the attribute that's introduced here *replaces*
> > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> > >
> > > 2) the adjustment formula later in the algorithm does not care about how
> > > much time passed between step 1 and step 4.  It just takes two well
> > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > > the same on the destination as if the guest was still running on the
> > > source.  It is irrelevant that one of them is before migration and one
> > > is after, all it matters is that one is on the source and one is on the
> > > destination.
> >
> > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> > which is introduced during migration (which is what i would guess is
> > the lower hanging fruit) (for guests using TSC).
> 
> The series gives userspace the ability to modify the guest's
> perception of the TSC in whatever way it sees fit. The algorithm in
> the documentation provides a suggestion to userspace on how to do
> exactly that. I kept that advancement logic out of the kernel because
> IMO it is an implementation detail: users have differing opinions on
> how clocks should behave across a migration and KVM shouldn't have any
> baked-in rules around it.

Ok, was just trying to visualize how this would work with QEMU Linux guests.

> 
> At the same time, userspace can choose to _not_ jump the TSC and use
> the available interfaces to just migrate the existing state of the
> TSCs.
> 
> When I had initially proposed this series upstream, Paolo astutely
> pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
> pairing, which is critical for the TSC advancement algorithm in the
> documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
> in userspace [1], hence the missing kvm clock changes. So, in all, the
> spirit of the KVM clock changes is to provide missing UAPI around the
> clock/TSC, with the side effect of changing the guest-visible value.
> 
> [1] https://cloud.google.com/spanner/docs/true-time-external-consistency
> 
> > My point was that, by advancing the _TSC value_ by:
> >
> > T0. stop guest vcpus    (source)
> > T1. KVM_GET_CLOCK       (source)
> > T2. KVM_SET_CLOCK       (destination)
> > T3. Write guest TSCs    (destination)
> > T4. resume guest        (destination)
> >
> > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> >
> > t_0:    host TSC at KVM_GET_CLOCK time.
> > off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> > TSC offset is fixed).
> > ...
> >
> > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> > +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> > +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> > +   structure. KVM will advance the VM's kvmclock to account for elapsed
> > +   time since recording the clock values.
> >
> > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> > (hopefully modern guests on modern hosts will use TSC clocksource,
> > whose clock_gettime is faster... some people are using that already).
> >
> 
> Hopefully the above explanation made it clearer how the TSCs are
> supposed to get advanced, and why it isn't done in the kernel.
> 
> > At some point QEMU should enable invariant TSC flag by default?
> >
> > That said, the point is: why not advance the _TSC_ values
> > (instead of kvmclock nanoseconds), as doing so would reduce
> > the "the CLOCK_REALTIME delay which is introduced during migration"
> > for both kvmclock users and modern tsc clocksource users.
> >
> > So yes, i also like this patchset, but would like it even more
> > if it fixed the case above as well (and not sure whether adding
> > the migration delta to KVMCLOCK makes it harder to fix TSC case
> > later).
> >
> > > Perhaps we can add to step 6 something like:
> > >
> > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > > time +   elapsed since recording state and (2) difference in TSCs
> > > > between the +   source and destination machine: + +   new_off_n = t_0
> > > > + off_n + (k_1 - k_0) * freq - t_1 +
> > >
> > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > > in kvmclock.  The above formula ensures that it is the same on the
> > > destination as it was on the source.
> > >
> > > Also, the names are a bit hard to follow.  Perhaps
> > >
> > >       t_0             tsc_src
> > >       t_1             tsc_dest
> > >       k_0             guest_src
> > >       k_1             guest_dest
> > >       r_0             host_src
> > >       off_n           ofs_src[i]
> > >       new_off_n       ofs_dest[i]
> > >
> > > Paolo
> > >
> 
> Yeah, sounds good to me. Shall I respin the whole series from what you
> have in kvm/queue, or just send you the bits and pieces that ought to
> be applied?
> 
> --
> Thanks,
> Oliver
> 
>
Sean Christopherson Oct. 5, 2021, 3:22 p.m. UTC | #9
On Thu, Sep 16, 2021, Oliver Upton wrote:
> +static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

...

> +static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

These casts break 32-bit builds because of truncating attr->addr from 64-bit int
to a 32-bit pointer.  The address should also be checked to verify bits 63:32 are
not set on 32-bit kernels.

arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_get_attr’:
arch/x86/kvm/x86.c:4947:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4947 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^
arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_set_attr’:
arch/x86/kvm/x86.c:4967:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4967 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^


Not sure if there's a more elegant approach than casts galore?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5e462ffd65..3930e5dcdf0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4944,9 +4944,12 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET:
                r = -EFAULT;
@@ -4964,10 +4967,13 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        struct kvm *kvm = vcpu->kvm;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET: {
                u64 offset, tsc, ns;
David Woodhouse Feb. 23, 2022, 10:02 a.m. UTC | #10
On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

This isn't true. The guest TSC also depends on the *scaling* factor.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 2acec3b9ef65..3b399d727c11 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,3 +161,60 @@  Specifies the base address of the stolen time structure for this VCPU. The
 base address must be 64 byte aligned and exist within a valid guest memory
 region. See Documentation/virt/kvm/arm/pvtime.rst for more information
 including the layout of the stolen time structure.
+
+4. GROUP: KVM_VCPU_TSC_CTRL
+===========================
+
+:Architectures: x86
+
+4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
+
+:Parameters: 64-bit unsigned TSC offset
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Specifies the guest's TSC offset relative to the host's TSC. The guest's
+TSC is then derived by the following equation:
+
+  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+
+This attribute is useful for the precise migration of a guest's TSC. The
+following describes a possible algorithm to use for the migration of a
+guest's TSC:
+
+From the source VMM process:
+
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
+   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
+
+2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
+   guest TSC offset (off_n).
+
+3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
+   guest's TSC (freq).
+
+From the destination VMM process:
+
+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.
+
+5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
+   kvmclock nanoseconds (k_1).
+
+6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+   elapsed since recording state and (2) difference in TSCs between the
+   source and destination machine:
+
+   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
+
+7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+   respective value derived in the previous step.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5accfe7246ce..09c678f2e616 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1096,6 +1096,7 @@  struct kvm_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
+	u64 last_tsc_offset;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2ef1f6513c68..5a776a08f78c 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -504,4 +504,8 @@  struct kvm_pmu_event_filter {
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+/* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
+#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
+#define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1ea65bb2e74d..1177604c805a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2470,6 +2470,7 @@  static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.last_tsc_offset = offset;
 
 	vcpu->arch.last_guest_tsc = tsc;
 
@@ -4069,6 +4070,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
 	case KVM_CAP_SREGS2:
 	case KVM_CAP_EXIT_ON_EMULATION_FAILURE:
+	case KVM_CAP_VCPU_ATTRIBUTES:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -4933,6 +4935,109 @@  static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = -EFAULT;
+		if (put_user(vcpu->arch.l1_tsc_offset, uaddr))
+			break;
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET: {
+		u64 offset, tsc, ns;
+		unsigned long flags;
+		bool matched;
+
+		r = -EFAULT;
+		if (get_user(offset, uaddr))
+			break;
+
+		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+
+		matched = (vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_khz == vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_offset == offset);
+
+		tsc = kvm_scale_tsc(vcpu, rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
+		ns = get_kvmclock_base_ns();
+
+		__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
+		raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+		r = 0;
+		break;
+	}
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_vcpu_ioctl_device_attr(struct kvm_vcpu *vcpu,
+				      unsigned int ioctl,
+				      void __user *argp)
+{
+	struct kvm_device_attr attr;
+	int r;
+
+	if (copy_from_user(&attr, argp, sizeof(attr)))
+		return -EFAULT;
+
+	if (attr.group != KVM_VCPU_TSC_CTRL)
+		return -ENXIO;
+
+	switch (ioctl) {
+	case KVM_HAS_DEVICE_ATTR:
+		r = kvm_arch_tsc_has_attr(vcpu, &attr);
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		r = kvm_arch_tsc_get_attr(vcpu, &attr);
+		break;
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_arch_tsc_set_attr(vcpu, &attr);
+		break;
+	}
+
+	return r;
+}
+
 static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 				     struct kvm_enable_cap *cap)
 {
@@ -5387,6 +5492,11 @@  long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = __set_sregs2(vcpu, u.sregs2);
 		break;
 	}
+	case KVM_HAS_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}