diff mbox series

[1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

Message ID 20240408220705.7637-2-jalliste@amazon.com (mailing list archive)
State New, archived
Headers show
Series Add API to correct KVM/PV clock drift | expand

Commit Message

Jack Allister April 8, 2024, 10:07 p.m. UTC
There is a potential for drift between the TSC and a KVM/PV clock when the
guest TSC is scaled (as seen previously in [1]). Which fixed drift between
timers over the lifetime of a VM.

However, there is another factor which will cause a drift. In a situation
such as a kexec/live-update of the kernel or a live-migration of a VM the
PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE).
This update samples a new system_time & tsc_timestamp to be used in the
structure.

For example, when a guest is running with a TSC frequency of 1.5GHz but the
host frequency is 3.0GHz upon an update of the PV time information a delta
of ~3500ns is observed between the TSC and the KVM/PV clock. There is no
reason why a fixup creating an accuracy of ±1ns cannot be achieved.

Additional interfaces are added to retrieve & fixup the PV time information
when a VMM may believe is appropriate (deserialization after live-update/
migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
currently used PV time information and then when the VMM believes a drift
may occur can then instruct KVM to perform a correction via the setter
KVM_SET_CLOCK_GUEST.

The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
TSC & kernel timstamp are sampled at a singular point in time. Using the
already known scaling/offset for L1 the guest TSC is then derived from this
information.

From here two PV time information structures are created, one which is the
original time information structure prior to whatever may have caused a PV
clock re-calculation (live-update/migration). The second is then using the
singular point in time sampled just prior. An individual KVM/PV clock for
each of the PV time information structures using the singular guest TSC.

A delta is then determined between the two calculated PV times, which is
then used as a correction offset added onto the kvmclock_offset for the VM.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae

Suggested-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Jack Allister <jalliste@amazon.com>
CC: Paul Durrant <paul@xen.org>
---
 Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
 arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  3 ++
 3 files changed, 133 insertions(+)

Comments

Dongli Zhang April 9, 2024, 12:34 a.m. UTC | #1
Hi Jack,

On 4/8/24 15:07, Jack Allister wrote:
> There is a potential for drift between the TSC and a KVM/PV clock when the
> guest TSC is scaled (as seen previously in [1]). Which fixed drift between
> timers over the lifetime of a VM.

Those patches mentioned "TSC scaling" mutiple times. Is it a necessary to
reproduce this issue? I do not think it is necessary. The tsc scaling may speed
up the drift, but not the root cause.

How about to cite the below patch as the beginning. The below patch only
*avoids* KVM_REQ_MASTERCLOCK_UPDATE in some situations, but never solve the
problem when KVM_REQ_MASTERCLOCK_UPDATE is triggered ... therefore we need this
patchset ...

KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c52ffadc65e28ab461fd055e9991e8d8106a0056

I think this patch is only closely related to KVM_REQ_MASTERCLOCK_UPDATE, not
TSC scaling.

> 
> However, there is another factor which will cause a drift. In a situation
> such as a kexec/live-update of the kernel or a live-migration of a VM the
> PV clock information is recalculated by KVM (KVM_REQ_MASTERCLOCK_UPDATE).
> This update samples a new system_time & tsc_timestamp to be used in the
> structure.
> 
> For example, when a guest is running with a TSC frequency of 1.5GHz but the
> host frequency is 3.0GHz upon an update of the PV time information a delta
> of ~3500ns is observed between the TSC and the KVM/PV clock. There is no
> reason why a fixup creating an accuracy of ±1ns cannot be achieved.

Same as above. I think the key is to explain the issue when
KVM_REQ_MASTERCLOCK_UPDATE is triggered, not to emphasize the TSC scaling.
Please correct me if I am wrong.

> 
> Additional interfaces are added to retrieve & fixup the PV time information
> when a VMM may believe is appropriate (deserialization after live-update/
> migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
> currently used PV time information and then when the VMM believes a drift
> may occur can then instruct KVM to perform a correction via the setter
> KVM_SET_CLOCK_GUEST.
> 
> The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
> TSC & kernel timstamp are sampled at a singular point in time. Using the

Typo: "timstamp"

> already known scaling/offset for L1 the guest TSC is then derived from this

I assume you meant to derive guest TSC from TSC offset/scaling, not to derive
kvmclock. What does "TSC & kernel timstamp" mean?

> information.
> 
> From here two PV time information structures are created, one which is the
> original time information structure prior to whatever may have caused a PV
> clock re-calculation (live-update/migration). The second is then using the
> singular point in time sampled just prior. An individual KVM/PV clock for
> each of the PV time information structures using the singular guest TSC.
> 
> A delta is then determined between the two calculated PV times, which is
> then used as a correction offset added onto the kvmclock_offset for the VM.
> 
> [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!OnMXeXj4Plz6xvAc5lYsKaR3d1GDGGGRhZkdLMbxr8Skc_VAv_O1H8qP9igQv4KPCtYDw2ShTUtEd2o3mD5R$ 
> 
> Suggested-by: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> CC: Paul Durrant <paul@xen.org>
> ---
>  Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
>  arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  3 ++
>  3 files changed, 133 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..5f74d8ac1002 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
>  
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks.
> +On x86 a PV clock is derived from the current TSC and is then scaled based
> +upon the a specified multiplier and shift. The result of this is then added
> +to a system time.

Typo: "the a".

> +
> +The guest needs a way to determine the system time, multiplier and shift. This
> +can be done by multiple ways, for KVM guests this can be via an MSR write to
> +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
> +address KVM shall put the structure. On Xen guests this can be found in the Xen
> +vcpu_info.
> +
> +This is structure is useful information for a VMM to also know when taking into
> +account potential timer drift on live-update/migration.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
> +known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
> +
> +If a VM is utilizing TSC scaling there is a potential for a drift between the
> +KVM/PV clock and the TSC itself. This is due to the loss of precision when
> +performing a multiply and shift rather than divide for the TSC.
> +
> +To perform the correction a delta is calculated between the original time info
> +(which is assumed correct) at a singular point in time X. The KVM clock offset
> +is then offset by this delta.
> +
>  5. The kvm_run structure
>  ========================
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..5d2e10cd1c30 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  	return 0;
>  }
>  
> +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu = NULL;
> +	int i;
> +
> +	for (i = 0; i < KVM_MAX_VCPUS; i++) {
> +		vcpu = kvm_get_vcpu_by_id(kvm, i);
> +		if (!vcpu)
> +			continue;
> +
> +		if (kvm_vcpu_is_reset_bsp(vcpu))
> +			break;
> +	}
> +
> +	return vcpu;
> +}

Would the above rely not only on TSC clocksource, but also
ka->use_master_clock==true?

 3125         ka->use_master_clock = host_tsc_clocksource && vcpus_matched
 3126                                 && !ka->backwards_tsc_observed
 3127                                 && !ka->boot_vcpu_runs_old_kvmclock;

Should the condition of (ka->use_master_clock==true) be checked in the ioctl?

> +
> +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
> +{
> +	struct kvm_vcpu *vcpu;
> +
> +	vcpu = kvm_get_bsp_vcpu(kvm);
> +	if (!vcpu)
> +		return -EINVAL;
> +
> +	if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
> +		return -EIO;
> +
> +	if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
> +		return -EFAULT;

What will happen if the vCPU=0 (e.g., BSP) thread is racing with here to update
the vcpu->arch.hv_clock?

It is a good idea to making assumption from the VMM (e.g., QEMU) side?

> +
> +	return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
> +{
> +	struct kvm_vcpu *vcpu;
> +	struct pvclock_vcpu_time_info orig_pvti;
> +	struct pvclock_vcpu_time_info dummy_pvti;
> +	int64_t kernel_ns;
> +	uint64_t host_tsc, guest_tsc;
> +	uint64_t clock_orig, clock_dummy;
> +	int64_t correction;
> +	unsigned long i;

Please ignore me if there is not any chance to make the above (and other places
in the patchset) to honor reverse xmas tree style.

> +
> +	vcpu = kvm_get_bsp_vcpu(kvm);
> +	if (!vcpu)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> +		return -EFAULT;
> +
> +	/*
> +	 * Sample the kernel time and host TSC at a singular point.
> +	 * We then calculate the guest TSC using this exact point in time,
> +	 * From here we can then determine the delta using the
> +	 * PV time info requested from the user and what we currently have
> +	 * using the fixed point in time. This delta is then used as a
> +	 * correction factor to fixup the potential drift.
> +	 */
> +	if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
> +		return -EFAULT;
> +
> +	guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> +	dummy_pvti = orig_pvti;
> +	dummy_pvti.tsc_timestamp = guest_tsc;
> +	dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
> +
> +	clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> +	clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
> +
> +	correction = clock_orig - clock_dummy;
> +	kvm->arch.kvmclock_offset += correction;

I am not sure if it is a good idea to rely on userspace VMM to decide the good
timepoint to issue the ioctl, without assuming any racing.

In addition to live migration, can the user call this API any time during the VM
is running (to fix the clock drift)? Therefore, any requirement to protect the
update of kvmclock_offset from racing?


Thank you very much!

Dongli Zhang


> +
> +	kvm_for_each_vcpu(i, vcpu, kvm)
> +		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +
> +	return 0;
> +}
> +
>  int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
>  	struct kvm *kvm = filp->private_data;
> @@ -7246,6 +7327,12 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  	case KVM_GET_CLOCK:
>  		r = kvm_vm_ioctl_get_clock(kvm, argp);
>  		break;
> +	case KVM_SET_CLOCK_GUEST:
> +		r = kvm_vm_ioctl_set_clock_guest(kvm, argp);
> +		break;
> +	case KVM_GET_CLOCK_GUEST:
> +		r = kvm_vm_ioctl_get_clock_guest(kvm, argp);
> +		break;
>  	case KVM_SET_TSC_KHZ: {
>  		u32 user_tsc_khz;
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
>  	__u64 reserved[6];
>  };
>  
> +#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */
David Woodhouse April 9, 2024, 3:50 a.m. UTC | #2
On Mon, 2024-04-08 at 17:34 -0700, Dongli Zhang wrote:
> Hi Jack,
> 
> On 4/8/24 15:07, Jack Allister wrote:
> > There is a potential for drift between the TSC and a KVM/PV clock when the
> > guest TSC is scaled (as seen previously in [1]). Which fixed drift between
> > timers over the lifetime of a VM.
> 
> Those patches mentioned "TSC scaling" mutiple times. Is it a necessary to
> reproduce this issue? I do not think it is necessary. The tsc scaling may speed
> up the drift, but not the root cause.

Right. See the three definitions of the KVM clock I described in
https://lore.kernel.org/kvm/7e0040f70c629d365e80d13b339a95e0affa6d61.camel@infradead.org/

One is based on the host CLOCK_MONOTONIC_RAW + ka->kvmclock_offset, the
second is based on the reference point in master_kernel_ns /
master_cycle_now, and runs at the rate of the (scaled) guest TSC.

It's the *third* definition, which I called 'Definition C', which is
purely a bug, and which comes from running at the rate of the
*unscaled* guest TSC. And which doesn't exist without TSC scaling.

I don't think this patch needs to talk about the TSC scaling issues at
all. Those, and the existence of 'Definition C', are just bugs.

For that matter, I'm not entirely sure I'd have mentioned 'drift'
either. Yes, the two non-buggy definitions do drift from each other but
that isn't the point of this patch. Let's try again to explain what
*is* the point of this patch...



In all sane environments, ka->use_master_clock is set and 'Definition
B' of the KVM clock is used. The KVM clock is a simple arithmetic
function of the guest TSC.

The existing KVM_GET_CLOCK function isn't just buggy because it uses
'Definition C' and ignores TSC scaling. It's also just fundamentally
wrong as an API.

First let's consider the case of live *update*, where we serialize the
guest and kexec into a new kernel, then resum the guest. So we don't
have to deal with the full horrors of TSC migration, and the broken
algorithm described in the kernel documentation for doing that; we can
just preserve the KVM_VCPU_TSC_OFFSET. (We'll come to that problem
later).

Now, how do we restore the KVM clock? There is no real excuse for this
not being cycle-accurate, and giving the *same* pvclock information
back to the guest as before. The TSC wasn't disrupted at all during the
steal time. And the KVM clock should be precisely the *same* arithmetic
function of the TSC. The pvclock structure shouldn't change by a single
bit! Although in fact we *will* tolerate the data structure changing,
but only if it gives the same *result* (±1ns for rounding as Jack
notes).

KVM_[GS]ET_CLOCK are fundamentally useless for this. KVM_SET_CLOCK can
attempt to set the KVM clock at a given UTC time, but UTC is an awful
reference. Its frequency is also skewed by NTP, which is probably going
to be unsynchronized on the newly-booted kernel and running at a
different rate w.r.t the TSC than it was on the source.

Using UTC was always stupid because of leap seconds; TAI would have
been better. And one might argue that that aside, KVM_[GS]ET_CLOCK
isn't *such* an awful thing to use for live *migration*.

But wait, we *also* have to migrate the TSC using a wallclock time as a
base, and that naturally introduces some unavoidable discontinuity in
the TSC. So why in $DEITY's name would we use wallclock time as the
reference for migrating the KVM clock, giving it a *different* time
jump from the TSC on which it is based?

Even for live migration, even though the accuracy of the guest clocks
are fundamentally limited by the accuracy of the NTP sync between the
source and destination hosts, why wouldn't the clock jump at least be
*consistent*? We want that precise arithmetic relationship from TSC to
KVM clock to be preserved even in the live *migration* case where we
have to accept that the TSC itself isn't cycle-accurate.

Because we don't have a way to accurately preserve the TSC→KVM clock
relationship, we are seeing the clocksource watchdog trigger in guests
and disable the TSC clocksource... because the *KVM clock* was used as
a reference and got corrupted :)

So, Jack's patch adds KVM_[GS]ET_CLOCK_GUEST which reads and writes the
actual struct pvclock_vcpu_time_info structure.

> How about to cite the below patch as the beginning. The below patch only
> *avoids* KVM_REQ_MASTERCLOCK_UPDATE in some situations, but never solve the
> problem when KVM_REQ_MASTERCLOCK_UPDATE is triggered ... therefore we need this
> patchset ...
> 
> KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c52ffadc65e28ab461fd055e9991e8d8106a0056
> 
> I think this patch is only closely related to KVM_REQ_MASTERCLOCK_UPDATE, not
> TSC scaling.
> 
> ...
> 
> Same as above. I think the key is to explain the issue when
> KVM_REQ_MASTERCLOCK_UPDATE is triggered, not to emphasize the TSC scaling.
> Please correct me if I am wrong.

Not sure; I think that's a separate issue too.

> > 
> > Additional interfaces are added to retrieve & fixup the PV time information
> > when a VMM may believe is appropriate (deserialization after live-update/
> > migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
> > currently used PV time information and then when the VMM believes a drift
> > may occur can then instruct KVM to perform a correction via the setter
> > KVM_SET_CLOCK_GUEST.
> > 
> > The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
> > TSC & kernel timstamp are sampled at a singular point in time. Using the
> 
> Typo: "timstamp"
> 
> > already known scaling/offset for L1 the guest TSC is then derived from this
> 
> I assume you meant to derive guest TSC from TSC offset/scaling, not to derive
> kvmclock. What does "TSC & kernel timstamp" mean?

In ka->use_master_clock mode (Definition B), the KVM clock is defined
in terms of guest TSC ticks since a reference point in time, stored in
ka->master_cycle_now and ka->master_kernel_ns. That is, a TSC, and a
kernel (CLOCK_MONOTONIC_RAW, to be precise) timestamp.

Jack's KVM_SET_CLOCK_GUEST should be taking a new reference point with
kvm_get_time_and_clockread(). As we know, changing the reference point
like that has basically the same effect as a stray invocation of
KVM_REQ_MASTERCLOCK_UPDATE — it yanks the KVM clock back to 'Definition
A' based on the CLOCK_MONOTONIC_RAW. But we correct for that...

Next we calculate (correctly via guest TSC frequency) the KVM clock
value which would result with the newly changed reference point.

Then we calculate the *intended* KVM clock at that *same* host TSC
value, by converting to a guest TSC value and running it through the
pvclock information passed into the ioctl.

Then adjust ka->kvmclock_offset by the delta between the two.

And now the KVM clock at this moment is set to *precisely* what it
would was before the live {update,migration}, in relation to the guest
TSC. At least within 1ns.

> > information.
> > 
> > From here two PV time information structures are created, one which is the
> > original time information structure prior to whatever may have caused a PV
> > clock re-calculation (live-update/migration). The second is then using the
> > singular point in time sampled just prior. An individual KVM/PV clock for
> > each of the PV time information structures using the singular guest TSC.
> > 
> > A delta is then determined between the two calculated PV times, which is
> > then used as a correction offset added onto the kvmclock_offset for the VM.
> > 
> > [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!OnMXeXj4Plz6xvAc5lYsKaR3d1GDGGGRhZkdLMbxr8Skc_VAv_O1H8qP9igQv4KPCtYDw2ShTUtEd2o3mD5R$ 
> > 
> > Suggested-by: David Woodhouse <dwmw2@infradead.org>
> > Signed-off-by: Jack Allister <jalliste@amazon.com>
> > CC: Paul Durrant <paul@xen.org>
> > ---
> >  Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
> >  arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  3 ++
> >  3 files changed, 133 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 0b5a33ee71ee..5f74d8ac1002 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
> >  
> >  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >  
> > +4.143 KVM_GET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Retrieves the current time information structure used for KVM/PV clocks.
> > +On x86 a PV clock is derived from the current TSC and is then scaled based
> > +upon the a specified multiplier and shift. The result of this is then added
> > +to a system time.
> 
> Typo: "the a".
> 
> > +
> > +The guest needs a way to determine the system time, multiplier and shift. This
> > +can be done by multiple ways, for KVM guests this can be via an MSR write to
> > +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
> > +address KVM shall put the structure. On Xen guests this can be found in the Xen
> > +vcpu_info.
> > +
> > +This is structure is useful information for a VMM to also know when taking into
> > +account potential timer drift on live-update/migration.
> > +
> > +4.144 KVM_SET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (in)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
> > +known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
> > +
> > +If a VM is utilizing TSC scaling there is a potential for a drift between the
> > +KVM/PV clock and the TSC itself. This is due to the loss of precision when
> > +performing a multiply and shift rather than divide for the TSC.
> > +
> > +To perform the correction a delta is calculated between the original time info
> > +(which is assumed correct) at a singular point in time X. The KVM clock offset
> > +is then offset by this delta.
> > +
> >  5. The kvm_run structure
> >  ========================
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 47d9f03b7778..5d2e10cd1c30 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >         return 0;
> >  }
> >  
> > +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
> > +{
> > +       struct kvm_vcpu *vcpu = NULL;
> > +       int i;
> > +
> > +       for (i = 0; i < KVM_MAX_VCPUS; i++) {
> > +               vcpu = kvm_get_vcpu_by_id(kvm, i);
> > +               if (!vcpu)
> > +                       continue;
> > +
> > +               if (kvm_vcpu_is_reset_bsp(vcpu))
> > +                       break;
> > +       }
> > +
> > +       return vcpu;
> > +}
> 
> Would the above rely not only on TSC clocksource, but also
> ka->use_master_clock==true?
> 
>  3125         ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>  3126                                 && !ka->backwards_tsc_observed
>  3127                                 && !ka->boot_vcpu_runs_old_kvmclock;
> 
> Should the condition of (ka->use_master_clock==true) be checked in the ioctl?
> 
> > +
> > +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
> > +{
> > +       struct kvm_vcpu *vcpu;
> > +
> > +       vcpu = kvm_get_bsp_vcpu(kvm);
> > +       if (!vcpu)
> > +               return -EINVAL;
> > +
> > +       if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
> > +               return -EIO;
> > +
> > +       if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
> > +               return -EFAULT;
> 
> What will happen if the vCPU=0 (e.g., BSP) thread is racing with here to update
> the vcpu->arch.hv_clock?
> 
> It is a good idea to making assumption from the VMM (e.g., QEMU) side?

What if the vCPU has never set up the KVM clock and vcpu->arch.hv_clock
isn't even populated?

I don't think we should be using an actual vCPU at all.

I think we have to *create* the pvclock information, just just blindly
memcpy from some vCPU's ->arch.hv_clock.

Yes, we do need to scale via guest TSC frequency, but *only* in the
ka->use_master_clock case because otherwise, everything's hosed anyway.
In the ka->use_master_clock case, where we know all guests are running
at the same TSC frequency, why not just snapshot the scaling factor? 
We can fix the broken __get_kvmclock_ns() that way too.

In fact, in the !ka->use_master_clock case, this ioctl should probably
return an error. Because in that case, the KVM clock *isn't* stable in
terms of the guest TSC as it would be in a sane world, and it doesn't
make sense to care about migrating it accurately.

We should think about the case where ka->use_master_clock is true on
the source, but *false* on the destination. Could KVM_SET_CLOCK_GUEST
still work in that mode, by calling compute_guest_tsc()? 


> > +
> > +       return 0;
> > +}
> > +
> > +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
> > +{
> > +       struct kvm_vcpu *vcpu;
> > +       struct pvclock_vcpu_time_info orig_pvti;
> > +       struct pvclock_vcpu_time_info dummy_pvti;
> > +       int64_t kernel_ns;
> > +       uint64_t host_tsc, guest_tsc;
> > +       uint64_t clock_orig, clock_dummy;
> > +       int64_t correction;
> > +       unsigned long i;
> 
> Please ignore me if there is not any chance to make the above (and other places
> in the patchset) to honor reverse xmas tree style.
> 
> > +
> > +       vcpu = kvm_get_bsp_vcpu(kvm);
> > +       if (!vcpu)
> > +               return -EINVAL;
> > +
> > +       if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> > +               return -EFAULT;
> > +
> > +       /*
> > +        * Sample the kernel time and host TSC at a singular point.
> > +        * We then calculate the guest TSC using this exact point in time,
> > +        * From here we can then determine the delta using the
> > +        * PV time info requested from the user and what we currently have
> > +        * using the fixed point in time. This delta is then used as a
> > +        * correction factor to fixup the potential drift.
> > +        */
> > +       if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
> > +               return -EFAULT;
> > +
> > +       guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > +
> > +       dummy_pvti = orig_pvti;
> > +       dummy_pvti.tsc_timestamp = guest_tsc;
> > +       dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
> > +
> > +       clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> > +       clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
> > +
> > +       correction = clock_orig - clock_dummy;
> > +       kvm->arch.kvmclock_offset += correction;
> 
> I am not sure if it is a good idea to rely on userspace VMM to decide the good
> timepoint to issue the ioctl, without assuming any racing.
> 
> In addition to live migration, can the user call this API any time during the VM
> is running (to fix the clock drift)? Therefore, any requirement to protect the
> update of kvmclock_offset from racing?
> 

$DEITY no. The clock drift at *runtime* due to stray calls of
KVM_REQ_MASTERCLOCK_UPDATE is just a kernel bug. It isn't the user's
responsibility to correct it!

However... where invoked in ka->use_masterclock_mode, perhaps
pvclock_update_vm_gtod_copy() *should* perform the same kind of
calculation, and actually *adjust* ka->kvmclock_offset as the true
definition of the KVM clock drifts from CLOCK_MONOTONIC_RAW?

But internally; never seen by userspace.
Allister, Jack April 10, 2024, 10:08 a.m. UTC | #3
> Would the above rely not only on TSC clocksource, but also
> ka->use_master_clock==true?

Yes this should only be in the case of master clock. Extra verification
to check this for both GET/SET should be present.

> What will happen if the vCPU=0 (e.g., BSP) thread is racing with here
> to update the vcpu->arch.hv_clock?

> In addition to live migration, can the user call this API any time
> during the VM is running (to fix the clock drift)? Therefore, any
> requirement to protect the update of kvmclock_offset from racing?

This should occur when none of the vCPUs are running, in the context of
a live-migration/update this would be after deserialization before the
resume. I may have been unintentionally misleading here describing some
of the problem space. There isn't a "drift" per say when a VM is
running and the PVTI stays constant. It's more the relationship between
the TSC & PV clock changes due to inaccuracy when KVM decides to
generate that information, e.g on a live-update/migration KVM will
perform the update. The KVM_REQ_MASTERCLOCK_UPDATE is just an example
as to how to simulate/trigger this.

I've posted a V2 which hopefully addresses some of above and makes it
clearer overall the aim behind the patches.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..5f74d8ac1002 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,49 @@  a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks.
+On x86 a PV clock is derived from the current TSC and is then scaled based
+upon the a specified multiplier and shift. The result of this is then added
+to a system time.
+
+The guest needs a way to determine the system time, multiplier and shift. This
+can be done by multiple ways, for KVM guests this can be via an MSR write to
+MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
+address KVM shall put the structure. On Xen guests this can be found in the Xen
+vcpu_info.
+
+This is structure is useful information for a VMM to also know when taking into
+account potential timer drift on live-update/migration.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
+known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
+
+If a VM is utilizing TSC scaling there is a potential for a drift between the
+KVM/PV clock and the TSC itself. This is due to the loss of precision when
+performing a multiply and shift rather than divide for the TSC.
+
+To perform the correction a delta is calculated between the original time info
+(which is assumed correct) at a singular point in time X. The KVM clock offset
+is then offset by this delta.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..5d2e10cd1c30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6988,6 +6988,87 @@  static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	return 0;
 }
 
+static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu = NULL;
+	int i;
+
+	for (i = 0; i < KVM_MAX_VCPUS; i++) {
+		vcpu = kvm_get_vcpu_by_id(kvm, i);
+		if (!vcpu)
+			continue;
+
+		if (kvm_vcpu_is_reset_bsp(vcpu))
+			break;
+	}
+
+	return vcpu;
+}
+
+static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_vcpu *vcpu;
+
+	vcpu = kvm_get_bsp_vcpu(kvm);
+	if (!vcpu)
+		return -EINVAL;
+
+	if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
+		return -EIO;
+
+	if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_vcpu *vcpu;
+	struct pvclock_vcpu_time_info orig_pvti;
+	struct pvclock_vcpu_time_info dummy_pvti;
+	int64_t kernel_ns;
+	uint64_t host_tsc, guest_tsc;
+	uint64_t clock_orig, clock_dummy;
+	int64_t correction;
+	unsigned long i;
+
+	vcpu = kvm_get_bsp_vcpu(kvm);
+	if (!vcpu)
+		return -EINVAL;
+
+	if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
+		return -EFAULT;
+
+	/*
+	 * Sample the kernel time and host TSC at a singular point.
+	 * We then calculate the guest TSC using this exact point in time,
+	 * From here we can then determine the delta using the
+	 * PV time info requested from the user and what we currently have
+	 * using the fixed point in time. This delta is then used as a
+	 * correction factor to fixup the potential drift.
+	 */
+	if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
+		return -EFAULT;
+
+	guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+	dummy_pvti = orig_pvti;
+	dummy_pvti.tsc_timestamp = guest_tsc;
+	dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
+
+	clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
+	clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
+
+	correction = clock_orig - clock_dummy;
+	kvm->arch.kvmclock_offset += correction;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+	return 0;
+}
+
 int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
 	struct kvm *kvm = filp->private_data;
@@ -7246,6 +7327,12 @@  int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	case KVM_GET_CLOCK:
 		r = kvm_vm_ioctl_get_clock(kvm, argp);
 		break;
+	case KVM_SET_CLOCK_GUEST:
+		r = kvm_vm_ioctl_set_clock_guest(kvm, argp);
+		break;
+	case KVM_GET_CLOCK_GUEST:
+		r = kvm_vm_ioctl_get_clock_guest(kvm, argp);
+		break;
 	case KVM_SET_TSC_KHZ: {
 		u32 user_tsc_khz;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@  struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
+
 #endif /* __LINUX_KVM_H */