diff mbox series

[v4] KVM: x86/tsc: Don't sync user changes to TSC with KVM-initiated change

Message ID 20230801034524.64007-1-likexu@tencent.com (mailing list archive)
State New, archived
Headers show
Series [v4] KVM: x86/tsc: Don't sync user changes to TSC with KVM-initiated change | expand

Commit Message

Like Xu Aug. 1, 2023, 3:45 a.m. UTC
From: Like Xu <likexu@tencent.com>

Add kvm->arch.user_changed_tsc to avoid synchronizing user-initiated
changes to the guest TSC with the KVM initiated change at least once.

KVM interprets writes to the TSC with values within 1 second of each
other as an attempt to synchronize the TSC for all vCPUs in the VM,
and uses a common offset for all vCPUs in a VM. For brevity's sake
let's just ignore what happens on systems with an unstable TSC.

While this may seem odd, it is imperative for VM save/restore, as VMMs
such as QEMU have long resorted to saving the TSCs (by value) from all
vCPUs in the VM at approximately the same time. Of course, it is
impossible to synchronize all the vCPU ioctls to capture the exact
instant in time, hence KVM fudges it a bit on the restore side.

This has been useful for the 'typical' VM lifecycle, where in all
likelihood the VM goes through save/restore a considerable amount of
time after VM creation. Nonetheless, there are some use cases that
need to restore a VM snapshot that was created very shortly after boot
(<1 second). Unfortunately the TSC sync code makes no distinction
between kernel and user-initiated writes, which leads to the target VM
synchronizing on the TSC offset from creation instead of the
user-intended value.

Avoid synchronizing user-initiated changes to the guest TSC with the
KVM initiated change in kvm_arch_vcpu_postcreate() by conditioning the
logic on userspace having written the TSC at least once.

Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
---
V3 -> V4 Changelog:
- Apply the suggested changelog for better clarification of issue; (Oliver)
V3: https://lore.kernel.org/kvm/20230731080758.29482-1-likexu@tencent.com/

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 23 ++++++++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)


base-commit: 5a7591176c47cce363c1eed704241e5d1c42c5a6

Comments

Sean Christopherson Aug. 11, 2023, 10:59 p.m. UTC | #1
On Tue, Aug 01, 2023, Like Xu wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 278dbd37dab2..eeaf4ad9174d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2713,7 +2713,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  	kvm_track_tsc_matching(vcpu);
>  }
>  
> -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, bool user_initiated)

Rather than pass two somewhat magic values for the KVM-internal call, what about
making @data a pointer and passing NULL?

>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	u64 offset, ns, elapsed;
> @@ -2734,20 +2734,29 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  			 * kvm_clock stable after CPU hotplug
>  			 */
>  			synchronizing = true;
> -		} else {
> +		} else if (kvm->arch.user_changed_tsc) {
>  			u64 tsc_exp = kvm->arch.last_tsc_write +
>  						nsec_to_cycles(vcpu, elapsed);
>  			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>  			/*
> -			 * Special case: TSC write with a small delta (1 second)
> -			 * of virtual cycle time against real time is
> -			 * interpreted as an attempt to synchronize the CPU.
> +			 * Here lies UAPI baggage: user-initiated TSC write with
> +			 * a small delta (1 second) of virtual cycle time
> +			 * against real time is interpreted as an attempt to
> +			 * synchronize the CPU.
> +			 *
> +			 * Don't synchronize user changes to the TSC with the
> +			 * KVM-initiated change in kvm_arch_vcpu_postcreate()
> +			 * by conditioning this mess on userspace having
> +			 * written the TSC at least once already.

Ok, this confused me for a good long while.  As in, super duper wtf is going on
confused.  Calling out kvm_arch_vcpu_postcreate() is a gigantic red-herring,
because this path is *never* reached by the kvm_synchronize_tsc() call from
kvm_arch_vcpu_postcreate().  @data is 0, and so the internal KVM call goes straight
to synchronizing.

And the fact that KVM does synchronization from kvm_arch_vcpu_postcreate() isn't
interesting, as that's just an internal detail.  What's important is that
synchronization needs to be forced when creating or hotplugging a vCPU (data == 0),
but when NOT hotplugging should be skipped for the first write from userspace.

And IMO, this blurb from the changelog is flat out wrong:

 : Unfortunately the TSC sync code makes no distinction
 : between kernel and user-initiated writes, which leads to the target VM
 : synchronizing on the TSC offset from creation instead of the
 : user-intended value.

The problem isn't that the sync code doesn't differentiate between kernel and
user-initiated writes, because parts of the code *do* differentiate.  I think it's
more accurate to say that the problem is that the sync code doesn't differentiate
between userspace initializing the TSC and userspace attempting to synchronize the
TSC.

And the "user_changed_tsc" only adds to the confusion, because in the hotplug
              ^^^^^^^
case, AFAICT there is no guarantee that the TSC will be changed, i.e. userspace
may set the exact same value.

With a massaged changelog, I think we want this (untested):

---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 25 ++++++++++++++++---------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4c2d659a1269..bf566262ebd8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1331,6 +1331,7 @@ struct kvm_arch {
 	int nr_vcpus_matched_tsc;
 
 	u32 default_tsc_khz;
+	bool user_set_tsc;
 
 	seqcount_raw_spinlock_t pvclock_sc;
 	bool use_master_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34945c7dba38..d01991aadf19 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2698,8 +2698,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 {
+	u64 data = user_value ? *user_value : 0;
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
@@ -2712,14 +2713,17 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
+		/*
+		 * Force synchronization when creating or hotplugging a vCPU,
+		 * i.e. when the TSC value is '0', to help keep clocks stable.
+		 * If this is NOT a hotplug/creation case, skip synchronization
+		 * on the first write from userspace so as not to misconstrue
+		 * state restoration after live migration as an attempt from
+		 * userspace to synchronize.
+		 */
 		if (data == 0) {
-			/*
-			 * detection of vcpu initialization -- need to sync
-			 * with other vCPUs. This particularly helps to keep
-			 * kvm_clock stable after CPU hotplug
-			 */
 			synchronizing = true;
-		} else {
+		} else if (kvm->arch.user_set_tsc) {
 			u64 tsc_exp = kvm->arch.last_tsc_write +
 						nsec_to_cycles(vcpu, elapsed);
 			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
@@ -2733,6 +2737,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 		}
 	}
 
+	if (user_value)
+		kvm->arch.user_set_tsc = true;
+
 	/*
 	 * For a reliable TSC, we can match TSC offsets, and for an unstable
 	 * TSC, we add elapsed time in this computation.  We could let the
@@ -3761,7 +3768,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, &data);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11934,7 +11941,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */

base-commit: ba44e03ef5e5d3ece316d1384e43e3d7761c89d4
--
David Woodhouse Sept. 13, 2023, 8:10 a.m. UTC | #2
On Fri, 2023-08-11 at 15:59 -0700, Sean Christopherson wrote:
> The problem isn't that the sync code doesn't differentiate between kernel and
> user-initiated writes, because parts of the code *do* differentiate.  I think it's
> more accurate to say that the problem is that the sync code doesn't differentiate
> between userspace initializing the TSC and userspace attempting to synchronize the
> TSC.

I'm not utterly sure that *I* differentiate between userspace
"initializing the TSC" and attempting to "synchronize the TSC". What
*is* the difference? 

Userspace is merely *setting* the TSC for a given vCPU, regardless of
whether other vCPUs even exist.

But we have to work around the fundamental brokenness of the legacy
API, whose semantics are most accurately described as "Please set the
TSC to precisely <x> because that's what it should have been *some*
time around now, if I wasn't preempted very much between when I
calculated it and when you see this ioctl".

That's why — for the legacy API only — we have this hack to make the
TSCs *actually* in sync if they're close. Because without it, there;s
*no* way the VMM can restore a guest with its TSCs actually in sync.

I think the best answer to the bug report that led to this patch is
just "Don't use the legacy API then". Use KVM_VCPU_TSC_OFFSET which is
defined as "the TSC was <x> at KVM time <y>" and is actually *sane*.
David Woodhouse Sept. 13, 2023, 8:31 a.m. UTC | #3
On Fri, 2023-08-11 at 15:59 -0700, Sean Christopherson wrote:
> On Tue, Aug 01, 2023, Like Xu wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 278dbd37dab2..eeaf4ad9174d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2713,7 +2713,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
> >         kvm_track_tsc_matching(vcpu);
> >   }
> >   
> > -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> > +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, bool user_initiated)
> 
> Rather than pass two somewhat magic values for the KVM-internal call, what about
> making @data a pointer and passing NULL?

Why change that at all?

Userspace used to be able to force a sync by writing zero. You are
removing that from the ABI without any explanation about why; it
doesn't seem necessary for fixing the original issue.
Like Xu Sept. 13, 2023, 8:41 a.m. UTC | #4
On 13/9/2023 4:10 pm, David Woodhouse wrote:
> On Fri, 2023-08-11 at 15:59 -0700, Sean Christopherson wrote:
>> The problem isn't that the sync code doesn't differentiate between kernel and
>> user-initiated writes, because parts of the code *do* differentiate.  I think it's
>> more accurate to say that the problem is that the sync code doesn't differentiate
>> between userspace initializing the TSC and userspace attempting to synchronize the
>> TSC.
> 
> I'm not utterly sure that *I* differentiate between userspace
> "initializing the TSC" and attempting to "synchronize the TSC". What
> *is* the difference?

I'd be more inclined to Oliver's explanation in this version of the changelog
that different tsc_offsets are used to calculate guest_tsc value between the vcpu
is created and when it is first set by usersapce. This extra synchronization is not
expected for guest based on user's bugzilla report.

> 
> Userspace is merely *setting* the TSC for a given vCPU, regardless of
> whether other vCPUs even exist.
> 
> But we have to work around the fundamental brokenness of the legacy
> API, whose semantics are most accurately described as "Please set the
> TSC to precisely <x> because that's what it should have been *some*
> time around now, if I wasn't preempted very much between when I
> calculated it and when you see this ioctl".
> 
> That's why — for the legacy API only — we have this hack to make the
> TSCs *actually* in sync if they're close. Because without it, there;s
> *no* way the VMM can restore a guest with its TSCs actually in sync.
> 
> I think the best answer to the bug report that led to this patch is
> just "Don't use the legacy API then". Use KVM_VCPU_TSC_OFFSET which is
> defined as "the TSC was <x> at KVM time <y>" and is actually *sane*.
> 

Two hands in favor. Using the new KVM_VCPU_TSC_OFFSET API and a little
fix on the legacy API is not conflict. Thank you for reviewing it.
David Woodhouse Sept. 13, 2023, 8:44 a.m. UTC | #5
On Wed, 2023-09-13 at 16:41 +0800, Like Xu wrote:
> On 13/9/2023 4:10 pm, David Woodhouse wrote:
> > On Fri, 2023-08-11 at 15:59 -0700, Sean Christopherson wrote:
> > > The problem isn't that the sync code doesn't differentiate between kernel and
> > > user-initiated writes, because parts of the code *do* differentiate.  I think it's
> > > more accurate to say that the problem is that the sync code doesn't differentiate
> > > between userspace initializing the TSC and userspace attempting to synchronize the
> > > TSC.
> > 
> > I'm not utterly sure that *I* differentiate between userspace
> > "initializing the TSC" and attempting to "synchronize the TSC". What
> > *is* the difference?
> 
> I'd be more inclined to Oliver's explanation in this version of the changelog
> that different tsc_offsets are used to calculate guest_tsc value between the vcpu
> is created and when it is first set by usersapce. This extra synchronization is not
> expected for guest based on user's bugzilla report.
> 

Yes, it's about the kernel's default startup values (first vCPU
starting at TSC 0, others syncing to that on creation), and the fact
that the *first* userspace write (to any vCPU) should actually be
honoured even if it *does* happen to be within 1 second of the kernel's
startup values.


> Two hands in favor. Using the new KVM_VCPU_TSC_OFFSET API and a little
> fix on the legacy API is not conflict. Thank you for reviewing it.

I'm slightly dubious about making *changes* to an established userspace
ABI, especially when there's already a better way to do it. But I
suppose this specific change, if you *don't* also take away the ability
for userspace to explicitly write zero to force a sync (qv), is OK.
Sean Christopherson Sept. 13, 2023, 2:50 p.m. UTC | #6
On Wed, Sep 13, 2023, David Woodhouse wrote:
> On Fri, 2023-08-11 at 15:59 -0700, Sean Christopherson wrote:
> > On Tue, Aug 01, 2023, Like Xu wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 278dbd37dab2..eeaf4ad9174d 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2713,7 +2713,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
> > >         kvm_track_tsc_matching(vcpu);
> > >   }
> > >   
> > > -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> > > +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, bool user_initiated)
> > 
> > Rather than pass two somewhat magic values for the KVM-internal call, what about
> > making @data a pointer and passing NULL?
> 
> Why change that at all?
> 
> Userspace used to be able to force a sync by writing zero. You are
> removing that from the ABI without any explanation about why;

No, my suggestion did not remove that from the ABI.  A @user_value of '0' would
still force synchronization.

-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 {
+       u64 data = user_value ? *user_value : 0;  <=== "*user_value" is '0'
        struct kvm *kvm = vcpu->kvm;
        u64 offset, ns, elapsed;
        unsigned long flags;
@@ -2712,14 +2713,17 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
        elapsed = ns - kvm->arch.last_tsc_nsec;

        if (vcpu->arch.virtual_tsc_khz) {
+               /*
+                * Force synchronization when creating or hotplugging a vCPU,
+                * i.e. when the TSC value is '0', to help keep clocks stable.
+                * If this is NOT a hotplug/creation case, skip synchronization
+                * on the first write from userspace so as not to misconstrue
+                * state restoration after live migration as an attempt from
+                * userspace to synchronize.
+                */
                if (data == 0) { <== "data" still '0', still forces synchronization
-                       /*
-                        * detection of vcpu initialization -- need to sync
-                        * with other vCPUs. This particularly helps to keep
-                        * kvm_clock stable after CPU hotplug
-                        */
                        synchronizing = true;

> it doesn't seem necessary for fixing the original issue.

It's necessary for "user_set_tsc" to be an accurate name.  The code in v6 yields
"user_set_tsc_to_non_zero_value".  And I don't think it's just a naming issue,
e.g. if userspace writes '0' immediately after creating, and then later writes a
small delta, the v6 code wouldn't trigger synchronization because "user_set_tsc"
would be left unseft by the write of '0'.
David Woodhouse Sept. 13, 2023, 3:05 p.m. UTC | #7
On Wed, 2023-09-13 at 07:50 -0700, Sean Christopherson wrote:
> On Wed, Sep 13, 2023, David Woodhouse wrote:
> > Userspace used to be able to force a sync by writing zero. You are
> > removing that from the ABI without any explanation about why;
> 
> No, my suggestion did not remove that from the ABI.  A @user_value of '0' would
> still force synchronization.

Ah, OK. Yes, you're right. Thanks.

> It's necessary for "user_set_tsc" to be an accurate name.  The code in v6 yields
> "user_set_tsc_to_non_zero_value".  And I don't think it's just a naming issue,

In another thread, you said that the sync code doesn't differentiate
between userspace initializing the TSC And userspace attempting to
synchronize the TSC. I responded that *I* don't differentiate the two
and couldn't see the difference.

I think we were both wrong. Userspace does *explicitly* synchronize the
TSC by writing zero, and the sync code *does* explicitly handle that,
yes?

And the reason I mention it here is that we could perhaps reasonable
say that userspace *syncing* the TSC like that is not the same as
userspace *setting* the TSC, and that it's OK for user_set_tsc to
remain false? It saves adding another argument to kvm_synchronize_tsc()
making it even more complex for a use case that just doesn't make sense
anyway...

> e.g. if userspace writes '0' immediately after creating, and then later writes a
> small delta, the v6 code wouldn't trigger synchronization because "user_set_tsc"
> would be left unseft by the write of '0'.

True, but that's the existing behaviour, and it doesn't make much sense
for the user to write 0 to trigger a sync immediately after creating,
because the *kernel* does that anyway.

I don't feel particularly strongly. Having a commit message and code
comments which clearly set out the reasoning for the 1-second slop and
the reasons why we want to *stop* doing it in this case, is the
important part. That had got lost.
Sean Christopherson Sept. 13, 2023, 3:15 p.m. UTC | #8
On Wed, Sep 13, 2023, David Woodhouse wrote:
> On Wed, 2023-09-13 at 07:50 -0700, Sean Christopherson wrote:
> > On Wed, Sep 13, 2023, David Woodhouse wrote:
> > > Userspace used to be able to force a sync by writing zero. You are
> > > removing that from the ABI without any explanation about why;
> > 
> > No, my suggestion did not remove that from the ABI.  A @user_value of '0' would
> > still force synchronization.
> 
> Ah, OK. Yes, you're right. Thanks.
> 
> > It's necessary for "user_set_tsc" to be an accurate name.  The code in v6 yields
> > "user_set_tsc_to_non_zero_value".  And I don't think it's just a naming issue,
> 
> In another thread, you said that the sync code doesn't differentiate
> between userspace initializing the TSC And userspace attempting to
> synchronize the TSC. I responded that *I* don't differentiate the two
> and couldn't see the difference.
> 
> I think we were both wrong. Userspace does *explicitly* synchronize the
> TSC by writing zero, and the sync code *does* explicitly handle that,
> yes?

Doh, by "sync code" I meant the "conditionally sync" code, i.e. the data != 0 path.

> And the reason I mention it here is that we could perhaps reasonable
> say that userspace *syncing* the TSC like that is not the same as
> userspace *setting* the TSC, and that it's OK for user_set_tsc to
> remain false? It saves adding another argument to kvm_synchronize_tsc()
> making it even more complex for a use case that just doesn't make sense
> anyway...
> 
> > e.g. if userspace writes '0' immediately after creating, and then later writes a
> > small delta, the v6 code wouldn't trigger synchronization because "user_set_tsc"
> > would be left unseft by the write of '0'.
> 
> True, but that's the existing behaviour,

No?  The existing code will fall into the "conditionally sync" logic for any
non-zero value.

		if (data == 0) {
			/*
			 * detection of vcpu initialization -- need to sync
			 * with other vCPUs. This particularly helps to keep
			 * kvm_clock stable after CPU hotplug
			 */
			synchronizing = true;
		} else {
			u64 tsc_exp = kvm->arch.last_tsc_write +
						nsec_to_cycles(vcpu, elapsed);
			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
			/*
			 * Special case: TSC write with a small delta (1 second)
			 * of virtual cycle time against real time is
			 * interpreted as an attempt to synchronize the CPU.
			 */
			synchronizing = data < tsc_exp + tsc_hz &&
					data + tsc_hz > tsc_exp;
		}

> and it doesn't make much sense for the user to write 0 to trigger a sync
> immediately after creating, because the *kernel* does that anyway.

I don't care (in the Tommy Lee Jones[*] sense).  All I care about is minimizing
the probability of breaking userspace, which means making the smallest possible
change to KVM's ABI.  For me, whether or not userspace is doing something sensible
doesn't factor into that equation.

[*] https://www.youtube.com/watch?v=OoTbXu1qnbc
David Woodhouse Sept. 13, 2023, 3:24 p.m. UTC | #9
On Wed, 2023-09-13 at 15:15 +0000, Sean Christopherson wrote:
> > > e.g. if userspace writes '0' immediately after creating, and then later writes a
> > > small delta, the v6 code wouldn't trigger synchronization because "user_set_tsc"
> > > would be left unseft by the write of '0'.
> > 
> > True, but that's the existing behaviour,
> 
> No?  The existing code will fall into the "conditionally sync" logic for any
> non-zero value.

Yeah, OK. This isn't one of the cases we set out to deliberately
change, but it would be changed by v6 of the patch, and I suppose
you're right that we should accept a small amount of extra code
complexity just to avoid making any changes we don't *need* to, even
for stupid cases like this.


> I don't care (in the Tommy Lee Jones[*] sense).  All I care about is minimizing
> the probability of breaking userspace, which means making the smallest possible
> change to KVM's ABI.  For me, whether or not userspace is doing something sensible
> doesn't factor into that equation.

Ack.

Although there's a strong argument that adding further warts to an
already fundamentally broken API probably isn't a great idea in the
first place. Just deprecate it and use the saner replacement API...
which I just realised we don't have (qv). Ooops :)
Like Xu Sept. 14, 2023, 3:24 a.m. UTC | #10
On 13/9/2023 11:24 pm, David Woodhouse wrote:
> On Wed, 2023-09-13 at 15:15 +0000, Sean Christopherson wrote:
>>>> e.g. if userspace writes '0' immediately after creating, and then later writes a
>>>> small delta, the v6 code wouldn't trigger synchronization because "user_set_tsc"
>>>> would be left unseft by the write of '0'.
>>>
>>> True, but that's the existing behaviour,
>>
>> No?  The existing code will fall into the "conditionally sync" logic for any
>> non-zero value.
> 
> Yeah, OK. This isn't one of the cases we set out to deliberately
> change, but it would be changed by v6 of the patch, and I suppose
> you're right that we should accept a small amount of extra code
> complexity just to avoid making any changes we don't *need* to, even
> for stupid cases like this.
> 
> 
>> I don't care (in the Tommy Lee Jones[*] sense).  All I care about is minimizing
>> the probability of breaking userspace, which means making the smallest possible
>> change to KVM's ABI.  For me, whether or not userspace is doing something sensible
>> doesn't factor into that equation.
> 
> Ack.

If we combine the v5 code diff (the u64 *user_value proposal) with the refined 
changelog in v6,
it seems like we've reached a point of equilibrium on this issue, doesn't it ?

Please let me know you have more concerns.

> 
> Although there's a strong argument that adding further warts to an
> already fundamentally broken API probably isn't a great idea in the
> first place. Just deprecate it and use the saner replacement API...
> which I just realised we don't have (qv). Ooops :)
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3bc146dfd38d..e8d423ef1474 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1303,6 +1303,7 @@  struct kvm_arch {
 	u64 cur_tsc_offset;
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
+	bool user_changed_tsc;
 
 	u32 default_tsc_khz;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 278dbd37dab2..eeaf4ad9174d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2713,7 +2713,7 @@  static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, bool user_initiated)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2734,20 +2734,29 @@  static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 			 * kvm_clock stable after CPU hotplug
 			 */
 			synchronizing = true;
-		} else {
+		} else if (kvm->arch.user_changed_tsc) {
 			u64 tsc_exp = kvm->arch.last_tsc_write +
 						nsec_to_cycles(vcpu, elapsed);
 			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
 			/*
-			 * Special case: TSC write with a small delta (1 second)
-			 * of virtual cycle time against real time is
-			 * interpreted as an attempt to synchronize the CPU.
+			 * Here lies UAPI baggage: user-initiated TSC write with
+			 * a small delta (1 second) of virtual cycle time
+			 * against real time is interpreted as an attempt to
+			 * synchronize the CPU.
+			 *
+			 * Don't synchronize user changes to the TSC with the
+			 * KVM-initiated change in kvm_arch_vcpu_postcreate()
+			 * by conditioning this mess on userspace having
+			 * written the TSC at least once already.
 			 */
 			synchronizing = data < tsc_exp + tsc_hz &&
 					data + tsc_hz > tsc_exp;
 		}
 	}
 
+	if (user_initiated)
+		kvm->arch.user_changed_tsc = true;
+
 	/*
 	 * For a reliable TSC, we can match TSC offsets, and for an unstable
 	 * TSC, we add elapsed time in this computation.  We could let the
@@ -3776,7 +3785,7 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, true);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11950,7 +11959,7 @@  void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, false);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */