diff mbox series

[v16,12/13] x86/tsc: Switch to native sched clock

Message ID 20250106124633.1418972-13-nikunj@amd.com (mailing list archive)
State New
Headers show
Series Add Secure TSC support for SNP guests | expand

Commit Message

Nikunj A. Dadhania Jan. 6, 2025, 12:46 p.m. UTC
Although the kernel switches over to stable TSC clocksource instead of
PV clocksource, the scheduler still keeps on using PV clocks as the
sched clock source. This is because KVM, Xen and VMWare, switch the
paravirt sched clock handler in their init routines. HyperV is the
only PV clock source that checks if the platform provides an invariant
TSC and does not switch to PV sched clock.

When switching back to stable TSC, restore the scheduler clock to
native_sched_clock().

As the clock selection happens in the stop machine context, schedule
delayed work to update the static_call()

Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/tsc.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Borislav Petkov Jan. 7, 2025, 7:37 p.m. UTC | #1
On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote:
> Although the kernel switches over to stable TSC clocksource instead of
> PV clocksource, the scheduler still keeps on using PV clocks as the
> sched clock source. This is because KVM, Xen and VMWare, switch the
> paravirt sched clock handler in their init routines. HyperV is the
> only PV clock source that checks if the platform provides an invariant
> TSC and does not switch to PV sched clock.

So this below looks like some desperate hackery. Are you doing this because
kvm_sched_clock_init() does

	paravirt_set_sched_clock(kvm_sched_clock_read);

?

If so, why don't you simply prevent that on STSC guests?
Nikunj A. Dadhania Jan. 8, 2025, 5:20 a.m. UTC | #2
On 1/8/2025 1:07 AM, Borislav Petkov wrote:
> On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote:
>> Although the kernel switches over to stable TSC clocksource instead of
>> PV clocksource, the scheduler still keeps on using PV clocks as the
>> sched clock source. This is because KVM, Xen and VMWare, switch the
>> paravirt sched clock handler in their init routines. HyperV is the
>> only PV clock source that checks if the platform provides an invariant
>> TSC and does not switch to PV sched clock.
>
> So this below looks like some desperate hackery. Are you doing this because
> kvm_sched_clock_init() does
>
> 	paravirt_set_sched_clock(kvm_sched_clock_read);
>
> ?

All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare
uses the regular TSC when the guest is booted with TscInvariant bit set. But it
doesn't switch the sched clock to use TSC which it should have, pointed here[1]
by Sean:

    No, I'm saying that the guest should prefer the raw TSC over kvmclock if the
    TSC is stable, irrespective of SNP or TDX. This is effectively already done
    for the timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC
    clocksource if invariant TSC is exposed")), but the scheduler still uses
    kvmclock thanks to the kvm_sched_clock_init() code.

> If so, why don't you simply prevent that on STSC guests?

This is for all guests running on different hypervisor with stable TSC. And I
did try to limit this to KVM here [2], here is Sean's comment to that
implementation:

    > Although the kernel switches over to stable TSC clocksource instead of
    > kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
    > This is due to kvm_sched_clock_init() updating the pv_sched_clock()
    > unconditionally.

    All PV clocks are affected by this, no? This seems like something that
    should be handled in common code, which is the point I was trying to make in
    v11.

Regards
Nikunj

1. https://lore.kernel.org/kvm/ZurCbP7MesWXQbqZ@google.com/
2. https://lore.kernel.org/kvm/20241009092850.197575-17-nikunj@amd.com/
Borislav Petkov Jan. 8, 2025, 8:22 a.m. UTC | #3
On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote:
> All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare
> uses the regular TSC when the guest is booted with TscInvariant bit set. But it
> doesn't switch the sched clock to use TSC which it should have, pointed here[1]
> by Sean:

Well, that paravirt_set_sched_clock() thing is there for a reason and all the
different guest types which call it perhaps have their reasons.

Now, if you think it would make sense to use the TSC in every guest type if
the TSC is this and that, then I'd need to see a patch or even a patchset
which explains why it is ok to do so on every guest type and then tested on
every guest type and then each patch would convert a single guest type and
properly explain why.

And your patch doesn't have that at all. I know Sean thinks it is a good idea
and perhaps it is but without proper justification and acks from the other
guest type maintainers, I simply won't take such a patch(-set).

And even if it is good, you need to solve it another way - not with this
delayed-work hackery.

IOW, this switching the paravirt clock crap to use TSC when the TSC is this
and that and the HV tells you so, should be a wholly separate patchset with
reviews and acks and the usual shebang.

If you want to take care only of STSC now, I'd take a patch which is known
good and tested properly. And that should happen very soon because the merge
window is closing in. Otherwise, there's always another merge window and
rushing things doesn't do any good anyway, as I keep preaching...

Your call.
Nikunj A. Dadhania Jan. 8, 2025, 8:34 a.m. UTC | #4
On 1/8/2025 1:52 PM, Borislav Petkov wrote:
> On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote:
> And your patch doesn't have that at all. I know Sean thinks it is a good idea
> and perhaps it is but without proper justification and acks from the other
> guest type maintainers, I simply won't take such a patch(-set).

I haven't heard anything from other guest type maintainers yet, and they are 
on CC from v12 onward.

> And even if it is good, you need to solve it another way - not with this
> delayed-work hackery.
> 
> IOW, this switching the paravirt clock crap to use TSC when the TSC is this
> and that and the HV tells you so, should be a wholly separate patchset with
> reviews and acks and the usual shebang.

I agree, we should handle this as a separate patch series.

> If you want to take care only of STSC now, I'd take a patch which is known
> good and tested properly. And that should happen very soon because the merge
> window is closing in. 

In that case, let me limit this only to STSC for now, i will send updated patch.

Regards,
Nikunj
Nikunj A. Dadhania Jan. 8, 2025, 10:20 a.m. UTC | #5
On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote:
> 
>> If you want to take care only of STSC now, I'd take a patch which is known
>> good and tested properly. And that should happen very soon because the merge
>> window is closing in. 
> 
> In that case, let me limit this only to STSC for now, i will send updated patch.

From: Nikunj A Dadhania <nikunj@amd.com>
Date: Wed, 8 Jan 2025 14:18:04 +0530
Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure
 TSC guests

Although the kernel switches over to a stable TSC clocksource instead of
kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
This is due to kvm_sched_clock_init() updating the pv_sched_clock()
unconditionally. Do not override the PV sched clock when Secure TSC is
enabled.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/kvmclock.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d8fef3a65a35..82c4743a5e7a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -324,8 +324,10 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	}
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
Sean Christopherson Jan. 8, 2025, 2 p.m. UTC | #6
On Wed, Jan 08, 2025, Nikunj A. Dadhania wrote:
> 
> On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote:
> > 
> >> If you want to take care only of STSC now, I'd take a patch which is known
> >> good and tested properly. And that should happen very soon because the merge
> >> window is closing in. 
> > 
> > In that case, let me limit this only to STSC for now, i will send updated patch.
> 
> From: Nikunj A Dadhania <nikunj@amd.com>
> Date: Wed, 8 Jan 2025 14:18:04 +0530
> Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure
>  TSC guests
> 
> Although the kernel switches over to a stable TSC clocksource instead of
> kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
> This is due to kvm_sched_clock_init() updating the pv_sched_clock()
> unconditionally. Do not override the PV sched clock when Secure TSC is
> enabled.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  arch/x86/kernel/kvmclock.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index d8fef3a65a35..82c4743a5e7a 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -324,8 +324,10 @@ void __init kvmclock_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
>  		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>  
> -	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> -	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
> +		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> +		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	}

This still misses my point.  Ditto for the "x86/tsc: Switch Secure TSC guests away
from kvm-clock".

I object to singling out kvmclock.  It's weird and misleading, because handling
only kvmclock suggests that other PV clocks are somehow trusted/ok, when in
reality the only reason kvmclock is getting singled out is (presumably) because
it's what Nikunj and the other folks enabling KVM SNP test on.

What I care most about is having a sane, consistent policy throughout the kernel.
E.g. so that a user/reader walks away with an understanding PV clocks are a
theoretical host attack vector and so should be avoided when Secure TSC is
available.

Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
it needs buy-in from other maintainers (including Paolo), because it's entirely
possible (likely, even) that there's an angle to scheduling I'm not considering.
diff mbox series

Patch

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 88d8bfceea04..fe7a0b1b7cfd 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -291,12 +291,26 @@  static void __init upgrade_clock_rating(struct clocksource *tsc_early,
 		tsc->rating = 450;
 	}
 }
+
+static void enable_native_sc_work(struct work_struct *work)
+{
+	pr_info("Using native sched clock\n");
+	paravirt_set_sched_clock(native_sched_clock);
+}
+static DECLARE_DELAYED_WORK(enable_native_sc, enable_native_sc_work);
+
+static void enable_native_sched_clock(void)
+{
+	if (!using_native_sched_clock())
+		schedule_delayed_work(&enable_native_sc, 0);
+}
 #else
 u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
 
 bool using_native_sched_clock(void) { return true; }
 
 static void __init upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { }
+static void enable_native_sched_clock(void) { }
 #endif
 
 notrace u64 sched_clock(void)
@@ -1176,6 +1190,10 @@  static void tsc_cs_tick_stable(struct clocksource *cs)
 static int tsc_cs_enable(struct clocksource *cs)
 {
 	vclocks_set_used(VDSO_CLOCKMODE_TSC);
+
+	/* Restore native_sched_clock() when switching to TSC */
+	enable_native_sched_clock();
+
 	return 0;
 }