diff mbox series

[v16,12/13] x86/tsc: Switch to native sched clock

Message ID 20250106124633.1418972-13-nikunj@amd.com (mailing list archive)
State New, archived
Headers show
Series Add Secure TSC support for SNP guests | expand

Commit Message

Nikunj A. Dadhania Jan. 6, 2025, 12:46 p.m. UTC
Although the kernel switches over to stable TSC clocksource instead of
PV clocksource, the scheduler still keeps on using PV clocks as the
sched clock source. This is because KVM, Xen and VMWare, switch the
paravirt sched clock handler in their init routines. HyperV is the
only PV clock source that checks if the platform provides an invariant
TSC and does not switch to PV sched clock.

When switching back to stable TSC, restore the scheduler clock to
native_sched_clock().

As the clock selection happens in the stop machine context, schedule
delayed work to update the static_call()

Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/tsc.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Borislav Petkov Jan. 7, 2025, 7:37 p.m. UTC | #1
On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote:
> Although the kernel switches over to stable TSC clocksource instead of
> PV clocksource, the scheduler still keeps on using PV clocks as the
> sched clock source. This is because KVM, Xen and VMWare, switch the
> paravirt sched clock handler in their init routines. HyperV is the
> only PV clock source that checks if the platform provides an invariant
> TSC and does not switch to PV sched clock.

So this below looks like some desperate hackery. Are you doing this because
kvm_sched_clock_init() does

	paravirt_set_sched_clock(kvm_sched_clock_read);

?

If so, why don't you simply prevent that on STSC guests?
Nikunj A. Dadhania Jan. 8, 2025, 5:20 a.m. UTC | #2
On 1/8/2025 1:07 AM, Borislav Petkov wrote:
> On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote:
>> Although the kernel switches over to stable TSC clocksource instead of
>> PV clocksource, the scheduler still keeps on using PV clocks as the
>> sched clock source. This is because KVM, Xen and VMWare, switch the
>> paravirt sched clock handler in their init routines. HyperV is the
>> only PV clock source that checks if the platform provides an invariant
>> TSC and does not switch to PV sched clock.
>
> So this below looks like some desperate hackery. Are you doing this because
> kvm_sched_clock_init() does
>
> 	paravirt_set_sched_clock(kvm_sched_clock_read);
>
> ?

All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare
uses the regular TSC when the guest is booted with TscInvariant bit set. But it
doesn't switch the sched clock to use TSC which it should have, pointed here[1]
by Sean:

    No, I'm saying that the guest should prefer the raw TSC over kvmclock if the
    TSC is stable, irrespective of SNP or TDX. This is effectively already done
    for the timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC
    clocksource if invariant TSC is exposed")), but the scheduler still uses
    kvmclock thanks to the kvm_sched_clock_init() code.

> If so, why don't you simply prevent that on STSC guests?

This is for all guests running on different hypervisor with stable TSC. And I
did try to limit this to KVM here [2], here is Sean's comment to that
implementation:

    > Although the kernel switches over to stable TSC clocksource instead of
    > kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
    > This is due to kvm_sched_clock_init() updating the pv_sched_clock()
    > unconditionally.

    All PV clocks are affected by this, no? This seems like something that
    should be handled in common code, which is the point I was trying to make in
    v11.

Regards
Nikunj

1. https://lore.kernel.org/kvm/ZurCbP7MesWXQbqZ@google.com/
2. https://lore.kernel.org/kvm/20241009092850.197575-17-nikunj@amd.com/
Borislav Petkov Jan. 8, 2025, 8:22 a.m. UTC | #3
On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote:
> All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare
> uses the regular TSC when the guest is booted with TscInvariant bit set. But it
> doesn't switch the sched clock to use TSC which it should have, pointed here[1]
> by Sean:

Well, that paravirt_set_sched_clock() thing is there for a reason and all the
different guest types which call it perhaps have their reasons.

Now, if you think it would make sense to use the TSC in every guest type if
the TSC is this and that, then I'd need to see a patch or even a patchset
which explains why it is ok to do so on every guest type and then tested on
every guest type and then each patch would convert a single guest type and
properly explain why.

And your patch doesn't have that at all. I know Sean thinks it is a good idea
and perhaps it is but without proper justification and acks from the other
guest type maintainers, I simply won't take such a patch(-set).

And even if it is good, you need to solve it another way - not with this
delayed-work hackery.

IOW, this switching the paravirt clock crap to use TSC when the TSC is this
and that and the HV tells you so, should be a wholly separate patchset with
reviews and acks and the usual shebang.

If you want to take care only of STSC now, I'd take a patch which is known
good and tested properly. And that should happen very soon because the merge
window is closing in. Otherwise, there's always another merge window and
rushing things doesn't do any good anyway, as I keep preaching...

Your call.
Nikunj A. Dadhania Jan. 8, 2025, 8:34 a.m. UTC | #4
On 1/8/2025 1:52 PM, Borislav Petkov wrote:
> On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote:
> And your patch doesn't have that at all. I know Sean thinks it is a good idea
> and perhaps it is but without proper justification and acks from the other
> guest type maintainers, I simply won't take such a patch(-set).

I haven't heard anything from other guest type maintainers yet, and they are 
on CC from v12 onward.

> And even if it is good, you need to solve it another way - not with this
> delayed-work hackery.
> 
> IOW, this switching the paravirt clock crap to use TSC when the TSC is this
> and that and the HV tells you so, should be a wholly separate patchset with
> reviews and acks and the usual shebang.

I agree, we should handle this as a separate patch series.

> If you want to take care only of STSC now, I'd take a patch which is known
> good and tested properly. And that should happen very soon because the merge
> window is closing in. 

In that case, let me limit this only to STSC for now, i will send updated patch.

Regards,
Nikunj
Nikunj A. Dadhania Jan. 8, 2025, 10:20 a.m. UTC | #5
On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote:
> 
>> If you want to take care only of STSC now, I'd take a patch which is known
>> good and tested properly. And that should happen very soon because the merge
>> window is closing in. 
> 
> In that case, let me limit this only to STSC for now, i will send updated patch.

From: Nikunj A Dadhania <nikunj@amd.com>
Date: Wed, 8 Jan 2025 14:18:04 +0530
Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure
 TSC guests

Although the kernel switches over to a stable TSC clocksource instead of
kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
This is due to kvm_sched_clock_init() updating the pv_sched_clock()
unconditionally. Do not override the PV sched clock when Secure TSC is
enabled.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
---
 arch/x86/kernel/kvmclock.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d8fef3a65a35..82c4743a5e7a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -324,8 +324,10 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	}
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
Sean Christopherson Jan. 8, 2025, 2 p.m. UTC | #6
On Wed, Jan 08, 2025, Nikunj A. Dadhania wrote:
> 
> On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote:
> > 
> >> If you want to take care only of STSC now, I'd take a patch which is known
> >> good and tested properly. And that should happen very soon because the merge
> >> window is closing in. 
> > 
> > In that case, let me limit this only to STSC for now, i will send updated patch.
> 
> From: Nikunj A Dadhania <nikunj@amd.com>
> Date: Wed, 8 Jan 2025 14:18:04 +0530
> Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure
>  TSC guests
> 
> Although the kernel switches over to a stable TSC clocksource instead of
> kvmclock, the scheduler still keeps on using kvmclock as the sched clock.
> This is due to kvm_sched_clock_init() updating the pv_sched_clock()
> unconditionally. Do not override the PV sched clock when Secure TSC is
> enabled.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> ---
>  arch/x86/kernel/kvmclock.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index d8fef3a65a35..82c4743a5e7a 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -324,8 +324,10 @@ void __init kvmclock_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
>  		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>  
> -	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> -	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) {
> +		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> +		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	}

This still misses my point.  Ditto for the "x86/tsc: Switch Secure TSC guests away
from kvm-clock".

I object to singling out kvmclock.  It's weird and misleading, because handling
only kvmclock suggests that other PV clocks are somehow trusted/ok, when in
reality the only reason kvmclock is getting singled out is (presumably) because
it's what Nikunj and the other folks enabling KVM SNP test on.

What I care most about is having a sane, consistent policy throughout the kernel.
E.g. so that a user/reader walks away with an understanding PV clocks are a
theoretical host attack vector and so should be avoided when Secure TSC is
available.

Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
it needs buy-in from other maintainers (including Paolo), because it's entirely
possible (likely, even) that there's an angle to scheduling I'm not considering.
Borislav Petkov Jan. 8, 2025, 3:34 p.m. UTC | #7
On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
> This still misses my point. 

Oh I got it, no worries.

> Ditto for the "x86/tsc: Switch Secure TSC guests away from kvm-clock".
> 
> I object to singling out kvmclock.  It's weird and misleading, because handling
> only kvmclock suggests that other PV clocks are somehow trusted/ok, when in
> reality the only reason kvmclock is getting singled out is (presumably) because
> it's what Nikunj and the other folks enabling KVM SNP test on.

Presumably.

> What I care most about is having a sane, consistent policy throughout the kernel.
> E.g. so that a user/reader walks away with an understanding PV clocks are a
> theoretical host attack vector and so should be avoided when Secure TSC is
> available.

Yap, agreed.

> Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
> TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
> it needs buy-in from other maintainers (including Paolo), because it's entirely
> possible (likely, even) that there's an angle to scheduling I'm not considering.

That's exactly why I wanted to have this taken care of only for the STSC side
of things now and temporarily. So that we can finally land those STSC patches
- they've been pending for waaay too long.

And then ask Nikunj nicely to clean up this whole pv clock gunk, potentially
kill some of those old clocksources which probably don't matter anymore.

But your call how/when you wanna do this.

If you want the cleanup first, I'll take only a subset of the STSC set so that
I can unload some of that set upstream.

Thx.
Sean Christopherson Jan. 8, 2025, 5:02 p.m. UTC | #8
On Wed, Jan 08, 2025, Borislav Petkov wrote:
> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
> > Ideally, if the TSC is the preferred clocksource, then the scheduler will use the
> > TSC and not a PV clock irrespective of STSC.  But I 100% agree with Boris that
> > it needs buy-in from other maintainers (including Paolo), because it's entirely
> > possible (likely, even) that there's an angle to scheduling I'm not considering.
> 
> That's exactly why I wanted to have this taken care of only for the STSC side
> of things now and temporarily. So that we can finally land those STSC patches
> - they've been pending for waaay too long.
> 
> And then ask Nikunj nicely to clean up this whole pv clock gunk, potentially
> kill some of those old clocksources which probably don't matter anymore.
> 
> But your call how/when you wanna do this.

I'm okay starting with just TDX and SNP guests, but I don't want to special case
SNP's Secure TSC anywhere in kvmclock or common TSC/sched code.

For TDX guests, the TSC is _always_ "secure".  So similar to singling out kvmclock,
handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
generically; doing so would avoid the need to special case SNP or TDX in code
that doesn't/shouldn't care about SNP or TDX.

[*] https://lore.kernel.org/all/ZurCbP7MesWXQbqZ@google.com

> If you want the cleanup first, I'll take only a subset of the STSC set so that
> I can unload some of that set upstream.

My vote is to apply through "x86/sev: Mark Secure TSC as reliable clocksource",
and then split "x86/tsc: Switch Secure TSC guests away from kvm-clock" to grab
only the snp_secure_tsc_init() related changes (which is how that patch should
be constructed no matter what; adding support for MSR_AMD64_GUEST_TSC_FREQ has
nothing to do with kvmclock).

And then figure out how to wrangle clocksource and sched_clock in a way that is
sane and consistent.
Borislav Petkov Jan. 8, 2025, 7:53 p.m. UTC | #9
On Wed, Jan 08, 2025 at 09:02:34AM -0800, Sean Christopherson wrote:
> I'm okay starting with just TDX and SNP guests, but I don't want to special case
> SNP's Secure TSC anywhere in kvmclock or common TSC/sched code.
> 
> For TDX guests, the TSC is _always_ "secure".

Ah, good to know. I was wondering what the situation in TDX land is wrt TSC...

> So similar to singling out kvmclock,
> handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
> generically; doing so would avoid the need to special case SNP or TDX in code
> that doesn't/shouldn't care about SNP or TDX.

Right.

> My vote is to apply through "x86/sev: Mark Secure TSC as reliable clocksource",
> and then split "x86/tsc: Switch Secure TSC guests away from kvm-clock" to grab
> only the snp_secure_tsc_init() related changes (which is how that patch should
> be constructed no matter what; adding support for MSR_AMD64_GUEST_TSC_FREQ has
> nothing to do with kvmclock).
> 
> And then figure out how to wrangle clocksource and sched_clock in a way that is
> sane and consistent.

Makes sense to me, I'll carve out the bits.

Nikunj, I'd appreciate it if you gathered whatever bits are left wrt kvmclock
and take care of things as Sean suggests above.

Thx.
Nikunj A. Dadhania Jan. 9, 2025, 6:32 a.m. UTC | #10
On 1/8/2025 10:32 PM, Sean Christopherson wrote:
> On Wed, Jan 08, 2025, Borislav Petkov wrote:
>> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:

> For TDX guests, the TSC is _always_ "secure".  So similar to singling out kvmclock,
> handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
> generically; 
> doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX.

That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever
TSC is picked up as a clock source, sched-clock will start using TSC instead of any
PV sched clock. This does not have any special case for STSC/SNP/TDX. 

> [*] https://lore.kernel.org/all/ZurCbP7MesWXQbqZ@google.com

Regards
Nikunj

1. https://lore.kernel.org/kvm/20250106124633.1418972-13-nikunj@amd.com/
Sean Christopherson Jan. 15, 2025, 9:37 p.m. UTC | #11
On Thu, Jan 09, 2025, Nikunj A. Dadhania wrote:
> On 1/8/2025 10:32 PM, Sean Christopherson wrote:
> > On Wed, Jan 08, 2025, Borislav Petkov wrote:
> >> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
> 
> > For TDX guests, the TSC is _always_ "secure".  So similar to singling out kvmclock,
> > handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> > state.  Which is why I originally suggested[*] fixing the sched_clock mess in a
> > generically; 
> > doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX.
> 
> That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever
> TSC is picked up as a clock source, sched-clock will start using TSC instead of any
> PV sched clock. This does not have any special case for STSC/SNP/TDX. 

*sigh*

This is a complete and utter trainwreck.

Paolo had an idea to handle this without a deferred callback by having kvmclock
detect if kvmclock is selected as the clocksource, OR if the TSC is unreliable,
i.e. if the kernel may switch to kvmclock due to the TSC being marked unreliable.

Unfortunately, that idea doesn't work because the ordering is all wrong.  The
*early* TSC initialization in setup_arch() happens after kvmclock_init() (via
init_hypervisor_platform())

	init_hypervisor_platform();

	tsc_early_init();

and even if we mucked with that, __clocksource_select() very deliberately doesn't
change the clocksource until the kernel is "finished" booting, in order to avoid
thrashing the clocksource.

  static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
  {
	struct clocksource *cs;

	if (!finished_booting || list_empty(&clocksource_list)) <===
		return NULL;

	...
  }


  /*
   * clocksource_done_booting - Called near the end of core bootup
   *
   * Hack to avoid lots of clocksource churn at boot time.
   * We use fs_initcall because we want this to start before
   * device_initcall but after subsys_initcall.
   */
  static int __init clocksource_done_booting(void)
  {
	mutex_lock(&clocksource_mutex);
	curr_clocksource = clocksource_default_clock();
	finished_booting = 1;
	/*
	 * Run the watchdog first to eliminate unstable clock sources
	 */
	__clocksource_watchdog_kthread();
	clocksource_select();
	mutex_unlock(&clocksource_mutex);
	return 0;
  }
  fs_initcall(clocksource_done_booting);

I fiddled with a variety of ideas to try and let kvmclock tell sched_clock that
it prefers the TSC, e.g. if TSC_RELIABLE is set, but after seeing this comment
in native_sched_clock():

	/*
	 * Fall back to jiffies if there's no TSC available:
	 * ( But note that we still use it if the TSC is marked
	 *   unstable. We do this because unlike Time Of Day,
	 *   the scheduler clock tolerates small errors and it's
	 *   very important for it to be as fast as the platform
	 *   can achieve it. )
	 */

My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
nonstop, and not marked stable via command line.  I.e. use the same criteria as
tweaking the clocksource rating.  As above, sched_clock is more tolerant of slop
than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock
(or something else entirely) is used for the clocksource.

If we wanted to go with a more conservative approach, e.g. to minimize the risk
of breaking existing setups, we could also condition the change on the TSC being
reliable and having a known frequency.  I.e. require SNP's Secure TSC, or require
the hypervisor to enumerate the TSC frequency via CPUID.  I don't see a ton of
value in that approach though, and long-term it would lead to some truly weird
code due to holding sched_clock to a higher standard than clocksource.

But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
.calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
the TSC frequency using the information provided by KVM, i.e. the untrusted host.

The "obvious" solution is to leave the calibration functions as-is if the TSC has
a known, reliable frequency, but even _that_ is riddled with problems, because
as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which
runs *after* init_hypervisor_platform().  SNP Secure TSC fudges around this by
overiding the calibration routines, but that's a bit gross and easy to fix if we
also fix TDX.  And fixing TDX by running native_calibrate_tsc() would give the
same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16.

All in all, I'm thinking something like this (across multiple patches):

---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/kvmclock.c | 55 ++++++++++++++++++++++++++------------
 arch/x86/kernel/setup.c    |  7 +++++
 arch/x86/kernel/tsc.c      | 32 +++++++++++++++++-----
 4 files changed, 72 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94408a784c8e..e13d6c3f2298 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -28,6 +28,7 @@ static inline cycles_t get_cycles(void)
 }
 #define get_cycles get_cycles
 
+extern void tsc_early_detect(void);
 extern void tsc_early_init(void);
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..fa6bf71cc511 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -317,33 +317,54 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	/*
+	 * If the TSC counts at a constant frequency across P/T states, counts
+	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
+	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
+	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
+	 * exists purely to honor the TSC being marked unstable via command
+	 * line, any runtime detection of an unstable will happen after this.
+	 */
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+	    !check_tsc_unstable()) {
+		kvm_clock.rating = 299;
+		pr_warn("kvm-clock: Using native sched_clock\n");
+	} else {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	}
+
+	/*
+	 * If the TSC frequency is already known, e.g. via CPUID, rely on that
+	 * information.  For "normal" VMs, the hypervisor controls kvmclock and
+	 * CPUID, i.e. the frequency is coming from the same place.  For CoCo
+	 * VMs, the TSC frequency may be provided by trusted firmware, in which
+	 * case it's highly desirable to use that information, not kvmclock's.
+	 * Note, TSC_KNOWN_FREQ must be read before presetting loops-per-jiffy,
+	 * (see kvm_get_tsc_khz()).
+	 */
+	if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ) ||
+	    !boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
+		pr_warn("kvm-clock: Using native calibration\n");
+		x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+		x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+	}
 
-	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
-	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
 	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
 #endif
+	/*
+	 * Save/restore "sched" clock state even if kvmclock isn't being used
+	 * for sched_clock, as kvmclock is still used for wallclock and relies
+	 * on these hooks to re-enable kvmclock after suspend+resume.
+	 */
 	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
 	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
 	kvm_get_preset_lpj();
 
-	/*
-	 * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
-	 * with P/T states and does not stop in deep C-states.
-	 *
-	 * Invariant TSC exposed by host means kvmclock is not necessary:
-	 * can use TSC as clocksource.
-	 *
-	 */
-	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
-	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-	    !check_tsc_unstable())
-		kvm_clock.rating = 299;
-
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";
 }
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f1fea506e20f..2b6800426349 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -907,6 +907,13 @@ void __init setup_arch(char **cmdline_p)
 	reserve_ibft_region();
 	x86_init.resources.dmi_setup();
 
+	/*
+	 * Detect, but do not fully initialize, TSC info before initializing
+	 * the hypervisor platform, so that hypervisor code can make informed
+	 * decisions about using a paravirt clock vs. TSC.
+	 */
+	tsc_early_detect();
+
 	/*
 	 * VMware detection requires dmi to be available, so this
 	 * needs to be done after dmi_setup(), for the boot CPU.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0864b314c26a..9baffb425386 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
 	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
 	unsigned int crystal_khz;
 
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+	/*
+	 * Ignore the vendor when running as a VM, if the hypervisor provides
+	 * garbage CPUID information then the vendor is also suspect.
+	 */
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
+	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return 0;
 
 	if (boot_cpu_data.cpuid_level < 0x15)
@@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
 		return 0;
 
 	/*
-	 * For Atom SoCs TSC is the only reliable clocksource.
-	 * Mark TSC reliable so no watchdog on it.
+	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
+	 * VM, any watchdog is going to be less reliable than the TSC as the
+	 * watchdog source will be emulated in software.  In both cases, mark
+	 * the TSC reliable so that no watchdog runs on it.
 	 */
-	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
+	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -1509,6 +1517,20 @@ static void __init tsc_enable_sched_clock(void)
 	static_branch_enable(&__use_tsc);
 }
 
+void __init tsc_early_detect(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TSC))
+		return;
+
+	snp_secure_tsc_init();
+
+	/*
+	 * Run through native TSC calibration to set TSC_KNOWN_FREQ and/or
+	 * TSC_RELIABLE as appropriate.
+	 */
+	native_calibrate_tsc();
+}
+
 void __init tsc_early_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_TSC))
@@ -1517,8 +1539,6 @@ void __init tsc_early_init(void)
 	if (is_early_uv_system())
 		return;
 
-	snp_secure_tsc_init();
-
 	if (!determine_cpu_tsc_frequencies(true))
 		return;
 	tsc_enable_sched_clock();

base-commit: 0563ee35ae2c9cfb0c6a7b2c0ddf7d9372bb8a98
--
Borislav Petkov Jan. 16, 2025, 4:25 p.m. UTC | #12
On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote:
> My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked stable via command line.

So how do we deal with the case here where a malicious HV could set those bits
and still tweak the TSC?

IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when
in a guest.

If anything, trusting the TSC in a normal guest should at least issue
a warning saying that we do use the TSC but there's no 100% guarantee that it
is trustworthy...

> But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> the TSC frequency using the information provided by KVM, i.e. the untrusted host.

Yeah, I guess we don't want that. Or at least we should warn about it.
 
> +	/*
> +	 * If the TSC counts at a constant frequency across P/T states, counts
> +	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
> +	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
> +	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
> +	 * exists purely to honor the TSC being marked unstable via command
> +	 * line, any runtime detection of an unstable will happen after this.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> +	    !check_tsc_unstable()) {
> +		kvm_clock.rating = 299;
> +		pr_warn("kvm-clock: Using native sched_clock\n");

The warn is in the right direction but probably should say TSC still cannot be
trusted 100%.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)
> @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
>  		return 0;
>  
>  	/*
> -	 * For Atom SoCs TSC is the only reliable clocksource.
> -	 * Mark TSC reliable so no watchdog on it.
> +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> +	 * VM, any watchdog is going to be less reliable than the TSC as the
> +	 * watchdog source will be emulated in software.  In both cases, mark
> +	 * the TSC reliable so that no watchdog runs on it.
>  	 */
> -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>  
>  #ifdef CONFIG_X86_LOCAL_APIC

It looks all wrong if a function called native_* sprinkles a bunch of "am
I a guest" checks. I guess we should split it into VM and native variants.

But yeah, the general direction is ok once we agree on what we do when
exactly.

Thx.
Sean Christopherson Jan. 16, 2025, 4:56 p.m. UTC | #13
On Thu, Jan 16, 2025, Borislav Petkov wrote:
> On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote:
> > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> > nonstop, and not marked stable via command line.
> 
> So how do we deal with the case here where a malicious HV could set those bits
> and still tweak the TSC?

You don't.  If the hypervisor is malicious, it's game over for non-CoCo guests.
The clocksource being untrusted is completely unintersting because the host has
full access to VM state.

It's probably game over for SEV and SEV-ES too, e.g. thanks to remapping attacks,
lack of robust attestation, and a variety of other avenues a truly malicious
hypervisor can exploit to attack the guest.

It's only with SNP and TDX that the clocksource becomes at all interesting.

> IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when
> in a guest.
> 
> If anything, trusting the TSC in a normal guest should at least issue
> a warning saying that we do use the TSC but there's no 100% guarantee that it
> is trustworthy...

Why?  For non-TDX/SecureTSC guests, *all* clocksources are host controlled.

> > But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> > the TSC frequency using the information provided by KVM, i.e. the untrusted host.
> 
> Yeah, I guess we don't want that. Or at least we should warn about it.

CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
would ideally assert that the kernel doesn't switch to some other calibration
method too.  Not sure where to hook into that though, without bleeding TDX and
SNP details everywhere.

> > +	/*
> > +	 * If the TSC counts at a constant frequency across P/T states, counts
> > +	 * in deep C-states, and the TSC hasn't been marked unstable, prefer
> > +	 * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
> > +	 * that TSC is chosen as the clocksource.  Note, the TSC unstable check
> > +	 * exists purely to honor the TSC being marked unstable via command
> > +	 * line, any runtime detection of an unstable will happen after this.
> > +	 */
> > +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > +	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > +	    !check_tsc_unstable()) {
> > +		kvm_clock.rating = 299;
> > +		pr_warn("kvm-clock: Using native sched_clock\n");
> 
> The warn is in the right direction but probably should say TSC still cannot be
> trusted 100%.

Heh, I didn't mean to include the printks, they were purely for my own debugging.

> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index 0864b314c26a..9baffb425386 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
> >  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
> >  	unsigned int crystal_khz;
> >  
> > -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> > +	/*
> > +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> > +	 * garbage CPUID information then the vendor is also suspect.
> > +	 */
> > +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> > +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
> >  		return 0;
> >  
> >  	if (boot_cpu_data.cpuid_level < 0x15)
> > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
> >  		return 0;
> >  
> >  	/*
> > -	 * For Atom SoCs TSC is the only reliable clocksource.
> > -	 * Mark TSC reliable so no watchdog on it.
> > +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> > +	 * VM, any watchdog is going to be less reliable than the TSC as the
> > +	 * watchdog source will be emulated in software.  In both cases, mark
> > +	 * the TSC reliable so that no watchdog runs on it.
> >  	 */
> > -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> > +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> > +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> >  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> >  
> >  #ifdef CONFIG_X86_LOCAL_APIC
> 
> It looks all wrong if a function called native_* sprinkles a bunch of "am
> I a guest" checks. I guess we should split it into VM and native variants.

I agree the naming is weird, but outside of the vendor checks, the VM code is
identical to the "native" code, so I don't know that it's worth splitting into
multiple functions.

What if we simply rename it to calibrate_tsc_from_cpuid()?

> But yeah, the general direction is ok once we agree on what we do when
> exactly.
Borislav Petkov Jan. 17, 2025, 8:28 p.m. UTC | #14
On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote:
> It's only with SNP and TDX that the clocksource becomes at all interesting.

So basically you're saying, let's just go ahead and trust the TSC when the HV
sets a bunch of CPUID bits.

But we really really trust it when the guest type is SNP+STSC or TDX since
there the HV is out of the picture and the only one who can flub it there is
the OEM.

> CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
> would ideally assert that the kernel doesn't switch to some other calibration
> method too.  Not sure where to hook into that though, without bleeding TDX and
> SNP details everywhere.

We could use the platform calibrate* function pointers and assign TDX- or
SNP-specific ones and perhaps even define new such function ptrs. That's what
the platform stuff is for... needs staring, ofc.

> I agree the naming is weird, but outside of the vendor checks, the VM code is
> identical to the "native" code, so I don't know that it's worth splitting into
> multiple functions.
> 
> What if we simply rename it to calibrate_tsc_from_cpuid()?

This is all wrong layering with all those different guest types having their
own ->calibrate_tsc:

arch/x86/kernel/cpu/acrn.c:32:  x86_platform.calibrate_tsc = acrn_get_tsc_khz;
arch/x86/kernel/cpu/mshyperv.c:424:             x86_platform.calibrate_tsc = hv_get_tsc_khz;
arch/x86/kernel/cpu/vmware.c:419:               x86_platform.calibrate_tsc = vmware_get_tsc_khz;
arch/x86/kernel/jailhouse.c:213:        x86_platform.calibrate_tsc              = jailhouse_get_tsc;
arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz;
arch/x86/kernel/tsc.c:944:      tsc_khz = x86_platform.calibrate_tsc();
arch/x86/kernel/tsc.c:1458:                     tsc_khz = x86_platform.calibrate_tsc();
arch/x86/kernel/x86_init.c:148: .calibrate_tsc                  = native_calibrate_tsc,
arch/x86/xen/time.c:569:        x86_platform.calibrate_tsc = xen_tsc_khz;

What you want sounds like a redesign to me considering how you want to keep
the KVM guest code and baremetal pretty close... Hmmm, needs staring...

Thx.
Sean Christopherson Jan. 17, 2025, 8:59 p.m. UTC | #15
On Fri, Jan 17, 2025, Borislav Petkov wrote:
> On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote:
> > It's only with SNP and TDX that the clocksource becomes at all interesting.
> 
> So basically you're saying, let's just go ahead and trust the TSC when the HV
> sets a bunch of CPUID bits.

Sort of.  It's not a trust thing though.  The Xen, KVM, and VMware PV clocks are
all based on TSC, i.e. we already "trust" the hypervisor to not muck with TSC.

The purpose of such PV clocks is to account for things in software, that aren't
handled in hardware.  E.g. to provide a constant counter on hardware without a
constant TSC.

The proposal here, and what kvmclock already does for clocksource, is to use the
raw TSC when the hypervisor sets bits that effectively say that the massaging of
TSC done by the PV clock isn't needed.

> But we really really trust it when the guest type is SNP+STSC or TDX since
> there the HV is out of the picture and the only one who can flub it there is
> the OEM.

Yep.  This one _is_ about trust.  Specifically, we trust the raw TSC more than
any clock that is provided by the HV.

> > CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC
> > would ideally assert that the kernel doesn't switch to some other calibration
> > method too.  Not sure where to hook into that though, without bleeding TDX and
> > SNP details everywhere.
> 
> We could use the platform calibrate* function pointers and assign TDX- or
> SNP-specific ones and perhaps even define new such function ptrs. That's what
> the platform stuff is for... needs staring, ofc.
> 
> > I agree the naming is weird, but outside of the vendor checks, the VM code is
> > identical to the "native" code, so I don't know that it's worth splitting into
> > multiple functions.
> > 
> > What if we simply rename it to calibrate_tsc_from_cpuid()?
> 
> This is all wrong layering with all those different guest types having their
> own ->calibrate_tsc:
> 
> arch/x86/kernel/cpu/acrn.c:32:  x86_platform.calibrate_tsc = acrn_get_tsc_khz;
> arch/x86/kernel/cpu/mshyperv.c:424:             x86_platform.calibrate_tsc = hv_get_tsc_khz;
> arch/x86/kernel/cpu/vmware.c:419:               x86_platform.calibrate_tsc = vmware_get_tsc_khz;
> arch/x86/kernel/jailhouse.c:213:        x86_platform.calibrate_tsc              = jailhouse_get_tsc;
> arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> arch/x86/kernel/tsc.c:944:      tsc_khz = x86_platform.calibrate_tsc();
> arch/x86/kernel/tsc.c:1458:                     tsc_khz = x86_platform.calibrate_tsc();
> arch/x86/kernel/x86_init.c:148: .calibrate_tsc                  = native_calibrate_tsc,
> arch/x86/xen/time.c:569:        x86_platform.calibrate_tsc = xen_tsc_khz;
> 
> What you want sounds like a redesign to me considering how you want to keep
> the KVM guest code and baremetal pretty close... Hmmm, needs staring...

It's not KVM guest code though.  The CPUID stuff is Intel's architecturally
defined behavior.  There are oodles and oodles of features that are transparently
emulated by the hypervisor according to hardware specifications.  Generally
speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(),
etc.

What I am proposing is that, for TDX especially, instead of relying on the hypervisor
to use a paravirtual channel for communicating the TSC frequency, we rely on the
hardware-defined way of getting the frequency, because CPUID is emulated by the
trusted entity, i.e. the OEM.

Hmm, though I suppose I'm arguing against myself in that case.  If the hypervisor
provides the frequency and there are no trust issues, why would we care if the
kernel gets the frequency via CPUID or the PV channel.  It's really only TDX that
matters.  And we could handle TDX by overriding .calibrate_tsc() in tsc_init(),
same as Secure TSC.

That said, I do think it makes sense to either override the vendor and F/M/S
checks native_calibrate_tsc().  Or even better drop the initial vendor check
entirely, because both Intel and AMD have a rich history of implementing each
other's CPUID leaves.  I.e. I see no reason to ignore CPUID 0x15 just because
the CPU isn't Intel.

As for the Goldmost F/M/S check, that one is a virtualization specific thing.
The argument is that when running as a guest, any non-TSC clocksource is going
to be emulated by the hypervisor, and therefore is going to be less reliable than
TSC.  I.e. putting a watchdog on TSC does more harm than good, because what ends
up happening is the TSC gets marked unreliable because the *watchdog* is unreliable.
Nikunj A. Dadhania Jan. 21, 2025, 3:59 a.m. UTC | #16
On 1/16/2025 3:07 AM, Sean Christopherson wrote:

> My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked stable via command line.  I.e. use the same criteria as
> tweaking the clocksource rating.  As above, sched_clock is more tolerant of slop
> than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock
> (or something else entirely) is used for the clocksource.
> 
> If we wanted to go with a more conservative approach, e.g. to minimize the risk
> of breaking existing setups, we could also condition the change on the TSC being
> reliable and having a known frequency.  I.e. require SNP's Secure TSC, or require
> the hypervisor to enumerate the TSC frequency via CPUID.  I don't see a ton of
> value in that approach though, and long-term it would lead to some truly weird
> code due to holding sched_clock to a higher standard than clocksource.
> 
> But wait, there's more!  Because TDX doesn't override .calibrate_tsc() or
> .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
> frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
> the TSC frequency using the information provided by KVM, i.e. the untrusted host.
> 
> The "obvious" solution is to leave the calibration functions as-is if the TSC has
> a known, reliable frequency, but even _that_ is riddled with problems, because
> as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which
> runs *after* init_hypervisor_platform().  SNP Secure TSC fudges around this by
> overiding the calibration routines, but that's a bit gross and easy to fix if we
> also fix TDX.  And fixing TDX by running native_calibrate_tsc() would give the
> same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16.

One change that I wasn't sure was about non-Intel guests using CPUID 0x15H/0x16H,
that you have already answered in the subsequent emails. At present, AMD platform
does not implement either of them and will bail out from the below check:

        /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
        cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);

        if (ebx_numerator == 0 || eax_denominator == 0)
                return 0;

> All in all, I'm thinking something like this (across multiple patches):

Tested on AMD Milan and changes are working as intended:

For SecureTSC guests with TSC_INVARIANT set:
* Raw TSC is used as clocksource and sched-clock
* Calibration is done using GUEST_TSC_FREQ MSR

For non-SecureTSC guests with TSC_INVARIANT set:
* Raw TSC is used as clocksource and sched-clock
* Calibration is done using kvm-clock

For non-SecureTSC guests without TSC_INVARIANT:
* kvm-clock(based on TSC) is used as clocksource and sched-clock
* Calibration is done using kvm-clock

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)

Regards
Nikunj
Borislav Petkov Jan. 21, 2025, 11:32 a.m. UTC | #17
On Fri, Jan 17, 2025 at 12:59:37PM -0800, Sean Christopherson wrote:
> The proposal here, and what kvmclock already does for clocksource, is to use the
> raw TSC when the hypervisor sets bits that effectively say that the massaging of
> TSC done by the PV clock isn't needed.
>
> > But we really really trust it when the guest type is SNP+STSC or TDX since
> > there the HV is out of the picture and the only one who can flub it there is
> > the OEM.
> 
> Yep.  This one _is_ about trust.  Specifically, we trust the raw TSC more than
> any clock that is provided by the HV.

Ok, makes sense.

> It's not KVM guest code though.  The CPUID stuff is Intel's architecturally
> defined behavior.  There are oodles and oodles of features that are transparently
> emulated by the hypervisor according to hardware specifications.  Generally
> speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(),
> etc.

Uuuh, this is calling for confusion.

native_* has always been stuff the kernel calls when it runs, well, natively,
on baremetal. And not some PV or whatever-enlightened ops.

Now, if you want to emulate those transparently that's fine but this is what
native means in arch/x86/. Unless I've missed some memo but I doubt it. And
I asked around :)

> What I am proposing is that, for TDX especially, instead of relying on the
> hypervisor to use a paravirtual channel for communicating the TSC frequency,
> we rely on the hardware-defined way of getting the frequency, because CPUID
> is emulated by the trusted entity, i.e. the OEM.

I wouldn't trust the OEM with a single bit but that's a different story. :-P

> Hmm, though I suppose I'm arguing against myself in that case.  If the
> hypervisor provides the frequency and there are no trust issues, why would
> we care if the kernel gets the frequency via CPUID or the PV channel.  It's
> really only TDX that matters.  And we could handle TDX by overriding
> .calibrate_tsc() in tsc_init(), same as Secure TSC.

I guess it all boils down to the trust level you want to have with the TSC.
I don't know whether there's even a HV which tries to lie to the guest with
the TSC and I guess we'll find out eventually but for now we could treat the
two things similarly. The two things being:

- TDX and STSC SNP guests - full TSC trust
- HV with the required CPUID bits set - almost full, we assume HV is
  benevolent. Could change if we notice something.

> That said, I do think it makes sense to either override the vendor and F/M/S
> checks native_calibrate_tsc().  Or even better drop the initial vendor check
> entirely,

I have no clue why that thing is even there:

aa297292d708 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")

I guess back then it meant that only Intel sports those new CPUID leafs but
those things do change.

> because both Intel and AMD have a rich history of implementing each other's
> CPUID leaves.  I.e. I see no reason to ignore CPUID 0x15 just because the
> CPU isn't Intel.
> 
> As for the Goldmost F/M/S check, that one is a virtualization specific
> thing.  The argument is that when running as a guest, any non-TSC
> clocksource is going to be emulated by the hypervisor, and therefore is
> going to be less reliable than TSC.  I.e. putting a watchdog on TSC does
> more harm than good, because what ends up happening is the TSC gets marked
> unreliable because the *watchdog* is unreliable.

So I think the best thing to do is to carve out the meat of that function
which is valid for both baremetal and virtualized and then have separate
helpers which call the common thing. So that you don't have to test
on *everything* when having to change it in the future for whatever reason and
so that both camps can be relatively free when implementing their
idiosyncrasies.

And then it should fit in the current scheme with platform function ptrs.

I know you want to use native_calibrate_tsc() but then you have to sprinkle
"am I a guest" checks and this reminds me of the "if XEN" fiasco back then.
Also, a really nasty lying HV won't simply set X86_FEATURE_HYPERVISOR...

So I'd prefer if we fit a guest which runs on a relatively honest HV :) into
the current scheme as the kernel running simply on yet another platform, even
at the price of some small code duplication.

Thx.
Nikunj A. Dadhania Jan. 28, 2025, 5:41 a.m. UTC | #18
Sean Christopherson <seanjc@google.com> writes:

>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 0864b314c26a..9baffb425386 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
>  	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
>  	unsigned int crystal_khz;
>  
> -	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> +	/*
> +	 * Ignore the vendor when running as a VM, if the hypervisor provides
> +	 * garbage CPUID information then the vendor is also suspect.
> +	 */
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> +	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		return 0;
>  
>  	if (boot_cpu_data.cpuid_level < 0x15)
> @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
>  		return 0;
>  
>  	/*
> -	 * For Atom SoCs TSC is the only reliable clocksource.
> -	 * Mark TSC reliable so no watchdog on it.
> +	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
> +	 * VM, any watchdog is going to be less reliable than the TSC as the
> +	 * watchdog source will be emulated in software.  In both cases, mark
> +	 * the TSC reliable so that no watchdog runs on it.
>  	 */
> -	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
> +	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
>  		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);

One more point here is for AMD guests, TSC will not be marked reliable
as per the above change, it will only be effective for CPUs supporting
CPUID 15H/16H. Although, the watchdog should be stopped for AMD guests
as well.

Will it make sense to move this before cpuid_level check ?

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index e7abcc4d02c3..2769d1598c0d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -672,6 +672,14 @@ unsigned long native_calibrate_tsc(void)
 	    !boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return 0;
 
+	/*
+	 * In a VM, any watchdog is going to be less reliable than the TSC as
+	 * the watchdog source will be emulated in software. Mark the TSC
+	 * reliable so that no watchdog runs on it.
+	 */
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
 	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
 		return 0;
 
@@ -719,13 +727,10 @@ unsigned long native_calibrate_tsc(void)
 		return 0;
 
 	/*
-	 * For Atom SoCs TSC is the only reliable clocksource.  Similarly, in a
-	 * VM, any watchdog is going to be less reliable than the TSC as the
-	 * watchdog source will be emulated in software.  In both cases, mark
-	 * the TSC reliable so that no watchdog runs on it.
+	 * For Atom SoCs TSC is the only reliable clocksource.
+	 * Mark TSC reliable so no watchdog on it.
 	 */
-	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-	    boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
+	if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 
 #ifdef CONFIG_X86_LOCAL_APIC

Regards
Nikunj
diff mbox series

Patch

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 88d8bfceea04..fe7a0b1b7cfd 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -291,12 +291,26 @@  static void __init upgrade_clock_rating(struct clocksource *tsc_early,
 		tsc->rating = 450;
 	}
 }
+
+static void enable_native_sc_work(struct work_struct *work)
+{
+	pr_info("Using native sched clock\n");
+	paravirt_set_sched_clock(native_sched_clock);
+}
+static DECLARE_DELAYED_WORK(enable_native_sc, enable_native_sc_work);
+
+static void enable_native_sched_clock(void)
+{
+	if (!using_native_sched_clock())
+		schedule_delayed_work(&enable_native_sc, 0);
+}
 #else
 u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
 
 bool using_native_sched_clock(void) { return true; }
 
 static void __init upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { }
+static void enable_native_sched_clock(void) { }
 #endif
 
 notrace u64 sched_clock(void)
@@ -1176,6 +1190,10 @@  static void tsc_cs_tick_stable(struct clocksource *cs)
 static int tsc_cs_enable(struct clocksource *cs)
 {
 	vclocks_set_used(VDSO_CLOCKMODE_TSC);
+
+	/* Restore native_sched_clock() when switching to TSC */
+	enable_native_sched_clock();
+
 	return 0;
 }