Message ID | 20250106124633.1418972-13-nikunj@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add Secure TSC support for SNP guests | expand |
On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote: > Although the kernel switches over to stable TSC clocksource instead of > PV clocksource, the scheduler still keeps on using PV clocks as the > sched clock source. This is because KVM, Xen and VMWare, switch the > paravirt sched clock handler in their init routines. HyperV is the > only PV clock source that checks if the platform provides an invariant > TSC and does not switch to PV sched clock. So this below looks like some desperate hackery. Are you doing this because kvm_sched_clock_init() does paravirt_set_sched_clock(kvm_sched_clock_read); ? If so, why don't you simply prevent that on STSC guests?
On 1/8/2025 1:07 AM, Borislav Petkov wrote: > On Mon, Jan 06, 2025 at 06:16:32PM +0530, Nikunj A Dadhania wrote: >> Although the kernel switches over to stable TSC clocksource instead of >> PV clocksource, the scheduler still keeps on using PV clocks as the >> sched clock source. This is because KVM, Xen and VMWare, switch the >> paravirt sched clock handler in their init routines. HyperV is the >> only PV clock source that checks if the platform provides an invariant >> TSC and does not switch to PV sched clock. > > So this below looks like some desperate hackery. Are you doing this because > kvm_sched_clock_init() does > > paravirt_set_sched_clock(kvm_sched_clock_read); > > ? All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare uses the regular TSC when the guest is booted with TscInvariant bit set. But it doesn't switch the sched clock to use TSC which it should have, pointed here[1] by Sean: No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC is stable, irrespective of SNP or TDX. This is effectively already done for the timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the kvm_sched_clock_init() code. > If so, why don't you simply prevent that on STSC guests? This is for all guests running on different hypervisor with stable TSC. And I did try to limit this to KVM here [2], here is Sean's comment to that implementation: > Although the kernel switches over to stable TSC clocksource instead of > kvmclock, the scheduler still keeps on using kvmclock as the sched clock. > This is due to kvm_sched_clock_init() updating the pv_sched_clock() > unconditionally. All PV clocks are affected by this, no? This seems like something that should be handled in common code, which is the point I was trying to make in v11. Regards Nikunj 1. https://lore.kernel.org/kvm/ZurCbP7MesWXQbqZ@google.com/ 2. https://lore.kernel.org/kvm/20241009092850.197575-17-nikunj@amd.com/
On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote: > All guests(including STSC) running on KVM or other hypervisors like Xen/VMWare > uses the regular TSC when the guest is booted with TscInvariant bit set. But it > doesn't switch the sched clock to use TSC which it should have, pointed here[1] > by Sean: Well, that paravirt_set_sched_clock() thing is there for a reason and all the different guest types which call it perhaps have their reasons. Now, if you think it would make sense to use the TSC in every guest type if the TSC is this and that, then I'd need to see a patch or even a patchset which explains why it is ok to do so on every guest type and then tested on every guest type and then each patch would convert a single guest type and properly explain why. And your patch doesn't have that at all. I know Sean thinks it is a good idea and perhaps it is but without proper justification and acks from the other guest type maintainers, I simply won't take such a patch(-set). And even if it is good, you need to solve it another way - not with this delayed-work hackery. IOW, this switching the paravirt clock crap to use TSC when the TSC is this and that and the HV tells you so, should be a wholly separate patchset with reviews and acks and the usual shebang. If you want to take care only of STSC now, I'd take a patch which is known good and tested properly. And that should happen very soon because the merge window is closing in. Otherwise, there's always another merge window and rushing things doesn't do any good anyway, as I keep preaching... Your call.
On 1/8/2025 1:52 PM, Borislav Petkov wrote: > On Wed, Jan 08, 2025 at 10:50:13AM +0530, Nikunj A. Dadhania wrote: > And your patch doesn't have that at all. I know Sean thinks it is a good idea > and perhaps it is but without proper justification and acks from the other > guest type maintainers, I simply won't take such a patch(-set). I haven't heard anything from other guest type maintainers yet, and they are on CC from v12 onward. > And even if it is good, you need to solve it another way - not with this > delayed-work hackery. > > IOW, this switching the paravirt clock crap to use TSC when the TSC is this > and that and the HV tells you so, should be a wholly separate patchset with > reviews and acks and the usual shebang. I agree, we should handle this as a separate patch series. > If you want to take care only of STSC now, I'd take a patch which is known > good and tested properly. And that should happen very soon because the merge > window is closing in. In that case, let me limit this only to STSC for now, i will send updated patch. Regards, Nikunj
On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote: > >> If you want to take care only of STSC now, I'd take a patch which is known >> good and tested properly. And that should happen very soon because the merge >> window is closing in. > > In that case, let me limit this only to STSC for now, i will send updated patch. From: Nikunj A Dadhania <nikunj@amd.com> Date: Wed, 8 Jan 2025 14:18:04 +0530 Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure TSC guests Although the kernel switches over to a stable TSC clocksource instead of kvmclock, the scheduler still keeps on using kvmclock as the sched clock. This is due to kvm_sched_clock_init() updating the pv_sched_clock() unconditionally. Do not override the PV sched clock when Secure TSC is enabled. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> --- arch/x86/kernel/kvmclock.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index d8fef3a65a35..82c4743a5e7a 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -324,8 +324,10 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); + if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) { + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); + } x86_platform.calibrate_tsc = kvm_get_tsc_khz; x86_platform.calibrate_cpu = kvm_get_tsc_khz;
On Wed, Jan 08, 2025, Nikunj A. Dadhania wrote: > > On 1/8/2025 2:04 PM, Nikunj A. Dadhania wrote: > > > >> If you want to take care only of STSC now, I'd take a patch which is known > >> good and tested properly. And that should happen very soon because the merge > >> window is closing in. > > > > In that case, let me limit this only to STSC for now, i will send updated patch. > > From: Nikunj A Dadhania <nikunj@amd.com> > Date: Wed, 8 Jan 2025 14:18:04 +0530 > Subject: [PATCH] x86/kvmclock: Prefer TSC as the scheduler clock for Secure > TSC guests > > Although the kernel switches over to a stable TSC clocksource instead of > kvmclock, the scheduler still keeps on using kvmclock as the sched clock. > This is due to kvm_sched_clock_init() updating the pv_sched_clock() > unconditionally. Do not override the PV sched clock when Secure TSC is > enabled. > > Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> > --- > arch/x86/kernel/kvmclock.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index d8fef3a65a35..82c4743a5e7a 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -324,8 +324,10 @@ void __init kvmclock_init(void) > if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) > pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); > > - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); > - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); > + if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC)) { > + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); > + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); > + } This still misses my point. Ditto for the "x86/tsc: Switch Secure TSC guests away from kvm-clock". I object to singling out kvmclock. It's weird and misleading, because handling only kvmclock suggests that other PV clocks are somehow trusted/ok, when in reality the only reason kvmclock is getting singled out is (presumably) because it's what Nikunj and the other folks enabling KVM SNP test on. What I care most about is having a sane, consistent policy throughout the kernel. E.g. so that a user/reader walks away with an understanding PV clocks are a theoretical host attack vector and so should be avoided when Secure TSC is available. Ideally, if the TSC is the preferred clocksource, then the scheduler will use the TSC and not a PV clock irrespective of STSC. But I 100% agree with Boris that it needs buy-in from other maintainers (including Paolo), because it's entirely possible (likely, even) that there's an angle to scheduling I'm not considering.
On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote: > This still misses my point. Oh I got it, no worries. > Ditto for the "x86/tsc: Switch Secure TSC guests away from kvm-clock". > > I object to singling out kvmclock. It's weird and misleading, because handling > only kvmclock suggests that other PV clocks are somehow trusted/ok, when in > reality the only reason kvmclock is getting singled out is (presumably) because > it's what Nikunj and the other folks enabling KVM SNP test on. Presumably. > What I care most about is having a sane, consistent policy throughout the kernel. > E.g. so that a user/reader walks away with an understanding PV clocks are a > theoretical host attack vector and so should be avoided when Secure TSC is > available. Yap, agreed. > Ideally, if the TSC is the preferred clocksource, then the scheduler will use the > TSC and not a PV clock irrespective of STSC. But I 100% agree with Boris that > it needs buy-in from other maintainers (including Paolo), because it's entirely > possible (likely, even) that there's an angle to scheduling I'm not considering. That's exactly why I wanted to have this taken care of only for the STSC side of things now and temporarily. So that we can finally land those STSC patches - they've been pending for waaay too long. And then ask Nikunj nicely to clean up this whole pv clock gunk, potentially kill some of those old clocksources which probably don't matter anymore. But your call how/when you wanna do this. If you want the cleanup first, I'll take only a subset of the STSC set so that I can unload some of that set upstream. Thx.
On Wed, Jan 08, 2025, Borislav Petkov wrote: > On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote: > > Ideally, if the TSC is the preferred clocksource, then the scheduler will use the > > TSC and not a PV clock irrespective of STSC. But I 100% agree with Boris that > > it needs buy-in from other maintainers (including Paolo), because it's entirely > > possible (likely, even) that there's an angle to scheduling I'm not considering. > > That's exactly why I wanted to have this taken care of only for the STSC side > of things now and temporarily. So that we can finally land those STSC patches > - they've been pending for waaay too long. > > And then ask Nikunj nicely to clean up this whole pv clock gunk, potentially > kill some of those old clocksources which probably don't matter anymore. > > But your call how/when you wanna do this. I'm okay starting with just TDX and SNP guests, but I don't want to special case SNP's Secure TSC anywhere in kvmclock or common TSC/sched code. For TDX guests, the TSC is _always_ "secure". So similar to singling out kvmclock, handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent state. Which is why I originally suggested[*] fixing the sched_clock mess in a generically; doing so would avoid the need to special case SNP or TDX in code that doesn't/shouldn't care about SNP or TDX. [*] https://lore.kernel.org/all/ZurCbP7MesWXQbqZ@google.com > If you want the cleanup first, I'll take only a subset of the STSC set so that > I can unload some of that set upstream. My vote is to apply through "x86/sev: Mark Secure TSC as reliable clocksource", and then split "x86/tsc: Switch Secure TSC guests away from kvm-clock" to grab only the snp_secure_tsc_init() related changes (which is how that patch should be constructed no matter what; adding support for MSR_AMD64_GUEST_TSC_FREQ has nothing to do with kvmclock). And then figure out how to wrangle clocksource and sched_clock in a way that is sane and consistent.
On Wed, Jan 08, 2025 at 09:02:34AM -0800, Sean Christopherson wrote: > I'm okay starting with just TDX and SNP guests, but I don't want to special case > SNP's Secure TSC anywhere in kvmclock or common TSC/sched code. > > For TDX guests, the TSC is _always_ "secure". Ah, good to know. I was wondering what the situation in TDX land is wrt TSC... > So similar to singling out kvmclock, > handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent > state. Which is why I originally suggested[*] fixing the sched_clock mess in a > generically; doing so would avoid the need to special case SNP or TDX in code > that doesn't/shouldn't care about SNP or TDX. Right. > My vote is to apply through "x86/sev: Mark Secure TSC as reliable clocksource", > and then split "x86/tsc: Switch Secure TSC guests away from kvm-clock" to grab > only the snp_secure_tsc_init() related changes (which is how that patch should > be constructed no matter what; adding support for MSR_AMD64_GUEST_TSC_FREQ has > nothing to do with kvmclock). > > And then figure out how to wrangle clocksource and sched_clock in a way that is > sane and consistent. Makes sense to me, I'll carve out the bits. Nikunj, I'd appreciate it if you gathered whatever bits are left wrt kvmclock and take care of things as Sean suggests above. Thx.
On 1/8/2025 10:32 PM, Sean Christopherson wrote: > On Wed, Jan 08, 2025, Borislav Petkov wrote: >> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote: > For TDX guests, the TSC is _always_ "secure". So similar to singling out kvmclock, > handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent > state. Which is why I originally suggested[*] fixing the sched_clock mess in a > generically; > doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX. That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever TSC is picked up as a clock source, sched-clock will start using TSC instead of any PV sched clock. This does not have any special case for STSC/SNP/TDX. > [*] https://lore.kernel.org/all/ZurCbP7MesWXQbqZ@google.com Regards Nikunj 1. https://lore.kernel.org/kvm/20250106124633.1418972-13-nikunj@amd.com/
On Thu, Jan 09, 2025, Nikunj A. Dadhania wrote: > On 1/8/2025 10:32 PM, Sean Christopherson wrote: > > On Wed, Jan 08, 2025, Borislav Petkov wrote: > >> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote: > > > For TDX guests, the TSC is _always_ "secure". So similar to singling out kvmclock, > > handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent > > state. Which is why I originally suggested[*] fixing the sched_clock mess in a > > generically; > > doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX. > > That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever > TSC is picked up as a clock source, sched-clock will start using TSC instead of any > PV sched clock. This does not have any special case for STSC/SNP/TDX. *sigh* This is a complete and utter trainwreck. Paolo had an idea to handle this without a deferred callback by having kvmclock detect if kvmclock is selected as the clocksource, OR if the TSC is unreliable, i.e. if the kernel may switch to kvmclock due to the TSC being marked unreliable. Unfortunately, that idea doesn't work because the ordering is all wrong. The *early* TSC initialization in setup_arch() happens after kvmclock_init() (via init_hypervisor_platform()) init_hypervisor_platform(); tsc_early_init(); and even if we mucked with that, __clocksource_select() very deliberately doesn't change the clocksource until the kernel is "finished" booting, in order to avoid thrashing the clocksource. static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur) { struct clocksource *cs; if (!finished_booting || list_empty(&clocksource_list)) <=== return NULL; ... } /* * clocksource_done_booting - Called near the end of core bootup * * Hack to avoid lots of clocksource churn at boot time. * We use fs_initcall because we want this to start before * device_initcall but after subsys_initcall. */ static int __init clocksource_done_booting(void) { mutex_lock(&clocksource_mutex); curr_clocksource = clocksource_default_clock(); finished_booting = 1; /* * Run the watchdog first to eliminate unstable clock sources */ __clocksource_watchdog_kthread(); clocksource_select(); mutex_unlock(&clocksource_mutex); return 0; } fs_initcall(clocksource_done_booting); I fiddled with a variety of ideas to try and let kvmclock tell sched_clock that it prefers the TSC, e.g. if TSC_RELIABLE is set, but after seeing this comment in native_sched_clock(): /* * Fall back to jiffies if there's no TSC available: * ( But note that we still use it if the TSC is marked * unstable. We do this because unlike Time Of Day, * the scheduler clock tolerates small errors and it's * very important for it to be as fast as the platform * can achieve it. ) */ My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant, nonstop, and not marked stable via command line. I.e. use the same criteria as tweaking the clocksource rating. As above, sched_clock is more tolerant of slop than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock (or something else entirely) is used for the clocksource. If we wanted to go with a more conservative approach, e.g. to minimize the risk of breaking existing setups, we could also condition the change on the TSC being reliable and having a known frequency. I.e. require SNP's Secure TSC, or require the hypervisor to enumerate the TSC frequency via CPUID. I don't see a ton of value in that approach though, and long-term it would lead to some truly weird code due to holding sched_clock to a higher standard than clocksource. But wait, there's more! Because TDX doesn't override .calibrate_tsc() or .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the frequency of the TSC, unless I'm missing something, tsc_early_init() will compute the TSC frequency using the information provided by KVM, i.e. the untrusted host. The "obvious" solution is to leave the calibration functions as-is if the TSC has a known, reliable frequency, but even _that_ is riddled with problems, because as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which runs *after* init_hypervisor_platform(). SNP Secure TSC fudges around this by overiding the calibration routines, but that's a bit gross and easy to fix if we also fix TDX. And fixing TDX by running native_calibrate_tsc() would give the same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16. All in all, I'm thinking something like this (across multiple patches): --- arch/x86/include/asm/tsc.h | 1 + arch/x86/kernel/kvmclock.c | 55 ++++++++++++++++++++++++++------------ arch/x86/kernel/setup.c | 7 +++++ arch/x86/kernel/tsc.c | 32 +++++++++++++++++----- 4 files changed, 72 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index 94408a784c8e..e13d6c3f2298 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -28,6 +28,7 @@ static inline cycles_t get_cycles(void) } #define get_cycles get_cycles +extern void tsc_early_detect(void); extern void tsc_early_init(void); extern void tsc_init(void); extern void mark_tsc_unstable(char *reason); diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 5b2c15214a6b..fa6bf71cc511 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -317,33 +317,54 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); - flags = pvclock_read_flags(&hv_clock_boot[0].pvti); - kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); + /* + * If the TSC counts at a constant frequency across P/T states, counts + * in deep C-states, and the TSC hasn't been marked unstable, prefer + * the TSC over kvmclock for sched_clock and drop kvmclock's rating so + * that TSC is chosen as the clocksource. Note, the TSC unstable check + * exists purely to honor the TSC being marked unstable via command + * line, any runtime detection of an unstable will happen after this. + */ + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && + !check_tsc_unstable()) { + kvm_clock.rating = 299; + pr_warn("kvm-clock: Using native sched_clock\n"); + } else { + flags = pvclock_read_flags(&hv_clock_boot[0].pvti); + kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT); + } + + /* + * If the TSC frequency is already known, e.g. via CPUID, rely on that + * information. For "normal" VMs, the hypervisor controls kvmclock and + * CPUID, i.e. the frequency is coming from the same place. For CoCo + * VMs, the TSC frequency may be provided by trusted firmware, in which + * case it's highly desirable to use that information, not kvmclock's. + * Note, TSC_KNOWN_FREQ must be read before presetting loops-per-jiffy, + * (see kvm_get_tsc_khz()). + */ + if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ) || + !boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) { + pr_warn("kvm-clock: Using native calibration\n"); + x86_platform.calibrate_tsc = kvm_get_tsc_khz; + x86_platform.calibrate_cpu = kvm_get_tsc_khz; + } - x86_platform.calibrate_tsc = kvm_get_tsc_khz; - x86_platform.calibrate_cpu = kvm_get_tsc_khz; x86_platform.get_wallclock = kvm_get_wallclock; x86_platform.set_wallclock = kvm_set_wallclock; #ifdef CONFIG_X86_LOCAL_APIC x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock; #endif + /* + * Save/restore "sched" clock state even if kvmclock isn't being used + * for sched_clock, as kvmclock is still used for wallclock and relies + * on these hooks to re-enable kvmclock after suspend+resume. + */ x86_platform.save_sched_clock_state = kvm_save_sched_clock_state; x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state; kvm_get_preset_lpj(); - /* - * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate - * with P/T states and does not stop in deep C-states. - * - * Invariant TSC exposed by host means kvmclock is not necessary: - * can use TSC as clocksource. - * - */ - if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && - boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && - !check_tsc_unstable()) - kvm_clock.rating = 299; - clocksource_register_hz(&kvm_clock, NSEC_PER_SEC); pv_info.name = "KVM"; } diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index f1fea506e20f..2b6800426349 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -907,6 +907,13 @@ void __init setup_arch(char **cmdline_p) reserve_ibft_region(); x86_init.resources.dmi_setup(); + /* + * Detect, but do not fully initialize, TSC info before initializing + * the hypervisor platform, so that hypervisor code can make informed + * decisions about using a paravirt clock vs. TSC. + */ + tsc_early_detect(); + /* * VMware detection requires dmi to be available, so this * needs to be done after dmi_setup(), for the boot CPU. diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 0864b314c26a..9baffb425386 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; unsigned int crystal_khz; - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) + /* + * Ignore the vendor when running as a VM, if the hypervisor provides + * garbage CPUID information then the vendor is also suspect. + */ + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) return 0; if (boot_cpu_data.cpuid_level < 0x15) @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void) return 0; /* - * For Atom SoCs TSC is the only reliable clocksource. - * Mark TSC reliable so no watchdog on it. + * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a + * VM, any watchdog is going to be less reliable than the TSC as the + * watchdog source will be emulated in software. In both cases, mark + * the TSC reliable so that no watchdog runs on it. */ - if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || + boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); #ifdef CONFIG_X86_LOCAL_APIC @@ -1509,6 +1517,20 @@ static void __init tsc_enable_sched_clock(void) static_branch_enable(&__use_tsc); } +void __init tsc_early_detect(void) +{ + if (!boot_cpu_has(X86_FEATURE_TSC)) + return; + + snp_secure_tsc_init(); + + /* + * Run through native TSC calibration to set TSC_KNOWN_FREQ and/or + * TSC_RELIABLE as appropriate. + */ + native_calibrate_tsc(); +} + void __init tsc_early_init(void) { if (!boot_cpu_has(X86_FEATURE_TSC)) @@ -1517,8 +1539,6 @@ void __init tsc_early_init(void) if (is_early_uv_system()) return; - snp_secure_tsc_init(); - if (!determine_cpu_tsc_frequencies(true)) return; tsc_enable_sched_clock(); base-commit: 0563ee35ae2c9cfb0c6a7b2c0ddf7d9372bb8a98 --
On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote: > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant, > nonstop, and not marked stable via command line. So how do we deal with the case here where a malicious HV could set those bits and still tweak the TSC? IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when in a guest. If anything, trusting the TSC in a normal guest should at least issue a warning saying that we do use the TSC but there's no 100% guarantee that it is trustworthy... > But wait, there's more! Because TDX doesn't override .calibrate_tsc() or > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute > the TSC frequency using the information provided by KVM, i.e. the untrusted host. Yeah, I guess we don't want that. Or at least we should warn about it. > + /* > + * If the TSC counts at a constant frequency across P/T states, counts > + * in deep C-states, and the TSC hasn't been marked unstable, prefer > + * the TSC over kvmclock for sched_clock and drop kvmclock's rating so > + * that TSC is chosen as the clocksource. Note, the TSC unstable check > + * exists purely to honor the TSC being marked unstable via command > + * line, any runtime detection of an unstable will happen after this. > + */ > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > + !check_tsc_unstable()) { > + kvm_clock.rating = 299; > + pr_warn("kvm-clock: Using native sched_clock\n"); The warn is in the right direction but probably should say TSC still cannot be trusted 100%. > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index 0864b314c26a..9baffb425386 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) > unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; > unsigned int crystal_khz; > > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > + /* > + * Ignore the vendor when running as a VM, if the hypervisor provides > + * garbage CPUID information then the vendor is also suspect. > + */ > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && > + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > return 0; > > if (boot_cpu_data.cpuid_level < 0x15) > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void) > return 0; > > /* > - * For Atom SoCs TSC is the only reliable clocksource. > - * Mark TSC reliable so no watchdog on it. > + * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a > + * VM, any watchdog is going to be less reliable than the TSC as the > + * watchdog source will be emulated in software. In both cases, mark > + * the TSC reliable so that no watchdog runs on it. > */ > - if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || > + boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); > > #ifdef CONFIG_X86_LOCAL_APIC It looks all wrong if a function called native_* sprinkles a bunch of "am I a guest" checks. I guess we should split it into VM and native variants. But yeah, the general direction is ok once we agree on what we do when exactly. Thx.
On Thu, Jan 16, 2025, Borislav Petkov wrote: > On Wed, Jan 15, 2025 at 01:37:25PM -0800, Sean Christopherson wrote: > > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant, > > nonstop, and not marked stable via command line. > > So how do we deal with the case here where a malicious HV could set those bits > and still tweak the TSC? You don't. If the hypervisor is malicious, it's game over for non-CoCo guests. The clocksource being untrusted is completely unintersting because the host has full access to VM state. It's probably game over for SEV and SEV-ES too, e.g. thanks to remapping attacks, lack of robust attestation, and a variety of other avenues a truly malicious hypervisor can exploit to attack the guest. It's only with SNP and TDX that the clocksource becomes at all interesting. > IOW, I think Secure TSC and TDX should be the only ones who trust the TSC when > in a guest. > > If anything, trusting the TSC in a normal guest should at least issue > a warning saying that we do use the TSC but there's no 100% guarantee that it > is trustworthy... Why? For non-TDX/SecureTSC guests, *all* clocksources are host controlled. > > But wait, there's more! Because TDX doesn't override .calibrate_tsc() or > > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the > > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute > > the TSC frequency using the information provided by KVM, i.e. the untrusted host. > > Yeah, I guess we don't want that. Or at least we should warn about it. CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC would ideally assert that the kernel doesn't switch to some other calibration method too. Not sure where to hook into that though, without bleeding TDX and SNP details everywhere. > > + /* > > + * If the TSC counts at a constant frequency across P/T states, counts > > + * in deep C-states, and the TSC hasn't been marked unstable, prefer > > + * the TSC over kvmclock for sched_clock and drop kvmclock's rating so > > + * that TSC is chosen as the clocksource. Note, the TSC unstable check > > + * exists purely to honor the TSC being marked unstable via command > > + * line, any runtime detection of an unstable will happen after this. > > + */ > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && > > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && > > + !check_tsc_unstable()) { > > + kvm_clock.rating = 299; > > + pr_warn("kvm-clock: Using native sched_clock\n"); > > The warn is in the right direction but probably should say TSC still cannot be > trusted 100%. Heh, I didn't mean to include the printks, they were purely for my own debugging. > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > > index 0864b314c26a..9baffb425386 100644 > > --- a/arch/x86/kernel/tsc.c > > +++ b/arch/x86/kernel/tsc.c > > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) > > unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; > > unsigned int crystal_khz; > > > > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > > + /* > > + * Ignore the vendor when running as a VM, if the hypervisor provides > > + * garbage CPUID information then the vendor is also suspect. > > + */ > > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && > > + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > > return 0; > > > > if (boot_cpu_data.cpuid_level < 0x15) > > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void) > > return 0; > > > > /* > > - * For Atom SoCs TSC is the only reliable clocksource. > > - * Mark TSC reliable so no watchdog on it. > > + * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a > > + * VM, any watchdog is going to be less reliable than the TSC as the > > + * watchdog source will be emulated in software. In both cases, mark > > + * the TSC reliable so that no watchdog runs on it. > > */ > > - if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > > + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || > > + boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > > setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); > > > > #ifdef CONFIG_X86_LOCAL_APIC > > It looks all wrong if a function called native_* sprinkles a bunch of "am > I a guest" checks. I guess we should split it into VM and native variants. I agree the naming is weird, but outside of the vendor checks, the VM code is identical to the "native" code, so I don't know that it's worth splitting into multiple functions. What if we simply rename it to calibrate_tsc_from_cpuid()? > But yeah, the general direction is ok once we agree on what we do when > exactly.
On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote: > It's only with SNP and TDX that the clocksource becomes at all interesting. So basically you're saying, let's just go ahead and trust the TSC when the HV sets a bunch of CPUID bits. But we really really trust it when the guest type is SNP+STSC or TDX since there the HV is out of the picture and the only one who can flub it there is the OEM. > CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC > would ideally assert that the kernel doesn't switch to some other calibration > method too. Not sure where to hook into that though, without bleeding TDX and > SNP details everywhere. We could use the platform calibrate* function pointers and assign TDX- or SNP-specific ones and perhaps even define new such function ptrs. That's what the platform stuff is for... needs staring, ofc. > I agree the naming is weird, but outside of the vendor checks, the VM code is > identical to the "native" code, so I don't know that it's worth splitting into > multiple functions. > > What if we simply rename it to calibrate_tsc_from_cpuid()? This is all wrong layering with all those different guest types having their own ->calibrate_tsc: arch/x86/kernel/cpu/acrn.c:32: x86_platform.calibrate_tsc = acrn_get_tsc_khz; arch/x86/kernel/cpu/mshyperv.c:424: x86_platform.calibrate_tsc = hv_get_tsc_khz; arch/x86/kernel/cpu/vmware.c:419: x86_platform.calibrate_tsc = vmware_get_tsc_khz; arch/x86/kernel/jailhouse.c:213: x86_platform.calibrate_tsc = jailhouse_get_tsc; arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz; arch/x86/kernel/tsc.c:944: tsc_khz = x86_platform.calibrate_tsc(); arch/x86/kernel/tsc.c:1458: tsc_khz = x86_platform.calibrate_tsc(); arch/x86/kernel/x86_init.c:148: .calibrate_tsc = native_calibrate_tsc, arch/x86/xen/time.c:569: x86_platform.calibrate_tsc = xen_tsc_khz; What you want sounds like a redesign to me considering how you want to keep the KVM guest code and baremetal pretty close... Hmmm, needs staring... Thx.
On Fri, Jan 17, 2025, Borislav Petkov wrote: > On Thu, Jan 16, 2025 at 08:56:25AM -0800, Sean Christopherson wrote: > > It's only with SNP and TDX that the clocksource becomes at all interesting. > > So basically you're saying, let's just go ahead and trust the TSC when the HV > sets a bunch of CPUID bits. Sort of. It's not a trust thing though. The Xen, KVM, and VMware PV clocks are all based on TSC, i.e. we already "trust" the hypervisor to not muck with TSC. The purpose of such PV clocks is to account for things in software, that aren't handled in hardware. E.g. to provide a constant counter on hardware without a constant TSC. The proposal here, and what kvmclock already does for clocksource, is to use the raw TSC when the hypervisor sets bits that effectively say that the massaging of TSC done by the PV clock isn't needed. > But we really really trust it when the guest type is SNP+STSC or TDX since > there the HV is out of the picture and the only one who can flub it there is > the OEM. Yep. This one _is_ about trust. Specifically, we trust the raw TSC more than any clock that is provided by the HV. > > CPUID 0x15 (and 0x16?) is guaranteed to be available under TDX, and Secure TSC > > would ideally assert that the kernel doesn't switch to some other calibration > > method too. Not sure where to hook into that though, without bleeding TDX and > > SNP details everywhere. > > We could use the platform calibrate* function pointers and assign TDX- or > SNP-specific ones and perhaps even define new such function ptrs. That's what > the platform stuff is for... needs staring, ofc. > > > I agree the naming is weird, but outside of the vendor checks, the VM code is > > identical to the "native" code, so I don't know that it's worth splitting into > > multiple functions. > > > > What if we simply rename it to calibrate_tsc_from_cpuid()? > > This is all wrong layering with all those different guest types having their > own ->calibrate_tsc: > > arch/x86/kernel/cpu/acrn.c:32: x86_platform.calibrate_tsc = acrn_get_tsc_khz; > arch/x86/kernel/cpu/mshyperv.c:424: x86_platform.calibrate_tsc = hv_get_tsc_khz; > arch/x86/kernel/cpu/vmware.c:419: x86_platform.calibrate_tsc = vmware_get_tsc_khz; > arch/x86/kernel/jailhouse.c:213: x86_platform.calibrate_tsc = jailhouse_get_tsc; > arch/x86/kernel/kvmclock.c:323: x86_platform.calibrate_tsc = kvm_get_tsc_khz; > arch/x86/kernel/tsc.c:944: tsc_khz = x86_platform.calibrate_tsc(); > arch/x86/kernel/tsc.c:1458: tsc_khz = x86_platform.calibrate_tsc(); > arch/x86/kernel/x86_init.c:148: .calibrate_tsc = native_calibrate_tsc, > arch/x86/xen/time.c:569: x86_platform.calibrate_tsc = xen_tsc_khz; > > What you want sounds like a redesign to me considering how you want to keep > the KVM guest code and baremetal pretty close... Hmmm, needs staring... It's not KVM guest code though. The CPUID stuff is Intel's architecturally defined behavior. There are oodles and oodles of features that are transparently emulated by the hypervisor according to hardware specifications. Generally speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(), etc. What I am proposing is that, for TDX especially, instead of relying on the hypervisor to use a paravirtual channel for communicating the TSC frequency, we rely on the hardware-defined way of getting the frequency, because CPUID is emulated by the trusted entity, i.e. the OEM. Hmm, though I suppose I'm arguing against myself in that case. If the hypervisor provides the frequency and there are no trust issues, why would we care if the kernel gets the frequency via CPUID or the PV channel. It's really only TDX that matters. And we could handle TDX by overriding .calibrate_tsc() in tsc_init(), same as Secure TSC. That said, I do think it makes sense to either override the vendor and F/M/S checks native_calibrate_tsc(). Or even better drop the initial vendor check entirely, because both Intel and AMD have a rich history of implementing each other's CPUID leaves. I.e. I see no reason to ignore CPUID 0x15 just because the CPU isn't Intel. As for the Goldmost F/M/S check, that one is a virtualization specific thing. The argument is that when running as a guest, any non-TSC clocksource is going to be emulated by the hypervisor, and therefore is going to be less reliable than TSC. I.e. putting a watchdog on TSC does more harm than good, because what ends up happening is the TSC gets marked unreliable because the *watchdog* is unreliable.
On 1/16/2025 3:07 AM, Sean Christopherson wrote: > My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant, > nonstop, and not marked stable via command line. I.e. use the same criteria as > tweaking the clocksource rating. As above, sched_clock is more tolerant of slop > than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock > (or something else entirely) is used for the clocksource. > > If we wanted to go with a more conservative approach, e.g. to minimize the risk > of breaking existing setups, we could also condition the change on the TSC being > reliable and having a known frequency. I.e. require SNP's Secure TSC, or require > the hypervisor to enumerate the TSC frequency via CPUID. I don't see a ton of > value in that approach though, and long-term it would lead to some truly weird > code due to holding sched_clock to a higher standard than clocksource. > > But wait, there's more! Because TDX doesn't override .calibrate_tsc() or > .calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the > frequency of the TSC, unless I'm missing something, tsc_early_init() will compute > the TSC frequency using the information provided by KVM, i.e. the untrusted host. > > The "obvious" solution is to leave the calibration functions as-is if the TSC has > a known, reliable frequency, but even _that_ is riddled with problems, because > as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which > runs *after* init_hypervisor_platform(). SNP Secure TSC fudges around this by > overiding the calibration routines, but that's a bit gross and easy to fix if we > also fix TDX. And fixing TDX by running native_calibrate_tsc() would give the > same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16. One change that I wasn't sure was about non-Intel guests using CPUID 0x15H/0x16H, that you have already answered in the subsequent emails. At present, AMD platform does not implement either of them and will bail out from the below check: /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */ cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx); if (ebx_numerator == 0 || eax_denominator == 0) return 0; > All in all, I'm thinking something like this (across multiple patches): Tested on AMD Milan and changes are working as intended: For SecureTSC guests with TSC_INVARIANT set: * Raw TSC is used as clocksource and sched-clock * Calibration is done using GUEST_TSC_FREQ MSR For non-SecureTSC guests with TSC_INVARIANT set: * Raw TSC is used as clocksource and sched-clock * Calibration is done using kvm-clock For non-SecureTSC guests without TSC_INVARIANT: * kvm-clock(based on TSC) is used as clocksource and sched-clock * Calibration is done using kvm-clock > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index 0864b314c26a..9baffb425386 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) > unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; > unsigned int crystal_khz; > > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > + /* > + * Ignore the vendor when running as a VM, if the hypervisor provides > + * garbage CPUID information then the vendor is also suspect. > + */ > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && > + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > return 0; > > if (boot_cpu_data.cpuid_level < 0x15) Regards Nikunj
On Fri, Jan 17, 2025 at 12:59:37PM -0800, Sean Christopherson wrote: > The proposal here, and what kvmclock already does for clocksource, is to use the > raw TSC when the hypervisor sets bits that effectively say that the massaging of > TSC done by the PV clock isn't needed. > > > But we really really trust it when the guest type is SNP+STSC or TDX since > > there the HV is out of the picture and the only one who can flub it there is > > the OEM. > > Yep. This one _is_ about trust. Specifically, we trust the raw TSC more than > any clock that is provided by the HV. Ok, makes sense. > It's not KVM guest code though. The CPUID stuff is Intel's architecturally > defined behavior. There are oodles and oodles of features that are transparently > emulated by the hypervisor according to hardware specifications. Generally > speaking, the kernel treats those as "native", e.g. native_wrmsrl(), native_cpuid(), > etc. Uuuh, this is calling for confusion. native_* has always been stuff the kernel calls when it runs, well, natively, on baremetal. And not some PV or whatever-enlightened ops. Now, if you want to emulate those transparently that's fine but this is what native means in arch/x86/. Unless I've missed some memo but I doubt it. And I asked around :) > What I am proposing is that, for TDX especially, instead of relying on the > hypervisor to use a paravirtual channel for communicating the TSC frequency, > we rely on the hardware-defined way of getting the frequency, because CPUID > is emulated by the trusted entity, i.e. the OEM. I wouldn't trust the OEM with a single bit but that's a different story. :-P > Hmm, though I suppose I'm arguing against myself in that case. If the > hypervisor provides the frequency and there are no trust issues, why would > we care if the kernel gets the frequency via CPUID or the PV channel. It's > really only TDX that matters. And we could handle TDX by overriding > .calibrate_tsc() in tsc_init(), same as Secure TSC. I guess it all boils down to the trust level you want to have with the TSC. I don't know whether there's even a HV which tries to lie to the guest with the TSC and I guess we'll find out eventually but for now we could treat the two things similarly. The two things being: - TDX and STSC SNP guests - full TSC trust - HV with the required CPUID bits set - almost full, we assume HV is benevolent. Could change if we notice something. > That said, I do think it makes sense to either override the vendor and F/M/S > checks native_calibrate_tsc(). Or even better drop the initial vendor check > entirely, I have no clue why that thing is even there: aa297292d708 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID") I guess back then it meant that only Intel sports those new CPUID leafs but those things do change. > because both Intel and AMD have a rich history of implementing each other's > CPUID leaves. I.e. I see no reason to ignore CPUID 0x15 just because the > CPU isn't Intel. > > As for the Goldmost F/M/S check, that one is a virtualization specific > thing. The argument is that when running as a guest, any non-TSC > clocksource is going to be emulated by the hypervisor, and therefore is > going to be less reliable than TSC. I.e. putting a watchdog on TSC does > more harm than good, because what ends up happening is the TSC gets marked > unreliable because the *watchdog* is unreliable. So I think the best thing to do is to carve out the meat of that function which is valid for both baremetal and virtualized and then have separate helpers which call the common thing. So that you don't have to test on *everything* when having to change it in the future for whatever reason and so that both camps can be relatively free when implementing their idiosyncrasies. And then it should fit in the current scheme with platform function ptrs. I know you want to use native_calibrate_tsc() but then you have to sprinkle "am I a guest" checks and this reminds me of the "if XEN" fiasco back then. Also, a really nasty lying HV won't simply set X86_FEATURE_HYPERVISOR... So I'd prefer if we fit a guest which runs on a relatively honest HV :) into the current scheme as the kernel running simply on yet another platform, even at the price of some small code duplication. Thx.
Sean Christopherson <seanjc@google.com> writes: > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index 0864b314c26a..9baffb425386 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void) > unsigned int eax_denominator, ebx_numerator, ecx_hz, edx; > unsigned int crystal_khz; > > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > + /* > + * Ignore the vendor when running as a VM, if the hypervisor provides > + * garbage CPUID information then the vendor is also suspect. > + */ > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL && > + !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > return 0; > > if (boot_cpu_data.cpuid_level < 0x15) > @@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void) > return 0; > > /* > - * For Atom SoCs TSC is the only reliable clocksource. > - * Mark TSC reliable so no watchdog on it. > + * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a > + * VM, any watchdog is going to be less reliable than the TSC as the > + * watchdog source will be emulated in software. In both cases, mark > + * the TSC reliable so that no watchdog runs on it. > */ > - if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || > + boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) > setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); One more point here is for AMD guests, TSC will not be marked reliable as per the above change, it will only be effective for CPUs supporting CPUID 15H/16H. Although, the watchdog should be stopped for AMD guests as well. Will it make sense to move this before cpuid_level check ? diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index e7abcc4d02c3..2769d1598c0d 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -672,6 +672,14 @@ unsigned long native_calibrate_tsc(void) !boot_cpu_has(X86_FEATURE_HYPERVISOR)) return 0; + /* + * In a VM, any watchdog is going to be less reliable than the TSC as + * the watchdog source will be emulated in software. Mark the TSC + * reliable so that no watchdog runs on it. + */ + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); + if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC) return 0; @@ -719,13 +727,10 @@ unsigned long native_calibrate_tsc(void) return 0; /* - * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a - * VM, any watchdog is going to be less reliable than the TSC as the - * watchdog source will be emulated in software. In both cases, mark - * the TSC reliable so that no watchdog runs on it. + * For Atom SoCs TSC is the only reliable clocksource. + * Mark TSC reliable so no watchdog on it. */ - if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || - boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) + if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT) setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE); #ifdef CONFIG_X86_LOCAL_APIC Regards Nikunj
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 88d8bfceea04..fe7a0b1b7cfd 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -291,12 +291,26 @@ static void __init upgrade_clock_rating(struct clocksource *tsc_early, tsc->rating = 450; } } + +static void enable_native_sc_work(struct work_struct *work) +{ + pr_info("Using native sched clock\n"); + paravirt_set_sched_clock(native_sched_clock); +} +static DECLARE_DELAYED_WORK(enable_native_sc, enable_native_sc_work); + +static void enable_native_sched_clock(void) +{ + if (!using_native_sched_clock()) + schedule_delayed_work(&enable_native_sc, 0); +} #else u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock"))); bool using_native_sched_clock(void) { return true; } static void __init upgrade_clock_rating(struct clocksource *tsc_early, struct clocksource *tsc) { } +static void enable_native_sched_clock(void) { } #endif notrace u64 sched_clock(void) @@ -1176,6 +1190,10 @@ static void tsc_cs_tick_stable(struct clocksource *cs) static int tsc_cs_enable(struct clocksource *cs) { vclocks_set_used(VDSO_CLOCKMODE_TSC); + + /* Restore native_sched_clock() when switching to TSC */ + enable_native_sched_clock(); + return 0; }
Although the kernel switches over to stable TSC clocksource instead of PV clocksource, the scheduler still keeps on using PV clocks as the sched clock source. This is because KVM, Xen and VMWare, switch the paravirt sched clock handler in their init routines. HyperV is the only PV clock source that checks if the platform provides an invariant TSC and does not switch to PV sched clock. When switching back to stable TSC, restore the scheduler clock to native_sched_clock(). As the clock selection happens in the stop machine context, schedule delayed work to update the static_call() Cc: Alexey Makhalov <alexey.makhalov@broadcom.com> Cc: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> --- arch/x86/kernel/tsc.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)