diff mbox series

VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX

Message ID Zn9X0yFxZi_Mrlnt@google.com (mailing list archive)
State New
Headers show
Series VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX | expand

Commit Message

Sean Christopherson June 29, 2024, 12:39 a.m. UTC
Forking this off to try and avoid confusion...

On Wed, Jun 12, 2024, Reinette Chatre wrote:
> +/*
> + * Possible TDCR values with matching divide count. Used to modify APIC
> + * timer frequency.
> + */
> +static struct {
> +	uint32_t tdcr;
> +	uint32_t divide_count;
> +} tdcrs[] = {
> +	{0x0, 2},
> +	{0x1, 4},
> +	{0x2, 8},
> +	{0x3, 16},
> +	{0x8, 32},
> +	{0x9, 64},
> +	{0xa, 128},
> +	{0xb, 1},
> +};

...

> +static void apic_guest_code(void)
> +{
> +	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
> +	const uint32_t tmict = ~0u;
> +	uint64_t tsc0, tsc1, freq;
> +	uint32_t tmcct;
> +	int i;
> +
> +	apic_enable();
> +
> +	/*
> +	 * Setup one-shot timer.  The vector does not matter because the
> +	 * interrupt should not fire.
> +	 */
> +	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
> +
> +	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
> +
> +		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
> +		apic_write_reg(APIC_TMICT, tmict);
> +
> +		tsc0 = rdtsc();
> +		udelay(delay_ms * 1000);
> +		tmcct = apic_read_reg(APIC_TMCCT);
> +		tsc1 = rdtsc();
> +
> +		/*
> +		 * Stop the timer _after_ reading the current, final count, as
> +		 * writing the initial counter also modifies the current count.
> +		 */
> +		apic_write_reg(APIC_TMICT, 0);
> +
> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> +		/* Check if measured frequency is within 1% of configured frequency. */
> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> +	}

This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
bug.  It looks like something APICv related is clobbering internal VMX timer state?
Or maybe there's a tearing or truncation issue?

As mentioned ad nauseum at this point, I'm offline all of next week, so hopefully
there's enough info here to get a root cause...


A spurious VM-Exit will occur after programming a vmcs.PREEMPTION_TIMER_VALUE that
shouldn't exit.  Every observed failure occurs when bits 27:16 are zero, with a
small value in bits 15:0, e.g. VM-Enter with a timer value of 0xe0003bf7 or
0xa0006db6 will cause a near-immediate VM-Exit.

Weirdly, it doesn't always happen, e.g. I've observed rollover from 0xa000xxxx
to 0x9fffxxxx without failure.  However, the *test* failure is 100% reproducible,
i.e. it's not _that_ rare of a failure.  So maybe there's state build-up required?
E.g. in the failing cases, there are 10s of entries with a slightly larger timer
value, with no preemption timer exit (the host's tick IRQ interrupts the guest,
and then KVM reprograms the VMX timer).

Even more sketchy, the failure only occurs if APICv is enabled.  Turning off APICv
makes the problem go away (I initially suspected KVM was somehow botching the TMCCT
emulation).  I am 99.9% certain there is no KVM APICv bug that is causing problems.
Our IVB servers don't fail (even with APICv enabled), nor does my Raptor Lake client
box (with APICv enabled).  And I confirmed that the VMX timer is still getting
programmed with the same sequence that fails when APICv is enabled.

(Before I realized the pattern of values), I sanity checked the VMCS field just
before VM-Enter, and again after VM-Exit (KVM runs without the CPU save
vmcs.PREEMPTION_TIMER_VALUE on exit).

I also verified the CPU thinks the timer has expired by enabling "save on exit"
and verifying vmcs.PREEMPTION_TIMER_VALUE is indeed '0', and that KVM really did
get a PREEMPTION_TIMER exit.

Attached is the debug patch I used to get the below data (sort of; one of the
post-exit prints is without saving vmcs.PREEMPTION_TIMER_VALUE on exit).

In kvm_hypercall (ignore the name, I piggybacked a tracepoint because trying to
log to dmesg was too slow, guest literally couldn't make forward progress), a1 is
the VMX timer value programmed by KVM (0xe0003bf7).
 
 apic_bus_clock_-11419   [056] d..2.   146.771179: kvm_hypercall: nr 0x2c8a9e9cc6703c a0 0x2c8b0e9ce46c37 a1 0xe0003bf7 a2 0xe0003bf7 a3 0x7  
 apic_bus_clock_-11419   [056] d..2.   146.771242: kvm_exit: vcpu 0 reason PREEMPTION_TIMER rip 0x402065 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 VMX timer exit, VMCS = e0003bf7, delta = e0003690

and the post-exit print with the attached patch:

  kvm_intel: VMX timer exit, EXIT_REASON = 34, VMCS = 0, delta = e00037e9

And I've been fiddling with the below hack to coerce KVM into programming VMX
timer values.  Had I more time, I would have booted kernels with the ability to
properly fuzz the timer values.

Note, with TMICT=-1, only divide_count=1 will fail, because larger divide counts
result in a timeout that doesn't fit into the 32-bit VMX timer field (don't ask
me how long it took me to realize the divide count affects the time frequency,
not the actual count, *sigh*).
diff mbox series

Patch

From 42da584d5e4e2dfaf4296f2ea666f3a04c82052f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 28 Jun 2024 16:33:32 -0700
Subject: [PATCH] debug

---
 arch/x86/kvm/vmx/vmx.c | 33 +++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h |  1 +
 arch/x86/kvm/x86.c     |  1 +
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..e83351c690d9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,7 @@  static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
 
 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5997,6 +5996,8 @@  static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 delta_tsc;
+	u64 tscl;
 
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
@@ -6020,6 +6021,16 @@  static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 	if (is_guest_mode(vcpu))
 		return EXIT_FASTPATH_NONE;
 
+	tscl = rdtsc();
+	if (vmx->hv_deadline_tsc > tscl)
+		delta_tsc = (u32)((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi);
+	else
+		delta_tsc = 0;
+
+	pr_warn("VMX timer exit, EXIT_REASON = %x, VMCS = %x, delta = %x\n",
+		vmcs_read32(VM_EXIT_REASON),
+		vmcs_read32(VMX_PREEMPTION_TIMER_VALUE), delta_tsc);
+
 	kvm_lapic_expired_hv_timer(vcpu);
 	return EXIT_FASTPATH_REENTER_GUEST;
 }
@@ -7197,6 +7208,8 @@  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	u32 delta_tsc;
 
 	if (force_immediate_exit) {
+		trace_kvm_hypercall(0, 0, 0, 0, 0);
+		vmx->preemption_timer = 0;
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
@@ -7208,9 +7221,14 @@  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 		else
 			delta_tsc = 0;
 
+		trace_kvm_hypercall(tscl, vmx->hv_deadline_tsc, delta_tsc,
+				    ((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi),
+				    cpu_preemption_timer_multi);
+		vmx->preemption_timer = delta_tsc;
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (!vmx->loaded_vmcs->hv_timer_soft_disabled) {
+		trace_kvm_hypercall(-1, -1, -1, -1, -1);
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, -1);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = true;
 	}
@@ -7218,6 +7236,8 @@  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 
 void noinstr vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
 {
+	WARN_ON(!vmx->loaded_vmcs->hv_timer_soft_disabled &&
+		vmcs_read32(VMX_PREEMPTION_TIMER_VALUE) != vmx->preemption_timer);
 	if (unlikely(host_rsp != vmx->loaded_vmcs->host_state.rsp)) {
 		vmx->loaded_vmcs->host_state.rsp = host_rsp;
 		vmcs_writel(HOST_RSP, host_rsp);
@@ -8128,7 +8148,7 @@  int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	    delta_tsc && u64_shl_div_u64(delta_tsc,
 				kvm_caps.tsc_scaling_ratio_frac_bits,
 				vcpu->arch.l1_tsc_scaling_ratio, &delta_tsc))
-		return -ERANGE;
+		goto out_of_range;
 
 	/*
 	 * If the delta tsc can't fit in the 32 bit after the multi shift,
@@ -8137,11 +8157,16 @@  int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	 * on every vmentry is costly so we just use an hrtimer.
 	 */
 	if (delta_tsc >> (cpu_preemption_timer_multi + 32))
-		return -ERANGE;
+		goto out_of_range;
 
+	trace_kvm_hypercall(tscl, vmx->hv_deadline_tsc, delta_tsc, -1, -1);
 	vmx->hv_deadline_tsc = tscl + delta_tsc;
 	*expired = !delta_tsc;
 	return 0;
+
+out_of_range:
+	vmx->hv_deadline_tsc = -1;
+	return -ERANGE;
 }
 
 void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..ecafbb11519d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -341,6 +341,7 @@  struct vcpu_vmx {
 
 	/* apic deadline value in host tsc */
 	u64 hv_deadline_tsc;
+	u32 preemption_timer;
 
 	unsigned long host_debugctlmsr;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..00847259bcc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14024,6 +14024,7 @@  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_hypercall);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);

base-commit: 128c71b7f489d8115d29a487367c648f8acc8374
-- 
2.45.2.803.g4e1b14247a-goog