diff mbox series

[1/2] KVM: x86: Synthesize at most one PMI per VM-exit

Message ID 20230925173448.3518223-2-mizhang@google.com (mailing list archive)
State New, archived
Headers show
Series [1/2] KVM: x86: Synthesize at most one PMI per VM-exit | expand

Commit Message

Mingwei Zhang Sept. 25, 2023, 5:34 p.m. UTC
From: Jim Mattson <jmattson@google.com>

When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.

That shouldn't be a problem. The local APIC is supposed to
automatically set the mask flag in LVTPC when it handles a PMI, so the
second PMI should be inhibited. However, KVM's local APIC emulation
fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
are delivered via the local APIC. In the common case, where LVTPC is
configured to deliver an NMI, the first NMI is vectored through the
guest IDT, and the second one is held pending. When the NMI handler
returns, the second NMI is vectored through the IDT. For Linux guests,
this results in the "dazed and confused" spurious NMI message.

Though the obvious fix is to set the mask flag in LVTPC when handling
a PMI, KVM's logic around synthesizing a PMI is unnecessarily
convoluted.

Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().

Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/pmu.c              | 27 +--------------------------
 arch/x86/kvm/x86.c              |  3 +++
 3 files changed, 4 insertions(+), 27 deletions(-)

Comments

Sean Christopherson Sept. 25, 2023, 5:59 p.m. UTC | #1
On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> From: Jim Mattson <jmattson@google.com>
> 
> When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
> VM-exit that also invokes __kvm_perf_overflow() as a result of
> instruction emulation, kvm_pmu_deliver_pmi() will be called twice
> before the next VM-entry.
> 
> That shouldn't be a problem. The local APIC is supposed to
> automatically set the mask flag in LVTPC when it handles a PMI, so the
> second PMI should be inhibited. However, KVM's local APIC emulation
> fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
> are delivered via the local APIC. In the common case, where LVTPC is
> configured to deliver an NMI, the first NMI is vectored through the
> guest IDT, and the second one is held pending. When the NMI handler
> returns, the second NMI is vectored through the IDT. For Linux guests,
> this results in the "dazed and confused" spurious NMI message.
> 
> Though the obvious fix is to set the mask flag in LVTPC when handling
> a PMI, KVM's logic around synthesizing a PMI is unnecessarily
> convoluted.

Unless Jim outright objects, I strongly prefer placing this patch second, with
the above two paragraphs replaced with my suggestion (or something similar):

  Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
  KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
  trigger the PMI is still broken, albeit very theoretically.

  E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
  vCPU to be migrated to a different pCPU, then it's possible for
  kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
  KVM_REQ_PMI and still generate two PMIs.

  KVM could set the mask bit using an atomic operation, but that'd just be
  piling on unnecessary code to workaround what is effectively a hack.  The
  *only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
  event, e.g. if the vCPU just executed HLT.

I understand Jim's desire for the patch to be more obviously valuable, but the
people that need convincing are already convinced that the patch is worth taking.

> Remove the irq_work callback for synthesizing a PMI, and all of the
> logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
> a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
> 
> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> Signed-off-by: Jim Mattson <jmattson@google.com>
> Tested-by: Mingwei Zhang <mizhang@google.com>
> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>

Needs your SoB
Mingwei Zhang Sept. 25, 2023, 7:33 p.m. UTC | #2
On Mon, Sep 25, 2023 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> > From: Jim Mattson <jmattson@google.com>
> >
> > When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
> > VM-exit that also invokes __kvm_perf_overflow() as a result of
> > instruction emulation, kvm_pmu_deliver_pmi() will be called twice
> > before the next VM-entry.
> >
> > That shouldn't be a problem. The local APIC is supposed to
> > automatically set the mask flag in LVTPC when it handles a PMI, so the
> > second PMI should be inhibited. However, KVM's local APIC emulation
> > fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
> > are delivered via the local APIC. In the common case, where LVTPC is
> > configured to deliver an NMI, the first NMI is vectored through the
> > guest IDT, and the second one is held pending. When the NMI handler
> > returns, the second NMI is vectored through the IDT. For Linux guests,
> > this results in the "dazed and confused" spurious NMI message.
> >
> > Though the obvious fix is to set the mask flag in LVTPC when handling
> > a PMI, KVM's logic around synthesizing a PMI is unnecessarily
> > convoluted.
>
> Unless Jim outright objects, I strongly prefer placing this patch second, with
> the above two paragraphs replaced with my suggestion (or something similar):
>
>   Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
>   KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
>   trigger the PMI is still broken, albeit very theoretically.
>
>   E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
>   vCPU to be migrated to a different pCPU, then it's possible for
>   kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
>   KVM_REQ_PMI and still generate two PMIs.
>
>   KVM could set the mask bit using an atomic operation, but that'd just be
>   piling on unnecessary code to workaround what is effectively a hack.  The
>   *only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
>   event, e.g. if the vCPU just executed HLT.
>
> I understand Jim's desire for the patch to be more obviously valuable, but the
> people that need convincing are already convinced that the patch is worth taking.
>
> > Remove the irq_work callback for synthesizing a PMI, and all of the
> > logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
> > a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
> >
> > Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > Signed-off-by: Jim Mattson <jmattson@google.com>
> > Tested-by: Mingwei Zhang <mizhang@google.com>
> > Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>
> Needs your SoB

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Sean Christopherson Sept. 25, 2023, 9:28 p.m. UTC | #3
On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> On Mon, Sep 25, 2023 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> > > From: Jim Mattson <jmattson@google.com>
> > >
> > > When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
> > > VM-exit that also invokes __kvm_perf_overflow() as a result of
> > > instruction emulation, kvm_pmu_deliver_pmi() will be called twice
> > > before the next VM-entry.
> > >
> > > That shouldn't be a problem. The local APIC is supposed to
> > > automatically set the mask flag in LVTPC when it handles a PMI, so the
> > > second PMI should be inhibited. However, KVM's local APIC emulation
> > > fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
> > > are delivered via the local APIC. In the common case, where LVTPC is
> > > configured to deliver an NMI, the first NMI is vectored through the
> > > guest IDT, and the second one is held pending. When the NMI handler
> > > returns, the second NMI is vectored through the IDT. For Linux guests,
> > > this results in the "dazed and confused" spurious NMI message.
> > >
> > > Though the obvious fix is to set the mask flag in LVTPC when handling
> > > a PMI, KVM's logic around synthesizing a PMI is unnecessarily
> > > convoluted.
> >
> > Unless Jim outright objects, I strongly prefer placing this patch second, with
> > the above two paragraphs replaced with my suggestion (or something similar):
> >
> >   Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
> >   KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
> >   trigger the PMI is still broken, albeit very theoretically.
> >
> >   E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
> >   vCPU to be migrated to a different pCPU, then it's possible for
> >   kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
> >   KVM_REQ_PMI and still generate two PMIs.
> >
> >   KVM could set the mask bit using an atomic operation, but that'd just be
> >   piling on unnecessary code to workaround what is effectively a hack.  The
> >   *only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
> >   event, e.g. if the vCPU just executed HLT.
> >
> > I understand Jim's desire for the patch to be more obviously valuable, but the
> > people that need convincing are already convinced that the patch is worth taking.
> >
> > > Remove the irq_work callback for synthesizing a PMI, and all of the
> > > logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
> > > a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
> > >
> > > Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > > Signed-off-by: Jim Mattson <jmattson@google.com>
> > > Tested-by: Mingwei Zhang <mizhang@google.com>
> > > Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >
> > Needs your SoB
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>

Thanks!

Jim gave his blessing off-list for swapping the order, I'll do that and massage
the changelogs when applying, i.e. no need for a v3.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28bd38303d70..de951d6aa9a8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -528,7 +528,6 @@  struct kvm_pmu {
 	u64 raw_event_mask;
 	struct kvm_pmc gp_counters[KVM_INTEL_PMC_MAX_GENERIC];
 	struct kvm_pmc fixed_counters[KVM_PMC_MAX_FIXED];
-	struct irq_work irq_work;
 
 	/*
 	 * Overlay the bitmap with a 64-bit atomic so that all bits can be
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index edb89b51b383..9ae07db6f0f6 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -93,14 +93,6 @@  void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops)
 #undef __KVM_X86_PMU_OP
 }
 
-static void kvm_pmi_trigger_fn(struct irq_work *irq_work)
-{
-	struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu, irq_work);
-	struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
-
-	kvm_pmu_deliver_pmi(vcpu);
-}
-
 static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
@@ -124,20 +116,7 @@  static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
 	}
 
-	if (!pmc->intr || skip_pmi)
-		return;
-
-	/*
-	 * Inject PMI. If vcpu was in a guest mode during NMI PMI
-	 * can be ejected on a guest mode re-entry. Otherwise we can't
-	 * be sure that vcpu wasn't executing hlt instruction at the
-	 * time of vmexit and is not going to re-enter guest mode until
-	 * woken up. So we should wake it, but this is impossible from
-	 * NMI context. Do it from irq work instead.
-	 */
-	if (in_pmi && !kvm_handling_nmi_from_guest(pmc->vcpu))
-		irq_work_queue(&pmc_to_pmu(pmc)->irq_work);
-	else
+	if (pmc->intr && !skip_pmi)
 		kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
 }
 
@@ -675,9 +654,6 @@  void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 
 void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 {
-	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-
-	irq_work_sync(&pmu->irq_work);
 	static_call(kvm_x86_pmu_reset)(vcpu);
 }
 
@@ -687,7 +663,6 @@  void kvm_pmu_init(struct kvm_vcpu *vcpu)
 
 	memset(pmu, 0, sizeof(*pmu));
 	static_call(kvm_x86_pmu_init)(vcpu);
-	init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
 	pmu->event_count = 0;
 	pmu->need_cleanup = false;
 	kvm_pmu_refresh(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a6b9bea62fb8..6f24a8c1e136 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12820,6 +12820,9 @@  static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 		return true;
 #endif
 
+	if (kvm_test_request(KVM_REQ_PMI, vcpu))
+		return true;
+
 	if (kvm_arch_interrupt_allowed(vcpu) &&
 	    (kvm_cpu_has_interrupt(vcpu) ||
 	    kvm_guest_apic_has_interrupt(vcpu)))