diff mbox series

[3/3] KVM: x86: Decouple APICv activation state from apicv_inhibit_reasons

Message ID 405a98c2f21b9fe73eddbc35c80b60d6523db70c.1738595289.git.naveen@kernel.org (mailing list archive)
State New
Headers show
Series KVM: x86: Address performance degradation due to APICv inhibits | expand

Commit Message

Naveen N Rao Feb. 3, 2025, 5:03 p.m. UTC
apicv_inhibit_reasons is used to determine if APICv is active, and if
not, the reason(s) why it may be inhibited. In some scenarios, inhibit
reasons can be set and cleared often, resulting in increased contention
on apicv_update_lock used to guard updates to apicv_inhibit_reasons.

In particular, if a guest is using PIT in reinject mode (the default)
and if AVIC is enabled in kvm_amd kernel module, we inhibit AVIC during
kernel PIT creation (APICV_INHIBIT_REASON_PIT_REINJ), resulting in KVM
emulating x2APIC for the guest. In that case, since AVIC is enabled in
the kvm_amd kernel module, KVM additionally inhibits AVIC for requesting
a IRQ window every time it has to inject external interrupts resulting
in a barrage of inhibits being set and cleared. This shows significant
performance degradation compared to AVIC being disabled, due to high
contention on apicv_update_lock.

Though apicv_update_lock is being used to guard updates to
apicv_inhibit_reasons, it is only necessary if the APICv activation
state changes. Introduce a separate boolean, apicv_activated, to track
if APICv is active or not, and limit use of apicv_update_lock for when
APICv is being (de)activated. Convert apicv_inhibit_reasons to an atomic
and use atomic operations to fetch/update it.

Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
---
 arch/x86/include/asm/kvm_host.h |   7 +-
 arch/x86/kvm/x86.c              | 116 +++++++++++++++++---------------
 2 files changed, 63 insertions(+), 60 deletions(-)

Comments

Sean Christopherson Feb. 3, 2025, 6:45 p.m. UTC | #1
On Mon, Feb 03, 2025, Naveen N Rao (AMD) wrote:
> apicv_inhibit_reasons is used to determine if APICv is active, and if
> not, the reason(s) why it may be inhibited. In some scenarios, inhibit
> reasons can be set and cleared often, resulting in increased contention
> on apicv_update_lock used to guard updates to apicv_inhibit_reasons.
> 
> In particular, if a guest is using PIT in reinject mode (the default)
> and if AVIC is enabled in kvm_amd kernel module, we inhibit AVIC during
> kernel PIT creation (APICV_INHIBIT_REASON_PIT_REINJ), resulting in KVM
> emulating x2APIC for the guest. In that case, since AVIC is enabled in
> the kvm_amd kernel module, KVM additionally inhibits AVIC for requesting
> a IRQ window every time it has to inject external interrupts resulting
> in a barrage of inhibits being set and cleared. This shows significant
> performance degradation compared to AVIC being disabled, due to high
> contention on apicv_update_lock.
> 
> Though apicv_update_lock is being used to guard updates to
> apicv_inhibit_reasons, it is only necessary if the APICv activation
> state changes. Introduce a separate boolean, apicv_activated, to track
> if APICv is active or not, and limit use of apicv_update_lock for when
> APICv is being (de)activated. Convert apicv_inhibit_reasons to an atomic
> and use atomic operations to fetch/update it.

It's a moot point, but there is too much going on in this patch.  For a change
like this, it should be split into at ~4 patches:

 1. Add an kvm_x86_ops.required_apicv_inhibits check outside outside of the lock,
    which we can and should do anyways.  I would probably vote to turn the one
    inside the lock into a WARN, as that "optimized" path is only an optimization
    if the inhibit applies to both SVM and VMX.
 2. Use a bool to track the activated state.
 3. Use an atomic.
 4. Implement the optimized updates.

#3 is optional, #1 and #2 are not.

It's a moot point though, because toggling APICv inhibits on IRQ windows is
laughably broken.  I'm guessing it "works" because the only time it triggers in
practice is in conjunction with PIT re-injection.

 1. vCPU0 waits for an IRQ window, APICV_INHIBIT_REASON_IRQWIN = 1
 2. vCPU1 waits for an IRQ window, APICV_INHIBIT_REASON_IRQWIN = 1
 3. vCPU1 gets its IRQ window, APICV_INHIBIT_REASON_IRQWIN = 0
 4. vCPU0 is sad

APICV_INHIBIT_REASON_BLOCKIRQ is also tied to per-vCPU state, but is a-ok because
KVM walks all vCPUs to generate inhibit.  That approach won't work for IRQ windows
though, because it's obviously not a slow path.

What is generating so many external interrupts?  E.g. is the guest using the PIT
for TSC or APIC calibration?  I suppose it doesn't matter terribly in this case
since APICV_INHIBIT_REASON_PIT_REINJ is already set.

Unless there's a very, very good reason to support a use case that generates
ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
clear it.  And then we can more simply optimize kvm_set_or_clear_apicv_inhibit()
to do nothing if the inhibit is sticky and already set.

E.g. something like this:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6d4a6734b2d6..6f926fd6fc1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10629,6 +10629,14 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
        if (!enable_apicv)
                return;
 
+       if (kvm_is_apicv_inhibit_sticky(reason)) {
+               if (WARN_ON_ONCE(!set))
+                       return;
+
+               if (kvm_test_apicv_inhibit(reason))
+                       return;
+       }
+
        down_write(&kvm->arch.apicv_update_lock);
        __kvm_set_or_clear_apicv_inhibit(kvm, reason, set);
        up_write(&kvm->arch.apicv_update_lock);
Paolo Bonzini Feb. 3, 2025, 10:22 p.m. UTC | #2
On 2/3/25 19:45, Sean Christopherson wrote:
> Unless there's a very, very good reason to support a use case that generates
> ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> clear it.

BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
it to APICV_INHIBIT_REASON_PIT_REINJ.

Plus, to avoid crazy ExtINT configurations, something like this:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3ec6197b1386..3e358d55b676 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1295,6 +1295,11 @@ enum kvm_apicv_inhibit {
  	 */
  	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
  
+	/*
+	 * AVIC is disabled because more than one vCPU has extint unmasked
+	 */
+	APICV_INHIBIT_REASON_EXTINT,
+
  	/*********************************************************/
  	/* INHIBITs that are relevant only to the Intel's APICv. */
  	/*********************************************************/
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 71544b0f6301..33a5f4ef42bd 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -377,6 +377,7 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
  	struct kvm_apic_map *new, *old = NULL;
  	struct kvm_vcpu *vcpu;
  	unsigned long i;
+	int extint_cnt = 0;
  	u32 max_id = 255; /* enough space for any xAPIC ID */
  	bool xapic_id_mismatch;
  	int r;
@@ -432,6 +433,8 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
  		if (!kvm_apic_present(vcpu))
  			continue;
  
+		extint_cnt += kvm_apic_accept_pic_intr(vcpu);
+
  		r = kvm_recalculate_phys_map(new, vcpu, &xapic_id_mismatch);
  		if (r) {
  			kvfree(new);
@@ -457,6 +460,11 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
  	else
  		kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED);
  
+	if (extint_cnt > 1)
+		kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_EXTINT);
+	else
+		kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_EXTINT);
+
  	if (!new || new->logical_mode == KVM_APIC_MODE_MAP_DISABLED)
  		kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED);
  	else
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 57ff79bc02a4..ba2fc7dd8ca2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -676,6 +676,7 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
  	BIT(APICV_INHIBIT_REASON_HYPERV) |		\
  	BIT(APICV_INHIBIT_REASON_NESTED) |		\
  	BIT(APICV_INHIBIT_REASON_IRQWIN) |		\
+	BIT(APICV_INHIBIT_REASON_EXTINT) |		\
  	BIT(APICV_INHIBIT_REASON_PIT_REINJ) |		\
  	BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |		\
  	BIT(APICV_INHIBIT_REASON_SEV)      |		\


I don't love adding another inhibit reason but, together, these two should
remove the contention on apicv_update_lock.  Another idea could be to move
IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.

Paolo
Sean Christopherson Feb. 3, 2025, 11:46 p.m. UTC | #3
On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> On 2/3/25 19:45, Sean Christopherson wrote:
> > Unless there's a very, very good reason to support a use case that generates
> > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > clear it.
> 
> BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
> of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> it to APICV_INHIBIT_REASON_PIT_REINJ.

That won't work, at least not with yet more changes, because KVM creates the
in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
if a bit is set and can never be cleared, then there's no need to track new
updates.  Since userspace needs to explicitly disable reinjection, the inhibit
can't be sticky.

I assume We could fudge around that easily enough by deferring the inhibit until
a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
I/O APIC case.

> I don't love adding another inhibit reason but, together, these two should
> remove the contention on apicv_update_lock.  Another idea could be to move
> IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.

Oh, yeah, that reminds me of the other reason I would vote for a sticky flag:
if inhibition really is toggling rapidly, performance is going to be quite bad
because inhibiting APICv requires (a) zapping APIC SPTEs and (b) serializing
writers if multiple vCPUs trigger the 0=>1 transition.

And there's some amount of serialization even if there's only a single writer,
as KVM kicks all vCPUs to toggle APICv (and again to flush TLBs, if necessary).

Hmm, something doesn't add up.  Naveen's changelog says:

  KVM additionally inhibits AVIC for requesting a IRQ window every time it has
  to inject external interrupts resulting in a barrage of inhibits being set and
  cleared. This shows significant performance degradation compared to AVIC being
  disabled, due to high contention on apicv_update_lock.

But if this is a "real world" use case where the only source of ExtInt is the
PIT, and kernels typically only wire up the PIT to the BSP, why is there
contention on apicv_update_lock?  APICv isn't actually being toggled, so readers
blocking writers to handle KVM_REQ_APICV_UPDATE shouldn't be a problem.

Naveen, do you know why there's a contention on apicv_update_lock?  Are multiple
vCPUs actually trying to inject ExtInt?
Maxim Levitsky Feb. 4, 2025, 1:23 a.m. UTC | #4
On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> > On 2/3/25 19:45, Sean Christopherson wrote:
> > > Unless there's a very, very good reason to support a use case that generates
> > > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > > clear it.
> > 
> > BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
> > of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> > it to APICV_INHIBIT_REASON_PIT_REINJ.
> 
> That won't work, at least not with yet more changes, because KVM creates the
> in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
> if a bit is set and can never be cleared, then there's no need to track new
> updates.  Since userspace needs to explicitly disable reinjection, the inhibit
> can't be sticky.
I confirmed this with a trace, this is indeed the case.

> 
> I assume We could fudge around that easily enough by deferring the inhibit until
> a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
> I/O APIC case.
> 
> > I don't love adding another inhibit reason but, together, these two should
> > remove the contention on apicv_update_lock.  Another idea could be to move
> > IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.

I retract this statement, it was based on my knowledge from back when I implemented it.

Looking at the current code again, this should be possible and can be a nice cleanup regardless.

(Or I just might have forgotten the reason that made me think back then that this is not worth it,
because I do remember well that I wanted to make IRQWIN inhibit to be per vcpu)

Basically to do so we need to introduce per-vcpu inhibit field (instead of .vcpu_get_apicv_inhibit_reasons callback)
set the inhibit bit there in svm_enable_irq_window, and raise KVM_REQ_APICV_UPDATE.

Same thing in interrupt_window_interception().

Nested code can be updated to do so as well very easily. IMHO this is a very nice cleanup.

I'll prepare a patch soon for this.

Also regardless, I strongly support Paolo's idea to inhibit APICv/AVIC when more than one ExtINT entry is
enabled, although this might not be enough:

In fact multiple vCPUs with ExtINT enabled I think can trigger the WARN_ON_ONCE below:

kvm_cpu_has_extint will be true on both vCPUs, so kvm_cpu_has_injectable_intr will be true on both
vCPUs as well, and thus both vCPUs can try to pull the interrupt from PIC, with second one likely getting -1,
and that not to mention the possibility of corrupting PIC state due to the concurrent access.


I am talking about this code:


	if (kvm_cpu_has_injectable_intr(vcpu)) {
		r = can_inject ? kvm_x86_call(interrupt_allowed)(vcpu, true) :
				 -EBUSY;
		if (r < 0)
			goto out;
		if (r) {
			int irq = kvm_cpu_get_interrupt(vcpu);

			if (!WARN_ON_ONCE(irq == -1)) {
				kvm_queue_interrupt(vcpu, irq, false);
				kvm_x86_call(inject_irq)(vcpu, false);
				WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0);
			}
		}
		if (kvm_cpu_has_injectable_intr(vcpu))
			kvm_x86_call(enable_irq_window)(vcpu);
	}


So we might need to do something stronger than only inhibiting APICv/AVIC, we might
want to ignore second ExtINT entry, or maybe even better ignore both ExtInt entries and refuse to deliver ExtINT at all? 
(the guest is broken (Intel says that this configuration is frowned upon), so IMHO it deserves to keep both pieces. Do you agree?)


Best regards,
	Maxim Levitsky



> 
> Oh, yeah, that reminds me of the other reason I would vote for a sticky flag:
> if inhibition really is toggling rapidly, performance is going to be quite bad
> because inhibiting APICv requires (a) zapping APIC SPTEs and (b) serializing
> writers if multiple vCPUs trigger the 0=>1 transition.
> 
> And there's some amount of serialization even if there's only a single writer,
> as KVM kicks all vCPUs to toggle APICv (and again to flush TLBs, if necessary).
> 
> Hmm, something doesn't add up.  Naveen's changelog says:
> 
>   KVM additionally inhibits AVIC for requesting a IRQ window every time it has
>   to inject external interrupts resulting in a barrage of inhibits being set and
>   cleared. This shows significant performance degradation compared to AVIC being
>   disabled, due to high contention on apicv_update_lock.
> 
> But if this is a "real world" use case where the only source of ExtInt is the
> PIT, and kernels typically only wire up the PIT to the BSP, why is there
> contention on apicv_update_lock?  APICv isn't actually being toggled, so readers
> blocking writers to handle KVM_REQ_APICV_UPDATE shouldn't be a problem.
> 
> Naveen, do you know why there's a contention on apicv_update_lock?  Are multiple
> vCPUs actually trying to inject ExtInt?
>
Naveen N Rao Feb. 4, 2025, 11:06 a.m. UTC | #5
On Mon, Feb 03, 2025 at 03:46:48PM -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> > On 2/3/25 19:45, Sean Christopherson wrote:
> > > Unless there's a very, very good reason to support a use case that generates
> > > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > > clear it.
> > 
> > BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
> > of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> > it to APICV_INHIBIT_REASON_PIT_REINJ.
> 
> That won't work, at least not with yet more changes, because KVM creates the
> in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
> if a bit is set and can never be cleared, then there's no need to track new
> updates.  Since userspace needs to explicitly disable reinjection, the inhibit
> can't be sticky.
> 
> I assume We could fudge around that easily enough by deferring the inhibit until
> a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
> I/O APIC case.

As a separate change, I have been testing a patch that moves the 
PIT_REINJ inhibit from PIT creation to the point at which the guest 
actually programs it so that default guest configurations can utilize 
AVIC:

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index cd57a517d04a..8f959de7ff32 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -235,6 +235,7 @@ static void destroy_pit_timer(struct kvm_pit *pit)
 {
        hrtimer_cancel(&pit->pit_state.timer);
        kthread_flush_work(&pit->expired);
+       kvm_clear_apicv_inhibit(pit->kvm, APICV_INHIBIT_REASON_PIT_REINJ);
 }
 
 static void pit_do_work(struct kthread_work *work)
@@ -296,22 +297,12 @@ void kvm_pit_set_reinject(struct kvm_pit *pit, bool reinject)
        if (atomic_read(&ps->reinject) == reinject)
                return;
 
-       /*
-        * AMD SVM AVIC accelerates EOI write and does not trap.
-        * This cause in-kernel PIT re-inject mode to fail
-        * since it checks ps->irq_ack before kvm_set_irq()
-        * and relies on the ack notifier to timely queue
-        * the pt->worker work iterm and reinject the missed tick.
-        * So, deactivate APICv when PIT is in reinject mode.
-        */
        if (reinject) {
-               kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ);
                /* The initial state is preserved while ps->reinject == 0. */
                kvm_pit_reset_reinject(pit);
                kvm_register_irq_ack_notifier(kvm, &ps->irq_ack_notifier);
                kvm_register_irq_mask_notifier(kvm, 0, &pit->mask_notifier);
        } else {
-               kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ);
                kvm_unregister_irq_ack_notifier(kvm, &ps->irq_ack_notifier);
                kvm_unregister_irq_mask_notifier(kvm, 0, &pit->mask_notifier);
        }
@@ -358,6 +349,16 @@ static void create_pit_timer(struct kvm_pit *pit, u32 val, int is_period)
                }
        }
 
+       /*
+        * AMD SVM AVIC accelerates EOI write and does not trap. This causes
+        * in-kernel PIT re-inject mode to fail since it checks ps->irq_ack
+        * before kvm_set_irq() and relies on the ack notifier to timely queue
+        * the pt->worker work item and reinject the missed tick. So,
+        * deactivate APICv when PIT is in reinject mode, but only when the
+        * timer is actually armed.
+        */
+       kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ);
+
        hrtimer_start(&ps->timer, ktime_add_ns(ktime_get(), interval),
                      HRTIMER_MODE_ABS);
 }


Is that reasonable?

If it is, or if we choose to delay PIT_REINJ inhibit to vcpu creation time, 
then making PT_REINJ or IRQWIN inhibits sticky will prevent AVIC from being 
enabled later on. I can see in my tests that BIOS (both seabios and edk2) 
programs the PIT though Linux guest itself doesn't (unless -no-hpet is used).  
The above change will allow AVIC to be used in Linux guests even when a PIT is 
present.

> 
> > I don't love adding another inhibit reason but, together, these two should
> > remove the contention on apicv_update_lock.  Another idea could be to move
> > IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
> 
> Oh, yeah, that reminds me of the other reason I would vote for a sticky flag:
> if inhibition really is toggling rapidly, performance is going to be quite bad
> because inhibiting APICv requires (a) zapping APIC SPTEs and (b) serializing
> writers if multiple vCPUs trigger the 0=>1 transition.
> 
> And there's some amount of serialization even if there's only a single writer,
> as KVM kicks all vCPUs to toggle APICv (and again to flush TLBs, if necessary).
> 
> Hmm, something doesn't add up.  Naveen's changelog says:
> 
>   KVM additionally inhibits AVIC for requesting a IRQ window every time it has
>   to inject external interrupts resulting in a barrage of inhibits being set and
>   cleared. This shows significant performance degradation compared to AVIC being
>   disabled, due to high contention on apicv_update_lock.
> 
> But if this is a "real world" use case where the only source of ExtInt is the
> PIT, and kernels typically only wire up the PIT to the BSP, why is there
> contention on apicv_update_lock?  APICv isn't actually being toggled, so readers
> blocking writers to handle KVM_REQ_APICV_UPDATE shouldn't be a problem.
> 
> Naveen, do you know why there's a contention on apicv_update_lock?  Are multiple
> vCPUs actually trying to inject ExtInt?

Apologies for the confusion, I should probably have said "device 
interrupts" (sorry, I'm still trying to get my terminology right). I 
have described the test scenario in the cover letter, but to add more 
details: the test involves two guests on the same host (using network 
tap devices with vhost and mq) with one guest running netperf TCP_RR to 
the other guest. Trimmed qemu command:

qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=on \
  -smp 16,cores=8,threads=2 -cpu host,topoext=on,x2apic=on -m 32G \
  -drive file=~/ubuntu_vm1.img,if=virtio,format=qcow2 \
  -netdev tap,id=v0,br=virbr0,script=~/qemu-ifup,downscript=~/qemu-ifdown,vhost=on,queues=16 \
  -device virtio-net-pci,netdev=v0,mac=52:54:00:12:34:11,mq=on,vectors=34

So, not a real world workload per se, though there may be use cases for guests 
having to handle large number of packets.

You're right -- APICv isn't actually being toggled, but IRQWIN inhibit is 
constantly being set and cleared while trying to inject device interrupts into 
the guests. The places where we set/clear IRQWIN inhibit has comments 
indicating that it is only required for ExtINT, though that's not actually the 
case here.

What is actually happening is that since the PIT is in reinject mode, APICv is 
not active in the guest. When that happens, kvm_cpu_has_injectable_intr() 
returns true when any interrupt is pending:

    /*
     * check if there is injectable interrupt:
     * when virtual interrupt delivery enabled,
     * interrupt from apic will handled by hardware,
     * we don't need to check it here.
     */
    int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
    {
	    if (kvm_cpu_has_extint(v))
		    return 1;

	    if (!is_guest_mode(v) && kvm_vcpu_apicv_active(v))
		    return 0;

	    return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
    }

The second if condition fails since APICv is not active. So, 
kvm_check_and_inject_events() calls enable_irq_window() to request for an IRQ 
window to inject those interrupts. That results in all vCPUs trying to acquire 
apicv_update_lock for updating the inhibit reason, though APICv state itself 
isn't changing and we are not requesting for KVM_REQ_APICV_UPDATE.


Thanks,
Naveen
Paolo Bonzini Feb. 4, 2025, 2:08 p.m. UTC | #6
On Tue, Feb 4, 2025 at 12:15 PM Naveen N Rao <naveen@kernel.org> wrote:
> As a separate change, I have been testing a patch that moves the
> PIT_REINJ inhibit from PIT creation to the point at which the guest
> actually programs it so that default guest configurations can utilize
> AVIC:

In-kernel PIC and PIT is sort of a legacy setup; the so-called
"split irqchip" (LAPIC in KVM, PIC/PIT in userspace) is strongly
preferred.  So I don't think it's particularly important to cater
for PIT_REINJ.

> If it is, or if we choose to delay PIT_REINJ inhibit to vcpu creation time,
> then making PT_REINJ or IRQWIN inhibits sticky will prevent AVIC from being
> enabled later on. I can see in my tests that BIOS (both seabios and edk2)
> programs the PIT though Linux guest itself doesn't (unless -no-hpet is used).

Even with -no-hpet, Linux should turn off the PIT relatively soon
and only rely on the local APIC's timer.

> You're right -- APICv isn't actually being toggled, but IRQWIN inhibit is
> constantly being set and cleared while trying to inject device interrupts into
> the guests. The places where we set/clear IRQWIN inhibit has comments
> indicating that it is only required for ExtINT, though that's not actually the
> case here.
>
> What is actually happening is that since the PIT is in reinject mode, APICv is
> not active in the guest. When that happens, kvm_cpu_has_injectable_intr()
> returns true when any interrupt is pending:
>
>     /*
>      * check if there is injectable interrupt:
>      * when virtual interrupt delivery enabled,
>      * interrupt from apic will handled by hardware,
>      * we don't need to check it here.
>      */
>     int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
>     {
>             if (kvm_cpu_has_extint(v))
>                     return 1;
>
>             if (!is_guest_mode(v) && kvm_vcpu_apicv_active(v))
>                     return 0;
>
>             return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
>     }
>
> The second if condition fails since APICv is not active. So,
> kvm_check_and_inject_events() calls enable_irq_window() to request for an IRQ
> window to inject those interrupts.

Ok, that's due solely to the presence of *another* active inhibit.
Since sticky inhibits cannot work, making the IRQWIN inhibit per-CPU
will still cause vCPUs to pound on the apicv_update_lock, but only on
the read side of the rwsem so that should be more tolerable.

Using atomics is considerably more complicated and I'd rather avoid it.

Paolo
Sean Christopherson Feb. 4, 2025, 7:18 p.m. UTC | #7
On Mon, Feb 03, 2025, Maxim Levitsky wrote:
> On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
> > On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> > > On 2/3/25 19:45, Sean Christopherson wrote:
> > > > Unless there's a very, very good reason to support a use case that generates
> > > > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > > > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > > > clear it.
> > > 
> > > BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
> > > of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> > > it to APICV_INHIBIT_REASON_PIT_REINJ.
> > 
> > That won't work, at least not with yet more changes, because KVM creates the
> > in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
> > if a bit is set and can never be cleared, then there's no need to track new
> > updates.  Since userspace needs to explicitly disable reinjection, the inhibit
> > can't be sticky.
> I confirmed this with a trace, this is indeed the case.
> 
> > 
> > I assume We could fudge around that easily enough by deferring the inhibit until
> > a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
> > I/O APIC case.
> > 
> > > I don't love adding another inhibit reason but, together, these two should
> > > remove the contention on apicv_update_lock.  Another idea could be to move
> > > IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
> 
> I retract this statement, it was based on my knowledge from back when I
> implemented it.
> 
> Looking at the current code again, this should be possible and can be a nice
> cleanup regardless.
> 
> (Or I just might have forgotten the reason that made me think back then that
> this is not worth it, because I do remember well that I wanted to make IRQWIN
> inhibit to be per vcpu)

The complication is the APIC page.  That's not a problem for vCPUs running in L2
because they'll use a different MMU, i.e. a different set of SPTEs that never map
the APIC backing page.  At least, that's how it's supposed to work[*].  ;-)

For IRQWIN, turning off APICv for the current vCPU will leave the APIC SPTEs in
place and so KVM will fail to intercept accesses to the APIC page.  And making
IRQWIN a per-vCPU inhibit won't help performance in the case where there is no
other inhibit, because (a) toggling it on/off requires taking mmu_lock for writing
and doing a remote TLB flush, and (b) unless the guest is doing something bizarre,
only one vCPU will be receiving ExtInt IRQs.  I.e. I don't think trying to make
IRQWIN a pure per-vCPU inhibit is necessary for performance.

After fiddling with a bunch of ideas, I think the best approach to address both
issues is to add a counter for the IRQ window (addresses the per-vCPU aspect of
IRQ windows), set/clear the IRQWIN inhibit according to the counter when *any*
inhibit changes, and then force an immediate update if and only if the count hits
a 0<=>1 transition *and* there is no other inhibit.  That would address the flaw
Naveen found without needing to make PIT_REINJ sticky.

Guarding the count with apicv_update_lock held for read ensures that if there is
a racing change to a different inhibit, that either kvm_inc_or_dec_irq_window_inhibit()
will see no inhibits and go down the slow path, or __kvm_set_or_clear_apicv_inhibit()
will set IRQWIN accordingly.

Compile tested only (and probably needs to be split into multiple patches).  I'll
try to take it for a spin later today.

[*] https://lore.kernel.org/all/20250130010825.220346-1-seanjc@google.com

---
 arch/x86/include/asm/kvm_host.h | 13 ++++++++++
 arch/x86/kvm/svm/svm.c          | 43 +++++++++------------------------
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++++-
 4 files changed, 61 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5193c3dfbce1..9e3465e70a0a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1365,6 +1365,7 @@ struct kvm_arch {
 	/* Protects apicv_inhibit_reasons */
 	struct rw_semaphore apicv_update_lock;
 	unsigned long apicv_inhibit_reasons;
+	atomic_t apicv_irq_window;
 
 	gpa_t wall_clock;
 
@@ -2203,6 +2204,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
 	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
+void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
+
+static inline void kvm_inc_apicv_irq_window(struct kvm *kvm)
+{
+	kvm_inc_or_dec_irq_window_inhibit(kvm, true);
+}
+
+static inline void kvm_dec_apicv_irq_window(struct kvm *kvm)
+{
+	kvm_inc_or_dec_irq_window_inhibit(kvm, false);
+}
+
 unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 				      unsigned long a0, unsigned long a1,
 				      unsigned long a2, unsigned long a3,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a..668db3bfff3d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1636,9 +1636,13 @@ static void svm_set_vintr(struct vcpu_svm *svm)
 	struct vmcb_control_area *control;
 
 	/*
-	 * The following fields are ignored when AVIC is enabled
+	 * vIRQ is ignored by hardware AVIC is enabled, and so AVIC must be
+	 * inhibited to detect the interrupt window.
 	 */
-	WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu));
+	if (enable_apicv && !is_guest_mode(&svm->vcpu)) {
+		svm->avic_irq_window = true;
+		kvm_inc_apicv_irq_window(svm->vcpu.kvm);
+	}
 
 	svm_set_intercept(svm, INTERCEPT_VINTR);
 
@@ -1666,6 +1670,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
 
 static void svm_clear_vintr(struct vcpu_svm *svm)
 {
+	if (svm->avic_irq_window && !is_guest_mode(&svm->vcpu)) {
+		svm->avic_irq_window = false;
+		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
+	}
+
 	svm_clr_intercept(svm, INTERCEPT_VINTR);
 
 	/* Drop int_ctl fields related to VINTR injection.  */
@@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	svm_clear_vintr(to_svm(vcpu));
 
-	/*
-	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
-	 * In this case AVIC was temporarily disabled for
-	 * requesting the IRQ window and we have to re-enable it.
-	 *
-	 * If running nested, still remove the VM wide AVIC inhibit to
-	 * support case in which the interrupt window was requested when the
-	 * vCPU was not running nested.
-
-	 * All vCPUs which run still run nested, will remain to have their
-	 * AVIC still inhibited due to per-cpu AVIC inhibition.
-	 */
-	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
 	++vcpu->stat.irq_window_exits;
 	return 1;
 }
@@ -3879,22 +3874,8 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
 	 * enabled, the STGI interception will not occur. Enable the irq
 	 * window under the assumption that the hardware will set the GIF.
 	 */
-	if (vgif || gif_set(svm)) {
-		/*
-		 * IRQ window is not needed when AVIC is enabled,
-		 * unless we have pending ExtINT since it cannot be injected
-		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
-		 * and fallback to injecting IRQ via V_IRQ.
-		 *
-		 * If running nested, AVIC is already locally inhibited
-		 * on this vCPU, therefore there is no need to request
-		 * the VM wide AVIC inhibition.
-		 */
-		if (!is_guest_mode(vcpu))
-			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
+	if (vgif || gif_set(svm))
 		svm_set_vintr(svm);
-	}
 }
 
 static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9d7cdb8fbf87..8eefed0a865a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -323,6 +323,7 @@ struct vcpu_svm {
 
 	bool guest_state_loaded;
 
+	bool avic_irq_window;
 	bool x2avic_msrs_intercepted;
 
 	/* Guest GIF value, used when vGIF is not enabled */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2d9a16fd4d3..7388f4cfe468 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10604,7 +10604,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
 
 	old = new = kvm->arch.apicv_inhibit_reasons;
 
-	set_or_clear_apicv_inhibit(&new, reason, set);
+	if (reason != APICV_INHIBIT_REASON_IRQWIN)
+		set_or_clear_apicv_inhibit(&new, reason, set);
+
+	set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
+				   atomic_read(&kvm->arch.apicv_irq_window));
 
 	if (!!old != !!new) {
 		/*
@@ -10645,6 +10649,36 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
 }
 EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);
 
+void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
+{
+	bool toggle;
+
+	/*
+	 * The IRQ window inhibit has a cyclical dependency of sorts, as KVM
+	 * needs to manually inject IRQs and thus detect interrupt windows if
+	 * APICv is disabled/inhibitied.  To avoid thrashing if the IRQ window
+	 * is being requested because APICv is already inhibited, toggle the
+	 * actual inhibit (and take the lock for write) if and only if there's
+	 * no other inhibit.  KVM evaluates the IRQ window count when _any_
+	 * inhibit changes, i.e. the IRQ window inhibit can be lazily updated
+	 * on the next inhibit change (if one ever occurs).
+	 */
+	down_read(&kvm->arch.apicv_update_lock);
+
+	if (inc)
+		toggle = atomic_inc_return(&kvm->arch.apicv_irq_window) == 1;
+	else
+		toggle = atomic_dec_return(&kvm->arch.apicv_irq_window) == 0;
+
+	toggle = toggle && !(kvm->arch.apicv_inhibit_reasons & ~BIT(APICV_INHIBIT_REASON_IRQWIN));
+
+	up_read(&kvm->arch.apicv_update_lock);
+
+	if (toggle)
+		kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
+}
+EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);
+
 static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_apic_present(vcpu))

base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
--
Maxim Levitsky Feb. 4, 2025, 8:08 p.m. UTC | #8
On Tue, 2025-02-04 at 11:18 -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Maxim Levitsky wrote:
> > On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
> > > On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> > > > On 2/3/25 19:45, Sean Christopherson wrote:
> > > > > Unless there's a very, very good reason to support a use case that generates
> > > > > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > > > > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > > > > clear it.
> > > > 
> > > > BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
> > > > of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> > > > it to APICV_INHIBIT_REASON_PIT_REINJ.
> > > 
> > > That won't work, at least not with yet more changes, because KVM creates the
> > > in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
> > > if a bit is set and can never be cleared, then there's no need to track new
> > > updates.  Since userspace needs to explicitly disable reinjection, the inhibit
> > > can't be sticky.
> > I confirmed this with a trace, this is indeed the case.
> > 
> > > I assume We could fudge around that easily enough by deferring the inhibit until
> > > a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
> > > I/O APIC case.
> > > 
> > > > I don't love adding another inhibit reason but, together, these two should
> > > > remove the contention on apicv_update_lock.  Another idea could be to move
> > > > IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
> > 
> > I retract this statement, it was based on my knowledge from back when I
> > implemented it.
> > 
> > Looking at the current code again, this should be possible and can be a nice
> > cleanup regardless.
> > 
> > (Or I just might have forgotten the reason that made me think back then that
> > this is not worth it, because I do remember well that I wanted to make IRQWIN
> > inhibit to be per vcpu)
> 
> The complication is the APIC page.  That's not a problem for vCPUs running in L2
> because they'll use a different MMU, i.e. a different set of SPTEs that never map
> the APIC backing page.  At least, that's how it's supposed to work[*].  ;-)

Yes it is that thing that I forgot. Because of this AVIC can't be inhibited on a single vCPU, 
and the only exception is nesting.
I need to write this down somewhere so that I won't forget this again.


> 
> For IRQWIN, turning off APICv for the current vCPU will leave the APIC SPTEs in
> place and so KVM will fail to intercept accesses to the APIC page.  And making
> IRQWIN a per-vCPU inhibit won't help performance in the case where there is no
> other inhibit, because (a) toggling it on/off requires taking mmu_lock for writing
> and doing a remote TLB flush, and (b) unless the guest is doing something bizarre,
> only one vCPU will be receiving ExtInt IRQs.  I.e. I don't think trying to make
> IRQWIN a pure per-vCPU inhibit is necessary for performance.
> 
> After fiddling with a bunch of ideas, I think the best approach to address both
> issues is to add a counter for the IRQ window (addresses the per-vCPU aspect of
> IRQ windows), set/clear the IRQWIN inhibit according to the counter when *any*
> inhibit changes, and then force an immediate update if and only if the count hits
> a 0<=>1 transition *and* there is no other inhibit.  That would address the flaw
> Naveen found without needing to make PIT_REINJ sticky.
> 
> Guarding the count with apicv_update_lock held for read ensures that if there is
> a racing change to a different inhibit, that either kvm_inc_or_dec_irq_window_inhibit()
> will see no inhibits and go down the slow path, or __kvm_set_or_clear_apicv_inhibit()
> will set IRQWIN accordingly.
> 
> Compile tested only (and probably needs to be split into multiple patches).  I'll
> try to take it for a spin later today.
> 
> [*] https://lore.kernel.org/all/20250130010825.220346-1-seanjc@google.com
> 
> ---
>  arch/x86/include/asm/kvm_host.h | 13 ++++++++++
>  arch/x86/kvm/svm/svm.c          | 43 +++++++++------------------------
>  arch/x86/kvm/svm/svm.h          |  1 +
>  arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++++-
>  4 files changed, 61 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5193c3dfbce1..9e3465e70a0a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1365,6 +1365,7 @@ struct kvm_arch {
>  	/* Protects apicv_inhibit_reasons */
>  	struct rw_semaphore apicv_update_lock;
>  	unsigned long apicv_inhibit_reasons;
> +	atomic_t apicv_irq_window;
>  
>  	gpa_t wall_clock;
>  
> @@ -2203,6 +2204,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>  	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>  }
>  
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
> +
> +static inline void kvm_inc_apicv_irq_window(struct kvm *kvm)
> +{
> +	kvm_inc_or_dec_irq_window_inhibit(kvm, true);
> +}
> +
> +static inline void kvm_dec_apicv_irq_window(struct kvm *kvm)
> +{
> +	kvm_inc_or_dec_irq_window_inhibit(kvm, false);
> +}
> +
>  unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  				      unsigned long a0, unsigned long a1,
>  				      unsigned long a2, unsigned long a3,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..668db3bfff3d 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1636,9 +1636,13 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>  	struct vmcb_control_area *control;
>  
>  	/*
> -	 * The following fields are ignored when AVIC is enabled
> +	 * vIRQ is ignored by hardware AVIC is enabled, and so AVIC must be
> +	 * inhibited to detect the interrupt window.
>  	 */
> -	WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu));
> +	if (enable_apicv && !is_guest_mode(&svm->vcpu)) {
> +		svm->avic_irq_window = true;
> +		kvm_inc_apicv_irq_window(svm->vcpu.kvm);
> +	}
>  
>  	svm_set_intercept(svm, INTERCEPT_VINTR);
>  
> @@ -1666,6 +1670,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>  
>  static void svm_clear_vintr(struct vcpu_svm *svm)
>  {
> +	if (svm->avic_irq_window && !is_guest_mode(&svm->vcpu)) {
> +		svm->avic_irq_window = false;
> +		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
> +	}
> +
>  	svm_clr_intercept(svm, INTERCEPT_VINTR);
>  
>  	/* Drop int_ctl fields related to VINTR injection.  */
> @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  	svm_clear_vintr(to_svm(vcpu));
>  
> -	/*
> -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> -	 * In this case AVIC was temporarily disabled for
> -	 * requesting the IRQ window and we have to re-enable it.
> -	 *
> -	 * If running nested, still remove the VM wide AVIC inhibit to
> -	 * support case in which the interrupt window was requested when the
> -	 * vCPU was not running nested.
> -
> -	 * All vCPUs which run still run nested, will remain to have their
> -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> -	 */

Please keep these comment that explain why inhibit has to be cleared here,
and the code as well.

The reason is that IRQ window can be requested before nested entry, which will lead to
VM wide inhibit, and the interrupt window can happen while nested because nested hypervisor
can opt to not intercept interrupts. If that happens, the AVIC will remain inhibited forever.
 


> -	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
>  	++vcpu->stat.irq_window_exits;
>  	return 1;
>  }
> @@ -3879,22 +3874,8 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
>  	 * enabled, the STGI interception will not occur. Enable the irq
>  	 * window under the assumption that the hardware will set the GIF.
>  	 */
> -	if (vgif || gif_set(svm)) {
> -		/*
> -		 * IRQ window is not needed when AVIC is enabled,
> -		 * unless we have pending ExtINT since it cannot be injected
> -		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
> -		 * and fallback to injecting IRQ via V_IRQ.
> -		 *
> -		 * If running nested, AVIC is already locally inhibited
> -		 * on this vCPU, therefore there is no need to request
> -		 * the VM wide AVIC inhibition.
> -		 */

Please keep this comment because it explains why this is needed.

As far as I remember there was a reason, not only related to performance,
to avoid inhibiting AVIC here VM wide.


Best regards,
	Maxim Levitsky




> -		if (!is_guest_mode(vcpu))
> -			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
> +	if (vgif || gif_set(svm))
>  		svm_set_vintr(svm);
> -	}
>  }
>  
>  static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9d7cdb8fbf87..8eefed0a865a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -323,6 +323,7 @@ struct vcpu_svm {
>  
>  	bool guest_state_loaded;
>  
> +	bool avic_irq_window;
>  	bool x2avic_msrs_intercepted;
>  
>  	/* Guest GIF value, used when vGIF is not enabled */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b2d9a16fd4d3..7388f4cfe468 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10604,7 +10604,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>  
>  	old = new = kvm->arch.apicv_inhibit_reasons;
>  
> -	set_or_clear_apicv_inhibit(&new, reason, set);
> +	if (reason != APICV_INHIBIT_REASON_IRQWIN)
> +		set_or_clear_apicv_inhibit(&new, reason, set);
> +
> +	set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
> +				   atomic_read(&kvm->arch.apicv_irq_window));
>  
>  	if (!!old != !!new) {
>  		/*
> @@ -10645,6 +10649,36 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);
>  
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
> +{
> +	bool toggle;
> +
> +	/*
> +	 * The IRQ window inhibit has a cyclical dependency of sorts, as KVM
> +	 * needs to manually inject IRQs and thus detect interrupt windows if
> +	 * APICv is disabled/inhibitied.  To avoid thrashing if the IRQ window
> +	 * is being requested because APICv is already inhibited, toggle the
> +	 * actual inhibit (and take the lock for write) if and only if there's
> +	 * no other inhibit.  KVM evaluates the IRQ window count when _any_
> +	 * inhibit changes, i.e. the IRQ window inhibit can be lazily updated
> +	 * on the next inhibit change (if one ever occurs).
> +	 */
> +	down_read(&kvm->arch.apicv_update_lock);
> +
> +	if (inc)
> +		toggle = atomic_inc_return(&kvm->arch.apicv_irq_window) == 1;
> +	else
> +		toggle = atomic_dec_return(&kvm->arch.apicv_irq_window) == 0;
> +
> +	toggle = toggle && !(kvm->arch.apicv_inhibit_reasons & ~BIT(APICV_INHIBIT_REASON_IRQWIN));
> +
> +	up_read(&kvm->arch.apicv_update_lock);
> +
> +	if (toggle)
> +		kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
> +}
> +EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);
> +
>  static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
>  {
>  	if (!kvm_apic_present(vcpu))
> 
> base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
Sean Christopherson Feb. 5, 2025, 1:41 a.m. UTC | #9
On Tue, Feb 04, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-04 at 11:18 -0800, Sean Christopherson wrote:
> > On Mon, Feb 03, 2025, Maxim Levitsky wrote:
> > @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
> >  	kvm_make_request(KVM_REQ_EVENT, vcpu);
> >  	svm_clear_vintr(to_svm(vcpu));
> >  
> > -	/*
> > -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> > -	 * In this case AVIC was temporarily disabled for
> > -	 * requesting the IRQ window and we have to re-enable it.
> > -	 *
> > -	 * If running nested, still remove the VM wide AVIC inhibit to
> > -	 * support case in which the interrupt window was requested when the
> > -	 * vCPU was not running nested.
> > -
> > -	 * All vCPUs which run still run nested, will remain to have their
> > -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> > -	 */
> 
> Please keep these comment that explain why inhibit has to be cleared here,

Ya, I'll make sure there are good comments that capture everything before posting
anything.

> and the code as well.
> 
> The reason is that IRQ window can be requested before nested entry, which
> will lead to VM wide inhibit, and the interrupt window can happen while
> nested because nested hypervisor can opt to not intercept interrupts. If that
> happens, the AVIC will remain inhibited forever.

Ah, because the svm_clear_vintr() triggered by enter_svm_guest_mode()'s

	svm_set_gif(svm, true);

will occur in_guest_mode() == true.  But doesn't that bug exist today?  Clearing
the inhibit relies on interrupt_window_interception() being reached, and so if
the ExtInt is injected into L2, KVM will never reach interrupt_window_interception().

KVM will evaluate pending events in that case:

		    svm->vcpu.arch.nmi_pending ||
		    kvm_cpu_has_injectable_intr(&svm->vcpu) ||
		    kvm_apic_has_pending_init_or_sipi(&svm->vcpu))
			kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);

but the nested pending VMRUN will block events and not get to the point where
KVM enables an IRQ window:

		r = can_inject ? kvm_x86_call(interrupt_allowed)(vcpu, true) :
				 -EBUSY;
		if (r < 0)
			goto out; <======= Taken for nested VMRUN pending
		if (r) {
			int irq = kvm_cpu_get_interrupt(vcpu);

			if (!WARN_ON_ONCE(irq == -1)) {
				kvm_queue_interrupt(vcpu, irq, false);
				kvm_x86_call(inject_irq)(vcpu, false);
				WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0);
			}
		}
		if (kvm_cpu_has_injectable_intr(vcpu))
			kvm_x86_call(enable_irq_window)(vcpu);

and if L2 has IRQs enabled, the subsequent KVM_REQ_EVENT processing after the
immediate exit will inject into L2, without re-opening a window and thus without
triggering interrupt_window_interception().

Ha!  I was going to suggest hooking svm_leave_nested(), and lo and behold KVM
already does that for KVM_REQ_APICV_UPDATE to ensure it has an up-to-date view
of the inhibits.

After much staring, I suspect the reason KVM hooks interrupt_window_interception()
and not svm_clear_vintr() is to avoid thrashing the inhibit and thus the VM-wide
lock/state on nested entry/exit, e.g. on a typical:

	gif=0 => VMRUN => gif=1 => #VMEXIT => gif=0

sequence, KVM would clear the inhibit while running L2, and then re-enable the
inhibit when control transers back to L1.  But that doesn't gel with this comment
in interrupt_window_interception():

	* If running nested, still remove the VM wide AVIC inhibit to
	* support case in which the interrupt window was requested when the
	* vCPU was not running nested.

That code/comment exists specifically for the case where L1 is not intercepting
IRQs, and so KVM is effectively using V_IRQ in vmcb02 to detect interrupt windows
for L1 IRQs.

If L1 _is_ intercepting IRQs, then interrupt_window_interception() will never be
reached while L2 is active, because the only reason KVM would set the V_IRQ intercept
in vmcb02 would be on behalf of L1, i.e. because of vmcs12.  svm_clear_vintr()
always operates on (at least) vmcb01, and VMRUN unconditionally sets GIF=1, which
means that enter_svm_guest_mode() will always do svm_clear_vintr() via
svm_set_gif(svm, true).  I.e. KVM will keep the VM-wide inhibit set until control
transfers back to L1 *and* an interrupt window is triggered.

Even the "L1 doesn't intercept IRQs" case is incongruous, because if IRQs are
enabled in L2 at the time of VMRUN, KVM will immediately inject L1's ExtINT into
L2 without taking an interrupt window #VMEXIT.

I don't see a perfect solution.  Barring a rather absurd number of checks, KVM
will inevitably leave the VM-wide inhibit in place while running L2, or KVM will
risk thrashing the VM-wide inhibit across VMRUN.

/facepalm

It's still not perfect, but the obvious-in-hindsight answer is to clear the
IRQ window inhibit when KVM actually injects an interrupt and there's no longer
a injectable interrupt.

That optimizes all the paths: if L1 isn't intercept IRQs, KVM will drop the
inhibit as soon as an interrupt is injected into L2.  If L1 is intercepting IRQs,
KVM will keep the inhibit until the IRQ is injected into L2.  Unless I'm missing
something, the inhibit itself should prevent an injectable IRQ from disappearing,
i.e. AVIC won't be left inhibited.

So this for the SVM changes?

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a..2a5cf7029b26 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3219,20 +3219,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	svm_clear_vintr(to_svm(vcpu));
 
-	/*
-	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
-	 * In this case AVIC was temporarily disabled for
-	 * requesting the IRQ window and we have to re-enable it.
-	 *
-	 * If running nested, still remove the VM wide AVIC inhibit to
-	 * support case in which the interrupt window was requested when the
-	 * vCPU was not running nested.
-
-	 * All vCPUs which run still run nested, will remain to have their
-	 * AVIC still inhibited due to per-cpu AVIC inhibition.
-	 */
-	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
 	++vcpu->stat.irq_window_exits;
 	return 1;
 }
@@ -3670,6 +3656,23 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 type;
 
+	/*
+	 * If AVIC was inhibited in order to detect an IRQ window, and there's
+	 * no other injectable interrupts pending or L2 is active (see below),
+	 * then drop the inhibit as the window has served its purpose.
+	 *
+	 * If L2 is active, this path is reachable if L1 is not intercepting
+	 * IRQs, i.e. if KVM is injecting L1 IRQs into L2.  AVIC is locally
+	 * inhibited while L2 is active; drop the VM-wide inhibit to optimize
+	 * the case in which the interrupt window was requested while L1 was
+	 * active (the vCPU was not running nested).
+	 */
+	if (svm->avic_irq_window &&
+	    (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))) {
+		svm->avic_irq_window = false;
+		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
+	}
+
 	if (vcpu->arch.interrupt.soft) {
 		if (svm_update_soft_interrupt_rip(vcpu))
 			return;
@@ -3881,17 +3884,28 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
 	 */
 	if (vgif || gif_set(svm)) {
 		/*
-		 * IRQ window is not needed when AVIC is enabled,
-		 * unless we have pending ExtINT since it cannot be injected
-		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
-		 * and fallback to injecting IRQ via V_IRQ.
+		 * KVM only enables IRQ windows when AVIC is enabled if there's
+		 * pending ExtINT since it cannot be injected via AVIC (ExtINT
+		 * bypasses the local APIC).  V_IRQ is ignored by hardware when
+		 * AVIC is enabled, and so KVM needs to temporarily disable
+		 * AVIC in order to detect when it's ok to inject the ExtINT.
 		 *
-		 * If running nested, AVIC is already locally inhibited
-		 * on this vCPU, therefore there is no need to request
-		 * the VM wide AVIC inhibition.
+		 * If running nested, AVIC is already locally inhibited on this
+		 * vCPU (L2 vCPUs use a different MMU that never maps the AVIC
+		 * backing page), therefore there is no need to increment the
+		 * VM-wide AVIC inhibit.  KVM will re-evaluate events when the
+		 * vCPU exits to L1 and enable an IRQ window if the ExtINT is
+		 * still pending.
+		 *
+		 * Note, the IRQ window inhibit needs to be updated even if
+		 * AVIC is inhibited for a different reason, as KVM needs to
+		 * keep AVIC inhibited if the other reason is cleared and there
+		 * is still an injectable interrupt pending.
 		 */
-		if (!is_guest_mode(vcpu))
-			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
+		if (enable_apicv && !svm->avic_irq_window && !is_guest_mode(vcpu)) {
+			svm->avic_irq_window = true;
+			kvm_inc_apicv_irq_window(vcpu->kvm);
+		}
 
 		svm_set_vintr(svm);
 	}
Naveen N Rao Feb. 5, 2025, 10:54 a.m. UTC | #10
On Tue, Feb 04, 2025 at 05:41:35PM -0800, Sean Christopherson wrote:
> On Tue, Feb 04, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 11:18 -0800, Sean Christopherson wrote:
> > > On Mon, Feb 03, 2025, Maxim Levitsky wrote:
> > > @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
> > >  	kvm_make_request(KVM_REQ_EVENT, vcpu);
> > >  	svm_clear_vintr(to_svm(vcpu));
> > >  
> > > -	/*
> > > -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> > > -	 * In this case AVIC was temporarily disabled for
> > > -	 * requesting the IRQ window and we have to re-enable it.
> > > -	 *
> > > -	 * If running nested, still remove the VM wide AVIC inhibit to
> > > -	 * support case in which the interrupt window was requested when the
> > > -	 * vCPU was not running nested.
> > > -
> > > -	 * All vCPUs which run still run nested, will remain to have their
> > > -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> > > -	 */
> > 
> > Please keep these comment that explain why inhibit has to be cleared here,
> 
> Ya, I'll make sure there are good comments that capture everything before posting
> anything.
> 
> > and the code as well.
> > 
> > The reason is that IRQ window can be requested before nested entry, which
> > will lead to VM wide inhibit, and the interrupt window can happen while
> > nested because nested hypervisor can opt to not intercept interrupts. If that
> > happens, the AVIC will remain inhibited forever.

<snip>

> It's still not perfect, but the obvious-in-hindsight answer is to 
> clear the
> IRQ window inhibit when KVM actually injects an interrupt and there's no longer
> a injectable interrupt.
> 
> That optimizes all the paths: if L1 isn't intercept IRQs, KVM will drop the
> inhibit as soon as an interrupt is injected into L2.  If L1 is intercepting IRQs,
> KVM will keep the inhibit until the IRQ is injected into L2.  Unless I'm missing
> something, the inhibit itself should prevent an injectable IRQ from disappearing,
> i.e. AVIC won't be left inhibited.
> 
> So this for the SVM changes?

I tested this in my setup (below changes to svm.c, along with the rest 
of the changes from your earlier posting), and it ensures proper update 
of the APICv inhibit. That is, with the below changes, AVIC is properly 
enabled in the virtual machine in pit=discard mode (which wasn't the 
case with the earlier changes to svm_[set|clear]_vintr()).

This improves performance in my test by ~10%:
Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
Average:           lo      0.12      0.12      0.01      0.01      0.00      0.00      0.00      0.00
Average:       enp0s2 1197141.51 1197158.89 226801.33 114572.12      0.00      0.00      0.00      0.00

This is still ~10% below what I see with avic=0 though.


Thanks,
Naveen

> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..2a5cf7029b26 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3219,20 +3219,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  	svm_clear_vintr(to_svm(vcpu));
>  
> -	/*
> -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> -	 * In this case AVIC was temporarily disabled for
> -	 * requesting the IRQ window and we have to re-enable it.
> -	 *
> -	 * If running nested, still remove the VM wide AVIC inhibit to
> -	 * support case in which the interrupt window was requested when the
> -	 * vCPU was not running nested.
> -
> -	 * All vCPUs which run still run nested, will remain to have their
> -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> -	 */
> -	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
>  	++vcpu->stat.irq_window_exits;
>  	return 1;
>  }
> @@ -3670,6 +3656,23 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  	u32 type;
>  
> +	/*
> +	 * If AVIC was inhibited in order to detect an IRQ window, and there's
> +	 * no other injectable interrupts pending or L2 is active (see below),
> +	 * then drop the inhibit as the window has served its purpose.
> +	 *
> +	 * If L2 is active, this path is reachable if L1 is not intercepting
> +	 * IRQs, i.e. if KVM is injecting L1 IRQs into L2.  AVIC is locally
> +	 * inhibited while L2 is active; drop the VM-wide inhibit to optimize
> +	 * the case in which the interrupt window was requested while L1 was
> +	 * active (the vCPU was not running nested).
> +	 */
> +	if (svm->avic_irq_window &&
> +	    (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))) {
> +		svm->avic_irq_window = false;
> +		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
> +	}
> +
>  	if (vcpu->arch.interrupt.soft) {
>  		if (svm_update_soft_interrupt_rip(vcpu))
>  			return;
> @@ -3881,17 +3884,28 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
>  	 */
>  	if (vgif || gif_set(svm)) {
>  		/*
> -		 * IRQ window is not needed when AVIC is enabled,
> -		 * unless we have pending ExtINT since it cannot be injected
> -		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
> -		 * and fallback to injecting IRQ via V_IRQ.
> +		 * KVM only enables IRQ windows when AVIC is enabled if there's
> +		 * pending ExtINT since it cannot be injected via AVIC (ExtINT
> +		 * bypasses the local APIC).  V_IRQ is ignored by hardware when
> +		 * AVIC is enabled, and so KVM needs to temporarily disable
> +		 * AVIC in order to detect when it's ok to inject the ExtINT.
>  		 *
> -		 * If running nested, AVIC is already locally inhibited
> -		 * on this vCPU, therefore there is no need to request
> -		 * the VM wide AVIC inhibition.
> +		 * If running nested, AVIC is already locally inhibited on this
> +		 * vCPU (L2 vCPUs use a different MMU that never maps the AVIC
> +		 * backing page), therefore there is no need to increment the
> +		 * VM-wide AVIC inhibit.  KVM will re-evaluate events when the
> +		 * vCPU exits to L1 and enable an IRQ window if the ExtINT is
> +		 * still pending.
> +		 *
> +		 * Note, the IRQ window inhibit needs to be updated even if
> +		 * AVIC is inhibited for a different reason, as KVM needs to
> +		 * keep AVIC inhibited if the other reason is cleared and there
> +		 * is still an injectable interrupt pending.
>  		 */
> -		if (!is_guest_mode(vcpu))
> -			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> +		if (enable_apicv && !svm->avic_irq_window && !is_guest_mode(vcpu)) {
> +			svm->avic_irq_window = true;
> +			kvm_inc_apicv_irq_window(vcpu->kvm);
> +		}
>  
>  		svm_set_vintr(svm);
>  	}
Paolo Bonzini Feb. 5, 2025, 11:36 a.m. UTC | #11
On 2/4/25 20:18, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Maxim Levitsky wrote:
>> On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
>>> On Mon, Feb 03, 2025, Paolo Bonzini wrote:
>>>> On 2/3/25 19:45, Sean Christopherson wrote:
>>>>> Unless there's a very, very good reason to support a use case that generates
>>>>> ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
>>>>> ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
>>>>> clear it.
>>>>
>>>> BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
>>>> of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
>>>> it to APICV_INHIBIT_REASON_PIT_REINJ.
>>>
>>> That won't work, at least not with yet more changes, because KVM creates the
>>> in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
>>> if a bit is set and can never be cleared, then there's no need to track new
>>> updates.  Since userspace needs to explicitly disable reinjection, the inhibit
>>> can't be sticky.
>> I confirmed this with a trace, this is indeed the case.
>>
>>>
>>> I assume We could fudge around that easily enough by deferring the inhibit until
>>> a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
>>> I/O APIC case.
>>>
>>>> I don't love adding another inhibit reason but, together, these two should
>>>> remove the contention on apicv_update_lock.  Another idea could be to move
>>>> IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
>>
>> I retract this statement, it was based on my knowledge from back when I
>> implemented it.
>>
>> Looking at the current code again, this should be possible and can be a nice
>> cleanup regardless.
>>
>> (Or I just might have forgotten the reason that made me think back then that
>> this is not worth it, because I do remember well that I wanted to make IRQWIN
>> inhibit to be per vcpu)
> 
> The complication is the APIC page.  That's not a problem for vCPUs running in L2
> because they'll use a different MMU, i.e. a different set of SPTEs that never map
> the APIC backing page.  At least, that's how it's supposed to work[*].  ;-)
> 
> For IRQWIN, turning off APICv for the current vCPU will leave the APIC SPTEs in
> place and so KVM will fail to intercept accesses to the APIC page.  And making
> IRQWIN a per-vCPU inhibit won't help performance in the case where there is no
> other inhibit, because (a) toggling it on/off requires taking mmu_lock for writing
> and doing a remote TLB flush, and (b) unless the guest is doing something bizarre,
> only one vCPU will be receiving ExtInt IRQs.  I.e. I don't think trying to make
> IRQWIN a pure per-vCPU inhibit is necessary for performance.
> 
> After fiddling with a bunch of ideas, I think the best approach to address both
> issues is to add a counter for the IRQ window (addresses the per-vCPU aspect of
> IRQ windows), set/clear the IRQWIN inhibit according to the counter when *any*
> inhibit changes, and then force an immediate update if and only if the count hits
> a 0<=>1 transition *and* there is no other inhibit.  That would address the flaw
> Naveen found without needing to make PIT_REINJ sticky.
> 
> Guarding the count with apicv_update_lock held for read ensures that if there is
> a racing change to a different inhibit, that either kvm_inc_or_dec_irq_window_inhibit()
> will see no inhibits and go down the slow path, or __kvm_set_or_clear_apicv_inhibit()
> will set IRQWIN accordingly.
> 
> Compile tested only (and probably needs to be split into multiple patches).  I'll
> try to take it for a spin later today.
> 
> [*] https://lore.kernel.org/all/20250130010825.220346-1-seanjc@google.com
> 
> ---
>   arch/x86/include/asm/kvm_host.h | 13 ++++++++++
>   arch/x86/kvm/svm/svm.c          | 43 +++++++++------------------------
>   arch/x86/kvm/svm/svm.h          |  1 +
>   arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++++-
>   4 files changed, 61 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5193c3dfbce1..9e3465e70a0a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1365,6 +1365,7 @@ struct kvm_arch {
>   	/* Protects apicv_inhibit_reasons */
>   	struct rw_semaphore apicv_update_lock;
>   	unsigned long apicv_inhibit_reasons;
> +	atomic_t apicv_irq_window;
>   
>   	gpa_t wall_clock;
>   
> @@ -2203,6 +2204,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>   	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>   }
>   
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
> +
> +static inline void kvm_inc_apicv_irq_window(struct kvm *kvm)
> +{
> +	kvm_inc_or_dec_irq_window_inhibit(kvm, true);
> +}
> +
> +static inline void kvm_dec_apicv_irq_window(struct kvm *kvm)
> +{
> +	kvm_inc_or_dec_irq_window_inhibit(kvm, false);
> +}
> +
>   unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   				      unsigned long a0, unsigned long a1,
>   				      unsigned long a2, unsigned long a3,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..668db3bfff3d 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1636,9 +1636,13 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>   	struct vmcb_control_area *control;
>   
>   	/*
> -	 * The following fields are ignored when AVIC is enabled
> +	 * vIRQ is ignored by hardware AVIC is enabled, and so AVIC must be
> +	 * inhibited to detect the interrupt window.
>   	 */
> -	WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu));
> +	if (enable_apicv && !is_guest_mode(&svm->vcpu)) {
> +		svm->avic_irq_window = true;
> +		kvm_inc_apicv_irq_window(svm->vcpu.kvm);
> +	}
>   
>   	svm_set_intercept(svm, INTERCEPT_VINTR);
>   
> @@ -1666,6 +1670,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>   
>   static void svm_clear_vintr(struct vcpu_svm *svm)
>   {
> +	if (svm->avic_irq_window && !is_guest_mode(&svm->vcpu)) {
> +		svm->avic_irq_window = false;
> +		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
> +	}
> +
>   	svm_clr_intercept(svm, INTERCEPT_VINTR);
>   
>   	/* Drop int_ctl fields related to VINTR injection.  */
> @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
>   	kvm_make_request(KVM_REQ_EVENT, vcpu);
>   	svm_clear_vintr(to_svm(vcpu));
>   
> -	/*
> -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> -	 * In this case AVIC was temporarily disabled for
> -	 * requesting the IRQ window and we have to re-enable it.
> -	 *
> -	 * If running nested, still remove the VM wide AVIC inhibit to
> -	 * support case in which the interrupt window was requested when the
> -	 * vCPU was not running nested.
> -
> -	 * All vCPUs which run still run nested, will remain to have their
> -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> -	 */
> -	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
>   	++vcpu->stat.irq_window_exits;
>   	return 1;
>   }
> @@ -3879,22 +3874,8 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
>   	 * enabled, the STGI interception will not occur. Enable the irq
>   	 * window under the assumption that the hardware will set the GIF.
>   	 */
> -	if (vgif || gif_set(svm)) {
> -		/*
> -		 * IRQ window is not needed when AVIC is enabled,
> -		 * unless we have pending ExtINT since it cannot be injected
> -		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
> -		 * and fallback to injecting IRQ via V_IRQ.
> -		 *
> -		 * If running nested, AVIC is already locally inhibited
> -		 * on this vCPU, therefore there is no need to request
> -		 * the VM wide AVIC inhibition.
> -		 */
> -		if (!is_guest_mode(vcpu))
> -			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
> +	if (vgif || gif_set(svm))
>   		svm_set_vintr(svm);
> -	}
>   }
>   
>   static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9d7cdb8fbf87..8eefed0a865a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -323,6 +323,7 @@ struct vcpu_svm {
>   
>   	bool guest_state_loaded;
>   
> +	bool avic_irq_window;
>   	bool x2avic_msrs_intercepted;
>   
>   	/* Guest GIF value, used when vGIF is not enabled */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b2d9a16fd4d3..7388f4cfe468 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10604,7 +10604,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>   
>   	old = new = kvm->arch.apicv_inhibit_reasons;
>   
> -	set_or_clear_apicv_inhibit(&new, reason, set);
> +	if (reason != APICV_INHIBIT_REASON_IRQWIN)
> +		set_or_clear_apicv_inhibit(&new, reason, set);
> +
> +	set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
> +				   atomic_read(&kvm->arch.apicv_irq_window));
>   
>   	if (!!old != !!new) {
>   		/*
> @@ -10645,6 +10649,36 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);
>   
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
> +{
> +	bool toggle;
> +
> +	/*
> +	 * The IRQ window inhibit has a cyclical dependency of sorts, as KVM
> +	 * needs to manually inject IRQs and thus detect interrupt windows if
> +	 * APICv is disabled/inhibitied.  To avoid thrashing if the IRQ window
> +	 * is being requested because APICv is already inhibited, toggle the
> +	 * actual inhibit (and take the lock for write) if and only if there's
> +	 * no other inhibit.  KVM evaluates the IRQ window count when _any_
> +	 * inhibit changes, i.e. the IRQ window inhibit can be lazily updated
> +	 * on the next inhibit change (if one ever occurs).
> +	 */
> +	down_read(&kvm->arch.apicv_update_lock);
> +
> +	if (inc)
> +		toggle = atomic_inc_return(&kvm->arch.apicv_irq_window) == 1;
> +	else
> +		toggle = atomic_dec_return(&kvm->arch.apicv_irq_window) == 0;
> +
> +	toggle = toggle && !(kvm->arch.apicv_inhibit_reasons & ~BIT(APICV_INHIBIT_REASON_IRQWIN));
> +
> +	up_read(&kvm->arch.apicv_update_lock);
> +
> +	if (toggle)
> +		kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);

I'm not super confident in breaking the critical section...  Another possibility:

void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
{
         int add = inc ? 1 : -1;

	if (!enable_apicv)
		return;

         /*
         * IRQ windows happen either because of ExtINT injections, or because
	* APICv is already disabled/inhibited for another reason.  While ExtINT
	* injections are rare and should not happen while the vCPU is running
	* its actual workload, it's worth avoiding thrashing if the IRQ window
         * is being requested because APICv is already inhibited.  So, toggle the
         * the actual inhibit (which requires taking the lock forwrite) if and
	* only if there's no other inhibit.  kvm_set_or_clear_apicv_inhibit()
         * always evaluates the IRQ window count; thus the IRQ window inhibit
         * call _will_ be lazily updated on the next call, if it ever happens.
         */
         if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
                 guard(rwsem_read)(&kvm->arch.apicv_update_lock);
                 if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
                         atomic_add(add, &kvm->arch.apicv_irq_window);
                         return;
                 }
         }

	/*
	 * Strictly speaking the lock is only needed if going 0->1 or 1->0,
	 * a la atomic_dec_and_mutex_lock.  However, ExtINTs are rare and
	 * only target a single CPU, so that is the common case; do not
	 * bother eliding the down_write()/up_write() pair.
	 */
         guard(rwsem_write)(&kvm->arch.apicv_update_lock);
         if (atomic_add_return(add, &kvm->arch.apicv_irq_window) == inc)
                 __kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
}
EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);

Paolo
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb93563714c2..bc4fb3c9d54c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1359,9 +1359,10 @@  struct kvm_arch {
 	bool apic_access_memslot_enabled;
 	bool apic_access_memslot_inhibited;
 
-	/* Protects apicv_inhibit_reasons */
+	bool apicv_activated;
+	/* Protects apicv_activated */
 	struct rw_semaphore apicv_update_lock;
-	unsigned long apicv_inhibit_reasons;
+	atomic_t apicv_inhibit_reasons;
 
 	gpa_t wall_clock;
 
@@ -2183,8 +2184,6 @@  gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
 bool kvm_apicv_activated(struct kvm *kvm);
 bool kvm_vcpu_apicv_activated(struct kvm_vcpu *vcpu);
 void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu);
-void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
-				      enum kvm_apicv_inhibit reason, bool set);
 void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
 				    enum kvm_apicv_inhibit reason, bool set);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 11235e91ae90..6c8f9a9d6548 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9917,33 +9917,42 @@  static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
 
 bool kvm_apicv_activated(struct kvm *kvm)
 {
-	return (READ_ONCE(kvm->arch.apicv_inhibit_reasons) == 0);
+	return READ_ONCE(kvm->arch.apicv_activated);
 }
 EXPORT_SYMBOL_GPL(kvm_apicv_activated);
 
 bool kvm_vcpu_apicv_activated(struct kvm_vcpu *vcpu)
 {
-	ulong vm_reasons = READ_ONCE(vcpu->kvm->arch.apicv_inhibit_reasons);
+	ulong vm_apicv_activated = READ_ONCE(vcpu->kvm->arch.apicv_activated);
 	ulong vcpu_reasons =
 			kvm_x86_call(vcpu_get_apicv_inhibit_reasons)(vcpu);
 
-	return (vm_reasons | vcpu_reasons) == 0;
+	return vm_apicv_activated && vcpu_reasons == 0;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_apicv_activated);
 
-static void set_or_clear_apicv_inhibit(unsigned long *inhibits,
-				       enum kvm_apicv_inhibit reason, bool set)
+static unsigned long set_or_clear_apicv_inhibit(atomic_t *inhibits, enum kvm_apicv_inhibit reason,
+						bool set, unsigned long *new_inhibits)
 {
 	const struct trace_print_flags apicv_inhibits[] = { APICV_INHIBIT_REASONS };
+	unsigned long old, new;
 
 	BUILD_BUG_ON(ARRAY_SIZE(apicv_inhibits) != NR_APICV_INHIBIT_REASONS);
 
-	if (set)
-		__set_bit(reason, inhibits);
-	else
-		__clear_bit(reason, inhibits);
+	if (set) {
+		old = new = atomic_fetch_or(BIT(reason), inhibits);
+		__set_bit(reason, &new);
+	} else {
+		old = new = atomic_fetch_andnot(BIT(reason), inhibits);
+		__clear_bit(reason, &new);
+	}
 
-	trace_kvm_apicv_inhibit_changed(reason, set, *inhibits);
+	trace_kvm_apicv_inhibit_changed(reason, set, new);
+
+	if (new_inhibits)
+		*new_inhibits = new;
+
+	return old;
 }
 
 static void kvm_apicv_init(struct kvm *kvm)
@@ -9951,7 +9960,7 @@  static void kvm_apicv_init(struct kvm *kvm)
 	enum kvm_apicv_inhibit reason = enable_apicv ? APICV_INHIBIT_REASON_ABSENT :
 						       APICV_INHIBIT_REASON_DISABLED;
 
-	set_or_clear_apicv_inhibit(&kvm->arch.apicv_inhibit_reasons, reason, true);
+	set_or_clear_apicv_inhibit(&kvm->arch.apicv_inhibit_reasons, reason, true, NULL);
 
 	init_rwsem(&kvm->arch.apicv_update_lock);
 }
@@ -10592,56 +10601,51 @@  static void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 	__kvm_vcpu_update_apicv(vcpu);
 }
 
-void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
-				      enum kvm_apicv_inhibit reason, bool set)
-{
-	unsigned long old, new;
-
-	lockdep_assert_held_write(&kvm->arch.apicv_update_lock);
-
-	if (!(kvm_x86_ops.required_apicv_inhibits & BIT(reason)))
-		return;
-
-	old = new = kvm->arch.apicv_inhibit_reasons;
-
-	set_or_clear_apicv_inhibit(&new, reason, set);
-
-	if (!!old != !!new) {
-		/*
-		 * Kick all vCPUs before setting apicv_inhibit_reasons to avoid
-		 * false positives in the sanity check WARN in vcpu_enter_guest().
-		 * This task will wait for all vCPUs to ack the kick IRQ before
-		 * updating apicv_inhibit_reasons, and all other vCPUs will
-		 * block on acquiring apicv_update_lock so that vCPUs can't
-		 * redo vcpu_enter_guest() without seeing the new inhibit state.
-		 *
-		 * Note, holding apicv_update_lock and taking it in the read
-		 * side (handling the request) also prevents other vCPUs from
-		 * servicing the request with a stale apicv_inhibit_reasons.
-		 */
-		kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
-		kvm->arch.apicv_inhibit_reasons = new;
-		if (new) {
-			unsigned long gfn = gpa_to_gfn(APIC_DEFAULT_PHYS_BASE);
-			int idx = srcu_read_lock(&kvm->srcu);
-
-			kvm_zap_gfn_range(kvm, gfn, gfn+1);
-			srcu_read_unlock(&kvm->srcu, idx);
-		}
-	} else {
-		kvm->arch.apicv_inhibit_reasons = new;
-	}
-}
-
 void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
 				    enum kvm_apicv_inhibit reason, bool set)
 {
-	if (!enable_apicv)
+	unsigned long old, new;
+
+	if (!enable_apicv || !(kvm_x86_ops.required_apicv_inhibits & BIT(reason)))
 		return;
 
-	down_write(&kvm->arch.apicv_update_lock);
-	__kvm_set_or_clear_apicv_inhibit(kvm, reason, set);
-	up_write(&kvm->arch.apicv_update_lock);
+	old = set_or_clear_apicv_inhibit(&kvm->arch.apicv_inhibit_reasons, reason, set, &new);
+
+	if (!old != !new) {
+		down_write(&kvm->arch.apicv_update_lock);
+
+		/*
+		 * Someone else may have updated the inhibit reason and the flag
+		 * between when we do the update above and take the lock. Confirm
+		 * the state change needed before proceeding.
+		 */
+		new = atomic_read(&kvm->arch.apicv_inhibit_reasons);
+		if (!new != kvm->arch.apicv_activated) {
+			/*
+			 * Kick all vCPUs before setting apicv_activated to avoid false
+			 * positives in the sanity check WARN in vcpu_enter_guest().
+			 * This task will wait for all vCPUs to ack the kick IRQ before
+			 * updating apicv_activated, and all other vCPUs will block on
+			 * acquiring apicv_update_lock so that vCPUs can't redo
+			 * vcpu_enter_guest() without seeing the new inhibit state.
+			 *
+			 * Note, holding apicv_update_lock and taking it in the read side
+			 * (handling the request) also prevents other vCPUs from servicing
+			 * the request with a stale apicv_activated value.
+			 */
+			kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
+			kvm->arch.apicv_activated = !new;
+			if (new) {
+				unsigned long gfn = gpa_to_gfn(APIC_DEFAULT_PHYS_BASE);
+				int idx = srcu_read_lock(&kvm->srcu);
+
+				kvm_zap_gfn_range(kvm, gfn, gfn+1);
+				srcu_read_unlock(&kvm->srcu, idx);
+			}
+		}
+
+		up_write(&kvm->arch.apicv_update_lock);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);