diff mbox series

[RFC,v2] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

Message ID 20220923085806.384344-1-chao.gao@intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host | expand

Commit Message

Chao Gao Sept. 23, 2022, 8:58 a.m. UTC
PIN (Posted interrupt notification) is useless to host as KVM always syncs
pending guest interrupts in PID to guest's vAPIC before each VM entry. In
fact, Sending PINs to a CPU running in host will lead to additional
overhead due to interrupt handling.

Currently, software path, vmx_deliver_posted_interrupt(), is optimized to
issue PINs only if target vCPU is in IN_GUEST_MODE. But hardware paths
(VT-d and Intel IPI virtualization) aren't optimized.

Set PID.SN right after VM exits and clear it before VM entry to minimize
the chance of hardware issuing PINs to a CPU when it's in host.

Also honour PID.SN bit in vmx_deliver_posted_interrupt().

Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,
it is pointless to update PID.NV to wakeup vector since notification is
anyway suppressed. And since PID.SN should be already set for running
vCPUs, so, don't set it again for preempted vCPUs.

When IPI virtualization is enabled, this patch increases "perf bench" [*]
by 6.56%, and PIN count in 1 second drops from tens of thousands to
hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
VM-exit round-trip latency.

[*] test cmd: perf bench sched pipe -T. Note that we change the source
code to pin two threads to two different vCPUs so that it can reproduce
stable results.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.

Changes in v2 (addressed comments from Kevin):
- measure/estimate the impact to non-IPC-intensive cases
- don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
  right before VM-entry and set it after VM-exit.
- use !pi_is_pir_empty() in pi_enable_wakeup_handler() to
  check if any interrupt was posted after clearing SN.
- clean up vmx_vcpu_pi_put(). See comments above.

 arch/x86/kvm/lapic.c           |  2 ++
 arch/x86/kvm/vmx/posted_intr.c | 55 +++++++++++-----------------------
 arch/x86/kvm/vmx/vmx.c         | 34 ++++++++++++++++++++-
 3 files changed, 53 insertions(+), 38 deletions(-)


base-commit: 372d07084593dc7a399bf9bee815711b1fb1bcf2

Comments

Sean Christopherson Sept. 26, 2022, 4:19 p.m. UTC | #1
On Fri, Sep 23, 2022, Chao Gao wrote:
> Set PID.SN right after VM exits and clear it before VM entry to minimize
> the chance of hardware issuing PINs to a CPU when it's in host.

Toggling PID.SN as close to the world switch as possible is undesirable.  If KVM
re-enters the guest without enabling IRQs, i.e. handles the VM-Exit in the fastpath,
then the notification IRQ will be delivered in the guest.

The natural location to do the toggling is when KVM "toggles" software, i.e. when
KVM sets IN_GUEST_MODE (clear SN) and OUTSIDE_GUEST_MODE (set SN).

I believe that would also obviate the need to manually send a PI Notification IRQ,
as the existing ->sync_pir_to_irr() call that exists to handle exactly this case
(notification not sent or handled in host) would ensure any outstanding posted IRQ
gets moved to the IRR and processed accordingly.

> Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,

Uh uh, this patch is already way, way too subtle and complex to tack on clean up.
"Opportunistic" clean up is for cases where the clean up is a pure refactoring
and/or has zero impact on functionality.

> it is pointless to update PID.NV to wakeup vector since notification is
> anyway suppressed. And since PID.SN should be already set for running
> vCPUs, so, don't set it again for preempted vCPUs.

I'm pretty sure this is wrong.  If the vCPU is preempted between prepare_to_rcuwait()
and schedule(), then skipping pi_enable_wakeup_handler() will hang the guest if
the wakeup event is a posted IRQ and the event arrives while the vCPU is preempted.

> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> VM-exit round-trip latency.

The overhead is more than likely due to pi_is_pir_empty() in the VM-Entry path,
i.e. should in theory go away if PID.SN is clear/set at IN_GUEST_MODE and
OUTSIDE_GUEST_MODE

> Also honour PID.SN bit in vmx_deliver_posted_interrupt().

Why?

> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> VM-exit round-trip latency.
> 
> [*] test cmd: perf bench sched pipe -T. Note that we change the source
> code to pin two threads to two different vCPUs so that it can reproduce
> stable results.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
> 
> Changes in v2 (addressed comments from Kevin):
> - measure/estimate the impact to non-IPC-intensive cases
> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
>   right before VM-entry and set it after VM-exit.

Ah, sorry, missed v1.  Rather than key off of IN_GUEST_MODE in the sync path, add
an explicit kvm_x86_ops hook to perform the transition.  I.e. make it explict.

> - use !pi_is_pir_empty() in pi_enable_wakeup_handler() to
>   check if any interrupt was posted after clearing SN.
> - clean up vmx_vcpu_pi_put(). See comments above.
> 
>  arch/x86/kvm/lapic.c           |  2 ++
>  arch/x86/kvm/vmx/posted_intr.c | 55 +++++++++++-----------------------
>  arch/x86/kvm/vmx/vmx.c         | 34 ++++++++++++++++++++-
>  3 files changed, 53 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9dda989a1cf0..a9f27c6ce937 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -128,7 +128,9 @@ static inline int __apic_test_and_clear_vector(int vec, void *bitmap)
>  }
>  
>  __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
> +EXPORT_SYMBOL_GPL(apic_hw_disabled);
>  __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_sw_disabled, HZ);
> +EXPORT_SYMBOL_GPL(apic_sw_disabled);

These exports will hopefully go away.

>  static inline int apic_enabled(struct kvm_lapic *apic)
>  {
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 1b56c5e5c9fb..9cec3dd88f75 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -70,12 +70,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>  	 * needs to be changed.
>  	 */
>  	if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
> -		/*
> -		 * Clear SN if it was set due to being preempted.  Again, do
> -		 * this even if there is no assigned device for simplicity.
> -		 */
> -		if (pi_test_and_clear_sn(pi_desc))
> -			goto after_clear_sn;
>  		return;
>  	}
>  
> @@ -101,11 +95,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>  		new.control = old.control;
>  
>  		/*
> -		 * Clear SN (as above) and refresh the destination APIC ID to
> -		 * handle task migration (@cpu != vcpu->cpu).
> +		 * Set SN and refresh the destination APIC ID to handle
> +		 * task migration (@cpu != vcpu->cpu).
> +		 *
> +		 * SN is cleared when a vCPU goes to blocked state so that
> +		 * the blocked vCPU can be waken up on receiving a
> +		 * notification. For a running/runnable vCPU, such
> +		 * notifications are useless. Set SN bit to suppress them.
>  		 */
>  		new.ndst = dest;
> -		new.sn = 0;
> +		new.sn = 1;

To handle the preempted case, I believe the correct behavior is to leave SN
as-is, although that would require setting SN=1 during vCPU creation.  Arguably
KVM should do that anyways when APICv is enabled.

Hmm, or alternatively this should do the same?

		new.sn = !kvm_vcpu_is_blocking();

> @@ -172,8 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>  	 * enabled until it is safe to call try_to_wake_up() on the task being
>  	 * scheduled out).
>  	 */
> -	if (pi_test_on(&new))
> +	if (!pi_is_pir_empty(pi_desc)) {
> +		pi_set_on(pi_desc);

As much as I wish we could get rid of kvm_arch_vcpu_blocking(), I actually think
this would be a good application of that hook.  If PID.SN is cleared during
kvm_arch_vcpu_blocking() and set during kvm_arch_vcpu_unblocking(), then I believe
there's no need to manually check the PIR here, as any IRQ that isn't detected by
kvm_vcpu_check_block() is guaranteed to set PID.ON=1.

>  		apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
> +	}
>  
>  	local_irq_restore(flags);
>  }
> @@ -193,21 +183,12 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
>  
>  void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
>  {
> -	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> -
>  	if (!vmx_needs_pi_wakeup(vcpu))
>  		return;
>  
> -	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> +	if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&

As above, I don't think skipping this for preempted is correct.

> +	    !vmx_interrupt_blocked(vcpu))
>  		pi_enable_wakeup_handler(vcpu);
> -
> -	/*
> -	 * Set SN when the vCPU is preempted.  Note, the vCPU can both be seen
> -	 * as blocking and preempted, e.g. if it's preempted between setting
> -	 * its wait state and manually scheduling out.
> -	 */

Assuming I'm correct, hoist the "note" above to document that KVM needs to enable
the wakeup handler even when the vCPU is preempted.

> -	if (vcpu->preempted)
> -		pi_set_sn(pi_desc);
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c9b49a09e6b5..949fb80eca3d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4186,6 +4186,9 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>  	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
>  		return 0;
>  
> +	if (pi_test_sn(&vmx->pi_desc))
> +		return 0;

Should be unnecessary.
Chao Gao Sept. 27, 2022, 6:32 a.m. UTC | #2
On Mon, Sep 26, 2022 at 04:19:52PM +0000, Sean Christopherson wrote:
>On Fri, Sep 23, 2022, Chao Gao wrote:
>> Set PID.SN right after VM exits and clear it before VM entry to minimize
>> the chance of hardware issuing PINs to a CPU when it's in host.
>
>Toggling PID.SN as close to the world switch as possible is undesirable.  If KVM
>re-enters the guest without enabling IRQs, i.e. handles the VM-Exit in the fastpath,
>then the notification IRQ will be delivered in the guest.
>
>The natural location to do the toggling is when KVM "toggles" software, i.e. when
>KVM sets IN_GUEST_MODE (clear SN) and OUTSIDE_GUEST_MODE (set SN).

This makes sense to me.

>
>I believe that would also obviate the need to manually send a PI Notification IRQ,
>as the existing ->sync_pir_to_irr() call that exists to handle exactly this case
>(notification not sent or handled in host) would ensure any outstanding posted IRQ
>gets moved to the IRR and processed accordingly.
>
>> Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,
>
>Uh uh, this patch is already way, way too subtle and complex to tack on clean up.
>"Opportunistic" clean up is for cases where the clean up is a pure refactoring
>and/or has zero impact on functionality.

Got it. Will move this cleanup to a separate patch if it is still needed.

>
>> it is pointless to update PID.NV to wakeup vector since notification is
>> anyway suppressed. And since PID.SN should be already set for running
>> vCPUs, so, don't set it again for preempted vCPUs.
>
>I'm pretty sure this is wrong.  If the vCPU is preempted between prepare_to_rcuwait()
>and schedule(), then skipping pi_enable_wakeup_handler() will hang the guest if
>the wakeup event is a posted IRQ and the event arrives while the vCPU is preempted.

Thanks for pointing out this subtle case.

My understanding is finally there will be a call of vmx_vcpu_pi_put()
with preempted=false (I assume that preempted vCPUs will be scheduled
at some later point). In that case, pi_enable_wakeup_handler() can wake
up the vCPU by sending a self-ipi. Plus this patch checks PIR instead of
ON bit, I don't get why the guest will hang.

>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 6.56%, and PIN count in 1 second drops from tens of thousands to
>> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
>> VM-exit round-trip latency.
>
>The overhead is more than likely due to pi_is_pir_empty() in the VM-Entry path,
>i.e. should in theory go away if PID.SN is clear/set at IN_GUEST_MODE and
>OUTSIDE_GUEST_MODE

I will collect perf data after implementing what you suggested.

>
>> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
>
>Why?

VT-d hardware doesn't set ON bit if SN bit is set.

Enforce the same rule in KVM can skip unnecessary work, like the
following pi_test_and_set_on() and kvm_vcpu_trigger_posted_interrupt().

>
>> When IPI virtualization is enabled, this patch increases "perf bench" [*]
>> by 6.56%, and PIN count in 1 second drops from tens of thousands to
>> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
>> VM-exit round-trip latency.
>> 
>> [*] test cmd: perf bench sched pipe -T. Note that we change the source
>> code to pin two threads to two different vCPUs so that it can reproduce
>> stable results.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
>> 
>> Changes in v2 (addressed comments from Kevin):
>> - measure/estimate the impact to non-IPC-intensive cases
>> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
>>   right before VM-entry and set it after VM-exit.
>
>Ah, sorry, missed v1.  Rather than key off of IN_GUEST_MODE in the sync path, add
>an explicit kvm_x86_ops hook to perform the transition.  I.e. make it explict.

It is ok to add a separate hook. But the question is how to coordinate clearing
SN with ->sync_pir_to_irr(). Clearing SN bit may put PIR in a state where ON/SN
are cleared but some outstanding IRQs left there. Current ->sync_pir_to_irr()
doesn't sync those IRQs to IRR in this case. Here are two options to fix the
problem:

1) clear SN with the new hook, then set ON bit if there is any outstanding IRQ.
No change to ->sync_pir_to_irr() is needed.

2) clear SN with the new hook, add a force mode to ->sync_pir_to_irr() where
PIR is synced to IRR regardless of ON/SN bits, inovke ->sync_pir_to_irr()
on VM-entry path with force_mode=true.

Both may lead to an extra check of PIR.

>> @@ -101,11 +95,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>>  		new.control = old.control;
>>  
>>  		/*
>> -		 * Clear SN (as above) and refresh the destination APIC ID to
>> -		 * handle task migration (@cpu != vcpu->cpu).
>> +		 * Set SN and refresh the destination APIC ID to handle
>> +		 * task migration (@cpu != vcpu->cpu).
>> +		 *
>> +		 * SN is cleared when a vCPU goes to blocked state so that
>> +		 * the blocked vCPU can be waken up on receiving a
>> +		 * notification. For a running/runnable vCPU, such
>> +		 * notifications are useless. Set SN bit to suppress them.
>>  		 */
>>  		new.ndst = dest;
>> -		new.sn = 0;
>> +		new.sn = 1;
>
>To handle the preempted case, I believe the correct behavior is to leave SN
>as-is, although that would require setting SN=1 during vCPU creation.  Arguably
>KVM should do that anyways when APICv is enabled.
>
>Hmm, or alternatively this should do the same?
>
>		new.sn = !kvm_vcpu_is_blocking();

I don't get this. Probably I am misunderstanding something about vCPU preemption.

>
>> @@ -172,8 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>>  	 * enabled until it is safe to call try_to_wake_up() on the task being
>>  	 * scheduled out).
>>  	 */
>> -	if (pi_test_on(&new))
>> +	if (!pi_is_pir_empty(pi_desc)) {
>> +		pi_set_on(pi_desc);
>
>As much as I wish we could get rid of kvm_arch_vcpu_blocking(), I actually think
>this would be a good application of that hook.  If PID.SN is cleared during
>kvm_arch_vcpu_blocking() and set during kvm_arch_vcpu_unblocking(), then I believe
>there's no need to manually check the PIR here, as any IRQ that isn't detected by
>kvm_vcpu_check_block() is guaranteed to set PID.ON=1.

Using kvm_arch_vcpu_blocking() has the same problem as using a new hook
for the VM-entry path: we need a force mode for ->sync_pir_to_irr() or
set ON bit if there is any outstanding IRQ right after clearing SN

The former may help performance a little but since the call of
->sync_pir_to_irr() in kvm_vcpu_check_block() is so far away from the
place where SN is cleared, I think this would be a source of bugs.

The latter has no benefit compared to what this patch does here.
Sean Christopherson Sept. 27, 2022, 9:43 p.m. UTC | #3
On Tue, Sep 27, 2022, Chao Gao wrote:
> On Mon, Sep 26, 2022 at 04:19:52PM +0000, Sean Christopherson wrote:
> >On Fri, Sep 23, 2022, Chao Gao wrote:
> >> it is pointless to update PID.NV to wakeup vector since notification is
> >> anyway suppressed. And since PID.SN should be already set for running
> >> vCPUs, so, don't set it again for preempted vCPUs.
> >
> >I'm pretty sure this is wrong.  If the vCPU is preempted between prepare_to_rcuwait()
> >and schedule(), then skipping pi_enable_wakeup_handler() will hang the guest if
> >the wakeup event is a posted IRQ and the event arrives while the vCPU is preempted.
> 
> Thanks for pointing out this subtle case.
> 
> My understanding is finally there will be a call of vmx_vcpu_pi_put()
> with preempted=false (I assume that preempted vCPUs will be scheduled
> at some later point). In that case, pi_enable_wakeup_handler() can wake
> up the vCPU by sending a self-ipi. Plus this patch checks PIR instead of
> ON bit, I don't get why the guest will hang.

Ah, I forgot about the addition of the pi_is_pir_empty() check.  I think I was
hoping/assuming that check would go away when I wrote the above.

> >> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> >> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> >> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> >> VM-exit round-trip latency.
> >
> >The overhead is more than likely due to pi_is_pir_empty() in the VM-Entry path,
> >i.e. should in theory go away if PID.SN is clear/set at IN_GUEST_MODE and
> >OUTSIDE_GUEST_MODE
> 
> I will collect perf data after implementing what you suggested.
> 
> >
> >> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
> >
> >Why?
> 
> VT-d hardware doesn't set ON bit if SN bit is set.
> 
> Enforce the same rule in KVM can skip unnecessary work, like the
> following pi_test_and_set_on() and kvm_vcpu_trigger_posted_interrupt().

But in all likelihood it's a net negative.  IMO, ideally posted interrupts would
set ON=1 even if SN=1 (or have another "PI pending" bit if necessary).

In the optimal case, where the vCPU is IN_GUEST_MODE, it's completely unnecessary.
If IN_GUEST_MODE isn't set, then setting ON=1 is likely a net positive, because
it will allow KVM to avoid checking all of PIR when clearing SN, i.e. KVM can
optimize those paths to:

	if (!pi_test_on(&vmx->pi_desc) && !pi_is_pir_empty(&vmx->pi_desc))
		pi_set_on(&vmx->pi_desc);
	    
> >> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> >> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> >> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> >> VM-exit round-trip latency.
> >> 
> >> [*] test cmd: perf bench sched pipe -T. Note that we change the source
> >> code to pin two threads to two different vCPUs so that it can reproduce
> >> stable results.
> >> 
> >> Signed-off-by: Chao Gao <chao.gao@intel.com>
> >> ---
> >> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
> >> 
> >> Changes in v2 (addressed comments from Kevin):
> >> - measure/estimate the impact to non-IPC-intensive cases
> >> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
> >>   right before VM-entry and set it after VM-exit.
> >
> >Ah, sorry, missed v1.  Rather than key off of IN_GUEST_MODE in the sync path, add
> >an explicit kvm_x86_ops hook to perform the transition.  I.e. make it explict.
> 
> It is ok to add a separate hook. But the question is how to coordinate clearing
> SN with ->sync_pir_to_irr(). Clearing SN bit may put PIR in a state where ON/SN
> are cleared but some outstanding IRQs left there. Current ->sync_pir_to_irr()
> doesn't sync those IRQs to IRR in this case.
>
> Here are two options to fix the problem:
> 
> 1) clear SN with the new hook, then set ON bit if there is any outstanding IRQ.
> No change to ->sync_pir_to_irr() is needed.
> 
> 2) clear SN with the new hook, add a force mode to ->sync_pir_to_irr() where
> PIR is synced to IRR regardless of ON/SN bits, inovke ->sync_pir_to_irr()
> on VM-entry path with force_mode=true.
>
> Both may lead to an extra check of PIR.

  3) Unconditionally set ON when clearing SN in the VM-Enter path.

#3 it's a little gross, but it would avoid the "extra" pi_is_pir_empty().

I don't like #2 because it bleeds the logic into common x86, whereas #1 and #3
keep everything in the PI code.

My vote would be #1, and only do #3 if the overhead of #1 is really truly necessary
for performance reasons.  #1 effectively optimizes for not having a pending posted
IRQ, while #3 optimizes for having a pending posted IRQ _without ON=1.  And #1 can
be optimized to scrub the PIR if and only if ON=0.

More importantly, if we go with #1, then we can use the same hook for the
->vcpu_blocking() hook.  More below.

> >> @@ -101,11 +95,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> >>  		new.control = old.control;
> >>  
> >>  		/*
> >> -		 * Clear SN (as above) and refresh the destination APIC ID to
> >> -		 * handle task migration (@cpu != vcpu->cpu).
> >> +		 * Set SN and refresh the destination APIC ID to handle
> >> +		 * task migration (@cpu != vcpu->cpu).
> >> +		 *
> >> +		 * SN is cleared when a vCPU goes to blocked state so that
> >> +		 * the blocked vCPU can be waken up on receiving a
> >> +		 * notification. For a running/runnable vCPU, such
> >> +		 * notifications are useless. Set SN bit to suppress them.
> >>  		 */
> >>  		new.ndst = dest;
> >> -		new.sn = 0;
> >> +		new.sn = 1;
> >
> >To handle the preempted case, I believe the correct behavior is to leave SN
> >as-is, although that would require setting SN=1 during vCPU creation.  Arguably
> >KVM should do that anyways when APICv is enabled.
> >
> >Hmm, or alternatively this should do the same?
> >
> >		new.sn = !kvm_vcpu_is_blocking();
> 
> I don't get this. Probably I am misunderstanding something about vCPU preemption.

Ignore the this, I was deep down a path of stuff that I'm pretty sure ends up being
irrelevant.

> >> @@ -172,8 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> >>  	 * enabled until it is safe to call try_to_wake_up() on the task being
> >>  	 * scheduled out).
> >>  	 */
> >> -	if (pi_test_on(&new))
> >> +	if (!pi_is_pir_empty(pi_desc)) {
> >> +		pi_set_on(pi_desc);
> >
> >As much as I wish we could get rid of kvm_arch_vcpu_blocking(), I actually think
> >this would be a good application of that hook.  If PID.SN is cleared during
> >kvm_arch_vcpu_blocking() and set during kvm_arch_vcpu_unblocking(), then I believe
> >there's no need to manually check the PIR here, as any IRQ that isn't detected by
> >kvm_vcpu_check_block() is guaranteed to set PID.ON=1.
> 
> Using kvm_arch_vcpu_blocking() has the same problem as using a new hook
> for the VM-entry path: we need a force mode for ->sync_pir_to_irr() or
> set ON bit if there is any outstanding IRQ right after clearing SN
>
> The former may help performance a little but since the call of
> ->sync_pir_to_irr() in kvm_vcpu_check_block() is so far away from the
> place where SN is cleared, I think this would be a source of bugs.
> 
> The latter has no benefit compared to what this patch does here.

The benefits I see are:

  1. The code is very explicit.  When clearing SN, check PIR and set ON=1 to
     ensure IRQs aren't ignored.

  2. pi_enable_wakeup_handler() is only responsible for changing the vector,
     i.e. there's no clearing of SN "hidden" in the CMPXCHG loop.

  3. Same for vmx_vcpu_pi_load(), it's only responsible for updating the pCPU
     and the vector, it doesn't touch SN.

  4. The logic is common to all paths that clear SN, i.e. the same helper can
     be used for both VM-Enter and vCPU blocking.

E.g. the VMX hook for both VM-Enter and vCPU blocking could be:

  static void vmx_no_idea_what_to_call_this(struct kvm_vcpu *vcpu)
  {
	pi_clear_sn(&vmx->pi_desc);

	/*
	 * Clear SN before reading the bitmap.  The VT-d firmware writes the
	 * bitmap and reads SN atomically (5.2.3 in the spec), so it doesn't
	 * really have a memory barrier that pairs with this, but we cannot do
	 * that and we need one.
	 */
	smp_mb__after_atomic();

	/* blah blah blah */
	if (!pi_test_on(&vmx->pi_desc) && !pi_is_pir_empty(&vmx->pi_desc))
		pi_set_on(&vmx->pi_desc);
  }

One related thought thing I want to capture:

The proposed approach is having vmx_sync_pir_to_irr() check PIR if ON=1 || SN=1
is effectively _required_ to avoid breaking halt-polling.  I was thinking we could
keep that path optimized to check only ON=1, but with that approach, KVM won't detect
the pending IRQ until it actually starts to block the vCPU, i.e. until it clears SN
and manually checks the PIR.  Clearing SN in kvm_arch_vcpu_pending() would be better
than waiting until the vCPU is "put", but even that is too late from a halt-polling
perspective.

A comment in vmx_sync_pir_to_irr() is likely needed to capture that.
Tian, Kevin Sept. 28, 2022, 8:29 a.m. UTC | #4
> From: Gao, Chao <chao.gao@intel.com>
> Sent: Friday, September 23, 2022 4:58 PM
> 
> PIN (Posted interrupt notification) is useless to host as KVM always syncs
> pending guest interrupts in PID to guest's vAPIC before each VM entry. In
> fact, Sending PINs to a CPU running in host will lead to additional
> overhead due to interrupt handling.
> 
> Currently, software path, vmx_deliver_posted_interrupt(), is optimized to
> issue PINs only if target vCPU is in IN_GUEST_MODE. But hardware paths
> (VT-d and Intel IPI virtualization) aren't optimized.
> 
> Set PID.SN right after VM exits and clear it before VM entry to minimize
> the chance of hardware issuing PINs to a CPU when it's in host.
> 
> Also honour PID.SN bit in vmx_deliver_posted_interrupt().
> 
> Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,
> it is pointless to update PID.NV to wakeup vector since notification is
> anyway suppressed. And since PID.SN should be already set for running
> vCPUs, so, don't set it again for preempted vCPUs.
> 
> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> VM-exit round-trip latency.
> 
> [*] test cmd: perf bench sched pipe -T. Note that we change the source
> code to pin two threads to two different vCPUs so that it can reproduce
> stable results.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
> 
> Changes in v2 (addressed comments from Kevin):
> - measure/estimate the impact to non-IPC-intensive cases
> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
>   right before VM-entry and set it after VM-exit.

One correction here. My comment in v1 [1] was actually close to Sean's
suggestion, i.e. opposite to above description:

--
  I wonder whether it makes more sense to have 'sn' closely sync-ed
  with vcpu->mode, e.g. having a kvm_x86_set_vcpu_mode() ops
  to translate vcpu->mode into vmx/svm specific hardware bits like
  'sn' here. Then call it in common place when vcpu->mode is changed.
--

[1] https://lore.kernel.org/lkml/BN9PR11MB52766B74ADFBAEC0AA205E298CB39@BN9PR11MB5276.namprd11.prod.outlook.com/
Chao Gao Sept. 28, 2022, 10:56 a.m. UTC | #5
>> Changes in v2 (addressed comments from Kevin):
>> - measure/estimate the impact to non-IPC-intensive cases
>> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
>>   right before VM-entry and set it after VM-exit.
>
>One correction here. My comment in v1 [1] was actually close to Sean's
>suggestion, i.e. opposite to above description:

Hi Kevin,

Yes. The changelog is misleading. You suggested using a dedicate hook.
And I indeed agreed to follow the suggestion. But as you said, what v2
does is the complete opposite of the suggestion.

The reason is when I started v2 recently, the idea of clearing SN right
before VM-entry came to my mind. Since it could also solve some problems
of v1 without a hook (hence, more self-contained), I thought it was
slightly better. But I missed that clearing SN right before VM-entry can
cause unnecessary SN toggling to VM-exit fast-path.
Chao Gao Sept. 28, 2022, 11:26 a.m. UTC | #6
On Tue, Sep 27, 2022 at 09:43:08PM +0000, Sean Christopherson wrote:
>The benefits I see are:
>
>  1. The code is very explicit.  When clearing SN, check PIR and set ON=1 to
>     ensure IRQs aren't ignored.
>
>  2. pi_enable_wakeup_handler() is only responsible for changing the vector,
>     i.e. there's no clearing of SN "hidden" in the CMPXCHG loop.
>
>  3. Same for vmx_vcpu_pi_load(), it's only responsible for updating the pCPU
>     and the vector, it doesn't touch SN.
>
>  4. The logic is common to all paths that clear SN, i.e. the same helper can
>     be used for both VM-Enter and vCPU blocking.
>
>E.g. the VMX hook for both VM-Enter and vCPU blocking could be:
>
>  static void vmx_no_idea_what_to_call_this(struct kvm_vcpu *vcpu)
>  {
>	pi_clear_sn(&vmx->pi_desc);
>
>	/*
>	 * Clear SN before reading the bitmap.  The VT-d firmware writes the
>	 * bitmap and reads SN atomically (5.2.3 in the spec), so it doesn't
>	 * really have a memory barrier that pairs with this, but we cannot do
>	 * that and we need one.
>	 */
>	smp_mb__after_atomic();
>
>	/* blah blah blah */
>	if (!pi_test_on(&vmx->pi_desc) && !pi_is_pir_empty(&vmx->pi_desc))
>		pi_set_on(&vmx->pi_desc);
>  }
>
>One related thought thing I want to capture:
>
>The proposed approach is having vmx_sync_pir_to_irr() check PIR if ON=1 || SN=1
>is effectively _required_ to avoid breaking halt-polling.  I was thinking we could
>keep that path optimized to check only ON=1, but with that approach, KVM won't detect
>the pending IRQ until it actually starts to block the vCPU, i.e. until it clears SN
>and manually checks the PIR.  Clearing SN in kvm_arch_vcpu_pending() would be better
>than waiting until the vCPU is "put", but even that is too late from a halt-polling
>perspective.
>
>A comment in vmx_sync_pir_to_irr() is likely needed to capture that.

All your comments make sense to me. I just posted v3 with all issues of v2
addressed.
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9dda989a1cf0..a9f27c6ce937 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -128,7 +128,9 @@  static inline int __apic_test_and_clear_vector(int vec, void *bitmap)
 }
 
 __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
+EXPORT_SYMBOL_GPL(apic_hw_disabled);
 __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_sw_disabled, HZ);
+EXPORT_SYMBOL_GPL(apic_sw_disabled);
 
 static inline int apic_enabled(struct kvm_lapic *apic)
 {
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 1b56c5e5c9fb..9cec3dd88f75 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -70,12 +70,6 @@  void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	 * needs to be changed.
 	 */
 	if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
-		/*
-		 * Clear SN if it was set due to being preempted.  Again, do
-		 * this even if there is no assigned device for simplicity.
-		 */
-		if (pi_test_and_clear_sn(pi_desc))
-			goto after_clear_sn;
 		return;
 	}
 
@@ -101,11 +95,16 @@  void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 		new.control = old.control;
 
 		/*
-		 * Clear SN (as above) and refresh the destination APIC ID to
-		 * handle task migration (@cpu != vcpu->cpu).
+		 * Set SN and refresh the destination APIC ID to handle
+		 * task migration (@cpu != vcpu->cpu).
+		 *
+		 * SN is cleared when a vCPU goes to blocked state so that
+		 * the blocked vCPU can be waken up on receiving a
+		 * notification. For a running/runnable vCPU, such
+		 * notifications are useless. Set SN bit to suppress them.
 		 */
 		new.ndst = dest;
-		new.sn = 0;
+		new.sn = 1;
 
 		/*
 		 * Restore the notification vector; in the blocking case, the
@@ -115,19 +114,6 @@  void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	} while (pi_try_set_control(pi_desc, &old.control, new.control));
 
 	local_irq_restore(flags);
-
-after_clear_sn:
-
-	/*
-	 * Clear SN before reading the bitmap.  The VT-d firmware
-	 * writes the bitmap and reads SN atomically (5.2.3 in the
-	 * spec), so it doesn't really have a memory barrier that
-	 * pairs with this, but we cannot do that and we need one.
-	 */
-	smp_mb__after_atomic();
-
-	if (!pi_is_pir_empty(pi_desc))
-		pi_set_on(pi_desc);
 }
 
 static bool vmx_can_use_vtd_pi(struct kvm *kvm)
@@ -155,13 +141,15 @@  static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
 	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 
-	WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
-
 	old.control = READ_ONCE(pi_desc->control);
 	do {
-		/* set 'NV' to 'wakeup vector' */
+		/*
+		 * set 'NV' to 'wakeup vector' and clear SN bit so that
+		 * the blocked vCPU can be waken up on receiving interrupts.
+		 */
 		new.control = old.control;
 		new.nv = POSTED_INTR_WAKEUP_VECTOR;
+		new.sn = 0;
 	} while (pi_try_set_control(pi_desc, &old.control, new.control));
 
 	/*
@@ -172,8 +160,10 @@  static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 	 * enabled until it is safe to call try_to_wake_up() on the task being
 	 * scheduled out).
 	 */
-	if (pi_test_on(&new))
+	if (!pi_is_pir_empty(pi_desc)) {
+		pi_set_on(pi_desc);
 		apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
+	}
 
 	local_irq_restore(flags);
 }
@@ -193,21 +183,12 @@  static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
 
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 {
-	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-
 	if (!vmx_needs_pi_wakeup(vcpu))
 		return;
 
-	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+	if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&
+	    !vmx_interrupt_blocked(vcpu))
 		pi_enable_wakeup_handler(vcpu);
-
-	/*
-	 * Set SN when the vCPU is preempted.  Note, the vCPU can both be seen
-	 * as blocking and preempted, e.g. if it's preempted between setting
-	 * its wait state and manually scheduling out.
-	 */
-	if (vcpu->preempted)
-		pi_set_sn(pi_desc);
 }
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c9b49a09e6b5..949fb80eca3d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4186,6 +4186,9 @@  static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
 		return 0;
 
+	if (pi_test_sn(&vmx->pi_desc))
+		return 0;
+
 	/* If a previous notification has sent the IPI, nothing to do.  */
 	if (pi_test_and_set_on(&vmx->pi_desc))
 		return 0;
@@ -6706,7 +6709,7 @@  static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
 		return -EIO;
 
-	if (pi_test_on(&vmx->pi_desc)) {
+	if (pi_test_on(&vmx->pi_desc) || pi_test_sn(&vmx->pi_desc)) {
 		pi_clear_on(&vmx->pi_desc);
 		/*
 		 * IOMMU can write to PID.ON, so the barrier matters even on UP.
@@ -7187,11 +7190,40 @@  static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (enable_preemption_timer)
 		vmx_update_hv_timer(vcpu);
 
+	/* do this even if apicv is disabled for simplicity */
+	if (kvm_lapic_enabled(vcpu)) {
+		pi_clear_sn(&vmx->pi_desc);
+		/*
+		 * Clear SN before reading the bitmap.  The VT-d firmware
+		 * writes the bitmap and reads SN atomically (5.2.3 in the
+		 * spec), so it doesn't really have a memory barrier that
+		 * pairs with this, but we cannot do that and we need one.
+		 */
+		smp_mb__after_atomic();
+
+		if (!pi_is_pir_empty(&vmx->pi_desc)) {
+			pi_set_on(&vmx->pi_desc);
+			apic->send_IPI_self(POSTED_INTR_VECTOR);
+		}
+	}
+
 	kvm_wait_lapic_expire(vcpu);
 
 	/* The actual VMENTER/EXIT is in the .noinstr.text section. */
 	vmx_vcpu_enter_exit(vcpu, vmx, __vmx_vcpu_run_flags(vmx));
 
+	/*
+	 * Suppress notification right after VM exits to minimize the
+	 * window where VT-d or remote CPU may send a useless notification
+	 * when posting interrupts to a VM. Note that the notification is
+	 * useless because KVM syncs pending interrupts in PID.IRR to vAPIC
+	 * IRR before VM entry.
+
+	 * do this even if apicv is disabled for simplicity.
+	 */
+	if (kvm_lapic_enabled(vcpu))
+		pi_set_sn(&vmx->pi_desc);
+
 	/* All fields are clean at this point */
 	if (static_branch_unlikely(&enable_evmcs)) {
 		current_evmcs->hv_clean_fields |=