diff mbox

[6/6] kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

Message ID 1482164232-130035-7-git-send-email-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Bonzini Dec. 19, 2016, 4:17 p.m. UTC
Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.

Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.

However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any).  To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.

Second, the IPI may be issued between

                        kvm_x86_ops->hwapic_irr_update(vcpu,
                                kvm_lapic_find_highest_irr(vcpu));

and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.

Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.

This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/lapic.c | 11 ++++-------
 arch/x86/kvm/vmx.c   |  8 +++++---
 arch/x86/kvm/x86.c   | 44 +++++++++++++++++++++++++-------------------
 3 files changed, 34 insertions(+), 29 deletions(-)

Comments

Radim Krčmář Feb. 7, 2017, 7:58 p.m. UTC | #1
2016-12-19 17:17+0100, Paolo Bonzini:
> Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked", 2015-09-18) the posted interrupt descriptor is checked
> unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
> trigger the scan and, if NMIs or SMIs are not involved, we can avoid
> the complicated event injection path.
> 
> Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
> there since APICv was introduced.
> 
> However, without the KVM_REQ_EVENT safety net KVM needs to be much
> more careful about races between vmx_deliver_posted_interrupt and
> vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
> between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
> If that happens, kvm_trigger_posted_interrupt returns true, but
> smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
> entered with PIR.ON, but the posted interrupt IPI has not been sent
> and the interrupt is only delivered to the guest on the next vmentry
> (if any).  To fix this, disable interrupts before setting vcpu->mode.
> This ensures that the IPI is delayed until the guest enters non-root mode;
> it is then trapped by the processor causing the interrupt to be injected.
> 
> Second, the IPI may be issued between
> 
>                         kvm_x86_ops->hwapic_irr_update(vcpu,
>                                 kvm_lapic_find_highest_irr(vcpu));
> 
> and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
> but it (correctly) doesn't do anything because it sees vcpu->mode ==
> OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
> posted interrupt IPI is pending; this time, the fix for this is to move
> the RVI update after IN_GUEST_MODE.
> 
> Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
> In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
> in another vmentry which would inject the interrupt.
> 
> This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.

Please mention that this also fixes an existing problem with posted
interrupts from devices.  If we didn't check PIR.ON after disabling host
interrupts, we might delay delivery to the next VM exit or posted
interrupt.  (It was recently posted.)

> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -6767,20 +6754,39 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->prepare_guest_switch(vcpu);
>  	if (vcpu->fpu_active)
>  		kvm_load_guest_fpu(vcpu);
> +
> +	/*
> +	 * Disabling IRQs before setting IN_GUEST_MODE.  Posted interrupt
> +	 * IPI are then delayed after guest entry, which ensures that they
> +	 * result in virtual interrupt delivery.
> +	 */
> +	local_irq_disable();
>  	vcpu->mode = IN_GUEST_MODE;
>  
>  	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>  
>  	/*
> -	 * We should set ->mode before check ->requests,
> -	 * Please see the comment in kvm_make_all_cpus_request.
> -	 * This also orders the write to mode from any reads
> -	 * to the page tables done while the VCPU is running.
> -	 * Please see the comment in kvm_flush_remote_tlbs.
> +	 * 1) We should set ->mode before checking ->requests.  Please see
> +	 * the comment in kvm_make_all_cpus_request.
> +	 *
> +	 * 2) For APICv, we should set ->mode before checking PIR.ON.  This
> +	 * pairs with the memory barrier implicit in pi_test_and_set_on
> +	 * (see vmx_deliver_posted_interrupt).
> +	 *
> +	 * 3) This also orders the write to mode from any reads to the page
> +	 * tables done while the VCPU is running.  Please see the comment
> +	 * in kvm_flush_remote_tlbs.
>  	 */
>  	smp_mb__after_srcu_read_unlock();
> -	local_irq_disable();
> +	if (kvm_lapic_enabled(vcpu)) {
> +		/*
> +		 * This handles the case where a posted interrupt was
> +		 * notified with kvm_vcpu_kick.
> +		 */
> +		if (kvm_x86_ops->sync_pir_to_irr)
> +			kvm_x86_ops->sync_pir_to_irr(vcpu);

Hm, this is not working well when nesting while L1 has assigned devices:
if the posted interrupt arrives just before local_irq_disable(), then
we'll just enter L2 instead of doing a nested VM exit (in case we have
interrupt exiting).

And after reading the code a bit, I think we allow posted interrupts in
L2 while L1 has assigned devices that use posted interrupts, and that it
doesn't work.

Am I missing something?

Thanks.
Paolo Bonzini Feb. 8, 2017, 4:23 p.m. UTC | #2
On 07/02/2017 20:58, Radim Krčmář wrote:
>> -	local_irq_disable();
>> +	if (kvm_lapic_enabled(vcpu)) {
>> +		/*
>> +		 * This handles the case where a posted interrupt was
>> +		 * notified with kvm_vcpu_kick.
>> +		 */
>> +		if (kvm_x86_ops->sync_pir_to_irr)
>> +			kvm_x86_ops->sync_pir_to_irr(vcpu);
> Hm, this is not working well when nesting while L1 has assigned devices:
> if the posted interrupt arrives just before local_irq_disable(), then
> we'll just enter L2 instead of doing a nested VM exit (in case we have
> interrupt exiting).
> 
> And after reading the code a bit, I think we allow posted interrupts in
> L2 while L1 has assigned devices that use posted interrupts, and that it
> doesn't work.

So you mean the interrupt is delivered to L2?  The fix would be to wrap
L2 entry and exit with some subset of pi_pre_block/pi_post_block.

Paolo
Radim Krčmář Feb. 9, 2017, 3:11 p.m. UTC | #3
2017-02-08 17:23+0100, Paolo Bonzini:
> On 07/02/2017 20:58, Radim Krčmář wrote:
>>> -	local_irq_disable();
>>> +	if (kvm_lapic_enabled(vcpu)) {
>>> +		/*
>>> +		 * This handles the case where a posted interrupt was
>>> +		 * notified with kvm_vcpu_kick.
>>> +		 */
>>> +		if (kvm_x86_ops->sync_pir_to_irr)
>>> +			kvm_x86_ops->sync_pir_to_irr(vcpu);
>> Hm, this is not working well when nesting while L1 has assigned devices:
>> if the posted interrupt arrives just before local_irq_disable(), then
>> we'll just enter L2 instead of doing a nested VM exit (in case we have
>> interrupt exiting).
>> 
>> And after reading the code a bit, I think we allow posted interrupts in
>> L2 while L1 has assigned devices that use posted interrupts, and that it
>> doesn't work.
> 
> So you mean the interrupt is delivered to L2?  The fix would be to wrap
> L2 entry and exit with some subset of pi_pre_block/pi_post_block.

I hope not, as their PI strucutres are separate, so we'd be just
delaying the interrupt injection to L1.  The CPU running L2 guest will
notice a posted notification, but its PIR.ON will/might not be set.
L1's PIR.ON will be set, but no-one is going to care until the next VM
exit.

I'll add some unit tests to check that I understood the bug correctly.

Changing the notification vector for L2 would be an ok solution.
We'd reserve a new vector in L0 and check L1's interrupts.  If it were
targetting a VCPU that is currently in L2 with a notification vector
configured for L2, we'd translate that vector into the notification
vector we set for L2 -- L1 could then post interrupts to L2 without a VM
exit.  And "posted" interrupts for L1 while in L2 would trigger a VM
exit, because the notification vector would be different.
Wanpeng Li March 9, 2017, 1:23 a.m. UTC | #4
2016-12-20 0:17 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked", 2015-09-18) the posted interrupt descriptor is checked
> unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
> trigger the scan and, if NMIs or SMIs are not involved, we can avoid
> the complicated event injection path.
>
> Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
> there since APICv was introduced.
>
> However, without the KVM_REQ_EVENT safety net KVM needs to be much
> more careful about races between vmx_deliver_posted_interrupt and
> vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
> between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
> If that happens, kvm_trigger_posted_interrupt returns true, but
> smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
> entered with PIR.ON, but the posted interrupt IPI has not been sent
> and the interrupt is only delivered to the guest on the next vmentry
> (if any).  To fix this, disable interrupts before setting vcpu->mode.
> This ensures that the IPI is delayed until the guest enters non-root mode;
> it is then trapped by the processor causing the interrupt to be injected.
>
> Second, the IPI may be issued between
>
>                         kvm_x86_ops->hwapic_irr_update(vcpu,
>                                 kvm_lapic_find_highest_irr(vcpu));
>
> and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
> but it (correctly) doesn't do anything because it sees vcpu->mode ==
> OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
> posted interrupt IPI is pending; this time, the fix for this is to move
> the RVI update after IN_GUEST_MODE.
>
> Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
> In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
> in another vmentry which would inject the interrupt.
>
> This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/lapic.c | 11 ++++-------
>  arch/x86/kvm/vmx.c   |  8 +++++---
>  arch/x86/kvm/x86.c   | 44 +++++++++++++++++++++++++-------------------
>  3 files changed, 34 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f644dd1dbe71..5ea94b622e88 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -385,12 +385,8 @@ int __kvm_apic_update_irr(u32 *pir, void *regs)
>  int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
>  {
>         struct kvm_lapic *apic = vcpu->arch.apic;
> -       int max_irr;
>
> -       max_irr = __kvm_apic_update_irr(pir, apic->regs);
> -
> -       kvm_make_request(KVM_REQ_EVENT, vcpu);
> -       return max_irr;
> +       return __kvm_apic_update_irr(pir, apic->regs);
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
>
> @@ -423,9 +419,10 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
>         vcpu = apic->vcpu;
>
>         if (unlikely(vcpu->arch.apicv_active)) {
> -               /* try to update RVI */
> +               /* need to update RVI */
>                 apic_clear_vector(vec, apic->regs + APIC_IRR);
> -               kvm_make_request(KVM_REQ_EVENT, vcpu);
> +               kvm_x86_ops->hwapic_irr_update(vcpu,
> +                               apic_find_highest_irr(apic));
>         } else {
>                 apic->irr_pending = false;
>                 apic_clear_vector(vec, apic->regs + APIC_IRR);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 27e40b180242..3dd4fad35a3e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5062,9 +5062,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>         if (pi_test_and_set_pir(vector, &vmx->pi_desc))
>                 return;
>
> -       r = pi_test_and_set_on(&vmx->pi_desc);
> -       kvm_make_request(KVM_REQ_EVENT, vcpu);
> -       if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
> +       /* If a previous notification has sent the IPI, nothing to do.  */
> +       if (pi_test_and_set_on(&vmx->pi_desc))
> +               return;
> +
> +       if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
>                 kvm_vcpu_kick(vcpu);
>  }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c666414adc1d..725473ba6dd3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6710,19 +6710,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>                         kvm_hv_process_stimers(vcpu);
>         }
>
> -       /*
> -        * KVM_REQ_EVENT is not set when posted interrupts are set by
> -        * VT-d hardware, so we have to update RVI unconditionally.
> -        */
> -       if (kvm_lapic_enabled(vcpu)) {
> -               /*
> -                * Update architecture specific hints for APIC
> -                * virtual interrupt delivery.
> -                */
> -               if (kvm_x86_ops->sync_pir_to_irr)
> -                       kvm_x86_ops->sync_pir_to_irr(vcpu);
> -       }
> -
>         if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>                 ++vcpu->stat.req_event;
>                 kvm_apic_accept_events(vcpu);
> @@ -6767,20 +6754,39 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>         kvm_x86_ops->prepare_guest_switch(vcpu);
>         if (vcpu->fpu_active)
>                 kvm_load_guest_fpu(vcpu);
> +
> +       /*
> +        * Disabling IRQs before setting IN_GUEST_MODE.  Posted interrupt
> +        * IPI are then delayed after guest entry, which ensures that they
> +        * result in virtual interrupt delivery.
> +        */
> +       local_irq_disable();
>         vcpu->mode = IN_GUEST_MODE;
>
>         srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>
>         /*
> -        * We should set ->mode before check ->requests,
> -        * Please see the comment in kvm_make_all_cpus_request.
> -        * This also orders the write to mode from any reads
> -        * to the page tables done while the VCPU is running.
> -        * Please see the comment in kvm_flush_remote_tlbs.
> +        * 1) We should set ->mode before checking ->requests.  Please see
> +        * the comment in kvm_make_all_cpus_request.
> +        *
> +        * 2) For APICv, we should set ->mode before checking PIR.ON.  This
> +        * pairs with the memory barrier implicit in pi_test_and_set_on
> +        * (see vmx_deliver_posted_interrupt).
> +        *
> +        * 3) This also orders the write to mode from any reads to the page
> +        * tables done while the VCPU is running.  Please see the comment
> +        * in kvm_flush_remote_tlbs.
>          */
>         smp_mb__after_srcu_read_unlock();
>
> -       local_irq_disable();

The local_irq_disable() movement is unnecessary if you move sync_pir_to_irr.

- IPI after vcpu->mode = IN_GUEST_MODE and interrupt disable, PI is
successfully.
- IPI between vcpu->mode = IN_GUEST_MODE and interrupt disable, the
sync_ir_to_irr will catch the PIR and set RVI.

Regards,
Wanpeng Li

> +       if (kvm_lapic_enabled(vcpu)) {
> +               /*
> +                * This handles the case where a posted interrupt was
> +                * notified with kvm_vcpu_kick.
> +                */
> +               if (kvm_x86_ops->sync_pir_to_irr)
> +                       kvm_x86_ops->sync_pir_to_irr(vcpu);
> +       }
>
>         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
>             || need_resched() || signal_pending(current)) {
> --
> 1.8.3.1
>
Wanpeng Li March 9, 2017, 9:40 a.m. UTC | #5
2017-03-09 9:23 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-12-20 0:17 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
>> is blocked", 2015-09-18) the posted interrupt descriptor is checked
>> unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
>> trigger the scan and, if NMIs or SMIs are not involved, we can avoid
>> the complicated event injection path.
>>
>> Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
>> there since APICv was introduced.
>>
>> However, without the KVM_REQ_EVENT safety net KVM needs to be much
>> more careful about races between vmx_deliver_posted_interrupt and
>> vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
>> between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
>> If that happens, kvm_trigger_posted_interrupt returns true, but
>> smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
>> entered with PIR.ON, but the posted interrupt IPI has not been sent
>> and the interrupt is only delivered to the guest on the next vmentry
>> (if any).  To fix this, disable interrupts before setting vcpu->mode.
>> This ensures that the IPI is delayed until the guest enters non-root mode;
>> it is then trapped by the processor causing the interrupt to be injected.
>>
>> Second, the IPI may be issued between
>>
>>                         kvm_x86_ops->hwapic_irr_update(vcpu,
>>                                 kvm_lapic_find_highest_irr(vcpu));
>>
>> and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
>> but it (correctly) doesn't do anything because it sees vcpu->mode ==
>> OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
>> posted interrupt IPI is pending; this time, the fix for this is to move
>> the RVI update after IN_GUEST_MODE.
>>
>> Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
>> In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
>> in another vmentry which would inject the interrupt.
>>
>> This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  arch/x86/kvm/lapic.c | 11 ++++-------
>>  arch/x86/kvm/vmx.c   |  8 +++++---
>>  arch/x86/kvm/x86.c   | 44 +++++++++++++++++++++++++-------------------
>>  3 files changed, 34 insertions(+), 29 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index f644dd1dbe71..5ea94b622e88 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -385,12 +385,8 @@ int __kvm_apic_update_irr(u32 *pir, void *regs)
>>  int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
>>  {
>>         struct kvm_lapic *apic = vcpu->arch.apic;
>> -       int max_irr;
>>
>> -       max_irr = __kvm_apic_update_irr(pir, apic->regs);
>> -
>> -       kvm_make_request(KVM_REQ_EVENT, vcpu);
>> -       return max_irr;
>> +       return __kvm_apic_update_irr(pir, apic->regs);
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
>>
>> @@ -423,9 +419,10 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
>>         vcpu = apic->vcpu;
>>
>>         if (unlikely(vcpu->arch.apicv_active)) {
>> -               /* try to update RVI */
>> +               /* need to update RVI */
>>                 apic_clear_vector(vec, apic->regs + APIC_IRR);
>> -               kvm_make_request(KVM_REQ_EVENT, vcpu);
>> +               kvm_x86_ops->hwapic_irr_update(vcpu,
>> +                               apic_find_highest_irr(apic));
>>         } else {
>>                 apic->irr_pending = false;
>>                 apic_clear_vector(vec, apic->regs + APIC_IRR);
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 27e40b180242..3dd4fad35a3e 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -5062,9 +5062,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>>         if (pi_test_and_set_pir(vector, &vmx->pi_desc))
>>                 return;
>>
>> -       r = pi_test_and_set_on(&vmx->pi_desc);
>> -       kvm_make_request(KVM_REQ_EVENT, vcpu);
>> -       if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
>> +       /* If a previous notification has sent the IPI, nothing to do.  */
>> +       if (pi_test_and_set_on(&vmx->pi_desc))
>> +               return;
>> +
>> +       if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
>>                 kvm_vcpu_kick(vcpu);
>>  }
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c666414adc1d..725473ba6dd3 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6710,19 +6710,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>                         kvm_hv_process_stimers(vcpu);
>>         }
>>
>> -       /*
>> -        * KVM_REQ_EVENT is not set when posted interrupts are set by
>> -        * VT-d hardware, so we have to update RVI unconditionally.
>> -        */
>> -       if (kvm_lapic_enabled(vcpu)) {
>> -               /*
>> -                * Update architecture specific hints for APIC
>> -                * virtual interrupt delivery.
>> -                */
>> -               if (kvm_x86_ops->sync_pir_to_irr)
>> -                       kvm_x86_ops->sync_pir_to_irr(vcpu);
>> -       }
>> -
>>         if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>>                 ++vcpu->stat.req_event;
>>                 kvm_apic_accept_events(vcpu);
>> @@ -6767,20 +6754,39 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>         kvm_x86_ops->prepare_guest_switch(vcpu);
>>         if (vcpu->fpu_active)
>>                 kvm_load_guest_fpu(vcpu);
>> +
>> +       /*
>> +        * Disabling IRQs before setting IN_GUEST_MODE.  Posted interrupt
>> +        * IPI are then delayed after guest entry, which ensures that they
>> +        * result in virtual interrupt delivery.
>> +        */
>> +       local_irq_disable();
>>         vcpu->mode = IN_GUEST_MODE;
>>
>>         srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>>
>>         /*
>> -        * We should set ->mode before check ->requests,
>> -        * Please see the comment in kvm_make_all_cpus_request.
>> -        * This also orders the write to mode from any reads
>> -        * to the page tables done while the VCPU is running.
>> -        * Please see the comment in kvm_flush_remote_tlbs.
>> +        * 1) We should set ->mode before checking ->requests.  Please see
>> +        * the comment in kvm_make_all_cpus_request.
>> +        *
>> +        * 2) For APICv, we should set ->mode before checking PIR.ON.  This
>> +        * pairs with the memory barrier implicit in pi_test_and_set_on
>> +        * (see vmx_deliver_posted_interrupt).
>> +        *
>> +        * 3) This also orders the write to mode from any reads to the page
>> +        * tables done while the VCPU is running.  Please see the comment
>> +        * in kvm_flush_remote_tlbs.
>>          */
>>         smp_mb__after_srcu_read_unlock();
>>
>> -       local_irq_disable();
>
> The local_irq_disable() movement is unnecessary if you move sync_pir_to_irr.

In addition, this movement will increase the time of irq disable to
some degree. Do you think I can send a patch to revert it?

Regards,
Wanpeng Li

>
> - IPI after vcpu->mode = IN_GUEST_MODE and interrupt disable, PI is
> successfully.
> - IPI between vcpu->mode = IN_GUEST_MODE and interrupt disable, the
> sync_ir_to_irr will catch the PIR and set RVI.
>
> Regards,
> Wanpeng Li
>
>> +       if (kvm_lapic_enabled(vcpu)) {
>> +               /*
>> +                * This handles the case where a posted interrupt was
>> +                * notified with kvm_vcpu_kick.
>> +                */
>> +               if (kvm_x86_ops->sync_pir_to_irr)
>> +                       kvm_x86_ops->sync_pir_to_irr(vcpu);
>> +       }
>>
>>         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
>>             || need_resched() || signal_pending(current)) {
>> --
>> 1.8.3.1
>>
Paolo Bonzini March 9, 2017, 10:03 a.m. UTC | #6
On 09/03/2017 10:40, Wanpeng Li wrote:
> 2017-03-09 9:23 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>> 2016-12-20 0:17 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>> Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
>>> is blocked", 2015-09-18) the posted interrupt descriptor is checked
>>> unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
>>> trigger the scan and, if NMIs or SMIs are not involved, we can avoid
>>> the complicated event injection path.
>>>
>>> Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
>>> there since APICv was introduced.
>>>
>>> However, without the KVM_REQ_EVENT safety net KVM needs to be much
>>> more careful about races between vmx_deliver_posted_interrupt and
>>> vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
>>> between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
>>> If that happens, kvm_trigger_posted_interrupt returns true, but
>>> smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
>>> entered with PIR.ON, but the posted interrupt IPI has not been sent
>>> and the interrupt is only delivered to the guest on the next vmentry
>>> (if any).  To fix this, disable interrupts before setting vcpu->mode.
>>> This ensures that the IPI is delayed until the guest enters non-root mode;
>>> it is then trapped by the processor causing the interrupt to be injected.
>>>
>>> Second, the IPI may be issued between
>>>
>>>                         kvm_x86_ops->hwapic_irr_update(vcpu,
>>>                                 kvm_lapic_find_highest_irr(vcpu));
>>>
>>> and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
>>> but it (correctly) doesn't do anything because it sees vcpu->mode ==
>>> OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
>>> posted interrupt IPI is pending; this time, the fix for this is to move
>>> the RVI update after IN_GUEST_MODE.
>>>
>>> Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
>>> In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
>>> in another vmentry which would inject the interrupt.
>>>
>>> This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
>>>
>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>> ---
>>>  arch/x86/kvm/lapic.c | 11 ++++-------
>>>  arch/x86/kvm/vmx.c   |  8 +++++---
>>>  arch/x86/kvm/x86.c   | 44 +++++++++++++++++++++++++-------------------
>>>  3 files changed, 34 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index f644dd1dbe71..5ea94b622e88 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -385,12 +385,8 @@ int __kvm_apic_update_irr(u32 *pir, void *regs)
>>>  int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
>>>  {
>>>         struct kvm_lapic *apic = vcpu->arch.apic;
>>> -       int max_irr;
>>>
>>> -       max_irr = __kvm_apic_update_irr(pir, apic->regs);
>>> -
>>> -       kvm_make_request(KVM_REQ_EVENT, vcpu);
>>> -       return max_irr;
>>> +       return __kvm_apic_update_irr(pir, apic->regs);
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
>>>
>>> @@ -423,9 +419,10 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
>>>         vcpu = apic->vcpu;
>>>
>>>         if (unlikely(vcpu->arch.apicv_active)) {
>>> -               /* try to update RVI */
>>> +               /* need to update RVI */
>>>                 apic_clear_vector(vec, apic->regs + APIC_IRR);
>>> -               kvm_make_request(KVM_REQ_EVENT, vcpu);
>>> +               kvm_x86_ops->hwapic_irr_update(vcpu,
>>> +                               apic_find_highest_irr(apic));
>>>         } else {
>>>                 apic->irr_pending = false;
>>>                 apic_clear_vector(vec, apic->regs + APIC_IRR);
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 27e40b180242..3dd4fad35a3e 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -5062,9 +5062,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>>>         if (pi_test_and_set_pir(vector, &vmx->pi_desc))
>>>                 return;
>>>
>>> -       r = pi_test_and_set_on(&vmx->pi_desc);
>>> -       kvm_make_request(KVM_REQ_EVENT, vcpu);
>>> -       if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
>>> +       /* If a previous notification has sent the IPI, nothing to do.  */
>>> +       if (pi_test_and_set_on(&vmx->pi_desc))
>>> +               return;
>>> +
>>> +       if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
>>>                 kvm_vcpu_kick(vcpu);
>>>  }
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index c666414adc1d..725473ba6dd3 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6710,19 +6710,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>                         kvm_hv_process_stimers(vcpu);
>>>         }
>>>
>>> -       /*
>>> -        * KVM_REQ_EVENT is not set when posted interrupts are set by
>>> -        * VT-d hardware, so we have to update RVI unconditionally.
>>> -        */
>>> -       if (kvm_lapic_enabled(vcpu)) {
>>> -               /*
>>> -                * Update architecture specific hints for APIC
>>> -                * virtual interrupt delivery.
>>> -                */
>>> -               if (kvm_x86_ops->sync_pir_to_irr)
>>> -                       kvm_x86_ops->sync_pir_to_irr(vcpu);
>>> -       }
>>> -
>>>         if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>>>                 ++vcpu->stat.req_event;
>>>                 kvm_apic_accept_events(vcpu);
>>> @@ -6767,20 +6754,39 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>         kvm_x86_ops->prepare_guest_switch(vcpu);
>>>         if (vcpu->fpu_active)
>>>                 kvm_load_guest_fpu(vcpu);
>>> +
>>> +       /*
>>> +        * Disabling IRQs before setting IN_GUEST_MODE.  Posted interrupt
>>> +        * IPI are then delayed after guest entry, which ensures that they
>>> +        * result in virtual interrupt delivery.
>>> +        */
>>> +       local_irq_disable();
>>>         vcpu->mode = IN_GUEST_MODE;
>>>
>>>         srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>>>
>>>         /*
>>> -        * We should set ->mode before check ->requests,
>>> -        * Please see the comment in kvm_make_all_cpus_request.
>>> -        * This also orders the write to mode from any reads
>>> -        * to the page tables done while the VCPU is running.
>>> -        * Please see the comment in kvm_flush_remote_tlbs.
>>> +        * 1) We should set ->mode before checking ->requests.  Please see
>>> +        * the comment in kvm_make_all_cpus_request.
>>> +        *
>>> +        * 2) For APICv, we should set ->mode before checking PIR.ON.  This
>>> +        * pairs with the memory barrier implicit in pi_test_and_set_on
>>> +        * (see vmx_deliver_posted_interrupt).
>>> +        *
>>> +        * 3) This also orders the write to mode from any reads to the page
>>> +        * tables done while the VCPU is running.  Please see the comment
>>> +        * in kvm_flush_remote_tlbs.
>>>          */
>>>         smp_mb__after_srcu_read_unlock();
>>>
>>> -       local_irq_disable();
>>
>> The local_irq_disable() movement is unnecessary if you move sync_pir_to_irr.
> 
> In addition, this movement will increase the time of irq disable to
> some degree. Do you think I can send a patch to revert it?

The difference is a few dozen hundred clock cycles, I don't think it
matters.  Also, a posted interrupt sent to the host while IN_GUEST_MODE
is more expensive than one sent while the processor is in non-root mode.

All in all, I think it's preferrable to keep the local_irq_disable here.
 Your observation seems correct though.

Paolo

> Regards,
> Wanpeng Li
> 
>>
>> - IPI after vcpu->mode = IN_GUEST_MODE and interrupt disable, PI is
>> successfully.
>> - IPI between vcpu->mode = IN_GUEST_MODE and interrupt disable, the
>> sync_ir_to_irr will catch the PIR and set RVI.
>>
>> Regards,
>> Wanpeng Li
>>
>>> +       if (kvm_lapic_enabled(vcpu)) {
>>> +               /*
>>> +                * This handles the case where a posted interrupt was
>>> +                * notified with kvm_vcpu_kick.
>>> +                */
>>> +               if (kvm_x86_ops->sync_pir_to_irr)
>>> +                       kvm_x86_ops->sync_pir_to_irr(vcpu);
>>> +       }
>>>
>>>         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
>>>             || need_resched() || signal_pending(current)) {
>>> --
>>> 1.8.3.1
>>>
diff mbox

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f644dd1dbe71..5ea94b622e88 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -385,12 +385,8 @@  int __kvm_apic_update_irr(u32 *pir, void *regs)
 int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	int max_irr;
 
-	max_irr = __kvm_apic_update_irr(pir, apic->regs);
-
-	kvm_make_request(KVM_REQ_EVENT, vcpu);
-	return max_irr;
+	return __kvm_apic_update_irr(pir, apic->regs);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
 
@@ -423,9 +419,10 @@  static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
 	vcpu = apic->vcpu;
 
 	if (unlikely(vcpu->arch.apicv_active)) {
-		/* try to update RVI */
+		/* need to update RVI */
 		apic_clear_vector(vec, apic->regs + APIC_IRR);
-		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		kvm_x86_ops->hwapic_irr_update(vcpu,
+				apic_find_highest_irr(apic));
 	} else {
 		apic->irr_pending = false;
 		apic_clear_vector(vec, apic->regs + APIC_IRR);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 27e40b180242..3dd4fad35a3e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5062,9 +5062,11 @@  static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
 		return;
 
-	r = pi_test_and_set_on(&vmx->pi_desc);
-	kvm_make_request(KVM_REQ_EVENT, vcpu);
-	if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
+	/* If a previous notification has sent the IPI, nothing to do.  */
+	if (pi_test_and_set_on(&vmx->pi_desc))
+		return;
+
+	if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
 		kvm_vcpu_kick(vcpu);
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c666414adc1d..725473ba6dd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6710,19 +6710,6 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_hv_process_stimers(vcpu);
 	}
 
-	/*
-	 * KVM_REQ_EVENT is not set when posted interrupts are set by
-	 * VT-d hardware, so we have to update RVI unconditionally.
-	 */
-	if (kvm_lapic_enabled(vcpu)) {
-		/*
-		 * Update architecture specific hints for APIC
-		 * virtual interrupt delivery.
-		 */
-		if (kvm_x86_ops->sync_pir_to_irr)
-			kvm_x86_ops->sync_pir_to_irr(vcpu);
-	}
-
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
 		++vcpu->stat.req_event;
 		kvm_apic_accept_events(vcpu);
@@ -6767,20 +6754,39 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->prepare_guest_switch(vcpu);
 	if (vcpu->fpu_active)
 		kvm_load_guest_fpu(vcpu);
+
+	/*
+	 * Disabling IRQs before setting IN_GUEST_MODE.  Posted interrupt
+	 * IPI are then delayed after guest entry, which ensures that they
+	 * result in virtual interrupt delivery.
+	 */
+	local_irq_disable();
 	vcpu->mode = IN_GUEST_MODE;
 
 	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
 	/*
-	 * We should set ->mode before check ->requests,
-	 * Please see the comment in kvm_make_all_cpus_request.
-	 * This also orders the write to mode from any reads
-	 * to the page tables done while the VCPU is running.
-	 * Please see the comment in kvm_flush_remote_tlbs.
+	 * 1) We should set ->mode before checking ->requests.  Please see
+	 * the comment in kvm_make_all_cpus_request.
+	 *
+	 * 2) For APICv, we should set ->mode before checking PIR.ON.  This
+	 * pairs with the memory barrier implicit in pi_test_and_set_on
+	 * (see vmx_deliver_posted_interrupt).
+	 *
+	 * 3) This also orders the write to mode from any reads to the page
+	 * tables done while the VCPU is running.  Please see the comment
+	 * in kvm_flush_remote_tlbs.
 	 */
 	smp_mb__after_srcu_read_unlock();
 
-	local_irq_disable();
+	if (kvm_lapic_enabled(vcpu)) {
+		/*
+		 * This handles the case where a posted interrupt was
+		 * notified with kvm_vcpu_kick.
+		 */
+		if (kvm_x86_ops->sync_pir_to_irr)
+			kvm_x86_ops->sync_pir_to_irr(vcpu);
+	}
 
 	if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
 	    || need_resched() || signal_pending(current)) {