diff mbox series

[v2,3/5] KVM: SVM: fix race between interrupt delivery and AVIC inhibition

Message ID 20211213104634.199141-4-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series RFC: KVM: SVM: Allow L1's AVIC to co-exist with nesting | expand

Commit Message

Maxim Levitsky Dec. 13, 2021, 10:46 a.m. UTC
If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.

To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.

All this complicated logic is actually exactly how we can handle an
incomplete IPI vmexit; the only difference lies in who sets IRR, whether
KVM or the processor.

Also incomplete IPI vmexit, has the same races as svm_deliver_avic_intr.
therefore just reuse the avic_kick_target_vcpu for it as well.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-with: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 85 +++++++++++++++++++++++++----------------
 arch/x86/kvm/x86.c      |  4 +-
 2 files changed, 55 insertions(+), 34 deletions(-)

Comments

Sean Christopherson Jan. 4, 2022, 10:52 p.m. UTC | #1
On Mon, Dec 13, 2021, Maxim Levitsky wrote:
> If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
> inhibited, it might read a stale value of vcpu->arch.apicv_active
> which can lead to the target vCPU not noticing the interrupt.
> 
> To fix this use load-acquire/store-release so that, if the target vCPU
> is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
> AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
> KVM_REQ_EVENT-based delivery.
> 
> All this complicated logic is actually exactly how we can handle an
> incomplete IPI vmexit; the only difference lies in who sets IRR, whether
> KVM or the processor.
> 
> Also incomplete IPI vmexit, has the same races as svm_deliver_avic_intr.
> therefore just reuse the avic_kick_target_vcpu for it as well.
> 
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>

Heh, probably don't need a Reported-by for a patch you wrote :-)

> Co-developed-with: Paolo Bonzini <pbonzini@redhat.com>

Co-developed-by: is preferred, and should be accompanied by Paolo's SoB.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/svm/avic.c | 85 +++++++++++++++++++++++++----------------
>  arch/x86/kvm/x86.c      |  4 +-
>  2 files changed, 55 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 90364d02f22aa..34f62da2fbadd 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -289,6 +289,47 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +static void avic_kick_target_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	bool in_guest_mode;
> +
> +	/*
> +	 * vcpu->arch.apicv_active is read after vcpu->mode.  Pairs

This should say "must be read", not "is read".  It's obvious from the code that
apicv_active is read second, the comment is there to say that it _must_ be read
after vcpu->mode.

> +	 * with smp_store_release in vcpu_enter_guest.
> +	 */
> +	in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);

IMO, it's marginally clear to initialize the bool.

	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);

> +	if (READ_ONCE(vcpu->arch.apicv_active)) {
> +		if (in_guest_mode) {
> +			/*
> +			 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> +			 * is in the guest.  If the vCPU is not in the guest, hardware will
> +			 * automatically process AVIC interrupts at VMRUN.

Might as well wrap these comments at 80 chars since they're being moved.  Or
maybe even better....

	/* blah blah blah */
	if (!READ_ONCE(vcpu->arch.apicv_active)) {
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		kvm_vcpu_kick(vcpu);
		return;
	}

	if (in_guest_mode) {
		...
	} else {
		....
	}

...so that the existing comments can be preserved as is.

> +			 *
> +			 * Note, the vCPU could get migrated to a different pCPU at any
> +			 * point, which could result in signalling the wrong/previous
> +			 * pCPU.  But if that happens the vCPU is guaranteed to do a
> +			 * VMRUN (after being migrated) and thus will process pending
> +			 * interrupts, i.e. a doorbell is not needed (and the spurious
> +			 * one is harmless).
> +			 */
> +			int cpu = READ_ONCE(vcpu->cpu);
> +			if (cpu != get_cpu())
> +				wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> +			put_cpu();
> +		} else {
> +			/*
> +			 * Wake the vCPU if it was blocking.  KVM will then detect the
> +			 * pending IRQ when checking if the vCPU has a wake event.
> +			 */
> +			kvm_vcpu_wake_up(vcpu);
> +		}
> +	} else {
> +		/* Compare this case with __apic_accept_irq.  */

Honestly, this comment isn't very helpful.  It only takes a few lines to say:

		/*
		 * Manually signal the event, the __apic_accept_irq() fallback
		 * path can't be used if AVIC is disabled after the vector is
		 * already queued in the vIRR.
		 */

(incorporating more feedback below)

> +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +		kvm_vcpu_kick(vcpu);
> +	}
> +}
> +
>  static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
>  				   u32 icrl, u32 icrh)
>  {
> @@ -304,8 +345,10 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
>  	kvm_for_each_vcpu(i, vcpu, kvm) {
>  		if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
>  					GET_APIC_DEST_FIELD(icrh),
> -					icrl & APIC_DEST_MASK))
> -			kvm_vcpu_wake_up(vcpu);
> +					icrl & APIC_DEST_MASK)) {
> +			vcpu->arch.apic->irr_pending = true;
> +			avic_kick_target_vcpu(vcpu);
> +		}
>  	}
>  }
>  
> @@ -671,9 +714,12 @@ void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
>  
>  int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
>  {
> -	if (!vcpu->arch.apicv_active)
> -		return -1;
> -
> +	/*
> +	 * Below, we have to handle anyway the case of AVIC being disabled
> +	 * in the middle of this function, and there is hardly any overhead
> +	 * if AVIC is disabled.  So, we do not bother returning -1 and handle
> +	 * the kick ourselves for disabled APICv.

Hmm, my preference would be to keep the "return -1" even though apicv_active must
be rechecked.  That would help highlight that returning "failure" after this point
is not an option as it would result in kvm_lapic_set_irr() being called twice.

> +	 */
>  	kvm_lapic_set_irr(vec, vcpu->arch.apic);
>  
>  	/*
> @@ -684,34 +730,7 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
>  	 * the doorbell if the vCPU is already running in the guest.
>  	 */
>  	smp_mb__after_atomic();
> -
> -	/*
> -	 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> -	 * is in the guest.  If the vCPU is not in the guest, hardware will
> -	 * automatically process AVIC interrupts at VMRUN.
> -	 */
> -	if (vcpu->mode == IN_GUEST_MODE) {
> -		int cpu = READ_ONCE(vcpu->cpu);
> -
> -		/*
> -		 * Note, the vCPU could get migrated to a different pCPU at any
> -		 * point, which could result in signalling the wrong/previous
> -		 * pCPU.  But if that happens the vCPU is guaranteed to do a
> -		 * VMRUN (after being migrated) and thus will process pending
> -		 * interrupts, i.e. a doorbell is not needed (and the spurious
> -		 * one is harmless).
> -		 */
> -		if (cpu != get_cpu())
> -			wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> -		put_cpu();
> -	} else {
> -		/*
> -		 * Wake the vCPU if it was blocking.  KVM will then detect the
> -		 * pending IRQ when checking if the vCPU has a wake event.
> -		 */
> -		kvm_vcpu_wake_up(vcpu);
> -	}
> -
> +	avic_kick_target_vcpu(vcpu);
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 85127b3e3690b..81a74d86ee5eb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9869,7 +9869,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	 * result in virtual interrupt delivery.
>  	 */
>  	local_irq_disable();
> -	vcpu->mode = IN_GUEST_MODE;
> +
> +	/* Store vcpu->apicv_active before vcpu->mode.  */
> +	smp_store_release(&vcpu->mode, IN_GUEST_MODE);
>  
>  	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>  
> -- 
> 2.26.3
>
Maxim Levitsky Jan. 5, 2022, 11:03 a.m. UTC | #2
On Tue, 2022-01-04 at 22:52 +0000, Sean Christopherson wrote:
> On Mon, Dec 13, 2021, Maxim Levitsky wrote:
> > If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
> > inhibited, it might read a stale value of vcpu->arch.apicv_active
> > which can lead to the target vCPU not noticing the interrupt.
> > 
> > To fix this use load-acquire/store-release so that, if the target vCPU
> > is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
> > AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
> > KVM_REQ_EVENT-based delivery.
> > 
> > All this complicated logic is actually exactly how we can handle an
> > incomplete IPI vmexit; the only difference lies in who sets IRR, whether
> > KVM or the processor.
> > 
> > Also incomplete IPI vmexit, has the same races as svm_deliver_avic_intr.
> > therefore just reuse the avic_kick_target_vcpu for it as well.
> > 
> > Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Heh, probably don't need a Reported-by for a patch you wrote :-)

Paolo gave me this version, I pretty much sent it as is. We had few iterations
of this patch before though we agreed that the race is finally gone.

> 
> > Co-developed-with: Paolo Bonzini <pbonzini@redhat.com>
> 
> Co-developed-by: is preferred, and should be accompanied by Paolo's SoB.

First time I use this format, so I didn't knew about this.

> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/svm/avic.c | 85 +++++++++++++++++++++++++----------------
> >  arch/x86/kvm/x86.c      |  4 +-
> >  2 files changed, 55 insertions(+), 34 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index 90364d02f22aa..34f62da2fbadd 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -289,6 +289,47 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> >  	return 0;
> >  }
> >  
> > +static void avic_kick_target_vcpu(struct kvm_vcpu *vcpu)
> > +{
> > +	bool in_guest_mode;
> > +
> > +	/*
> > +	 * vcpu->arch.apicv_active is read after vcpu->mode.  Pairs
> 
> This should say "must be read", not "is read".  It's obvious from the code that
> apicv_active is read second, the comment is there to say that it _must_ be read
> after vcpu->mode.
> 
> > +	 * with smp_store_release in vcpu_enter_guest.
> > +	 */
> > +	in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> 
> IMO, it's marginally clear to initialize the bool.
> 
> 	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> 
> > +	if (READ_ONCE(vcpu->arch.apicv_active)) {
> > +		if (in_guest_mode) {
> > +			/*
> > +			 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> > +			 * is in the guest.  If the vCPU is not in the guest, hardware will
> > +			 * automatically process AVIC interrupts at VMRUN.
> 
> Might as well wrap these comments at 80 chars since they're being moved.  Or
> maybe even better....
> 
> 	/* blah blah blah */
> 	if (!READ_ONCE(vcpu->arch.apicv_active)) {
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 		kvm_vcpu_kick(vcpu);
> 		return;
> 	}
> 
> 	if (in_guest_mode) {
> 		...
> 	} else {
> 		....
> 	}
> 
> ...so that the existing comments can be preserved as is.
> 
> > +			 *
> > +			 * Note, the vCPU could get migrated to a different pCPU at any
> > +			 * point, which could result in signalling the wrong/previous
> > +			 * pCPU.  But if that happens the vCPU is guaranteed to do a
> > +			 * VMRUN (after being migrated) and thus will process pending
> > +			 * interrupts, i.e. a doorbell is not needed (and the spurious
> > +			 * one is harmless).
> > +			 */
> > +			int cpu = READ_ONCE(vcpu->cpu);
> > +			if (cpu != get_cpu())
> > +				wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> > +			put_cpu();
> > +		} else {
> > +			/*
> > +			 * Wake the vCPU if it was blocking.  KVM will then detect the
> > +			 * pending IRQ when checking if the vCPU has a wake event.
> > +			 */
> > +			kvm_vcpu_wake_up(vcpu);
> > +		}
> > +	} else {
> > +		/* Compare this case with __apic_accept_irq.  */
> 
> Honestly, this comment isn't very helpful.  It only takes a few lines to say:
> 
> 		/*
> 		 * Manually signal the event, the __apic_accept_irq() fallback
> 		 * path can't be used if AVIC is disabled after the vector is
> 		 * already queued in the vIRR.
> 		 */
> 
> (incorporating more feedback below)
> 
> > +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +		kvm_vcpu_kick(vcpu);
> > +	}
> > +}
> > +
> >  static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
> >  				   u32 icrl, u32 icrh)
> >  {
> > @@ -304,8 +345,10 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
> >  	kvm_for_each_vcpu(i, vcpu, kvm) {
> >  		if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
> >  					GET_APIC_DEST_FIELD(icrh),
> > -					icrl & APIC_DEST_MASK))
> > -			kvm_vcpu_wake_up(vcpu);
> > +					icrl & APIC_DEST_MASK)) {
> > +			vcpu->arch.apic->irr_pending = true;
> > +			avic_kick_target_vcpu(vcpu);
> > +		}
> >  	}
> >  }
> >  
> > @@ -671,9 +714,12 @@ void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> >  
> >  int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> >  {
> > -	if (!vcpu->arch.apicv_active)
> > -		return -1;
> > -
> > +	/*
> > +	 * Below, we have to handle anyway the case of AVIC being disabled
> > +	 * in the middle of this function, and there is hardly any overhead
> > +	 * if AVIC is disabled.  So, we do not bother returning -1 and handle
> > +	 * the kick ourselves for disabled APICv.
> 
> Hmm, my preference would be to keep the "return -1" even though apicv_active must
> be rechecked.  That would help highlight that returning "failure" after this point
> is not an option as it would result in kvm_lapic_set_irr() being called twice.

I don't mind either - this will fix the tracepoint I recently added to report the
number of interrupts that were delivered by AVIC/APICv - with this patch,
all of them count as such.


I will also address all other feedback about the comments and send new version soon.

Thanks for the review!
Best regards,
	Maxim Levitsky

> 
> > +	 */
> >  	kvm_lapic_set_irr(vec, vcpu->arch.apic);
> >  
> >  	/*
> > @@ -684,34 +730,7 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> >  	 * the doorbell if the vCPU is already running in the guest.
> >  	 */
> >  	smp_mb__after_atomic();
> > -
> > -	/*
> > -	 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
> > -	 * is in the guest.  If the vCPU is not in the guest, hardware will
> > -	 * automatically process AVIC interrupts at VMRUN.
> > -	 */
> > -	if (vcpu->mode == IN_GUEST_MODE) {
> > -		int cpu = READ_ONCE(vcpu->cpu);
> > -
> > -		/*
> > -		 * Note, the vCPU could get migrated to a different pCPU at any
> > -		 * point, which could result in signalling the wrong/previous
> > -		 * pCPU.  But if that happens the vCPU is guaranteed to do a
> > -		 * VMRUN (after being migrated) and thus will process pending
> > -		 * interrupts, i.e. a doorbell is not needed (and the spurious
> > -		 * one is harmless).
> > -		 */
> > -		if (cpu != get_cpu())
> > -			wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
> > -		put_cpu();
> > -	} else {
> > -		/*
> > -		 * Wake the vCPU if it was blocking.  KVM will then detect the
> > -		 * pending IRQ when checking if the vCPU has a wake event.
> > -		 */
> > -		kvm_vcpu_wake_up(vcpu);
> > -	}
> > -
> > +	avic_kick_target_vcpu(vcpu);
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 85127b3e3690b..81a74d86ee5eb 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9869,7 +9869,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	 * result in virtual interrupt delivery.
> >  	 */
> >  	local_irq_disable();
> > -	vcpu->mode = IN_GUEST_MODE;
> > +
> > +	/* Store vcpu->apicv_active before vcpu->mode.  */
> > +	smp_store_release(&vcpu->mode, IN_GUEST_MODE);
> >  
> >  	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> >  
> > -- 
> > 2.26.3
> >
Paolo Bonzini Jan. 5, 2022, 11:54 a.m. UTC | #3
On 1/5/22 12:03, Maxim Levitsky wrote:
>> Hmm, my preference would be to keep the "return -1" even though apicv_active must
>> be rechecked.  That would help highlight that returning "failure" after this point
>> is not an option as it would result in kvm_lapic_set_irr() being called twice.
> I don't mind either - this will fix the tracepoint I recently added to report the
> number of interrupts that were delivered by AVIC/APICv - with this patch,
> all of them count as such.

Perhaps we can move the tracepoints in the delivery functions.  This 
also makes them more precise in the rare case where apicv_active changes 
in the middle of the function.

Paolo
Maxim Levitsky Jan. 5, 2022, 12:15 p.m. UTC | #4
On Wed, 2022-01-05 at 12:54 +0100, Paolo Bonzini wrote:
> On 1/5/22 12:03, Maxim Levitsky wrote:
> > > Hmm, my preference would be to keep the "return -1" even though apicv_active must
> > > be rechecked.  That would help highlight that returning "failure" after this point
> > > is not an option as it would result in kvm_lapic_set_irr() being called twice.
> > I don't mind either - this will fix the tracepoint I recently added to report the
> > number of interrupts that were delivered by AVIC/APICv - with this patch,
> > all of them count as such.
> 
> Perhaps we can move the tracepoints in the delivery functions.  This 
> also makes them more precise in the rare case where apicv_active changes 
> in the middle of the function.

That is what I was thinking to do as well, but I don't mind returning the 'return -1' as well.
Best regards,
	Maxim Levitsky

> 
> Paolo
>
Paolo Bonzini Jan. 7, 2022, 3:32 p.m. UTC | #5
On 1/5/22 12:03, Maxim Levitsky wrote:
>>> -	if (!vcpu->arch.apicv_active)
>>> -		return -1;
>>> -
>>> +	/*
>>> +	 * Below, we have to handle anyway the case of AVIC being disabled
>>> +	 * in the middle of this function, and there is hardly any overhead
>>> +	 * if AVIC is disabled.  So, we do not bother returning -1 and handle
>>> +	 * the kick ourselves for disabled APICv.
>> Hmm, my preference would be to keep the "return -1" even though apicv_active must
>> be rechecked.  That would help highlight that returning "failure" after this point
>> is not an option as it would result in kvm_lapic_set_irr() being called twice.
> I don't mind either - this will fix the tracepoint I recently added to report the
> number of interrupts that were delivered by AVIC/APICv - with this patch,
> all of them count as such.

The reasoning here is that, unlike VMX, we have to react anyway to 
vcpu->arch.apicv_active becoming false halfway through the function.

Removing the early return means that there's one less case of load 
(mis)reordering that the reader has to check.  So I really would prefer 
to remove it.

Agreed with the other feedback.

Paolo
Sean Christopherson Jan. 7, 2022, 11:42 p.m. UTC | #6
On Fri, Jan 07, 2022, Paolo Bonzini wrote:
> On 1/5/22 12:03, Maxim Levitsky wrote:
> > > > -	if (!vcpu->arch.apicv_active)
> > > > -		return -1;
> > > > -
> > > > +	/*
> > > > +	 * Below, we have to handle anyway the case of AVIC being disabled
> > > > +	 * in the middle of this function, and there is hardly any overhead
> > > > +	 * if AVIC is disabled.  So, we do not bother returning -1 and handle
> > > > +	 * the kick ourselves for disabled APICv.
> > > Hmm, my preference would be to keep the "return -1" even though apicv_active must
> > > be rechecked.  That would help highlight that returning "failure" after this point
> > > is not an option as it would result in kvm_lapic_set_irr() being called twice.
> > I don't mind either - this will fix the tracepoint I recently added to report the
> > number of interrupts that were delivered by AVIC/APICv - with this patch,
> > all of them count as such.
> 
> The reasoning here is that, unlike VMX, we have to react anyway to
> vcpu->arch.apicv_active becoming false halfway through the function.
> 
> Removing the early return means that there's one less case of load
> (mis)reordering that the reader has to check.

Yeah, I don't disagree, but the flip side is that without the early check, it's
not all that obvious that SVM must not return -1.  And when AVIC isn't supported
or is disabled at the module level, flowing into AVIC "specific" IRR logic is
a bit weird.  And the LAPIC code effectively becomes Intel-only.

To make everyone happy, and fix the tracepoint issue, what about moving delivery
into vendor code?  E.g. the below (incomplete), with SVM functions renamed so that
anything that isn't guaranteed to be AVIC specific uses svm_ instead of avic_.

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index baca9fa37a91..a9ac724c6305 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1096,14 +1096,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
                                                       apic->regs + APIC_TMR);
                }

-               if (static_call(kvm_x86_deliver_posted_interrupt)(vcpu, vector)) {
-                       kvm_lapic_set_irr(vector, apic);
-                       kvm_make_request(KVM_REQ_EVENT, vcpu);
-                       kvm_vcpu_kick(vcpu);
-               } else {
-                       trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode,
-                                                  trig_mode, vector);
-               }
+               static_call(kvm_x86_deliver_interrupt)(vcpu, vector);
                break;

        case APIC_DM_REMRD:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe06b02994e6..1fadd14ea884 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4012,6 +4012,18 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
        return 0;
 }

+static void vmx_deliver_interrupt(struct kvm_vcpu *vcpu, int vector)
+{
+       if (vmx_deliver_posted_interrupt(vcpu, vector)) {
+               kvm_lapic_set_irr(vector, apic);
+               kvm_make_request(KVM_REQ_EVENT, vcpu);
+               kvm_vcpu_kick(vcpu);
+       } else {
+               trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode,
+                                          trig_mode, vector);
+       }
+}
+
 /*
  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
  * will not change in the lifetime of the guest.
@@ -7651,7 +7663,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
        .hwapic_isr_update = vmx_hwapic_isr_update,
        .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
        .sync_pir_to_irr = vmx_sync_pir_to_irr,
-       .deliver_posted_interrupt = vmx_deliver_posted_interrupt,
+       .deliver_interrupt = vmx_deliver_interrupt,
        .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,

        .set_tss_addr = vmx_set_tss_addr,
Paolo Bonzini Jan. 10, 2022, 3:10 p.m. UTC | #7
On 1/8/22 00:42, Sean Christopherson wrote:
> Yeah, I don't disagree, but the flip side is that without the early check, it's
> not all that obvious that SVM must not return -1.

But that's what the comment is about.

> And when AVIC isn't supported
> or is disabled at the module level, flowing into AVIC "specific" IRR logic is
> a bit weird.  And the LAPIC code effectively becomes Intel-only.

Yeah, I agree that this particular aspect is a bit weird at first.  But 
it is also natural if you consider how IRR is implemented for AVIC vs. 
APICv, especially with respect to incomplete IPIs.  And the LAPIC code 
becomes a blueprint for AVIC's, i.e. you can see that the AVIC code 
effectively becomes the one in lapic.c when you have either 
!vcpu->arch.apicv_active or an incomplete IPI.

I don't hate the code below, it does lose a bit of expressiveness of the 
LAPIC code but I guess it's a good middle ground.

Paolo

> To make everyone happy, and fix the tracepoint issue, what about moving delivery
> into vendor code?  E.g. the below (incomplete), with SVM functions renamed so that
> anything that isn't guaranteed to be AVIC specific uses svm_ instead of avic_.
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index baca9fa37a91..a9ac724c6305 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1096,14 +1096,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>                                                         apic->regs + APIC_TMR);
>                  }
> 
> -               if (static_call(kvm_x86_deliver_posted_interrupt)(vcpu, vector)) {
> -                       kvm_lapic_set_irr(vector, apic);
> -                       kvm_make_request(KVM_REQ_EVENT, vcpu);
> -                       kvm_vcpu_kick(vcpu);
> -               } else {
> -                       trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode,
> -                                                  trig_mode, vector);
> -               }
> +               static_call(kvm_x86_deliver_interrupt)(vcpu, vector);
>                  break;
> 
>          case APIC_DM_REMRD:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index fe06b02994e6..1fadd14ea884 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4012,6 +4012,18 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>          return 0;
>   }
> 
> +static void vmx_deliver_interrupt(struct kvm_vcpu *vcpu, int vector)
> +{
> +       if (vmx_deliver_posted_interrupt(vcpu, vector)) {
> +               kvm_lapic_set_irr(vector, apic);
> +               kvm_make_request(KVM_REQ_EVENT, vcpu);
> +               kvm_vcpu_kick(vcpu);
> +       } else {
> +               trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode,
> +                                          trig_mode, vector);
> +       }
> +}
> +
>   /*
>    * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>    * will not change in the lifetime of the guest.
> @@ -7651,7 +7663,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>          .hwapic_isr_update = vmx_hwapic_isr_update,
>          .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
>          .sync_pir_to_irr = vmx_sync_pir_to_irr,
> -       .deliver_posted_interrupt = vmx_deliver_posted_interrupt,
> +       .deliver_interrupt = vmx_deliver_interrupt,
>          .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
> 
>          .set_tss_addr = vmx_set_tss_addr,
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 90364d02f22aa..34f62da2fbadd 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -289,6 +289,47 @@  static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void avic_kick_target_vcpu(struct kvm_vcpu *vcpu)
+{
+	bool in_guest_mode;
+
+	/*
+	 * vcpu->arch.apicv_active is read after vcpu->mode.  Pairs
+	 * with smp_store_release in vcpu_enter_guest.
+	 */
+	in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
+	if (READ_ONCE(vcpu->arch.apicv_active)) {
+		if (in_guest_mode) {
+			/*
+			 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
+			 * is in the guest.  If the vCPU is not in the guest, hardware will
+			 * automatically process AVIC interrupts at VMRUN.
+			 *
+			 * Note, the vCPU could get migrated to a different pCPU at any
+			 * point, which could result in signalling the wrong/previous
+			 * pCPU.  But if that happens the vCPU is guaranteed to do a
+			 * VMRUN (after being migrated) and thus will process pending
+			 * interrupts, i.e. a doorbell is not needed (and the spurious
+			 * one is harmless).
+			 */
+			int cpu = READ_ONCE(vcpu->cpu);
+			if (cpu != get_cpu())
+				wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
+			put_cpu();
+		} else {
+			/*
+			 * Wake the vCPU if it was blocking.  KVM will then detect the
+			 * pending IRQ when checking if the vCPU has a wake event.
+			 */
+			kvm_vcpu_wake_up(vcpu);
+		}
+	} else {
+		/* Compare this case with __apic_accept_irq.  */
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
+}
+
 static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
 				   u32 icrl, u32 icrh)
 {
@@ -304,8 +345,10 @@  static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
 					GET_APIC_DEST_FIELD(icrh),
-					icrl & APIC_DEST_MASK))
-			kvm_vcpu_wake_up(vcpu);
+					icrl & APIC_DEST_MASK)) {
+			vcpu->arch.apic->irr_pending = true;
+			avic_kick_target_vcpu(vcpu);
+		}
 	}
 }
 
@@ -671,9 +714,12 @@  void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 
 int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
 {
-	if (!vcpu->arch.apicv_active)
-		return -1;
-
+	/*
+	 * Below, we have to handle anyway the case of AVIC being disabled
+	 * in the middle of this function, and there is hardly any overhead
+	 * if AVIC is disabled.  So, we do not bother returning -1 and handle
+	 * the kick ourselves for disabled APICv.
+	 */
 	kvm_lapic_set_irr(vec, vcpu->arch.apic);
 
 	/*
@@ -684,34 +730,7 @@  int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
 	 * the doorbell if the vCPU is already running in the guest.
 	 */
 	smp_mb__after_atomic();
-
-	/*
-	 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
-	 * is in the guest.  If the vCPU is not in the guest, hardware will
-	 * automatically process AVIC interrupts at VMRUN.
-	 */
-	if (vcpu->mode == IN_GUEST_MODE) {
-		int cpu = READ_ONCE(vcpu->cpu);
-
-		/*
-		 * Note, the vCPU could get migrated to a different pCPU at any
-		 * point, which could result in signalling the wrong/previous
-		 * pCPU.  But if that happens the vCPU is guaranteed to do a
-		 * VMRUN (after being migrated) and thus will process pending
-		 * interrupts, i.e. a doorbell is not needed (and the spurious
-		 * one is harmless).
-		 */
-		if (cpu != get_cpu())
-			wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
-		put_cpu();
-	} else {
-		/*
-		 * Wake the vCPU if it was blocking.  KVM will then detect the
-		 * pending IRQ when checking if the vCPU has a wake event.
-		 */
-		kvm_vcpu_wake_up(vcpu);
-	}
-
+	avic_kick_target_vcpu(vcpu);
 	return 0;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 85127b3e3690b..81a74d86ee5eb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9869,7 +9869,9 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * result in virtual interrupt delivery.
 	 */
 	local_irq_disable();
-	vcpu->mode = IN_GUEST_MODE;
+
+	/* Store vcpu->apicv_active before vcpu->mode.  */
+	smp_store_release(&vcpu->mode, IN_GUEST_MODE);
 
 	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);