diff mbox series

[5/6] KVM: x86: never clear irr_pending in kvm_apic_update_apicv

Message ID 20211209115440.394441-6-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series RFC: KVM: SVM: Allow L1's AVIC to co-exist with nesting | expand

Commit Message

Maxim Levitsky Dec. 9, 2021, 11:54 a.m. UTC
It is possible that during the AVIC incomplete IPI vmexit,
its handler will set irr_pending to true,
but the target vCPU will still see the IRR bit not set,
due to the apparent lack of memory ordering between CPU's vIRR write
that is supposed to happen prior to the AVIC incomplete IPI
vmexit and the write of the irr_pending in that handler.

The AVIC incomplete IPI handler sets this boolean, then issues
a write barrier and then raises KVM_REQ_EVENT,
thus when we later process the KVM_REQ_EVENT we will notice
the vIRR bits set.

Also reorder call to kvm_apic_update_apicv to be after
.refresh_apicv_exec_ctrl, although that doesn't guarantee
that it will see up to date IRR bits.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/lapic.c | 3 ++-
 arch/x86/kvm/x86.c   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini Dec. 9, 2021, 2:12 p.m. UTC | #1
On 12/9/21 12:54, Maxim Levitsky wrote:
> Also reorder call to kvm_apic_update_apicv to be after
> .refresh_apicv_exec_ctrl, although that doesn't guarantee
> that it will see up to date IRR bits.

Can you spell out why do that?

Paolo
Maxim Levitsky Dec. 9, 2021, 3:03 p.m. UTC | #2
On Thu, 2021-12-09 at 15:12 +0100, Paolo Bonzini wrote:
> On 12/9/21 12:54, Maxim Levitsky wrote:
> > Also reorder call to kvm_apic_update_apicv to be after
> > .refresh_apicv_exec_ctrl, although that doesn't guarantee
> > that it will see up to date IRR bits.
> 
> Can you spell out why do that?


Here is what I seen happening during kvm_vcpu_update_apicv when we about to disable AVIC:

1. we call kvm_apic_update_apicv which sets irr_pending == false,
because there is nothing in IRR yet.

2. we call kvm_x86_refresh_apicv_exec_ctrl which disables AVIC

If IPI arrives in between 1 and 2, the IRR bits are set, and legit there
is no VMexit happening so no chance of irr_pending to be set to true.


This is why I reordered those calls and added a memory barrier between them
(but I didn't post it in the series)

However I then found out that even with incomplete IPI handler setting irr_pending,
I can here observe irr_pending = true but no bits in IRR so the kvm_apic_update_apicv
would reset it. I expected VM exit to be write barrier but it seems that it isn't.

However I ended up fixing the incomplete IPI handler to just always 
	- set irr_pending
	- raise KVM_REQ_EVENT
	- kick the vcpu

Because kicking a sleeping vCPU is just waking it up,
and otherwise vcpu kick only sends IPI when the target vCPU
is in guest mode anyway.

That I think ensures for good that interrupt will be processed by
this vCPU regardless of order of these calls, and barrier between them.

The only thing I kept is that make kvm_apic_update_apicv never clear
irr_pending to make sure it doesn't reset it if it sees the writes out of order.

Later the KVM_REQ_EVENT should see writes in order because kvm_make_request
includes a write barrier, and the kick should ensure that the vCPU will
process that request.

So in summary this reorder is not needed anymore but it seems more logical
to scan IRR after we disable AVIC.
Or on the second though I think we should drop the IRR scan from here at all,
now that the callers do vcpu kicks.

Best regards,
	Maxim Levitsky





> 
> Paolo
>
Paolo Bonzini Dec. 10, 2021, 12:07 p.m. UTC | #3
On 12/9/21 12:54, Maxim Levitsky wrote:
> It is possible that during the AVIC incomplete IPI vmexit,
> its handler will set irr_pending to true,
> but the target vCPU will still see the IRR bit not set,
> due to the apparent lack of memory ordering between CPU's vIRR write
> that is supposed to happen prior to the AVIC incomplete IPI
> vmexit and the write of the irr_pending in that handler.

Are you sure about this?  Store-to-store ordering should be 
guaranteed---if not by the architecture---by existing memory barriers 
between vmrun returning and avic_incomplete_ipi_interception().  For 
example, srcu_read_lock implies an smp_mb().

Even more damning: no matter what internal black magic the processor 
could be using to write to IRR, the processor needs to order the writes 
against reads of IsRunning on processors without the erratum.  That 
would be equivalent to flushing the store buffer, and it would imply 
that the write of vIRR is ordered before the write to irr_pending.

Paolo
Maxim Levitsky Dec. 10, 2021, 12:20 p.m. UTC | #4
On Fri, 2021-12-10 at 13:07 +0100, Paolo Bonzini wrote:
> On 12/9/21 12:54, Maxim Levitsky wrote:
> > It is possible that during the AVIC incomplete IPI vmexit,
> > its handler will set irr_pending to true,
> > but the target vCPU will still see the IRR bit not set,
> > due to the apparent lack of memory ordering between CPU's vIRR write
> > that is supposed to happen prior to the AVIC incomplete IPI
> > vmexit and the write of the irr_pending in that handler.
> 
> Are you sure about this?  Store-to-store ordering should be 
> guaranteed---if not by the architecture---by existing memory barriers 
> between vmrun returning and avic_incomplete_ipi_interception().  For 
> example, srcu_read_lock implies an smp_mb().
> 
> Even more damning: no matter what internal black magic the processor 
> could be using to write to IRR, the processor needs to order the writes 
> against reads of IsRunning on processors without the erratum.  That 
> would be equivalent to flushing the store buffer, and it would imply 
> that the write of vIRR is ordered before the write to irr_pending.
> 
> Paolo
> 
Yes I almost 100% sure now that this patch is wrong.
the code was just seeing irr_pending true because it is set
to true while APICv/AVIC is use, and was not seeing yet the vIRR bits,
because they didn't arrive yet. This this patch isn't needed.

Thanks again for help!
I am testing your version of fixes to avic inhibition races,
and then I'll send a new version of these patches.

Best regards,
	Maxim Levitsky
Maxim Levitsky Dec. 10, 2021, 12:47 p.m. UTC | #5
On Fri, 2021-12-10 at 14:20 +0200, Maxim Levitsky wrote:
> On Fri, 2021-12-10 at 13:07 +0100, Paolo Bonzini wrote:
> > On 12/9/21 12:54, Maxim Levitsky wrote:
> > > It is possible that during the AVIC incomplete IPI vmexit,
> > > its handler will set irr_pending to true,
> > > but the target vCPU will still see the IRR bit not set,
> > > due to the apparent lack of memory ordering between CPU's vIRR write
> > > that is supposed to happen prior to the AVIC incomplete IPI
> > > vmexit and the write of the irr_pending in that handler.
> > 
> > Are you sure about this?  Store-to-store ordering should be 
> > guaranteed---if not by the architecture---by existing memory barriers 
> > between vmrun returning and avic_incomplete_ipi_interception().  For 
> > example, srcu_read_lock implies an smp_mb().
> > 
> > Even more damning: no matter what internal black magic the processor 
> > could be using to write to IRR, the processor needs to order the writes 
> > against reads of IsRunning on processors without the erratum.  That 
> > would be equivalent to flushing the store buffer, and it would imply 
> > that the write of vIRR is ordered before the write to irr_pending.
> > 
> > Paolo
> > 
> Yes I almost 100% sure now that this patch is wrong.
> the code was just seeing irr_pending true because it is set
> to true while APICv/AVIC is use, and was not seeing yet the vIRR bits,
> because they didn't arrive yet. This this patch isn't needed.
> 
> Thanks again for help!
> I am testing your version of fixes to avic inhibition races,
> and then I'll send a new version of these patches.
> 
> Best regards,
> 	Maxim Levitsky

And yet that patch is needed for a differnt reason.

If the sender has AVIC enabled, it can turn on vIRR bits at any moment
without setting irr_pending = true - there are no VMexits happeing
on the sender side.

If we scan vIRR here and see no bits, and *then* disable AVIC,
there is a window where the they could legit be turned on without any cpu errata,
and we will not have irr_pending == true, and thus the following 
KVM_REQ_EVENT will make no difference.

Not touching irr_pending and letting just the KVM_REQ_EVENT do the work
will work too, and if the avic errata is present, reduce slightly
the chances of it happening.

Best regards,
	Maxim Levitsky
Paolo Bonzini Dec. 10, 2021, 1:03 p.m. UTC | #6
On 12/10/21 13:47, Maxim Levitsky wrote:
> If we scan vIRR here and see no bits, and*then*  disable AVIC,
> there is a window where the they could legit be turned on without any cpu errata,
> and we will not have irr_pending == true, and thus the following
> KVM_REQ_EVENT will make no difference.

Right.

> Not touching irr_pending and letting just the KVM_REQ_EVENT do the work
> will work too,

Yeah, I think that's preferrable.  irr_pending == true is a conservative 
setting that works; irr_pending will be evaluated again on the first 
call to apic_clear_irr and that's enough.

With that justification, you don't need to reorder the call to 
kvm_apic_update_apicv to be after kvm_x86_refresh_apicv_exec_ctrl.

Paolo

  and if the avic errata is present, reduce slightly
> the chances of it happening.
Maxim Levitsky Dec. 10, 2021, 1:10 p.m. UTC | #7
On Fri, 2021-12-10 at 14:03 +0100, Paolo Bonzini wrote:
> On 12/10/21 13:47, Maxim Levitsky wrote:
> > If we scan vIRR here and see no bits, and*then*  disable AVIC,
> > there is a window where the they could legit be turned on without any cpu errata,
> > and we will not have irr_pending == true, and thus the following
> > KVM_REQ_EVENT will make no difference.
> 
> Right.
> 
> > Not touching irr_pending and letting just the KVM_REQ_EVENT do the work
> > will work too,
> 
> Yeah, I think that's preferrable.  irr_pending == true is a conservative 
> setting that works; irr_pending will be evaluated again on the first 
> call to apic_clear_irr and that's enough.
> 
> With that justification, you don't need to reorder the call to 
> kvm_apic_update_apicv to be after kvm_x86_refresh_apicv_exec_ctrl.

Yes exactly! but no need to scan IRR here since irr_pending is already
true at that point anyway - it is always true while avic is enabled.


Best regards,
	Maxim Levitsky
> 
> Paolo
> 
>   and if the avic errata is present, reduce slightly
> > the chances of it happening.
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c5028e6b0f96..ecd6111b9a0d 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2314,7 +2314,8 @@  void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
 		apic->irr_pending = true;
 		apic->isr_count = 1;
 	} else {
-		apic->irr_pending = (apic_search_irr(apic) != -1);
+		if (apic_search_irr(apic) != -1)
+			apic->irr_pending = true;
 		apic->isr_count = count_vectors(apic->regs + APIC_ISR);
 	}
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 26cb3a4cd0e9..ca037ac2ea08 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9542,8 +9542,8 @@  void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 		goto out;
 
 	vcpu->arch.apicv_active = activate;
-	kvm_apic_update_apicv(vcpu);
 	static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
+	kvm_apic_update_apicv(vcpu);
 
 	/*
 	 * When APICv gets disabled, we may still have injected interrupts