diff mbox series

[v2,1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts

Message ID 20190813100349.GD9567@blackberry (mailing list archive)
State New, archived
Headers show
Series powerpc/xive: Fix race condition leading to host crashes and hangs | expand

Commit Message

Paul Mackerras Aug. 13, 2019, 10:03 a.m. UTC
Escalation interrupts are interrupts sent to the host by the XIVE
hardware when it has an interrupt to deliver to a guest VCPU but that
VCPU is not running anywhere in the system.  Hence we disable the
escalation interrupt for the VCPU being run when we enter the guest
and re-enable it when the guest does an H_CEDE hypercall indicating
it is idle.

It is possible that an escalation interrupt gets generated just as we
are entering the guest.  In that case the escalation interrupt may be
using a queue entry in one of the interrupt queues, and that queue
entry may not have been processed when the guest exits with an H_CEDE.
The existing entry code detects this situation and does not clear the
vcpu->arch.xive_esc_on flag as an indication that there is a pending
queue entry (if the queue entry gets processed, xive_esc_irq() will
clear the flag).  There is a comment in the code saying that if the
flag is still set on H_CEDE, we have to abort the cede rather than
re-enabling the escalation interrupt, lest we end up with two
occurrences of the escalation interrupt in the interrupt queue.

However, the exit code doesn't do that; it aborts the cede in the sense
that vcpu->arch.ceded gets cleared, but it still enables the escalation
interrupt by setting the source's PQ bits to 00.  Instead we need to
set the PQ bits to 10, indicating that an interrupt has been triggered.
We also need to avoid setting vcpu->arch.xive_esc_on in this case
(i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
xive_esc_irq() will run at some point and clear it, and if we race with
that we may end up with an incorrect result (i.e. xive_esc_on set when
the escalation interrupt has just been handled).

It is extremely unlikely that having two queue entries would cause
observable problems; theoretically it could cause queue overflow, but
the CPU would have to have thousands of interrupts targetted to it for
that to be possible.  However, this fix will also make it possible to
determine accurately whether there is an unhandled escalation
interrupt in the queue, which will be needed by the following patch.

Cc: stable@vger.kernel.org # v4.16+
Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
v2: don't set xive_esc_on if we're not using a XIVE escalation
interrupt.

 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 36 +++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 13 deletions(-)

Comments

Jordan Niethe Aug. 14, 2019, 4:46 a.m. UTC | #1
On Tue, 2019-08-13 at 20:03 +1000, Paul Mackerras wrote:
> Escalation interrupts are interrupts sent to the host by the XIVE
> hardware when it has an interrupt to deliver to a guest VCPU but that
> VCPU is not running anywhere in the system.  Hence we disable the
> escalation interrupt for the VCPU being run when we enter the guest
> and re-enable it when the guest does an H_CEDE hypercall indicating
> it is idle.
> 
> It is possible that an escalation interrupt gets generated just as we
> are entering the guest.  In that case the escalation interrupt may be
> using a queue entry in one of the interrupt queues, and that queue
> entry may not have been processed when the guest exits with an
> H_CEDE.
> The existing entry code detects this situation and does not clear the
> vcpu->arch.xive_esc_on flag as an indication that there is a pending
> queue entry (if the queue entry gets processed, xive_esc_irq() will
> clear the flag).  There is a comment in the code saying that if the
> flag is still set on H_CEDE, we have to abort the cede rather than
> re-enabling the escalation interrupt, lest we end up with two
> occurrences of the escalation interrupt in the interrupt queue.
> 
> However, the exit code doesn't do that; it aborts the cede in the
> sense
> that vcpu->arch.ceded gets cleared, but it still enables the
> escalation
> interrupt by setting the source's PQ bits to 00.  Instead we need to
> set the PQ bits to 10, indicating that an interrupt has been
> triggered.
> We also need to avoid setting vcpu->arch.xive_esc_on in this case
> (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
> xive_esc_irq() will run at some point and clear it, and if we race
> with
> that we may end up with an incorrect result (i.e. xive_esc_on set
> when
> the escalation interrupt has just been handled).
> 
> It is extremely unlikely that having two queue entries would cause
> observable problems; theoretically it could cause queue overflow, but
> the CPU would have to have thousands of interrupts targetted to it
> for
> that to be possible.  However, this fix will also make it possible to
> determine accurately whether there is an unhandled escalation
> interrupt in the queue, which will be needed by the following patch.
> 
> Cc: stable@vger.kernel.org # v4.16+
> Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation
> interrupt masked unless ceded")
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> ---
> v2: don't set xive_esc_on if we're not using a XIVE escalation
> interrupt.
> 
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 36 +++++++++++++++++++++
> ------------
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 337e644..2e7e788 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -2831,29 +2831,39 @@ kvm_cede_prodded:
>  kvm_cede_exit:
>  	ld	r9, HSTATE_KVM_VCPU(r13)
>  #ifdef CONFIG_KVM_XICS
> -	/* Abort if we still have a pending escalation */
> +	/* are we using XIVE with single escalation? */
> +	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> +	cmpdi	r10, 0
> +	beq	3f
> +	li	r6, XIVE_ESB_SET_PQ_00
Would it make sense to put the above instruction down into the 4: label
instead? If we do not branch to 4, r6 is overwriten anyway. 
I think that would save a load when we do not branch to 4. Also it
would mean that you could use r5 everywhere instead of changing it to
r6? 
> +	/*
> +	 * If we still have a pending escalation, abort the cede,
> +	 * and we must set PQ to 10 rather than 00 so that we don't
> +	 * potentially end up with two entries for the escalation
> +	 * interrupt in the XIVE interrupt queue.  In that case
> +	 * we also don't want to set xive_esc_on to 1 here in
> +	 * case we race with xive_esc_irq().
> +	 */
>  	lbz	r5, VCPU_XIVE_ESC_ON(r9)
>  	cmpwi	r5, 0
> -	beq	1f
> +	beq	4f
>  	li	r0, 0
>  	stb	r0, VCPU_CEDED(r9)
> -1:	/* Enable XIVE escalation */
> -	li	r5, XIVE_ESB_SET_PQ_00
> +	li	r6, XIVE_ESB_SET_PQ_10
> +	b	5f
> +4:	li	r0, 1
> +	stb	r0, VCPU_XIVE_ESC_ON(r9)
> +	/* make sure store to xive_esc_on is seen before xive_esc_irq
> runs */
> +	sync
> +5:	/* Enable XIVE escalation */
>  	mfmsr	r0
>  	andi.	r0, r0, MSR_DR		/* in real mode? */
>  	beq	1f
> -	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> -	cmpdi	r10, 0
> -	beq	3f
> -	ldx	r0, r10, r5
> +	ldx	r0, r10, r6
>  	b	2f
>  1:	ld	r10, VCPU_XIVE_ESC_RADDR(r9)
> -	cmpdi	r10, 0
> -	beq	3f
> -	ldcix	r0, r10, r5
> +	ldcix	r0, r10, r6
>  2:	sync
> -	li	r0, 1
> -	stb	r0, VCPU_XIVE_ESC_ON(r9)
>  #endif /* CONFIG_KVM_XICS */
>  3:	b	guest_exit_cont
>
Paul Mackerras Aug. 14, 2019, 6:05 a.m. UTC | #2
On Wed, Aug 14, 2019 at 02:46:38PM +1000, Jordan Niethe wrote:
> On Tue, 2019-08-13 at 20:03 +1000, Paul Mackerras wrote:

[snip]
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index 337e644..2e7e788 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -2831,29 +2831,39 @@ kvm_cede_prodded:
> >  kvm_cede_exit:
> >  	ld	r9, HSTATE_KVM_VCPU(r13)
> >  #ifdef CONFIG_KVM_XICS
> > -	/* Abort if we still have a pending escalation */
> > +	/* are we using XIVE with single escalation? */
> > +	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> > +	cmpdi	r10, 0
> > +	beq	3f
> > +	li	r6, XIVE_ESB_SET_PQ_00
> Would it make sense to put the above instruction down into the 4: label
> instead? If we do not branch to 4, r6 is overwriten anyway. 

Right.

> I think that would save a load when we do not branch to 4. Also it

Well, li is a load immediate rather than a load ("load" would normally
imply a load from memory).  Load-immediate instructions are
essentially free since they can easily be executed in parallel with
other instructions and execute in a single cycle.

> would mean that you could use r5 everywhere instead of changing it to
> r6? 

Yes.  If I have to respin the patch for other reasons then I will
rearrange things as you suggest.  I don't think it's worth respinning
just for this change -- it won't reduce the total number of
instructions, and I strongly doubt there would be any measurable
performance difference.

> > +	/*
> > +	 * If we still have a pending escalation, abort the cede,
> > +	 * and we must set PQ to 10 rather than 00 so that we don't
> > +	 * potentially end up with two entries for the escalation
> > +	 * interrupt in the XIVE interrupt queue.  In that case
> > +	 * we also don't want to set xive_esc_on to 1 here in
> > +	 * case we race with xive_esc_irq().
> > +	 */
> >  	lbz	r5, VCPU_XIVE_ESC_ON(r9)
> >  	cmpwi	r5, 0
> > -	beq	1f
> > +	beq	4f
> >  	li	r0, 0
> >  	stb	r0, VCPU_CEDED(r9)
> > -1:	/* Enable XIVE escalation */
> > -	li	r5, XIVE_ESB_SET_PQ_00
> > +	li	r6, XIVE_ESB_SET_PQ_10
> > +	b	5f
> > +4:	li	r0, 1
> > +	stb	r0, VCPU_XIVE_ESC_ON(r9)
> > +	/* make sure store to xive_esc_on is seen before xive_esc_irq
> > runs */
> > +	sync
> > +5:	/* Enable XIVE escalation */
> >  	mfmsr	r0
> >  	andi.	r0, r0, MSR_DR		/* in real mode? */
> >  	beq	1f
> > -	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> > -	cmpdi	r10, 0
> > -	beq	3f
> > -	ldx	r0, r10, r5
> > +	ldx	r0, r10, r6
> >  	b	2f
> >  1:	ld	r10, VCPU_XIVE_ESC_RADDR(r9)
> > -	cmpdi	r10, 0
> > -	beq	3f
> > -	ldcix	r0, r10, r5
> > +	ldcix	r0, r10, r6
> >  2:	sync
> > -	li	r0, 1
> > -	stb	r0, VCPU_XIVE_ESC_ON(r9)
> >  #endif /* CONFIG_KVM_XICS */
> >  3:	b	guest_exit_cont
> >  

Paul.
Michael Ellerman Aug. 22, 2019, 10:46 a.m. UTC | #3
On Tue, 2019-08-13 at 10:03:49 UTC, Paul Mackerras wrote:
> Escalation interrupts are interrupts sent to the host by the XIVE
> hardware when it has an interrupt to deliver to a guest VCPU but that
> VCPU is not running anywhere in the system.  Hence we disable the
> escalation interrupt for the VCPU being run when we enter the guest
> and re-enable it when the guest does an H_CEDE hypercall indicating
> it is idle.
> 
> It is possible that an escalation interrupt gets generated just as we
> are entering the guest.  In that case the escalation interrupt may be
> using a queue entry in one of the interrupt queues, and that queue
> entry may not have been processed when the guest exits with an H_CEDE.
> The existing entry code detects this situation and does not clear the
> vcpu->arch.xive_esc_on flag as an indication that there is a pending
> queue entry (if the queue entry gets processed, xive_esc_irq() will
> clear the flag).  There is a comment in the code saying that if the
> flag is still set on H_CEDE, we have to abort the cede rather than
> re-enabling the escalation interrupt, lest we end up with two
> occurrences of the escalation interrupt in the interrupt queue.
> 
> However, the exit code doesn't do that; it aborts the cede in the sense
> that vcpu->arch.ceded gets cleared, but it still enables the escalation
> interrupt by setting the source's PQ bits to 00.  Instead we need to
> set the PQ bits to 10, indicating that an interrupt has been triggered.
> We also need to avoid setting vcpu->arch.xive_esc_on in this case
> (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
> xive_esc_irq() will run at some point and clear it, and if we race with
> that we may end up with an incorrect result (i.e. xive_esc_on set when
> the escalation interrupt has just been handled).
> 
> It is extremely unlikely that having two queue entries would cause
> observable problems; theoretically it could cause queue overflow, but
> the CPU would have to have thousands of interrupts targetted to it for
> that to be possible.  However, this fix will also make it possible to
> determine accurately whether there is an unhandled escalation
> interrupt in the queue, which will be needed by the following patch.
> 
> Cc: stable@vger.kernel.org # v4.16+
> Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Applied to powerpc topic/ppc-kvm, thanks.

https://git.kernel.org/powerpc/c/959c5d5134786b4988b6fdd08e444aa67d1667ed

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 337e644..2e7e788 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2831,29 +2831,39 @@  kvm_cede_prodded:
 kvm_cede_exit:
 	ld	r9, HSTATE_KVM_VCPU(r13)
 #ifdef CONFIG_KVM_XICS
-	/* Abort if we still have a pending escalation */
+	/* are we using XIVE with single escalation? */
+	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
+	cmpdi	r10, 0
+	beq	3f
+	li	r6, XIVE_ESB_SET_PQ_00
+	/*
+	 * If we still have a pending escalation, abort the cede,
+	 * and we must set PQ to 10 rather than 00 so that we don't
+	 * potentially end up with two entries for the escalation
+	 * interrupt in the XIVE interrupt queue.  In that case
+	 * we also don't want to set xive_esc_on to 1 here in
+	 * case we race with xive_esc_irq().
+	 */
 	lbz	r5, VCPU_XIVE_ESC_ON(r9)
 	cmpwi	r5, 0
-	beq	1f
+	beq	4f
 	li	r0, 0
 	stb	r0, VCPU_CEDED(r9)
-1:	/* Enable XIVE escalation */
-	li	r5, XIVE_ESB_SET_PQ_00
+	li	r6, XIVE_ESB_SET_PQ_10
+	b	5f
+4:	li	r0, 1
+	stb	r0, VCPU_XIVE_ESC_ON(r9)
+	/* make sure store to xive_esc_on is seen before xive_esc_irq runs */
+	sync
+5:	/* Enable XIVE escalation */
 	mfmsr	r0
 	andi.	r0, r0, MSR_DR		/* in real mode? */
 	beq	1f
-	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
-	cmpdi	r10, 0
-	beq	3f
-	ldx	r0, r10, r5
+	ldx	r0, r10, r6
 	b	2f
 1:	ld	r10, VCPU_XIVE_ESC_RADDR(r9)
-	cmpdi	r10, 0
-	beq	3f
-	ldcix	r0, r10, r5
+	ldcix	r0, r10, r6
 2:	sync
-	li	r0, 1
-	stb	r0, VCPU_XIVE_ESC_ON(r9)
 #endif /* CONFIG_KVM_XICS */
 3:	b	guest_exit_cont