diff mbox series

[v2] KVM: PPC: Book3S HV: Mask off LPCR_MER for a vCPU before running it to avoid spurious interrupts

Message ID 20241024173417.95395-1-gautam@linux.ibm.com (mailing list archive)
State New
Headers show
Series [v2] KVM: PPC: Book3S HV: Mask off LPCR_MER for a vCPU before running it to avoid spurious interrupts | expand

Commit Message

Gautam Menghani Oct. 24, 2024, 5:34 p.m. UTC
Mask off the LPCR_MER bit before running a vCPU to ensure that it is not
set if there are no pending interrupts. Running a vCPU with LPCR_MER bit
set and no pending interrupts results in L2 vCPU getting an infinite flood
of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets
the LPCR_MER bit if there are pending interrupts.

The spurious flood problem can be observed in 2 cases:
1. Crashing the guest while interrupt heavy workload is running
  a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm)
  b. While the workload is running, crash the guest (make sure kdump
     is configured)
  c. Any one of the vCPUs of the guest will start getting an infinite
     flood of spurious interrupts.

2. Running LTP stress tests in multiple guests at the same time
   a. Start 4 L2 guests.
   b. Start running LTP stress tests on all 4 guests at same time.
   c. In some time, any one/more of the vCPUs of any of the guests will
      start getting an infinite flood of spurious interrupts.

The root cause of both the above issues is the same:
1. A NMI is sent to a running vCPU that has LPCR_MER bit set.
2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE
   is called for all the registers.
3. When H_GUEST_GET_STATE is called for lpcr, the vcpu->arch.vcore->lpcr
   of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this
   new value is always used whenever that vCPU runs, regardless of whether
   there was a pending interrupt.
4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external
   interrupt handler, and this cycle never ends.

Fix the spurious flood by making sure a vCPU's LPCR_MER is always masked
before running a vCPU.

Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0")
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
---
V1 -> V2:
1. Mask off the LPCR_MER in vcpu->arch.vcore->lpcr instead of resetting
it so that we avoid grabbing vcpu->arch.vcore->lock. (Suggested by
Ritesh in an internal review)

 arch/powerpc/kvm/book3s_hv.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Oct. 25, 2024, 3:56 a.m. UTC | #1
Hi Gautam,

A few comments below ...

Gautam Menghani <gautam@linux.ibm.com> writes:
> Mask off the LPCR_MER bit before running a vCPU to ensure that it is not
> set if there are no pending interrupts.

I would typically leave this until the end of the change log. ie.
describe the bug and how it happens first, then the fix at the end.

But it's not a hard rule, so up to you.

> Running a vCPU with LPCR_MER bit
            ^
        "an L2 vCPU"

In general if you can qualify L0 vs L1 vs L2 everywhere it would help
folks follow the description.

> set and no pending interrupts results in L2 vCPU getting an infinite flood
> of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets
> the LPCR_MER bit if there are pending interrupts.
>
> The spurious flood problem can be observed in 2 cases:
> 1. Crashing the guest while interrupt heavy workload is running
>   a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm)
>   b. While the workload is running, crash the guest (make sure kdump
>      is configured)
>   c. Any one of the vCPUs of the guest will start getting an infinite
>      flood of spurious interrupts.
>
> 2. Running LTP stress tests in multiple guests at the same time
>    a. Start 4 L2 guests.
>    b. Start running LTP stress tests on all 4 guests at same time.
>    c. In some time, any one/more of the vCPUs of any of the guests will
>       start getting an infinite flood of spurious interrupts.
>
> The root cause of both the above issues is the same:
> 1. A NMI is sent to a running vCPU that has LPCR_MER bit set.
> 2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE
>    is called for all the registers.
> 3. When H_GUEST_GET_STATE is called for lpcr, the vcpu->arch.vcore->lpcr
>    of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this
>    new value is always used whenever that vCPU runs, regardless of whether
>    there was a pending interrupt.
> 4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external
>    interrupt handler, and this cycle never ends.
>
> Fix the spurious flood by making sure a vCPU's LPCR_MER is always masked
> before running a vCPU.

I think your original sentence at the top of the change log is actually more
accurate. ie. it's not that LPCR_MER is always cleared, it's cleared
*unless there's a pending interrupt*.

> Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0")
> Cc: stable@vger.kernel.org # v6.8+
> Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
> ---
> V1 -> V2:
> 1. Mask off the LPCR_MER in vcpu->arch.vcore->lpcr instead of resetting
> it so that we avoid grabbing vcpu->arch.vcore->lock. (Suggested by
> Ritesh in an internal review)

Did v1 take the vcore->lock? I don't remember it.

> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8f7d7e37bc8c..b8701b5dde50 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5089,9 +5089,19 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
>  
>  	do {
>  		accumulate_time(vcpu, &vcpu->arch.guest_entry);
> +		/*
> +		 * L1's copy of L2's lpcr (vcpu->arch.vcore->lpcr) can get its MER bit
                                     ^
                                     LPCR
> +		 * unexpectedly set - for e.g. during NMI handling when all register
> +		 * states are synchronized from L0 to L1. L1 needs to inform L0 about
> +		 * MER=1 only when there are pending external interrupts.
> +		 * kvmhv_run_single_vcpu() anyway sets MER bit if there are pending
> +		 * external interrupts. Hence, mask off MER bit when passing vcore->lpcr
> +		 * here as otherwise it may generate spurious interrupts in L2 KVM
> +		 * causing an endless loop, which results in L2 guest getting hung.
> +		 */
>  		if (cpu_has_feature(CPU_FTR_ARCH_300))
>  			r = kvmhv_run_single_vcpu(vcpu, ~(u64)0,
> -						  vcpu->arch.vcore->lpcr);
> +						  vcpu->arch.vcore->lpcr & ~LPCR_MER);
 
This is much better than v1 which hid the clearing of LPCR_MER in a macro.

But I still wonder if it would be better to clear it in
kvmhv_run_single_vcpu() itself.

The logic to set LPCR_MER is already in there, so why not ensure
LPCR_MER is cleared as part of that some block?

I realise there's another caller of kvmhv_run_single_vcpu() from the
nested code, but that's OK because there's already a nested check in
kvmhv_run_single_vcpu(), so you can still isolate this change to just
the non-nested case.

cheers
Ritesh Harjani (IBM) Oct. 25, 2024, 6:07 a.m. UTC | #2
Gautam Menghani <gautam@linux.ibm.com> writes:

> Mask off the LPCR_MER bit before running a vCPU to ensure that it is not
> set if there are no pending interrupts. Running a vCPU with LPCR_MER bit
> set and no pending interrupts results in L2 vCPU getting an infinite flood
> of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets
> the LPCR_MER bit if there are pending interrupts.
>
> The spurious flood problem can be observed in 2 cases:
> 1. Crashing the guest while interrupt heavy workload is running
>   a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm)
>   b. While the workload is running, crash the guest (make sure kdump
>      is configured)
>   c. Any one of the vCPUs of the guest will start getting an infinite
>      flood of spurious interrupts.
>
> 2. Running LTP stress tests in multiple guests at the same time
>    a. Start 4 L2 guests.
>    b. Start running LTP stress tests on all 4 guests at same time.
>    c. In some time, any one/more of the vCPUs of any of the guests will
>       start getting an infinite flood of spurious interrupts.
>
> The root cause of both the above issues is the same:
> 1. A NMI is sent to a running vCPU that has LPCR_MER bit set.
> 2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE
>    is called for all the registers.
> 3. When H_GUEST_GET_STATE is called for lpcr, the vcpu->arch.vcore->lpcr
>    of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this
>    new value is always used whenever that vCPU runs, regardless of whether
>    there was a pending interrupt.
> 4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external
>    interrupt handler, and this cycle never ends.
>
> Fix the spurious flood by making sure a vCPU's LPCR_MER is always masked
> before running a vCPU.
>
> Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0")
> Cc: stable@vger.kernel.org # v6.8+
> Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
> ---
> V1 -> V2:
> 1. Mask off the LPCR_MER in vcpu->arch.vcore->lpcr instead of resetting
> it so that we avoid grabbing vcpu->arch.vcore->lock. (Suggested by
> Ritesh in an internal review)

Thanks Gautam for addressing the review comment. But let me improve the
changelog a little to make it more accurate for others too.

Removed the macro which was silently clearing LPCR_MER bit from vcore->lpcr
and instead just mask it off while sending it to kvmhv_run_single_vcpu().
Added an inline comment describing the reason to avoid anyone tipping
it over. - (suggested ...)


Yes, that would also mean that no need of taking any vcore lock since we
are not modifying any of the vcore state variables which came up in the
internal review discussion.
Having said that it will be good to document the usage of vcore->lock
above the struct kvmppc_vcore definition. Because it isn't obvious of
when all it should be taken and/or what all it protects?

>
>  arch/powerpc/kvm/book3s_hv.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8f7d7e37bc8c..b8701b5dde50 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5089,9 +5089,19 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
>  
>  	do {
>  		accumulate_time(vcpu, &vcpu->arch.guest_entry);
> +		/*
> +		 * L1's copy of L2's lpcr (vcpu->arch.vcore->lpcr) can get its MER bit
> +		 * unexpectedly set - for e.g. during NMI handling when all register
> +		 * states are synchronized from L0 to L1. L1 needs to inform L0 about
> +		 * MER=1 only when there are pending external interrupts.
> +		 * kvmhv_run_single_vcpu() anyway sets MER bit if there are pending
> +		 * external interrupts. Hence, mask off MER bit when passing vcore->lpcr
> +		 * here as otherwise it may generate spurious interrupts in L2 KVM
> +		 * causing an endless loop, which results in L2 guest getting hung.
> +		 */

Thanks for describing this inline.

>  		if (cpu_has_feature(CPU_FTR_ARCH_300))
>  			r = kvmhv_run_single_vcpu(vcpu, ~(u64)0,
> -						  vcpu->arch.vcore->lpcr);
> +						  vcpu->arch.vcore->lpcr & ~LPCR_MER);

While still at it - 
I too like mpe suggestion to clear the LPCR_MER bit at one place
itself within kvmhv_run_single_vcpu(). 

>  		else
>  			r = kvmppc_run_vcpu(vcpu);
>  
> -- 
> 2.47.0

-ritesh
Gautam Menghani Oct. 25, 2024, 2:22 p.m. UTC | #3
On Fri, Oct 25, 2024 at 02:56:05PM +1100, Michael Ellerman wrote:
> Hi Gautam,
> 
> A few comments below ...
> 
> Gautam Menghani <gautam@linux.ibm.com> writes:
> > Mask off the LPCR_MER bit before running a vCPU to ensure that it is not
> > set if there are no pending interrupts.
> 
> I would typically leave this until the end of the change log. ie.
> describe the bug and how it happens first, then the fix at the end.
> 
> But it's not a hard rule, so up to you.

Yes agreed, that would make more sense.

> 
> > Running a vCPU with LPCR_MER bit
>             ^
>         "an L2 vCPU"
> 
> In general if you can qualify L0 vs L1 vs L2 everywhere it would help
> folks follow the description.

yes will add it in v3
> 
> > set and no pending interrupts results in L2 vCPU getting an infinite flood
> > of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets
> > the LPCR_MER bit if there are pending interrupts.
> >
> > The spurious flood problem can be observed in 2 cases:
> > 1. Crashing the guest while interrupt heavy workload is running
> >   a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm)
> >   b. While the workload is running, crash the guest (make sure kdump
> >      is configured)
> >   c. Any one of the vCPUs of the guest will start getting an infinite
> >      flood of spurious interrupts.
> >
> > 2. Running LTP stress tests in multiple guests at the same time
> >    a. Start 4 L2 guests.
> >    b. Start running LTP stress tests on all 4 guests at same time.
> >    c. In some time, any one/more of the vCPUs of any of the guests will
> >       start getting an infinite flood of spurious interrupts.
> >
> > The root cause of both the above issues is the same:
> > 1. A NMI is sent to a running vCPU that has LPCR_MER bit set.
> > 2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE
> >    is called for all the registers.
> > 3. When H_GUEST_GET_STATE is called for lpcr, the vcpu->arch.vcore->lpcr
> >    of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this
> >    new value is always used whenever that vCPU runs, regardless of whether
> >    there was a pending interrupt.
> > 4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external
> >    interrupt handler, and this cycle never ends.
> >
> > Fix the spurious flood by making sure a vCPU's LPCR_MER is always masked
> > before running a vCPU.
> 
> I think your original sentence at the top of the change log is actually more
> accurate. ie. it's not that LPCR_MER is always cleared, it's cleared
> *unless there's a pending interrupt*.

Yes agreed
> 
> > Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0")
> > Cc: stable@vger.kernel.org # v6.8+
> > Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
> > ---
> > V1 -> V2:
> > 1. Mask off the LPCR_MER in vcpu->arch.vcore->lpcr instead of resetting
> > it so that we avoid grabbing vcpu->arch.vcore->lock. (Suggested by
> > Ritesh in an internal review)
> 
> Did v1 take the vcore->lock? I don't remember it.

No v1 did not take a lock, but ideally was supposed to take a lock. I
missed the locking part there.
> 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 8f7d7e37bc8c..b8701b5dde50 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -5089,9 +5089,19 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
> >  
> >  	do {
> >  		accumulate_time(vcpu, &vcpu->arch.guest_entry);
> > +		/*
> > +		 * L1's copy of L2's lpcr (vcpu->arch.vcore->lpcr) can get its MER bit
>                                      ^
>                                      LPCR

Ack. 
> > +		 * unexpectedly set - for e.g. during NMI handling when all register
> > +		 * states are synchronized from L0 to L1. L1 needs to inform L0 about
> > +		 * MER=1 only when there are pending external interrupts.
> > +		 * kvmhv_run_single_vcpu() anyway sets MER bit if there are pending
> > +		 * external interrupts. Hence, mask off MER bit when passing vcore->lpcr
> > +		 * here as otherwise it may generate spurious interrupts in L2 KVM
> > +		 * causing an endless loop, which results in L2 guest getting hung.
> > +		 */
> >  		if (cpu_has_feature(CPU_FTR_ARCH_300))
> >  			r = kvmhv_run_single_vcpu(vcpu, ~(u64)0,
> > -						  vcpu->arch.vcore->lpcr);
> > +						  vcpu->arch.vcore->lpcr & ~LPCR_MER);
>  
> This is much better than v1 which hid the clearing of LPCR_MER in a macro.
> 
> But I still wonder if it would be better to clear it in
> kvmhv_run_single_vcpu() itself.
> 
> The logic to set LPCR_MER is already in there, so why not ensure
> LPCR_MER is cleared as part of that some block?
> 
> I realise there's another caller of kvmhv_run_single_vcpu() from the
> nested code, but that's OK because there's already a nested check in
> kvmhv_run_single_vcpu(), so you can still isolate this change to just
> the non-nested case.
> 

Yes it would be better to mask off LPCR_MER inside
kvmhv_run_single_vcpu(), will make that change and send v3. 

> cheers

Thanks,
Gautam
Gautam Menghani Oct. 25, 2024, 2:30 p.m. UTC | #4
On Fri, Oct 25, 2024 at 11:37:52AM +0530, Ritesh Harjani (IBM) wrote:
> Gautam Menghani <gautam@linux.ibm.com> writes:
> 
> > Mask off the LPCR_MER bit before running a vCPU to ensure that it is not
> > set if there are no pending interrupts. Running a vCPU with LPCR_MER bit
> > set and no pending interrupts results in L2 vCPU getting an infinite flood
> > of spurious interrupts. The 'if check' in kvmhv_run_single_vcpu() sets
> > the LPCR_MER bit if there are pending interrupts.
> >
> > The spurious flood problem can be observed in 2 cases:
> > 1. Crashing the guest while interrupt heavy workload is running
> >   a. Start a L2 guest and run an interrupt heavy workload (eg: ipistorm)
> >   b. While the workload is running, crash the guest (make sure kdump
> >      is configured)
> >   c. Any one of the vCPUs of the guest will start getting an infinite
> >      flood of spurious interrupts.
> >
> > 2. Running LTP stress tests in multiple guests at the same time
> >    a. Start 4 L2 guests.
> >    b. Start running LTP stress tests on all 4 guests at same time.
> >    c. In some time, any one/more of the vCPUs of any of the guests will
> >       start getting an infinite flood of spurious interrupts.
> >
> > The root cause of both the above issues is the same:
> > 1. A NMI is sent to a running vCPU that has LPCR_MER bit set.
> > 2. In the NMI path, all registers are refreshed, i.e, H_GUEST_GET_STATE
> >    is called for all the registers.
> > 3. When H_GUEST_GET_STATE is called for lpcr, the vcpu->arch.vcore->lpcr
> >    of that vCPU at L1 level gets updated with LPCR_MER set to 1, and this
> >    new value is always used whenever that vCPU runs, regardless of whether
> >    there was a pending interrupt.
> > 4. Since LPCR_MER is set, the vCPU in L2 always jumps to the external
> >    interrupt handler, and this cycle never ends.
> >
> > Fix the spurious flood by making sure a vCPU's LPCR_MER is always masked
> > before running a vCPU.
> >
> > Fixes: ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0")
> > Cc: stable@vger.kernel.org # v6.8+
> > Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
> > ---
> > V1 -> V2:
> > 1. Mask off the LPCR_MER in vcpu->arch.vcore->lpcr instead of resetting
> > it so that we avoid grabbing vcpu->arch.vcore->lock. (Suggested by
> > Ritesh in an internal review)
> 
> Thanks Gautam for addressing the review comment. But let me improve the
> changelog a little to make it more accurate for others too.
> 
> Removed the macro which was silently clearing LPCR_MER bit from vcore->lpcr
> and instead just mask it off while sending it to kvmhv_run_single_vcpu().
> Added an inline comment describing the reason to avoid anyone tipping
> it over. - (suggested ...)

As suggested by mpe, it would be better to clear LPCR_MER inside
kvmhv_run_single_vcpu() where we check for pending interrupts. But I'll
add the rest of your changelog suggestion in v3.
> 
> 
> Yes, that would also mean that no need of taking any vcore lock since we
> are not modifying any of the vcore state variables which came up in the
> internal review discussion.
> Having said that it will be good to document the usage of vcore->lock
> above the struct kvmppc_vcore definition. Because it isn't obvious of
> when all it should be taken and/or what all it protects?

Yes agreed, I'll send a separate patch documenting this.
> 
> >
> >  arch/powerpc/kvm/book3s_hv.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 8f7d7e37bc8c..b8701b5dde50 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -5089,9 +5089,19 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
> >  
> >  	do {
> >  		accumulate_time(vcpu, &vcpu->arch.guest_entry);
> > +		/*
> > +		 * L1's copy of L2's lpcr (vcpu->arch.vcore->lpcr) can get its MER bit
> > +		 * unexpectedly set - for e.g. during NMI handling when all register
> > +		 * states are synchronized from L0 to L1. L1 needs to inform L0 about
> > +		 * MER=1 only when there are pending external interrupts.
> > +		 * kvmhv_run_single_vcpu() anyway sets MER bit if there are pending
> > +		 * external interrupts. Hence, mask off MER bit when passing vcore->lpcr
> > +		 * here as otherwise it may generate spurious interrupts in L2 KVM
> > +		 * causing an endless loop, which results in L2 guest getting hung.
> > +		 */
> 
> Thanks for describing this inline.
> 
> >  		if (cpu_has_feature(CPU_FTR_ARCH_300))
> >  			r = kvmhv_run_single_vcpu(vcpu, ~(u64)0,
> > -						  vcpu->arch.vcore->lpcr);
> > +						  vcpu->arch.vcore->lpcr & ~LPCR_MER);
> 
> While still at it - 
> I too like mpe suggestion to clear the LPCR_MER bit at one place
> itself within kvmhv_run_single_vcpu(). 

Yes will send v3.
> 
> >  		else
> >  			r = kvmppc_run_vcpu(vcpu);
> >  
> > -- 
> > 2.47.0
> 
> -ritesh

Thanks,
Gautam
diff mbox series

Patch

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8f7d7e37bc8c..b8701b5dde50 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5089,9 +5089,19 @@  static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
 
 	do {
 		accumulate_time(vcpu, &vcpu->arch.guest_entry);
+		/*
+		 * L1's copy of L2's lpcr (vcpu->arch.vcore->lpcr) can get its MER bit
+		 * unexpectedly set - for e.g. during NMI handling when all register
+		 * states are synchronized from L0 to L1. L1 needs to inform L0 about
+		 * MER=1 only when there are pending external interrupts.
+		 * kvmhv_run_single_vcpu() anyway sets MER bit if there are pending
+		 * external interrupts. Hence, mask off MER bit when passing vcore->lpcr
+		 * here as otherwise it may generate spurious interrupts in L2 KVM
+		 * causing an endless loop, which results in L2 guest getting hung.
+		 */
 		if (cpu_has_feature(CPU_FTR_ARCH_300))
 			r = kvmhv_run_single_vcpu(vcpu, ~(u64)0,
-						  vcpu->arch.vcore->lpcr);
+						  vcpu->arch.vcore->lpcr & ~LPCR_MER);
 		else
 			r = kvmppc_run_vcpu(vcpu);