diff mbox

KVM: PPC: Exit guest upon fatal machine check exception

Message ID 20151111165845.3721.98296.stgit@aravindap (mailing list archive)
State New, archived
Headers show

Commit Message

Aravinda Prasad Nov. 11, 2015, 4:58 p.m. UTC
This patch modifies KVM to cause a guest exit with
KVM_EXIT_NMI instead of immediately delivering a 0x200
interrupt to guest upon machine check exception in
guest address. Exiting the guest enables QEMU to build
error log and deliver machine check exception to guest
OS (either via guest OS registered machine check
handler or via 0x200 guest OS interrupt vector).

This approach simplifies the delivering of machine
check exception to guest OS compared to the earlier approach
of KVM directly invoking 0x200 guest interrupt vector.
In the earlier approach QEMU patched the 0x200 interrupt
vector during boot. The patched code at 0x200 issued a
private hcall to pass the control to QEMU to build the
error log.

This design/approach is based on the feedback for the
QEMU patches to handle machine check exception. Details
of earlier approach of handling machine check exception
in QEMU and related discussions can be found at:

https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c            |   12 +++---------
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   31 +++++++++++--------------------
 2 files changed, 14 insertions(+), 29 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Daniel Axtens Nov. 12, 2015, 2:24 a.m. UTC | #1
Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:

> This patch modifies KVM to cause a guest exit with
> KVM_EXIT_NMI instead of immediately delivering a 0x200
> interrupt to guest upon machine check exception in
> guest address. Exiting the guest enables QEMU to build
> error log and deliver machine check exception to guest
> OS (either via guest OS registered machine check
> handler or via 0x200 guest OS interrupt vector).
>
> This approach simplifies the delivering of machine
> check exception to guest OS compared to the earlier approach
> of KVM directly invoking 0x200 guest interrupt vector.
> In the earlier approach QEMU patched the 0x200 interrupt
> vector during boot. The patched code at 0x200 issued a
> private hcall to pass the control to QEMU to build the
> error log.
>
> This design/approach is based on the feedback for the
> QEMU patches to handle machine check exception. Details
> of earlier approach of handling machine check exception
> in QEMU and related discussions can be found at:
>
> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html

I've poked at the MCE code, but not the KVM MCE code, so I may be
mistaken here, but I'm not clear on how this handles errors that the
guest can recover without terminating.

For example, a Linux guest can handle a UE in guest userspace by killing
the guest process. A hypthetical non-linux guest with a microkernel
could even survive UEs in drivers.

It sounds from your patch like you're changing this behaviour. Is this
right?

Regards,
Daniel

>
> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c            |   12 +++---------
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   31 +++++++++++--------------------
>  2 files changed, 14 insertions(+), 29 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2280497..1b1dff0 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -859,15 +859,9 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  		r = RESUME_GUEST;
>  		break;
>  	case BOOK3S_INTERRUPT_MACHINE_CHECK:
> -		/*
> -		 * Deliver a machine check interrupt to the guest.
> -		 * We have to do this, even if the host has handled the
> -		 * machine check, because machine checks use SRR0/1 and
> -		 * the interrupt might have trashed guest state in them.
> -		 */
> -		kvmppc_book3s_queue_irqprio(vcpu,
> -					    BOOK3S_INTERRUPT_MACHINE_CHECK);
> -		r = RESUME_GUEST;
> +		/* Exit to guest with KVM_EXIT_NMI as exit reason */
> +		run->exit_reason = KVM_EXIT_NMI;
> +		r = RESUME_HOST;
>  		break;
>  	case BOOK3S_INTERRUPT_PROGRAM:
>  	{
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..672b4f6 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	addi	r1, r1, 112
>  	ld	r7, HSTATE_HOST_MSR(r13)
>  
> -	cmpwi	cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>  	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
>  	beq	11f
>  	cmpwi	cr2, r12, BOOK3S_INTERRUPT_HMI
> @@ -160,7 +159,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	mtmsrd	r6, 1			/* Clear RI in MSR */
>  	mtsrr0	r8
>  	mtsrr1	r7
> -	beq	cr1, 13f		/* machine check */
>  	RFI
>  
>  	/* On POWER7, we have external interrupts set to use HSRR0/1 */
> @@ -168,8 +166,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	mtspr	SPRN_HSRR1, r7
>  	ba	0x500
>  
> -13:	b	machine_check_fwnmi
> -
>  14:	mtspr	SPRN_HSRR0, r8
>  	mtspr	SPRN_HSRR1, r7
>  	b	hmi_exception_after_realmode
> @@ -2381,24 +2377,19 @@ machine_check_realmode:
>  	ld	r9, HSTATE_KVM_VCPU(r13)
>  	li	r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>  	/*
> -	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest through
> -	 * machine check interrupt (set HSRR0 to 0x200). And for handled
> -	 * errors (no-fatal), just go back to guest execution with current
> -	 * HSRR0 instead of exiting guest. This new approach will inject
> -	 * machine check to guest for fatal error causing guest to crash.
> -	 *
> -	 * The old code used to return to host for unhandled errors which
> -	 * was causing guest to hang with soft lockups inside guest and
> -	 * makes it difficult to recover guest instance.
> +	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest
> +	 * by exiting the guest with KVM_EXIT_NMI exit reason (exit
> +	 * reason set later based on trap). For handled errors
> +	 * (no-fatal), go back to guest execution with current HSRR0
> +	 * instead of exiting the guest. This approach will cause
> +	 * the guest to exit in case of fatal machine check error.
>  	 */
> -	ld	r10, VCPU_PC(r9)
> +	bne	2f	/* Continue guest execution? */
> +	/* If not, exit the guest. SRR0/1 are already set */
> +	b	mc_cont
> +2:	ld	r10, VCPU_PC(r9)
>  	ld	r11, VCPU_MSR(r9)
> -	bne	2f	/* Continue guest execution. */
> -	/* If not, deliver a machine check.  SRR0/1 are already set */
> -	li	r10, BOOK3S_INTERRUPT_MACHINE_CHECK
> -	ld	r11, VCPU_MSR(r9)
> -	bl	kvmppc_msr_interrupt
> -2:	b	fast_interrupt_c_return
> +	b	fast_interrupt_c_return
>  
>  /*
>   * Check the reason we woke from nap, and take appropriate action.
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
David Gibson Nov. 12, 2015, 3:34 a.m. UTC | #2
On Wed, Nov 11, 2015 at 10:28:46PM +0530, Aravinda Prasad wrote:
> This patch modifies KVM to cause a guest exit with
> KVM_EXIT_NMI instead of immediately delivering a 0x200
> interrupt to guest upon machine check exception in
> guest address. Exiting the guest enables QEMU to build
> error log and deliver machine check exception to guest
> OS (either via guest OS registered machine check
> handler or via 0x200 guest OS interrupt vector).
> 
> This approach simplifies the delivering of machine
> check exception to guest OS compared to the earlier approach
> of KVM directly invoking 0x200 guest interrupt vector.
> In the earlier approach QEMU patched the 0x200 interrupt
> vector during boot. The patched code at 0x200 issued a
> private hcall to pass the control to QEMU to build the
> error log.
> 
> This design/approach is based on the feedback for the
> QEMU patches to handle machine check exception. Details
> of earlier approach of handling machine check exception
> in QEMU and related discussions can be found at:
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> 
> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c            |   12 +++---------
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   31 +++++++++++--------------------
>  2 files changed, 14 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2280497..1b1dff0 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -859,15 +859,9 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
>  		r = RESUME_GUEST;
>  		break;
>  	case BOOK3S_INTERRUPT_MACHINE_CHECK:
> -		/*
> -		 * Deliver a machine check interrupt to the guest.
> -		 * We have to do this, even if the host has handled the
> -		 * machine check, because machine checks use SRR0/1 and
> -		 * the interrupt might have trashed guest state in them.
> -		 */
> -		kvmppc_book3s_queue_irqprio(vcpu,
> -					    BOOK3S_INTERRUPT_MACHINE_CHECK);
> -		r = RESUME_GUEST;
> +		/* Exit to guest with KVM_EXIT_NMI as exit reason */
> +		run->exit_reason = KVM_EXIT_NMI;
> +		r = RESUME_HOST;
>  		break;
>  	case BOOK3S_INTERRUPT_PROGRAM:
>  	{
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b98889e..672b4f6 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	addi	r1, r1, 112
>  	ld	r7, HSTATE_HOST_MSR(r13)
>  
> -	cmpwi	cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>  	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
>  	beq	11f
>  	cmpwi	cr2, r12, BOOK3S_INTERRUPT_HMI
> @@ -160,7 +159,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	mtmsrd	r6, 1			/* Clear RI in MSR */
>  	mtsrr0	r8
>  	mtsrr1	r7
> -	beq	cr1, 13f		/* machine check */
>  	RFI
>  
>  	/* On POWER7, we have external interrupts set to use HSRR0/1 */
> @@ -168,8 +166,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>  	mtspr	SPRN_HSRR1, r7
>  	ba	0x500
>  
> -13:	b	machine_check_fwnmi
> -

I don't quite understand the 3 hunks above.  The rest seems to change
the delivery of MCs as I'd expect, but the above seems to change their
detection and the reason for that isn't obvious to me.


>  14:	mtspr	SPRN_HSRR0, r8
>  	mtspr	SPRN_HSRR1, r7
>  	b	hmi_exception_after_realmode
> @@ -2381,24 +2377,19 @@ machine_check_realmode:
>  	ld	r9, HSTATE_KVM_VCPU(r13)
>  	li	r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>  	/*
> -	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest through
> -	 * machine check interrupt (set HSRR0 to 0x200). And for handled
> -	 * errors (no-fatal), just go back to guest execution with current
> -	 * HSRR0 instead of exiting guest. This new approach will inject
> -	 * machine check to guest for fatal error causing guest to crash.
> -	 *
> -	 * The old code used to return to host for unhandled errors which
> -	 * was causing guest to hang with soft lockups inside guest and
> -	 * makes it difficult to recover guest instance.
> +	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest
> +	 * by exiting the guest with KVM_EXIT_NMI exit reason (exit
> +	 * reason set later based on trap). For handled errors
> +	 * (no-fatal), go back to guest execution with current HSRR0
> +	 * instead of exiting the guest. This approach will cause
> +	 * the guest to exit in case of fatal machine check error.
>  	 */
> -	ld	r10, VCPU_PC(r9)
> +	bne	2f	/* Continue guest execution? */
> +	/* If not, exit the guest. SRR0/1 are already set */
> +	b	mc_cont
> +2:	ld	r10, VCPU_PC(r9)
>  	ld	r11, VCPU_MSR(r9)
> -	bne	2f	/* Continue guest execution. */
> -	/* If not, deliver a machine check.  SRR0/1 are already set */
> -	li	r10, BOOK3S_INTERRUPT_MACHINE_CHECK
> -	ld	r11, VCPU_MSR(r9)
> -	bl	kvmppc_msr_interrupt
> -2:	b	fast_interrupt_c_return
> +	b	fast_interrupt_c_return
>  
>  /*
>   * Check the reason we woke from nap, and take appropriate action.
>
David Gibson Nov. 12, 2015, 3:38 a.m. UTC | #3
On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:
> 
> > This patch modifies KVM to cause a guest exit with
> > KVM_EXIT_NMI instead of immediately delivering a 0x200
> > interrupt to guest upon machine check exception in
> > guest address. Exiting the guest enables QEMU to build
> > error log and deliver machine check exception to guest
> > OS (either via guest OS registered machine check
> > handler or via 0x200 guest OS interrupt vector).
> >
> > This approach simplifies the delivering of machine
> > check exception to guest OS compared to the earlier approach
> > of KVM directly invoking 0x200 guest interrupt vector.
> > In the earlier approach QEMU patched the 0x200 interrupt
> > vector during boot. The patched code at 0x200 issued a
> > private hcall to pass the control to QEMU to build the
> > error log.
> >
> > This design/approach is based on the feedback for the
> > QEMU patches to handle machine check exception. Details
> > of earlier approach of handling machine check exception
> > in QEMU and related discussions can be found at:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> 
> I've poked at the MCE code, but not the KVM MCE code, so I may be
> mistaken here, but I'm not clear on how this handles errors that the
> guest can recover without terminating.
> 
> For example, a Linux guest can handle a UE in guest userspace by killing
> the guest process. A hypthetical non-linux guest with a microkernel
> could even survive UEs in drivers.
> 
> It sounds from your patch like you're changing this behaviour. Is this
> right?

So, IIUC.  Once the qemu pieces are in place as well it shouldn't
change this behaviour: KVM will exit to qemu, qemu will log the error
information (new), then reinject the MC to the guest which can still
handle it as you describe above.

But, there could be a problem if you have a new kernel with an old
qemu, in that case qemu might not understand the new exit type and
treat it as a fatal error, even though the guest could actually cope
with it.

Aravinda, do we need to change this so that qemu has to explicitly
enable the new NMI behaviour?  Or have I missed something that will
make that case work already.
Aravinda Prasad Nov. 12, 2015, 4:32 a.m. UTC | #4
On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
>> Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:
>>
>>> This patch modifies KVM to cause a guest exit with
>>> KVM_EXIT_NMI instead of immediately delivering a 0x200
>>> interrupt to guest upon machine check exception in
>>> guest address. Exiting the guest enables QEMU to build
>>> error log and deliver machine check exception to guest
>>> OS (either via guest OS registered machine check
>>> handler or via 0x200 guest OS interrupt vector).
>>>
>>> This approach simplifies the delivering of machine
>>> check exception to guest OS compared to the earlier approach
>>> of KVM directly invoking 0x200 guest interrupt vector.
>>> In the earlier approach QEMU patched the 0x200 interrupt
>>> vector during boot. The patched code at 0x200 issued a
>>> private hcall to pass the control to QEMU to build the
>>> error log.
>>>
>>> This design/approach is based on the feedback for the
>>> QEMU patches to handle machine check exception. Details
>>> of earlier approach of handling machine check exception
>>> in QEMU and related discussions can be found at:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
>>
>> I've poked at the MCE code, but not the KVM MCE code, so I may be
>> mistaken here, but I'm not clear on how this handles errors that the
>> guest can recover without terminating.
>>
>> For example, a Linux guest can handle a UE in guest userspace by killing
>> the guest process. A hypthetical non-linux guest with a microkernel
>> could even survive UEs in drivers.
>>
>> It sounds from your patch like you're changing this behaviour. Is this
>> right?
> 
> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> change this behaviour: KVM will exit to qemu, qemu will log the error
> information (new), then reinject the MC to the guest which can still
> handle it as you describe above.

Yes. With KVM and QEMU both in place this will not change the behavior.
QEMU will inject the UE to guest and the guest handles the UE based on
where it occurred. For example if an UE happens in a guest process
address space, that process will be killed.

> 
> But, there could be a problem if you have a new kernel with an old
> qemu, in that case qemu might not understand the new exit type and
> treat it as a fatal error, even though the guest could actually cope
> with it.

In case of new kernel and old QEMU, the guest terminates as old QEMU
does not understand the NMI exit reason. However, this is the case with
old kernel and old QEMU as they do not handle UE belonging to guest. The
difference is that the guest kernel terminates with different error code.

old kernel and old QEMU -> guest panics [1] irrespective of where UE
                           happened in guest address space.
old kernel and new QEMU -> guest panics. same as above.
new kernel and old QEMU -> guest terminates with unhanded NMI error
                           irrespective of where UE happened in guest
new kernel and new QEMU -> guest handles UEs in process address space
                           by killing the process. guest terminates
                           for UEs in guest kernel address space.

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html

> 
> Aravinda, do we need to change this so that qemu has to explicitly
> enable the new NMI behaviour?  Or have I missed something that will
> make that case work already.

I think we don't need to explicitly enable the new behavior. With new
kernel and new QEMU this should just work. As mentioned above this is
already broken for old kernel/QEMU. Any thoughts?

Regards,
Aravinda

> 
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
David Gibson Nov. 12, 2015, 4:43 a.m. UTC | #5
On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >> Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:
> >>
> >>> This patch modifies KVM to cause a guest exit with
> >>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>> interrupt to guest upon machine check exception in
> >>> guest address. Exiting the guest enables QEMU to build
> >>> error log and deliver machine check exception to guest
> >>> OS (either via guest OS registered machine check
> >>> handler or via 0x200 guest OS interrupt vector).
> >>>
> >>> This approach simplifies the delivering of machine
> >>> check exception to guest OS compared to the earlier approach
> >>> of KVM directly invoking 0x200 guest interrupt vector.
> >>> In the earlier approach QEMU patched the 0x200 interrupt
> >>> vector during boot. The patched code at 0x200 issued a
> >>> private hcall to pass the control to QEMU to build the
> >>> error log.
> >>>
> >>> This design/approach is based on the feedback for the
> >>> QEMU patches to handle machine check exception. Details
> >>> of earlier approach of handling machine check exception
> >>> in QEMU and related discussions can be found at:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>
> >> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >> mistaken here, but I'm not clear on how this handles errors that the
> >> guest can recover without terminating.
> >>
> >> For example, a Linux guest can handle a UE in guest userspace by killing
> >> the guest process. A hypthetical non-linux guest with a microkernel
> >> could even survive UEs in drivers.
> >>
> >> It sounds from your patch like you're changing this behaviour. Is this
> >> right?
> > 
> > So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> > change this behaviour: KVM will exit to qemu, qemu will log the error
> > information (new), then reinject the MC to the guest which can still
> > handle it as you describe above.
> 
> Yes. With KVM and QEMU both in place this will not change the behavior.
> QEMU will inject the UE to guest and the guest handles the UE based on
> where it occurred. For example if an UE happens in a guest process
> address space, that process will be killed.
> 
> > 
> > But, there could be a problem if you have a new kernel with an old
> > qemu, in that case qemu might not understand the new exit type and
> > treat it as a fatal error, even though the guest could actually cope
> > with it.
> 
> In case of new kernel and old QEMU, the guest terminates as old QEMU
> does not understand the NMI exit reason. However, this is the case with
> old kernel and old QEMU as they do not handle UE belonging to guest. The
> difference is that the guest kernel terminates with different error
> code.

Ok.. assuming the guest has code to handle the UE in 0x200, why would
the guest terminate with old kernel and old qemu?  I haven't quite
followed the logic.

> 
> old kernel and old QEMU -> guest panics [1] irrespective of where UE
>                            happened in guest address space.
> old kernel and new QEMU -> guest panics. same as above.
> new kernel and old QEMU -> guest terminates with unhanded NMI error
>                            irrespective of where UE happened in guest
> new kernel and new QEMU -> guest handles UEs in process address space
>                            by killing the process. guest terminates
>                            for UEs in guest kernel address space.
> 
> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html
> 
> > 
> > Aravinda, do we need to change this so that qemu has to explicitly
> > enable the new NMI behaviour?  Or have I missed something that will
> > make that case work already.
> 
> I think we don't need to explicitly enable the new behavior. With new
> kernel and new QEMU this should just work. As mentioned above this is
> already broken for old kernel/QEMU. Any thoughts?
> 
> Regards,
> Aravinda
> 
> > 
> > 
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
>
Daniel Axtens Nov. 12, 2015, 4:58 a.m. UTC | #6
> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> change this behaviour: KVM will exit to qemu, qemu will log the error
> information (new), then reinject the MC to the guest which can still
> handle it as you describe above.

Ah, that makes *much* more sense now! Thanks for the explanation: I
don't really follow qemu development.

>
> But, there could be a problem if you have a new kernel with an old
> qemu, in that case qemu might not understand the new exit type and
> treat it as a fatal error, even though the guest could actually cope
> with it.
>
> Aravinda, do we need to change this so that qemu has to explicitly
> enable the new NMI behaviour?  Or have I missed something that will
> make that case work already.

<puts on CAPI hat>Yeah, it would be good not to break this.</hat>

Regards,
Daniel


> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aravinda Prasad Nov. 12, 2015, 5:18 a.m. UTC | #7
On Thursday 12 November 2015 09:04 AM, David Gibson wrote:
> On Wed, Nov 11, 2015 at 10:28:46PM +0530, Aravinda Prasad wrote:
>> This patch modifies KVM to cause a guest exit with
>> KVM_EXIT_NMI instead of immediately delivering a 0x200
>> interrupt to guest upon machine check exception in
>> guest address. Exiting the guest enables QEMU to build
>> error log and deliver machine check exception to guest
>> OS (either via guest OS registered machine check
>> handler or via 0x200 guest OS interrupt vector).
>>
>> This approach simplifies the delivering of machine
>> check exception to guest OS compared to the earlier approach
>> of KVM directly invoking 0x200 guest interrupt vector.
>> In the earlier approach QEMU patched the 0x200 interrupt
>> vector during boot. The patched code at 0x200 issued a
>> private hcall to pass the control to QEMU to build the
>> error log.
>>
>> This design/approach is based on the feedback for the
>> QEMU patches to handle machine check exception. Details
>> of earlier approach of handling machine check exception
>> in QEMU and related discussions can be found at:
>>
>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
>>
>> Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kvm/book3s_hv.c            |   12 +++---------
>>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |   31 +++++++++++--------------------
>>  2 files changed, 14 insertions(+), 29 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 2280497..1b1dff0 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -859,15 +859,9 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
>>  		r = RESUME_GUEST;
>>  		break;
>>  	case BOOK3S_INTERRUPT_MACHINE_CHECK:
>> -		/*
>> -		 * Deliver a machine check interrupt to the guest.
>> -		 * We have to do this, even if the host has handled the
>> -		 * machine check, because machine checks use SRR0/1 and
>> -		 * the interrupt might have trashed guest state in them.
>> -		 */
>> -		kvmppc_book3s_queue_irqprio(vcpu,
>> -					    BOOK3S_INTERRUPT_MACHINE_CHECK);
>> -		r = RESUME_GUEST;
>> +		/* Exit to guest with KVM_EXIT_NMI as exit reason */
>> +		run->exit_reason = KVM_EXIT_NMI;
>> +		r = RESUME_HOST;
>>  		break;
>>  	case BOOK3S_INTERRUPT_PROGRAM:
>>  	{
>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> index b98889e..672b4f6 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> @@ -147,7 +147,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>>  	addi	r1, r1, 112
>>  	ld	r7, HSTATE_HOST_MSR(r13)
>>  
>> -	cmpwi	cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>>  	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
>>  	beq	11f
>>  	cmpwi	cr2, r12, BOOK3S_INTERRUPT_HMI
>> @@ -160,7 +159,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>>  	mtmsrd	r6, 1			/* Clear RI in MSR */
>>  	mtsrr0	r8
>>  	mtsrr1	r7
>> -	beq	cr1, 13f		/* machine check */
>>  	RFI
>>  
>>  	/* On POWER7, we have external interrupts set to use HSRR0/1 */
>> @@ -168,8 +166,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>>  	mtspr	SPRN_HSRR1, r7
>>  	ba	0x500
>>  
>> -13:	b	machine_check_fwnmi
>> -
> 
> I don't quite understand the 3 hunks above.  The rest seems to change
> the delivery of MCs as I'd expect, but the above seems to change their
> detection and the reason for that isn't obvious to me.

We need to do guest exit here and hence we continue with RFI or else if
we branch to machine_check_fwnmi then the control will flow to
opal_recover_mce routine without causing the guest to exit with NMI exit
code. And I think, opal_recover_mce() is used to recover from UEs in
host user/kernel space and does not have a check for UEs belonging to
guest. Hence if we branch to machine_check_fwnmi we end up running into
panic() in opal_machine_check() if UE belonged to guest.

Regards,
Aravinda

> 
> 
>>  14:	mtspr	SPRN_HSRR0, r8
>>  	mtspr	SPRN_HSRR1, r7
>>  	b	hmi_exception_after_realmode
>> @@ -2381,24 +2377,19 @@ machine_check_realmode:
>>  	ld	r9, HSTATE_KVM_VCPU(r13)
>>  	li	r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>>  	/*
>> -	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest through
>> -	 * machine check interrupt (set HSRR0 to 0x200). And for handled
>> -	 * errors (no-fatal), just go back to guest execution with current
>> -	 * HSRR0 instead of exiting guest. This new approach will inject
>> -	 * machine check to guest for fatal error causing guest to crash.
>> -	 *
>> -	 * The old code used to return to host for unhandled errors which
>> -	 * was causing guest to hang with soft lockups inside guest and
>> -	 * makes it difficult to recover guest instance.
>> +	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest
>> +	 * by exiting the guest with KVM_EXIT_NMI exit reason (exit
>> +	 * reason set later based on trap). For handled errors
>> +	 * (no-fatal), go back to guest execution with current HSRR0
>> +	 * instead of exiting the guest. This approach will cause
>> +	 * the guest to exit in case of fatal machine check error.
>>  	 */
>> -	ld	r10, VCPU_PC(r9)
>> +	bne	2f	/* Continue guest execution? */
>> +	/* If not, exit the guest. SRR0/1 are already set */
>> +	b	mc_cont
>> +2:	ld	r10, VCPU_PC(r9)
>>  	ld	r11, VCPU_MSR(r9)
>> -	bne	2f	/* Continue guest execution. */
>> -	/* If not, deliver a machine check.  SRR0/1 are already set */
>> -	li	r10, BOOK3S_INTERRUPT_MACHINE_CHECK
>> -	ld	r11, VCPU_MSR(r9)
>> -	bl	kvmppc_msr_interrupt
>> -2:	b	fast_interrupt_c_return
>> +	b	fast_interrupt_c_return
>>  
>>  /*
>>   * Check the reason we woke from nap, and take appropriate action.
>>
>
Aravinda Prasad Nov. 12, 2015, 5:22 p.m. UTC | #8
On Thursday 12 November 2015 10:28 AM, Daniel Axtens wrote:
> 
>> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
>> change this behaviour: KVM will exit to qemu, qemu will log the error
>> information (new), then reinject the MC to the guest which can still
>> handle it as you describe above.
> 
> Ah, that makes *much* more sense now! Thanks for the explanation: I
> don't really follow qemu development.
> 
>>
>> But, there could be a problem if you have a new kernel with an old
>> qemu, in that case qemu might not understand the new exit type and
>> treat it as a fatal error, even though the guest could actually cope
>> with it.
>>
>> Aravinda, do we need to change this so that qemu has to explicitly
>> enable the new NMI behaviour?  Or have I missed something that will
>> make that case work already.
> 
> <puts on CAPI hat>Yeah, it would be good not to break this.</hat>

I am not familiar with CAPI. Does this affect CAPI?

Regards,
Aravinda

> 
> Regards,
> Daniel
> 
> 
>> -- 
>> David Gibson			| I'll have my music baroque, and my code
>> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
>> 				| _way_ _around_!
>> http://www.ozlabs.org/~dgibson
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Aravinda Prasad Nov. 12, 2015, 5:52 p.m. UTC | #9
On Thursday 12 November 2015 10:13 AM, David Gibson wrote:
> On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
>>
>>
>> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
>>> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
>>>> Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:
>>>>
>>>>> This patch modifies KVM to cause a guest exit with
>>>>> KVM_EXIT_NMI instead of immediately delivering a 0x200
>>>>> interrupt to guest upon machine check exception in
>>>>> guest address. Exiting the guest enables QEMU to build
>>>>> error log and deliver machine check exception to guest
>>>>> OS (either via guest OS registered machine check
>>>>> handler or via 0x200 guest OS interrupt vector).
>>>>>
>>>>> This approach simplifies the delivering of machine
>>>>> check exception to guest OS compared to the earlier approach
>>>>> of KVM directly invoking 0x200 guest interrupt vector.
>>>>> In the earlier approach QEMU patched the 0x200 interrupt
>>>>> vector during boot. The patched code at 0x200 issued a
>>>>> private hcall to pass the control to QEMU to build the
>>>>> error log.
>>>>>
>>>>> This design/approach is based on the feedback for the
>>>>> QEMU patches to handle machine check exception. Details
>>>>> of earlier approach of handling machine check exception
>>>>> in QEMU and related discussions can be found at:
>>>>>
>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
>>>>
>>>> I've poked at the MCE code, but not the KVM MCE code, so I may be
>>>> mistaken here, but I'm not clear on how this handles errors that the
>>>> guest can recover without terminating.
>>>>
>>>> For example, a Linux guest can handle a UE in guest userspace by killing
>>>> the guest process. A hypthetical non-linux guest with a microkernel
>>>> could even survive UEs in drivers.
>>>>
>>>> It sounds from your patch like you're changing this behaviour. Is this
>>>> right?
>>>
>>> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
>>> change this behaviour: KVM will exit to qemu, qemu will log the error
>>> information (new), then reinject the MC to the guest which can still
>>> handle it as you describe above.
>>
>> Yes. With KVM and QEMU both in place this will not change the behavior.
>> QEMU will inject the UE to guest and the guest handles the UE based on
>> where it occurred. For example if an UE happens in a guest process
>> address space, that process will be killed.
>>
>>>
>>> But, there could be a problem if you have a new kernel with an old
>>> qemu, in that case qemu might not understand the new exit type and
>>> treat it as a fatal error, even though the guest could actually cope
>>> with it.
>>
>> In case of new kernel and old QEMU, the guest terminates as old QEMU
>> does not understand the NMI exit reason. However, this is the case with
>> old kernel and old QEMU as they do not handle UE belonging to guest. The
>> difference is that the guest kernel terminates with different error
>> code.
> 
> Ok.. assuming the guest has code to handle the UE in 0x200, why would
> the guest terminate with old kernel and old qemu?  I haven't quite
> followed the logic.

I overlooked it. I think I need to take into consideration whether guest
issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register"
then we should not jump to 0x200 upon UE. With the old kernel and old
QEMU this is broken as we always jump to 0x200.

However, if the guest has not issued "ibm, nmi-register" then upon UE we
should jump to 0x200. If new kernel is used with old QEMU this
functionality breaks as it causes guest to terminate with unhandled NMI
exit.

So thinking whether qemu should explicitly enable the new NMI behavior.

Regards,
Aravinda

> 
>>
>> old kernel and old QEMU -> guest panics [1] irrespective of where UE
>>                            happened in guest address space.
>> old kernel and new QEMU -> guest panics. same as above.
>> new kernel and old QEMU -> guest terminates with unhanded NMI error
>>                            irrespective of where UE happened in guest
>> new kernel and new QEMU -> guest handles UEs in process address space
>>                            by killing the process. guest terminates
>>                            for UEs in guest kernel address space.
>>
>> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html
>>
>>>
>>> Aravinda, do we need to change this so that qemu has to explicitly
>>> enable the new NMI behaviour?  Or have I missed something that will
>>> make that case work already.
>>
>> I think we don't need to explicitly enable the new behavior. With new
>> kernel and new QEMU this should just work. As mentioned above this is
>> already broken for old kernel/QEMU. Any thoughts?
>>
>> Regards,
>> Aravinda
>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Linuxppc-dev mailing list
>>> Linuxppc-dev@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>>>
>>
>
Daniel Axtens Nov. 12, 2015, 9:37 p.m. UTC | #10
Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:

>> <puts on CAPI hat>Yeah, it would be good not to break this.</hat>
>
> I am not familiar with CAPI. Does this affect CAPI?

When a CAPI card experiences an EEH event, any cache lines it holds are
filled with SUEs (Special UEs, interpreted by the kernel the same as
regular UEs). When these are read, we get an MCE. Currently CAPI does
not support virtualisation, but that is actively being worked
on. There's a _very_ good chance it will then be backported to various
distros, which could have old qemu.

Therefore, I'd ideally like to make sure UEs in KVM guests work properly
and continue to do so in all combinations of kernel and qemu. I'm
following up the email from Mahesh that you linked: I'm not sure I quite
follow his logic so I'll try to make sense of that and then get back to
you.

Regards,
Daniel
David Gibson Nov. 13, 2015, 1:50 a.m. UTC | #11
On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:
> 
> 
> On Thursday 12 November 2015 10:13 AM, David Gibson wrote:
> > On Thu, Nov 12, 2015 at 10:02:10AM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Thursday 12 November 2015 09:08 AM, David Gibson wrote:
> >>> On Thu, Nov 12, 2015 at 01:24:19PM +1100, Daniel Axtens wrote:
> >>>> Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:
> >>>>
> >>>>> This patch modifies KVM to cause a guest exit with
> >>>>> KVM_EXIT_NMI instead of immediately delivering a 0x200
> >>>>> interrupt to guest upon machine check exception in
> >>>>> guest address. Exiting the guest enables QEMU to build
> >>>>> error log and deliver machine check exception to guest
> >>>>> OS (either via guest OS registered machine check
> >>>>> handler or via 0x200 guest OS interrupt vector).
> >>>>>
> >>>>> This approach simplifies the delivering of machine
> >>>>> check exception to guest OS compared to the earlier approach
> >>>>> of KVM directly invoking 0x200 guest interrupt vector.
> >>>>> In the earlier approach QEMU patched the 0x200 interrupt
> >>>>> vector during boot. The patched code at 0x200 issued a
> >>>>> private hcall to pass the control to QEMU to build the
> >>>>> error log.
> >>>>>
> >>>>> This design/approach is based on the feedback for the
> >>>>> QEMU patches to handle machine check exception. Details
> >>>>> of earlier approach of handling machine check exception
> >>>>> in QEMU and related discussions can be found at:
> >>>>>
> >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
> >>>>
> >>>> I've poked at the MCE code, but not the KVM MCE code, so I may be
> >>>> mistaken here, but I'm not clear on how this handles errors that the
> >>>> guest can recover without terminating.
> >>>>
> >>>> For example, a Linux guest can handle a UE in guest userspace by killing
> >>>> the guest process. A hypthetical non-linux guest with a microkernel
> >>>> could even survive UEs in drivers.
> >>>>
> >>>> It sounds from your patch like you're changing this behaviour. Is this
> >>>> right?
> >>>
> >>> So, IIUC.  Once the qemu pieces are in place as well it shouldn't
> >>> change this behaviour: KVM will exit to qemu, qemu will log the error
> >>> information (new), then reinject the MC to the guest which can still
> >>> handle it as you describe above.
> >>
> >> Yes. With KVM and QEMU both in place this will not change the behavior.
> >> QEMU will inject the UE to guest and the guest handles the UE based on
> >> where it occurred. For example if an UE happens in a guest process
> >> address space, that process will be killed.
> >>
> >>>
> >>> But, there could be a problem if you have a new kernel with an old
> >>> qemu, in that case qemu might not understand the new exit type and
> >>> treat it as a fatal error, even though the guest could actually cope
> >>> with it.
> >>
> >> In case of new kernel and old QEMU, the guest terminates as old QEMU
> >> does not understand the NMI exit reason. However, this is the case with
> >> old kernel and old QEMU as they do not handle UE belonging to guest. The
> >> difference is that the guest kernel terminates with different error
> >> code.
> > 
> > Ok.. assuming the guest has code to handle the UE in 0x200, why would
> > the guest terminate with old kernel and old qemu?  I haven't quite
> > followed the logic.
> 
> I overlooked it. I think I need to take into consideration whether guest
> issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register"
> then we should not jump to 0x200 upon UE. With the old kernel and old
> QEMU this is broken as we always jump to 0x200.
> 
> However, if the guest has not issued "ibm, nmi-register" then upon UE we
> should jump to 0x200. If new kernel is used with old QEMU this
> functionality breaks as it causes guest to terminate with unhandled NMI
> exit.
> 
> So thinking whether qemu should explicitly enable the new NMI
> behavior.

So, I think the reasoning above tends towards having qemu control the
MC behaviour.  If qemu does nothing, MCs are delivered direct to
0x200, if it enables the new handling, they cause a KVM exit and qemu
will deliver the MC.

Then I'd expect qemu to switch on the new-style handling from
ibm,nmi-register.

> 
> Regards,
> Aravinda
> 
> > 
> >>
> >> old kernel and old QEMU -> guest panics [1] irrespective of where UE
> >>                            happened in guest address space.
> >> old kernel and new QEMU -> guest panics. same as above.
> >> new kernel and old QEMU -> guest terminates with unhanded NMI error
> >>                            irrespective of where UE happened in guest
> >> new kernel and new QEMU -> guest handles UEs in process address space
> >>                            by killing the process. guest terminates
> >>                            for UEs in guest kernel address space.
> >>
> >> [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-June/118329.html
> >>
> >>>
> >>> Aravinda, do we need to change this so that qemu has to explicitly
> >>> enable the new NMI behaviour?  Or have I missed something that will
> >>> make that case work already.
> >>
> >> I think we don't need to explicitly enable the new behavior. With new
> >> kernel and new QEMU this should just work. As mentioned above this is
> >> already broken for old kernel/QEMU. Any thoughts?
> >>
> >> Regards,
> >> Aravinda
> >>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Linuxppc-dev mailing list
> >>> Linuxppc-dev@lists.ozlabs.org
> >>> https://lists.ozlabs.org/listinfo/linuxppc-dev
> >>>
> >>
> > 
>
Aravinda Prasad Nov. 13, 2015, 4:58 a.m. UTC | #12
On Friday 13 November 2015 03:07 AM, Daniel Axtens wrote:
> Aravinda Prasad <aravinda@linux.vnet.ibm.com> writes:
> 
>>> <puts on CAPI hat>Yeah, it would be good not to break this.</hat>
>>
>> I am not familiar with CAPI. Does this affect CAPI?
> 
> When a CAPI card experiences an EEH event, any cache lines it holds are
> filled with SUEs (Special UEs, interpreted by the kernel the same as
> regular UEs). When these are read, we get an MCE. Currently CAPI does
> not support virtualisation, but that is actively being worked
> on. There's a _very_ good chance it will then be backported to various
> distros, which could have old qemu.
> 
> Therefore, I'd ideally like to make sure UEs in KVM guests work properly
> and continue to do so in all combinations of kernel and qemu. I'm
> following up the email from Mahesh that you linked: I'm not sure I quite
> follow his logic so I'll try to make sense of that and then get back to
> you.

sure. Thanks.

Regards,
Aravinda

> 
> Regards,
> Daniel
>
Aravinda Prasad Nov. 13, 2015, 6:26 a.m. UTC | #13
On Friday 13 November 2015 07:20 AM, David Gibson wrote:
> On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:

[...]

>>
>> I overlooked it. I think I need to take into consideration whether guest
>> issued "ibm, nmi-register". If the guest has issued "ibm, nmi-register"
>> then we should not jump to 0x200 upon UE. With the old kernel and old
>> QEMU this is broken as we always jump to 0x200.
>>
>> However, if the guest has not issued "ibm, nmi-register" then upon UE we
>> should jump to 0x200. If new kernel is used with old QEMU this
>> functionality breaks as it causes guest to terminate with unhandled NMI
>> exit.
>>
>> So thinking whether qemu should explicitly enable the new NMI
>> behavior.
> 
> So, I think the reasoning above tends towards having qemu control the
> MC behaviour.  If qemu does nothing, MCs are delivered direct to
> 0x200, if it enables the new handling, they cause a KVM exit and qemu
> will deliver the MC.

This essentially requires qemu to control how KVM behaves as KVM does
the actual redirection of MC either to guest's 0x200 vector or to exit
guest. So, if we are running new qemu, then KVM should exit guest and if
we are running old qemu, KVM should redirect MC to 0x200. Is there any
way to communicate this to KVM? ioctl?

However, this should not be a problem (in terms of breaking existing
functionality) with old KVM as it always redirects MC to 0x200
irrespective of old or new qemu.

> 
> Then I'd expect qemu to switch on the new-style handling from
> ibm,nmi-register.
> 
>>
Thomas Huth Nov. 13, 2015, 7:38 a.m. UTC | #14
On 13/11/15 07:26, Aravinda Prasad wrote:
> 
> On Friday 13 November 2015 07:20 AM, David Gibson wrote:
>> On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:
[...]
>>> So thinking whether qemu should explicitly enable the new NMI
>>> behavior.
>>
>> So, I think the reasoning above tends towards having qemu control the
>> MC behaviour.  If qemu does nothing, MCs are delivered direct to
>> 0x200, if it enables the new handling, they cause a KVM exit and qemu
>> will deliver the MC.
> 
> This essentially requires qemu to control how KVM behaves as KVM does
> the actual redirection of MC either to guest's 0x200 vector or to exit
> guest. So, if we are running new qemu, then KVM should exit guest and if
> we are running old qemu, KVM should redirect MC to 0x200. Is there any
> way to communicate this to KVM? ioctl?

Simply introduce a KVM capability that can be enabled by userspace.
See kvm_vcpu_ioctl_enable_cap() in arch/powerpc/kvm/powerpc.c.

 Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aravinda Prasad Nov. 13, 2015, 11:25 a.m. UTC | #15
On Friday 13 November 2015 01:08 PM, Thomas Huth wrote:
> On 13/11/15 07:26, Aravinda Prasad wrote:
>>
>> On Friday 13 November 2015 07:20 AM, David Gibson wrote:
>>> On Thu, Nov 12, 2015 at 11:22:29PM +0530, Aravinda Prasad wrote:
> [...]
>>>> So thinking whether qemu should explicitly enable the new NMI
>>>> behavior.
>>>
>>> So, I think the reasoning above tends towards having qemu control the
>>> MC behaviour.  If qemu does nothing, MCs are delivered direct to
>>> 0x200, if it enables the new handling, they cause a KVM exit and qemu
>>> will deliver the MC.
>>
>> This essentially requires qemu to control how KVM behaves as KVM does
>> the actual redirection of MC either to guest's 0x200 vector or to exit
>> guest. So, if we are running new qemu, then KVM should exit guest and if
>> we are running old qemu, KVM should redirect MC to 0x200. Is there any
>> way to communicate this to KVM? ioctl?
> 
> Simply introduce a KVM capability that can be enabled by userspace.
> See kvm_vcpu_ioctl_enable_cap() in arch/powerpc/kvm/powerpc.c.

Thanks. I will look at it.

Regards,
Aravinda

> 
>  Thomas
>
diff mbox

Patch

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2280497..1b1dff0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -859,15 +859,9 @@  static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 		r = RESUME_GUEST;
 		break;
 	case BOOK3S_INTERRUPT_MACHINE_CHECK:
-		/*
-		 * Deliver a machine check interrupt to the guest.
-		 * We have to do this, even if the host has handled the
-		 * machine check, because machine checks use SRR0/1 and
-		 * the interrupt might have trashed guest state in them.
-		 */
-		kvmppc_book3s_queue_irqprio(vcpu,
-					    BOOK3S_INTERRUPT_MACHINE_CHECK);
-		r = RESUME_GUEST;
+		/* Exit to guest with KVM_EXIT_NMI as exit reason */
+		run->exit_reason = KVM_EXIT_NMI;
+		r = RESUME_HOST;
 		break;
 	case BOOK3S_INTERRUPT_PROGRAM:
 	{
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b98889e..672b4f6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -147,7 +147,6 @@  END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 	addi	r1, r1, 112
 	ld	r7, HSTATE_HOST_MSR(r13)
 
-	cmpwi	cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
 	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
 	beq	11f
 	cmpwi	cr2, r12, BOOK3S_INTERRUPT_HMI
@@ -160,7 +159,6 @@  END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 	mtmsrd	r6, 1			/* Clear RI in MSR */
 	mtsrr0	r8
 	mtsrr1	r7
-	beq	cr1, 13f		/* machine check */
 	RFI
 
 	/* On POWER7, we have external interrupts set to use HSRR0/1 */
@@ -168,8 +166,6 @@  END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 	mtspr	SPRN_HSRR1, r7
 	ba	0x500
 
-13:	b	machine_check_fwnmi
-
 14:	mtspr	SPRN_HSRR0, r8
 	mtspr	SPRN_HSRR1, r7
 	b	hmi_exception_after_realmode
@@ -2381,24 +2377,19 @@  machine_check_realmode:
 	ld	r9, HSTATE_KVM_VCPU(r13)
 	li	r12, BOOK3S_INTERRUPT_MACHINE_CHECK
 	/*
-	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest through
-	 * machine check interrupt (set HSRR0 to 0x200). And for handled
-	 * errors (no-fatal), just go back to guest execution with current
-	 * HSRR0 instead of exiting guest. This new approach will inject
-	 * machine check to guest for fatal error causing guest to crash.
-	 *
-	 * The old code used to return to host for unhandled errors which
-	 * was causing guest to hang with soft lockups inside guest and
-	 * makes it difficult to recover guest instance.
+	 * Deliver unhandled/fatal (e.g. UE) MCE errors to guest
+	 * by exiting the guest with KVM_EXIT_NMI exit reason (exit
+	 * reason set later based on trap). For handled errors
+	 * (no-fatal), go back to guest execution with current HSRR0
+	 * instead of exiting the guest. This approach will cause
+	 * the guest to exit in case of fatal machine check error.
 	 */
-	ld	r10, VCPU_PC(r9)
+	bne	2f	/* Continue guest execution? */
+	/* If not, exit the guest. SRR0/1 are already set */
+	b	mc_cont
+2:	ld	r10, VCPU_PC(r9)
 	ld	r11, VCPU_MSR(r9)
-	bne	2f	/* Continue guest execution. */
-	/* If not, deliver a machine check.  SRR0/1 are already set */
-	li	r10, BOOK3S_INTERRUPT_MACHINE_CHECK
-	ld	r11, VCPU_MSR(r9)
-	bl	kvmppc_msr_interrupt
-2:	b	fast_interrupt_c_return
+	b	fast_interrupt_c_return
 
 /*
  * Check the reason we woke from nap, and take appropriate action.