kvm: pass the virtual SEI syndrome to guest OS

Message ID	1489996534-8270-1-git-send-email-gengdongjiu@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Dongjiu Geng <gengdongjiu@huawei.com> To: <catalin.marinas@arm.com>, <will.deacon@arm.com>, <marc.zyngier@arm.com>, <christoffer.dall@linaro.org>, <rkrcmar@redhat.com>, <suzuki.poulose@arm.com>, <andre.przywara@arm.com>, <mark.rutland@arm.com>, <vladimir.murzin@arm.com>, <linux-arm-kernel@lists.infradead.org>, <kvmarm@lists.cs.columbia.edu>, <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH] kvm: pass the virtual SEI syndrome to guest OS Date: Mon, 20 Mar 2017 15:55:34 +0800 Message-ID: <1489996534-8270-1-git-send-email-gengdongjiu@huawei.com> MIME-Version: 1.0 Precedence: list Cc: wuquanming@huawei.com, wangxiongfeng2@huawei.com, xiexiuqi@huawei.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Dongjiu Geng March 20, 2017, 7:55 a.m. UTC

In the RAS implementation, hardware pass the virtual SEI
syndrome information through the VSESR_EL2, so set the virtual
SEI syndrome using physical SEI syndrome el2_elr to pass to
the guest OS

Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Signed-off-by: Quanming wu <wuquanming@huawei.com>
---
 arch/arm64/Kconfig                   |  8 ++++++++
 arch/arm64/include/asm/esr.h         |  1 +
 arch/arm64/include/asm/kvm_emulate.h | 12 ++++++++++++
 arch/arm64/include/asm/kvm_host.h    |  4 ++++
 arch/arm64/kvm/hyp/switch.c          | 15 ++++++++++++++-
 arch/arm64/kvm/inject_fault.c        | 10 ++++++++++
 6 files changed, 49 insertions(+), 1 deletion(-)

Marc Zyngier March 20, 2017, 11:24 a.m. UTC | #1

Please include James Morse on anything RAS related, as he's already
looking at related patches.

On 20/03/17 07:55, Dongjiu Geng wrote:
> In the RAS implementation, hardware pass the virtual SEI
> syndrome information through the VSESR_EL2, so set the virtual
> SEI syndrome using physical SEI syndrome el2_elr to pass to
> the guest OS
> 
> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
> Signed-off-by: Quanming wu <wuquanming@huawei.com>
> ---
>  arch/arm64/Kconfig                   |  8 ++++++++
>  arch/arm64/include/asm/esr.h         |  1 +
>  arch/arm64/include/asm/kvm_emulate.h | 12 ++++++++++++
>  arch/arm64/include/asm/kvm_host.h    |  4 ++++
>  arch/arm64/kvm/hyp/switch.c          | 15 ++++++++++++++-
>  arch/arm64/kvm/inject_fault.c        | 10 ++++++++++
>  6 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 8c7c244247b6..ea62170a3b75 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -908,6 +908,14 @@ endmenu
>  
>  menu "ARMv8.2 architectural features"
>  
> +config HAS_RAS_EXTENSION
> +	bool "Support arm64 RAS extension"
> +	default n
> +	help
> +	  Reliability, Availability, Serviceability(RAS; part of the ARMv8.2 Extensions).
> +
> +	  Selecting this option OS will try to recover the error that RAS hardware node detected.
> +

As this is an architectural extension, this should be controlled by the
CPU feature mechanism, and not be chosen at compile time. What you have
here will break horribly when booted on a CPU that doesn't implement RAS.

>  config ARM64_UAO
>  	bool "Enable support for User Access Override (UAO)"
>  	default y
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index d14c478976d0..e38d32b2bdad 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -111,6 +111,7 @@
>  #define ESR_ELx_COND_MASK	(UL(0xF) << ESR_ELx_COND_SHIFT)
>  #define ESR_ELx_WFx_ISS_WFE	(UL(1) << 0)
>  #define ESR_ELx_xVC_IMM_MASK	((1UL << 16) - 1)
> +#define VSESR_ELx_IDS_ISS_MASK    ((1UL << 25) - 1)
>  
>  /* ESR value templates for specific events */
>  
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index f5ea0ba70f07..20d4da7f5dce 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -148,6 +148,18 @@ static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
>  	return vcpu->arch.fault.esr_el2;
>  }
>  
> +#ifdef CONFIG_HAS_RAS_EXTENSION
> +static inline u32 kvm_vcpu_get_vsesr(const struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.fault.vsesr_el2;
> +}
> +
> +static inline void kvm_vcpu_set_vsesr(struct kvm_vcpu *vcpu, unsigned long val)
> +{
> +	vcpu->arch.fault.vsesr_el2 = val;
> +}
> +#endif
> +
>  static inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>  {
>  	u32 esr = kvm_vcpu_get_hsr(vcpu);
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index e7705e7bb07b..f9e3bb57c461 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -83,6 +83,10 @@ struct kvm_mmu_memory_cache {
>  };
>  
>  struct kvm_vcpu_fault_info {
> +#ifdef CONFIG_HAS_RAS_EXTENSION
> +	/* Virtual SError Exception Syndrome Register */
> +	u32 vsesr_el2;
> +#endif
>  	u32 esr_el2;		/* Hyp Syndrom Register */
>  	u64 far_el2;		/* Hyp Fault Address Register */
>  	u64 hpfar_el2;		/* Hyp IPA Fault Address Register */
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index aede1658aeda..770a153fb6ba 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
>  		isb();
>  	}
>  	write_sysreg(val, hcr_el2);
> +#ifdef CONFIG_HAS_RAS_EXTENSION
> +	/* If virtual System Error or Asynchronous Abort is pending. set
> +	 * the virtual exception syndrome information
> +	 */
> +	if (vcpu->arch.hcr_el2 & HCR_VSE)
> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);
> +#endif
>  	/* Trap on AArch32 cp15 c15 accesses (EL1 or EL0) */
>  	write_sysreg(1 << 15, hstr_el2);
>  	/*
> @@ -139,8 +146,14 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
>  	 * the crucial bit is "On taking a vSError interrupt,
>  	 * HCR_EL2.VSE is cleared to 0."
>  	 */
> -	if (vcpu->arch.hcr_el2 & HCR_VSE)
> +	if (vcpu->arch.hcr_el2 & HCR_VSE) {
>  		vcpu->arch.hcr_el2 = read_sysreg(hcr_el2);
> +#ifdef CONFIG_HAS_RAS_EXTENSION
> +		/* set vsesr_el2[24:0] with esr_el2[24:0] */
> +		kvm_vcpu_set_vsesr(vcpu, read_sysreg_el2(esr)
> +					& VSESR_ELx_IDS_ISS_MASK);

What guarantees that ESR_EL2 still contains the latest exception? What
does it mean to store something that is the current EL2 exception
syndrome together with an SError that has already been injected?

Also, is it correct to directly copy the ESR_EL2 bits into VSESR_EL2? My
own reading of the specification seem to imply that there is at least
differences when the guest is AArch32. Surely there would be some
processing here.

> +#endif
> +	}
>  
>  	__deactivate_traps_arch()();
>  	write_sysreg(0, hstr_el2);
> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> index da6a8cfa54a0..08a13dfe28a8 100644
> --- a/arch/arm64/kvm/inject_fault.c
> +++ b/arch/arm64/kvm/inject_fault.c
> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>  {
>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
> +#ifdef CONFIG_HAS_RAS_EXTENSION
> +	/* If virtual System Error or Asynchronous Abort is set. set
> +	 * the virtual exception syndrome information
> +	 */
> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
> +				& (~VSESR_ELx_IDS_ISS_MASK))
> +				| (kvm_vcpu_get_hsr(vcpu)
> +				& VSESR_ELx_IDS_ISS_MASK)));

What is the rational for setting VSESR_EL2 with the EL1 syndrome
information? That doesn't make any sense to me.

Overall, this patch is completely inconsistent and unclear in what it
tries to achieve. Also, as I already tated before, I'd like to see the
"firmware first" mode of operation be enforced here, going back to
userspace and let the VMM decide what to do.

Thanks,

	M.

Dongjiu Geng March 20, 2017, 12:28 p.m. UTC | #2

On 2017/3/20 19:24, Marc Zyngier wrote:
> Please include James Morse on anything RAS related, as he's already
> looking at related patches.
> 
> On 20/03/17 07:55, Dongjiu Geng wrote:
>> In the RAS implementation, hardware pass the virtual SEI
>> syndrome information through the VSESR_EL2, so set the virtual
>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>> the guest OS
>>
>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>> Signed-off-by: Quanming wu <wuquanming@huawei.com>
>> ---
>>  arch/arm64/Kconfig                   |  8 ++++++++
>>  arch/arm64/include/asm/esr.h         |  1 +
>>  arch/arm64/include/asm/kvm_emulate.h | 12 ++++++++++++
>>  arch/arm64/include/asm/kvm_host.h    |  4 ++++
>>  arch/arm64/kvm/hyp/switch.c          | 15 ++++++++++++++-
>>  arch/arm64/kvm/inject_fault.c        | 10 ++++++++++
>>  6 files changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 8c7c244247b6..ea62170a3b75 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -908,6 +908,14 @@ endmenu
>>  
>>  menu "ARMv8.2 architectural features"
>>  
>> +config HAS_RAS_EXTENSION
>> +	bool "Support arm64 RAS extension"
>> +	default n
>> +	help
>> +	  Reliability, Availability, Serviceability(RAS; part of the ARMv8.2 Extensions).
>> +
>> +	  Selecting this option OS will try to recover the error that RAS hardware node detected.
>> +
> 
> As this is an architectural extension, this should be controlled by the
> CPU feature mechanism, and not be chosen at compile time. What you have
> here will break horribly when booted on a CPU that doesn't implement RAS.

thanks very much for your review, yes, it is, you are right.

> 
>>  config ARM64_UAO
>>  	bool "Enable support for User Access Override (UAO)"
>>  	default y
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index d14c478976d0..e38d32b2bdad 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -111,6 +111,7 @@
>>  #define ESR_ELx_COND_MASK	(UL(0xF) << ESR_ELx_COND_SHIFT)
>>  #define ESR_ELx_WFx_ISS_WFE	(UL(1) << 0)
>>  #define ESR_ELx_xVC_IMM_MASK	((1UL << 16) - 1)
>> +#define VSESR_ELx_IDS_ISS_MASK    ((1UL << 25) - 1)
>>  
>>  /* ESR value templates for specific events */
>>  
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index f5ea0ba70f07..20d4da7f5dce 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -148,6 +148,18 @@ static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
>>  	return vcpu->arch.fault.esr_el2;
>>  }
>>  
>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>> +static inline u32 kvm_vcpu_get_vsesr(const struct kvm_vcpu *vcpu)
>> +{
>> +	return vcpu->arch.fault.vsesr_el2;
>> +}
>> +
>> +static inline void kvm_vcpu_set_vsesr(struct kvm_vcpu *vcpu, unsigned long val)
>> +{
>> +	vcpu->arch.fault.vsesr_el2 = val;
>> +}
>> +#endif
>> +
>>  static inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>>  {
>>  	u32 esr = kvm_vcpu_get_hsr(vcpu);
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index e7705e7bb07b..f9e3bb57c461 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -83,6 +83,10 @@ struct kvm_mmu_memory_cache {
>>  };
>>  
>>  struct kvm_vcpu_fault_info {
>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>> +	/* Virtual SError Exception Syndrome Register */
>> +	u32 vsesr_el2;
>> +#endif
>>  	u32 esr_el2;		/* Hyp Syndrom Register */
>>  	u64 far_el2;		/* Hyp Fault Address Register */
>>  	u64 hpfar_el2;		/* Hyp IPA Fault Address Register */
>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>> index aede1658aeda..770a153fb6ba 100644
>> --- a/arch/arm64/kvm/hyp/switch.c
>> +++ b/arch/arm64/kvm/hyp/switch.c
>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
>>  		isb();
>>  	}
>>  	write_sysreg(val, hcr_el2);
>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>> +	/* If virtual System Error or Asynchronous Abort is pending. set
>> +	 * the virtual exception syndrome information
>> +	 */
>> +	if (vcpu->arch.hcr_el2 & HCR_VSE)
>> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);
>> +#endif
>>  	/* Trap on AArch32 cp15 c15 accesses (EL1 or EL0) */
>>  	write_sysreg(1 << 15, hstr_el2);
>>  	/*
>> @@ -139,8 +146,14 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
>>  	 * the crucial bit is "On taking a vSError interrupt,
>>  	 * HCR_EL2.VSE is cleared to 0."
>>  	 */
>> -	if (vcpu->arch.hcr_el2 & HCR_VSE)
>> +	if (vcpu->arch.hcr_el2 & HCR_VSE) {
>>  		vcpu->arch.hcr_el2 = read_sysreg(hcr_el2);
>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>> +		/* set vsesr_el2[24:0] with esr_el2[24:0] */
>> +		kvm_vcpu_set_vsesr(vcpu, read_sysreg_el2(esr)
>> +					& VSESR_ELx_IDS_ISS_MASK);
> 
> What guarantees that ESR_EL2 still contains the latest exception? What
> does it mean to store something that is the current EL2 exception
> syndrome together with an SError that has already been injected?

yes, thanks for the review, I will add a judgement condition for the "exit_code"
if the "exit_code == ARM_EXCEPTION_EL1_SERROR" then set the vsesr_el2.

for the aarch32, it only need set the "ExT, bit [12]" and AET, "bits [15:14]", other bit is RES0

> 
> Also, is it correct to directly copy the ESR_EL2 bits into VSESR_EL2? My
please see below spec description, it virtual SERROR syndrome from VSESR_EL2.
-----
 Control returns to the OS, and the ESB instruction is re-executed.
— The physical asynchronous SError interrupt has been cleared, so it is not taken again.
— The PE sets VDISR_EL2.A to 1 and records the syndrome from VSESR_EL2 in VDISR_EL2.
-----

> own reading of the specification seem to imply that there is at least
> differences when the guest is AArch32. Surely there would be some
> processing here.



> 
>> +#endif
>> +	}
>>  
>>  	__deactivate_traps_arch()();
>>  	write_sysreg(0, hstr_el2);
>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>> index da6a8cfa54a0..08a13dfe28a8 100644
>> --- a/arch/arm64/kvm/inject_fault.c
>> +++ b/arch/arm64/kvm/inject_fault.c
>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>  {
>>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>> +	/* If virtual System Error or Asynchronous Abort is set. set
>> +	 * the virtual exception syndrome information
>> +	 */
>> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
>> +				& (~VSESR_ELx_IDS_ISS_MASK))
>> +				| (kvm_vcpu_get_hsr(vcpu)
>> +				& VSESR_ELx_IDS_ISS_MASK)));
> 
> What is the rational for setting VSESR_EL2 with the EL1 syndrome
> information? That doesn't make any sense to me.
thanks, I set the VSESR_EL2 using the  EL2 syndrome information, "kvm_vcpu_get_hsr"
return the value of esr_el2, not EL1 syndrome information

> 
> Overall, this patch is completely inconsistent and unclear in what it
> tries to achieve. Also, as I already tated before, I'd like to see the
> "firmware first" mode of operation be enforced here, going back to
> userspace and let the VMM decide what to do.
> 
> Thanks,
> 
> 	M.
>

Marc Zyngier March 20, 2017, 1:58 p.m. UTC | #3

On 20/03/17 12:28, gengdongjiu wrote:
> 
> 
> On 2017/3/20 19:24, Marc Zyngier wrote:
>> Please include James Morse on anything RAS related, as he's already
>> looking at related patches.
>>
>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>> In the RAS implementation, hardware pass the virtual SEI
>>> syndrome information through the VSESR_EL2, so set the virtual
>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>> the guest OS
>>>
>>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>>> Signed-off-by: Quanming wu <wuquanming@huawei.com>
>>> ---
>>>  arch/arm64/Kconfig                   |  8 ++++++++
>>>  arch/arm64/include/asm/esr.h         |  1 +
>>>  arch/arm64/include/asm/kvm_emulate.h | 12 ++++++++++++
>>>  arch/arm64/include/asm/kvm_host.h    |  4 ++++
>>>  arch/arm64/kvm/hyp/switch.c          | 15 ++++++++++++++-
>>>  arch/arm64/kvm/inject_fault.c        | 10 ++++++++++
>>>  6 files changed, 49 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 8c7c244247b6..ea62170a3b75 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -908,6 +908,14 @@ endmenu
>>>  
>>>  menu "ARMv8.2 architectural features"
>>>  
>>> +config HAS_RAS_EXTENSION
>>> +	bool "Support arm64 RAS extension"
>>> +	default n
>>> +	help
>>> +	  Reliability, Availability, Serviceability(RAS; part of the ARMv8.2 Extensions).
>>> +
>>> +	  Selecting this option OS will try to recover the error that RAS hardware node detected.
>>> +
>>
>> As this is an architectural extension, this should be controlled by the
>> CPU feature mechanism, and not be chosen at compile time. What you have
>> here will break horribly when booted on a CPU that doesn't implement RAS.
> 
> thanks very much for your review, yes, it is, you are right.
> 
>>
>>>  config ARM64_UAO
>>>  	bool "Enable support for User Access Override (UAO)"
>>>  	default y
>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>> index d14c478976d0..e38d32b2bdad 100644
>>> --- a/arch/arm64/include/asm/esr.h
>>> +++ b/arch/arm64/include/asm/esr.h
>>> @@ -111,6 +111,7 @@
>>>  #define ESR_ELx_COND_MASK	(UL(0xF) << ESR_ELx_COND_SHIFT)
>>>  #define ESR_ELx_WFx_ISS_WFE	(UL(1) << 0)
>>>  #define ESR_ELx_xVC_IMM_MASK	((1UL << 16) - 1)
>>> +#define VSESR_ELx_IDS_ISS_MASK    ((1UL << 25) - 1)
>>>  
>>>  /* ESR value templates for specific events */
>>>  
>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>>> index f5ea0ba70f07..20d4da7f5dce 100644
>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>> @@ -148,6 +148,18 @@ static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
>>>  	return vcpu->arch.fault.esr_el2;
>>>  }
>>>  
>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>> +static inline u32 kvm_vcpu_get_vsesr(const struct kvm_vcpu *vcpu)
>>> +{
>>> +	return vcpu->arch.fault.vsesr_el2;
>>> +}
>>> +
>>> +static inline void kvm_vcpu_set_vsesr(struct kvm_vcpu *vcpu, unsigned long val)
>>> +{
>>> +	vcpu->arch.fault.vsesr_el2 = val;
>>> +}
>>> +#endif
>>> +
>>>  static inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>>>  {
>>>  	u32 esr = kvm_vcpu_get_hsr(vcpu);
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index e7705e7bb07b..f9e3bb57c461 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -83,6 +83,10 @@ struct kvm_mmu_memory_cache {
>>>  };
>>>  
>>>  struct kvm_vcpu_fault_info {
>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>> +	/* Virtual SError Exception Syndrome Register */
>>> +	u32 vsesr_el2;
>>> +#endif
>>>  	u32 esr_el2;		/* Hyp Syndrom Register */
>>>  	u64 far_el2;		/* Hyp Fault Address Register */
>>>  	u64 hpfar_el2;		/* Hyp IPA Fault Address Register */
>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>>> index aede1658aeda..770a153fb6ba 100644
>>> --- a/arch/arm64/kvm/hyp/switch.c
>>> +++ b/arch/arm64/kvm/hyp/switch.c
>>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
>>>  		isb();
>>>  	}
>>>  	write_sysreg(val, hcr_el2);
>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>> +	/* If virtual System Error or Asynchronous Abort is pending. set
>>> +	 * the virtual exception syndrome information
>>> +	 */
>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE)
>>> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);
>>> +#endif
>>>  	/* Trap on AArch32 cp15 c15 accesses (EL1 or EL0) */
>>>  	write_sysreg(1 << 15, hstr_el2);
>>>  	/*
>>> @@ -139,8 +146,14 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
>>>  	 * the crucial bit is "On taking a vSError interrupt,
>>>  	 * HCR_EL2.VSE is cleared to 0."
>>>  	 */
>>> -	if (vcpu->arch.hcr_el2 & HCR_VSE)
>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE) {
>>>  		vcpu->arch.hcr_el2 = read_sysreg(hcr_el2);
>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>> +		/* set vsesr_el2[24:0] with esr_el2[24:0] */
>>> +		kvm_vcpu_set_vsesr(vcpu, read_sysreg_el2(esr)
>>> +					& VSESR_ELx_IDS_ISS_MASK);
>>
>> What guarantees that ESR_EL2 still contains the latest exception? What
>> does it mean to store something that is the current EL2 exception
>> syndrome together with an SError that has already been injected?
> 
> yes, thanks for the review, I will add a judgement condition for the "exit_code"
> if the "exit_code == ARM_EXCEPTION_EL1_SERROR" then set the vsesr_el2.
> 
> for the aarch32, it only need set the "ExT, bit [12]" and AET, "bits [15:14]", other bit is RES0
> 
>>
>> Also, is it correct to directly copy the ESR_EL2 bits into VSESR_EL2? My
> please see below spec description, it virtual SERROR syndrome from VSESR_EL2.
> -----
>  Control returns to the OS, and the ESB instruction is re-executed.
> — The physical asynchronous SError interrupt has been cleared, so it is not taken again.
> — The PE sets VDISR_EL2.A to 1 and records the syndrome from VSESR_EL2 in VDISR_EL2.
> -----

Which doesn't say anything about directly using ESR_EL2 and propagating
it directly to the guest, specially when there is *already* an SError
pending (just in case you haven't noticed, this is the *exit* path, and
I can't see why you would overload VSESR_EL2 at at point).

> 
>> own reading of the specification seem to imply that there is at least
>> differences when the guest is AArch32. Surely there would be some
>> processing here.
> 
> 
> 
>>
>>> +#endif
>>> +	}
>>>  
>>>  	__deactivate_traps_arch()();
>>>  	write_sysreg(0, hstr_el2);
>>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>>> index da6a8cfa54a0..08a13dfe28a8 100644
>>> --- a/arch/arm64/kvm/inject_fault.c
>>> +++ b/arch/arm64/kvm/inject_fault.c
>>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>>  {
>>>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>> +	/* If virtual System Error or Asynchronous Abort is set. set
>>> +	 * the virtual exception syndrome information
>>> +	 */
>>> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
>>> +				& (~VSESR_ELx_IDS_ISS_MASK))
>>> +				| (kvm_vcpu_get_hsr(vcpu)
>>> +				& VSESR_ELx_IDS_ISS_MASK)));
>>
>> What is the rational for setting VSESR_EL2 with the EL1 syndrome
>> information? That doesn't make any sense to me.
> thanks, I set the VSESR_EL2 using the  EL2 syndrome information, "kvm_vcpu_get_hsr"
> return the value of esr_el2, not EL1 syndrome information

Ah, good point. But that doesn't make it more valid. I still don't see
anything in the spec that supports this behaviour, and I still propose
that when RAS is enabled, the VSError injection is mediated by userspace.

Thanks,

	M.

James Morse March 20, 2017, 3:08 p.m. UTC | #4

Hi Dongjiu Geng,

On 20/03/17 13:58, Marc Zyngier wrote:
> On 20/03/17 12:28, gengdongjiu wrote:
>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>> Please include James Morse on anything RAS related, as he's already
>>> looking at related patches.

(Thanks Marc,)

>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>> In the RAS implementation, hardware pass the virtual SEI
>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>> the guest OS

How does this work with firmware first?
If we took a Physical SError Interrupt the CPER records are in the hosts memory.
To deliver a RAS event to the guest something needs to generate CPER records and
put them in the guest memory. Only Qemu knows where these memory regions are.

Put another way, what is the guest expected to do with this SError interrupt?
The only choice is panic(). We should send this via Qemu so that we can add
proper guest RAS support later. Once Qemu has written the CPER records into
guest memory, it can notify the guest.

Is anyone from Huawei looking at adding RAS support for Qemu?

It looks like we should save/restore VSESR_EL2 as part of the guest CPU state,
but this needs doing with the cpufeature framework so that the single-image
kernel works on platforms with and without these features.

Xie XiuQi's series for SEI also touches the cpufeature framework.

>>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>>>> index aede1658aeda..770a153fb6ba 100644
>>>> --- a/arch/arm64/kvm/hyp/switch.c
>>>> +++ b/arch/arm64/kvm/hyp/switch.c
>>>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
>>>>  		isb();
>>>>  	}
>>>>  	write_sysreg(val, hcr_el2);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +	/* If virtual System Error or Asynchronous Abort is pending. set
>>>> +	 * the virtual exception syndrome information
>>>> +	 */
>>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE)

>>>> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);

This won't build with versions of binutils that don't recognise vsesr_el2.
Is there another patch out there that adds a sysreg definition for vsesr_el2?

>>>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>>>> index da6a8cfa54a0..08a13dfe28a8 100644
>>>> --- a/arch/arm64/kvm/inject_fault.c
>>>> +++ b/arch/arm64/kvm/inject_fault.c
>>>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>>>  {
>>>>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +	/* If virtual System Error or Asynchronous Abort is set. set
>>>> +	 * the virtual exception syndrome information
>>>> +	 */
>>>> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
>>>> +				& (~VSESR_ELx_IDS_ISS_MASK))
>>>> +				| (kvm_vcpu_get_hsr(vcpu)
>>>> +				& VSESR_ELx_IDS_ISS_MASK)));
>>>
>>> What is the rational for setting VSESR_EL2 with the EL1 syndrome
>>> information? That doesn't make any sense to me.
>> thanks, I set the VSESR_EL2 using the  EL2 syndrome information, "kvm_vcpu_get_hsr"
>> return the value of esr_el2, not EL1 syndrome information
> 
> Ah, good point. But that doesn't make it more valid. I still don't see
> anything in the spec that supports this behaviour, and I still propose
> that when RAS is enabled, the VSError injection is mediated by userspace.

I agree, we should be handling RAS errors as firmware-first, and Qemu plays the
part of firmware for a guest. We will probably need to have a KVM API for Qemu
to pend an SError with a specific ESR value.

If this isn't a firmware-first RAS error the existing code will pend an SError
for the guest.

Thanks,

James

Dongjiu Geng March 21, 2017, 6:07 a.m. UTC | #5

Hi Marc,
  Thank you very much for your review.


On 2017/3/20 21:58, Marc Zyngier wrote:
> On 20/03/17 12:28, gengdongjiu wrote:
>>
>>
>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>> Please include James Morse on anything RAS related, as he's already
>>> looking at related patches.
>>>
>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>> In the RAS implementation, hardware pass the virtual SEI
>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>> the guest OS
>>>>
>>>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>>>> Signed-off-by: Quanming wu <wuquanming@huawei.com>
>>>> ---
>>>>  arch/arm64/Kconfig                   |  8 ++++++++
>>>>  arch/arm64/include/asm/esr.h         |  1 +
>>>>  arch/arm64/include/asm/kvm_emulate.h | 12 ++++++++++++
>>>>  arch/arm64/include/asm/kvm_host.h    |  4 ++++
>>>>  arch/arm64/kvm/hyp/switch.c          | 15 ++++++++++++++-
>>>>  arch/arm64/kvm/inject_fault.c        | 10 ++++++++++
>>>>  6 files changed, 49 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index 8c7c244247b6..ea62170a3b75 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -908,6 +908,14 @@ endmenu
>>>>  
>>>>  menu "ARMv8.2 architectural features"
>>>>  
>>>> +config HAS_RAS_EXTENSION
>>>> +	bool "Support arm64 RAS extension"
>>>> +	default n
>>>> +	help
>>>> +	  Reliability, Availability, Serviceability(RAS; part of the ARMv8.2 Extensions).
>>>> +
>>>> +	  Selecting this option OS will try to recover the error that RAS hardware node detected.
>>>> +
>>>
>>> As this is an architectural extension, this should be controlled by the
>>> CPU feature mechanism, and not be chosen at compile time. What you have
>>> here will break horribly when booted on a CPU that doesn't implement RAS.
>>
>> thanks very much for your review, yes, it is, you are right.
>>
>>>
>>>>  config ARM64_UAO
>>>>  	bool "Enable support for User Access Override (UAO)"
>>>>  	default y
>>>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>>>> index d14c478976d0..e38d32b2bdad 100644
>>>> --- a/arch/arm64/include/asm/esr.h
>>>> +++ b/arch/arm64/include/asm/esr.h
>>>> @@ -111,6 +111,7 @@
>>>>  #define ESR_ELx_COND_MASK	(UL(0xF) << ESR_ELx_COND_SHIFT)
>>>>  #define ESR_ELx_WFx_ISS_WFE	(UL(1) << 0)
>>>>  #define ESR_ELx_xVC_IMM_MASK	((1UL << 16) - 1)
>>>> +#define VSESR_ELx_IDS_ISS_MASK    ((1UL << 25) - 1)
>>>>  
>>>>  /* ESR value templates for specific events */
>>>>  
>>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>>>> index f5ea0ba70f07..20d4da7f5dce 100644
>>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>>> @@ -148,6 +148,18 @@ static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
>>>>  	return vcpu->arch.fault.esr_el2;
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +static inline u32 kvm_vcpu_get_vsesr(const struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	return vcpu->arch.fault.vsesr_el2;
>>>> +}
>>>> +
>>>> +static inline void kvm_vcpu_set_vsesr(struct kvm_vcpu *vcpu, unsigned long val)
>>>> +{
>>>> +	vcpu->arch.fault.vsesr_el2 = val;
>>>> +}
>>>> +#endif
>>>> +
>>>>  static inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
>>>>  {
>>>>  	u32 esr = kvm_vcpu_get_hsr(vcpu);
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>> index e7705e7bb07b..f9e3bb57c461 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -83,6 +83,10 @@ struct kvm_mmu_memory_cache {
>>>>  };
>>>>  
>>>>  struct kvm_vcpu_fault_info {
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +	/* Virtual SError Exception Syndrome Register */
>>>> +	u32 vsesr_el2;
>>>> +#endif
>>>>  	u32 esr_el2;		/* Hyp Syndrom Register */
>>>>  	u64 far_el2;		/* Hyp Fault Address Register */
>>>>  	u64 hpfar_el2;		/* Hyp IPA Fault Address Register */
>>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>>>> index aede1658aeda..770a153fb6ba 100644
>>>> --- a/arch/arm64/kvm/hyp/switch.c
>>>> +++ b/arch/arm64/kvm/hyp/switch.c
>>>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
>>>>  		isb();
>>>>  	}
>>>>  	write_sysreg(val, hcr_el2);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +	/* If virtual System Error or Asynchronous Abort is pending. set
>>>> +	 * the virtual exception syndrome information
>>>> +	 */
>>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE)
>>>> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);
>>>> +#endif
>>>>  	/* Trap on AArch32 cp15 c15 accesses (EL1 or EL0) */
>>>>  	write_sysreg(1 << 15, hstr_el2);
>>>>  	/*
>>>> @@ -139,8 +146,14 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
>>>>  	 * the crucial bit is "On taking a vSError interrupt,
>>>>  	 * HCR_EL2.VSE is cleared to 0."
>>>>  	 */
>>>> -	if (vcpu->arch.hcr_el2 & HCR_VSE)
>>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE) {
>>>>  		vcpu->arch.hcr_el2 = read_sysreg(hcr_el2);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +		/* set vsesr_el2[24:0] with esr_el2[24:0] */
>>>> +		kvm_vcpu_set_vsesr(vcpu, read_sysreg_el2(esr)
>>>> +					& VSESR_ELx_IDS_ISS_MASK);
>>>
>>> What guarantees that ESR_EL2 still contains the latest exception? What
>>> does it mean to store something that is the current EL2 exception
>>> syndrome together with an SError that has already been injected?
>>
>> yes, thanks for the review, I will add a judgement condition for the "exit_code"
>> if the "exit_code == ARM_EXCEPTION_EL1_SERROR" then set the vsesr_el2.
>>
>> for the aarch32, it only need set the "ExT, bit [12]" and AET, "bits [15:14]", other bit is RES0
>>
>>>
>>> Also, is it correct to directly copy the ESR_EL2 bits into VSESR_EL2? My
>> please see below spec description, it virtual SERROR syndrome from VSESR_EL2.
>> -----
>>  Control returns to the OS, and the ESB instruction is re-executed.
>> — The physical asynchronous SError interrupt has been cleared, so it is not taken again.
>> — The PE sets VDISR_EL2.A to 1 and records the syndrome from VSESR_EL2 in VDISR_EL2.
>> -----
> 
> Which doesn't say anything about directly using ESR_EL2 and propagating
> it directly to the guest, specially when there is *already* an SError

thank you for the comments.
In my current solution which I refer to ARM RAS spec(RAS_Extension_PRD03-PRDC-010953-32-0.pdf).
when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 firmware records the error
info to the APEI table, then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
hypervisor, hypervisor delegates the error exception to EL1 guest OS by setting HCR_EL2.VSE to 1 and pass the
virtual SEI syndrome through vsesr_el2. The EL1 guest OS check the DISR_EL1 syndrome information to decide to
terminate the application, or do some other recovery action. because the HCR_EL2.AMO is set, so in fact, read
DISR_EL1, it returns the VDISR_EL2. and VDISR_EL2 is loaded from VSESR_EL2, so here I pass the virtual SEI
syndrome vsesr_el2.

please see the ARM spec example about SEI with ESB
----------------------------------------------------------------------------------------------------
 The firmware logs, triages, and delegates the error exception to the hypervisor. As the error came
from EL1, it does by faking an SError interrupt exception entry to EL2.
 Control transfers to the hypervisor’s delegated error recovery agent. It repeats the process of triaging
the error.
 The hypervisor delegates the error exception to EL1 by setting HCR_EL2.VSE to 1, and returns
control to the OS using ERET.
 Control returns to the OS, and the ESB instruction is re-executed.
— The physical asynchronous SError interrupt has been cleared, so it is not taken again.
— The PE sets VDISR_EL2.A to 1 and records the syndrome from VSESR_EL2 in VDISR_EL2.
 The SVC exception entry code completes, and then the OS checks DISR_EL1 for a deferred SError
interrupt.
— See the example code sequence in Error Synchronization Barrier.
— This in fact returns VDISR_EL2 because the HCR_EL2.AMO bit is set.
 Because this returns DISR_EL1.A set, the OS instead processes the error.
— Although the error exception was taken from EL1, the OS error recovery agent can use the
information that it was taken from the ESB instruction at the exception entry point to isolate
the error to the EL0 application and not affect the OS.
 The error has been logged, triaged, and contained to the EL0 application.
-------------------------------------------------------------------------------------------------------------


> pending (just in case you haven't noticed, this is the *exit* path, and
> I can't see why you would overload VSESR_EL2 at at point).
>
thanks for the question.
This is exit path, for the case that already an SError is pending.it should be overload VSESR_EL2 using
the esr_el2 in below situation, right?

int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
                       int exception_index)
{
        exit_handle_fn exit_handler;

        if (ARM_SERROR_PENDING(exception_index)) {
                u8 hsr_ec = ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));

                /*
                 * HVC/SMC already have an adjusted PC, which we need
                 * to correct in order to return to after having
                 * injected the SError.
                 */
                if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 ||
                    hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) {
                        u32 adj =  kvm_vcpu_trap_il_is32bit(vcpu) ? 4 : 2;
                        *vcpu_pc(vcpu) -= adj;
                }

                kvm_inject_vabt(vcpu);
                return 1;
        }


>>
>>> own reading of the specification seem to imply that there is at least
>>> differences when the guest is AArch32. Surely there would be some
>>> processing here.
>>
>>
>>
>>>
>>>> +#endif
>>>> +	}
>>>>  
>>>>  	__deactivate_traps_arch()();
>>>>  	write_sysreg(0, hstr_el2);
>>>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>>>> index da6a8cfa54a0..08a13dfe28a8 100644
>>>> --- a/arch/arm64/kvm/inject_fault.c
>>>> +++ b/arch/arm64/kvm/inject_fault.c
>>>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>>>  {
>>>>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +	/* If virtual System Error or Asynchronous Abort is set. set
>>>> +	 * the virtual exception syndrome information
>>>> +	 */
>>>> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
>>>> +				& (~VSESR_ELx_IDS_ISS_MASK))
>>>> +				| (kvm_vcpu_get_hsr(vcpu)
>>>> +				& VSESR_ELx_IDS_ISS_MASK)));
>>>
>>> What is the rational for setting VSESR_EL2 with the EL1 syndrome
>>> information? That doesn't make any sense to me.
>> thanks, I set the VSESR_EL2 using the  EL2 syndrome information, "kvm_vcpu_get_hsr"
>> return the value of esr_el2, not EL1 syndrome information
> 
> Ah, good point. But that doesn't make it more valid. I still don't see
> anything in the spec that supports this behaviour, and I still propose
> that when RAS is enabled, the VSError injection is mediated by userspace.
> 
For the reason that set vsesr_el2 using esr_el2, I have explained it in above.
do you mean below steps?

EL0/EL1 SEI ---> EL3 firmware ------> EL2 hypervisor notify the Qemu to inject VSError ------>Qemu call KVM API to inject VSError ---->KVM



> Thanks,
> 
> 	M.
>

Dongjiu Geng March 21, 2017, 6:32 a.m. UTC | #6

On 2017/3/20 23:08, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 20/03/17 13:58, Marc Zyngier wrote:
>> On 20/03/17 12:28, gengdongjiu wrote:
>>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>>> Please include James Morse on anything RAS related, as he's already
>>>> looking at related patches.
> 
> (Thanks Marc,)
> 
>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>> the guest OS
> 
> How does this work with firmware first?

I explained it in previous mail about the work flow.

when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 firmware records the error
info to the APEI table, then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
hypervisor, hypervisor delegates the error exception to EL1 guest OS by setting HCR_EL2.VSE to 1 and pass the
virtual SEI syndrome through vsesr_el2. The EL1 guest OS check the DISR_EL1 syndrome information to decide to
terminate the application, or do some other recovery action. because the HCR_EL2.AMO is set, so in fact, read
DISR_EL1, it returns the VDISR_EL2. and VDISR_EL2 is loaded from VSESR_EL2, so here I pass the virtual SEI
syndrome vsesr_el2.

> If we took a Physical SError Interrupt the CPER records are in the hosts memory.
> To deliver a RAS event to the guest something needs to generate CPER records and
> put them in the guest memory. Only Qemu knows where these memory regions are.
> 
> Put another way, what is the guest expected to do with this SError interrupt?
No, we do not only panic,if it is EL0 application SEI. the OS error recovery
agent will terminate the EL0 application to isolate the error; If it is EL1 guest
OS SError, guest OS can see whether it can recover. if the error was in a read-only file cache buffer, guest OS
can invalidate the page and reload the data from disk.

if all of the above are failed, OS will panic.


> The only choice is panic(). We should send this via Qemu so that we can add
> proper guest RAS support later. Once Qemu has written the CPER records into
> guest memory, it can notify the guest.
> 
> Is anyone from Huawei looking at adding RAS support for Qemu?
 yes, I am looking at Qemu and want to add RAS support.
 do you mean let Qemu inject both the SEA and SEI?

> 
> 
> It looks like we should save/restore VSESR_EL2 as part of the guest CPU state,
> but this needs doing with the cpufeature framework so that the single-image
> kernel works on platforms with and without these features.
 yes, you are right, we will follow cpufeature framework.


> 
> Xie XiuQi's series for SEI also touches the cpufeature framework.
> 
> 
>>>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>>>>> index aede1658aeda..770a153fb6ba 100644
>>>>> --- a/arch/arm64/kvm/hyp/switch.c
>>>>> +++ b/arch/arm64/kvm/hyp/switch.c
>>>>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
>>>>>  		isb();
>>>>>  	}
>>>>>  	write_sysreg(val, hcr_el2);
>>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>>> +	/* If virtual System Error or Asynchronous Abort is pending. set
>>>>> +	 * the virtual exception syndrome information
>>>>> +	 */
>>>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE)
> 
>>>>> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);
> 
> This won't build with versions of binutils that don't recognise vsesr_el2.
> Is there another patch out there that adds a sysreg definition for vsesr_el2?
> 
> 
>>>>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>>>>> index da6a8cfa54a0..08a13dfe28a8 100644
>>>>> --- a/arch/arm64/kvm/inject_fault.c
>>>>> +++ b/arch/arm64/kvm/inject_fault.c
>>>>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>>>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>>>>  {
>>>>>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
>>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>>> +	/* If virtual System Error or Asynchronous Abort is set. set
>>>>> +	 * the virtual exception syndrome information
>>>>> +	 */
>>>>> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
>>>>> +				& (~VSESR_ELx_IDS_ISS_MASK))
>>>>> +				| (kvm_vcpu_get_hsr(vcpu)
>>>>> +				& VSESR_ELx_IDS_ISS_MASK)));
>>>>
>>>> What is the rational for setting VSESR_EL2 with the EL1 syndrome
>>>> information? That doesn't make any sense to me.
>>> thanks, I set the VSESR_EL2 using the  EL2 syndrome information, "kvm_vcpu_get_hsr"
>>> return the value of esr_el2, not EL1 syndrome information
>>
>> Ah, good point. But that doesn't make it more valid. I still don't see
>> anything in the spec that supports this behaviour, and I still propose
>> that when RAS is enabled, the VSError injection is mediated by userspace.
> 
> I agree, we should be handling RAS errors as firmware-first, and Qemu plays the
> part of firmware for a guest. We will probably need to have a KVM API for Qemu
> to pend an SError with a specific ESR value.
> 
> If this isn't a firmware-first RAS error the existing code will pend an SError
> for the guest.
> 
so for both SEA and SEI, do you prefer to below steps?
EL0/EL1 SEI/SEA ---> EL3 firmware first handle ------> EL2 hypervisor notify the Qemu to inject SEI/SEA------>Qemu call KVM API to inject SEA/SEI---->KVM inject SEA/SEI to guest OS


> 
> Thanks,
> 
> James
> 
> 
> .
>

Christoffer Dall March 21, 2017, 11:34 a.m. UTC | #7

On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> 
> 
> On 2017/3/20 23:08, James Morse wrote:
> > Hi Dongjiu Geng,
> > 
> > On 20/03/17 13:58, Marc Zyngier wrote:
> >> On 20/03/17 12:28, gengdongjiu wrote:
> >>> On 2017/3/20 19:24, Marc Zyngier wrote:
> >>>> Please include James Morse on anything RAS related, as he's already
> >>>> looking at related patches.
> > 
> > (Thanks Marc,)
> > 
> >>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> >>>>> In the RAS implementation, hardware pass the virtual SEI
> >>>>> syndrome information through the VSESR_EL2, so set the virtual
> >>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> >>>>> the guest OS
> > 
> > How does this work with firmware first?
> 
> I explained it in previous mail about the work flow.

When delivering and reporting SEIs to the VM, should this happen
directly to the OS running in the VM, or to the guest firmware (e.g.
UEFI) running in the VM as well?

Thanks,
-Christoffer

> 
> when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 firmware records the error
> info to the APEI table, then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
> hypervisor, hypervisor delegates the error exception to EL1 guest OS by setting HCR_EL2.VSE to 1 and pass the
> virtual SEI syndrome through vsesr_el2. The EL1 guest OS check the DISR_EL1 syndrome information to decide to
> terminate the application, or do some other recovery action. because the HCR_EL2.AMO is set, so in fact, read
> DISR_EL1, it returns the VDISR_EL2. and VDISR_EL2 is loaded from VSESR_EL2, so here I pass the virtual SEI
> syndrome vsesr_el2.
> 
> > If we took a Physical SError Interrupt the CPER records are in the hosts memory.
> > To deliver a RAS event to the guest something needs to generate CPER records and
> > put them in the guest memory. Only Qemu knows where these memory regions are.
> > 
> > Put another way, what is the guest expected to do with this SError interrupt?
> No, we do not only panic,if it is EL0 application SEI. the OS error recovery
> agent will terminate the EL0 application to isolate the error; If it is EL1 guest
> OS SError, guest OS can see whether it can recover. if the error was in a read-only file cache buffer, guest OS
> can invalidate the page and reload the data from disk.
> 
> if all of the above are failed, OS will panic.
> 
> 
> > The only choice is panic(). We should send this via Qemu so that we can add
> > proper guest RAS support later. Once Qemu has written the CPER records into
> > guest memory, it can notify the guest.
> > 
> > Is anyone from Huawei looking at adding RAS support for Qemu?
>  yes, I am looking at Qemu and want to add RAS support.
>  do you mean let Qemu inject both the SEA and SEI?
> 
> > 
> > 
> > It looks like we should save/restore VSESR_EL2 as part of the guest CPU state,
> > but this needs doing with the cpufeature framework so that the single-image
> > kernel works on platforms with and without these features.
>  yes, you are right, we will follow cpufeature framework.
> 
> 
> > 
> > Xie XiuQi's series for SEI also touches the cpufeature framework.
> > 
> > 
> >>>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> >>>>> index aede1658aeda..770a153fb6ba 100644
> >>>>> --- a/arch/arm64/kvm/hyp/switch.c
> >>>>> +++ b/arch/arm64/kvm/hyp/switch.c
> >>>>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
> >>>>>  		isb();
> >>>>>  	}
> >>>>>  	write_sysreg(val, hcr_el2);
> >>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
> >>>>> +	/* If virtual System Error or Asynchronous Abort is pending. set
> >>>>> +	 * the virtual exception syndrome information
> >>>>> +	 */
> >>>>> +	if (vcpu->arch.hcr_el2 & HCR_VSE)
> > 
> >>>>> +		write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);
> > 
> > This won't build with versions of binutils that don't recognise vsesr_el2.
> > Is there another patch out there that adds a sysreg definition for vsesr_el2?
> > 
> > 
> >>>>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> >>>>> index da6a8cfa54a0..08a13dfe28a8 100644
> >>>>> --- a/arch/arm64/kvm/inject_fault.c
> >>>>> +++ b/arch/arm64/kvm/inject_fault.c
> >>>>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
> >>>>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
> >>>>>  {
> >>>>>  	vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
> >>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
> >>>>> +	/* If virtual System Error or Asynchronous Abort is set. set
> >>>>> +	 * the virtual exception syndrome information
> >>>>> +	 */
> >>>>> +	kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
> >>>>> +				& (~VSESR_ELx_IDS_ISS_MASK))
> >>>>> +				| (kvm_vcpu_get_hsr(vcpu)
> >>>>> +				& VSESR_ELx_IDS_ISS_MASK)));
> >>>>
> >>>> What is the rational for setting VSESR_EL2 with the EL1 syndrome
> >>>> information? That doesn't make any sense to me.
> >>> thanks, I set the VSESR_EL2 using the  EL2 syndrome information, "kvm_vcpu_get_hsr"
> >>> return the value of esr_el2, not EL1 syndrome information
> >>
> >> Ah, good point. But that doesn't make it more valid. I still don't see
> >> anything in the spec that supports this behaviour, and I still propose
> >> that when RAS is enabled, the VSError injection is mediated by userspace.
> > 
> > I agree, we should be handling RAS errors as firmware-first, and Qemu plays the
> > part of firmware for a guest. We will probably need to have a KVM API for Qemu
> > to pend an SError with a specific ESR value.
> > 
> > If this isn't a firmware-first RAS error the existing code will pend an SError
> > for the guest.
> > 
> so for both SEA and SEI, do you prefer to below steps?
> EL0/EL1 SEI/SEA ---> EL3 firmware first handle ------> EL2 hypervisor notify the Qemu to inject SEI/SEA------>Qemu call KVM API to inject SEA/SEI---->KVM inject SEA/SEI to guest OS
> 
> 
> > 
> > Thanks,
> > 
> > James
> > 
> > 
> > .
> > 
>

James Morse March 21, 2017, 1:10 p.m. UTC | #8

Hi,

On 21/03/17 06:32, gengdongjiu wrote:
> On 2017/3/20 23:08, James Morse wrote:
>> On 20/03/17 13:58, Marc Zyngier wrote:
>>> On 20/03/17 12:28, gengdongjiu wrote:
>>>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>> the guest OS

(I've juggled the order of your replies:)

> so for both SEA and SEI, do you prefer to below steps?
> EL0/EL1 SEI/SEA ---> EL3 firmware first handle ------> EL2 hypervisor notify >
the Qemu to inject SEI/SEA------>Qemu call KVM API to inject SEA/SEI---->KVM >
inject SEA/SEI to guest OS

Yes, to expand your EL2 hypervisor notify Qemu step:
1 The host should call its APEI code to parse the CPER records.
2 User space processes are then notified via SIGBUS (or for rasdaemon, trace
  points).
3 Qemu can take the address delivered via SIGBUS and generate CPER records for
  the guest. It knows how to convert host addresses to guest IPAs, and it knows
  where in guest memory to write the CPER records.
4 Qemu can then notify the guest via whatever mechanism it advertised via the
  HEST/GHES table. It might not be the same mechanism that the host received
  the notification through.

Steps 1 and 2 are the same even if no guest is running, so we don't have to add
any special case for KVM. This is existing code that x86 uses.
We can test the Qemu parts without any firmware support and the APEI path in the
host and guest is the same.

>> Is anyone from Huawei looking at adding RAS support for Qemu?
>  yes, I am looking at Qemu and want to add RAS support.

Great, support in Qemu is one of the missing pieces. On x86 it looks like it
emulates machine-check-exceptions, which is how x86 did this before
firmware-first and APEI became the standard.

>  do you mean let Qemu inject both the SEA and SEI?

To do the notification, yes. It needs to happen after the CPER records have been
written, and the mechanism and CPER memory location need to match what the guest
was told via the HEST/GHES table.

If Qemu didn't tell the guest about firmware-first, it can still deliver the
guest an SError Interrupt.

SEA should be possible to do with the KVM_SET_REG API, GPIO/GSIV and the other
kind of interrupts can use irqfd. For SEI we may need to add an API call to KVM
to let it pend SError with a specific ESR.

>> How does this work with firmware first?

> when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 firmware records the error
> info to the APEI table, 

These are CPER records in a memory area pointed to by one of HEST's GHES entries?

> then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
> hypervisor, hypervisor delegates the error exception to EL1 guest

This is a problem, just because the error occurred while the guest was running
doesn't mean we should deliver it directly to the guest. Some of these errors
will be fatal for the CPU and the host should try and power it off to contain
the fault. For example: CPER's 'micro-architectural error', should the guest
power-off the vCPU? All that really does is return to the hypervisor, the error
hasn't been contained.

Firmware should handle the error first, then the host, finally the guest via Qemu.

> OS by setting HCR_EL2.VSE to 1 and pass the virtual SEI syndrome through vsesr_el2. 
> The EL1 guest OS check the DISR_EL1 syndrome information to decide to
> terminate the application, or do some other recovery action. because the HCR_EL2.AMO is set, so in fact, read
> DISR_EL1, it returns the VDISR_EL2. and VDISR_EL2 is loaded from VSESR_EL2, so here I pass the virtual SEI
> syndrome vsesr_el2.

So this is how an SError Interrupt's ESR gets into a guest. How does it get hold
of the CPER records?

>> If we took a Physical SError Interrupt the CPER records are in the hosts memory.
>> To deliver a RAS event to the guest something needs to generate CPER records and
>> put them in the guest memory. Only Qemu knows where these memory regions are.
>>
>> Put another way, what is the guest expected to do with this SError interrupt?
>
> No, we do not only panic,if it is EL0 application SEI. the OS error recovery
> agent will terminate the EL0 application to isolate the error; If it is EL1 guest
> OS SError, guest OS can see whether it can recover. if the error was in a read-only file cache buffer, guest OS
> can invalidate the page and reload the data from disk.

How do we get an address for memory failure? SError is asynchronous, I don't
think it sets the FAR. (SEA is synchronous and its not guaranteed to set the
FAR..). As far as I understand this information is in the CPER records in host
memory.

If we did have an address it would be a host address, how is it converted to a
guest IPA? I think Qemu should do this as part of its CPER record generation,
once the host has decided the error wasn't catastrophic.

Thanks,

James

kernel test robot March 21, 2017, 1:51 p.m. UTC | #9

Hi Dongjiu,

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v4.11-rc3 next-20170321]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dongjiu-Geng/kvm-pass-the-virtual-SEI-syndrome-to-guest-OS/20170321-152433
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: arm64-allmodconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   /tmp/ccWmLqCE.s: Assembler messages:
>> /tmp/ccWmLqCE.s:677: Error: selected processor does not support system register name 'vsesr_el2'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

James Morse March 21, 2017, 7:11 p.m. UTC | #10

Hi Christoffer,

On 21/03/17 11:34, Christoffer Dall wrote:
> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
>> On 2017/3/20 23:08, James Morse wrote:
>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>>> the guest OS
>>>
>>> How does this work with firmware first?
>>
>> I explained it in previous mail about the work flow.
> 
> When delivering and reporting SEIs to the VM, should this happen
> directly to the OS running in the VM, or to the guest firmware (e.g.
> UEFI) running in the VM as well?

'firmware first' is the ACPI specs name for x86's BIOS or management-mode
handling the error. On arm64 we have multiple things called firmware, so the
name might be more confusing than helpful.

As far as I understand it, firmware here refers to the secure-world and EL3.
Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
routing them to EL3 where secure platform specific firmware generates CPER records.
For a guest, Qemu takes the role of this EL3-firmware.

Thanks,

James

Christoffer Dall March 21, 2017, 7:39 p.m. UTC | #11

[resending as clear text - I thought GMail supported that - sorry]

On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
> Hi Christoffer,
> 
> On 21/03/17 11:34, Christoffer Dall wrote:
> > On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> >> On 2017/3/20 23:08, James Morse wrote:
> >>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> >>>>>>> In the RAS implementation, hardware pass the virtual SEI
> >>>>>>> syndrome information through the VSESR_EL2, so set the virtual
> >>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> >>>>>>> the guest OS
> >>>
> >>> How does this work with firmware first?
> >>
> >> I explained it in previous mail about the work flow.
> > 
> > When delivering and reporting SEIs to the VM, should this happen
> > directly to the OS running in the VM, or to the guest firmware (e.g.
> > UEFI) running in the VM as well?
> 
> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
> handling the error. On arm64 we have multiple things called firmware, so the
> name might be more confusing than helpful.
> 
> As far as I understand it, firmware here refers to the secure-world and EL3.
> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
> routing them to EL3 where secure platform specific firmware generates CPER records.
> For a guest, Qemu takes the role of this EL3-firmware.
> 
Thanks for the clarification.  So UEFI in the VM would not be involved
in this at all?

My confusion here comes from not thinking about QEMU or KVM as firmware,
but as the machine, so it would be sort of like the functionality is
baked into hardware rather than firmware.

Note that to the VM, the environment will look like hardware without EL3
and without a secure world, so any software assuming there's something
'hidden' behind the available non-secure modes must not decide to
disable features if discovering the lack of a secure world.

Thanks,
-Christoffer

Peter Maydell March 21, 2017, 10:10 p.m. UTC | #12

On 21 March 2017 at 19:39, Christoffer Dall <cdall@linaro.org> wrote:
> My confusion here comes from not thinking about QEMU or KVM as firmware,
> but as the machine, so it would be sort of like the functionality is
> baked into hardware rather than firmware.

There is precedent for that kind of thing -- we implement PSCI
in KVM/QEMU for the guest, though in real hardware it would be
provided by firmware at EL3.

thanks
-- PMM

Dongjiu Geng March 22, 2017, 3:20 a.m. UTC | #13

Hi kbuild test robot,

  Thank you.
  The build error is due to "vsesr_el2" is armv8.2 register, I will change "vsesr_el2" to sysreg usage


On 2017/3/21 21:51, kbuild test robot wrote:
> Hi Dongjiu,
> 
> [auto build test ERROR on arm64/for-next/core]
> [also build test ERROR on v4.11-rc3 next-20170321]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Dongjiu-Geng/kvm-pass-the-virtual-SEI-syndrome-to-guest-OS/20170321-152433
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
> config: arm64-allmodconfig (attached as .config)
> compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm64 
> 
> All errors (new ones prefixed by >>):
> 
>    /tmp/ccWmLqCE.s: Assembler messages:
>>> /tmp/ccWmLqCE.s:677: Error: selected processor does not support system register name 'vsesr_el2'
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
>

Marc Zyngier March 22, 2017, 11:15 a.m. UTC | #14

On 21/03/17 22:10, Peter Maydell wrote:
> On 21 March 2017 at 19:39, Christoffer Dall <cdall@linaro.org> wrote:
>> My confusion here comes from not thinking about QEMU or KVM as firmware,
>> but as the machine, so it would be sort of like the functionality is
>> baked into hardware rather than firmware.
> 
> There is precedent for that kind of thing -- we implement PSCI
> in KVM/QEMU for the guest, though in real hardware it would be
> provided by firmware at EL3.

[probably vastly off topic]

In retrospect, I now believe this was a rather big mistake to implement
PSCI in KVM.

Eventually, we want to be able to handle the full spectrum of the SMCCC
and forward things to an actual TEE if available. There is no real
reason why PSCI shouldn't be handled in userspace the same way (and we
already offload reset and halt to QEMU).

Thanks,

	M.

Dongjiu Geng March 22, 2017, 1:37 p.m. UTC | #15

Hi James,
  Thank you very much for your detailed comment and answer.

On 2017/3/21 21:10, James Morse wrote:
> Hi,
> 
> On 21/03/17 06:32, gengdongjiu wrote:
>> On 2017/3/20 23:08, James Morse wrote:
>>> On 20/03/17 13:58, Marc Zyngier wrote:
>>>> On 20/03/17 12:28, gengdongjiu wrote:
>>>>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>>> the guest OS
> 
> (I've juggled the order of your replies:)
> 
>> so for both SEA and SEI, do you prefer to below steps?
>> EL0/EL1 SEI/SEA ---> EL3 firmware first handle ------> EL2 hypervisor notify >
> the Qemu to inject SEI/SEA------>Qemu call KVM API to inject SEA/SEI---->KVM >
> inject SEA/SEI to guest OS
> 
> Yes, to expand your EL2 hypervisor notify Qemu step:
> 1 The host should call its APEI code to parse the CPER records.
> 2 User space processes are then notified via SIGBUS (or for rasdaemon, trace
>   points).
> 3 Qemu can take the address delivered via SIGBUS and generate CPER records for
>   the guest. It knows how to convert host addresses to guest IPAs, and it knows
>   where in guest memory to write the CPER records.
> 4 Qemu can then notify the guest via whatever mechanism it advertised via the
>   HEST/GHES table. It might not be the same mechanism that the host received
>   the notification through.
> 
> Steps 1 and 2 are the same even if no guest is running, so we don't have to add
> any special case for KVM. This is existing code that x86 uses.
> We can test the Qemu parts without any firmware support and the APEI path in the
> host and guest is the same.
   here do you mean map host APEI table to guest for steps 1 and 2 test? so that the APEI path in the
  host and guest is the same.

> 
> 
>>> Is anyone from Huawei looking at adding RAS support for Qemu?
>>  yes, I am looking at Qemu and want to add RAS support.
> 
> Great, support in Qemu is one of the missing pieces. On x86 it looks like it
> emulates machine-check-exceptions, which is how x86 did this before
> firmware-first and APEI became the standard.
> 
> 
>>  do you mean let Qemu inject both the SEA and SEI?
> 
> To do the notification, yes. It needs to happen after the CPER records have been
> written, and the mechanism and CPER memory location need to match what the guest
> was told via the HEST/GHES table.
> 
> If Qemu didn't tell the guest about firmware-first, it can still deliver the
> guest an SError Interrupt.
> 
> 
> SEA should be possible to do with the KVM_SET_REG API, GPIO/GSIV and the other
> kind of interrupts can use irqfd. For SEI we may need to add an API call to KVM
> to let it pend SError with a specific ESR.
> 
> 
> 
>>> How does this work with firmware first?
> 
>> when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 firmware records the error
>> info to the APEI table, 
> 
> These are CPER records in a memory area pointed to by one of HEST's GHES entries?
> 
> 
>> then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
>> hypervisor, hypervisor delegates the error exception to EL1 guest
> 
> This is a problem, just because the error occurred while the guest was running
> doesn't mean we should deliver it directly to the guest. Some of these errors
> will be fatal for the CPU and the host should try and power it off to contain
yes, some of error does not need to deliver to guest OS directly. for example if the error is guest kernel fault error,
hypervisor can directly power off the whole guest OS

> the fault. For example: CPER's 'micro-architectural error', should the guest
> power-off the vCPU? All that really does is return to the hypervisor, the error
for this example, I think it is better hypervisor directly close the whole guest OS, instead of
guest power-off the vCPU.

> hasn't been contained.


> 
> Firmware should handle the error first, then the host, finally the guest via Qemu.
> 
> 
>> OS by setting HCR_EL2.VSE to 1 and pass the virtual SEI syndrome through vsesr_el2. 
>> The EL1 guest OS check the DISR_EL1 syndrome information to decide to
>> terminate the application, or do some other recovery action. because the HCR_EL2.AMO is set, so in fact, read
>> DISR_EL1, it returns the VDISR_EL2. and VDISR_EL2 is loaded from VSESR_EL2, so here I pass the virtual SEI
>> syndrome vsesr_el2.
> 
> So this is how an SError Interrupt's ESR gets into a guest. How does it get hold
> of the CPER records?
> 
> 
>>> If we took a Physical SError Interrupt the CPER records are in the hosts memory.
>>> To deliver a RAS event to the guest something needs to generate CPER records and
>>> put them in the guest memory. Only Qemu knows where these memory regions are.
>>>
>>> Put another way, what is the guest expected to do with this SError interrupt?
>>
>> No, we do not only panic,if it is EL0 application SEI. the OS error recovery
>> agent will terminate the EL0 application to isolate the error; If it is EL1 guest
>> OS SError, guest OS can see whether it can recover. if the error was in a read-only file cache buffer, guest OS
>> can invalidate the page and reload the data from disk.
> 
> How do we get an address for memory failure? SError is asynchronous, I don't
> think it sets the FAR. (SEA is synchronous and its not guaranteed to set the
Thank you to point that. sorry, my answer is not right. in fact, I think the FAR and
CPER are both not accurate for the  asynchronous SError. so guest OS can not try to recover.
but it can still know which application create this SError which is deferred by ESB, then guest OS close the APP.
by the way, for the synchronous SEA, do you think which address should be used? FAR or CPER that record come from ERR<n>ADDR?
I see Qualcomm series patches mainly use FAR not CPER record that come from ERR<n>ADDR for SEA.
so for the SEA case, I do not know which address is more accurate for FAR and CPER record

> FAR..). As far as I understand this information is in the CPER records in host
> memory.


> 
> If we did have an address it would be a host address, how is it converted to a
> guest IPA? I think Qemu should do this as part of its CPER record generation,
> once the host has decided the error wasn't catastrophic.
 thanks for your suggestion.

> 
> 
> Thanks,
> 
> James
> 
> 
> .
>

James Morse March 22, 2017, 6:56 p.m. UTC | #16

Hi gengdongjiu

On 22/03/17 13:37, gengdongjiu wrote:
> On 2017/3/21 21:10, James Morse wrote:
>> On 21/03/17 06:32, gengdongjiu wrote:
>>> so for both SEA and SEI, do you prefer to below steps?
>>> EL0/EL1 SEI/SEA ---> EL3 firmware first handle ------> EL2 hypervisor notify >
>> the Qemu to inject SEI/SEA------>Qemu call KVM API to inject SEA/SEI---->KVM >
>> inject SEA/SEI to guest OS
>>
>> Yes, to expand your EL2 hypervisor notify Qemu step:
>> 1 The host should call its APEI code to parse the CPER records.
>> 2 User space processes are then notified via SIGBUS (or for rasdaemon, trace
>>   points).
>> 3 Qemu can take the address delivered via SIGBUS and generate CPER records for
>>   the guest. It knows how to convert host addresses to guest IPAs, and it knows
>>   where in guest memory to write the CPER records.
>> 4 Qemu can then notify the guest via whatever mechanism it advertised via the
>>   HEST/GHES table. It might not be the same mechanism that the host received
>>   the notification through.
>>
>> Steps 1 and 2 are the same even if no guest is running, so we don't have to add
>> any special case for KVM. This is existing code that x86 uses.
>> We can test the Qemu parts without any firmware support and the APEI path in the
>> host and guest is the same.

>    here do you mean map host APEI table to guest for steps 1 and 2 test? so that the APEI path in the
>   host and guest is the same.

No, the hosts ACPI/APEI tables describe host physical addresses, the guest can't
access these.

Instead we can use Linux's hwpoison mechanism to call memory_failure() and if we
pick the address carefully, signal Qemu. From there we can test Qemu's
generation of CPER records and signalling the guest.

When a host and a guest both use APEI the HEST tables will be different because
the memory layout is different, but the path through APEI and the kernel's error
handling code would be the same.

>>>> How does this work with firmware first?
>>
>>> when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 firmware records the error
>>> info to the APEI table, 
>>
>> These are CPER records in a memory area pointed to by one of HEST's GHES entries?
>>
>>
>>> then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
>>> hypervisor, hypervisor delegates the error exception to EL1 guest
>>
>> This is a problem, just because the error occurred while the guest was running
>> doesn't mean we should deliver it directly to the guest. Some of these errors
>> will be fatal for the CPU and the host should try and power it off to contain

> yes, some of error does not need to deliver to guest OS directly. for example if the error is guest kernel fault error,
> hypervisor can directly power off the whole guest OS

I agree, Qemu should make that decision, depending on the users choice it can
print a helpful error message and exit, or try and restart the guest.

>> the fault. For example: CPER's 'micro-architectural error', should the guest
>> power-off the vCPU? All that really does is return to the hypervisor, the error

> for this example, I think it is better hypervisor directly close the whole guest OS, instead of
> guest power-off the vCPU.

I picked this as an example because its not clear what it means and it probably
affects the host as well as the guest. We need to do the host error containment
first.

>>>> If we took a Physical SError Interrupt the CPER records are in the hosts memory.
>>>> To deliver a RAS event to the guest something needs to generate CPER records and
>>>> put them in the guest memory. Only Qemu knows where these memory regions are.
>>>>
>>>> Put another way, what is the guest expected to do with this SError interrupt?
>>>
>>> No, we do not only panic,if it is EL0 application SEI. the OS error recovery
>>> agent will terminate the EL0 application to isolate the error; If it is EL1 guest
>>> OS SError, guest OS can see whether it can recover. if the error was in a read-only file cache buffer, guest OS
>>> can invalidate the page and reload the data from disk.
>>
>> How do we get an address for memory failure? SError is asynchronous, I don't
>> think it sets the FAR. (SEA is synchronous and its not guaranteed to set the

> Thank you to point that. sorry, my answer is not right. in fact, I think the FAR and
> CPER are both not accurate for the  asynchronous SError. so guest OS can not try to recover.

My point was only that the architecture doesn't tell us the FAR is always set,
so we have to find out from somewhere else, like the host's CPER records.

SError Interrupt is one of the notification types for GHES which may be
triggered by firmware. Firmware should generate the CPER records before
triggering the SEI notification. When the host gets an SEI it can parse the list
of GHES addresses looking for CPER records. The asynchronous delay doesn't
affect the CPER records from firmware.

Qemu can do the same for a guest before it pends an SError.

> but it can still know which application create this SError which is deferred by ESB, then guest OS close the APP.

Once the host has done its error containment, yes. If the SError interrupted a
vcpu and the host signalled Qemu the signal will be delivered before the vcpu is
run again, (if (signal_pending(current))... in kvm_arch_vcpu_ioctl_run()).

If Qemu does its work and decides to pend an SError before running the vcpu
again then it will be as if the guest isn't running under a hypervisor.

> by the way, for the synchronous SEA, do you think which address should be used? FAR or CPER that record come from ERR<n>ADDR?
> I see Qualcomm series patches mainly use FAR not CPER record that come from ERR<n>ADDR for SEA.
> so for the SEA case, I do not know which address is more accurate for FAR and CPER record

It uses both. The FAR may get copied to the signal address if the ESR says the
FAR is valid. This works on systems that don't have APEI SEA.
Systems that do have APEI SEA will parse the CPER records as well to find the
physical address and do the memory_failure() routine to signal all the affected
processes, not just the one that was running.

Thanks,

James

James Morse March 28, 2017, 10:48 a.m. UTC | #17

Hi Christoffer,

(CC: Leif and Achin who know more about how UEFI fits into this picture)

On 21/03/17 19:39, Christoffer Dall wrote:
> On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
>> On 21/03/17 11:34, Christoffer Dall wrote:
>>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
>>>> On 2017/3/20 23:08, James Morse wrote:
>>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>>>>> the guest OS
>>>>>
>>>>> How does this work with firmware first?
>>>>
>>>> I explained it in previous mail about the work flow.
>>>
>>> When delivering and reporting SEIs to the VM, should this happen
>>> directly to the OS running in the VM, or to the guest firmware (e.g.
>>> UEFI) running in the VM as well?
>>
>> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
>> handling the error. On arm64 we have multiple things called firmware, so the
>> name might be more confusing than helpful.
>>
>> As far as I understand it, firmware here refers to the secure-world and EL3.
>> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
>> routing them to EL3 where secure platform specific firmware generates CPER records.
>> For a guest, Qemu takes the role of this EL3-firmware.
>>
> Thanks for the clarification.  So UEFI in the VM would not be involved
> in this at all?

On the host, part of UEFI is involved to generate the CPER records.
In a guest?, I don't know.
Qemu could generate the records, or drive some other component to do it.

Leif and Achin are the people with the UEFI/bigger picture.


> My confusion here comes from not thinking about QEMU or KVM as firmware,
> but as the machine, so it would be sort of like the functionality is
> baked into hardware rather than firmware.
> 
> Note that to the VM, the environment will look like hardware without EL3
> and without a secure world, so any software assuming there's something
> 'hidden' behind the available non-secure modes must not decide to
> disable features if discovering the lack of a secure world.


Thanks,

James

Christoffer Dall March 28, 2017, 11:23 a.m. UTC | #18

On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> Hi Christoffer,
> 
> (CC: Leif and Achin who know more about how UEFI fits into this picture)
> 
> On 21/03/17 19:39, Christoffer Dall wrote:
> > On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
> >> On 21/03/17 11:34, Christoffer Dall wrote:
> >>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> >>>> On 2017/3/20 23:08, James Morse wrote:
> >>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> >>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
> >>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
> >>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> >>>>>>>>> the guest OS
> >>>>>
> >>>>> How does this work with firmware first?
> >>>>
> >>>> I explained it in previous mail about the work flow.
> >>>
> >>> When delivering and reporting SEIs to the VM, should this happen
> >>> directly to the OS running in the VM, or to the guest firmware (e.g.
> >>> UEFI) running in the VM as well?
> >>
> >> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
> >> handling the error. On arm64 we have multiple things called firmware, so the
> >> name might be more confusing than helpful.
> >>
> >> As far as I understand it, firmware here refers to the secure-world and EL3.
> >> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
> >> routing them to EL3 where secure platform specific firmware generates CPER records.
> >> For a guest, Qemu takes the role of this EL3-firmware.
> >>
> > Thanks for the clarification.  So UEFI in the VM would not be involved
> > in this at all?
> 
> On the host, part of UEFI is involved to generate the CPER records.
> In a guest?, I don't know.
> Qemu could generate the records, or drive some other component to do it.

I think I am beginning to understand this a bit.  Since the guet UEFI
instance is specifically built for the machine it runs on, QEMU's virt
machine in this case, they could simply agree (by some contract) to
place the records at some specific location in memory, and if the guest
kernel asks its guest UEFI for that location, things should just work by
having logic in QEMU to process error reports and populate guest memory.

Is this how others see the world too?

> 
> Leif and Achin are the people with the UEFI/bigger picture.
> 
> 

Thanks!
-Christoffer

Peter Maydell March 28, 2017, 11:33 a.m. UTC | #19

On 28 March 2017 at 12:23, Christoffer Dall <cdall@linaro.org> wrote:
> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>> On the host, part of UEFI is involved to generate the CPER records.
>> In a guest?, I don't know.
>> Qemu could generate the records, or drive some other component to do it.
>
> I think I am beginning to understand this a bit.  Since the guet UEFI
> instance is specifically built for the machine it runs on, QEMU's virt
> machine in this case, they could simply agree (by some contract) to
> place the records at some specific location in memory, and if the guest
> kernel asks its guest UEFI for that location, things should just work by
> having logic in QEMU to process error reports and populate guest memory.

Is "write direct to guest memory" the best ABI here or would
it be preferable to use the fw_cfg interface for the guest UEFI
to retrieve the data items on demand?

Is there a pre-existing "this is how it works on x86" implementation?

thanks
-- PMM

Achin Gupta March 28, 2017, 11:54 a.m. UTC | #20

On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> > Hi Christoffer,
> >
> > (CC: Leif and Achin who know more about how UEFI fits into this picture)
> >
> > On 21/03/17 19:39, Christoffer Dall wrote:
> > > On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
> > >> On 21/03/17 11:34, Christoffer Dall wrote:
> > >>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> > >>>> On 2017/3/20 23:08, James Morse wrote:
> > >>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> > >>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
> > >>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
> > >>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> > >>>>>>>>> the guest OS
> > >>>>>
> > >>>>> How does this work with firmware first?
> > >>>>
> > >>>> I explained it in previous mail about the work flow.
> > >>>
> > >>> When delivering and reporting SEIs to the VM, should this happen
> > >>> directly to the OS running in the VM, or to the guest firmware (e.g.
> > >>> UEFI) running in the VM as well?
> > >>
> > >> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
> > >> handling the error. On arm64 we have multiple things called firmware, so the
> > >> name might be more confusing than helpful.
> > >>
> > >> As far as I understand it, firmware here refers to the secure-world and EL3.
> > >> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
> > >> routing them to EL3 where secure platform specific firmware generates CPER records.
> > >> For a guest, Qemu takes the role of this EL3-firmware.

+1

> > >>
> > > Thanks for the clarification.  So UEFI in the VM would not be involved
> > > in this at all?
> >
> > On the host, part of UEFI is involved to generate the CPER records.
> > In a guest?, I don't know.
> > Qemu could generate the records, or drive some other component to do it.
>
> I think I am beginning to understand this a bit.  Since the guet UEFI
> instance is specifically built for the machine it runs on, QEMU's virt
> machine in this case, they could simply agree (by some contract) to
> place the records at some specific location in memory, and if the guest
> kernel asks its guest UEFI for that location, things should just work by
> having logic in QEMU to process error reports and populate guest memory.
>
> Is this how others see the world too?

I think so!

AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
HEST for the guest Kernel?

If so, then the question is how the guest UEFI finds out where QEMU (acting as
EL3 firmware) will populate the CPERs. This could either be a contract between
the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
where the memory is.

This is the way I expect it to work at the EL3/EL2 boundary. So I am
extrapolating it to the guest/hypervisor boundary. Do shout if I am missing
anything.

hth,
Achin

>
> >
> > Leif and Achin are the people with the UEFI/bigger picture.
> >
> >
>
> Thanks!
> -Christoffer

[1] http://infocenter.arm.com/help/topic/com.arm.doc.den0060a/DEN0060A_ARM_MM_Interface_Specification.pdf

Dongjiu Geng March 28, 2017, 12:16 p.m. UTC | #21

Hi all,

On 2017/3/28 19:54, Achin Gupta wrote:
> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>> Hi Christoffer,
>>>
>>> (CC: Leif and Achin who know more about how UEFI fits into this picture)
>>>
>>> On 21/03/17 19:39, Christoffer Dall wrote:
>>>> On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
>>>>> On 21/03/17 11:34, Christoffer Dall wrote:
>>>>>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
>>>>>>> On 2017/3/20 23:08, James Morse wrote:
>>>>>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>>>>>>>> the guest OS
>>>>>>>>
>>>>>>>> How does this work with firmware first?
>>>>>>>
>>>>>>> I explained it in previous mail about the work flow.
>>>>>>
>>>>>> When delivering and reporting SEIs to the VM, should this happen
>>>>>> directly to the OS running in the VM, or to the guest firmware (e.g.
>>>>>> UEFI) running in the VM as well?
>>>>>
>>>>> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
>>>>> handling the error. On arm64 we have multiple things called firmware, so the
>>>>> name might be more confusing than helpful.
>>>>>
>>>>> As far as I understand it, firmware here refers to the secure-world and EL3.
>>>>> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
>>>>> routing them to EL3 where secure platform specific firmware generates CPER records.
>>>>> For a guest, Qemu takes the role of this EL3-firmware.
> 
> +1
> 
>>>>>
>>>> Thanks for the clarification.  So UEFI in the VM would not be involved
>>>> in this at all?
>>>
>>> On the host, part of UEFI is involved to generate the CPER records.
>>> In a guest?, I don't know.
>>> Qemu could generate the records, or drive some other component to do it.
>>
>> I think I am beginning to understand this a bit.  Since the guet UEFI
>> instance is specifically built for the machine it runs on, QEMU's virt
>> machine in this case, they could simply agree (by some contract) to
>> place the records at some specific location in memory, and if the guest
>> kernel asks its guest UEFI for that location, things should just work by
>> having logic in QEMU to process error reports and populate guest memory.
>>
>> Is this how others see the world too?
> 
> I think so!
> 
> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
> HEST for the guest Kernel?
> 
> If so, then the question is how the guest UEFI finds out where QEMU (acting as
> EL3 firmware) will populate the CPERs. This could either be a contract between
> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
> where the memory is.

whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu directly generate the ACPI table, but I am not sure, we are checking the qemu logical.
let Qemu generate CPER record may be clear.

when qemu can take the address delivered via SIGBUS and generate CPER records for
the guest, seems the CPER table is simple because there is only one address passed by SIGBUS.



> 
> This is the way I expect it to work at the EL3/EL2 boundary. So I am
> extrapolating it to the guest/hypervisor boundary. Do shout if I am missing
> anything.
> 
> hth,
> Achin
> 
>>
>>>
>>> Leif and Achin are the people with the UEFI/bigger picture.
>>>
>>>
>>
>> Thanks!
>> -Christoffer
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.den0060a/DEN0060A_ARM_MM_Interface_Specification.pdf
> 
> .
>

Christoffer Dall March 28, 2017, 12:22 p.m. UTC | #22

On Tue, Mar 28, 2017 at 12:54:13PM +0100, Achin Gupta wrote:
> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
> > On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> > > Hi Christoffer,
> > >
> > > (CC: Leif and Achin who know more about how UEFI fits into this picture)
> > >
> > > On 21/03/17 19:39, Christoffer Dall wrote:
> > > > On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
> > > >> On 21/03/17 11:34, Christoffer Dall wrote:
> > > >>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> > > >>>> On 2017/3/20 23:08, James Morse wrote:
> > > >>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> > > >>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
> > > >>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
> > > >>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> > > >>>>>>>>> the guest OS
> > > >>>>>
> > > >>>>> How does this work with firmware first?
> > > >>>>
> > > >>>> I explained it in previous mail about the work flow.
> > > >>>
> > > >>> When delivering and reporting SEIs to the VM, should this happen
> > > >>> directly to the OS running in the VM, or to the guest firmware (e.g.
> > > >>> UEFI) running in the VM as well?
> > > >>
> > > >> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
> > > >> handling the error. On arm64 we have multiple things called firmware, so the
> > > >> name might be more confusing than helpful.
> > > >>
> > > >> As far as I understand it, firmware here refers to the secure-world and EL3.
> > > >> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
> > > >> routing them to EL3 where secure platform specific firmware generates CPER records.
> > > >> For a guest, Qemu takes the role of this EL3-firmware.
> 
> +1
> 
> > > >>
> > > > Thanks for the clarification.  So UEFI in the VM would not be involved
> > > > in this at all?
> > >
> > > On the host, part of UEFI is involved to generate the CPER records.
> > > In a guest?, I don't know.
> > > Qemu could generate the records, or drive some other component to do it.
> >
> > I think I am beginning to understand this a bit.  Since the guet UEFI
> > instance is specifically built for the machine it runs on, QEMU's virt
> > machine in this case, they could simply agree (by some contract) to
> > place the records at some specific location in memory, and if the guest
> > kernel asks its guest UEFI for that location, things should just work by
> > having logic in QEMU to process error reports and populate guest memory.
> >
> > Is this how others see the world too?
> 
> I think so!
> 
> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
> HEST for the guest Kernel?
> 
> If so, then the question is how the guest UEFI finds out where QEMU (acting as
> EL3 firmware) will populate the CPERs. This could either be a contract between
> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
> where the memory is.
> 
> This is the way I expect it to work at the EL3/EL2 boundary. So I am
> extrapolating it to the guest/hypervisor boundary. Do shout if I am missing
> anything.

No that sounds like a resonable comparison.

I'm not entirely sure what a HEST or GHES is, but I think the only place
where I'm still not clear is if when the guest kernel is notified of
errors does it (a) just traverse memory by following some pointers
(which it may have pre-loaded at boot from UEFI), or (b) run UEFI code
which can call into QEMU and generate error records on demand?

Thanks,
-Christoffer

Achin Gupta March 28, 2017, 1:24 p.m. UTC | #23

On Tue, Mar 28, 2017 at 02:22:29PM +0200, Christoffer Dall wrote:
> On Tue, Mar 28, 2017 at 12:54:13PM +0100, Achin Gupta wrote:
> > On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
> > > On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> > > > Hi Christoffer,
> > > >
> > > > (CC: Leif and Achin who know more about how UEFI fits into this picture)
> > > >
> > > > On 21/03/17 19:39, Christoffer Dall wrote:
> > > > > On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
> > > > >> On 21/03/17 11:34, Christoffer Dall wrote:
> > > > >>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> > > > >>>> On 2017/3/20 23:08, James Morse wrote:
> > > > >>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> > > > >>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
> > > > >>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
> > > > >>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> > > > >>>>>>>>> the guest OS
> > > > >>>>>
> > > > >>>>> How does this work with firmware first?
> > > > >>>>
> > > > >>>> I explained it in previous mail about the work flow.
> > > > >>>
> > > > >>> When delivering and reporting SEIs to the VM, should this happen
> > > > >>> directly to the OS running in the VM, or to the guest firmware (e.g.
> > > > >>> UEFI) running in the VM as well?
> > > > >>
> > > > >> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
> > > > >> handling the error. On arm64 we have multiple things called firmware, so the
> > > > >> name might be more confusing than helpful.
> > > > >>
> > > > >> As far as I understand it, firmware here refers to the secure-world and EL3.
> > > > >> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
> > > > >> routing them to EL3 where secure platform specific firmware generates CPER records.
> > > > >> For a guest, Qemu takes the role of this EL3-firmware.
> >
> > +1
> >
> > > > >>
> > > > > Thanks for the clarification.  So UEFI in the VM would not be involved
> > > > > in this at all?
> > > >
> > > > On the host, part of UEFI is involved to generate the CPER records.
> > > > In a guest?, I don't know.
> > > > Qemu could generate the records, or drive some other component to do it.
> > >
> > > I think I am beginning to understand this a bit.  Since the guet UEFI
> > > instance is specifically built for the machine it runs on, QEMU's virt
> > > machine in this case, they could simply agree (by some contract) to
> > > place the records at some specific location in memory, and if the guest
> > > kernel asks its guest UEFI for that location, things should just work by
> > > having logic in QEMU to process error reports and populate guest memory.
> > >
> > > Is this how others see the world too?
> >
> > I think so!
> >
> > AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
> > the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
> > HEST for the guest Kernel?
> >
> > If so, then the question is how the guest UEFI finds out where QEMU (acting as
> > EL3 firmware) will populate the CPERs. This could either be a contract between
> > the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
> > where the memory is.
> >
> > This is the way I expect it to work at the EL3/EL2 boundary. So I am
> > extrapolating it to the guest/hypervisor boundary. Do shout if I am missing
> > anything.
>
> No that sounds like a resonable comparison.
>
> I'm not entirely sure what a HEST or GHES is, but I think the only place
> where I'm still not clear is if when the guest kernel is notified of
> errors does it (a) just traverse memory by following some pointers
> (which it may have pre-loaded at boot from UEFI), or (b) run UEFI code
> which can call into QEMU and generate error records on demand?

So HEST is the ACPI Harware Error Source Table. It has entries in it for Generic
HW Error Sources (GHES) amongst other types of error sources (x86 MCE etc). Each
Error source specifies an address where the address of the CPER created by
firmware will be populated. OS upon receipt of an error reads the CPERs to find
the error source. It uses the addresses specified in the GHES entries of the
HEST. This is closer to (a) above. HEST has the pointers preloaded at boot by
UEFI.

hth,
Achin

>
> Thanks,
> -Christoffer

James Morse March 28, 2017, 1:27 p.m. UTC | #24

Hi Peter,

On 28/03/17 12:33, Peter Maydell wrote:
> On 28 March 2017 at 12:23, Christoffer Dall <cdall@linaro.org> wrote:
>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>> On the host, part of UEFI is involved to generate the CPER records.
>>> In a guest?, I don't know.
>>> Qemu could generate the records, or drive some other component to do it.
>>
>> I think I am beginning to understand this a bit.  Since the guet UEFI
>> instance is specifically built for the machine it runs on, QEMU's virt
>> machine in this case, they could simply agree (by some contract) to
>> place the records at some specific location in memory, and if the guest
>> kernel asks its guest UEFI for that location, things should just work by
>> having logic in QEMU to process error reports and populate guest memory.
> 
> Is "write direct to guest memory" the best ABI here or would
> it be preferable to use the fw_cfg interface for the guest UEFI
> to retrieve the data items on demand?

As far as I understand the interaction between Qemu and UEFI isn't defined. The
eventual aim is to emulate ACPI's firmware first error handling for a guest.
This way the RAS behaviour for a host and the guest is the same.

The ABI is the guest OS gets a 'notification' (there is a list in acpi: 18.3.2.9
Hardware Error Notification), and then finds a pointer to the CPER records
(defined in the UEFI Spec Appendix N) at the address advertised by one of the
Generic Hardware Error Source (GHES) entries of the ACPI Hardware Error Source
Table (HEST).

How Qemu and UEFI conspire to make this happen is up for discussion.
My suggestion would be to try and closely mirror whatever happens on a physical
system so that UEFI only needs to test one set of code.

> Is there a pre-existing "this is how it works on x86" implementation?

I found [0], which on page 16 talks about Qemu injecting a Pseudo 'Software
Recoverable Action Required', which I assume is a flavour of NMI.

The ACPI/CPER stuff is arch agnostic and given Qemu is only ever likely to have
to handle memory errors it should be possible to use the same code for both x86
and arm64.

Thanks,

James

[0] https://events.linuxfoundation.org/images/stories/slides/lfcs2013_tanino.pdf

James Morse March 28, 2017, 1:40 p.m. UTC | #25

Hi gengdongjiu,

On 28/03/17 13:16, gengdongjiu wrote:
> On 2017/3/28 19:54, Achin Gupta wrote:
>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>> In a guest?, I don't know.
>>>> Qemu could generate the records, or drive some other component to do it.
>>>
>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>> instance is specifically built for the machine it runs on, QEMU's virt
>>> machine in this case, they could simply agree (by some contract) to
>>> place the records at some specific location in memory, and if the guest
>>> kernel asks its guest UEFI for that location, things should just work by
>>> having logic in QEMU to process error reports and populate guest memory.
>>>
>>> Is this how others see the world too?
>>
>> I think so!
>>
>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>> HEST for the guest Kernel?
>>
>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>> EL3 firmware) will populate the CPERs. This could either be a contract between
>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>> where the memory is.
> 
> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
> directly generate the ACPI table, but I am not sure, we are checking the qemu
logical.
> let Qemu generate CPER record may be clear.

At boot UEFI in the guest will need to make sure the areas of memory that may be
used for CPER records are reserved. Whether UEFI or Qemu decides where these are
needs deciding, (but probably not here)...

At runtime, when an error has occurred, I agree it would be simpler (fewer
components involved) if Qemu generates the CPER records. But if UEFI made the
memory choice above they need to interact and it gets complicated again. The
CPER records are defined in the UEFI spec, so I would expect UEFI to contain
code to generate/parse them.


Thanks,

James

Christoffer Dall March 28, 2017, 1:40 p.m. UTC | #26

On Tue, Mar 28, 2017 at 02:24:55PM +0100, Achin Gupta wrote:
> On Tue, Mar 28, 2017 at 02:22:29PM +0200, Christoffer Dall wrote:
> > On Tue, Mar 28, 2017 at 12:54:13PM +0100, Achin Gupta wrote:
> > > On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
> > > > On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> > > > > Hi Christoffer,
> > > > >
> > > > > (CC: Leif and Achin who know more about how UEFI fits into this picture)
> > > > >
> > > > > On 21/03/17 19:39, Christoffer Dall wrote:
> > > > > > On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
> > > > > >> On 21/03/17 11:34, Christoffer Dall wrote:
> > > > > >>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
> > > > > >>>> On 2017/3/20 23:08, James Morse wrote:
> > > > > >>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
> > > > > >>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
> > > > > >>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
> > > > > >>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
> > > > > >>>>>>>>> the guest OS
> > > > > >>>>>
> > > > > >>>>> How does this work with firmware first?
> > > > > >>>>
> > > > > >>>> I explained it in previous mail about the work flow.
> > > > > >>>
> > > > > >>> When delivering and reporting SEIs to the VM, should this happen
> > > > > >>> directly to the OS running in the VM, or to the guest firmware (e.g.
> > > > > >>> UEFI) running in the VM as well?
> > > > > >>
> > > > > >> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
> > > > > >> handling the error. On arm64 we have multiple things called firmware, so the
> > > > > >> name might be more confusing than helpful.
> > > > > >>
> > > > > >> As far as I understand it, firmware here refers to the secure-world and EL3.
> > > > > >> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
> > > > > >> routing them to EL3 where secure platform specific firmware generates CPER records.
> > > > > >> For a guest, Qemu takes the role of this EL3-firmware.
> > >
> > > +1
> > >
> > > > > >>
> > > > > > Thanks for the clarification.  So UEFI in the VM would not be involved
> > > > > > in this at all?
> > > > >
> > > > > On the host, part of UEFI is involved to generate the CPER records.
> > > > > In a guest?, I don't know.
> > > > > Qemu could generate the records, or drive some other component to do it.
> > > >
> > > > I think I am beginning to understand this a bit.  Since the guet UEFI
> > > > instance is specifically built for the machine it runs on, QEMU's virt
> > > > machine in this case, they could simply agree (by some contract) to
> > > > place the records at some specific location in memory, and if the guest
> > > > kernel asks its guest UEFI for that location, things should just work by
> > > > having logic in QEMU to process error reports and populate guest memory.
> > > >
> > > > Is this how others see the world too?
> > >
> > > I think so!
> > >
> > > AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
> > > the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
> > > HEST for the guest Kernel?
> > >
> > > If so, then the question is how the guest UEFI finds out where QEMU (acting as
> > > EL3 firmware) will populate the CPERs. This could either be a contract between
> > > the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
> > > where the memory is.
> > >
> > > This is the way I expect it to work at the EL3/EL2 boundary. So I am
> > > extrapolating it to the guest/hypervisor boundary. Do shout if I am missing
> > > anything.
> >
> > No that sounds like a resonable comparison.
> >
> > I'm not entirely sure what a HEST or GHES is, but I think the only place
> > where I'm still not clear is if when the guest kernel is notified of
> > errors does it (a) just traverse memory by following some pointers
> > (which it may have pre-loaded at boot from UEFI), or (b) run UEFI code
> > which can call into QEMU and generate error records on demand?
> 
> So HEST is the ACPI Harware Error Source Table. It has entries in it for Generic
> HW Error Sources (GHES) amongst other types of error sources (x86 MCE etc). Each
> Error source specifies an address where the address of the CPER created by
> firmware will be populated. OS upon receipt of an error reads the CPERs to find
> the error source. It uses the addresses specified in the GHES entries of the
> HEST. This is closer to (a) above. HEST has the pointers preloaded at boot by
> UEFI.
> 
Thanks for the explanation.  Sounds to me like QEMU, through whatever
abstractions and proper methods they have to do that, must populate
memory more or less directly.

I guess this is up to whoever will actually implement support for this
to figure out.

-Christoffer

Dongjiu Geng March 29, 2017, 9:36 a.m. UTC | #27

Hi Laszlo/Biesheuvel/Qemu developer,

   Now I encounter a issue and want to consult with you in ARM64 platform， as described below:

   when guest OS happen synchronous or asynchronous abort, kvm needs to send the error address to Qemu or UEFI through sigbus to dynamically generate APEI table. from my investigation, there are two ways:

   (1) Qemu get the error address, and generate the APEI table, then notify UEFI to know this generation, then inject abort error to guest OS, guest OS read the APEI table.
   (2) Qemu get the error address, and let UEFI to generate the APEI table, then inject abort error to guest OS, guest OS read the APEI table.


   Do you think which modules generates the APEI table is better? UEFI or Qemu?




On 2017/3/28 21:40, James Morse wrote:
> Hi gengdongjiu,
> 
> On 28/03/17 13:16, gengdongjiu wrote:
>> On 2017/3/28 19:54, Achin Gupta wrote:
>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>> In a guest?, I don't know.
>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>
>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>> machine in this case, they could simply agree (by some contract) to
>>>> place the records at some specific location in memory, and if the guest
>>>> kernel asks its guest UEFI for that location, things should just work by
>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>
>>>> Is this how others see the world too?
>>>
>>> I think so!
>>>
>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>> HEST for the guest Kernel?
>>>
>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>> where the memory is.
>>
>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>> directly generate the ACPI table, but I am not sure, we are checking the qemu
> logical.
>> let Qemu generate CPER record may be clear.
> 
> At boot UEFI in the guest will need to make sure the areas of memory that may be
> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
> needs deciding, (but probably not here)...
> 
> At runtime, when an error has occurred, I agree it would be simpler (fewer
> components involved) if Qemu generates the CPER records. But if UEFI made the
> memory choice above they need to interact and it gets complicated again. The
> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
> code to generate/parse them.
> 
> 
> Thanks,
> 
> James
> 
> 
> .
>

Achin Gupta March 29, 2017, 10:36 a.m. UTC | #28

Hi gengdongjiu,

On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>
> Hi Laszlo/Biesheuvel/Qemu developer,
>
>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>
>    when guest OS happen synchronous or asynchronous abort, kvm needs to send the error address to Qemu or UEFI through sigbus to dynamically generate APEI table. from my investigation, there are two ways:
>
>    (1) Qemu get the error address, and generate the APEI table, then notify UEFI to know this generation, then inject abort error to guest OS, guest OS read the APEI table.
>    (2) Qemu get the error address, and let UEFI to generate the APEI table, then inject abort error to guest OS, guest OS read the APEI table.

Just being pedantic! I don't think we are talking about creating the APEI table
dynamically here. The issue is: Once KVM has received an error that is destined
for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
into the guest OS, a CPER (Common Platform Error Record) has to be generated
corresponding to the error source (GHES corresponding to memory subsystem,
processor etc) to allow the guest OS to do anything meaningful with the
error. So who should create the CPER is the question.

At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
at EL3 and secure firmware (at EL3 or a lower secure exception level) is
responsible for creating the CPER. ARM is experimenting with using a Standalone
MM EDK2 image in the secure world to do the CPER creation. This will avoid
adding the same code in ARM TF in EL3 (better for security). The error will then
be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
Firmware.

Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
interface (as discussed with Christoffer below). So it should generate the CPER
before injecting the error.

This is corresponds to (1) above apart from notifying UEFI (I am assuming you
mean guest UEFI). At this time, the guest OS already knows where to pick up the
CPER from through the HEST. Qemu has to create the CPER and populate its address
at the address exported in the HEST. Guest UEFI should not be involved in this
flow. Its job was to create the HEST at boot and that has been done by this
stage.

Qemu folk will be able to add but it looks like support for CPER generation will
need to be added to Qemu. We need to resolve this.

Do shout if I am missing anything above.

cheers,
Achin

>
>
>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>
>
>
>
> On 2017/3/28 21:40, James Morse wrote:
> > Hi gengdongjiu,
> >
> > On 28/03/17 13:16, gengdongjiu wrote:
> >> On 2017/3/28 19:54, Achin Gupta wrote:
> >>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
> >>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
> >>>>> On the host, part of UEFI is involved to generate the CPER records.
> >>>>> In a guest?, I don't know.
> >>>>> Qemu could generate the records, or drive some other component to do it.
> >>>>
> >>>> I think I am beginning to understand this a bit.  Since the guet UEFI
> >>>> instance is specifically built for the machine it runs on, QEMU's virt
> >>>> machine in this case, they could simply agree (by some contract) to
> >>>> place the records at some specific location in memory, and if the guest
> >>>> kernel asks its guest UEFI for that location, things should just work by
> >>>> having logic in QEMU to process error reports and populate guest memory.
> >>>>
> >>>> Is this how others see the world too?
> >>>
> >>> I think so!
> >>>
> >>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
> >>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
> >>> HEST for the guest Kernel?
> >>>
> >>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
> >>> EL3 firmware) will populate the CPERs. This could either be a contract between
> >>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
> >>> where the memory is.
> >>
> >> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
> >> directly generate the ACPI table, but I am not sure, we are checking the qemu
> > logical.
> >> let Qemu generate CPER record may be clear.
> >
> > At boot UEFI in the guest will need to make sure the areas of memory that may be
> > used for CPER records are reserved. Whether UEFI or Qemu decides where these are
> > needs deciding, (but probably not here)...
> >
> > At runtime, when an error has occurred, I agree it would be simpler (fewer
> > components involved) if Qemu generates the CPER records. But if UEFI made the
> > memory choice above they need to interact and it gets complicated again. The
> > CPER records are defined in the UEFI spec, so I would expect UEFI to contain
> > code to generate/parse them.
> >
> >
> > Thanks,
> >
> > James
> >
> >
> > .
> >
>

Laszlo Ersek March 29, 2017, 11:58 a.m. UTC | #29

(This ought to be one of the longest address lists I've ever seen :)
Thanks for the CC. I'm glad Shannon is already on the CC list. For good
measure, I'm adding MST and Igor.)

On 03/29/17 12:36, Achin Gupta wrote:
> Hi gengdongjiu,
> 
> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>
>> Hi Laszlo/Biesheuvel/Qemu developer,
>>
>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>
>> when guest OS happen synchronous or asynchronous abort, kvm needs
>> to send the error address to Qemu or UEFI through sigbus to
>> dynamically generate APEI table. from my investigation, there are
>> two ways:
>>
>> (1) Qemu get the error address, and generate the APEI table, then
>> notify UEFI to know this generation, then inject abort error to
>> guest OS, guest OS read the APEI table.
>> (2) Qemu get the error address, and let UEFI to generate the APEI
>> table, then inject abort error to guest OS, guest OS read the APEI
>> table.
> 
> Just being pedantic! I don't think we are talking about creating the APEI table
> dynamically here. The issue is: Once KVM has received an error that is destined
> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
> into the guest OS, a CPER (Common Platform Error Record) has to be generated
> corresponding to the error source (GHES corresponding to memory subsystem,
> processor etc) to allow the guest OS to do anything meaningful with the
> error. So who should create the CPER is the question.
> 
> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
> responsible for creating the CPER. ARM is experimenting with using a Standalone
> MM EDK2 image in the secure world to do the CPER creation. This will avoid
> adding the same code in ARM TF in EL3 (better for security). The error will then
> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
> Firmware.
> 
> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
> interface (as discussed with Christoffer below). So it should generate the CPER
> before injecting the error.
> 
> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
> mean guest UEFI). At this time, the guest OS already knows where to pick up the
> CPER from through the HEST. Qemu has to create the CPER and populate its address
> at the address exported in the HEST. Guest UEFI should not be involved in this
> flow. Its job was to create the HEST at boot and that has been done by this
> stage.
> 
> Qemu folk will be able to add but it looks like support for CPER generation will
> need to be added to Qemu. We need to resolve this.
> 
> Do shout if I am missing anything above.

After reading this email, the use case looks *very* similar to what
we've just done with VMGENID for QEMU 2.9.

We have a facility between QEMU and the guest firmware, called "ACPI
linker/loader", with which QEMU instructs the firmware to

- allocate and download blobs into guest RAM (AcpiNVS type memory) --
ALLOCATE command,

- relocate pointers in those blobs, to fields in other (or the same)
blobs -- ADD_POINTER command,

- set ACPI table checksums -- ADD_CHECKSUM command,

- and send GPAs of fields within such blobs back to QEMU --
WRITE_POINTER command.

This is how I imagine we can map the facility to the current use case
(note that this is the first time I read about HEST / GHES / CPER):

    etc/acpi/tables                 etc/hardware_errors
    ================     ==========================================
                         +-----------+
    +--------------+     | address   |         +-> +--------------+
    |    HEST      +     | registers |         |   | Error Status |
    + +------------+     | +---------+         |   | Data Block 1 |
    | | GHES       | --> | | address | --------+   | +------------+
    | | GHES       | --> | | address | ------+     | |  CPER      |
    | | GHES       | --> | | address | ----+ |     | |  CPER      |
    | | GHES       | --> | | address | -+  | |     | |  CPER      |
    +-+------------+     +-+---------+  |  | |     +-+------------+
                                        |  | |
                                        |  | +---> +--------------+
                                        |  |       | Error Status |
                                        |  |       | Data Block 2 |
                                        |  |       | +------------+
                                        |  |       | |  CPER      |
                                        |  |       | |  CPER      |
                                        |  |       +-+------------+
                                        |  |
                                        |  +-----> +--------------+
                                        |          | Error Status |
                                        |          | Data Block 3 |
                                        |          | +------------+
                                        |          | |  CPER      |
                                        |          +-+------------+
                                        |
                                        +--------> +--------------+
                                                   | Error Status |
                                                   | Data Block 4 |
                                                   | +------------+
                                                   | |  CPER      |
                                                   | |  CPER      |
                                                   | |  CPER      |
                                                   +-+------------+

(1) QEMU generates the HEST ACPI table. This table goes in the current
"etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
GHES objects in the HEST.

(2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
also populates this blob.

(2a) Given N error sources, the (unnamed) table of address registers
will contain N address registers.

(2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
also contain N Error Status Data Blocks.

I don't know about the sizing (number of CPERs) each Error Status Data
Block has to contain, but I understand it is all pre-allocated as far as
the OS is concerned, which matches our capabilities well.

(3) QEMU generates the ACPI linker/loader script for the firmware, as
always.

(3a) The HEST table is part of "etc/acpi/tables", which the firmware
already allocates memory for, and downloads (because QEMU already
generates an ALLOCATE linker/loader command for it already).

(3b) QEMU will have to create another ALLOCATE command for the
"etc/hardware_errors" blob. The firmware allocates memory for this blob,
and downloads it.

(4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
ADD_POINTER commands, which point the GHES."Error Status
Address" fields in the HEST table, to the corresponding address
registers in the downloaded "etc/hardware_errors" blob.

(5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
HEST table is correctly checksummed after executing the N ADD_POINTER
commands from (4).

(6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
address registers (located in guest memory, in the downloaded
"etc/hardware_errors" blob) to the respective Error Status Data Blocks.

(7) (This is the trick.) For this step, we need a third, write-only
fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
firmware can send back the guest-side allocation addresses to QEMU.

Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
QEMU generates N WRITE_POINTER commands for the firmware.

For error source K (0 <= K < N), QEMU instructs the firmware to
calculate the guest address of Error Status Data Block K, from the
QEMU-dictated offset within "etc/hardware_errors", and from the
guest-determined allocation base address for "etc/hardware_errors". The
firmware then writes the calculated address back to fw_cfg file
"etc/hardware_errors_addr", at offset K*8, according to the
WRITE_POINTER command.

This way QEMU will know the GPA of each Error Status Data Block.

(In fact this can be simplified to a single WRITE_POINTER command: the
address of the "address register table" can be sent back to QEMU as
well, which already contains all Error Status Data Block addresses.)

(8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
through a signalfd -- QEMU can format the CPER right into guest memory,
and then inject whatever interrupt (or assert whatever GPIO line) is
necessary for notifying the guest.

(9) This notification (in virtual hardware) can either be handled by the
guest kernel stand-alone, or else the guest kernel can invoke an ACPI
event handler method with it (which would be in the DSDT or one of the
SSDTs, also generated by QEMU). The ACPI event handler method could
invoke the specific guest kernel driver for errror handling via a
Notify() operation.

I'm attracted to the above design because:
- it would leave the firmware alone after OS boot, and
- it would leave the firmware blissfully ignorant about HEST, GHES,
CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
in the first place.)

Thanks
Laszlo

>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>
>>
>>
>>
>> On 2017/3/28 21:40, James Morse wrote:
>>> Hi gengdongjiu,
>>>
>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>> In a guest?, I don't know.
>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>
>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>> place the records at some specific location in memory, and if the guest
>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>
>>>>>> Is this how others see the world too?
>>>>>
>>>>> I think so!
>>>>>
>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>> HEST for the guest Kernel?
>>>>>
>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>> where the memory is.
>>>>
>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>> logical.
>>>> let Qemu generate CPER record may be clear.
>>>
>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>> needs deciding, (but probably not here)...
>>>
>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>> memory choice above they need to interact and it gets complicated again. The
>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>> code to generate/parse them.
>>>
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>> .
>>>
>>

Michael S. Tsirkin March 29, 2017, 12:51 p.m. UTC | #30

On Wed, Mar 29, 2017 at 01:58:29PM +0200, Laszlo Ersek wrote:
> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
> through a signalfd -- QEMU can format the CPER right into guest memory,
> and then inject whatever interrupt (or assert whatever GPIO line) is
> necessary for notifying the guest.

I think I see a race condition potential - what if guest accesses
CPER in guest memory while it's being written?

We can probably use another level of indirection to fix this:

allocate twice the space, add a pointer to where the valid
table is located and update that after writing CPER completely.
The pointer can be written atomically but also needs to
be read atomically, so I suspect it should be a single byte
as we don't know how are OSPMs implementing this.

Laszlo Ersek March 29, 2017, 1:36 p.m. UTC | #31

On 03/29/17 14:51, Michael S. Tsirkin wrote:
> On Wed, Mar 29, 2017 at 01:58:29PM +0200, Laszlo Ersek wrote:
>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>> through a signalfd -- QEMU can format the CPER right into guest memory,
>> and then inject whatever interrupt (or assert whatever GPIO line) is
>> necessary for notifying the guest.
> 
> I think I see a race condition potential - what if guest accesses
> CPER in guest memory while it's being written?

I'm not entirely sure about the data flow here (these parts of the ACPI
spec are particularly hard to read...), but I thought the OS wouldn't
look until it got a notification.

Or, are you concerned about the next CPER write by QEMU, while the OS is
reading the last one (and maybe the CPER area could wrap around?)

> 
> We can probably use another level of indirection to fix this:
> 
> allocate twice the space, add a pointer to where the valid
> table is located and update that after writing CPER completely.
> The pointer can be written atomically but also needs to
> be read atomically, so I suspect it should be a single byte
> as we don't know how are OSPMs implementing this.
> 

A-B-A problem? (Is that usually solved with a cookie or a wider
generation counter? But that would again require wider atomics.)

I do wonder though how this is handled on physical hardware. Assuming
the hardware error traps to the firmware first (which, on phys hw, is
responsible for depositing the CPER), in that scenario the phys firmware
would face the same issue (i.e., asynchronously interrupting the OS,
which could be reading the previously stored CPER).

Thanks,
Laszlo

Michael S. Tsirkin March 29, 2017, 1:54 p.m. UTC | #32

On Wed, Mar 29, 2017 at 03:36:59PM +0200, Laszlo Ersek wrote:
> On 03/29/17 14:51, Michael S. Tsirkin wrote:
> > On Wed, Mar 29, 2017 at 01:58:29PM +0200, Laszlo Ersek wrote:
> >> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
> >> through a signalfd -- QEMU can format the CPER right into guest memory,
> >> and then inject whatever interrupt (or assert whatever GPIO line) is
> >> necessary for notifying the guest.
> > 
> > I think I see a race condition potential - what if guest accesses
> > CPER in guest memory while it's being written?
> 
> I'm not entirely sure about the data flow here (these parts of the ACPI
> spec are particularly hard to read...), but I thought the OS wouldn't
> look until it got a notification.

There could be multiple notifications, OS might be looking
there because of them.

> Or, are you concerned about the next CPER write by QEMU, while the OS is
> reading the last one (and maybe the CPER area could wrap around?)
> 
> > 
> > We can probably use another level of indirection to fix this:
> > 
> > allocate twice the space, add a pointer to where the valid
> > table is located and update that after writing CPER completely.
> > The pointer can be written atomically but also needs to
> > be read atomically, so I suspect it should be a single byte
> > as we don't know how are OSPMs implementing this.
> > 
> 
> A-B-A problem? (Is that usually solved with a cookie or a wider
> generation counter? But that would again require wider atomics.)
> 
> I do wonder though how this is handled on physical hardware. Assuming
> the hardware error traps to the firmware first (which, on phys hw, is
> responsible for depositing the CPER), in that scenario the phys firmware
> would face the same issue (i.e., asynchronously interrupting the OS,
> which could be reading the previously stored CPER).
> 
> Thanks,
> Laszlo

ACPI spec seems to specify a set of serialization actions. I'm guessing
this is what you need to use to avoid changing guest state
while it's reading entries.

Punit Agrawal March 29, 2017, 1:56 p.m. UTC | #33

Laszlo Ersek <lersek@redhat.com> writes:

> On 03/29/17 14:51, Michael S. Tsirkin wrote:
>> On Wed, Mar 29, 2017 at 01:58:29PM +0200, Laszlo Ersek wrote:
>>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>>> through a signalfd -- QEMU can format the CPER right into guest memory,
>>> and then inject whatever interrupt (or assert whatever GPIO line) is
>>> necessary for notifying the guest.
>> 
>> I think I see a race condition potential - what if guest accesses
>> CPER in guest memory while it's being written?
>
> I'm not entirely sure about the data flow here (these parts of the ACPI
> spec are particularly hard to read...), but I thought the OS wouldn't
> look until it got a notification.
>
> Or, are you concerned about the next CPER write by QEMU, while the OS is
> reading the last one (and maybe the CPER area could wrap around?)
>
>> 
>> We can probably use another level of indirection to fix this:
>> 
>> allocate twice the space, add a pointer to where the valid
>> table is located and update that after writing CPER completely.
>> The pointer can be written atomically but also needs to
>> be read atomically, so I suspect it should be a single byte
>> as we don't know how are OSPMs implementing this.
>> 
>
> A-B-A problem? (Is that usually solved with a cookie or a wider
> generation counter? But that would again require wider atomics.)
>
> I do wonder though how this is handled on physical hardware. Assuming
> the hardware error traps to the firmware first (which, on phys hw, is
> responsible for depositing the CPER), in that scenario the phys firmware
> would face the same issue (i.e., asynchronously interrupting the OS,
> which could be reading the previously stored CPER).

Not sure about other error sources but for GHESv2 (ACPI 6.1, Section
18.3.2.8) the OS is expected to acknowledge the error before the
firmware is allowed to reuse the memory.

>
> Thanks,
> Laszlo
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

gengdongjiu March 29, 2017, 2:36 p.m. UTC | #34

Hi Achin,
  Thanks for your mail and  answer.

2017-03-29 18:36 GMT+08:00, Achin Gupta <achin.gupta@arm.com>:
> Hi gengdongjiu,
>
> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>
>> Hi Laszlo/Biesheuvel/Qemu developer,
>>
>>    Now I encounter a issue and want to consult with you in ARM64 platform，
>> as described below:
>>
>>    when guest OS happen synchronous or asynchronous abort, kvm needs to
>> send the error address to Qemu or UEFI through sigbus to dynamically
>> generate APEI table. from my investigation, there are two ways:
>>
>>    (1) Qemu get the error address, and generate the APEI table, then
>> notify UEFI to know this generation, then inject abort error to guest OS,
>> guest OS read the APEI table.
>>    (2) Qemu get the error address, and let UEFI to generate the APEI
>> table, then inject abort error to guest OS, guest OS read the APEI table.

The description may be not precise, I update it

   (1) Qemu get the error address, and generate the CPER table, then
notify guest UEFI to place this CPER table to Guest OS memory, then
Qemu let KVM inject abort error to guest OS, guest OS read the CPER
table.

 (2) Qemu get the error address, and let guest UEFI to directly
generate the CPER
table and place this table to the guest OS memory, not let Qemu gerate
it. then KVM inject abort error to guest OS, guest OS read the CPER
table.

>
> Just being pedantic! I don't think we are talking about creating the APEI
> table
> dynamically here. The issue is: Once KVM has received an error that is
> destined
> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the
> error
> into the guest OS, a CPER (Common Platform Error Record) has to be
> generated
> corresponding to the error source (GHES corresponding to memory subsystem,
> processor etc) to allow the guest OS to do anything meaningful with the
> error. So who should create the CPER is the question.
>
> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error
> arrives
> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
> responsible for creating the CPER. ARM is experimenting with using a
> Standalone
> MM EDK2 image in the secure world to do the CPER creation. This will avoid
> adding the same code in ARM TF in EL3 (better for security). The error will
> then
> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM
> Trusted
> Firmware.
>
> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
> interface (as discussed with Christoffer below). So it should generate the
> CPER
> before injecting the error.
>
> This is corresponds to (1) above apart from notifying UEFI (I am assuming
> you
> mean guest UEFI). At this time, the guest OS already knows where to pick up
> the
> CPER from through the HEST. Qemu has to create the CPER and populate its
> address
> at the address exported in the HEST. Guest UEFI should not be involved in
> this
> flow. Its job was to create the HEST at boot and that has been done by this
> stage.

 Sorry,  As I understand it, after Qemu generate the CPER table, it
should pass the CPER table to the guest UEFI, then Guest UEFI  place
this CPER table to the guest OS memory. In this flow, the Guest UEFI
should be involved, else the Guest OS can not see the CPER table.

>
> Qemu folk will be able to add but it looks like support for CPER generation
> will
> need to be added to Qemu. We need to resolve this.
>
> Do shout if I am missing anything above.
>
> cheers,
> Achin
>
>
>>
>>
>>    Do you think which modules generates the APEI table is better? UEFI or
>> Qemu?
>>
>>
>>
>>
>> On 2017/3/28 21:40, James Morse wrote:
>> > Hi gengdongjiu,
>> >
>> > On 28/03/17 13:16, gengdongjiu wrote:
>> >> On 2017/3/28 19:54, Achin Gupta wrote:
>> >>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>> >>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>> >>>>> On the host, part of UEFI is involved to generate the CPER records.
>> >>>>> In a guest?, I don't know.
>> >>>>> Qemu could generate the records, or drive some other component to do
>> >>>>> it.
>> >>>>
>> >>>> I think I am beginning to understand this a bit.  Since the guet
>> >>>> UEFI
>> >>>> instance is specifically built for the machine it runs on, QEMU's
>> >>>> virt
>> >>>> machine in this case, they could simply agree (by some contract) to
>> >>>> place the records at some specific location in memory, and if the
>> >>>> guest
>> >>>> kernel asks its guest UEFI for that location, things should just work
>> >>>> by
>> >>>> having logic in QEMU to process error reports and populate guest
>> >>>> memory.
>> >>>>
>> >>>> Is this how others see the world too?
>> >>>
>> >>> I think so!
>> >>>
>> >>> AFAIU, the memory where CPERs will reside should be specified in a
>> >>> GHES entry in
>> >>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI
>> >>> creates a
>> >>> HEST for the guest Kernel?
>> >>>
>> >>> If so, then the question is how the guest UEFI finds out where QEMU
>> >>> (acting as
>> >>> EL3 firmware) will populate the CPERs. This could either be a contract
>> >>> between
>> >>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1])
>> >>> to ask QEMU
>> >>> where the memory is.
>> >>
>> >> whether invoke the guest UEFI will be complex? not see the advantage.
>> >> it seems x86 Qemu
>> >> directly generate the ACPI table, but I am not sure, we are checking
>> >> the qemu
>> > logical.
>> >> let Qemu generate CPER record may be clear.
>> >
>> > At boot UEFI in the guest will need to make sure the areas of memory
>> > that may be
>> > used for CPER records are reserved. Whether UEFI or Qemu decides where
>> > these are
>> > needs deciding, (but probably not here)...
>> >
>> > At runtime, when an error has occurred, I agree it would be simpler
>> > (fewer
>> > components involved) if Qemu generates the CPER records. But if UEFI
>> > made the
>> > memory choice above they need to interact and it gets complicated again.
>> > The
>> > CPER records are defined in the UEFI spec, so I would expect UEFI to
>> > contain
>> > code to generate/parse them.
>> >
>> >
>> > Thanks,
>> >
>> > James
>> >
>> >
>> > .
>> >
>>
>

Christoffer Dall March 29, 2017, 2:48 p.m. UTC | #35

On Wed, Mar 29, 2017 at 10:36:51PM +0800, gengdongjiu wrote:
> Hi Achin,
>   Thanks for your mail and  answer.
> 
> 2017-03-29 18:36 GMT+08:00, Achin Gupta <achin.gupta@arm.com>:
> > Hi gengdongjiu,
> >
> > On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
> >>
> >> Hi Laszlo/Biesheuvel/Qemu developer,
> >>
> >>    Now I encounter a issue and want to consult with you in ARM64 platform，
> >> as described below:
> >>
> >>    when guest OS happen synchronous or asynchronous abort, kvm needs to
> >> send the error address to Qemu or UEFI through sigbus to dynamically
> >> generate APEI table. from my investigation, there are two ways:
> >>
> >>    (1) Qemu get the error address, and generate the APEI table, then
> >> notify UEFI to know this generation, then inject abort error to guest OS,
> >> guest OS read the APEI table.
> >>    (2) Qemu get the error address, and let UEFI to generate the APEI
> >> table, then inject abort error to guest OS, guest OS read the APEI table.
> 
> The description may be not precise, I update it
> 
>    (1) Qemu get the error address, and generate the CPER table, then
> notify guest UEFI to place this CPER table to Guest OS memory, then
> Qemu let KVM inject abort error to guest OS, guest OS read the CPER
> table.
> 
>  (2) Qemu get the error address, and let guest UEFI to directly
> generate the CPER
> table and place this table to the guest OS memory, not let Qemu gerate
> it. then KVM inject abort error to guest OS, guest OS read the CPER
> table.
> 

I don't understand how you are going to notify the guest UEFI instance
of anything or run the guest UEFI instance without going through the
guest kernel.

AFAIU, the guest UEFI instance is just a blob of software running at EL1
together with the guest kernel, and you're notification mechanism from
QEMU to the VM is pretty much limited to injecting a virtual interrupt
(of some kind) which is initially handled by the guest kernel.

> >
> > Just being pedantic! I don't think we are talking about creating the APEI
> > table
> > dynamically here. The issue is: Once KVM has received an error that is
> > destined
> > for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the
> > error
> > into the guest OS, a CPER (Common Platform Error Record) has to be
> > generated
> > corresponding to the error source (GHES corresponding to memory subsystem,
> > processor etc) to allow the guest OS to do anything meaningful with the
> > error. So who should create the CPER is the question.
> >
> > At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error
> > arrives
> > at EL3 and secure firmware (at EL3 or a lower secure exception level) is
> > responsible for creating the CPER. ARM is experimenting with using a
> > Standalone
> > MM EDK2 image in the secure world to do the CPER creation. This will avoid
> > adding the same code in ARM TF in EL3 (better for security). The error will
> > then
> > be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM
> > Trusted
> > Firmware.
> >
> > Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
> > interface (as discussed with Christoffer below). So it should generate the
> > CPER
> > before injecting the error.
> >
> > This is corresponds to (1) above apart from notifying UEFI (I am assuming
> > you
> > mean guest UEFI). At this time, the guest OS already knows where to pick up
> > the
> > CPER from through the HEST. Qemu has to create the CPER and populate its
> > address
> > at the address exported in the HEST. Guest UEFI should not be involved in
> > this
> > flow. Its job was to create the HEST at boot and that has been done by this
> > stage.
> 
>  Sorry,  As I understand it, after Qemu generate the CPER table, it
> should pass the CPER table to the guest UEFI, then Guest UEFI  place
> this CPER table to the guest OS memory. In this flow, the Guest UEFI
> should be involved, else the Guest OS can not see the CPER table.
> 

I think you need to explain the "pass the CPER table to the guest UEFI"
concept in terms of what really happens, step by step, and when you say
"then Guest UEFI place the CPER table to the guest OS memory", I'm
curious who is running what code on the hardware when doing that.


Thanks,
-Christoffer

Laszlo Ersek March 29, 2017, 3:37 p.m. UTC | #36

On 03/29/17 16:48, Christoffer Dall wrote:
> On Wed, Mar 29, 2017 at 10:36:51PM +0800, gengdongjiu wrote:
>> 2017-03-29 18:36 GMT+08:00, Achin Gupta <achin.gupta@arm.com>:

>>> Qemu is essentially fulfilling the role of secure firmware at the
>>> EL2/EL1 interface (as discussed with Christoffer below). So it
>>> should generate the CPER before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am
>>> assuming you mean guest UEFI). At this time, the guest OS already
>>> knows where to pick up the CPER from through the HEST. Qemu has
>>> to create the CPER and populate its address at the address
>>> exported in the HEST. Guest UEFI should not be involved in this 
>>> flow. Its job was to create the HEST at boot and that has been
>>> done by this stage.
>>
>> Sorry,  As I understand it, after Qemu generate the CPER table, it
>> should pass the CPER table to the guest UEFI, then Guest UEFI  place
>> this CPER table to the guest OS memory. In this flow, the Guest UEFI
>> should be involved, else the Guest OS can not see the CPER table.
>>
> 
> I think you need to explain the "pass the CPER table to the guest UEFI"
> concept in terms of what really happens, step by step, and when you say
> "then Guest UEFI place the CPER table to the guest OS memory", I'm
> curious who is running what code on the hardware when doing that.

I strongly suggest to keep the guest firmware's runtime involvement to
zero. Two reasons:

(1) As you explained above (... which I conveniently snipped), when you
inject an interrupt to the guest, the handler registered for that
interrupt will come from the guest kernel.

The only exception to this is when the platform provides a type of
interrupt whose handler can be registered and then locked down by the
firmware. On x86, this is the SMI.

In practice though,
- in OVMF (x86), we only do synchronous (software-initiated) SMIs (for
privileged UEFI varstore access),
- and in ArmVirtQemu (ARM / aarch64), none of the management mode stuff
exists at all.

I understand that the Platform Init 1.5 (or 1.6?) spec abstracted away
the MM (management mode) protocols from Intel SMM, but at this point
there is zero code in ArmVirtQemu for that. (And I'm unsure how much of
any eligible underlying hw emulation exists in QEMU.)

So you can't get the guest firmware to react to the injected interrupt
without the guest OS coming between first.

(2) Achin's description matches really-really closely what is possible,
and what should be done with QEMU, ArmVirtQemu, and the guest kernel.

In any solution for this feature, the firmware has to reserve some
memory from the OS at boot. The current facilities we have enable this.
As I described previously, the ACPI linker/loader actions can be mapped
more or less 1:1 to Achin's design. From a practical perspective, you
really want to keep the guest firmware as dumb as possible (meaning: as
generic as possible), and keep the ACPI specifics to the QEMU and the
guest kernel sides.

The error serialization actions -- the co-operation between guest kernel
and QEMU on the special memory areas -- that were mentioned earlier by
Michael and Punit look like a complication. But, IMO, they don't differ
from any other device emulation -- DMA actions in particular -- that
QEMU already does. Device models are what QEMU *does*. Read the command
block that the guest driver placed in guest memory, parse it, sanity
check it, verify it, execute it, write back the status code, inject an
interrupt (and/or let any polling guest driver notice it "soon after" --
use barriers as necessary).

Thus, I suggest to rely on the generic ACPI linker/loader interface
(between QEMU and guest firmware) *only* to make the firmware lay out
stuff (= reserve buffers, set up pointers, install QEMU's ACPI tables)
*at boot*. Then, at runtime, let the guest kernel and QEMU (the "device
model") talk to each other directly. Keep runtime firmware involvement
to zero.

You *really* don't want to debug three components at runtime, when you
can solve the thing with two. (Two components whose build systems won't
drive you mad, I should add.)

IMO, Achin's design nailed it. We can do that.

Laszlo

Christoffer Dall March 29, 2017, 5:44 p.m. UTC | #37

On Wed, Mar 29, 2017 at 05:37:49PM +0200, Laszlo Ersek wrote:
> On 03/29/17 16:48, Christoffer Dall wrote:
> > On Wed, Mar 29, 2017 at 10:36:51PM +0800, gengdongjiu wrote:
> >> 2017-03-29 18:36 GMT+08:00, Achin Gupta <achin.gupta@arm.com>:
> 
> >>> Qemu is essentially fulfilling the role of secure firmware at the
> >>> EL2/EL1 interface (as discussed with Christoffer below). So it
> >>> should generate the CPER before injecting the error.
> >>>
> >>> This is corresponds to (1) above apart from notifying UEFI (I am
> >>> assuming you mean guest UEFI). At this time, the guest OS already
> >>> knows where to pick up the CPER from through the HEST. Qemu has
> >>> to create the CPER and populate its address at the address
> >>> exported in the HEST. Guest UEFI should not be involved in this 
> >>> flow. Its job was to create the HEST at boot and that has been
> >>> done by this stage.
> >>
> >> Sorry,  As I understand it, after Qemu generate the CPER table, it
> >> should pass the CPER table to the guest UEFI, then Guest UEFI  place
> >> this CPER table to the guest OS memory. In this flow, the Guest UEFI
> >> should be involved, else the Guest OS can not see the CPER table.
> >>
> > 
> > I think you need to explain the "pass the CPER table to the guest UEFI"
> > concept in terms of what really happens, step by step, and when you say
> > "then Guest UEFI place the CPER table to the guest OS memory", I'm
> > curious who is running what code on the hardware when doing that.
> 
> I strongly suggest to keep the guest firmware's runtime involvement to
> zero. Two reasons:
> 
> (1) As you explained above (... which I conveniently snipped), when you
> inject an interrupt to the guest, the handler registered for that
> interrupt will come from the guest kernel.
> 
> The only exception to this is when the platform provides a type of
> interrupt whose handler can be registered and then locked down by the
> firmware. On x86, this is the SMI.
> 
> In practice though,
> - in OVMF (x86), we only do synchronous (software-initiated) SMIs (for
> privileged UEFI varstore access),
> - and in ArmVirtQemu (ARM / aarch64), none of the management mode stuff
> exists at all.
> 
> I understand that the Platform Init 1.5 (or 1.6?) spec abstracted away
> the MM (management mode) protocols from Intel SMM, but at this point
> there is zero code in ArmVirtQemu for that. (And I'm unsure how much of
> any eligible underlying hw emulation exists in QEMU.)
> 
> So you can't get the guest firmware to react to the injected interrupt
> without the guest OS coming between first.
> 
> (2) Achin's description matches really-really closely what is possible,
> and what should be done with QEMU, ArmVirtQemu, and the guest kernel.
> 
> In any solution for this feature, the firmware has to reserve some
> memory from the OS at boot. The current facilities we have enable this.
> As I described previously, the ACPI linker/loader actions can be mapped
> more or less 1:1 to Achin's design. From a practical perspective, you
> really want to keep the guest firmware as dumb as possible (meaning: as
> generic as possible), and keep the ACPI specifics to the QEMU and the
> guest kernel sides.
> 
> The error serialization actions -- the co-operation between guest kernel
> and QEMU on the special memory areas -- that were mentioned earlier by
> Michael and Punit look like a complication. But, IMO, they don't differ
> from any other device emulation -- DMA actions in particular -- that
> QEMU already does. Device models are what QEMU *does*. Read the command
> block that the guest driver placed in guest memory, parse it, sanity
> check it, verify it, execute it, write back the status code, inject an
> interrupt (and/or let any polling guest driver notice it "soon after" --
> use barriers as necessary).
> 
> Thus, I suggest to rely on the generic ACPI linker/loader interface
> (between QEMU and guest firmware) *only* to make the firmware lay out
> stuff (= reserve buffers, set up pointers, install QEMU's ACPI tables)
> *at boot*. Then, at runtime, let the guest kernel and QEMU (the "device
> model") talk to each other directly. Keep runtime firmware involvement
> to zero.
> 
> You *really* don't want to debug three components at runtime, when you
> can solve the thing with two. (Two components whose build systems won't
> drive you mad, I should add.)
> 
> IMO, Achin's design nailed it. We can do that.
> 
I completely agree.

My questions were intended for gengdongjiu to clarify his/her position
and clear up any misunderstandings between what Achin suggested and what
he/she wrote.

Thanks,
-Christoffer

Dongjiu Geng March 30, 2017, 1:22 a.m. UTC | #38

Hi Christoffer/Laszlo,

On 2017/3/30 1:44, Christoffer Dall wrote:
> On Wed, Mar 29, 2017 at 05:37:49PM +0200, Laszlo Ersek wrote:
>> On 03/29/17 16:48, Christoffer Dall wrote:
>>> On Wed, Mar 29, 2017 at 10:36:51PM +0800, gengdongjiu wrote:
>>>> 2017-03-29 18:36 GMT+08:00, Achin Gupta <achin.gupta@arm.com>:
>>
>>>>> Qemu is essentially fulfilling the role of secure firmware at the
>>>>> EL2/EL1 interface (as discussed with Christoffer below). So it
>>>>> should generate the CPER before injecting the error.
>>>>>
>>>>> This is corresponds to (1) above apart from notifying UEFI (I am
>>>>> assuming you mean guest UEFI). At this time, the guest OS already
>>>>> knows where to pick up the CPER from through the HEST. Qemu has
>>>>> to create the CPER and populate its address at the address
>>>>> exported in the HEST. Guest UEFI should not be involved in this 
>>>>> flow. Its job was to create the HEST at boot and that has been
>>>>> done by this stage.
>>>>
>>>> Sorry,  As I understand it, after Qemu generate the CPER table, it
>>>> should pass the CPER table to the guest UEFI, then Guest UEFI  place
>>>> this CPER table to the guest OS memory. In this flow, the Guest UEFI
>>>> should be involved, else the Guest OS can not see the CPER table.
>>>>
>>>
>>> I think you need to explain the "pass the CPER table to the guest UEFI"
>>> concept in terms of what really happens, step by step, and when you say
>>> "then Guest UEFI place the CPER table to the guest OS memory", I'm
>>> curious who is running what code on the hardware when doing that.
>>
>> I strongly suggest to keep the guest firmware's runtime involvement to
>> zero. Two reasons:
>>
>> (1) As you explained above (... which I conveniently snipped), when you
>> inject an interrupt to the guest, the handler registered for that
>> interrupt will come from the guest kernel.
>>
>> The only exception to this is when the platform provides a type of
>> interrupt whose handler can be registered and then locked down by the
>> firmware. On x86, this is the SMI.
>>
>> In practice though,
>> - in OVMF (x86), we only do synchronous (software-initiated) SMIs (for
>> privileged UEFI varstore access),
>> - and in ArmVirtQemu (ARM / aarch64), none of the management mode stuff
>> exists at all.
>>
>> I understand that the Platform Init 1.5 (or 1.6?) spec abstracted away
>> the MM (management mode) protocols from Intel SMM, but at this point
>> there is zero code in ArmVirtQemu for that. (And I'm unsure how much of
>> any eligible underlying hw emulation exists in QEMU.)
>>
>> So you can't get the guest firmware to react to the injected interrupt
>> without the guest OS coming between first.
>>
>> (2) Achin's description matches really-really closely what is possible,
>> and what should be done with QEMU, ArmVirtQemu, and the guest kernel.
>>
>> In any solution for this feature, the firmware has to reserve some
>> memory from the OS at boot. The current facilities we have enable this.
>> As I described previously, the ACPI linker/loader actions can be mapped
>> more or less 1:1 to Achin's design. From a practical perspective, you
>> really want to keep the guest firmware as dumb as possible (meaning: as
>> generic as possible), and keep the ACPI specifics to the QEMU and the
>> guest kernel sides.
>>
>> The error serialization actions -- the co-operation between guest kernel
>> and QEMU on the special memory areas -- that were mentioned earlier by
>> Michael and Punit look like a complication. But, IMO, they don't differ
>> from any other device emulation -- DMA actions in particular -- that
>> QEMU already does. Device models are what QEMU *does*. Read the command
>> block that the guest driver placed in guest memory, parse it, sanity
>> check it, verify it, execute it, write back the status code, inject an
>> interrupt (and/or let any polling guest driver notice it "soon after" --
>> use barriers as necessary).
>>
>> Thus, I suggest to rely on the generic ACPI linker/loader interface
>> (between QEMU and guest firmware) *only* to make the firmware lay out
>> stuff (= reserve buffers, set up pointers, install QEMU's ACPI tables)
>> *at boot*. Then, at runtime, let the guest kernel and QEMU (the "device
>> model") talk to each other directly. Keep runtime firmware involvement
>> to zero.
>>
>> You *really* don't want to debug three components at runtime, when you
>> can solve the thing with two. (Two components whose build systems won't
>> drive you mad, I should add.)
>>
>> IMO, Achin's design nailed it. We can do that.
>>
> I completely agree.
> 
> My questions were intended for gengdongjiu to clarify his/her 
> and clear up any misunderstandings between what Achin suggested and what
> he/she wrote.

  Achin and Laszlo's understanding are right. thanks for your suggestion.

> 
> Thanks,
> -Christoffer
> 
> .
>

Dongjiu Geng April 6, 2017, 12:35 p.m. UTC | #39

Dear, Laszlo
   Thanks for your detailed explanation.

On 2017/3/29 19:58, Laszlo Ersek wrote:
> (This ought to be one of the longest address lists I've ever seen :)
> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
> measure, I'm adding MST and Igor.)
> 
> On 03/29/17 12:36, Achin Gupta wrote:
>> Hi gengdongjiu,
>>
>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>
>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>
>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>
>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>> to send the error address to Qemu or UEFI through sigbus to
>>> dynamically generate APEI table. from my investigation, there are
>>> two ways:
>>>
>>> (1) Qemu get the error address, and generate the APEI table, then
>>> notify UEFI to know this generation, then inject abort error to
>>> guest OS, guest OS read the APEI table.
>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>> table, then inject abort error to guest OS, guest OS read the APEI
>>> table.
>>
>> Just being pedantic! I don't think we are talking about creating the APEI table
>> dynamically here. The issue is: Once KVM has received an error that is destined
>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>> corresponding to the error source (GHES corresponding to memory subsystem,
>> processor etc) to allow the guest OS to do anything meaningful with the
>> error. So who should create the CPER is the question.
>>
>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>> adding the same code in ARM TF in EL3 (better for security). The error will then
>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>> Firmware.
>>
>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>> interface (as discussed with Christoffer below). So it should generate the CPER
>> before injecting the error.
>>
>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>> at the address exported in the HEST. Guest UEFI should not be involved in this
>> flow. Its job was to create the HEST at boot and that has been done by this
>> stage.
>>
>> Qemu folk will be able to add but it looks like support for CPER generation will
>> need to be added to Qemu. We need to resolve this.
>>
>> Do shout if I am missing anything above.
> 
> After reading this email, the use case looks *very* similar to what
> we've just done with VMGENID for QEMU 2.9.
> 
> We have a facility between QEMU and the guest firmware, called "ACPI
> linker/loader", with which QEMU instructs the firmware to
> 
> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
> ALLOCATE command,
> 
> - relocate pointers in those blobs, to fields in other (or the same)
> blobs -- ADD_POINTER command,
> 
> - set ACPI table checksums -- ADD_CHECKSUM command,
> 
> - and send GPAs of fields within such blobs back to QEMU --
> WRITE_POINTER command.
> 
> This is how I imagine we can map the facility to the current use case
> (note that this is the first time I read about HEST / GHES / CPER):
> 
>     etc/acpi/tables                 etc/hardware_errors
>     ================     ==========================================
>                          +-----------+
>     +--------------+     | address   |         +-> +--------------+
>     |    HEST      +     | registers |         |   | Error Status |
>     + +------------+     | +---------+         |   | Data Block 1 |
>     | | GHES       | --> | | address | --------+   | +------------+
>     | | GHES       | --> | | address | ------+     | |  CPER      |
>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>     +-+------------+     +-+---------+  |  | |     +-+------------+
>                                         |  | |
>                                         |  | +---> +--------------+
>                                         |  |       | Error Status |
>                                         |  |       | Data Block 2 |
>                                         |  |       | +------------+
>                                         |  |       | |  CPER      |
>                                         |  |       | |  CPER      |
>                                         |  |       +-+------------+
>                                         |  |
>                                         |  +-----> +--------------+
>                                         |          | Error Status |
>                                         |          | Data Block 3 |
>                                         |          | +------------+
>                                         |          | |  CPER      |
>                                         |          +-+------------+
>                                         |
>                                         +--------> +--------------+
>                                                    | Error Status |
>                                                    | Data Block 4 |
>                                                    | +------------+
>                                                    | |  CPER      |
>                                                    | |  CPER      |
>                                                    | |  CPER      |
>                                                    +-+------------+
> 
> (1) QEMU generates the HEST ACPI table. This table goes in the current
> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
> GHES objects in the HEST.
> 
> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
> also populates this blob.
> 
> (2a) Given N error sources, the (unnamed) table of address registers
> will contain N address registers.
> 
> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
> also contain N Error Status Data Blocks.
> 
> I don't know about the sizing (number of CPERs) each Error Status Data
> Block has to contain, but I understand it is all pre-allocated as far as
> the OS is concerned, which matches our capabilities well.
here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)

    /* Allocate guest memory for the Data fw_cfg blob */
    bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
                             false /* page boundary, high memory */);



-> +--------------+
     |    HEST      +     | registers |             | Error Status |
     + +------------+     | +---------+             | Data Block  |
     | | GHES       | --> | | address | --------+-->| +------------+
     | | GHES       | --> | | address | ------+     | |  CPER      |
     | | GHES       | --> | | address | ----+ |     | |  CPER      |
     | | GHES       | --> | | address | -+  | |     | |  CPER      |
     +-+------------+     +-+---------+  |  | +---> +--------------+
                                         |  |       | |  CPER      |
                                         |  |       | |  CPER      |
                                         |  +-----> +--------------+
                                         |          | |  CPER      |
                                         +--------> +--------------+
                                                    | |  CPER      |
                                                    | |  CPER      |
                                                    | |  CPER      |
                                                    +-+------------+



so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then

etc/hardwre_errors0
etc/hardwre_errors1
...................
etc/hardwre_errors10
(the max N is 10)


the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
0 – Polled
1 – External Interrupt
2 – Local Interrupt
3 – SCI
4 – NMI
5 - CMCI
6 - MCE
7 - GPIO-Signal
8 - ARMv8 SEA
9 - ARMv8 SEI
10 - External Interrupt - GSIV




> 
> (3) QEMU generates the ACPI linker/loader script for the firmware, as
> always.
> 
> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
> already allocates memory for, and downloads (because QEMU already
> generates an ALLOCATE linker/loader command for it already).
> 
> (3b) QEMU will have to create another ALLOCATE command for the
> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
> and downloads it.
> 
> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
> ADD_POINTER commands, which point the GHES."Error Status
> Address" fields in the HEST table, to the corresponding address
> registers in the downloaded "etc/hardware_errors" blob.
> 
> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
> HEST table is correctly checksummed after executing the N ADD_POINTER
> commands from (4).
> 
> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
> address registers (located in guest memory, in the downloaded
> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
> 
> (7) (This is the trick.) For this step, we need a third, write-only
> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
> firmware can send back the guest-side allocation addresses to QEMU.
> 
> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
> QEMU generates N WRITE_POINTER commands for the firmware.
> 
> For error source K (0 <= K < N), QEMU instructs the firmware to
> calculate the guest address of Error Status Data Block K, from the
> QEMU-dictated offset within "etc/hardware_errors", and from the
> guest-determined allocation base address for "etc/hardware_errors". The
> firmware then writes the calculated address back to fw_cfg file
> "etc/hardware_errors_addr", at offset K*8, according to the
> WRITE_POINTER command.
> 
> This way QEMU will know the GPA of each Error Status Data Block.
> 
> (In fact this can be simplified to a single WRITE_POINTER command: the
> address of the "address register table" can be sent back to QEMU as
> well, which already contains all Error Status Data Block addresses.)
> 
> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
> through a signalfd -- QEMU can format the CPER right into guest memory,
> and then inject whatever interrupt (or assert whatever GPIO line) is
> necessary for notifying the guest.
> 
> (9) This notification (in virtual hardware) can either be handled by the
> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
> event handler method with it (which would be in the DSDT or one of the
> SSDTs, also generated by QEMU). The ACPI event handler method could
> invoke the specific guest kernel driver for errror handling via a
> Notify() operation.
> 
> I'm attracted to the above design because:
> - it would leave the firmware alone after OS boot, and
> - it would leave the firmware blissfully ignorant about HEST, GHES,
> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
> in the first place.)
> 
> Thanks
> Laszlo
> 
>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>
>>>
>>>
>>>
>>> On 2017/3/28 21:40, James Morse wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>> In a guest?, I don't know.
>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>
>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>
>>>>>>> Is this how others see the world too?
>>>>>>
>>>>>> I think so!
>>>>>>
>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>> HEST for the guest Kernel?
>>>>>>
>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>> where the memory is.
>>>>>
>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>> logical.
>>>>> let Qemu generate CPER record may be clear.
>>>>
>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>> needs deciding, (but probably not here)...
>>>>
>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>> memory choice above they need to interact and it gets complicated again. The
>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>> code to generate/parse them.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>>
>>>> .
>>>>
>>>
> 
> 
> .
>

Laszlo Ersek April 6, 2017, 6:55 p.m. UTC | #40

On 04/06/17 14:35, gengdongjiu wrote:
> Dear, Laszlo
>    Thanks for your detailed explanation.
> 
> On 2017/3/29 19:58, Laszlo Ersek wrote:
>> (This ought to be one of the longest address lists I've ever seen :)
>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>> measure, I'm adding MST and Igor.)
>>
>> On 03/29/17 12:36, Achin Gupta wrote:
>>> Hi gengdongjiu,
>>>
>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>
>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>
>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>
>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>> to send the error address to Qemu or UEFI through sigbus to
>>>> dynamically generate APEI table. from my investigation, there are
>>>> two ways:
>>>>
>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>> notify UEFI to know this generation, then inject abort error to
>>>> guest OS, guest OS read the APEI table.
>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>> table.
>>>
>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>> processor etc) to allow the guest OS to do anything meaningful with the
>>> error. So who should create the CPER is the question.
>>>
>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>> Firmware.
>>>
>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>> before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>> flow. Its job was to create the HEST at boot and that has been done by this
>>> stage.
>>>
>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>> need to be added to Qemu. We need to resolve this.
>>>
>>> Do shout if I am missing anything above.
>>
>> After reading this email, the use case looks *very* similar to what
>> we've just done with VMGENID for QEMU 2.9.
>>
>> We have a facility between QEMU and the guest firmware, called "ACPI
>> linker/loader", with which QEMU instructs the firmware to
>>
>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>> ALLOCATE command,
>>
>> - relocate pointers in those blobs, to fields in other (or the same)
>> blobs -- ADD_POINTER command,
>>
>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>
>> - and send GPAs of fields within such blobs back to QEMU --
>> WRITE_POINTER command.
>>
>> This is how I imagine we can map the facility to the current use case
>> (note that this is the first time I read about HEST / GHES / CPER):
>>
>>     etc/acpi/tables                 etc/hardware_errors
>>     ================     ==========================================
>>                          +-----------+
>>     +--------------+     | address   |         +-> +--------------+
>>     |    HEST      +     | registers |         |   | Error Status |
>>     + +------------+     | +---------+         |   | Data Block 1 |
>>     | | GHES       | --> | | address | --------+   | +------------+
>>     | | GHES       | --> | | address | ------+     | |  CPER      |
>>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>     +-+------------+     +-+---------+  |  | |     +-+------------+
>>                                         |  | |
>>                                         |  | +---> +--------------+
>>                                         |  |       | Error Status |
>>                                         |  |       | Data Block 2 |
>>                                         |  |       | +------------+
>>                                         |  |       | |  CPER      |
>>                                         |  |       | |  CPER      |
>>                                         |  |       +-+------------+
>>                                         |  |
>>                                         |  +-----> +--------------+
>>                                         |          | Error Status |
>>                                         |          | Data Block 3 |
>>                                         |          | +------------+
>>                                         |          | |  CPER      |
>>                                         |          +-+------------+
>>                                         |
>>                                         +--------> +--------------+
>>                                                    | Error Status |
>>                                                    | Data Block 4 |
>>                                                    | +------------+
>>                                                    | |  CPER      |
>>                                                    | |  CPER      |
>>                                                    | |  CPER      |
>>                                                    +-+------------+
>>
>> (1) QEMU generates the HEST ACPI table. This table goes in the current
>> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
>> GHES objects in the HEST.
>>
>> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
>> also populates this blob.
>>
>> (2a) Given N error sources, the (unnamed) table of address registers
>> will contain N address registers.
>>
>> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
>> also contain N Error Status Data Blocks.
>>
>> I don't know about the sizing (number of CPERs) each Error Status Data
>> Block has to contain, but I understand it is all pre-allocated as far as
>> the OS is concerned, which matches our capabilities well.
> here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
> Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
> when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
> I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)
> 
>     /* Allocate guest memory for the Data fw_cfg blob */
>     bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
>                              false /* page boundary, high memory */);
> 
> 
> 
> -> +--------------+
>      |    HEST      +     | registers |             | Error Status |
>      + +------------+     | +---------+             | Data Block  |
>      | | GHES       | --> | | address | --------+-->| +------------+
>      | | GHES       | --> | | address | ------+     | |  CPER      |
>      | | GHES       | --> | | address | ----+ |     | |  CPER      |
>      | | GHES       | --> | | address | -+  | |     | |  CPER      |
>      +-+------------+     +-+---------+  |  | +---> +--------------+
>                                          |  |       | |  CPER      |
>                                          |  |       | |  CPER      |
>                                          |  +-----> +--------------+
>                                          |          | |  CPER      |
>                                          +--------> +--------------+
>                                                     | |  CPER      |
>                                                     | |  CPER      |
>                                                     | |  CPER      |
>                                                     +-+------------+
> 
> 
> 
> so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then
> 
> etc/hardwre_errors0
> etc/hardwre_errors1
> ...................
> etc/hardwre_errors10
> (the max N is 10)
> 
> 
> the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
> 0 – Polled
> 1 – External Interrupt
> 2 – Local Interrupt
> 3 – SCI
> 4 – NMI
> 5 - CMCI
> 6 - MCE
> 7 - GPIO-Signal
> 8 - ARMv8 SEA
> 9 - ARMv8 SEI
> 10 - External Interrupt - GSIV

I'm unsure if, by "not fixed", you are saying

  the number of CPER entries that fits in Error Status Data Block N is
  not *uniform* across 0 <= N <= 10 [1]

or

  the number of CPER entries that fits in Error Status Data Block N is
  not *known* in advance, for all of 0 <= N <= 10 [2]

Which one is your point?

If [1], that's no problem; you can simply sum the individual error
status data block sizes in advance, and allocate "etc/hardware_errors"
accordingly, using the total size.

(Allocating one shared fw_cfg blob for all status data blocks is more
memory efficient, as each ALLOCATE command will allocate whole pages
(rounded up from the actual blob size).)

If your point is [2], then splitting the error status data blocks to
separate fw_cfg blobs makes no difference: regardless of whether we try
to place all the error status data blocks in a single fw_cfg blob, or in
separate fw_cfg blobs, the individual data block cannot be resized at OS
runtime, so there's no way to make it work.

Thanks,
Laszlo

> 
> 
> 
> 
>>
>> (3) QEMU generates the ACPI linker/loader script for the firmware, as
>> always.
>>
>> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
>> already allocates memory for, and downloads (because QEMU already
>> generates an ALLOCATE linker/loader command for it already).
>>
>> (3b) QEMU will have to create another ALLOCATE command for the
>> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
>> and downloads it.
>>
>> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
>> ADD_POINTER commands, which point the GHES."Error Status
>> Address" fields in the HEST table, to the corresponding address
>> registers in the downloaded "etc/hardware_errors" blob.
>>
>> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
>> HEST table is correctly checksummed after executing the N ADD_POINTER
>> commands from (4).
>>
>> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
>> address registers (located in guest memory, in the downloaded
>> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
>>
>> (7) (This is the trick.) For this step, we need a third, write-only
>> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
>> firmware can send back the guest-side allocation addresses to QEMU.
>>
>> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
>> QEMU generates N WRITE_POINTER commands for the firmware.
>>
>> For error source K (0 <= K < N), QEMU instructs the firmware to
>> calculate the guest address of Error Status Data Block K, from the
>> QEMU-dictated offset within "etc/hardware_errors", and from the
>> guest-determined allocation base address for "etc/hardware_errors". The
>> firmware then writes the calculated address back to fw_cfg file
>> "etc/hardware_errors_addr", at offset K*8, according to the
>> WRITE_POINTER command.
>>
>> This way QEMU will know the GPA of each Error Status Data Block.
>>
>> (In fact this can be simplified to a single WRITE_POINTER command: the
>> address of the "address register table" can be sent back to QEMU as
>> well, which already contains all Error Status Data Block addresses.)
>>
>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>> through a signalfd -- QEMU can format the CPER right into guest memory,
>> and then inject whatever interrupt (or assert whatever GPIO line) is
>> necessary for notifying the guest.
>>
>> (9) This notification (in virtual hardware) can either be handled by the
>> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
>> event handler method with it (which would be in the DSDT or one of the
>> SSDTs, also generated by QEMU). The ACPI event handler method could
>> invoke the specific guest kernel driver for errror handling via a
>> Notify() operation.
>>
>> I'm attracted to the above design because:
>> - it would leave the firmware alone after OS boot, and
>> - it would leave the firmware blissfully ignorant about HEST, GHES,
>> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
>> in the first place.)
>>
>> Thanks
>> Laszlo
>>
>>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>>
>>>>
>>>>
>>>>
>>>> On 2017/3/28 21:40, James Morse wrote:
>>>>> Hi gengdongjiu,
>>>>>
>>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>>> In a guest?, I don't know.
>>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>>
>>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>>
>>>>>>>> Is this how others see the world too?
>>>>>>>
>>>>>>> I think so!
>>>>>>>
>>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>>> HEST for the guest Kernel?
>>>>>>>
>>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>>> where the memory is.
>>>>>>
>>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>>> logical.
>>>>>> let Qemu generate CPER record may be clear.
>>>>>
>>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>>> needs deciding, (but probably not here)...
>>>>>
>>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>>> memory choice above they need to interact and it gets complicated again. The
>>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>>> code to generate/parse them.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>
>>
>> .
>>
>

Dongjiu Geng April 7, 2017, 2:52 a.m. UTC | #41

Hi Laszlo,
  thanks.

On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>    Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>
>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>
>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>
>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>> dynamically generate APEI table. from my investigation, there are
>>>>> two ways:
>>>>>
>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>> notify UEFI to know this generation, then inject abort error to
>>>>> guest OS, guest OS read the APEI table.
>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>> table.
>>>>
>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>> error. So who should create the CPER is the question.
>>>>
>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>> Firmware.
>>>>
>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>> before injecting the error.
>>>>
>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>> stage.
>>>>
>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>> need to be added to Qemu. We need to resolve this.
>>>>
>>>> Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):
>>>
>>>     etc/acpi/tables                 etc/hardware_errors
>>>     ================     ==========================================
>>>                          +-----------+
>>>     +--------------+     | address   |         +-> +--------------+
>>>     |    HEST      +     | registers |         |   | Error Status |
>>>     + +------------+     | +---------+         |   | Data Block 1 |
>>>     | | GHES       | --> | | address | --------+   | +------------+
>>>     | | GHES       | --> | | address | ------+     | |  CPER      |
>>>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>>     +-+------------+     +-+---------+  |  | |     +-+------------+
>>>                                         |  | |
>>>                                         |  | +---> +--------------+
>>>                                         |  |       | Error Status |
>>>                                         |  |       | Data Block 2 |
>>>                                         |  |       | +------------+
>>>                                         |  |       | |  CPER      |
>>>                                         |  |       | |  CPER      |
>>>                                         |  |       +-+------------+
>>>                                         |  |
>>>                                         |  +-----> +--------------+
>>>                                         |          | Error Status |
>>>                                         |          | Data Block 3 |
>>>                                         |          | +------------+
>>>                                         |          | |  CPER      |
>>>                                         |          +-+------------+
>>>                                         |
>>>                                         +--------> +--------------+
>>>                                                    | Error Status |
>>>                                                    | Data Block 4 |
>>>                                                    | +------------+
>>>                                                    | |  CPER      |
>>>                                                    | |  CPER      |
>>>                                                    | |  CPER      |
>>>                                                    +-+------------+
>>>
>>> (1) QEMU generates the HEST ACPI table. This table goes in the current
>>> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
>>> GHES objects in the HEST.
>>>
>>> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
>>> also populates this blob.
>>>
>>> (2a) Given N error sources, the (unnamed) table of address registers
>>> will contain N address registers.
>>>
>>> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
>>> also contain N Error Status Data Blocks.
>>>
>>> I don't know about the sizing (number of CPERs) each Error Status Data
>>> Block has to contain, but I understand it is all pre-allocated as far as
>>> the OS is concerned, which matches our capabilities well.
>> here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
>> Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
>> when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
>> I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)
>>
>>     /* Allocate guest memory for the Data fw_cfg blob */
>>     bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
>>                              false /* page boundary, high memory */);
>>
>>
>>
>> -> +--------------+
>>      |    HEST      +     | registers |             | Error Status |
>>      + +------------+     | +---------+             | Data Block  |
>>      | | GHES       | --> | | address | --------+-->| +------------+
>>      | | GHES       | --> | | address | ------+     | |  CPER      |
>>      | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>      | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>      +-+------------+     +-+---------+  |  | +---> +--------------+
>>                                          |  |       | |  CPER      |
>>                                          |  |       | |  CPER      |
>>                                          |  +-----> +--------------+
>>                                          |          | |  CPER      |
>>                                          +--------> +--------------+
>>                                                     | |  CPER      |
>>                                                     | |  CPER      |
>>                                                     | |  CPER      |
>>                                                     +-+------------+
>>
>>
>>
>> so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then
>>
>> etc/hardwre_errors0
>> etc/hardwre_errors1
>> ...................
>> etc/hardwre_errors10
>> (the max N is 10)
>>
>>
>> the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
>> 0 – Polled
>> 1 – External Interrupt
>> 2 – Local Interrupt
>> 3 – SCI
>> 4 – NMI
>> 5 - CMCI
>> 6 - MCE
>> 7 - GPIO-Signal
>> 8 - ARMv8 SEA
>> 9 - ARMv8 SEI
>> 10 - External Interrupt - GSIV
> 
> I'm unsure if, by "not fixed", you are saying
> 
>   the number of CPER entries that fits in Error Status Data Block N is
>   not *uniform* across 0 <= N <= 10 [1]
> 
> or
> 
>   the number of CPER entries that fits in Error Status Data Block N is
>   not *known* in advance, for all of 0 <= N <= 10 [2]
> 
> Which one is your point?
> 
> If [1], that's no problem; you can simply sum the individual error
> status data block sizes in advance, and allocate "etc/hardware_errors"
> accordingly, using the total size.
> 
> (Allocating one shared fw_cfg blob for all status data blocks is more
> memory efficient, as each ALLOCATE command will allocate whole pages
> (rounded up from the actual blob size).)
> 
> If your point is [2], then splitting the error status data blocks to
> separate fw_cfg blobs makes no difference: regardless of whether we try
> to place all the error status data blocks in a single fw_cfg blob, or in
> separate fw_cfg blobs, the individual data block cannot be resized at OS
> runtime, so there's no way to make it work.
>
My Point is [2]. The HEST(Hardware Error Source Table) table format is here:
https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29

Now I understand your thought.

> Thanks,
> Laszlo
> 
>>
>>
>>
>>
>>>
>>> (3) QEMU generates the ACPI linker/loader script for the firmware, as
>>> always.
>>>
>>> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
>>> already allocates memory for, and downloads (because QEMU already
>>> generates an ALLOCATE linker/loader command for it already).
>>>
>>> (3b) QEMU will have to create another ALLOCATE command for the
>>> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
>>> and downloads it.
>>>
>>> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
>>> ADD_POINTER commands, which point the GHES."Error Status
>>> Address" fields in the HEST table, to the corresponding address
>>> registers in the downloaded "etc/hardware_errors" blob.
>>>
>>> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
>>> HEST table is correctly checksummed after executing the N ADD_POINTER
>>> commands from (4).
>>>
>>> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
>>> address registers (located in guest memory, in the downloaded
>>> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
>>>
>>> (7) (This is the trick.) For this step, we need a third, write-only
>>> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
>>> firmware can send back the guest-side allocation addresses to QEMU.
>>>
>>> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
>>> QEMU generates N WRITE_POINTER commands for the firmware.
>>>
>>> For error source K (0 <= K < N), QEMU instructs the firmware to
>>> calculate the guest address of Error Status Data Block K, from the
>>> QEMU-dictated offset within "etc/hardware_errors", and from the
>>> guest-determined allocation base address for "etc/hardware_errors". The
>>> firmware then writes the calculated address back to fw_cfg file
>>> "etc/hardware_errors_addr", at offset K*8, according to the
>>> WRITE_POINTER command.
>>>
>>> This way QEMU will know the GPA of each Error Status Data Block.
>>>
>>> (In fact this can be simplified to a single WRITE_POINTER command: the
>>> address of the "address register table" can be sent back to QEMU as
>>> well, which already contains all Error Status Data Block addresses.)
>>>
>>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>>> through a signalfd -- QEMU can format the CPER right into guest memory,
>>> and then inject whatever interrupt (or assert whatever GPIO line) is
>>> necessary for notifying the guest.
>>>
>>> (9) This notification (in virtual hardware) can either be handled by the
>>> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
>>> event handler method with it (which would be in the DSDT or one of the
>>> SSDTs, also generated by QEMU). The ACPI event handler method could
>>> invoke the specific guest kernel driver for errror handling via a
>>> Notify() operation.
>>>
>>> I'm attracted to the above design because:
>>> - it would leave the firmware alone after OS boot, and
>>> - it would leave the firmware blissfully ignorant about HEST, GHES,
>>> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
>>> in the first place.)
>>>
>>> Thanks
>>> Laszlo
>>>
>>>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2017/3/28 21:40, James Morse wrote:
>>>>>> Hi gengdongjiu,
>>>>>>
>>>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>>>> In a guest?, I don't know.
>>>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>>>
>>>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>>>
>>>>>>>>> Is this how others see the world too?
>>>>>>>>
>>>>>>>> I think so!
>>>>>>>>
>>>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>>>> HEST for the guest Kernel?
>>>>>>>>
>>>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>>>> where the memory is.
>>>>>>>
>>>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>>>> logical.
>>>>>>> let Qemu generate CPER record may be clear.
>>>>>>
>>>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>>>> needs deciding, (but probably not here)...
>>>>>>
>>>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>>>> memory choice above they need to interact and it gets complicated again. The
>>>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>>>> code to generate/parse them.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>
>>>
>>> .
>>>
>>
> 
> 
> .
>

Laszlo Ersek April 7, 2017, 9:21 a.m. UTC | #42

On 04/07/17 04:52, gengdongjiu wrote:
> 
> On 2017/4/7 2:55, Laszlo Ersek wrote:

>> I'm unsure if, by "not fixed", you are saying
>>
>>   the number of CPER entries that fits in Error Status Data Block N is
>>   not *uniform* across 0 <= N <= 10 [1]
>>
>> or
>>
>>   the number of CPER entries that fits in Error Status Data Block N is
>>   not *known* in advance, for all of 0 <= N <= 10 [2]
>>
>> Which one is your point?
>>
>> If [1], that's no problem; you can simply sum the individual error
>> status data block sizes in advance, and allocate "etc/hardware_errors"
>> accordingly, using the total size.
>>
>> (Allocating one shared fw_cfg blob for all status data blocks is more
>> memory efficient, as each ALLOCATE command will allocate whole pages
>> (rounded up from the actual blob size).)
>>
>> If your point is [2], then splitting the error status data blocks to
>> separate fw_cfg blobs makes no difference: regardless of whether we try
>> to place all the error status data blocks in a single fw_cfg blob, or in
>> separate fw_cfg blobs, the individual data block cannot be resized at OS
>> runtime, so there's no way to make it work.
>>
> My Point is [2]. The HEST(Hardware Error Source Table) table format is here:
> https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29
> 
> Now I understand your thought.

But if you mean [2], then I am confused, with regard to firmware on
physical hardware. Namely, even on physical machines, the firmware has
to estimate, in advance, the area size that will be needed for CPERs,
doesn't it? And once the firmware allocates that memory area, it cannot
be resized at OS runtime. If there are more CPERs at runtime (due to
hardware errors) than the firmware allocated room for, they must surely
wrap around in the preallocated buffer (like in a ring buffer). Isn't
that correct?

On the diagrams that you linked above (great looking diagrams BTW!), I
see CPER in two places (it is helpfully shaded red):

- to the right of BERT; the CPER is part of a box that is captioned
"firmware reserved memory"

- to the right of HEST; again the CPER is part of a box that is
captioned "firmware reserved memory"

So, IMO, when QEMU has to guesstimate the room for CPERs in advance,
that doesn't differ from the physical firmware case. In QEMU maybe you
can let the user specify the area size on the command line, with a
machine type property or similar.

Thanks
Laszlo

Dongjiu Geng April 21, 2017, 1:27 p.m. UTC | #43

Hi all/Laszlo,

  sorry, I have a question to consult with you.


On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>    Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>
>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>
>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>
>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>> dynamically generate APEI table. from my investigation, there are
>>>>> two ways:
>>>>>
>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>> notify UEFI to know this generation, then inject abort error to
>>>>> guest OS, guest OS read the APEI table.
>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>> table.
>>>>
>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>> error. So who should create the CPER is the question.
>>>>
>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>> Firmware.
>>>>
>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>> before injecting the error.
>>>>
>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>> stage.
>>>>
>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>> need to be added to Qemu. We need to resolve this.
>>>>
>>>> Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):

Laszlo lists a Qemu GHES table generation solution, Mainly use the four commands: "ALLOCATE/ADD_POINTER/ADD_CHECKSUM/WRITE_POINTER" to communicate with BIOS
so whether the four commands needs to be supported by the guest firware/UEFI.  I found the  "WRITE_POINTER" always failed. so I suspect guest UEFI/firmware not support the "WRITE_POINTER" command. please help me confirm it, thanks so much.

Laszlo Ersek April 24, 2017, 11:27 a.m. UTC | #44

On 04/21/17 15:27, gengdongjiu wrote:
> Hi all/Laszlo,
> 
>   sorry, I have a question to consult with you.
> 
> 
> On 2017/4/7 2:55, Laszlo Ersek wrote:
>> On 04/06/17 14:35, gengdongjiu wrote:
>>> Dear, Laszlo
>>>    Thanks for your detailed explanation.
>>>
>>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>>> (This ought to be one of the longest address lists I've ever seen :)
>>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>>> measure, I'm adding MST and Igor.)
>>>>
>>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>>> Hi gengdongjiu,
>>>>>
>>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>>
>>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>>
>>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>>
>>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>>> dynamically generate APEI table. from my investigation, there are
>>>>>> two ways:
>>>>>>
>>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>>> notify UEFI to know this generation, then inject abort error to
>>>>>> guest OS, guest OS read the APEI table.
>>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>>> table.
>>>>>
>>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>>> error. So who should create the CPER is the question.
>>>>>
>>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>>> Firmware.
>>>>>
>>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>>> before injecting the error.
>>>>>
>>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>>> stage.
>>>>>
>>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>>> need to be added to Qemu. We need to resolve this.
>>>>>
>>>>> Do shout if I am missing anything above.
>>>>
>>>> After reading this email, the use case looks *very* similar to what
>>>> we've just done with VMGENID for QEMU 2.9.
>>>>
>>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>>> linker/loader", with which QEMU instructs the firmware to
>>>>
>>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>>> ALLOCATE command,
>>>>
>>>> - relocate pointers in those blobs, to fields in other (or the same)
>>>> blobs -- ADD_POINTER command,
>>>>
>>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>>
>>>> - and send GPAs of fields within such blobs back to QEMU --
>>>> WRITE_POINTER command.
>>>>
>>>> This is how I imagine we can map the facility to the current use case
>>>> (note that this is the first time I read about HEST / GHES / CPER):
> 
> Laszlo lists a Qemu GHES table generation solution, Mainly use the
> four commands: "ALLOCATE/ADD_POINTER/ADD_CHECKSUM/WRITE_POINTER" to
> communicate with BIOS so whether the four commands needs to be
> supported by the guest firware/UEFI.  I found the  "WRITE_POINTER"
> always failed. so I suspect guest UEFI/firmware not support the
> "WRITE_POINTER" command. please help me confirm it, thanks so much.

That's incorrect, both OVMF and ArmVirtQemu support the WRITE_POINTER
command (see <https://bugzilla.tianocore.org/show_bug.cgi?id=359>.) A
number of OvmfPkg/ modules are included in ArmVirtPkg binaries as well.

In QEMU, the WRITE_POINTER command is currently generated for the
VMGENID device only. If you try to test VMGENID with qemu-system-aarch64
(for the purposes of WRITE_POINTER testing), that won't work, because
the VMGENID device is not available for aarch64. (The Microsoft spec
that describes the device lists Windows OS versions that are x86 only.)

In other words, no QEMU code exists at the moment that would allow you
to readily test WRITE_POINTER in aarch64 guests. However, the
firmware-side code is not architecture specific, and WRITE_POINTER
support is already being built into ArmVirtQemu.

Thanks,
Laszlo

kvm: pass the virtual SEI syndrome to guest OS

Commit Message

Comments

Patch