diff mbox

[v5,2/2] acpi: apei: Add SEI notification type support for ARMv8

Message ID 1508227341-15651-2-git-send-email-gengdongjiu@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dongjiu Geng Oct. 17, 2017, 8:02 a.m. UTC
ARMv8.2 requires implementation of the RAS extension, in
this extension it adds SEI(SError Interrupt) notification
type, this patch adds new GHES error source SEI handling
functions. Because this error source parsing and handling
methods are similar with the SEA. So share some SEA handling
functions with the SEI

Expose one API ghes_notify_abort() to external users. External
modules can call this exposed API to parse and handle the
SEA or SEI.

Note: For the SEI(SError Interrupt), it is asynchronous external
abort, the error address recorded by firmware may be not accurate.
If not accurate, EL3 firmware needs to identify the address to a
invalid value.

Cc: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Tested-by: Dongjiu Geng <gengdongjiu@huawei.com>
---
 arch/arm64/mm/fault.c     |  4 +--
 drivers/acpi/apei/Kconfig | 15 ++++++++++
 drivers/acpi/apei/ghes.c  | 71 ++++++++++++++++++++++++++++++++++-------------
 include/acpi/ghes.h       |  2 +-
 4 files changed, 70 insertions(+), 22 deletions(-)

Comments

Borislav Petkov Oct. 17, 2017, 5:06 p.m. UTC | #1
On Tue, Oct 17, 2017 at 04:02:21PM +0800, Dongjiu Geng wrote:
> ARMv8.2 requires implementation of the RAS extension, in
> this extension it adds SEI(SError Interrupt) notification
> type, this patch adds new GHES error source SEI handling
> functions. Because this error source parsing and handling
> methods are similar with the SEA. So share some SEA handling
> functions with the SEI
> 
> Expose one API ghes_notify_abort() to external users. External
> modules can call this exposed API to parse and handle the
> SEA or SEI.
> 
> Note: For the SEI(SError Interrupt), it is asynchronous external
> abort, the error address recorded by firmware may be not accurate.
> If not accurate, EL3 firmware needs to identify the address to a
> invalid value.
> 
> Cc: Borislav Petkov <bp@suse.de>
> Cc: James Morse <james.morse@arm.com>
> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
> Tested-by: Dongjiu Geng <gengdongjiu@huawei.com>
> ---
>  arch/arm64/mm/fault.c     |  4 +--
>  drivers/acpi/apei/Kconfig | 15 ++++++++++
>  drivers/acpi/apei/ghes.c  | 71 ++++++++++++++++++++++++++++++++++-------------
>  include/acpi/ghes.h       |  2 +-
>  4 files changed, 70 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 2509e4f..c98c1b3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -585,7 +585,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  		if (interrupts_enabled(regs))
>  			nmi_enter();
>  
> -		ret = ghes_notify_sea();
> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>  
>  		if (interrupts_enabled(regs))
>  			nmi_exit();
> @@ -682,7 +682,7 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>  	int ret = -ENOENT;
>  
>  	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA))
> -		ret = ghes_notify_sea();
> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>  
>  	return ret;
>  }
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index de14d49..47fcb0c 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -54,6 +54,21 @@ config ACPI_APEI_SEA
>  	  option allows the OS to look for such hardware error record, and
>  	  take appropriate action.
>  
> +config ACPI_APEI_SEI
> +	bool "APEI Asynchronous SError Interrupt logging/recovering support"

What is "SError" ?

> +	depends on ARM64 && ACPI_APEI_GHES
> +	default y
> +	help
> +	  This option should be enabled if the system supports
> +	  firmware first handling of SEI (asynchronous SError interrupt).
> +
> +	  SEI happens with asynchronous external abort for errors on device
> +	  memory reads on ARMv8 systems. If a system supports firmware first
> +	  handling of SEI, the platform analyzes and handles hardware error
> +	  notifications from SEI, and it may then form a HW error record for
> +	  the OS to parse and handle. This option allows the OS to look for
> +	  such hardware error record, and take appropriate action.
> +
>  config ACPI_APEI_MEMORY_FAILURE
>  	bool "APEI memory error recovering support"
>  	depends on ACPI_APEI && MEMORY_FAILURE
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 3eee30a..24b4233 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -815,43 +815,67 @@ static struct notifier_block ghes_notifier_hed = {
>  
>  #ifdef CONFIG_ACPI_APEI_SEA
>  static LIST_HEAD(ghes_sea);
> +#endif
> +
> +#ifdef CONFIG_ACPI_APEI_SEI
> +static LIST_HEAD(ghes_sei);
> +#endif
>  
> +#if defined(CONFIG_ACPI_APEI_SEA) || defined(CONFIG_ACPI_APEI_SEI)
>  /*
> - * Return 0 only if one of the SEA error sources successfully reported an error
> - * record sent from the firmware.
> + * Return 0 only if one of the SEA or SEI error sources successfully
> + * reported an error record sent from the firmware.
>   */
> -int ghes_notify_sea(void)
> +int ghes_notify_abort(u8 type)

Adding "abort" everywhere makes it worse: what does this function now
do, notify or abort or both?

Ditto for the remaining ones. Please think of a better name.


ghes_notify_sei() sounds much better to me, for example. And then you
can add a whole set of *_sei() functions similar to the *_sea() ones and
then you don't have to do all that checking of the type but simply call
the proper function set. They're not huge so that the duplication of
code should be minimal.
Dongjiu Geng Oct. 18, 2017, 5 a.m. UTC | #2
Hi Borislav,

On 2017/10/18 1:06, Borislav Petkov wrote:
> On Tue, Oct 17, 2017 at 04:02:21PM +0800, Dongjiu Geng wrote:
>> ARMv8.2 requires implementation of the RAS extension, in
>> this extension it adds SEI(SError Interrupt) notification
>> type, this patch adds new GHES error source SEI handling
>> functions. Because this error source parsing and handling
>> methods are similar with the SEA. So share some SEA handling
>> functions with the SEI
>>
>> Expose one API ghes_notify_abort() to external users. External
>> modules can call this exposed API to parse and handle the
>> SEA or SEI.
>>
>> Note: For the SEI(SError Interrupt), it is asynchronous external
>> abort, the error address recorded by firmware may be not accurate.
>> If not accurate, EL3 firmware needs to identify the address to a
>> invalid value.
>>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: James Morse <james.morse@arm.com>
>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
>> Tested-by: Dongjiu Geng <gengdongjiu@huawei.com>
>> ---
>>  arch/arm64/mm/fault.c     |  4 +--
>>  drivers/acpi/apei/Kconfig | 15 ++++++++++
>>  drivers/acpi/apei/ghes.c  | 71 ++++++++++++++++++++++++++++++++++-------------
>>  include/acpi/ghes.h       |  2 +-
>>  4 files changed, 70 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 2509e4f..c98c1b3 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -585,7 +585,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>  		if (interrupts_enabled(regs))
>>  			nmi_enter();
>>  
>> -		ret = ghes_notify_sea();
>> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>>  
>>  		if (interrupts_enabled(regs))
>>  			nmi_exit();
>> @@ -682,7 +682,7 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>>  	int ret = -ENOENT;
>>  
>>  	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA))
>> -		ret = ghes_notify_sea();
>> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>>  
>>  	return ret;
>>  }
>> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
>> index de14d49..47fcb0c 100644
>> --- a/drivers/acpi/apei/Kconfig
>> +++ b/drivers/acpi/apei/Kconfig
>> @@ -54,6 +54,21 @@ config ACPI_APEI_SEA
>>  	  option allows the OS to look for such hardware error record, and
>>  	  take appropriate action.
>>  
>> +config ACPI_APEI_SEI
>> +	bool "APEI Asynchronous SError Interrupt logging/recovering support"
> 
> What is "SError" ?
SError is System Error, which is a asynchronous exception in the Internal CPU.

In the ARM RAS Extension, there are mainly two type abort for CPU:
SEA(Synchronous External Abort)
SEI(SError Interrupt)

> 
>> +	depends on ARM64 && ACPI_APEI_GHES
>> +	default y
>> +	help
>> +	  This option should be enabled if the system supports
>> +	  firmware first handling of SEI (asynchronous SError interrupt).
>> +
>> +	  SEI happens with asynchronous external abort for errors on device
>> +	  memory reads on ARMv8 systems. If a system supports firmware first
>> +	  handling of SEI, the platform analyzes and handles hardware error
>> +	  notifications from SEI, and it may then form a HW error record for
>> +	  the OS to parse and handle. This option allows the OS to look for
>> +	  such hardware error record, and take appropriate action.
>> +
>>  config ACPI_APEI_MEMORY_FAILURE
>>  	bool "APEI memory error recovering support"
>>  	depends on ACPI_APEI && MEMORY_FAILURE
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 3eee30a..24b4233 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -815,43 +815,67 @@ static struct notifier_block ghes_notifier_hed = {
>>  
>>  #ifdef CONFIG_ACPI_APEI_SEA
>>  static LIST_HEAD(ghes_sea);
>> +#endif
>> +
>> +#ifdef CONFIG_ACPI_APEI_SEI
>> +static LIST_HEAD(ghes_sei);
>> +#endif
>>  
>> +#if defined(CONFIG_ACPI_APEI_SEA) || defined(CONFIG_ACPI_APEI_SEI)
>>  /*
>> - * Return 0 only if one of the SEA error sources successfully reported an error
>> - * record sent from the firmware.
>> + * Return 0 only if one of the SEA or SEI error sources successfully
>> + * reported an error record sent from the firmware.
>>   */
>> -int ghes_notify_sea(void)
>> +int ghes_notify_abort(u8 type)
> 
> Adding "abort" everywhere makes it worse: what does this function now
> do, notify or abort or both?
This function is used to notify APEI driver to parse APEI table and do some
recovery, such as calling memrory_failure() to identify the address to a
poisoned memory and deliver SIGBUS to related application.

> 
> Ditto for the remaining ones. Please think of a better name.
Ok, thanks for your good suggestion, I will consider to use a better name.

> 
> 
> ghes_notify_sei() sounds much better to me, for example. And then you
> can add a whole set of *_sei() functions similar to the *_sea() ones and
> then you don't have to do all that checking of the type but simply call
> the proper function set. They're not huge so that the duplication of
> code should be minimal.

Thanks for your suggestion again. I will flowing that and add a whole set
of *sei() functions.

>
Borislav Petkov Oct. 18, 2017, 9:06 a.m. UTC | #3
On Wed, Oct 18, 2017 at 01:00:44PM +0800, gengdongjiu wrote:
> SError is System Error, which is a asynchronous exception in the Internal CPU.
> 
> In the ARM RAS Extension, there are mainly two type abort for CPU:
> SEA(Synchronous External Abort)
> SEI(SError Interrupt)

And you're not writing it out as "System Error" because ...?
Dongjiu Geng Oct. 18, 2017, 9:17 a.m. UTC | #4
On 2017/10/18 17:06, Borislav Petkov wrote:
> On Wed, Oct 18, 2017 at 01:00:44PM +0800, gengdongjiu wrote:
>> SError is System Error, which is a asynchronous exception in the Internal CPU.
>>
>> In the ARM RAS Extension, there are mainly two type abort for CPU:
>> SEA(Synchronous External Abort)
>> SEI(SError Interrupt)
> And you're not writing it out as "System Error" because ...?

Thanks Borislav, can I write it as asynchronous exception or asynchronous abort?
Borislav Petkov Oct. 18, 2017, 9:25 a.m. UTC | #5
On Wed, Oct 18, 2017 at 05:17:27PM +0800, gengdongjiu wrote:
> Thanks Borislav, can I write it as asynchronous exception or
> asynchronous abort?

WTF?!

The thing is abbreviated as "SEI" and apparently means "System Error
Interrupt". Nothing else.
James Morse Oct. 18, 2017, 9:44 a.m. UTC | #6
Hi Borislav!

On 18/10/17 10:25, Borislav Petkov wrote:
> On Wed, Oct 18, 2017 at 05:17:27PM +0800, gengdongjiu wrote:
>> Thanks Borislav, can I write it as asynchronous exception or
>> asynchronous abort?
> 
> WTF?!

Yup.


> The thing is abbreviated as "SEI" and apparently means "System Error
> Interrupt". Nothing else.

ARM has 'external abort', which are either synchronous or asynchronous, both are
delivered as different types of exception.

Asynchronous external abort is treated as a special kind of interrupt, 'SError
Interrupt', (where SError stands for System Error, but its rarely written like
that). 'SEI' is a relatively new abbreviation for SError interrupt.


What should we call this thing? In the ACPI code I'd prefer 'SEI' as that is
what the ACPI spec calls it. Here we are talking about an GHES notification.

But in the arm64 arch code this should be called SError Interrupt as this is
what the ARM-ARM calls it. This code cares about exception routing and interrupt
masking.


But, I don't really care.


Thanks,

James
Borislav Petkov Oct. 18, 2017, 10:04 a.m. UTC | #7
On Wed, Oct 18, 2017 at 10:44:48AM +0100, James Morse wrote:
> What should we call this thing?

My only pet peeve is having abbreviations everywhere and nothing
explaining them.

So whatever you guys decide upon and as long as there's an explanation
what those things mean and you stick with that name, is perfectly fine
with me.

Thx.
Dongjiu Geng Oct. 18, 2017, 10:21 a.m. UTC | #8
On 2017/10/18 17:44, James Morse wrote:
>> The thing is abbreviated as "SEI" and apparently means "System Error
>> Interrupt". Nothing else.
> ARM has 'external abort', which are either synchronous or asynchronous, both are
> delivered as different types of exception.
> 
> Asynchronous external abort is treated as a special kind of interrupt, 'SError
> Interrupt', (where SError stands for System Error, but its rarely written like
> that). 'SEI' is a relatively new abbreviation for SError interrupt.
> 
> 
> What should we call this thing? In the ACPI code I'd prefer 'SEI' as that is
> what the ACPI spec calls it. Here we are talking about an GHES notification.
> 
> But in the arm64 arch code this should be called SError Interrupt as this is
> what the ARM-ARM calls it. This code cares about exception routing and interrupt
> masking.
> 
> 
> But, I don't really care.

Thanks very much James's clear explanation.
I agree with James.

In the ACPI sepc, we usually call SEI as SError Interrupt, we rarely call SError to System Error,
Anyway I will explain clearly about the abbreviations in my next version patch.
Dongjiu Geng Oct. 18, 2017, 10:25 a.m. UTC | #9
On 2017/10/18 18:04, Borislav Petkov wrote:
> On Wed, Oct 18, 2017 at 10:44:48AM +0100, James Morse wrote:
>> What should we call this thing?
> My only pet peeve is having abbreviations everywhere and nothing
> explaining them.
> 
> So whatever you guys decide upon and as long as there's an explanation
> what those things mean and you stick with that name, is perfectly fine
> with me.

surely, we will, thanks Borislav's reminder.
James Morse Oct. 18, 2017, 10:26 a.m. UTC | #10
Hi Dongjiu Geng,

On 17/10/17 09:02, Dongjiu Geng wrote:
> ARMv8.2 requires implementation of the RAS extension, in
> this extension it adds SEI(SError Interrupt) notification
> type, this patch adds new GHES error source SEI handling
> functions.

This paragraph is merging two things that aren't related.
The 'ARM v8.2 architecture extensions' have some RAS bits, which if your CPU
implements v8.2 are required.

ACPIv6.1 added NOTIFY_SEI as a notification type for ARMv8 systems.

This patch adds a GHES function for NOTIFY_SEI. Please leave the CPU RAS
extensions out of it.


> Because this error source parsing and handling
> methods are similar with the SEA. So share some SEA handling
> functions with the SEI
> 
> Expose one API ghes_notify_abort() to external users. External
> modules can call this exposed API to parse and handle the
> SEA or SEI.

This series doesn't add a caller/user for this new API, so why do we need to do
this now?

(I still haven't had a usable answer for 'what does your firmware do when SError
is masked', but I'll go beat that drum on the other thread).


More important for the APEI code is: How do SEA and SEI interact?

As far as I can see they can both interrupt each other, which isn't something
the single in_nmi() path in APEI can handle. I thinks we should fix this first.
(I'll try and polish my RFC that had a stab at that...)


SEA gets away with a lot of things because its synchronous. SEI isn't. Xie XiuQi
pointed to the memory_failure_queue() code. We can use this directly from SEA,
but not SEI. (what happens if an SError arrives while we are queueing
memory_failure work from an IRQ).

The one that scares me is the trace-point reporting stuff. What happens if an
SError arrives while we are enabling a trace point? (these are static-keys right?)


I don't think we can just plumb SEI in like this and be done with it.
(I'm looking at teasing out the estatus cache code from being x86:NMI only. This
way we solve the same 'cant do this from NMI context' with the same code'.)


Thanks,

James



boring nits below:

> Note: For the SEI(SError Interrupt), it is asynchronous external
> abort, the error address recorded by firmware may be not accurate.
> If not accurate, EL3 firmware needs to identify the address to a
> invalid value.

This paragraph keeps cropping up. Who expects an address with an SError?
We don't get one for IRQs, but that never needs stating.


> Cc: Borislav Petkov <bp@suse.de>
> Cc: James Morse <james.morse@arm.com>
> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>

> Tested-by: Dongjiu Geng <gengdongjiu@huawei.com>

(It's expected you test your own code)



> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 2509e4f..c98c1b3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -585,7 +585,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  		if (interrupts_enabled(regs))
>  			nmi_enter();
>  
> -		ret = ghes_notify_sea();
> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>  
>  		if (interrupts_enabled(regs))
>  			nmi_exit();
> @@ -682,7 +682,7 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>  	int ret = -ENOENT;
>  
>  	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA))
> -		ret = ghes_notify_sea();
> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>  
>  	return ret;
>  }
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index de14d49..47fcb0c 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -54,6 +54,21 @@ config ACPI_APEI_SEA
>  	  option allows the OS to look for such hardware error record, and
>  	  take appropriate action.
>  
> +config ACPI_APEI_SEI
> +	bool "APEI Asynchronous SError Interrupt logging/recovering support"
> +	depends on ARM64 && ACPI_APEI_GHES
> +	default y
> +	help
> +	  This option should be enabled if the system supports
> +	  firmware first handling of SEI (asynchronous SError interrupt).
> +
> +	  SEI happens with asynchronous external abort for errors on device
> +	  memory reads on ARMv8 systems. If a system supports firmware first
> +	  handling of SEI, the platform analyzes and handles hardware error
> +	  notifications from SEI, and it may then form a HW error record for
> +	  the OS to parse and handle. This option allows the OS to look for
> +	  such hardware error record, and take appropriate action.
> +
>  config ACPI_APEI_MEMORY_FAILURE
>  	bool "APEI memory error recovering support"
>  	depends on ACPI_APEI && MEMORY_FAILURE
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 3eee30a..24b4233 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -815,43 +815,67 @@ static struct notifier_block ghes_notifier_hed = {
>  
>  #ifdef CONFIG_ACPI_APEI_SEA
>  static LIST_HEAD(ghes_sea);
> +#endif
> +
> +#ifdef CONFIG_ACPI_APEI_SEI
> +static LIST_HEAD(ghes_sei);
> +#endif
>  
> +#if defined(CONFIG_ACPI_APEI_SEA) || defined(CONFIG_ACPI_APEI_SEI)
>  /*
> - * Return 0 only if one of the SEA error sources successfully reported an error
> - * record sent from the firmware.
> + * Return 0 only if one of the SEA or SEI error sources successfully
> + * reported an error record sent from the firmware.
>   */
> -int ghes_notify_sea(void)
> +int ghes_notify_abort(u8 type)
>  {
>  	struct ghes *ghes;
> +	struct list_head *head = NULL;
>  	int ret = -ENOENT;
>  
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
> -		if (!ghes_proc(ghes))
> -			ret = 0;

> +	if (type == ACPI_HEST_NOTIFY_SEA)
> +		head = &ghes_sea;
> +	else if (type == ACPI_HEST_NOTIFY_SEI)
> +		head = &ghes_sei;

Surely if I only have one of CONFIG_ACPI_APEI_SE{A,I} this can't be compiled.


> +
> +	if (head) {
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(ghes, head, list) {
> +			if (!ghes_proc(ghes))
> +				ret = 0;
> +		}
> +		rcu_read_unlock();
>  	}
> -	rcu_read_unlock();
>  	return ret;
>  }
>  
> -static void ghes_sea_add(struct ghes *ghes)
> +static void ghes_abort_add(struct ghes *ghes)
>  {
> -	mutex_lock(&ghes_list_mutex);
> -	list_add_rcu(&ghes->list, &ghes_sea);
> -	mutex_unlock(&ghes_list_mutex);
> +	struct list_head *head = NULL;
> +	u8 notify_type = ghes->generic->notify.type;
> +

> +	if (notify_type == ACPI_HEST_NOTIFY_SEA)
> +		head = &ghes_sea;
> +	else if (notify_type == ACPI_HEST_NOTIFY_SEI)
> +		head = &ghes_sei;

And here.


> +
> +	if (head) {
> +		mutex_lock(&ghes_list_mutex);
> +		list_add_rcu(&ghes->list, head);
> +		mutex_unlock(&ghes_list_mutex);
> +	}
>  }
>  
> -static void ghes_sea_remove(struct ghes *ghes)
> +static void ghes_abort_remove(struct ghes *ghes)
>  {
>  	mutex_lock(&ghes_list_mutex);
>  	list_del_rcu(&ghes->list);
>  	mutex_unlock(&ghes_list_mutex);
>  	synchronize_rcu();
>  }
> -#else /* CONFIG_ACPI_APEI_SEA */
> -static inline void ghes_sea_add(struct ghes *ghes) { }
> -static inline void ghes_sea_remove(struct ghes *ghes) { }
> -#endif /* CONFIG_ACPI_APEI_SEA */
> +#else
> +static inline void ghes_abort_add(struct ghes *ghes) { }
> +static inline void ghes_abort_remove(struct ghes *ghes) { }
> +#endif
>  
>  #ifdef CONFIG_HAVE_ACPI_APEI_NMI
>  /*
> @@ -1084,6 +1108,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  			goto err;
>  		}
>  		break;
> +	case ACPI_HEST_NOTIFY_SEI:
> +		if (!IS_ENABLED(CONFIG_ACPI_APEI_SEI)) {
> +			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEI is not supported!\n",
> +				generic->header.source_id);
> +		goto err;
> +	}
> +	break;
>  	case ACPI_HEST_NOTIFY_NMI:
>  		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
>  			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
> @@ -1153,7 +1184,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  		break;
>  
>  	case ACPI_HEST_NOTIFY_SEA:
> -		ghes_sea_add(ghes);
> +	case ACPI_HEST_NOTIFY_SEI:
> +		ghes_abort_add(ghes);
>  		break;
>  	case ACPI_HEST_NOTIFY_NMI:
>  		ghes_nmi_add(ghes);
> @@ -1206,7 +1238,8 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  		break;
>  
>  	case ACPI_HEST_NOTIFY_SEA:
> -		ghes_sea_remove(ghes);
> +	case ACPI_HEST_NOTIFY_SEI:
> +		ghes_abort_remove(ghes);
>  		break;
>  	case ACPI_HEST_NOTIFY_NMI:
>  		ghes_nmi_remove(ghes);
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 9061c5c..ec6f4ba 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -118,6 +118,6 @@ static inline void *acpi_hest_get_next(struct acpi_hest_generic_data *gdata)
>  	     (void *)section - (void *)(estatus + 1) < estatus->data_length; \
>  	     section = acpi_hest_get_next(section))
>  
> -int ghes_notify_sea(void);
> +int ghes_notify_abort(u8 type);
>  
>  #endif /* GHES_H */
>
Dongjiu Geng Oct. 18, 2017, 11:39 a.m. UTC | #11
Hi james,

On 2017/10/18 18:26, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 17/10/17 09:02, Dongjiu Geng wrote:
>> ARMv8.2 requires implementation of the RAS extension, in
>> this extension it adds SEI(SError Interrupt) notification
>> type, this patch adds new GHES error source SEI handling
>> functions.
> 
> This paragraph is merging two things that aren't related.
> The 'ARM v8.2 architecture extensions' have some RAS bits, which if your CPU
> implements v8.2 are required.
> 
> ACPIv6.1 added NOTIFY_SEI as a notification type for ARMv8 systems.
> 
> This patch adds a GHES function for NOTIFY_SEI. Please leave the CPU RAS
> extensions out of it.
Ok, thanks

> 
> 
>> Because this error source parsing and handling
>> methods are similar with the SEA. So share some SEA handling
>> functions with the SEI
>>
>> Expose one API ghes_notify_abort() to external users. External
>> modules can call this exposed API to parse and handle the
>> SEA or SEI.
> 
> This series doesn't add a caller/user for this new API, so why do we need to do
> this now?
 there is caller and user, it is in another series(RAS virtualization series), not included in this series

As shown:

+int handle_guest_sei(unsigned int esr)
+{
+	int ret = -ENOENT;
+
+	if (IS_ENABLED(CONFIG_ACPI_APEI_SEI))
+		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEI);
+
+	return ret;
+}

> 
> (I still haven't had a usable answer for 'what does your firmware do when SError
> is masked', but I'll go beat that drum on the other thread).
sorry for my late response due to resent busy, I will answer your question in another thread.

May be tomorrow.


in short, regardless the physical SError is masked or unmasked, firmware will jump to
the corresponding SEA/SEI exception vector entry. there is only one PSTATE.DAIF which will be shared by different EL,
regardless EL1,EL2, EL3.

> 
> 
> More important for the APEI code is: How do SEA and SEI interact?
> 
> As far as I can see they can both interrupt each other, which isn't something
> the single in_nmi() path in APEI can handle. I thinks we should fix this first.
> (I'll try and polish my RFC that had a stab at that...)
if you have fix patch, you CC me. thanks.

> 
> 
> SEA gets away with a lot of things because its synchronous. SEI isn't. Xie XiuQi
> pointed to the memory_failure_queue() code. We can use this directly from SEA,
> but not SEI. (what happens if an SError arrives while we are queueing
> memory_failure work from an IRQ).
do you mean SError can interrupt memory_failure work from an IRQ?
memory_failure is in an process context, and in a work queue, not IRQ context.


> 
> The one that scares me is the trace-point reporting stuff. What happens if an
> SError arrives while we are enabling a trace point? (these are static-keys right?)
For the trace-point issue, may be we can consider it in the next step.
Now I am not consider the trace-point issue.


> 
> 
> I don't think we can just plumb SEI in like this and be done with it.
> (I'm looking at teasing out the estatus cache code from being x86:NMI only. This
> way we solve the same 'cant do this from NMI context' with the same code'.)
> 
> 
> Thanks,
> 
> James
> 
> 
> 
> boring nits below:
> 
>> Note: For the SEI(SError Interrupt), it is asynchronous external
>> abort, the error address recorded by firmware may be not accurate.
>> If not accurate, EL3 firmware needs to identify the address to a
>> invalid value.
> 
> This paragraph keeps cropping up. Who expects an address with an SError?
> We don't get one for IRQs, but that never needs stating.
> 
> 
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: James Morse <james.morse@arm.com>
>> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
>> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
> 
>> Tested-by: Dongjiu Geng <gengdongjiu@huawei.com>
> (It's expected you test your own code)

Ok

> 
> 
> 
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 2509e4f..c98c1b3 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -585,7 +585,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>  		if (interrupts_enabled(regs))
>>  			nmi_enter();
>>  
>> -		ret = ghes_notify_sea();
>> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>>  
>>  		if (interrupts_enabled(regs))
>>  			nmi_exit();
>> @@ -682,7 +682,7 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>>  	int ret = -ENOENT;
>>  
>>  	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA))
>> -		ret = ghes_notify_sea();
>> +		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>>  
>>  	return ret;
>>  }
>> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
>> index de14d49..47fcb0c 100644
>> --- a/drivers/acpi/apei/Kconfig
>> +++ b/drivers/acpi/apei/Kconfig
>> @@ -54,6 +54,21 @@ config ACPI_APEI_SEA
>>  	  option allows the OS to look for such hardware error record, and
>>  	  take appropriate action.
>>  
>> +config ACPI_APEI_SEI
>> +	bool "APEI Asynchronous SError Interrupt logging/recovering support"
>> +	depends on ARM64 && ACPI_APEI_GHES
>> +	default y
>> +	help
>> +	  This option should be enabled if the system supports
>> +	  firmware first handling of SEI (asynchronous SError interrupt).
>> +
>> +	  SEI happens with asynchronous external abort for errors on device
>> +	  memory reads on ARMv8 systems. If a system supports firmware first
>> +	  handling of SEI, the platform analyzes and handles hardware error
>> +	  notifications from SEI, and it may then form a HW error record for
>> +	  the OS to parse and handle. This option allows the OS to look for
>> +	  such hardware error record, and take appropriate action.
>> +
>>  config ACPI_APEI_MEMORY_FAILURE
>>  	bool "APEI memory error recovering support"
>>  	depends on ACPI_APEI && MEMORY_FAILURE
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 3eee30a..24b4233 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -815,43 +815,67 @@ static struct notifier_block ghes_notifier_hed = {
>>  
>>  #ifdef CONFIG_ACPI_APEI_SEA
>>  static LIST_HEAD(ghes_sea);
>> +#endif
>> +
>> +#ifdef CONFIG_ACPI_APEI_SEI
>> +static LIST_HEAD(ghes_sei);
>> +#endif
>>  
>> +#if defined(CONFIG_ACPI_APEI_SEA) || defined(CONFIG_ACPI_APEI_SEI)
>>  /*
>> - * Return 0 only if one of the SEA error sources successfully reported an error
>> - * record sent from the firmware.
>> + * Return 0 only if one of the SEA or SEI error sources successfully
>> + * reported an error record sent from the firmware.
>>   */
>> -int ghes_notify_sea(void)
>> +int ghes_notify_abort(u8 type)
>>  {
>>  	struct ghes *ghes;
>> +	struct list_head *head = NULL;
>>  	int ret = -ENOENT;
>>  
>> -	rcu_read_lock();
>> -	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
>> -		if (!ghes_proc(ghes))
>> -			ret = 0;
> 
>> +	if (type == ACPI_HEST_NOTIFY_SEA)
>> +		head = &ghes_sea;
>> +	else if (type == ACPI_HEST_NOTIFY_SEI)
>> +		head = &ghes_sei;
> 
> Surely if I only have one of CONFIG_ACPI_APEI_SE{A,I} this can't be compiled.
  No, it can be compiled, it is "||" not "&&"
> 
> 
>> +
>> +	if (head) {
>> +		rcu_read_lock();
>> +		list_for_each_entry_rcu(ghes, head, list) {
>> +			if (!ghes_proc(ghes))
>> +				ret = 0;
>> +		}
>> +		rcu_read_unlock();
>>  	}
>> -	rcu_read_unlock();
>>  	return ret;
>>  }
>>  
>> -static void ghes_sea_add(struct ghes *ghes)
>> +static void ghes_abort_add(struct ghes *ghes)
>>  {
>> -	mutex_lock(&ghes_list_mutex);
>> -	list_add_rcu(&ghes->list, &ghes_sea);
>> -	mutex_unlock(&ghes_list_mutex);
>> +	struct list_head *head = NULL;
>> +	u8 notify_type = ghes->generic->notify.type;
>> +
> 
>> +	if (notify_type == ACPI_HEST_NOTIFY_SEA)
>> +		head = &ghes_sea;
>> +	else if (notify_type == ACPI_HEST_NOTIFY_SEI)
>> +		head = &ghes_sei;
> 
> And here.
No, same above.

> 
> 
>> +
>> +	if (head) {
>> +		mutex_lock(&ghes_list_mutex);
>> +		list_add_rcu(&ghes->list, head);
>> +		mutex_unlock(&ghes_list_mutex);
>> +	}
>>  }
>>  
>> -static void ghes_sea_remove(struct ghes *ghes)
>> +static void ghes_abort_remove(struct ghes *ghes)
>>  {
>>  	mutex_lock(&ghes_list_mutex);
>>  	list_del_rcu(&ghes->list);
>>  	mutex_unlock(&ghes_list_mutex);
>>  	synchronize_rcu();
>>  }
>> -#else /* CONFIG_ACPI_APEI_SEA */
>> -static inline void ghes_sea_add(struct ghes *ghes) { }
>> -static inline void ghes_sea_remove(struct ghes *ghes) { }
>> -#endif /* CONFIG_ACPI_APEI_SEA */
>> +#else
>> +static inline void ghes_abort_add(struct ghes *ghes) { }
>> +static inline void ghes_abort_remove(struct ghes *ghes) { }
>> +#endif
>>  
>>  #ifdef CONFIG_HAVE_ACPI_APEI_NMI
>>  /*
>> @@ -1084,6 +1108,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
>>  			goto err;
>>  		}
>>  		break;
>> +	case ACPI_HEST_NOTIFY_SEI:
>> +		if (!IS_ENABLED(CONFIG_ACPI_APEI_SEI)) {
>> +			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEI is not supported!\n",
>> +				generic->header.source_id);
>> +		goto err;
>> +	}
>> +	break;
>>  	case ACPI_HEST_NOTIFY_NMI:
>>  		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
>>  			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
>> @@ -1153,7 +1184,8 @@ static int ghes_probe(struct platform_device *ghes_dev)
>>  		break;
>>  
>>  	case ACPI_HEST_NOTIFY_SEA:
>> -		ghes_sea_add(ghes);
>> +	case ACPI_HEST_NOTIFY_SEI:
>> +		ghes_abort_add(ghes);
>>  		break;
>>  	case ACPI_HEST_NOTIFY_NMI:
>>  		ghes_nmi_add(ghes);
>> @@ -1206,7 +1238,8 @@ static int ghes_remove(struct platform_device *ghes_dev)
>>  		break;
>>  
>>  	case ACPI_HEST_NOTIFY_SEA:
>> -		ghes_sea_remove(ghes);
>> +	case ACPI_HEST_NOTIFY_SEI:
>> +		ghes_abort_remove(ghes);
>>  		break;
>>  	case ACPI_HEST_NOTIFY_NMI:
>>  		ghes_nmi_remove(ghes);
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 9061c5c..ec6f4ba 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -118,6 +118,6 @@ static inline void *acpi_hest_get_next(struct acpi_hest_generic_data *gdata)
>>  	     (void *)section - (void *)(estatus + 1) < estatus->data_length; \
>>  	     section = acpi_hest_get_next(section))
>>  
>> -int ghes_notify_sea(void);
>> +int ghes_notify_abort(u8 type);
>>  
>>  #endif /* GHES_H */
>>
> 
> 
> .
>
diff mbox

Patch

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 2509e4f..c98c1b3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -585,7 +585,7 @@  static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 		if (interrupts_enabled(regs))
 			nmi_enter();
 
-		ret = ghes_notify_sea();
+		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
 
 		if (interrupts_enabled(regs))
 			nmi_exit();
@@ -682,7 +682,7 @@  int handle_guest_sea(phys_addr_t addr, unsigned int esr)
 	int ret = -ENOENT;
 
 	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA))
-		ret = ghes_notify_sea();
+		ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
 
 	return ret;
 }
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index de14d49..47fcb0c 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -54,6 +54,21 @@  config ACPI_APEI_SEA
 	  option allows the OS to look for such hardware error record, and
 	  take appropriate action.
 
+config ACPI_APEI_SEI
+	bool "APEI Asynchronous SError Interrupt logging/recovering support"
+	depends on ARM64 && ACPI_APEI_GHES
+	default y
+	help
+	  This option should be enabled if the system supports
+	  firmware first handling of SEI (asynchronous SError interrupt).
+
+	  SEI happens with asynchronous external abort for errors on device
+	  memory reads on ARMv8 systems. If a system supports firmware first
+	  handling of SEI, the platform analyzes and handles hardware error
+	  notifications from SEI, and it may then form a HW error record for
+	  the OS to parse and handle. This option allows the OS to look for
+	  such hardware error record, and take appropriate action.
+
 config ACPI_APEI_MEMORY_FAILURE
 	bool "APEI memory error recovering support"
 	depends on ACPI_APEI && MEMORY_FAILURE
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3eee30a..24b4233 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -815,43 +815,67 @@  static struct notifier_block ghes_notifier_hed = {
 
 #ifdef CONFIG_ACPI_APEI_SEA
 static LIST_HEAD(ghes_sea);
+#endif
+
+#ifdef CONFIG_ACPI_APEI_SEI
+static LIST_HEAD(ghes_sei);
+#endif
 
+#if defined(CONFIG_ACPI_APEI_SEA) || defined(CONFIG_ACPI_APEI_SEI)
 /*
- * Return 0 only if one of the SEA error sources successfully reported an error
- * record sent from the firmware.
+ * Return 0 only if one of the SEA or SEI error sources successfully
+ * reported an error record sent from the firmware.
  */
-int ghes_notify_sea(void)
+int ghes_notify_abort(u8 type)
 {
 	struct ghes *ghes;
+	struct list_head *head = NULL;
 	int ret = -ENOENT;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
-		if (!ghes_proc(ghes))
-			ret = 0;
+	if (type == ACPI_HEST_NOTIFY_SEA)
+		head = &ghes_sea;
+	else if (type == ACPI_HEST_NOTIFY_SEI)
+		head = &ghes_sei;
+
+	if (head) {
+		rcu_read_lock();
+		list_for_each_entry_rcu(ghes, head, list) {
+			if (!ghes_proc(ghes))
+				ret = 0;
+		}
+		rcu_read_unlock();
 	}
-	rcu_read_unlock();
 	return ret;
 }
 
-static void ghes_sea_add(struct ghes *ghes)
+static void ghes_abort_add(struct ghes *ghes)
 {
-	mutex_lock(&ghes_list_mutex);
-	list_add_rcu(&ghes->list, &ghes_sea);
-	mutex_unlock(&ghes_list_mutex);
+	struct list_head *head = NULL;
+	u8 notify_type = ghes->generic->notify.type;
+
+	if (notify_type == ACPI_HEST_NOTIFY_SEA)
+		head = &ghes_sea;
+	else if (notify_type == ACPI_HEST_NOTIFY_SEI)
+		head = &ghes_sei;
+
+	if (head) {
+		mutex_lock(&ghes_list_mutex);
+		list_add_rcu(&ghes->list, head);
+		mutex_unlock(&ghes_list_mutex);
+	}
 }
 
-static void ghes_sea_remove(struct ghes *ghes)
+static void ghes_abort_remove(struct ghes *ghes)
 {
 	mutex_lock(&ghes_list_mutex);
 	list_del_rcu(&ghes->list);
 	mutex_unlock(&ghes_list_mutex);
 	synchronize_rcu();
 }
-#else /* CONFIG_ACPI_APEI_SEA */
-static inline void ghes_sea_add(struct ghes *ghes) { }
-static inline void ghes_sea_remove(struct ghes *ghes) { }
-#endif /* CONFIG_ACPI_APEI_SEA */
+#else
+static inline void ghes_abort_add(struct ghes *ghes) { }
+static inline void ghes_abort_remove(struct ghes *ghes) { }
+#endif
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 /*
@@ -1084,6 +1108,13 @@  static int ghes_probe(struct platform_device *ghes_dev)
 			goto err;
 		}
 		break;
+	case ACPI_HEST_NOTIFY_SEI:
+		if (!IS_ENABLED(CONFIG_ACPI_APEI_SEI)) {
+			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEI is not supported!\n",
+				generic->header.source_id);
+		goto err;
+	}
+	break;
 	case ACPI_HEST_NOTIFY_NMI:
 		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
 			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1153,7 +1184,8 @@  static int ghes_probe(struct platform_device *ghes_dev)
 		break;
 
 	case ACPI_HEST_NOTIFY_SEA:
-		ghes_sea_add(ghes);
+	case ACPI_HEST_NOTIFY_SEI:
+		ghes_abort_add(ghes);
 		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_add(ghes);
@@ -1206,7 +1238,8 @@  static int ghes_remove(struct platform_device *ghes_dev)
 		break;
 
 	case ACPI_HEST_NOTIFY_SEA:
-		ghes_sea_remove(ghes);
+	case ACPI_HEST_NOTIFY_SEI:
+		ghes_abort_remove(ghes);
 		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_remove(ghes);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 9061c5c..ec6f4ba 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -118,6 +118,6 @@  static inline void *acpi_hest_get_next(struct acpi_hest_generic_data *gdata)
 	     (void *)section - (void *)(estatus + 1) < estatus->data_length; \
 	     section = acpi_hest_get_next(section))
 
-int ghes_notify_sea(void);
+int ghes_notify_abort(u8 type);
 
 #endif /* GHES_H */