diff mbox series

[v12,1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Message ID 20240902030034.67152-2-xueshuai@linux.alibaba.com (mailing list archive)
State Superseded, archived
Headers show
Series [v12,1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered | expand

Commit Message

Shuai Xue Sept. 2, 2024, 3 a.m. UTC
Synchronous error was detected as a result of user-space process accessing
a 2-bit uncorrected error. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
memory_failure() work which poisons the related page, unmaps the page, and
then sends a SIGBUS to the process, so that a system wide panic can be
avoided.

However, no memory_failure() work will be queued unless all bellow
preconditions check passed:

- `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
- `if (flags == -1)` in ghes_handle_memory_failure()
- `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
- `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()

In such case, the user-space process will trigger SEA again.  This loop
can potentially exceed the platform firmware threshold or even trigger a
kernel hard lockup, leading to a system reboot.

Fix it by performing a force kill if no memory_failure() work is queued
for synchronous errors.

Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

---
 drivers/acpi/apei/ghes.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Jarkko Sakkinen Sept. 3, 2024, 4:09 p.m. UTC | #1
On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote:
> Synchronous error was detected as a result of user-space process accessing
> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> memory_failure() work which poisons the related page, unmaps the page, and
> then sends a SIGBUS to the process, so that a system wide panic can be
> avoided.
>
> However, no memory_failure() work will be queued unless all bellow
> preconditions check passed:
>
> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
> - `if (flags == -1)` in ghes_handle_memory_failure()
> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()
>
> In such case, the user-space process will trigger SEA again.  This loop
> can potentially exceed the platform firmware threshold or even trigger a
> kernel hard lockup, leading to a system reboot.
>
> Fix it by performing a force kill if no memory_failure() work is queued
> for synchronous errors.
>
> Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>
> ---
>  drivers/acpi/apei/ghes.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 623cc0cb4a65..b0b20ee533d9 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes,
>  		}
>  	}
>  
> +	/*
> +	 * If no memory failure work is queued for abnormal synchronous
> +	 * errors, do a force kill.
> +	 */
> +	if (sync && !queued) {
> +		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> +			current->comm, task_pid_nr(current));

Hmm... doest this need "hardware" or would "memory corruption" be
enough?

Also, does this need to say that it is sending SIGBUS when the signal
itself tells that already?

I.e. could "%s:%d has memory corruption" be enough information?

> +		force_sig(SIGBUS);
> +	}
> +
>  	return queued;
>  }
>  

BR, Jarkko
Shuai Xue Sept. 5, 2024, 3:04 a.m. UTC | #2
在 2024/9/4 00:09, Jarkko Sakkinen 写道:
> On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote:
>> Synchronous error was detected as a result of user-space process accessing
>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>> memory_failure() work which poisons the related page, unmaps the page, and
>> then sends a SIGBUS to the process, so that a system wide panic can be
>> avoided.
>>
>> However, no memory_failure() work will be queued unless all bellow
>> preconditions check passed:
>>
>> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
>> - `if (flags == -1)` in ghes_handle_memory_failure()
>> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
>> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()
>>
>> In such case, the user-space process will trigger SEA again.  This loop
>> can potentially exceed the platform firmware threshold or even trigger a
>> kernel hard lockup, leading to a system reboot.
>>
>> Fix it by performing a force kill if no memory_failure() work is queued
>> for synchronous errors.
>>
>> Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>
>> ---
>>   drivers/acpi/apei/ghes.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 623cc0cb4a65..b0b20ee533d9 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes,
>>   		}
>>   	}
>>   
>> +	/*
>> +	 * If no memory failure work is queued for abnormal synchronous
>> +	 * errors, do a force kill.
>> +	 */
>> +	if (sync && !queued) {
>> +		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
>> +			current->comm, task_pid_nr(current));
> 
> Hmm... doest this need "hardware" or would "memory corruption" be
> enough?
> 
> Also, does this need to say that it is sending SIGBUS when the signal
> itself tells that already?
> 
> I.e. could "%s:%d has memory corruption" be enough information?

Hi, Jarkko,

Thank you for your suggestion. Maybe it could.

There are some similar error info which use "hardware memory error", e.g.

	static void kill_me_maybe(struct callback_head *cb)
	{
		pr_err("Uncorrected hardware memory error in user-access at %llx", 
p->mce_addr);
		...
		pr_err("Memory error not recovered");
		kill_me_now(cb);
	}
	
	static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
	{
		pr_err("%#lx: Sending SIGBUS to %s:%d due to hardware memory 
corruption\n",
				pfn, t->comm, task_pid_nr(t));
		...
			ret = force_sig_mceerr(BUS_MCEERR_AR,
					 (void __user *)tk->addr, addr_lsb);
	
	}	

So, personally, I prefer this info to be consistent with them.


Best Regards,
Shuai
Jarkko Sakkinen Sept. 5, 2024, 2:14 p.m. UTC | #3
On Thu Sep 5, 2024 at 6:04 AM EEST, Shuai Xue wrote:
>
>
> 在 2024/9/4 00:09, Jarkko Sakkinen 写道:
> > On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote:
> >> Synchronous error was detected as a result of user-space process accessing
> >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> >> memory_failure() work which poisons the related page, unmaps the page, and
> >> then sends a SIGBUS to the process, so that a system wide panic can be
> >> avoided.
> >>
> >> However, no memory_failure() work will be queued unless all bellow
> >> preconditions check passed:
> >>
> >> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
> >> - `if (flags == -1)` in ghes_handle_memory_failure()
> >> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
> >> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()
> >>
> >> In such case, the user-space process will trigger SEA again.  This loop
> >> can potentially exceed the platform firmware threshold or even trigger a
> >> kernel hard lockup, leading to a system reboot.
> >>
> >> Fix it by performing a force kill if no memory_failure() work is queued
> >> for synchronous errors.
> >>
> >> Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> >>
> >> ---
> >>   drivers/acpi/apei/ghes.c | 10 ++++++++++
> >>   1 file changed, 10 insertions(+)
> >>
> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> >> index 623cc0cb4a65..b0b20ee533d9 100644
> >> --- a/drivers/acpi/apei/ghes.c
> >> +++ b/drivers/acpi/apei/ghes.c
> >> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes,
> >>   		}
> >>   	}
> >>   
> >> +	/*
> >> +	 * If no memory failure work is queued for abnormal synchronous
> >> +	 * errors, do a force kill.
> >> +	 */
> >> +	if (sync && !queued) {
> >> +		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> >> +			current->comm, task_pid_nr(current));
> > 
> > Hmm... doest this need "hardware" or would "memory corruption" be
> > enough?
> > 
> > Also, does this need to say that it is sending SIGBUS when the signal
> > itself tells that already?
> > 
> > I.e. could "%s:%d has memory corruption" be enough information?
>
> Hi, Jarkko,
>
> Thank you for your suggestion. Maybe it could.
>
> There are some similar error info which use "hardware memory error", e.g.

By tweaking my original suggestion just a bit:

"%s:%d: hardware memory corruption"

Can't get clearer than that, right?

BR, Jarkko
Jarkko Sakkinen Sept. 5, 2024, 2:17 p.m. UTC | #4
On Thu Sep 5, 2024 at 5:14 PM EEST, Jarkko Sakkinen wrote:
> On Thu Sep 5, 2024 at 6:04 AM EEST, Shuai Xue wrote:
> >
> >
> > 在 2024/9/4 00:09, Jarkko Sakkinen 写道:
> > > On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote:
> > >> Synchronous error was detected as a result of user-space process accessing
> > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > >> memory_failure() work which poisons the related page, unmaps the page, and
> > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > >> avoided.
> > >>
> > >> However, no memory_failure() work will be queued unless all bellow
> > >> preconditions check passed:
> > >>
> > >> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
> > >> - `if (flags == -1)` in ghes_handle_memory_failure()
> > >> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
> > >> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()
> > >>
> > >> In such case, the user-space process will trigger SEA again.  This loop
> > >> can potentially exceed the platform firmware threshold or even trigger a
> > >> kernel hard lockup, leading to a system reboot.
> > >>
> > >> Fix it by performing a force kill if no memory_failure() work is queued
> > >> for synchronous errors.
> > >>
> > >> Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
> > >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> > >>
> > >> ---
> > >>   drivers/acpi/apei/ghes.c | 10 ++++++++++
> > >>   1 file changed, 10 insertions(+)
> > >>
> > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > >> index 623cc0cb4a65..b0b20ee533d9 100644
> > >> --- a/drivers/acpi/apei/ghes.c
> > >> +++ b/drivers/acpi/apei/ghes.c
> > >> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes,
> > >>   		}
> > >>   	}
> > >>   
> > >> +	/*
> > >> +	 * If no memory failure work is queued for abnormal synchronous
> > >> +	 * errors, do a force kill.
> > >> +	 */
> > >> +	if (sync && !queued) {
> > >> +		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> > >> +			current->comm, task_pid_nr(current));
> > > 
> > > Hmm... doest this need "hardware" or would "memory corruption" be
> > > enough?
> > > 
> > > Also, does this need to say that it is sending SIGBUS when the signal
> > > itself tells that already?
> > > 
> > > I.e. could "%s:%d has memory corruption" be enough information?
> >
> > Hi, Jarkko,
> >
> > Thank you for your suggestion. Maybe it could.
> >
> > There are some similar error info which use "hardware memory error", e.g.
>
> By tweaking my original suggestion just a bit:
>
> "%s:%d: hardware memory corruption"
>
> Can't get clearer than that, right?

And obvious reason that shorter and more consistent klog message is easy
to spot and grep. It is simply less convoluted.

If you want also SIGBUS, I'd just put it as "%s:%d: hardware memory
corruption (SIGBUS)"

BR, Jarkko
Shuai Xue Sept. 6, 2024, 1:53 a.m. UTC | #5
在 2024/9/5 22:17, Jarkko Sakkinen 写道:
> On Thu Sep 5, 2024 at 5:14 PM EEST, Jarkko Sakkinen wrote:
>> On Thu Sep 5, 2024 at 6:04 AM EEST, Shuai Xue wrote:
>>>
>>>
>>> 在 2024/9/4 00:09, Jarkko Sakkinen 写道:
>>>> On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote:
>>>>> Synchronous error was detected as a result of user-space process accessing
>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>>>>> memory_failure() work which poisons the related page, unmaps the page, and
>>>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>>>> avoided.
>>>>>
>>>>> However, no memory_failure() work will be queued unless all bellow
>>>>> preconditions check passed:
>>>>>
>>>>> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
>>>>> - `if (flags == -1)` in ghes_handle_memory_failure()
>>>>> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
>>>>> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()
>>>>>
>>>>> In such case, the user-space process will trigger SEA again.  This loop
>>>>> can potentially exceed the platform firmware threshold or even trigger a
>>>>> kernel hard lockup, leading to a system reboot.
>>>>>
>>>>> Fix it by performing a force kill if no memory_failure() work is queued
>>>>> for synchronous errors.
>>>>>
>>>>> Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>>>
>>>>> ---
>>>>>    drivers/acpi/apei/ghes.c | 10 ++++++++++
>>>>>    1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>> index 623cc0cb4a65..b0b20ee533d9 100644
>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>>>    		}
>>>>>    	}
>>>>>    
>>>>> +	/*
>>>>> +	 * If no memory failure work is queued for abnormal synchronous
>>>>> +	 * errors, do a force kill.
>>>>> +	 */
>>>>> +	if (sync && !queued) {
>>>>> +		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
>>>>> +			current->comm, task_pid_nr(current));
>>>>
>>>> Hmm... doest this need "hardware" or would "memory corruption" be
>>>> enough?
>>>>
>>>> Also, does this need to say that it is sending SIGBUS when the signal
>>>> itself tells that already?
>>>>
>>>> I.e. could "%s:%d has memory corruption" be enough information?
>>>
>>> Hi, Jarkko,
>>>
>>> Thank you for your suggestion. Maybe it could.
>>>
>>> There are some similar error info which use "hardware memory error", e.g.
>>
>> By tweaking my original suggestion just a bit:
>>
>> "%s:%d: hardware memory corruption"
>>
>> Can't get clearer than that, right?
> 
> And obvious reason that shorter and more consistent klog message is easy
> to spot and grep. It is simply less convoluted.
> 
> If you want also SIGBUS, I'd just put it as "%s:%d: hardware memory
> corruption (SIGBUS)"
> 
> BR, Jarkko

Hi, Jarkko,

I will change it to "%s:%d: hardware memory corruption (SIGBUS)".

Thank you for valuable suggestion.

Best Regards,
Shuai
Jarkko Sakkinen Sept. 6, 2024, 2:42 p.m. UTC | #6
On Fri Sep 6, 2024 at 4:53 AM EEST, Shuai Xue wrote:
>
>
> 在 2024/9/5 22:17, Jarkko Sakkinen 写道:
> > On Thu Sep 5, 2024 at 5:14 PM EEST, Jarkko Sakkinen wrote:
> >> On Thu Sep 5, 2024 at 6:04 AM EEST, Shuai Xue wrote:
> >>>
> >>>
> >>> 在 2024/9/4 00:09, Jarkko Sakkinen 写道:
> >>>> On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote:
> >>>>> Synchronous error was detected as a result of user-space process accessing
> >>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> >>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> >>>>> memory_failure() work which poisons the related page, unmaps the page, and
> >>>>> then sends a SIGBUS to the process, so that a system wide panic can be
> >>>>> avoided.
> >>>>>
> >>>>> However, no memory_failure() work will be queued unless all bellow
> >>>>> preconditions check passed:
> >>>>>
> >>>>> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
> >>>>> - `if (flags == -1)` in ghes_handle_memory_failure()
> >>>>> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
> >>>>> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()
> >>>>>
> >>>>> In such case, the user-space process will trigger SEA again.  This loop
> >>>>> can potentially exceed the platform firmware threshold or even trigger a
> >>>>> kernel hard lockup, leading to a system reboot.
> >>>>>
> >>>>> Fix it by performing a force kill if no memory_failure() work is queued
> >>>>> for synchronous errors.
> >>>>>
> >>>>> Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
> >>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> >>>>>
> >>>>> ---
> >>>>>    drivers/acpi/apei/ghes.c | 10 ++++++++++
> >>>>>    1 file changed, 10 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> >>>>> index 623cc0cb4a65..b0b20ee533d9 100644
> >>>>> --- a/drivers/acpi/apei/ghes.c
> >>>>> +++ b/drivers/acpi/apei/ghes.c
> >>>>> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes,
> >>>>>    		}
> >>>>>    	}
> >>>>>    
> >>>>> +	/*
> >>>>> +	 * If no memory failure work is queued for abnormal synchronous
> >>>>> +	 * errors, do a force kill.
> >>>>> +	 */
> >>>>> +	if (sync && !queued) {
> >>>>> +		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> >>>>> +			current->comm, task_pid_nr(current));
> >>>>
> >>>> Hmm... doest this need "hardware" or would "memory corruption" be
> >>>> enough?
> >>>>
> >>>> Also, does this need to say that it is sending SIGBUS when the signal
> >>>> itself tells that already?
> >>>>
> >>>> I.e. could "%s:%d has memory corruption" be enough information?
> >>>
> >>> Hi, Jarkko,
> >>>
> >>> Thank you for your suggestion. Maybe it could.
> >>>
> >>> There are some similar error info which use "hardware memory error", e.g.
> >>
> >> By tweaking my original suggestion just a bit:
> >>
> >> "%s:%d: hardware memory corruption"
> >>
> >> Can't get clearer than that, right?
> > 
> > And obvious reason that shorter and more consistent klog message is easy
> > to spot and grep. It is simply less convoluted.
> > 
> > If you want also SIGBUS, I'd just put it as "%s:%d: hardware memory
> > corruption (SIGBUS)"
> > 
> > BR, Jarkko
>
> Hi, Jarkko,
>
> I will change it to "%s:%d: hardware memory corruption (SIGBUS)".
>
> Thank you for valuable suggestion.

Yeah, no intention nitpick, it has a practical value when debugging
issues :-)

>
> Best Regards,
> Shuai

BR, Jarkko
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 623cc0cb4a65..b0b20ee533d9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -801,6 +801,16 @@  static bool ghes_do_proc(struct ghes *ghes,
 		}
 	}
 
+	/*
+	 * If no memory failure work is queued for abnormal synchronous
+	 * errors, do a force kill.
+	 */
+	if (sync && !queued) {
+		pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+			current->comm, task_pid_nr(current));
+		force_sig(SIGBUS);
+	}
+
 	return queued;
 }