Message ID | 20241028081142.66028-2-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | Needs ACK |
Headers | show |
Series | [v15,1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered | expand |
On Mon, Oct 28, 2024 at 04:11:40PM +0800, Shuai Xue wrote: > Synchronous error was detected as a result of user-space process accessing > a 2-bit uncorrected error. The CPU will take a synchronous error exception > such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a > memory_failure() work which poisons the related page, unmaps the page, and > then sends a SIGBUS to the process, so that a system wide panic can be > avoided. > > However, no memory_failure() work will be queued when abnormal synchronous > errors occur. These errors can include situations such as invalid PA, > unexpected severity, no memory failure config support, invalid GUID > section, etc. In such case, the user-space process will trigger SEA again. > This loop can potentially exceed the platform firmware threshold or even > trigger a kernel hard lockup, leading to a system reboot. > > Fix it by performing a force kill if no memory_failure() work is queued > for synchronous errors. > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > drivers/acpi/apei/ghes.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index ada93cfde9ba..f2ee28c44d7a 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, > } > } > > + /* > + * If no memory failure work is queued for abnormal synchronous > + * errors, do a force kill. > + */ > + if (sync && !queued) { > + pr_err("%s:%d: hardware memory corruption (SIGBUS)\n", > + current->comm, task_pid_nr(current)); I think it would help to include the GHES_PFX to indicate where this message is coming from. The pr_fmt() macro could also be introduced instead. Also, you may want to include the HW_ERR prefix. Not all kernel messages related to hardware errors have this prefix today. But maybe that should be changed so there is more consistent messaging. Thanks, Yazen
在 2024/10/30 04:48, Yazen Ghannam 写道: > On Mon, Oct 28, 2024 at 04:11:40PM +0800, Shuai Xue wrote: >> Synchronous error was detected as a result of user-space process accessing >> a 2-bit uncorrected error. The CPU will take a synchronous error exception >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a >> memory_failure() work which poisons the related page, unmaps the page, and >> then sends a SIGBUS to the process, so that a system wide panic can be >> avoided. >> >> However, no memory_failure() work will be queued when abnormal synchronous >> errors occur. These errors can include situations such as invalid PA, >> unexpected severity, no memory failure config support, invalid GUID >> section, etc. In such case, the user-space process will trigger SEA again. >> This loop can potentially exceed the platform firmware threshold or even >> trigger a kernel hard lockup, leading to a system reboot. >> >> Fix it by performing a force kill if no memory_failure() work is queued >> for synchronous errors. >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> --- >> drivers/acpi/apei/ghes.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index ada93cfde9ba..f2ee28c44d7a 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, >> } >> } >> >> + /* >> + * If no memory failure work is queued for abnormal synchronous >> + * errors, do a force kill. >> + */ >> + if (sync && !queued) { >> + pr_err("%s:%d: hardware memory corruption (SIGBUS)\n", >> + current->comm, task_pid_nr(current)); > > I think it would help to include the GHES_PFX to indicate where this > message is coming from. The pr_fmt() macro could also be introduced > instead. Yes, GHES_PFX is a effective prefix and will be consistent to other message in GHES driver. Will add it in next version. What do you mean about pr_fmt()? > > Also, you may want to include the HW_ERR prefix. Not all kernel messages > related to hardware errors have this prefix today. But maybe that should > be changed so there is more consistent messaging. > Do we really need a HW_ERR prefix? The other case which use HW_ERR prefix are for hardware registers. The messages which send SIGBUS does not include HW_ERR, e.g. in kill_proc(), kill_procs(). pr_err("%#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",... pr_err("%#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",... > Thanks, > Yazen Thanks for valuable comments. Best Regards, Shuai
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index ada93cfde9ba..f2ee28c44d7a 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, } } + /* + * If no memory failure work is queued for abnormal synchronous + * errors, do a force kill. + */ + if (sync && !queued) { + pr_err("%s:%d: hardware memory corruption (SIGBUS)\n", + current->comm, task_pid_nr(current)); + force_sig(SIGBUS); + } + return queued; }