mbox series

[v3,0/4] arm64: improve handle synchronous External Data Abort

Message ID 20221205160043.57465-1-xiexiuqi@huawei.com (mailing list archive)
Headers show
Series arm64: improve handle synchronous External Data Abort | expand

Message

Xie XiuQi Dec. 5, 2022, 4 p.m. UTC
This series fix some issue for arm64 synchronous External Data Abort.

1. fix unhandled processor error
According to the RAS documentation, if we cannot determine the impact
of the error based on the details of the error when an SEA occurs, the
process cannot safely continue to run. Therefore, for unhandled error,
we should signal the system and terminate the process immediately.

2. improve for handling memory errors

If error happened in current execution context, we need pass
MF_ACTION_REQUIRED flag to memory_failure(), and if memory_failure()
recovery failed, we must handle this case, other than ignore it.

---
v3: add improve for handing memory errors
v2: fix compile warning reported by kernel test robot.

Xie XiuQi (4):
  ACPI: APEI: include missing acpi/apei.h
  arm64: ghes: fix error unhandling in synchronous External Data Abort
  arm64: ghes: handle the case when memory_failure recovery failed
  arm64: ghes: pass MF_ACTION_REQUIRED to memory_failure when sea

 arch/arm64/kernel/acpi.c      |  6 ++++++
 drivers/acpi/apei/apei-base.c |  5 +++++
 drivers/acpi/apei/ghes.c      | 31 ++++++++++++++++++++++++-------
 include/acpi/apei.h           |  1 +
 include/linux/mm.h            |  2 +-
 mm/memory-failure.c           | 24 +++++++++++++++++-------
 6 files changed, 54 insertions(+), 15 deletions(-)

Comments

Shuai Xue Dec. 10, 2022, 1:35 p.m. UTC | #1
On 2022/12/6 AM12:00, Xie XiuQi wrote:
> This series fix some issue for arm64 synchronous External Data Abort.
> 
> 1. fix unhandled processor error
> According to the RAS documentation, if we cannot determine the impact
> of the error based on the details of the error when an SEA occurs, the
> process cannot safely continue to run. Therefore, for unhandled error,
> we should signal the system and terminate the process immediately.
> 
> 2. improve for handling memory errors
> 
> If error happened in current execution context, we need pass
> MF_ACTION_REQUIRED flag to memory_failure(), and if memory_failure()
> recovery failed, we must handle this case, other than ignore it.
> 
> ---
> v3: add improve for handing memory errors
> v2: fix compile warning reported by kernel test robot.
> 
> Xie XiuQi (4):
>   ACPI: APEI: include missing acpi/apei.h
>   arm64: ghes: fix error unhandling in synchronous External Data Abort
>   arm64: ghes: handle the case when memory_failure recovery failed
>   arm64: ghes: pass MF_ACTION_REQUIRED to memory_failure when sea
> 
>  arch/arm64/kernel/acpi.c      |  6 ++++++
>  drivers/acpi/apei/apei-base.c |  5 +++++
>  drivers/acpi/apei/ghes.c      | 31 ++++++++++++++++++++++++-------
>  include/acpi/apei.h           |  1 +
>  include/linux/mm.h            |  2 +-
>  mm/memory-failure.c           | 24 +++++++++++++++++-------
>  6 files changed, 54 insertions(+), 15 deletions(-)
> 

Hi, XiuQi,

As we discussed, if you want to fix this problem before the new UEFI version comes out,
you need a another patch separated synchronous error handling into task work when SEA
notification is used. Be careful that do not break error handling of other notification
type.

A reference code is pasted bellow.

Thank you.

Best Regards,
Shuai

----

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 57cae48ebc1f..1982a5e3fd8c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -445,15 +445,71 @@ static void ghes_kick_task_work(struct callback_head *head)
 	gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
 }

+/**
+ * struct mce_task_work - for synchronous RAS event
+ *
+ * @twork:                callback_head for task work
+ * @pfn:                  page frame number of corrupted page
+ * @flags:                fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
+ */
+struct mce_task_work {
+	struct callback_head twork;
+	u64 pfn;
+	int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
+{
+	int rc;
+	struct mce_task_work *twcb =
+		container_of(twork, struct mce_task_work, twork);
+
+	rc = memory_failure(twcb->pfn, twcb->flags);
+	kfree(twcb);
+
+	if (!rc)
+		return;
+	/*
+	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+	 * to the current process with the proper error info,
+	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+	 *
+	 * In both cases, no further processing is required.
+	 */
+	if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+		return;
+
+	pr_err("Memory error not recovered");
+	force_sig(SIGBUS);
+}
+
 static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 {
 	unsigned long pfn;
+	struct mce_task_work *twcb;

 	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
 		return false;

 	pfn = PHYS_PFN(physical_addr);
-	memory_failure_queue(pfn, flags);
+
+	if (flags == MF_ACTION_REQUIRED && task->mm) {
+		twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+		if (!twcb)
+			return false;
+
+		twcb->pfn = pfn;
+		twcb->flags = flags;
+		init_task_work(&twcb->twork, memory_failure_cb);
+		task_work_add(current, &twcb->twork, TWA_RESUME);
+		return false;
+	} else {
+		memory_failure_queue(pfn, flags);
+	}
+
 	return true;
 }