Message ID | 20221027042445.60108-1-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events | expand |
On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: > > There are two major types of uncorrected error (UC) : > > - Action Required: The error is detected and the processor already consumes the > memory. OS requires to take action (for example, offline failure page/kill > failure thread) to recover this uncorrectable error. > > - Action Optional: The error is detected out of processor execution context. > Some data in the memory are corrupted. But the data have not been consumed. > OS is optional to take action to recover this uncorrectable error. > > For X86 platforms, we can easily distinguish between these two types > based on the MCA Bank. While for arm64 platform, the memory failure > flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0, > a.k.a, Action Optional now. > > If UC is detected by a background scrubber, it is obviously an Action > Optional error. For other errors, we should conservatively regard them > as Action Required. > > cper_sec_mem_err::error_type identifies the type of error that occurred > if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 > for Scrub Uncorrected Error (type 14). Otherwise, set memory failure > flags as MF_ACTION_REQUIRED. > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> I need input from the APEI reviewers on this. Thanks! > --- > drivers/acpi/apei/ghes.c | 10 ++++++++-- > include/linux/cper.h | 3 +++ > 2 files changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 80ad530583c9..6c03059cbfc6 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, > if (sec_sev == GHES_SEV_CORRECTED && > (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) > flags = MF_SOFT_OFFLINE; > - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) > - flags = 0; > + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) { > + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) > + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ? > + 0 : > + MF_ACTION_REQUIRED; > + else > + flags = MF_ACTION_REQUIRED; > + } > > if (flags != -1) > return ghes_do_memory_failure(mem_err->physical_addr, flags); > diff --git a/include/linux/cper.h b/include/linux/cper.h > index eacb7dd7b3af..b77ab7636614 100644 > --- a/include/linux/cper.h > +++ b/include/linux/cper.h > @@ -235,6 +235,9 @@ enum { > #define CPER_MEM_VALID_BANK_ADDRESS 0x100000 > #define CPER_MEM_VALID_CHIP_ID 0x200000 > > +#define CPER_MEM_SCRUB_CE 13 > +#define CPER_MEM_SCRUB_UC 14 > + > #define CPER_MEM_EXT_ROW_MASK 0x3 > #define CPER_MEM_EXT_ROW_SHIFT 16 > > -- > 2.20.1.9.gb50a0d7 >
>> cper_sec_mem_err::error_type identifies the type of error that occurred >> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 >> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure >> flags as MF_ACTION_REQUIRED. On x86 the "action required" cases are signaled by a synchronous machine check that is delivered before the instruction that is attempting to consume the uncorrected data retires. I.e., it is guaranteed that the uncorrected error has not been propagated because it is not visible in any architectural state. APEI signaled errors don't fall into that category on x86 ... the uncorrected data could have been consumed and propagated long before the signaling used for APEI can alert the OS. Does ARM deliver APEI signals synchronously? If not, then this patch might deliver a false sense of security to applications about the state of uncorrected data in the system. -Tony
在 2022/10/29 AM1:08, Rafael J. Wysocki 写道: > On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: >> >> There are two major types of uncorrected error (UC) : >> >> - Action Required: The error is detected and the processor already consumes the >> memory. OS requires to take action (for example, offline failure page/kill >> failure thread) to recover this uncorrectable error. >> >> - Action Optional: The error is detected out of processor execution context. >> Some data in the memory are corrupted. But the data have not been consumed. >> OS is optional to take action to recover this uncorrectable error. >> >> For X86 platforms, we can easily distinguish between these two types >> based on the MCA Bank. While for arm64 platform, the memory failure >> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0, >> a.k.a, Action Optional now. >> >> If UC is detected by a background scrubber, it is obviously an Action >> Optional error. For other errors, we should conservatively regard them >> as Action Required. >> >> cper_sec_mem_err::error_type identifies the type of error that occurred >> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 >> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure >> flags as MF_ACTION_REQUIRED. >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > > I need input from the APEI reviewers on this. > > Thanks! Hi, Rafael, Sorry, I missed this email. Thank you for you quick reply. Let's discuss with reviewers. Thank you. Cheers, Shuai > >> --- >> drivers/acpi/apei/ghes.c | 10 ++++++++-- >> include/linux/cper.h | 3 +++ >> 2 files changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index 80ad530583c9..6c03059cbfc6 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, >> if (sec_sev == GHES_SEV_CORRECTED && >> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) >> flags = MF_SOFT_OFFLINE; >> - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) >> - flags = 0; >> + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) { >> + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) >> + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ? >> + 0 : >> + MF_ACTION_REQUIRED; >> + else >> + flags = MF_ACTION_REQUIRED; >> + } >> >> if (flags != -1) >> return ghes_do_memory_failure(mem_err->physical_addr, flags); >> diff --git a/include/linux/cper.h b/include/linux/cper.h >> index eacb7dd7b3af..b77ab7636614 100644 >> --- a/include/linux/cper.h >> +++ b/include/linux/cper.h >> @@ -235,6 +235,9 @@ enum { >> #define CPER_MEM_VALID_BANK_ADDRESS 0x100000 >> #define CPER_MEM_VALID_CHIP_ID 0x200000 >> >> +#define CPER_MEM_SCRUB_CE 13 >> +#define CPER_MEM_SCRUB_UC 14 >> + >> #define CPER_MEM_EXT_ROW_MASK 0x3 >> #define CPER_MEM_EXT_ROW_SHIFT 16 >> >> -- >> 2.20.1.9.gb50a0d7 >>
在 2022/10/29 AM1:25, Luck, Tony 写道: >>> cper_sec_mem_err::error_type identifies the type of error that occurred >>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 >>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure >>> flags as MF_ACTION_REQUIRED. > > On x86 the "action required" cases are signaled by a synchronous machine check > that is delivered before the instruction that is attempting to consume the uncorrected > data retires. I.e., it is guaranteed that the uncorrected error has not been propagated > because it is not visible in any architectural state. On arm, if a 2-bit (uncorrectable) error is detected, and the memory access has been architecturally executed, that error is considered “consumed”. The CPU will take a synchronous error exception, signaled as synchronous external abort (SEA), which is analogously to MCE. > > APEI signaled errors don't fall into that category on x86 ... the uncorrected data > could have been consumed and propagated long before the signaling used for > APEI can alert the OS. > > Does ARM deliver APEI signals synchronously? > > If not, then this patch might deliver a false sense of security to applications > about the state of uncorrected data in the system. > Well, it does not always. There are many APEI notification, such as SCI, GSIV, GPIO, SDEI, SEA, etc. Not all APEI notifications are synchronously and it depends on hardware signal. As far as I know, if a UE is detected and consumed, synchronous external abort is signaled to firmware and firmware then performs a first-level triage and synchronously notify OS by SDEI or SEA notification. On the other hand, if CE is detected, a asynchronous interrupt will be signaled and firmware could notify OS by GPIO or GSIV. Best Regards, Shuai
在 2022/11/2 PM7:53, Shuai Xue 写道: > > > 在 2022/10/29 AM1:25, Luck, Tony 写道: >>>> cper_sec_mem_err::error_type identifies the type of error that occurred >>>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 >>>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure >>>> flags as MF_ACTION_REQUIRED. >> >> On x86 the "action required" cases are signaled by a synchronous machine check >> that is delivered before the instruction that is attempting to consume the uncorrected >> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated >> because it is not visible in any architectural state. > > On arm, if a 2-bit (uncorrectable) error is detected, and the memory access has been > architecturally executed, that error is considered “consumed”. The CPU will take a > synchronous error exception, signaled as synchronous external abort (SEA), which is > analogously to MCE. > >> >> APEI signaled errors don't fall into that category on x86 ... the uncorrected data >> could have been consumed and propagated long before the signaling used for >> APEI can alert the OS. >> >> Does ARM deliver APEI signals synchronously? >> >> If not, then this patch might deliver a false sense of security to applications >> about the state of uncorrected data in the system. >> > > Well, it does not always. There are many APEI notification, such as SCI, GSIV, GPIO, > SDEI, SEA, etc. Not all APEI notifications are synchronously and it depends on > hardware signal. As far as I know, if a UE is detected and consumed, synchronous external > abort is signaled to firmware and firmware then performs a first-level triage and > synchronously notify OS by SDEI or SEA notification. On the other hand, if CE is > detected, a asynchronous interrupt will be signaled and firmware could notify OS > by GPIO or GSIV. > > Best Regards, > Shuai > > Hi, Tony, Prefetch data with UE error triggers async interrupt on both X86 and Arm64 platform (CMCI in X86 and SPI in arm64). It does not belongs to scrub UEs. I have to admit that cper_sec_mem_err::error_type is not an appropriate basis to distinguish "action required" cases. acpi_hest_generic_data::flags (UEFI spec section N.2.2) could be used to indicate Action Optional (Scrub/Prefetch). Bit 5 – Latent error: If set this flag indicates that action has been taken to ensure error containment (such a poisoning data), but the error has not been fully corrected and the data has not been consumed. System software may choose to take further corrective action before the data is consumed. Our hardware team has submitted a proposal to UEFI community to add a new bit: Bit 8 – sync flag; if set this flag indicates that this event record is synchronous(e.g. cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous. With bit 8, we will know it is "Action Required". I will send a new patch set to rework GHES error handling after the proposal is accept. Thank you. Best Regards Shuai
changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ Currently, both synchronous and asynchronous error are queued and handled by a dedicated kthread in workqueue. And Memory failure for synchronous error is synced by a cancel_work_sync trick which ensures that the corrupted page is unmapped and poisoned. And after returning to user-space, the task starts at current instruction which triggering a page fault in which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. However, the memory failure recovery for hwpoison-aware mechanisms does not work as expected. For example, hwpoison-aware user-space processes like QEMU register their customized SIGBUS handler and enable early kill mode by seting PF_MCE_EARLY at initialization. Then the kernel will directy notify the process by sending a SIGBUS signal in memory failure with wrong si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of BUS_MCEERR_AR. To address this problem: - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which indicates error happened in current execution context - PATCH 2 separates synchronous error handling into task work so that the current context in memory failure is exactly belongs to the task consuming poison data. Then, kernel will send SIGBUS with proper si_code in kill_proc(). Lv Ying and XiuQi also proposed to address similar problem and we discussed about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to distinguish synchronous event. [2][3] The UEFI community still has no response. After a deep dive into the SDEI TRM, the SDEI notification should be used for asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an exception-like entry into the client, **with the client providing an additional asynchronous entry point similar to an interrupt entry point**". The client (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, etc). So notify type is enough to distinguish synchronous event. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] https://developer.arm.com/documentation/den0054/latest/ [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++--------------- include/acpi/ghes.h | 3 - mm/memory-failure.c | 13 ---- 3 files changed, 83 insertions(+), 68 deletions(-)
On Fri, Mar 17, 2023 at 8:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: > > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > Currently, both synchronous and asynchronous error are queued and handled > by a dedicated kthread in workqueue. And Memory failure for synchronous > error is synced by a cancel_work_sync trick which ensures that the > corrupted page is unmapped and poisoned. And after returning to user-space, > the task starts at current instruction which triggering a page fault in > which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. > > However, the memory failure recovery for hwpoison-aware mechanisms does not > work as expected. For example, hwpoison-aware user-space processes like > QEMU register their customized SIGBUS handler and enable early kill mode by > seting PF_MCE_EARLY at initialization. Then the kernel will directy notify > the process by sending a SIGBUS signal in memory failure with wrong > si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of > BUS_MCEERR_AR. > > To address this problem: > > - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which > indicates error happened in current execution context > - PATCH 2 separates synchronous error handling into task work so that the > current context in memory failure is exactly belongs to the task > consuming poison data. > > Then, kernel will send SIGBUS with proper si_code in kill_proc(). > > Lv Ying and XiuQi also proposed to address similar problem and we discussed > about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to > distinguish synchronous event. [2][3] The UEFI community still has no response. > After a deep dive into the SDEI TRM, the SDEI notification should be used for > asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an > exception-like entry into the client, **with the client providing an additional > asynchronous entry point similar to an interrupt entry point**". The client > (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, > etc). So notify type is enough to distinguish synchronous event. > > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error > and it is not fact. > > After this patch set: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error > as we expected. > > [1] https://developer.arm.com/documentation/den0054/latest/ > [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ > [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > > Shuai Xue (2): > ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on > synchronous events > ACPI: APEI: handle synchronous exceptions in task work > > drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++--------------- > include/acpi/ghes.h | 3 - > mm/memory-failure.c | 13 ---- > 3 files changed, 83 insertions(+), 68 deletions(-) > > -- I really need the designated APEI reviewers to give their feedback on this.
Test-by: Ma Wupeng <mawupeng1@huawei.com> I have test this on arm64 with following steps: 1. make memory failure return EBUSY 2. force a UCE with einj Without this patchset, user task will not be kill since memory_failure can not handle this UCE properly and user task is in D state. The stack can be found in the end. With this patchset, user task can be killed even memory_failure return -EBUSY without doing anything. Here is the stack of user task with D state: # cat /proc/7001/stack [<0>] __flush_work.isra.0+0x80/0xa8 [<0>] __cancel_work_timer+0x144/0x1c8 [<0>] cancel_work_sync+0x1c/0x30 [<0>] memory_failure_queue_kick+0x3c/0x88 [<0>] ghes_kick_task_work+0x28/0x78 [<0>] task_work_run+0xb8/0x188 [<0>] do_notify_resume+0x1e0/0x280 [<0>] el0_da+0x130/0x138 [<0>] el0t_64_sync_handler+0x68/0xc0 [<0>] el0t_64_sync+0x188/0x190 On 2023/3/17 15:24, Shuai Xue wrote: > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > Currently, both synchronous and asynchronous error are queued and handled > by a dedicated kthread in workqueue. And Memory failure for synchronous > error is synced by a cancel_work_sync trick which ensures that the > corrupted page is unmapped and poisoned. And after returning to user-space, > the task starts at current instruction which triggering a page fault in > which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. > > However, the memory failure recovery for hwpoison-aware mechanisms does not > work as expected. For example, hwpoison-aware user-space processes like > QEMU register their customized SIGBUS handler and enable early kill mode by > seting PF_MCE_EARLY at initialization. Then the kernel will directy notify > the process by sending a SIGBUS signal in memory failure with wrong > si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of > BUS_MCEERR_AR. > > To address this problem: > > - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which > indicates error happened in current execution context > - PATCH 2 separates synchronous error handling into task work so that the > current context in memory failure is exactly belongs to the task > consuming poison data. > > Then, kernel will send SIGBUS with proper si_code in kill_proc(). > > Lv Ying and XiuQi also proposed to address similar problem and we discussed > about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to > distinguish synchronous event. [2][3] The UEFI community still has no response. > After a deep dive into the SDEI TRM, the SDEI notification should be used for > asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an > exception-like entry into the client, **with the client providing an additional > asynchronous entry point similar to an interrupt entry point**". The client > (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, > etc). So notify type is enough to distinguish synchronous event. > > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error > and it is not fact. > > After this patch set: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error > as we expected. > > [1] https://developer.arm.com/documentation/den0054/latest/ > [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ > [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > > Shuai Xue (2): > ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on > synchronous events > ACPI: APEI: handle synchronous exceptions in task work > > drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++--------------- > include/acpi/ghes.h | 3 - > mm/memory-failure.c | 13 ---- > 3 files changed, 83 insertions(+), 68 deletions(-) >
On 2023/3/21 PM3:17, mawupeng wrote: > Test-by: Ma Wupeng <mawupeng1@huawei.com> > > I have test this on arm64 with following steps: > 1. make memory failure return EBUSY > 2. force a UCE with einj > > Without this patchset, user task will not be kill since memory_failure can > not handle this UCE properly and user task is in D state. The stack can > be found in the end. > With this patchset, user task can be killed even memory_failure return > -EBUSY without doing anything. > > Here is the stack of user task with D state: > > # cat /proc/7001/stack > [<0>] __flush_work.isra.0+0x80/0xa8 > [<0>] __cancel_work_timer+0x144/0x1c8 > [<0>] cancel_work_sync+0x1c/0x30 > [<0>] memory_failure_queue_kick+0x3c/0x88 > [<0>] ghes_kick_task_work+0x28/0x78 > [<0>] task_work_run+0xb8/0x188 > [<0>] do_notify_resume+0x1e0/0x280 > [<0>] el0_da+0x130/0x138 > [<0>] el0t_64_sync_handler+0x68/0xc0 > [<0>] el0t_64_sync+0x188/0x190 Thank you :) Cheers, Shuai > > On 2023/3/17 15:24, Shuai Xue wrote: >> changes since v2 by addressing comments from Naoya: >> - rename mce_task_work to sync_task_work >> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() >> - add steps to reproduce this problem in cover letter >> - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e >> >> changes since v1: >> - synchronous events by notify type >> - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ >> >> Currently, both synchronous and asynchronous error are queued and handled >> by a dedicated kthread in workqueue. And Memory failure for synchronous >> error is synced by a cancel_work_sync trick which ensures that the >> corrupted page is unmapped and poisoned. And after returning to user-space, >> the task starts at current instruction which triggering a page fault in >> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. >> >> However, the memory failure recovery for hwpoison-aware mechanisms does not >> work as expected. For example, hwpoison-aware user-space processes like >> QEMU register their customized SIGBUS handler and enable early kill mode by >> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify >> the process by sending a SIGBUS signal in memory failure with wrong >> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of >> BUS_MCEERR_AR. >> >> To address this problem: >> >> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which >> indicates error happened in current execution context >> - PATCH 2 separates synchronous error handling into task work so that the >> current context in memory failure is exactly belongs to the task >> consuming poison data. >> >> Then, kernel will send SIGBUS with proper si_code in kill_proc(). >> >> Lv Ying and XiuQi also proposed to address similar problem and we discussed >> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to >> distinguish synchronous event. [2][3] The UEFI community still has no response. >> After a deep dive into the SDEI TRM, the SDEI notification should be used for >> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an >> exception-like entry into the client, **with the client providing an additional >> asynchronous entry point similar to an interrupt entry point**". The client >> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, >> etc). So notify type is enough to distinguish synchronous event. >> >> To reproduce this problem: >> >> # STEP1: enable early kill mode >> #sysctl -w vm.memory_failure_early_kill=1 >> vm.memory_failure_early_kill = 1 >> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error >> #einj_mem_uc single >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 >> injecting ... >> triggering ... >> signal 7 code 5 addr 0xffffb0d75000 >> page not present >> Test passed >> >> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error >> and it is not fact. >> >> After this patch set: >> >> # STEP1: enable early kill mode >> #sysctl -w vm.memory_failure_early_kill=1 >> vm.memory_failure_early_kill = 1 >> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error >> #einj_mem_uc single >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 >> injecting ... >> triggering ... >> signal 7 code 4 addr 0xffffb0d75000 >> page not present >> Test passed >> >> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error >> as we expected. >> >> [1] https://developer.arm.com/documentation/den0054/latest/ >> [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ >> [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ >> >> Shuai Xue (2): >> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on >> synchronous events >> ACPI: APEI: handle synchronous exceptions in task work >> >> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++--------------- >> include/acpi/ghes.h | 3 - >> mm/memory-failure.c | 13 ---- >> 3 files changed, 83 insertions(+), 68 deletions(-) >>
On 2023/3/21 AM2:03, Rafael J. Wysocki wrote: > On Fri, Mar 17, 2023 at 8:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: >> >> changes since v2 by addressing comments from Naoya: >> - rename mce_task_work to sync_task_work >> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() >> - add steps to reproduce this problem in cover letter >> - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e >> >> changes since v1: >> - synchronous events by notify type >> - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ >> >> Currently, both synchronous and asynchronous error are queued and handled >> by a dedicated kthread in workqueue. And Memory failure for synchronous >> error is synced by a cancel_work_sync trick which ensures that the >> corrupted page is unmapped and poisoned. And after returning to user-space, >> the task starts at current instruction which triggering a page fault in >> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. >> >> However, the memory failure recovery for hwpoison-aware mechanisms does not >> work as expected. For example, hwpoison-aware user-space processes like >> QEMU register their customized SIGBUS handler and enable early kill mode by >> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify >> the process by sending a SIGBUS signal in memory failure with wrong >> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of >> BUS_MCEERR_AR. >> >> To address this problem: >> >> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which >> indicates error happened in current execution context >> - PATCH 2 separates synchronous error handling into task work so that the >> current context in memory failure is exactly belongs to the task >> consuming poison data. >> >> Then, kernel will send SIGBUS with proper si_code in kill_proc(). >> >> Lv Ying and XiuQi also proposed to address similar problem and we discussed >> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to >> distinguish synchronous event. [2][3] The UEFI community still has no response. >> After a deep dive into the SDEI TRM, the SDEI notification should be used for >> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an >> exception-like entry into the client, **with the client providing an additional >> asynchronous entry point similar to an interrupt entry point**". The client >> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, >> etc). So notify type is enough to distinguish synchronous event. >> >> To reproduce this problem: >> >> # STEP1: enable early kill mode >> #sysctl -w vm.memory_failure_early_kill=1 >> vm.memory_failure_early_kill = 1 >> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error >> #einj_mem_uc single >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 >> injecting ... >> triggering ... >> signal 7 code 5 addr 0xffffb0d75000 >> page not present >> Test passed >> >> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error >> and it is not fact. >> >> After this patch set: >> >> # STEP1: enable early kill mode >> #sysctl -w vm.memory_failure_early_kill=1 >> vm.memory_failure_early_kill = 1 >> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error >> #einj_mem_uc single >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 >> injecting ... >> triggering ... >> signal 7 code 4 addr 0xffffb0d75000 >> page not present >> Test passed >> >> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error >> as we expected. >> >> [1] https://developer.arm.com/documentation/den0054/latest/ >> [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ >> [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ >> >> Shuai Xue (2): >> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on >> synchronous events >> ACPI: APEI: handle synchronous exceptions in task work >> >> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++--------------- >> include/acpi/ghes.h | 3 - >> mm/memory-failure.c | 13 ---- >> 3 files changed, 83 insertions(+), 68 deletions(-) >> >> -- > > I really need the designated APEI reviewers to give their feedback on this. Gentle ping. Best Regards. Shuai
On Thu, Mar 30, 2023 at 8:11 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: > > > On 2023/3/21 AM2:03, Rafael J. Wysocki wrote: > > On Fri, Mar 17, 2023 at 8:25 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote: > >> > >> changes since v2 by addressing comments from Naoya: > >> - rename mce_task_work to sync_task_work > >> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > >> - add steps to reproduce this problem in cover letter > >> - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e > >> > >> changes since v1: > >> - synchronous events by notify type > >> - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > >> > >> Currently, both synchronous and asynchronous error are queued and handled > >> by a dedicated kthread in workqueue. And Memory failure for synchronous > >> error is synced by a cancel_work_sync trick which ensures that the > >> corrupted page is unmapped and poisoned. And after returning to user-space, > >> the task starts at current instruction which triggering a page fault in > >> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. > >> > >> However, the memory failure recovery for hwpoison-aware mechanisms does not > >> work as expected. For example, hwpoison-aware user-space processes like > >> QEMU register their customized SIGBUS handler and enable early kill mode by > >> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify > >> the process by sending a SIGBUS signal in memory failure with wrong > >> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of > >> BUS_MCEERR_AR. > >> > >> To address this problem: > >> > >> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which > >> indicates error happened in current execution context > >> - PATCH 2 separates synchronous error handling into task work so that the > >> current context in memory failure is exactly belongs to the task > >> consuming poison data. > >> > >> Then, kernel will send SIGBUS with proper si_code in kill_proc(). > >> > >> Lv Ying and XiuQi also proposed to address similar problem and we discussed > >> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to > >> distinguish synchronous event. [2][3] The UEFI community still has no response. > >> After a deep dive into the SDEI TRM, the SDEI notification should be used for > >> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an > >> exception-like entry into the client, **with the client providing an additional > >> asynchronous entry point similar to an interrupt entry point**". The client > >> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, > >> etc). So notify type is enough to distinguish synchronous event. > >> > >> To reproduce this problem: > >> > >> # STEP1: enable early kill mode > >> #sysctl -w vm.memory_failure_early_kill=1 > >> vm.memory_failure_early_kill = 1 > >> > >> # STEP2: inject an UCE error and consume it to trigger a synchronous error > >> #einj_mem_uc single > >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > >> injecting ... > >> triggering ... > >> signal 7 code 5 addr 0xffffb0d75000 > >> page not present > >> Test passed > >> > >> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error > >> and it is not fact. > >> > >> After this patch set: > >> > >> # STEP1: enable early kill mode > >> #sysctl -w vm.memory_failure_early_kill=1 > >> vm.memory_failure_early_kill = 1 > >> > >> # STEP2: inject an UCE error and consume it to trigger a synchronous error > >> #einj_mem_uc single > >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > >> injecting ... > >> triggering ... > >> signal 7 code 4 addr 0xffffb0d75000 > >> page not present > >> Test passed > >> > >> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error > >> as we expected. > >> > >> [1] https://developer.arm.com/documentation/den0054/latest/ > >> [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ > >> [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > >> > >> Shuai Xue (2): > >> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on > >> synchronous events > >> ACPI: APEI: handle synchronous exceptions in task work > >> > >> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++--------------- > >> include/acpi/ghes.h | 3 - > >> mm/memory-failure.c | 13 ---- > >> 3 files changed, 83 insertions(+), 68 deletions(-) > >> > >> -- > > > > I really need the designated APEI reviewers to give their feedback on this. > > Gentle ping. As already stated in this thread, this series requires reviews from the designated APEI reviewers (Tony, Boris, James). Thanks!
changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memofy failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/ changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ Currently, both synchronous and asynchronous error are queued and handled by a dedicated kthread in workqueue. And Memory failure for synchronous error is synced by a cancel_work_sync trick which ensures that the corrupted page is unmapped and poisoned. And after returning to user-space, the task starts at current instruction which triggering a page fault in which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. However, the memory failure recovery for hwpoison-aware mechanisms does not work as expected. For example, hwpoison-aware user-space processes like QEMU register their customized SIGBUS handler and enable early kill mode by seting PF_MCE_EARLY at initialization. Then the kernel will directy notify the process by sending a SIGBUS signal in memory failure with wrong si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of BUS_MCEERR_AR. To address this problem: - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which indicates error happened in current execution context - PATCH 2 separates synchronous error handling into task work so that the current context in memory failure is exactly belongs to the task consuming poison data. Then, kernel will send SIGBUS with proper si_code in kill_proc(). Lv Ying and XiuQi also proposed to address similar problem and we discussed about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to distinguish synchronous event. [2][3] The UEFI community still has no response. After a deep dive into the SDEI TRM, the SDEI notification should be used for asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an exception-like entry into the client, **with the client providing an additional asynchronous entry point similar to an interrupt entry point**". The client (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, etc). So notify type is enough to distinguish synchronous event. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] https://developer.arm.com/documentation/den0054/latest/ [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work drivers/acpi/apei/ghes.c | 120 +++++++++++++++++++++++++++------------ include/acpi/ghes.h | 3 - mm/memory-failure.c | 13 ----- 3 files changed, 84 insertions(+), 52 deletions(-)
changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/ changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memofy failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/ changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/ changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ Currently, both synchronous and asynchronous error are queued and handled by a dedicated kthread in workqueue. And Memory failure for synchronous error is synced by a cancel_work_sync trick which ensures that the corrupted page is unmapped and poisoned. And after returning to user-space, the task starts at current instruction which triggering a page fault in which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON. However, the memory failure recovery for hwpoison-aware mechanisms does not work as expected. For example, hwpoison-aware user-space processes like QEMU register their customized SIGBUS handler and enable early kill mode by seting PF_MCE_EARLY at initialization. Then the kernel will directy notify the process by sending a SIGBUS signal in memory failure with wrong si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of BUS_MCEERR_AR. To address this problem: - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which indicates error happened in current execution context - PATCH 2 separates synchronous error handling into task work so that the current context in memory failure is exactly belongs to the task consuming poison data. Then, kernel will send SIGBUS with proper si_code in kill_proc(). Lv Ying and XiuQi also proposed to address similar problem and we discussed about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to distinguish synchronous event. [2][3] The UEFI community still has no response. After a deep dive into the SDEI TRM, the SDEI notification should be used for asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an exception-like entry into the client, **with the client providing an additional asynchronous entry point similar to an interrupt entry point**". The client (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR, etc). So notify type is enough to distinguish synchronous event. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] https://developer.arm.com/documentation/den0054/latest/ [2] https://lore.kernel.org/linux-arm-kernel/20221205160043.57465-4-xiexiuqi@huawei.com/T/ [3] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work drivers/acpi/apei/ghes.c | 120 +++++++++++++++++++++++++++------------ include/acpi/ghes.h | 3 - mm/memory-failure.c | 13 ----- 3 files changed, 84 insertions(+), 52 deletions(-)
changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pickup review-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter - Link: https://lore.kernel.org/lkml/1aa0ca90-d44c-aa99-1e2d-bd2ae610b088@linux.alibaba.com/ changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 7 --- drivers/acpi/apei/ghes.c | 111 ++++++++++++++++++++++----------- include/acpi/ghes.h | 3 - mm/memory-failure.c | 17 +---- 4 files changed, 76 insertions(+), 62 deletions(-)
changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 9 +-- drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- include/acpi/ghes.h | 3 - mm/memory-failure.c | 17 +---- 4 files changed, 79 insertions(+), 63 deletions(-)
On 2023/4/17 AM9:14, Shuai Xue wrote: > changes since v6: > - add more explicty error message suggested by Xiaofei > - pick up reviewed-by tag from Xiaofei > - pick up internal reviewed-by tag from Baolin > > changes since v5 by addressing comments from Kefeng: > - document return value of memory_failure() > - drop redundant comments in call site of memory_failure() > - make ghes_do_proc void and handle abnormal case within it > - pick up reviewed-by tag from Kefeng Wang > > changes since v4 by addressing comments from Xiaofei: > - do a force kill only for abnormal sync errors > > changes since v3 by addressing comments from Xiaofei: > - do a force kill for abnormal memory failure error such as invalid PA, > unexpected severity, OOM, etc > - pcik up tested-by tag from Ma Wupeng > > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > Shuai Xue (2): > ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on > synchronous events > ACPI: APEI: handle synchronous exceptions in task work > > arch/x86/kernel/cpu/mce/core.c | 9 +-- > drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- > include/acpi/ghes.h | 3 - > mm/memory-failure.c | 17 +---- > 4 files changed, 79 insertions(+), 63 deletions(-) > Hi, Rafael, Gentle ping. Are you happy to queue this patch set into your next tree, so that we can merge that in next merge window. Thank you. Best Regards, Shuai
On 2023/4/24 14:24, Shuai Xue wrote: > > > On 2023/4/17 AM9:14, Shuai Xue wrote: >> changes since v6: >> - add more explicty error message suggested by Xiaofei >> - pick up reviewed-by tag from Xiaofei >> - pick up internal reviewed-by tag from Baolin >> >> changes since v5 by addressing comments from Kefeng: >> - document return value of memory_failure() >> - drop redundant comments in call site of memory_failure() >> - make ghes_do_proc void and handle abnormal case within it >> - pick up reviewed-by tag from Kefeng Wang >> >> changes since v4 by addressing comments from Xiaofei: >> - do a force kill only for abnormal sync errors >> >> changes since v3 by addressing comments from Xiaofei: >> - do a force kill for abnormal memory failure error such as invalid PA, >> unexpected severity, OOM, etc >> - pcik up tested-by tag from Ma Wupeng >> >> changes since v2 by addressing comments from Naoya: >> - rename mce_task_work to sync_task_work >> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() >> - add steps to reproduce this problem in cover letter >> >> changes since v1: >> - synchronous events by notify type >> - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ >> >> Shuai Xue (2): >> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on >> synchronous events >> ACPI: APEI: handle synchronous exceptions in task work >> >> arch/x86/kernel/cpu/mce/core.c | 9 +-- >> drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- >> include/acpi/ghes.h | 3 - >> mm/memory-failure.c | 17 +---- >> 4 files changed, 79 insertions(+), 63 deletions(-) >> > > Hi, Rafael, > > Gentle ping. Are you happy to queue this patch set into your next tree, so that we can merge > that in next merge window. > > Thank you. > Gentle ping :) Thanks. > Best Regards, > Shuai
Hi, ALL, I have rewritten the cover letter with the hope that the maintainer will truly understand the necessity of this patch. Both Alibaba and Huawei met the same issue in products, and we hope it could be fixed ASAP. changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ There are two major types of uncorrected recoverable (UCR) errors : - Action Required (AR): The error is detected and the processor already consumes the memory. OS requires to take action (for example, offline failure page/kill failure thread) to recover this error. - Action Optional (AO): The error is detected out of processor execution context. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this error. The main difference between AR and AO errors is that AR errors are synchronous events, while AO errors are asynchronous events. Synchronous exceptions, such as Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on Arm64, are signaled by the hardware when an error is detected and the memory access has architecturally been executed. Currently, both synchronous and asynchronous errors are queued as AO errors and handled by a dedicated kernel thread in a work queue on the ARM64 platform. For synchronous errors, memory_failure() is synced using a cancel_work_sync trick to ensure that the corrupted page is unmapped and poisoned. Upon returning to user-space, the process resumes at the current instruction, triggering a page fault. As a result, the kernel sends a SIGBUS signal to the current process due to VM_FAULT_HWPOISON. However, this trick is not always be effective, this patch set improves the recovery process in three specific aspects: 1. Handle synchronous exceptions with proper si_code ghes_handle_memory_failure() queue both synchronous and asynchronous errors with flag=0. Then the kernel will notify the process by sending a SIGBUS signal in memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code to distinguish to handle memory failure. For example, hwpoison-aware user-space processes use the si_code: BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR for 'action required' synchronous/late notifications. Specifically, when a signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored by QEMU.[1] Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot If process mapping fault page, but memory_failure() abnormal return before try_to_unmap(), for example, the fault page process mapping is KSM page. In this case, arm64 cannot use the page fault process to terminate the synchronous exception loop.[4] This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. However, kernel has the capability to recover from this error. Fix it by performing a force kill when memory_failure() abnormal fails or when other abnormal synchronous errors occur. These errors can include situations such as invalid PA, unexpected severity, no memory failure config support, invalid GUID section, OOM, etc. (PATCH 2) 3. Handle memory_failure() in current process context which consuming poison When synchronous errors occur, memory_failure() assume that current process context is exactly that consuming poison synchronous error. For example, kill_accessing_process() holds mmap locking of current->mm, does pagetable walk to find the error virtual address, and sends SIGBUS to the current process with error info. However, the mm of kworker is not valid, resulting in a null-pointer dereference. I have fixed this in[3]. commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process Another example is that collect_procs()/kill_procs() walk the task list, only collect and send sigbus to task which consuming poison. But memory_failure() is queued and handled by a dedicated kernel thread on arm64 platform. Fix it by queuing memory_failure() as a task work which runs in current execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2) ** In summary, this patch set handles synchronous errors in task work with proper si_code so that hwpoison-aware process can recover from errors, and fixes (potentially)abnormal cases. ** Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 9 +-- drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- include/acpi/ghes.h | 3 - mm/memory-failure.c | 17 +---- 4 files changed, 79 insertions(+), 63 deletions(-)
Hi, ALL, I have rewritten the cover letter with the hope that the maintainer will truly understand the necessity of this patch. Both Alibaba and Huawei met the same issue in products, and we hope it could be fixed ASAP. ## Changes Log changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Action Required (AR): The error is detected and the processor already consumes the memory. OS requires to take action (for example, offline failure page/kill failure thread) to recover this error. - Action Optional (AO): The error is detected out of processor execution context. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this error. The main difference between AR and AO errors is that AR errors are synchronous events, while AO errors are asynchronous events. Synchronous exceptions, such as Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on Arm64, are signaled by the hardware when an error is detected and the memory access has architecturally been executed. Currently, both synchronous and asynchronous errors are queued as AO errors and handled by a dedicated kernel thread in a work queue on the ARM64 platform. For synchronous errors, memory_failure() is synced using a cancel_work_sync trick to ensure that the corrupted page is unmapped and poisoned. Upon returning to user-space, the process resumes at the current instruction, triggering a page fault. As a result, the kernel sends a SIGBUS signal to the current process due to VM_FAULT_HWPOISON. However, this trick is not always be effective, this patch set improves the recovery process in three specific aspects: 1. Handle synchronous exceptions with proper si_code ghes_handle_memory_failure() queue both synchronous and asynchronous errors with flag=0. Then the kernel will notify the process by sending a SIGBUS signal in memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code to distinguish to handle memory failure. For example, hwpoison-aware user-space processes use the si_code: BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR for 'action required' synchronous/late notifications. Specifically, when a signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored by QEMU.[1] Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot If process mapping fault page, but memory_failure() abnormal return before try_to_unmap(), for example, the fault page process mapping is KSM page. In this case, arm64 cannot use the page fault process to terminate the synchronous exception loop.[4] This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. However, kernel has the capability to recover from this error. Fix it by performing a force kill when memory_failure() abnormal fails or when other abnormal synchronous errors occur. These errors can include situations such as invalid PA, unexpected severity, no memory failure config support, invalid GUID section, OOM, etc. (PATCH 2) 3. Handle memory_failure() in current process context which consuming poison When synchronous errors occur, memory_failure() assume that current process context is exactly that consuming poison synchronous error. For example, kill_accessing_process() holds mmap locking of current->mm, does pagetable walk to find the error virtual address, and sends SIGBUS to the current process with error info. However, the mm of kworker is not valid, resulting in a null-pointer dereference. I have fixed this in[3]. commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process Another example is that collect_procs()/kill_procs() walk the task list, only collect and send sigbus to task which consuming poison. But memory_failure() is queued and handled by a dedicated kernel thread on arm64 platform. Fix it by queuing memory_failure() as a task work which runs in current execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2) ** In summary, this patch set handles synchronous errors in task work with proper si_code so that hwpoison-aware process can recover from errors, and fixes (potentially) abnormal cases. ** Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. ## Steps to Reproduce This Problem To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 9 +-- drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- include/acpi/ghes.h | 3 - include/linux/mm.h | 1 - mm/memory-failure.c | 22 ++----- 5 files changed, 82 insertions(+), 66 deletions(-)
Hi, ALL, Gentle ping. Best Regards, Shuai On 2023/10/7 15:28, Shuai Xue wrote: > Hi, ALL, > > I have rewritten the cover letter with the hope that the maintainer will truly > understand the necessity of this patch. Both Alibaba and Huawei met the same > issue in products, and we hope it could be fixed ASAP. > > ## Changes Log > > changes since v8: > - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) > - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) > - rewrite the return value comments of memory_failure (per Naoya Horiguchi) > > changes since v7: > - rebase to Linux v6.6-rc2 (no code changed) > - rewritten the cover letter to explain the motivation of this patchset > > changes since v6: > - add more explicty error message suggested by Xiaofei > - pick up reviewed-by tag from Xiaofei > - pick up internal reviewed-by tag from Baolin > > changes since v5 by addressing comments from Kefeng: > - document return value of memory_failure() > - drop redundant comments in call site of memory_failure() > - make ghes_do_proc void and handle abnormal case within it > - pick up reviewed-by tag from Kefeng Wang > > changes since v4 by addressing comments from Xiaofei: > - do a force kill only for abnormal sync errors > > changes since v3 by addressing comments from Xiaofei: > - do a force kill for abnormal memory failure error such as invalid PA, > unexpected severity, OOM, etc > - pcik up tested-by tag from Ma Wupeng > > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > > ## Cover Letter > > There are two major types of uncorrected recoverable (UCR) errors : > > - Action Required (AR): The error is detected and the processor already > consumes the memory. OS requires to take action (for example, offline > failure page/kill failure thread) to recover this error. > > - Action Optional (AO): The error is detected out of processor execution > context. Some data in the memory are corrupted. But the data have not > been consumed. OS is optional to take action to recover this error. > > The main difference between AR and AO errors is that AR errors are synchronous > events, while AO errors are asynchronous events. Synchronous exceptions, such as > Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on > Arm64, are signaled by the hardware when an error is detected and the memory > access has architecturally been executed. > > Currently, both synchronous and asynchronous errors are queued as AO errors and > handled by a dedicated kernel thread in a work queue on the ARM64 platform. For > synchronous errors, memory_failure() is synced using a cancel_work_sync trick to > ensure that the corrupted page is unmapped and poisoned. Upon returning to > user-space, the process resumes at the current instruction, triggering a page > fault. As a result, the kernel sends a SIGBUS signal to the current process due > to VM_FAULT_HWPOISON. > > However, this trick is not always be effective, this patch set improves the > recovery process in three specific aspects: > > 1. Handle synchronous exceptions with proper si_code > > ghes_handle_memory_failure() queue both synchronous and asynchronous errors with > flag=0. Then the kernel will notify the process by sending a SIGBUS signal in > memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space > process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code > to distinguish to handle memory failure. > > For example, hwpoison-aware user-space processes use the si_code: > BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR > for 'action required' synchronous/late notifications. Specifically, when a > signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to > Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored > by QEMU.[1] > > Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) > > 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot > > If process mapping fault page, but memory_failure() abnormal return before > try_to_unmap(), for example, the fault page process mapping is KSM page. > In this case, arm64 cannot use the page fault process to terminate the > synchronous exception loop.[4] > > This loop can potentially exceed the platform firmware threshold or even trigger > a kernel hard lockup, leading to a system reboot. However, kernel has the > capability to recover from this error. > > Fix it by performing a force kill when memory_failure() abnormal fails or when > other abnormal synchronous errors occur. These errors can include situations > such as invalid PA, unexpected severity, no memory failure config support, > invalid GUID section, OOM, etc. (PATCH 2) > > 3. Handle memory_failure() in current process context which consuming poison > > When synchronous errors occur, memory_failure() assume that current process > context is exactly that consuming poison synchronous error. > > For example, kill_accessing_process() holds mmap locking of current->mm, does > pagetable walk to find the error virtual address, and sends SIGBUS to the > current process with error info. However, the mm of kworker is not valid, > resulting in a null-pointer dereference. I have fixed this in[3]. > > commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process > > Another example is that collect_procs()/kill_procs() walk the task list, only > collect and send sigbus to task which consuming poison. But memory_failure() is > queued and handled by a dedicated kernel thread on arm64 platform. > > Fix it by queuing memory_failure() as a task work which runs in current > execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2) > > ** In summary, this patch set handles synchronous errors in task work with > proper si_code so that hwpoison-aware process can recover from errors, and > fixes (potentially) abnormal cases. ** > > Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. > Acknowledge to discussion with them. > > ## Steps to Reproduce This Problem > > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error > and it is not fact. > > After this patch set: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error > as we expected. > > [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ > [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ > [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com > [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > > Shuai Xue (2): > ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on > synchronous events > ACPI: APEI: handle synchronous exceptions in task work > > arch/x86/kernel/cpu/mce/core.c | 9 +-- > drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- > include/acpi/ghes.h | 3 - > include/linux/mm.h | 1 - > mm/memory-failure.c | 22 ++----- > 5 files changed, 82 insertions(+), 66 deletions(-) >
On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote: > However, this trick is not always be effective So far so good. What's missing here is why "this trick" is not always effective. Basically to explain what exactly the problem is. > For example, hwpoison-aware user-space processes use the si_code: > BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR > for 'action required' synchronous/late notifications. Specifically, when a > signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to > Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored > by QEMU.[1] > > Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) So you're fixing qemu by "fixing" the kernel? This doesn't make any sense. Make errors which are ACPI_HEST_NOTIFY_SEA type return MF_ACTION_REQUIRED so that it *happens* to fix your use case. Sounds like a lot of nonsense to me. What is the issue here you're trying to solve? > 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot > > If process mapping fault page, but memory_failure() abnormal return before > try_to_unmap(), for example, the fault page process mapping is KSM page. > In this case, arm64 cannot use the page fault process to terminate the > synchronous exception loop.[4] > > This loop can potentially exceed the platform firmware threshold or even trigger > a kernel hard lockup, leading to a system reboot. However, kernel has the > capability to recover from this error. > > Fix it by performing a force kill when memory_failure() abnormal fails or when > other abnormal synchronous errors occur. Just like that? Without giving the process the opportunity to even save its other data? So this all is still very confusing, patches definitely need splitting and this whole thing needs restraint. You go and do this: you split *each* issue you're addressing into a separate patch and explain it like this: --- 1. Prepare the context for the explanation briefly. 2. Explain the problem at hand. 3. "It happens because of <...>" 4. "Fix it by doing X" 5. "(Potentially do Y)." --- and each patch explains *exactly* *one* issue, what happens, why it happens and just the fix for it and *why* it is needed. Otherwise, this is unreviewable. Thx.
On 2023/11/23 23:07, Borislav Petkov wrote: Hi, Borislav, Thank you for your reply and advice. > On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote: >> However, this trick is not always be effective > > So far so good. > > What's missing here is why "this trick" is not always effective. > > Basically to explain what exactly the problem is. I think the main point is that this trick for AR error is not effective, because: - an AR error consumed by current process is deferred to handle in a dedicated kernel thread, but memory_failure() assumes that it runs in the current context - another page fault is not unnecessary, we can send sigbus to current process in the first Synchronous External Abort SEA on arm64 (analogy Machine Check Exception on x86) > >> For example, hwpoison-aware user-space processes use the si_code: >> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR >> for 'action required' synchronous/late notifications. Specifically, when a >> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to >> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored >> by QEMU.[1] >> >> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) > > So you're fixing qemu by "fixing" the kernel? > > This doesn't make any sense. I just give an example that the user space process *really* relys on the si_code of signal to handle hardware errors > > Make errors which are ACPI_HEST_NOTIFY_SEA type return > MF_ACTION_REQUIRED so that it *happens* to fix your use case. > > Sounds like a lot of nonsense to me. > > What is the issue here you're trying to solve? The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says: /* hardware memory error consumed on a machine check: action required */ #define BUS_MCEERR_AR 4 /* hardware memory error detected in process but not consumed: action optional*/ #define BUS_MCEERR_AO 5 When a synchronous error is consumed by Guest, the kernel should send a signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO. > >> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot >> >> If process mapping fault page, but memory_failure() abnormal return before >> try_to_unmap(), for example, the fault page process mapping is KSM page. >> In this case, arm64 cannot use the page fault process to terminate the >> synchronous exception loop.[4] >> >> This loop can potentially exceed the platform firmware threshold or even trigger >> a kernel hard lockup, leading to a system reboot. However, kernel has the >> capability to recover from this error. >> >> Fix it by performing a force kill when memory_failure() abnormal fails or when >> other abnormal synchronous errors occur. > > Just like that? > > Without giving the process the opportunity to even save its other data? Exactly. > > So this all is still very confusing, patches definitely need splitting > and this whole thing needs restraint. > > You go and do this: you split *each* issue you're addressing into > a separate patch and explain it like this: > > --- > 1. Prepare the context for the explanation briefly. > > 2. Explain the problem at hand. > > 3. "It happens because of <...>" > > 4. "Fix it by doing X" > > 5. "(Potentially do Y)." > --- > > and each patch explains *exactly* *one* issue, what happens, why it > happens and just the fix for it and *why* it is needed. > > Otherwise, this is unreviewable. Thank you for your valuable suggestion, I will split the patches and resubmit a new patch set. > > Thx. > Best Regards, Shuai
On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: > - an AR error consumed by current process is deferred to handle in a > dedicated kernel thread, but memory_failure() assumes that it runs in the > current context On x86? ARM? Please point to the exact code flow. > - another page fault is not unnecessary, we can send sigbus to current > process in the first Synchronous External Abort SEA on arm64 (analogy > Machine Check Exception on x86) I have no clue what that means. What page fault? > I just give an example that the user space process *really* relys on the > si_code of signal to handle hardware errors No, don't give examples. Explain what the exact problem is you're seeing, in your use case, point to the code and then state how you think it should be fixed and why. Right now your text is "all over the place" and I have no clue what you even want. > The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says: > > /* hardware memory error consumed on a machine check: action required */ > #define BUS_MCEERR_AR 4 > /* hardware memory error detected in process but not consumed: action optional*/ > #define BUS_MCEERR_AO 5 > > When a synchronous error is consumed by Guest, the kernel should send a > signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO. Can you drop this "synchronous" bla and concentrate on the error *severity*? I think you want to say that there are some types of errors for which error handling needs to happen immediately and for some reason that doesn't happen. Which errors are those? Types? Why do you need them to be handled immediately? > Exactly. No, not exactly. Why is it ok to do that? What are the implications of this? Is immediate killing the right decision? Is this ok for *every* possible kernel running out there - not only for your use case? And so on and so on...
On 2023/11/25 20:10, Borislav Petkov wrote: Hi, Borislav, Thank you for your reply, and sorry for the confusion I made. Please see my rely inline. Best Regards, Shuai > On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: >> - an AR error consumed by current process is deferred to handle in a >> dedicated kernel thread, but memory_failure() assumes that it runs in the >> current context > > On x86? ARM? > > Pease point to the exact code flow. An AR error consumed by current process is deferred to handle in a dedicated kernel thread on ARM platform. The AR error is handled in bellow flow: ----------------------------------------------------------------------------- [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0 ----------------------------------------------------------------------------- [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry // peak and read estatus => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work [ghes_sdei_critical_callback: return] ----------------------------------------------------------------------------- [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue // put work task on current CPU => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); [ghes_proc_in_irq: return] ----------------------------------------------------------------------------- // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 [memory_failure_work_func: current kworker, CPU 3] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => memory_failure(entry.pfn, entry.flags); ----------------------------------------------------------------------------- [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 => memory_failure_queue_kick => cancel_work_sync - waiting memory_failure_work_func finish => memory_failure_work_func(&mf_cpu->work) => kfifo_get(&mf_cpu->fifo, &entry); // no work ----------------------------------------------------------------------------- [einj_mem_uc resume at the same PC, trigger a page fault STEP 5 STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware notifies hardware error to kernel through is SDEI (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie a irq_work to handle hardware errors in IRQ context STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue. STEP3: The kworker preempts the current running thread and get CPU 3. Then memory_failure() is processed in kworker. STEP4: ghes_kick_task_work() is called as task_work to ensure any queued workqueue has been done before returning to user-space. STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the current instruction, because the poison page is unmapped by memory_failure() in step 3, so a page fault will be triggered. memory_failure() assumes that it runs in the current context on both x86 and ARM platform. for example: memory_failure() in mm/memory-failure.c: if (flags & MF_ACTION_REQUIRED) { folio = page_folio(p); res = kill_accessing_process(current, folio_pfn(folio), flags); } > >> - another page fault is not unnecessary, we can send sigbus to current >> process in the first Synchronous External Abort SEA on arm64 (analogy >> Machine Check Exception on x86) > > I have no clue what that means. What page fault? I mean page fault in step 5. We can simplify the above flow by queuing memory_failure() as a task work for AR errors in step 3 directly. > >> I just give an example that the user space process *really* relys on the >> si_code of signal to handle hardware errors > > No, don't give examples. > > Explain what the exact problem is you're seeing, in your use case, point > to the code and then state how you think it should be fixed and why. > > Right now your text is "all over the place" and I have no clue what you > even want. Ok, got it. Thank you. > >> The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says: >> >> /* hardware memory error consumed on a machine check: action required */ >> #define BUS_MCEERR_AR 4 >> /* hardware memory error detected in process but not consumed: action optional*/ >> #define BUS_MCEERR_AO 5 >> >> When a synchronous error is consumed by Guest, the kernel should send a >> signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO. > > Can you drop this "synchronous" bla and concentrate on the error > *severity*? > > I think you want to say that there are some types of errors for which > error handling needs to happen immediately and for some reason that > doesn't happen. > > Which errors are those? Types? > > Why do you need them to be handled immediately? Well, the severity defined on x86 and ARM platform is quite different. I guess you mean taxonomy of producer error types. - X86: Software recoverable action required (SRAR) A UCR error that *requires* system software to take a recovery action on this processor *before scheduling another stream of execution on this processor*. (15.6.3 UCR Error Classification in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3) - ARM: Recoverable state (UER) The PE determines that software *must* take action to locate and repair the error to successfully recover execution. This might be because the exception was taken before the error was architecturally consumed by the PE, at the point when the PE was not be able to make correct progress without either consuming the error or *otherwise making the state of the PE unrecoverable*. (2.3.2 PE error state classification in Arm RAS Supplement https://documentation-service.arm.com/static/63185614f72fad1903828eda) I think above two types of error need to be handled immediately. > >> Exactly. > > No, not exactly. Why is it ok to do that? What are the implications of > this? > > Is immediate killing the right decision? > > Is this ok for *every* possible kernel running out there - not only for > your use case? > > And so on and so on... > I don't have a clear answer here. I guess the poison data only effects the user space task which triggers exception. A panic is not necessary. On x86 platform, the current error handling of memory_failure() in kill_me_maybe() is just send a sigbus forcely. kill_me_maybe(): ret = memory_failure(pfn, flags); if (ret == -EHWPOISON || ret == -EOPNOTSUPP) return; pr_err("Memory error not recovered"); kill_me_now(cb); Do you have any comments or suggestion about this? I don't change x86 behavior. For arm64 platform, step 3 in above flow, memory_failure_work_func(), the call site of memory_failure(), does not handle the return code of memory_failure(). I just add the same behavior.
Moving James to To: On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote: > > On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: > >> - an AR error consumed by current process is deferred to handle in a > >> dedicated kernel thread, but memory_failure() assumes that it runs in the > >> current context > > > > On x86? ARM? > > > > Pease point to the exact code flow. > > An AR error consumed by current process is deferred to handle in a > dedicated kernel thread on ARM platform. The AR error is handled in bellow > flow: > > ----------------------------------------------------------------------------- > [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0 > > ----------------------------------------------------------------------------- > [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 > ghes_sdei_critical_callback > => __ghes_sdei_callback > => ghes_in_nmi_queue_one_entry // peak and read estatus > => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work > [ghes_sdei_critical_callback: return] > ----------------------------------------------------------------------------- > [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 > => ghes_do_proc > => ghes_handle_memory_failure > => ghes_do_memory_failure > => memory_failure_queue // put work task on current CPU > => if (kfifo_put(&mf_cpu->fifo, entry)) > schedule_work_on(smp_processor_id(), &mf_cpu->work); > => task_work_add(current, &estatus_node->task_work, TWA_RESUME); > [ghes_proc_in_irq: return] > ----------------------------------------------------------------------------- > // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 > [memory_failure_work_func: current kworker, CPU 3] > => memory_failure_work_func(&mf_cpu->work) > => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work > => memory_failure(entry.pfn, entry.flags); From the comment above that function: * The function is primarily of use for corruptions that * happen outside the current execution context (e.g. when * detected by a background scrubber) * * Must run in process context (e.g. a work queue) with interrupts * enabled and no spinlocks held. > ----------------------------------------------------------------------------- > [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 > => memory_failure_queue_kick > => cancel_work_sync - waiting memory_failure_work_func finish > => memory_failure_work_func(&mf_cpu->work) > => kfifo_get(&mf_cpu->fifo, &entry); // no work > ----------------------------------------------------------------------------- > [einj_mem_uc resume at the same PC, trigger a page fault STEP 5 > > STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware > notifies hardware error to kernel through is SDEI > (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). > > STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie > a irq_work to handle hardware errors in IRQ context > > STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on > current CPU in workqueue and add task work to sync with the workqueue. > > STEP3: The kworker preempts the current running thread and get CPU 3. Then > memory_failure() is processed in kworker. See above. > STEP4: ghes_kick_task_work() is called as task_work to ensure any queued > workqueue has been done before returning to user-space. > > STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the > current instruction, because the poison page is unmapped by > memory_failure() in step 3, so a page fault will be triggered. > > memory_failure() assumes that it runs in the current context on both x86 > and ARM platform. > > > for example: > memory_failure() in mm/memory-failure.c: > > if (flags & MF_ACTION_REQUIRED) { > folio = page_folio(p); > res = kill_accessing_process(current, folio_pfn(folio), flags); > } And? Do you see the check above it? if (TestSetPageHWPoison(p)) { test_and_set_bit() returns true only when the page was poisoned already. * This function is intended to handle "Action Required" MCEs on already * hardware poisoned pages. They could happen, for example, when * memory_failure() failed to unmap the error page at the first call, or * when multiple local machine checks happened on different CPUs. And that's kill_accessing_process(). So AFAIU, the kworker running memory_failure() would only mark the page as poison. The killing happens when memory_failure() runs again and the process touches the page again. But I'd let James confirm here. I still don't know what you're fixing here. Is this something you're encountering on some machine or you simply stared at code? What does that "Both Alibaba and Huawei met the same issue in products, and we hope it could be fixed ASAP." mean? What did you meet? What was the problem? I still note that you're avoiding answering the question what the issue is and if you keep avoiding it, I'll ignore this whole thread.
On 2023/11/30 02:54, Borislav Petkov wrote: > Moving James to To: > > On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote: >>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: >>>> - an AR error consumed by current process is deferred to handle in a >>>> dedicated kernel thread, but memory_failure() assumes that it runs in the >>>> current context >>> >>> On x86? ARM? >>> >>> Pease point to the exact code flow. >> >> An AR error consumed by current process is deferred to handle in a >> dedicated kernel thread on ARM platform. The AR error is handled in bellow >> flow: >> >> ----------------------------------------------------------------------------- >> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0 >> >> ----------------------------------------------------------------------------- >> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 >> ghes_sdei_critical_callback >> => __ghes_sdei_callback >> => ghes_in_nmi_queue_one_entry // peak and read estatus >> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work >> [ghes_sdei_critical_callback: return] >> ----------------------------------------------------------------------------- >> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 >> => ghes_do_proc >> => ghes_handle_memory_failure >> => ghes_do_memory_failure >> => memory_failure_queue // put work task on current CPU >> => if (kfifo_put(&mf_cpu->fifo, entry)) >> schedule_work_on(smp_processor_id(), &mf_cpu->work); >> => task_work_add(current, &estatus_node->task_work, TWA_RESUME); >> [ghes_proc_in_irq: return] >> ----------------------------------------------------------------------------- >> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 >> [memory_failure_work_func: current kworker, CPU 3] >> => memory_failure_work_func(&mf_cpu->work) >> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work >> => memory_failure(entry.pfn, entry.flags); > > From the comment above that function: > > * The function is primarily of use for corruptions that > * happen outside the current execution context (e.g. when > * detected by a background scrubber) > * > * Must run in process context (e.g. a work queue) with interrupts > * enabled and no spinlocks held. Hi, Borislav, Thank you for your comments. But we are talking about Action Required error, it does happen *inside the current execution context*. The Action Required error does not meet the function comments. > >> ----------------------------------------------------------------------------- >> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 >> => memory_failure_queue_kick >> => cancel_work_sync - waiting memory_failure_work_func finish >> => memory_failure_work_func(&mf_cpu->work) >> => kfifo_get(&mf_cpu->fifo, &entry); // no work >> ----------------------------------------------------------------------------- >> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5 >> >> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware >> notifies hardware error to kernel through is SDEI >> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). >> >> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie >> a irq_work to handle hardware errors in IRQ context >> >> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on >> current CPU in workqueue and add task work to sync with the workqueue. >> >> STEP3: The kworker preempts the current running thread and get CPU 3. Then >> memory_failure() is processed in kworker. > > See above. > >> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued >> workqueue has been done before returning to user-space. >> >> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the >> current instruction, because the poison page is unmapped by >> memory_failure() in step 3, so a page fault will be triggered. >> >> memory_failure() assumes that it runs in the current context on both x86 >> and ARM platform. >> >> >> for example: >> memory_failure() in mm/memory-failure.c: >> >> if (flags & MF_ACTION_REQUIRED) { >> folio = page_folio(p); >> res = kill_accessing_process(current, folio_pfn(folio), flags); >> } > > And? > > Do you see the check above it? > > if (TestSetPageHWPoison(p)) { > > test_and_set_bit() returns true only when the page was poisoned already. > > * This function is intended to handle "Action Required" MCEs on already > * hardware poisoned pages. They could happen, for example, when > * memory_failure() failed to unmap the error page at the first call, or > * when multiple local machine checks happened on different CPUs. > > And that's kill_accessing_process(). > > So AFAIU, the kworker running memory_failure() would only mark the page > as poison. > > The killing happens when memory_failure() runs again and the process > touches the page again. When a Action Required error occurs, it triggers a MCE-like exception (SEA). In the first call of memory_failure(), it will poison the page. If it failed to unmap the error page, the user space task resumes at the current PC and triggers another SEA exception, then the second call of memory_failure() will run into kill_accessing_process() which do nothing and just return -EFAULT. As a result, a third SEA exception will be triggered. Finally, a exception loop happens resulting a hard lockup panic. > > But I'd let James confirm here. > > > I still don't know what you're fixing here. In ARM64 platform, when a Action Required error occurs, the kernel should send SIGBUS with si_code BUS_MCEERR_AR instead of BUS_MCEERR_AO. (It is also the subject of this thread) > > Is this something you're encountering on some machine or you simply > stared at code? I met the wrong si_code problem on Yitian 710 machine which is based on ARM64 platform. And I think it is gernel on ARM64 platfrom. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. > > What does that > > "Both Alibaba and Huawei met the same issue in products, and we hope it > could be fixed ASAP." > > mean? > > What did you meet? > > What was the problem? We both got wrong si_code of SIGBUS from kernel side on ARM64 platform. The VMM in our product relies on the si_code of SIGBUS to handle memory failure in userspace. - For BUS_MCEERR_AO, we regard that the corruptions happen *outside the current execution context* e.g. detected by a background scrubber, the VMM will ignore the error and the VM will not be killed immediately. - For BUS_MCEERR_AR, we regard that the corruptions happen *insdie the current execution context*, e.g. when a data poison is consumed, the VMM will kill the VM immediately to avoid any further potential data propagation. > > I still note that you're avoiding answering the question what the issue > is and if you keep avoiding it, I'll ignore this whole thread. > Sorry, Borislav, thank you for your patient and time. I really appreciate that you are involving in to review this patchset. But I have to say it is not the truth, I am avoiding anything. I tried my best to answer every comments you raised, give the details of ARM RAS specific and code flow. Best Regards, Shuai
FTR, this is starting to make sense, thanks for explaining. Replying only to this one for now: On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote: > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error So this is for ARM folks to deal with, BUT: A consumed uncorrectable error on x86 means panic. On some hw like on AMD, that error doesn't even get seen by the OS but the hw does something called syncflood to prevent further error propagation. So there's no any action required - the hw does that. But I'd like to hear from ARM folks whether consuming an uncorrectable error even lets software run. Dunno. Thx.
Hi Boris, Shuai, On 29/11/2023 18:54, Borislav Petkov wrote: > On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote: >>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: >>>> - an AR error consumed by current process is deferred to handle in a >>>> dedicated kernel thread, but memory_failure() assumes that it runs in the >>>> current context >>> >>> On x86? ARM? >>> >>> Pease point to the exact code flow. >> An AR error consumed by current process is deferred to handle in a >> dedicated kernel thread on ARM platform. The AR error is handled in bellow >> flow: Please don't think of errors as "action required" - that's a user-space signal code. If the page could be fixed by memory-failure(), you may never get a signal. (all this was the fix for always sending an action-required signal) I assume you mean the CPU accessed a poisoned location and took a synchronous error. >> ----------------------------------------------------------------------------- >> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0 >> >> ----------------------------------------------------------------------------- >> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 >> ghes_sdei_critical_callback >> => __ghes_sdei_callback >> => ghes_in_nmi_queue_one_entry // peak and read estatus >> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work >> [ghes_sdei_critical_callback: return] >> ----------------------------------------------------------------------------- >> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 >> => ghes_do_proc >> => ghes_handle_memory_failure >> => ghes_do_memory_failure >> => memory_failure_queue // put work task on current CPU >> => if (kfifo_put(&mf_cpu->fifo, entry)) >> schedule_work_on(smp_processor_id(), &mf_cpu->work); >> => task_work_add(current, &estatus_node->task_work, TWA_RESUME); >> [ghes_proc_in_irq: return] >> ----------------------------------------------------------------------------- >> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 >> [memory_failure_work_func: current kworker, CPU 3] >> => memory_failure_work_func(&mf_cpu->work) >> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work >> => memory_failure(entry.pfn, entry.flags); > > From the comment above that function: > > * The function is primarily of use for corruptions that > * happen outside the current execution context (e.g. when > * detected by a background scrubber) > * > * Must run in process context (e.g. a work queue) with interrupts > * enabled and no spinlocks held. > >> ----------------------------------------------------------------------------- >> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 >> => memory_failure_queue_kick >> => cancel_work_sync - waiting memory_failure_work_func finish >> => memory_failure_work_func(&mf_cpu->work) >> => kfifo_get(&mf_cpu->fifo, &entry); // no work >> ----------------------------------------------------------------------------- >> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5 >> >> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware >> notifies hardware error to kernel through is SDEI >> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). >> >> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie >> a irq_work to handle hardware errors in IRQ context >> >> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on >> current CPU in workqueue and add task work to sync with the workqueue. >> >> STEP3: The kworker preempts the current running thread and get CPU 3. Then >> memory_failure() is processed in kworker. > > See above. > >> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued >> workqueue has been done before returning to user-space. >> >> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the >> current instruction, because the poison page is unmapped by >> memory_failure() in step 3, so a page fault will be triggered. >> >> memory_failure() assumes that it runs in the current context on both x86 >> and ARM platform. >> >> >> for example: >> memory_failure() in mm/memory-failure.c: >> >> if (flags & MF_ACTION_REQUIRED) { >> folio = page_folio(p); >> res = kill_accessing_process(current, folio_pfn(folio), flags); >> } > > And? > > Do you see the check above it? > > if (TestSetPageHWPoison(p)) { > > test_and_set_bit() returns true only when the page was poisoned already. > > * This function is intended to handle "Action Required" MCEs on already > * hardware poisoned pages. They could happen, for example, when > * memory_failure() failed to unmap the error page at the first call, or > * when multiple local machine checks happened on different CPUs. > > And that's kill_accessing_process(). > > So AFAIU, the kworker running memory_failure() would only mark the page > as poison. > > The killing happens when memory_failure() runs again and the process > touches the page again. > > But I'd let James confirm here. Yes, this is what is expected to happen with the existing code. The first pass will remove the pages from all processes that have it mapped before this user-space task can restart. Restarting the task will make it access a poisoned page, kicking off the second path which delivers the signal. The reason for two passes is send_sig_mceerr() likes to clear_siginfo(), so even if you queued action-required before leaving GHES, memory-failure() would stomp on it. > I still don't know what you're fixing here. The problem is if the user-space process registered for early messages, it gets a signal on the first pass. If it returns from that signal, it will access the poisoned page and get the action-required signal. How is this making Qemu go wrong? As to how this works for you given Boris' comments above: kill_procs() is also called from hwpoison_user_mappings(), which takes the flags given to memory-failure(). This is where the action-optional signals come from. Thanks, James
Hi Boris, On 30/11/2023 14:40, Borislav Petkov wrote: > FTR, this is starting to make sense, thanks for explaining. > > Replying only to this one for now: > > On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote: >> To reproduce this problem: >> >> # STEP1: enable early kill mode >> #sysctl -w vm.memory_failure_early_kill=1 >> vm.memory_failure_early_kill = 1 >> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error > > So this is for ARM folks to deal with, BUT: > > A consumed uncorrectable error on x86 means panic. On some hw like on > AMD, that error doesn't even get seen by the OS but the hw does > something called syncflood to prevent further error propagation. So > there's no any action required - the hw does that. > > But I'd like to hear from ARM folks whether consuming an uncorrectable > error even lets software run. Dunno. I think we mean different things by 'consume' here. I'd assume Shuai's test is poisoning a cache-line. When the CPU tries to access that cache-line it will get an 'external abort' signal back from the memory system. Shuai - is this what you mean by 'consume' - the CPU received external abort from the poisoned cache line? It's then up to the CPU whether it can put the world back in order to take this as synchronous-external-abort or asynchronous-external-abort, which for arm64 are two different interrupt/exception types. The synchronous exceptions can't be masked, but the asynchronous one can. If by the time the asynchronous-external-abort interrupt/exception has been unmasked, the CPU has used the poisoned value in some calculation (which is what we usually mean by consume) which has resulted in a memory access - it will report the error as 'uncontained' because the error has been silently propagated. APEI should always report those a 'fatal', and there is little point getting the OS involved at this point. Also in this category are things like 'tag ram corruption', where you can no longer trust anything about memory. Everything in this thread is about synchronous errors where this can't happen. The CPU stops and does takes an interrupt/exception instead. Thanks, James
On 2023/12/1 01:43, James Morse wrote: > Hi Boris, > > On 30/11/2023 14:40, Borislav Petkov wrote: >> FTR, this is starting to make sense, thanks for explaining. >> >> Replying only to this one for now: >> >> On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote: >>> To reproduce this problem: >>> >>> # STEP1: enable early kill mode >>> #sysctl -w vm.memory_failure_early_kill=1 >>> vm.memory_failure_early_kill = 1 >>> >>> # STEP2: inject an UCE error and consume it to trigger a synchronous error >> >> So this is for ARM folks to deal with, BUT: >> >> A consumed uncorrectable error on x86 means panic. On some hw like on >> AMD, that error doesn't even get seen by the OS but the hw does >> something called syncflood to prevent further error propagation. So >> there's no any action required - the hw does that. The "consume" is at the application point of view, e.g. a memory read. If poison is enable, then a SRAR error will be detected and a MCE raised at the point of the consumption in the execution flow. A generic Intel x86 hw behaves like below: 1. UE Error Inject at a known Physical Address. (by einj_mem_uc through EINJ interface) 2. Core Issue a Memory Read to the same Physical Address (by a singe memory read) 3. iMC Detects the error. 4. HA logs UCA error and signals CMCI if enabled 5. HA Forward data with poison indication bit set. 6. CBo detects the Poison data. Does not log any error. 7. MLC detects the Poison data. 8. DCU detects the Poison data, logs SRAR error and trigger MCERR if recoverable 9. OS/VMM takes corresponding recovery action based on affected state. In our example: - step 2 is triggered by a singe memory read. - step 8: UCR errors detected on data load, MCACOD 134H, triggering MCERR - step 9: the kernel is excepted to send sigbus with si_code BUS_MCEERR_AR (code 4) I also run the same test in AMD EPYC platform, e.g. Milan, Genoa, which behaves the same as Intel Xeon platform, e.g. Icelake, SPR. The ARMv8.2 RAS extension support similar data poison mechanism, a Synchronous External Abort on arm64 (analogy Machine Check Exception on x86) will be trigger in setp 8. See James comments for details. But the kernel sends sigbus with si_code BUS_MCEERR_AO (code 5) , tested on Alibaba Yitian710 and Huawei Kunepng 920. >> >> But I'd like to hear from ARM folks whether consuming an uncorrectable >> error even lets software run. Dunno. > > I think we mean different things by 'consume' here. > > I'd assume Shuai's test is poisoning a cache-line. When the CPU tries to access that > cache-line it will get an 'external abort' signal back from the memory system. Shuai - is > this what you mean by 'consume' - the CPU received external abort from the poisoned cache > line? > Yes, exactly. Thank you for point it out. We are talking about synchronous errors. > It's then up to the CPU whether it can put the world back in order to take this as > synchronous-external-abort or asynchronous-external-abort, which for arm64 are two > different interrupt/exception types. > The synchronous exceptions can't be masked, but the asynchronous one can. > If by the time the asynchronous-external-abort interrupt/exception has been unmasked, the > CPU has used the poisoned value in some calculation (which is what we usually mean by > consume) which has resulted in a memory access - it will report the error as 'uncontained' > because the error has been silently propagated. APEI should always report those a 'fatal', > and there is little point getting the OS involved at this point. Also in this category are > things like 'tag ram corruption', where you can no longer trust anything about memory. > > Everything in this thread is about synchronous errors where this can't happen. The CPU > stops and does takes an interrupt/exception instead. > > Thank you for explaining. Best Regards, Shuai
On 2023/12/1 01:39, James Morse wrote: > Hi Boris, Shuai, > > On 29/11/2023 18:54, Borislav Petkov wrote: >> On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote: >>>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: >>>>> - an AR error consumed by current process is deferred to handle in a >>>>> dedicated kernel thread, but memory_failure() assumes that it runs in the >>>>> current context >>>> >>>> On x86? ARM? >>>> >>>> Pease point to the exact code flow. > > >>> An AR error consumed by current process is deferred to handle in a >>> dedicated kernel thread on ARM platform. The AR error is handled in bellow >>> flow: > > Please don't think of errors as "action required" - that's a user-space signal code. If > the page could be fixed by memory-failure(), you may never get a signal. (all this was the > fix for always sending an action-required signal) > > I assume you mean the CPU accessed a poisoned location and took a synchronous error. Yes, I mean that CPU accessed a poisoned location and took a synchronous error. > > >>> ----------------------------------------------------------------------------- >>> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0 >>> >>> ----------------------------------------------------------------------------- >>> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 >>> ghes_sdei_critical_callback >>> => __ghes_sdei_callback >>> => ghes_in_nmi_queue_one_entry // peak and read estatus >>> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work >>> [ghes_sdei_critical_callback: return] >>> ----------------------------------------------------------------------------- >>> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 >>> => ghes_do_proc >>> => ghes_handle_memory_failure >>> => ghes_do_memory_failure >>> => memory_failure_queue // put work task on current CPU >>> => if (kfifo_put(&mf_cpu->fifo, entry)) >>> schedule_work_on(smp_processor_id(), &mf_cpu->work); >>> => task_work_add(current, &estatus_node->task_work, TWA_RESUME); >>> [ghes_proc_in_irq: return] >>> ----------------------------------------------------------------------------- >>> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 >>> [memory_failure_work_func: current kworker, CPU 3] >>> => memory_failure_work_func(&mf_cpu->work) >>> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work >>> => memory_failure(entry.pfn, entry.flags); >> >> From the comment above that function: >> >> * The function is primarily of use for corruptions that >> * happen outside the current execution context (e.g. when >> * detected by a background scrubber) >> * >> * Must run in process context (e.g. a work queue) with interrupts >> * enabled and no spinlocks held. >> >>> ----------------------------------------------------------------------------- >>> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 >>> => memory_failure_queue_kick >>> => cancel_work_sync - waiting memory_failure_work_func finish >>> => memory_failure_work_func(&mf_cpu->work) >>> => kfifo_get(&mf_cpu->fifo, &entry); // no work >>> ----------------------------------------------------------------------------- >>> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5 >>> >>> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware >>> notifies hardware error to kernel through is SDEI >>> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). >>> >>> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie >>> a irq_work to handle hardware errors in IRQ context >>> >>> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on >>> current CPU in workqueue and add task work to sync with the workqueue. >>> >>> STEP3: The kworker preempts the current running thread and get CPU 3. Then >>> memory_failure() is processed in kworker. >> >> See above. >> >>> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued >>> workqueue has been done before returning to user-space. >>> >>> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the >>> current instruction, because the poison page is unmapped by >>> memory_failure() in step 3, so a page fault will be triggered. >>> >>> memory_failure() assumes that it runs in the current context on both x86 >>> and ARM platform. >>> >>> >>> for example: >>> memory_failure() in mm/memory-failure.c: >>> >>> if (flags & MF_ACTION_REQUIRED) { >>> folio = page_folio(p); >>> res = kill_accessing_process(current, folio_pfn(folio), flags); >>> } >> >> And? >> >> Do you see the check above it? >> >> if (TestSetPageHWPoison(p)) { >> >> test_and_set_bit() returns true only when the page was poisoned already. >> >> * This function is intended to handle "Action Required" MCEs on already >> * hardware poisoned pages. They could happen, for example, when >> * memory_failure() failed to unmap the error page at the first call, or >> * when multiple local machine checks happened on different CPUs. >> >> And that's kill_accessing_process(). >> >> So AFAIU, the kworker running memory_failure() would only mark the page >> as poison. >> >> The killing happens when memory_failure() runs again and the process >> touches the page again. >> >> But I'd let James confirm here. > > Yes, this is what is expected to happen with the existing code. > > The first pass will remove the pages from all processes that have it mapped before this > user-space task can restart. Restarting the task will make it access a poisoned page, > kicking off the second path which delivers the signal. > > The reason for two passes is send_sig_mceerr() likes to clear_siginfo(), so even if you > queued action-required before leaving GHES, memory-failure() would stomp on it. > > >> I still don't know what you're fixing here. > > The problem is if the user-space process registered for early messages, it gets a signal > on the first pass. If it returns from that signal, it will access the poisoned page and > get the action-required signal. > > How is this making Qemu go wrong? The problem here is that we need to assume, the first pass memory failure handle and unmap the poisoned page successfully. - If so, it may work by the second pass action-requried signal because it access an unmapped page. But IMHO, we can improve by just sending one pass signal, so that the Guest will vmexit only once, right? - If not, there is no second pass signal. The exist code does not handle the error code from memory_failure(), so a exception loop happens resulting a hard lockup panic. Besides, in production environment, a second access to an already known poison page will introduce more risk of error propagation. > > > As to how this works for you given Boris' comments above: kill_procs() is also called from > hwpoison_user_mappings(), which takes the flags given to memory-failure(). This is where > the action-optional signals come from. > > Thank you very much for involving to review and comment. Best Regards, Shuai
## Changes Log changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Currently, both synchronous and asynchronous errors are queued by ghes_handle_memory_failure() with flag 0, and handled by a dedicated kernel thread in a work queue on the ARM64 platform. As a result, the memory failure recovery sends SIBUS with wrong BUS_MCEERR_AO si_code for synchronous errors in early kill mode. The main problem is that the memory_failure() work is handled in kthread context but not the user-space process context which is accessing the corrupt memory location, so it will send SIGBUS with BUS_MCEERR_AO si_code to the user-space process instead of BUS_MCEERR_AR in kill_proc(). Fix the problem by: - Patch 1: seting memory_failure() flags as MF_ACTION_REQUIRED on synchronous errors. - Patch 2: performing a force kill if no memory_failure() work is queued for synchronous errors. - Patch 3: a minor comments improve. - Patch 4: queueing memory_failure() as a task_work so that the current context in memory_failure() exactly belongs to the process consuming poison data. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. ## Steps to Reproduce This Problem To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (4): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move memory_failure() return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 9 +-- drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- include/acpi/ghes.h | 3 - mm/memory-failure.c | 22 ++----- 4 files changed, 82 insertions(+), 65 deletions(-)
## Changes Log changes since v10: - rebase to v6.8-rc2 changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel should terminate the current process which accessing the poisoned page. This is done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, the memory failure recovery is incorrectly sending a SIGBUS with wrong error code BUS_MCEERR_AO for synchronous errors in early kill mode, even MF_ACTION_REQUIRED is set. The main problem is that synchronous errors are queued as a memory_failure() work, and are executed within a kernel thread context, not the user-space process that encountered the corrupted memory on ARM64 platform. As a result, when kill_proc() is called to terminate the process, it sends the incorrect SIGBUS error code because the context in which it operates is not the one where the error was triggered. To this end, fix the problem by: - Patch 1: performing a force kill if no memory_failure() work is queued for synchronous errors. - Patch 2: a minor comments improvement. - Patch 3: queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data, and it will send SIBBUS with si_code BUS_MCEERR_AR. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. ## Steps to Reproduce This Problem To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ Shuai Xue (3): ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code arch/x86/kernel/cpu/mce/core.c | 9 +--- drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++------------- include/acpi/ghes.h | 3 -- mm/memory-failure.c | 22 +++------ 4 files changed, 59 insertions(+), 59 deletions(-)
Hi, James and Borislav, Gentle Ping. Any feedback to this new version? Thank you. Best Regards, Shuai On 2024/2/4 16:01, Shuai Xue wrote: > ## Changes Log > changes since v10: > - rebase to v6.8-rc2 > > changes since v9: > - split patch 2 to address exactly one issue in one patch (per Borislav) > - rewrite commit log according to template (per Borislav) > - pickup reviewed-by tag of patch 1 from James Morse > - alloc and free twcb through gen_pool_{alloc, free) (Per James) > - rewrite cover letter > > changes since v8: > - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) > - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) > - rewrite the return value comments of memory_failure (per Naoya Horiguchi) > > changes since v7: > - rebase to Linux v6.6-rc2 (no code changed) > - rewritten the cover letter to explain the motivation of this patchset > > changes since v6: > - add more explicty error message suggested by Xiaofei > - pick up reviewed-by tag from Xiaofei > - pick up internal reviewed-by tag from Baolin > > changes since v5 by addressing comments from Kefeng: > - document return value of memory_failure() > - drop redundant comments in call site of memory_failure() > - make ghes_do_proc void and handle abnormal case within it > - pick up reviewed-by tag from Kefeng Wang > > changes since v4 by addressing comments from Xiaofei: > - do a force kill only for abnormal sync errors > > changes since v3 by addressing comments from Xiaofei: > - do a force kill for abnormal memory failure error such as invalid PA, > unexpected severity, OOM, etc > - pcik up tested-by tag from Ma Wupeng > > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > ## Cover Letter > > There are two major types of uncorrected recoverable (UCR) errors : > > - Synchronous error: The error is detected and raised at the point of the > consumption in the execution flow, e.g. when a CPU tries to access > a poisoned cache line. The CPU will take a synchronous error exception > such as Synchronous External Abort (SEA) on Arm64 and Machine Check > Exception (MCE) on X86. OS requires to take action (for example, offline > failure page/kill failure thread) to recover this uncorrectable error. > > - Asynchronous error: The error is detected out of processor execution > context, e.g. when an error is detected by a background scrubber. Some data > in the memory are corrupted. But the data have not been consumed. OS is > optional to take action to recover this uncorrectable error. > > Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as > MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED > could be used to determine whether a synchronous exception occurs on ARM64 > platform. When a synchronous exception is detected, the kernel should > terminate the current process which accessing the poisoned page. This is > done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, > indicating an action-required machine check error on read. > > However, the memory failure recovery is incorrectly sending a SIGBUS > with wrong error code BUS_MCEERR_AO for synchronous errors in early kill > mode, even MF_ACTION_REQUIRED is set. The main problem is that > synchronous errors are queued as a memory_failure() work, and are > executed within a kernel thread context, not the user-space process that > encountered the corrupted memory on ARM64 platform. As a result, when > kill_proc() is called to terminate the process, it sends the incorrect > SIGBUS error code because the context in which it operates is not the > one where the error was triggered. > > To this end, fix the problem by: > > - Patch 1: performing a force kill if no memory_failure() work is queued for > synchronous errors. > - Patch 2: a minor comments improvement. > - Patch 3: queue memory_failure() as a task_work so that it runs in the > context of the process that is actually consuming the poisoned > data, and it will send SIBBUS with si_code BUS_MCEERR_AR. > > Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. > Acknowledge to discussion with them. > > ## Steps to Reproduce This Problem > > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error > and it is not fact. > > After this patch set: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error > as we expected. > > [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ > [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ > [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com > [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > > Shuai Xue (3): > ACPI: APEI: send SIGBUS to current task if synchronous memory error > not recovered > mm: memory-failure: move return value documentation to function > declaration > ACPI: APEI: handle synchronous exceptions in task work to send correct > SIGBUS si_code > > arch/x86/kernel/cpu/mce/core.c | 9 +--- > drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++------------- > include/acpi/ghes.h | 3 -- > mm/memory-failure.c | 22 +++------ > 4 files changed, 59 insertions(+), 59 deletions(-) >
## Changes Log changes since v11: - rebase to Linux 6.11-rc6 - fix grammer and typo in commit log (per Borislav) - remove `sync_` perfix of `sync_task_work` (per Borislav) - comments flags and description of `task_work` (per Borislav) changes since v10: - rebase to v6.8-rc2 changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Currently, both synchronous and asynchronous error use memory_failure_queue() to schedule memory_failure() exectute in kworker context. As a result, when a user-space process is accessing a poisoned data, a data abort is taken and the memory_failure() is executed in the kworker context: - will send wrong si_code by SIGBUS signal in early_kill mode, and - can not kill the user-space in some cases resulting a synchronous error infinite loop Issue 1: send wrong si_code in early_kill mode Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel is expected to terminate the current process which has accessed poisoned page. This is done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, when kill_proc() is called to terminate the processes who have the poisoned page mapped, it sends the incorrect SIGBUS error code BUS_MCEERR_AO because the context in which it operates is not the one where the error was triggered. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. To fix it, queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. Issue 2: a synchronous error infinite loop due to memory_failure() failed If a user-space process, e.g. devmem, a poisoned page which has been set HWPosion flag, kill_accessing_process() is called to send SIGBUS to the current processs with error info. Because the memory_failure() is executed in the kworker contex, it will just do nothing but return EFAULT. So, devmem will access the posioned page and trigger an excepction again, resulting in a synchronous error infinite loop. Such loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. To reproduce this problem: # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed # STEP 2: access the same page and it will trigger a synchronous error infinite loop devmem 0x4092d55b400 To fix it, if memory_failure() failed, perform a force kill to current process. Issue 3: a synchronous error infinite loop due to no memory_failure() queued No memory_failure() work is queued unless all bellow preconditions check passed: - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() - `if (flags == -1)` in ghes_handle_memory_failure() - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() If the preconditions are not passed, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. To fix it, if no memory_failure() queued, perform a force kill to current process. And the the memory errors triggered in kernel-mode[5], also relies on this patchset to kill the failure thread. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ [5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/ Shuai Xue (3): ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 7 --- drivers/acpi/apei/ghes.c | 86 +++++++++++++++++++++------------- include/acpi/ghes.h | 3 -- include/linux/mm.h | 1 - mm/memory-failure.c | 22 +++------ 5 files changed, 60 insertions(+), 59 deletions(-)
Hi, all, Gentle ping. Best Regards, Shuai 在 2024/9/2 11:00, Shuai Xue 写道: > ## Changes Log > > changes since v11: > - rebase to Linux 6.11-rc6 > - fix grammer and typo in commit log (per Borislav) > - remove `sync_` perfix of `sync_task_work` (per Borislav) > - comments flags and description of `task_work` (per Borislav) > > changes since v10: > - rebase to v6.8-rc2 > > changes since v9: > - split patch 2 to address exactly one issue in one patch (per Borislav) > - rewrite commit log according to template (per Borislav) > - pickup reviewed-by tag of patch 1 from James Morse > - alloc and free twcb through gen_pool_{alloc, free) (Per James) > - rewrite cover letter > > changes since v8: > - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) > - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) > - rewrite the return value comments of memory_failure (per Naoya Horiguchi) > > changes since v7: > - rebase to Linux v6.6-rc2 (no code changed) > - rewritten the cover letter to explain the motivation of this patchset > > changes since v6: > - add more explicty error message suggested by Xiaofei > - pick up reviewed-by tag from Xiaofei > - pick up internal reviewed-by tag from Baolin > > changes since v5 by addressing comments from Kefeng: > - document return value of memory_failure() > - drop redundant comments in call site of memory_failure() > - make ghes_do_proc void and handle abnormal case within it > - pick up reviewed-by tag from Kefeng Wang > > changes since v4 by addressing comments from Xiaofei: > - do a force kill only for abnormal sync errors > > changes since v3 by addressing comments from Xiaofei: > - do a force kill for abnormal memory failure error such as invalid PA, > unexpected severity, OOM, etc > - pcik up tested-by tag from Ma Wupeng > > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > ## Cover Letter > > There are two major types of uncorrected recoverable (UCR) errors : > > - Synchronous error: The error is detected and raised at the point of the > consumption in the execution flow, e.g. when a CPU tries to access > a poisoned cache line. The CPU will take a synchronous error exception > such as Synchronous External Abort (SEA) on Arm64 and Machine Check > Exception (MCE) on X86. OS requires to take action (for example, offline > failure page/kill failure thread) to recover this uncorrectable error. > > - Asynchronous error: The error is detected out of processor execution > context, e.g. when an error is detected by a background scrubber. Some data > in the memory are corrupted. But the data have not been consumed. OS is > optional to take action to recover this uncorrectable error. > > Currently, both synchronous and asynchronous error use > memory_failure_queue() to schedule memory_failure() exectute in kworker > context. As a result, when a user-space process is accessing a poisoned > data, a data abort is taken and the memory_failure() is executed in the > kworker context: > > - will send wrong si_code by SIGBUS signal in early_kill mode, and > - can not kill the user-space in some cases resulting a synchronous > error infinite loop > > Issue 1: send wrong si_code in early_kill mode > > Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as > MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED > could be used to determine whether a synchronous exception occurs on > ARM64 platform. When a synchronous exception is detected, the kernel is > expected to terminate the current process which has accessed poisoned > page. This is done by sending a SIGBUS signal with an error code > BUS_MCEERR_AR, indicating an action-required machine check error on > read. > > However, when kill_proc() is called to terminate the processes who have > the poisoned page mapped, it sends the incorrect SIGBUS error code > BUS_MCEERR_AO because the context in which it operates is not the one > where the error was triggered. > > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO > error and it is not fact. > > To fix it, queue memory_failure() as a task_work so that it runs in > the context of the process that is actually consuming the poisoned data. > > After this patch set: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR > error as we expected. > > Issue 2: a synchronous error infinite loop due to memory_failure() failed > > If a user-space process, e.g. devmem, a poisoned page which has been set > HWPosion flag, kill_accessing_process() is called to send SIGBUS to the > current processs with error info. Because the memory_failure() is > executed in the kworker contex, it will just do nothing but return > EFAULT. So, devmem will access the posioned page and trigger an > excepction again, resulting in a synchronous error infinite loop. Such > loop may cause platform firmware to exceed some threshold and reboot > when Linux could have recovered from this error. > > To reproduce this problem: > > # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > # STEP 2: access the same page and it will trigger a synchronous error infinite loop > devmem 0x4092d55b400 > > To fix it, if memory_failure() failed, perform a force kill to current process. > > Issue 3: a synchronous error infinite loop due to no memory_failure() queued > > No memory_failure() work is queued unless all bellow preconditions check passed: > > - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() > - `if (flags == -1)` in ghes_handle_memory_failure() > - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() > - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() > > If the preconditions are not passed, the user-space process will trigger SEA again. > This loop can potentially exceed the platform firmware threshold or even > trigger a kernel hard lockup, leading to a system reboot. > > To fix it, if no memory_failure() queued, perform a force kill to current process. > > And the the memory errors triggered in kernel-mode[5], also relies on this > patchset to kill the failure thread. > > Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. > Acknowledge to discussion with them. > > [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ > [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ > [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com > [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > [5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/ > > Shuai Xue (3): > ACPI: APEI: send SIGBUS to current task if synchronous memory error > not recovered > mm: memory-failure: move return value documentation to function > declaration > ACPI: APEI: handle synchronous exceptions in task work > > arch/x86/kernel/cpu/mce/core.c | 7 --- > drivers/acpi/apei/ghes.c | 86 +++++++++++++++++++++------------- > include/acpi/ghes.h | 3 -- > include/linux/mm.h | 1 - > mm/memory-failure.c | 22 +++------ > 5 files changed, 60 insertions(+), 59 deletions(-) >
## Changes Log changes since v12: - tweak error message for force kill (per Jarkko) - fix comments style (per Jarkko) - fix commit log typo (per Jarko) changes since v11: - rebase to Linux 6.11-rc6 - fix grammer and typo in commit log (per Borislav) - remove `sync_` perfix of `sync_task_work` (per Borislav) - comments flags and description of `task_work` (per Borislav) changes since v10: - rebase to v6.8-rc2 changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Currently, both synchronous and asynchronous error use memory_failure_queue() to schedule memory_failure() exectute in kworker context. As a result, when a user-space process is accessing a poisoned data, a data abort is taken and the memory_failure() is executed in the kworker context: - will send wrong si_code by SIGBUS signal in early_kill mode, and - can not kill the user-space in some cases resulting a synchronous error infinite loop Issue 1: send wrong si_code in early_kill mode Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel is expected to terminate the current process which has accessed poisoned page. This is done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, when kill_proc() is called to terminate the processes who have the poisoned page mapped, it sends the incorrect SIGBUS error code BUS_MCEERR_AO because the context in which it operates is not the one where the error was triggered. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. To fix it, queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. Issue 2: a synchronous error infinite loop due to memory_failure() failed If a user-space process, e.g. devmem, a poisoned page which has been set HWPosion flag, kill_accessing_process() is called to send SIGBUS to the current processs with error info. Because the memory_failure() is executed in the kworker contex, it will just do nothing but return EFAULT. So, devmem will access the posioned page and trigger an excepction again, resulting in a synchronous error infinite loop. Such loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. To reproduce this problem: # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed # STEP 2: access the same page and it will trigger a synchronous error infinite loop devmem 0x4092d55b400 To fix it, if memory_failure() failed, perform a force kill to current process. Issue 3: a synchronous error infinite loop due to no memory_failure() queued No memory_failure() work is queued unless all bellow preconditions check passed: - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() - `if (flags == -1)` in ghes_handle_memory_failure() - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() If the preconditions are not passed, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. To fix it, if no memory_failure() queued, perform a force kill to current process. And the the memory errors triggered in kernel-mode[5], also relies on this patchset to kill the failure thread. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ [5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/ Shuai Xue (3): ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 7 --- drivers/acpi/apei/ghes.c | 86 +++++++++++++++++++++------------- include/acpi/ghes.h | 3 -- include/linux/mm.h | 1 - mm/memory-failure.c | 22 +++------ 5 files changed, 60 insertions(+), 59 deletions(-)
## Changes Log changes since v12: - add reviewed-by tag from Jarkko - rename task_work to ghes_task_work (per Jarkko) changes since v12: - tweak error message for force kill (per Jarkko) - fix comments style (per Jarkko) - fix commit log typo (per Jarko) changes since v11: - rebase to Linux 6.11-rc6 - fix grammer and typo in commit log (per Borislav) - remove `sync_` perfix of `sync_task_work` (per Borislav) - comments flags and description of `task_work` (per Borislav) changes since v10: - rebase to v6.8-rc2 changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Currently, both synchronous and asynchronous error use memory_failure_queue() to schedule memory_failure() exectute in kworker context. As a result, when a user-space process is accessing a poisoned data, a data abort is taken and the memory_failure() is executed in the kworker context: - will send wrong si_code by SIGBUS signal in early_kill mode, and - can not kill the user-space in some cases resulting a synchronous error infinite loop Issue 1: send wrong si_code in early_kill mode Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel is expected to terminate the current process which has accessed poisoned page. This is done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, when kill_proc() is called to terminate the processes who have the poisoned page mapped, it sends the incorrect SIGBUS error code BUS_MCEERR_AO because the context in which it operates is not the one where the error was triggered. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. To fix it, queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. Issue 2: a synchronous error infinite loop due to memory_failure() failed If a user-space process, e.g. devmem, a poisoned page which has been set HWPosion flag, kill_accessing_process() is called to send SIGBUS to the current processs with error info. Because the memory_failure() is executed in the kworker contex, it will just do nothing but return EFAULT. So, devmem will access the posioned page and trigger an excepction again, resulting in a synchronous error infinite loop. Such loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. To reproduce this problem: # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed # STEP 2: access the same page and it will trigger a synchronous error infinite loop devmem 0x4092d55b400 To fix it, if memory_failure() failed, perform a force kill to current process. Issue 3: a synchronous error infinite loop due to no memory_failure() queued No memory_failure() work is queued unless all bellow preconditions check passed: - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() - `if (flags == -1)` in ghes_handle_memory_failure() - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() If the preconditions are not passed, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. To fix it, if no memory_failure() queued, perform a force kill to current process. And the the memory errors triggered in kernel-mode[5], also relies on this patchset to kill the failure thread. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ [5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/ Shuai Xue (3): ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 7 --- drivers/acpi/apei/ghes.c | 86 +++++++++++++++++++++------------- include/acpi/ghes.h | 3 -- include/linux/mm.h | 1 - mm/memory-failure.c | 22 +++------ 5 files changed, 60 insertions(+), 59 deletions(-)
## Changes Log changes since v14: - add reviewed-by tags from Jarkko and Jonathan - remove local variable and use twcb->pfn changes since v13: - add reviewed-by tag from Jarkko - rename task_work to ghes_task_work (per Jarkko) changes since v12: - tweak error message for force kill (per Jarkko) - fix comments style (per Jarkko) - fix commit log typo (per Jarko) changes since v11: - rebase to Linux 6.11-rc6 - fix grammer and typo in commit log (per Borislav) - remove `sync_` perfix of `sync_task_work` (per Borislav) - comments flags and description of `task_work` (per Borislav) changes since v10: - rebase to v6.8-rc2 changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Currently, both synchronous and asynchronous error use memory_failure_queue() to schedule memory_failure() exectute in kworker context. As a result, when a user-space process is accessing a poisoned data, a data abort is taken and the memory_failure() is executed in the kworker context: - will send wrong si_code by SIGBUS signal in early_kill mode, and - can not kill the user-space in some cases resulting a synchronous error infinite loop Issue 1: send wrong si_code in early_kill mode Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel is expected to terminate the current process which has accessed poisoned page. This is done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, when kill_proc() is called to terminate the processes who have the poisoned page mapped, it sends the incorrect SIGBUS error code BUS_MCEERR_AO because the context in which it operates is not the one where the error was triggered. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. To fix it, queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. Issue 2: a synchronous error infinite loop due to memory_failure() failed If a user-space process, e.g. devmem, a poisoned page which has been set HWPosion flag, kill_accessing_process() is called to send SIGBUS to the current processs with error info. Because the memory_failure() is executed in the kworker contex, it will just do nothing but return EFAULT. So, devmem will access the posioned page and trigger an excepction again, resulting in a synchronous error infinite loop. Such loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. To reproduce this problem: # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed # STEP 2: access the same page and it will trigger a synchronous error infinite loop devmem 0x4092d55b400 To fix it, if memory_failure() failed, perform a force kill to current process. Issue 3: a synchronous error infinite loop due to no memory_failure() queued No memory_failure() work is queued unless all bellow preconditions check passed: - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() - `if (flags == -1)` in ghes_handle_memory_failure() - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() If the preconditions are not passed, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. To fix it, if no memory_failure() queued, perform a force kill to current process. And the the memory errors triggered in kernel-mode[5], also relies on this patchset to kill the failure thread. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ [5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/ Shuai Xue (3): ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 7 --- drivers/acpi/apei/ghes.c | 85 +++++++++++++++++++++------------- include/acpi/ghes.h | 3 -- include/linux/mm.h | 1 - mm/memory-failure.c | 22 +++------ 5 files changed, 59 insertions(+), 59 deletions(-)
## Changes Log changes singce v15: - add HW_ERR and GHES_PFX prefix per Yazen changes since v14: - add reviewed-by tags from Jarkko and Jonathan - remove local variable and use twcb->pfn changes since v13: - add reviewed-by tag from Jarkko - rename task_work to ghes_task_work (per Jarkko) changes since v12: - tweak error message for force kill (per Jarkko) - fix comments style (per Jarkko) - fix commit log typo (per Jarko) changes since v11: - rebase to Linux 6.11-rc6 - fix grammer and typo in commit log (per Borislav) - remove `sync_` perfix of `sync_task_work` (per Borislav) - comments flags and description of `task_work` (per Borislav) changes since v10: - rebase to v6.8-rc2 changes since v9: - split patch 2 to address exactly one issue in one patch (per Borislav) - rewrite commit log according to template (per Borislav) - pickup reviewed-by tag of patch 1 from James Morse - alloc and free twcb through gen_pool_{alloc, free) (Per James) - rewrite cover letter changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Synchronous error: The error is detected and raised at the point of the consumption in the execution flow, e.g. when a CPU tries to access a poisoned cache line. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64 and Machine Check Exception (MCE) on X86. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Asynchronous error: The error is detected out of processor execution context, e.g. when an error is detected by a background scrubber. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. Currently, both synchronous and asynchronous error use memory_failure_queue() to schedule memory_failure() exectute in kworker context. As a result, when a user-space process is accessing a poisoned data, a data abort is taken and the memory_failure() is executed in the kworker context: - will send wrong si_code by SIGBUS signal in early_kill mode, and - can not kill the user-space in some cases resulting a synchronous error infinite loop Issue 1: send wrong si_code in early_kill mode Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED could be used to determine whether a synchronous exception occurs on ARM64 platform. When a synchronous exception is detected, the kernel is expected to terminate the current process which has accessed poisoned page. This is done by sending a SIGBUS signal with an error code BUS_MCEERR_AR, indicating an action-required machine check error on read. However, when kill_proc() is called to terminate the processes who have the poisoned page mapped, it sends the incorrect SIGBUS error code BUS_MCEERR_AO because the context in which it operates is not the one where the error was triggered. To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. To fix it, queue memory_failure() as a task_work so that it runs in the context of the process that is actually consuming the poisoned data. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. Issue 2: a synchronous error infinite loop due to memory_failure() failed If a user-space process, e.g. devmem, a poisoned page which has been set HWPosion flag, kill_accessing_process() is called to send SIGBUS to the current processs with error info. Because the memory_failure() is executed in the kworker contex, it will just do nothing but return EFAULT. So, devmem will access the posioned page and trigger an excepction again, resulting in a synchronous error infinite loop. Such loop may cause platform firmware to exceed some threshold and reboot when Linux could have recovered from this error. To reproduce this problem: # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed # STEP 2: access the same page and it will trigger a synchronous error infinite loop devmem 0x4092d55b400 To fix it, if memory_failure() failed, perform a force kill to current process. Issue 3: a synchronous error infinite loop due to no memory_failure() queued No memory_failure() work is queued unless all bellow preconditions check passed: - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() - `if (flags == -1)` in ghes_handle_memory_failure() - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() If the preconditions are not passed, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. To fix it, if no memory_failure() queued, perform a force kill to current process. And the the memory errors triggered in kernel-mode[5], also relies on this patchset to kill the failure thread. Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ [5] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20240528085915.1955987-1-tongtiangen@huawei.com/ Shuai Xue (3): ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered mm: memory-failure: move return value documentation to function declaration ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 7 --- drivers/acpi/apei/ghes.c | 85 +++++++++++++++++++++------------- include/acpi/ghes.h | 3 -- include/linux/mm.h | 1 - mm/memory-failure.c | 22 +++------ 5 files changed, 59 insertions(+), 59 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 80ad530583c9..6c03059cbfc6 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, if (sec_sev == GHES_SEV_CORRECTED && (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) flags = MF_SOFT_OFFLINE; - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) - flags = 0; + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) { + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ? + 0 : + MF_ACTION_REQUIRED; + else + flags = MF_ACTION_REQUIRED; + } if (flags != -1) return ghes_do_memory_failure(mem_err->physical_addr, flags); diff --git a/include/linux/cper.h b/include/linux/cper.h index eacb7dd7b3af..b77ab7636614 100644 --- a/include/linux/cper.h +++ b/include/linux/cper.h @@ -235,6 +235,9 @@ enum { #define CPER_MEM_VALID_BANK_ADDRESS 0x100000 #define CPER_MEM_VALID_CHIP_ID 0x200000 +#define CPER_MEM_SCRUB_CE 13 +#define CPER_MEM_SCRUB_UC 14 + #define CPER_MEM_EXT_ROW_MASK 0x3 #define CPER_MEM_EXT_ROW_SHIFT 16
There are two major types of uncorrected error (UC) : - Action Required: The error is detected and the processor already consumes the memory. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error. - Action Optional: The error is detected out of processor execution context. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error. For X86 platforms, we can easily distinguish between these two types based on the MCA Bank. While for arm64 platform, the memory failure flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0, a.k.a, Action Optional now. If UC is detected by a background scrubber, it is obviously an Action Optional error. For other errors, we should conservatively regard them as Action Required. cper_sec_mem_err::error_type identifies the type of error that occurred if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 for Scrub Uncorrected Error (type 14). Otherwise, set memory failure flags as MF_ACTION_REQUIRED. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- drivers/acpi/apei/ghes.c | 10 ++++++++-- include/linux/cper.h | 3 +++ 2 files changed, 11 insertions(+), 2 deletions(-)