Message ID | 20250211060200.33845-5-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fmm/hwpoison: Fix regressions in memory failure handling | expand |
On 2025/2/11 14:02, Shuai Xue wrote: > When an uncorrected memory error is consumed there is a race between > the CMCI from the memory controller reporting an uncorrected error > with a UCNA signature, and the core reporting and SRAR signature > machine check when the data is about to be consumed. > > If the CMCI wins that race, the page is marked poisoned when > uc_decode_notifier() calls memory_failure(). For dirty pages, > memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, > converting the PTE to a hwpoison entry. However, for clean pages, the > TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted > to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is > marked as a hwpoison entry allowing kill_accessing_process() to: > > - call walk_page_range() and return 1 > - call kill_proc() to make sure a SIGBUS is sent > - return -EHWPOISON to indicate that SIGBUS is already sent to the process > and kill_me_maybe() doesn't have to send it again. > > Conversely, for clean pages where PTE entries are not marked as hwpoison, > kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a > SIGBUS. > > Console log looks like this: > > Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects > Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered > Memory failure: 0x827ca68: already hardware poisoned > mce: Memory error not recovered > > To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing > an unnecessary SIGBUS. Thanks for your patch. > > Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > --- > mm/memory-failure.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 995a15eb67e2..f9a6b136a6f0 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, > (void *)&priv); > if (ret == 1 && priv.tk.addr) > kill_proc(&priv.tk, pfn, flags); > - else > - ret = 0; > mmap_read_unlock(p->mm); > - return ret > 0 ? -EHWPOISON : -EFAULT; > + > + return ret >= 0 ? -EHWPOISON : -EFAULT; IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break the semantics of -EHWPOISON? BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always return -EHWPOISON if this patch is applied. Correct me if I miss something. Thanks. .
在 2025/2/12 16:09, Miaohe Lin 写道: > On 2025/2/11 14:02, Shuai Xue wrote: >> When an uncorrected memory error is consumed there is a race between >> the CMCI from the memory controller reporting an uncorrected error >> with a UCNA signature, and the core reporting and SRAR signature >> machine check when the data is about to be consumed. >> >> If the CMCI wins that race, the page is marked poisoned when >> uc_decode_notifier() calls memory_failure(). For dirty pages, >> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, >> converting the PTE to a hwpoison entry. However, for clean pages, the >> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted >> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is >> marked as a hwpoison entry allowing kill_accessing_process() to: >> >> - call walk_page_range() and return 1 >> - call kill_proc() to make sure a SIGBUS is sent >> - return -EHWPOISON to indicate that SIGBUS is already sent to the process >> and kill_me_maybe() doesn't have to send it again. >> >> Conversely, for clean pages where PTE entries are not marked as hwpoison, >> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a >> SIGBUS. >> >> Console log looks like this: >> >> Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects >> Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered >> Memory failure: 0x827ca68: already hardware poisoned >> mce: Memory error not recovered >> >> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing >> an unnecessary SIGBUS. > > Thanks for your patch. > >> >> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> --- >> mm/memory-failure.c | 5 ++--- >> 1 file changed, 2 insertions(+), 3 deletions(-) >> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index 995a15eb67e2..f9a6b136a6f0 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, >> (void *)&priv); >> if (ret == 1 && priv.tk.addr) >> kill_proc(&priv.tk, pfn, flags); >> - else >> - ret = 0; >> mmap_read_unlock(p->mm); >> - return ret > 0 ? -EHWPOISON : -EFAULT; >> + >> + return ret >= 0 ? -EHWPOISON : -EFAULT; > > IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already > sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, > kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break > the semantics of -EHWPOISON? Yes, from the comment of kill_me_maybe(), * -EHWPOISON from memory_failure() means that it already sent SIGBUS * to the current process with the proper error info, * -EOPNOTSUPP means hwpoison_filter() filtered the error event, this patch break the comment. But the defination of EHWPOISON is quite different from the comment. #define EHWPOISON 133 /* Memory page has hardware error */ As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal from being sent in kill_me_maybe(). Which way do you prefer? > > BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops > walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always > return -EHWPOISON if this patch is applied. > > Correct me if I miss something. Yes, you are right. Let's count the cases one by one: 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and we should not send sigbus in kill_me_maybe(). 2. dirty page: 2.1 MCE wins race CMCI:w/o Action Require MCE: w/ Action Require TestSetPageHWPoison TestSetPageHWPoison return -EHWPOISON try_to_unmap(TTU_HWPOISON) kill_proc in hwpoison_user_mappings() If MCE wins the race, because the flag of memory_fialure() called by CMCI is not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send SIGBUS in hwpoison_user_mappings(). 2.2 CMCI win CMCI:w/o Action Require MCE: w/ Action Require TestSetPageHWPoison try_to_unmap(TTU_HWPOISON) walk_page_range() return 1 due to hwpoison PTE entry kill_proc in kill_accessing_process() If the CMCI wins the race, we need to kill the process in kill_accessing_process(). And if try_to_remap() success, everything goes well, kill_proc() will send SIGBUS in kill_accessing_process(). But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent. In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing and causing the PTE entry not to be set to hwpoison, and a clean page that originally does not have the PTE entry set to hwpoison. +naoya for orginal patch intend. Thanks. Best Regard, Shuai
On 2025/2/12 21:55, Shuai Xue wrote: > > > 在 2025/2/12 16:09, Miaohe Lin 写道: >> On 2025/2/11 14:02, Shuai Xue wrote: >>> When an uncorrected memory error is consumed there is a race between >>> the CMCI from the memory controller reporting an uncorrected error >>> with a UCNA signature, and the core reporting and SRAR signature >>> machine check when the data is about to be consumed. >>> >>> If the CMCI wins that race, the page is marked poisoned when >>> uc_decode_notifier() calls memory_failure(). For dirty pages, >>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, >>> converting the PTE to a hwpoison entry. However, for clean pages, the >>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted >>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is >>> marked as a hwpoison entry allowing kill_accessing_process() to: >>> >>> - call walk_page_range() and return 1 >>> - call kill_proc() to make sure a SIGBUS is sent >>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process >>> and kill_me_maybe() doesn't have to send it again. >>> >>> Conversely, for clean pages where PTE entries are not marked as hwpoison, >>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a >>> SIGBUS. >>> >>> Console log looks like this: >>> >>> Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects >>> Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered >>> Memory failure: 0x827ca68: already hardware poisoned >>> mce: Memory error not recovered >>> >>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing >>> an unnecessary SIGBUS. >> >> Thanks for your patch. >> >>> >>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") >>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>> --- >>> mm/memory-failure.c | 5 ++--- >>> 1 file changed, 2 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>> index 995a15eb67e2..f9a6b136a6f0 100644 >>> --- a/mm/memory-failure.c >>> +++ b/mm/memory-failure.c >>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, >>> (void *)&priv); >>> if (ret == 1 && priv.tk.addr) >>> kill_proc(&priv.tk, pfn, flags); >>> - else >>> - ret = 0; >>> mmap_read_unlock(p->mm); >>> - return ret > 0 ? -EHWPOISON : -EFAULT; >>> + >>> + return ret >= 0 ? -EHWPOISON : -EFAULT; >> >> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already >> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, >> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break >> the semantics of -EHWPOISON? > > Yes, from the comment of kill_me_maybe(), > > * -EHWPOISON from memory_failure() means that it already sent SIGBUS > * to the current process with the proper error info, > * -EOPNOTSUPP means hwpoison_filter() filtered the error event, > > this patch break the comment. > > But the defination of EHWPOISON is quite different from the comment. > > #define EHWPOISON 133 /* Memory page has hardware error */ > > As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal > from being sent in kill_me_maybe(). > > Which way do you prefer? > >> >> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops >> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always >> return -EHWPOISON if this patch is applied. >> >> Correct me if I miss something. > > Yes, you are right. Let's count the cases one by one: > > 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and Do you mean try_to_unmap? > we should not send sigbus in kill_me_maybe(). > > 2. dirty page: > 2.1 MCE wins race > CMCI:w/o Action Require MCE: w/ Action Require > TestSetPageHWPoison > TestSetPageHWPoison > return -EHWPOISON > try_to_unmap(TTU_HWPOISON) > kill_proc in hwpoison_user_mappings() > > If MCE wins the race, because the flag of memory_fialure() called by CMCI is > not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send > SIGBUS in hwpoison_user_mappings(). > > 2.2 CMCI win > CMCI:w/o Action Require MCE: w/ Action Require > TestSetPageHWPoison > try_to_unmap(TTU_HWPOISON) > walk_page_range() return 1 due to hwpoison PTE entry > kill_proc in kill_accessing_process() > > If the CMCI wins the race, we need to kill the process in > kill_accessing_process(). And if try_to_remap() success, everything goes well, > kill_proc() will send SIGBUS in kill_accessing_process(). > > But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and > walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent. If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range() will return 1 in this case. Or am I miss something? > > In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing > and causing the PTE entry not to be set to hwpoison, and a clean page that > originally does not have the PTE entry set to hwpoison. Is it possible current process is not the one accessing the hwpoisoned page? E.g. memory_failure is deferred and called from kworker context or something like that. If it's possible, this is another scene needs to be considered. Thanks. .
在 2025/2/13 11:20, Miaohe Lin 写道: > On 2025/2/12 21:55, Shuai Xue wrote: >> >> >> 在 2025/2/12 16:09, Miaohe Lin 写道: >>> On 2025/2/11 14:02, Shuai Xue wrote: >>>> When an uncorrected memory error is consumed there is a race between >>>> the CMCI from the memory controller reporting an uncorrected error >>>> with a UCNA signature, and the core reporting and SRAR signature >>>> machine check when the data is about to be consumed. >>>> >>>> If the CMCI wins that race, the page is marked poisoned when >>>> uc_decode_notifier() calls memory_failure(). For dirty pages, >>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, >>>> converting the PTE to a hwpoison entry. However, for clean pages, the >>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted >>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is >>>> marked as a hwpoison entry allowing kill_accessing_process() to: >>>> >>>> - call walk_page_range() and return 1 >>>> - call kill_proc() to make sure a SIGBUS is sent >>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process >>>> and kill_me_maybe() doesn't have to send it again. >>>> >>>> Conversely, for clean pages where PTE entries are not marked as hwpoison, >>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a >>>> SIGBUS. >>>> >>>> Console log looks like this: >>>> >>>> Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects >>>> Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered >>>> Memory failure: 0x827ca68: already hardware poisoned >>>> mce: Memory error not recovered >>>> >>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing >>>> an unnecessary SIGBUS. >>> >>> Thanks for your patch. >>> >>>> >>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") >>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>>> --- >>>> mm/memory-failure.c | 5 ++--- >>>> 1 file changed, 2 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>>> index 995a15eb67e2..f9a6b136a6f0 100644 >>>> --- a/mm/memory-failure.c >>>> +++ b/mm/memory-failure.c >>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, >>>> (void *)&priv); >>>> if (ret == 1 && priv.tk.addr) >>>> kill_proc(&priv.tk, pfn, flags); >>>> - else >>>> - ret = 0; >>>> mmap_read_unlock(p->mm); >>>> - return ret > 0 ? -EHWPOISON : -EFAULT; >>>> + >>>> + return ret >= 0 ? -EHWPOISON : -EFAULT; >>> >>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already >>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, >>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break >>> the semantics of -EHWPOISON? >> >> Yes, from the comment of kill_me_maybe(), >> >> * -EHWPOISON from memory_failure() means that it already sent SIGBUS >> * to the current process with the proper error info, >> * -EOPNOTSUPP means hwpoison_filter() filtered the error event, >> >> this patch break the comment. >> >> But the defination of EHWPOISON is quite different from the comment. >> >> #define EHWPOISON 133 /* Memory page has hardware error */ >> >> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal >> from being sent in kill_me_maybe(). >> >> Which way do you prefer? >> >>> >>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops >>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always >>> return -EHWPOISON if this patch is applied. >>> >>> Correct me if I miss something. >> >> Yes, you are right. Let's count the cases one by one: >> >> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and > > Do you mean try_to_unmap? Yes, sorry for the typo. > >> we should not send sigbus in kill_me_maybe(). >> >> 2. dirty page: >> 2.1 MCE wins race >> CMCI:w/o Action Require MCE: w/ Action Require >> TestSetPageHWPoison >> TestSetPageHWPoison >> return -EHWPOISON >> try_to_unmap(TTU_HWPOISON) >> kill_proc in hwpoison_user_mappings() >> >> If MCE wins the race, because the flag of memory_fialure() called by CMCI is >> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send >> SIGBUS in hwpoison_user_mappings(). >> >> 2.2 CMCI win >> CMCI:w/o Action Require MCE: w/ Action Require >> TestSetPageHWPoison >> try_to_unmap(TTU_HWPOISON) >> walk_page_range() return 1 due to hwpoison PTE entry >> kill_proc in kill_accessing_process() >> >> If the CMCI wins the race, we need to kill the process in >> kill_accessing_process(). And if try_to_remap() success, everything goes well, >> kill_proc() will send SIGBUS in kill_accessing_process(). >> >> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and >> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent. > > If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in > check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range() > will return 1 in this case. Or am I miss something? > You’re right; I overlooked the pte_present() branch. Therefore, in the walk_page_range() function: - It returns 0 when the poison page is a clean page. - It returns 1 when CMCI wins, regardless of whether try_to_unmap succeeds or fails. Then the patch will be like: @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, (void *)&priv); if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); - else - ret = 0; mmap_read_unlock(p->mm); - return ret > 0 ? -EHWPOISON : -EFAULT; + + return ret > 0 ? -EHWPOISON : 0; Here, returning 0 indicates that memory_failure() successfully handled the error by dropping the clean page. >> >> In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing >> and causing the PTE entry not to be set to hwpoison, and a clean page that >> originally does not have the PTE entry set to hwpoison. > > Is it possible current process is not the one accessing the hwpoisoned page? E.g. memory_failure > is deferred and called from kworker context or something like that. If it's possible, this is > another scene needs to be considered. Yes, it possibale. But kill_accessing_process() will only be called with MF_ACTION_REQUIRED. MF_ACTION_REQUIRED indates that current process is exactly the one accesing the poison data. Fox x86 platform, GHES driver may queue a kwoker to defer memory_failure() with flag=0. So kill_accessing_process() will not be called in such case. Thanks. Shuai
On 2025/2/13 14:59, Shuai Xue wrote: > > > 在 2025/2/13 11:20, Miaohe Lin 写道: >> On 2025/2/12 21:55, Shuai Xue wrote: >>> >>> >>> 在 2025/2/12 16:09, Miaohe Lin 写道: >>>> On 2025/2/11 14:02, Shuai Xue wrote: >>>>> When an uncorrected memory error is consumed there is a race between >>>>> the CMCI from the memory controller reporting an uncorrected error >>>>> with a UCNA signature, and the core reporting and SRAR signature >>>>> machine check when the data is about to be consumed. >>>>> >>>>> If the CMCI wins that race, the page is marked poisoned when >>>>> uc_decode_notifier() calls memory_failure(). For dirty pages, >>>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, >>>>> converting the PTE to a hwpoison entry. However, for clean pages, the >>>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted >>>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is >>>>> marked as a hwpoison entry allowing kill_accessing_process() to: >>>>> >>>>> - call walk_page_range() and return 1 >>>>> - call kill_proc() to make sure a SIGBUS is sent >>>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process >>>>> and kill_me_maybe() doesn't have to send it again. >>>>> >>>>> Conversely, for clean pages where PTE entries are not marked as hwpoison, >>>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a >>>>> SIGBUS. >>>>> >>>>> Console log looks like this: >>>>> >>>>> Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects >>>>> Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered >>>>> Memory failure: 0x827ca68: already hardware poisoned >>>>> mce: Memory error not recovered >>>>> >>>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing >>>>> an unnecessary SIGBUS. >>>> >>>> Thanks for your patch. >>>> >>>>> >>>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") >>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>>>> --- >>>>> mm/memory-failure.c | 5 ++--- >>>>> 1 file changed, 2 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>>>> index 995a15eb67e2..f9a6b136a6f0 100644 >>>>> --- a/mm/memory-failure.c >>>>> +++ b/mm/memory-failure.c >>>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, >>>>> (void *)&priv); >>>>> if (ret == 1 && priv.tk.addr) >>>>> kill_proc(&priv.tk, pfn, flags); >>>>> - else >>>>> - ret = 0; >>>>> mmap_read_unlock(p->mm); >>>>> - return ret > 0 ? -EHWPOISON : -EFAULT; >>>>> + >>>>> + return ret >= 0 ? -EHWPOISON : -EFAULT; >>>> >>>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already >>>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, >>>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break >>>> the semantics of -EHWPOISON? >>> >>> Yes, from the comment of kill_me_maybe(), >>> >>> * -EHWPOISON from memory_failure() means that it already sent SIGBUS >>> * to the current process with the proper error info, >>> * -EOPNOTSUPP means hwpoison_filter() filtered the error event, >>> >>> this patch break the comment. >>> >>> But the defination of EHWPOISON is quite different from the comment. >>> >>> #define EHWPOISON 133 /* Memory page has hardware error */ >>> >>> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal >>> from being sent in kill_me_maybe(). >>> >>> Which way do you prefer? >>> >>>> >>>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops >>>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always >>>> return -EHWPOISON if this patch is applied. >>>> >>>> Correct me if I miss something. >>> >>> Yes, you are right. Let's count the cases one by one: >>> >>> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and >> >> Do you mean try_to_unmap? > > Yes, sorry for the typo. >> >>> we should not send sigbus in kill_me_maybe(). >>> >>> 2. dirty page: >>> 2.1 MCE wins race >>> CMCI:w/o Action Require MCE: w/ Action Require >>> TestSetPageHWPoison >>> TestSetPageHWPoison >>> return -EHWPOISON >>> try_to_unmap(TTU_HWPOISON) >>> kill_proc in hwpoison_user_mappings() >>> >>> If MCE wins the race, because the flag of memory_fialure() called by CMCI is >>> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send >>> SIGBUS in hwpoison_user_mappings(). >>> >>> 2.2 CMCI win >>> CMCI:w/o Action Require MCE: w/ Action Require >>> TestSetPageHWPoison >>> try_to_unmap(TTU_HWPOISON) >>> walk_page_range() return 1 due to hwpoison PTE entry >>> kill_proc in kill_accessing_process() >>> >>> If the CMCI wins the race, we need to kill the process in >>> kill_accessing_process(). And if try_to_remap() success, everything goes well, >>> kill_proc() will send SIGBUS in kill_accessing_process(). >>> >>> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and >>> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent. >> >> If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in >> check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range() >> will return 1 in this case. Or am I miss something? >> > > You’re right; I overlooked the pte_present() branch. > > Therefore, in the walk_page_range() function: > - It returns 0 when the poison page is a clean page. > - It returns 1 when CMCI wins, regardless of whether try_to_unmap succeeds > or fails. > > Then the patch will be like: > > @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, > (void *)&priv); > if (ret == 1 && priv.tk.addr) > kill_proc(&priv.tk, pfn, flags); > - else > - ret = 0; > mmap_read_unlock(p->mm); > - return ret > 0 ? -EHWPOISON : -EFAULT; > + > + return ret > 0 ? -EHWPOISON : 0; > > Here, returning 0 indicates that memory_failure() successfully handled the > error by dropping the clean page. I'm not sure whether there's another scene that can make walk_page_range() returns 0. But if the only reason for walk_page_range() returning 0 is the poison page is a clean page and it's dropped, then this modification should be appropriate. With this change, the callers never send SIGBUS now. They might need to be changed too. Thanks. . > > >>> >>> In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing >>> and causing the PTE entry not to be set to hwpoison, and a clean page that >>> originally does not have the PTE entry set to hwpoison. >> >> Is it possible current process is not the one accessing the hwpoisoned page? E.g. memory_failure >> is deferred and called from kworker context or something like that. If it's possible, this is >> another scene needs to be considered. > > Yes, it possibale. > > But kill_accessing_process() will only be called with MF_ACTION_REQUIRED. > MF_ACTION_REQUIRED indates that current process is exactly the one accesing the > poison data. > > Fox x86 platform, GHES driver may queue a kwoker to defer memory_failure() with > flag=0. So kill_accessing_process() will not be called in such case. > > Thanks. > Shuai > .
在 2025/2/14 14:54, Miaohe Lin 写道: > On 2025/2/13 14:59, Shuai Xue wrote: >> >> >> 在 2025/2/13 11:20, Miaohe Lin 写道: >>> On 2025/2/12 21:55, Shuai Xue wrote: >>>> >>>> >>>> 在 2025/2/12 16:09, Miaohe Lin 写道: >>>>> On 2025/2/11 14:02, Shuai Xue wrote: >>>>>> When an uncorrected memory error is consumed there is a race between >>>>>> the CMCI from the memory controller reporting an uncorrected error >>>>>> with a UCNA signature, and the core reporting and SRAR signature >>>>>> machine check when the data is about to be consumed. >>>>>> >>>>>> If the CMCI wins that race, the page is marked poisoned when >>>>>> uc_decode_notifier() calls memory_failure(). For dirty pages, >>>>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, >>>>>> converting the PTE to a hwpoison entry. However, for clean pages, the >>>>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted >>>>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is >>>>>> marked as a hwpoison entry allowing kill_accessing_process() to: >>>>>> >>>>>> - call walk_page_range() and return 1 >>>>>> - call kill_proc() to make sure a SIGBUS is sent >>>>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process >>>>>> and kill_me_maybe() doesn't have to send it again. >>>>>> >>>>>> Conversely, for clean pages where PTE entries are not marked as hwpoison, >>>>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a >>>>>> SIGBUS. >>>>>> >>>>>> Console log looks like this: >>>>>> >>>>>> Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects >>>>>> Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered >>>>>> Memory failure: 0x827ca68: already hardware poisoned >>>>>> mce: Memory error not recovered >>>>>> >>>>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing >>>>>> an unnecessary SIGBUS. >>>>> >>>>> Thanks for your patch. >>>>> >>>>>> >>>>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") >>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>>>>> --- >>>>>> mm/memory-failure.c | 5 ++--- >>>>>> 1 file changed, 2 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>>>>> index 995a15eb67e2..f9a6b136a6f0 100644 >>>>>> --- a/mm/memory-failure.c >>>>>> +++ b/mm/memory-failure.c >>>>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, >>>>>> (void *)&priv); >>>>>> if (ret == 1 && priv.tk.addr) >>>>>> kill_proc(&priv.tk, pfn, flags); >>>>>> - else >>>>>> - ret = 0; >>>>>> mmap_read_unlock(p->mm); >>>>>> - return ret > 0 ? -EHWPOISON : -EFAULT; >>>>>> + >>>>>> + return ret >= 0 ? -EHWPOISON : -EFAULT; >>>>> >>>>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already >>>>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, >>>>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break >>>>> the semantics of -EHWPOISON? >>>> >>>> Yes, from the comment of kill_me_maybe(), >>>> >>>> * -EHWPOISON from memory_failure() means that it already sent SIGBUS >>>> * to the current process with the proper error info, >>>> * -EOPNOTSUPP means hwpoison_filter() filtered the error event, >>>> >>>> this patch break the comment. >>>> >>>> But the defination of EHWPOISON is quite different from the comment. >>>> >>>> #define EHWPOISON 133 /* Memory page has hardware error */ >>>> >>>> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal >>>> from being sent in kill_me_maybe(). >>>> >>>> Which way do you prefer? >>>> >>>>> >>>>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops >>>>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always >>>>> return -EHWPOISON if this patch is applied. >>>>> >>>>> Correct me if I miss something. >>>> >>>> Yes, you are right. Let's count the cases one by one: >>>> >>>> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and >>> >>> Do you mean try_to_unmap? >> >> Yes, sorry for the typo. >>> >>>> we should not send sigbus in kill_me_maybe(). >>>> >>>> 2. dirty page: >>>> 2.1 MCE wins race >>>> CMCI:w/o Action Require MCE: w/ Action Require >>>> TestSetPageHWPoison >>>> TestSetPageHWPoison >>>> return -EHWPOISON >>>> try_to_unmap(TTU_HWPOISON) >>>> kill_proc in hwpoison_user_mappings() >>>> >>>> If MCE wins the race, because the flag of memory_fialure() called by CMCI is >>>> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send >>>> SIGBUS in hwpoison_user_mappings(). >>>> >>>> 2.2 CMCI win >>>> CMCI:w/o Action Require MCE: w/ Action Require >>>> TestSetPageHWPoison >>>> try_to_unmap(TTU_HWPOISON) >>>> walk_page_range() return 1 due to hwpoison PTE entry >>>> kill_proc in kill_accessing_process() >>>> >>>> If the CMCI wins the race, we need to kill the process in >>>> kill_accessing_process(). And if try_to_remap() success, everything goes well, >>>> kill_proc() will send SIGBUS in kill_accessing_process(). >>>> >>>> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and >>>> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent. >>> >>> If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in >>> check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range() >>> will return 1 in this case. Or am I miss something? >>> >> >> You’re right; I overlooked the pte_present() branch. >> >> Therefore, in the walk_page_range() function: >> - It returns 0 when the poison page is a clean page. >> - It returns 1 when CMCI wins, regardless of whether try_to_unmap succeeds >> or fails. >> >> Then the patch will be like: >> >> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, >> (void *)&priv); >> if (ret == 1 && priv.tk.addr) >> kill_proc(&priv.tk, pfn, flags); >> - else >> - ret = 0; >> mmap_read_unlock(p->mm); >> - return ret > 0 ? -EHWPOISON : -EFAULT; >> + >> + return ret > 0 ? -EHWPOISON : 0; >> >> Here, returning 0 indicates that memory_failure() successfully handled the >> error by dropping the clean page. > > I'm not sure whether there's another scene that can make walk_page_range() returns 0. But if the > only reason for walk_page_range() returning 0 is the poison page is a clean page and it's dropped, > then this modification should be appropriate. With this change, the callers never send SIGBUS now. > They might need to be changed too. Yes, if memory_failure() successfully handled the error, the caller should be nothing. Thanks. Shuai
> > Then the patch will be like: > > > > @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, > > (void *)&priv); > > if (ret == 1 && priv.tk.addr) > > kill_proc(&priv.tk, pfn, flags); > > - else > > - ret = 0; > > mmap_read_unlock(p->mm); > > - return ret > 0 ? -EHWPOISON : -EFAULT; > > + > > + return ret > 0 ? -EHWPOISON : 0; > > > > Here, returning 0 indicates that memory_failure() successfully handled the > > error by dropping the clean page. > > I'm not sure whether there's another scene that can make walk_page_range() returns 0. But if the > only reason for walk_page_range() returning 0 is the poison page is a clean page and it's dropped, > then this modification should be appropriate. With this change, the callers never send SIGBUS now. > They might need to be changed too. Note there shouldn't be a SIGBUS when the action was "dropping a clean page". Full recovery is possible in this case (user process takes #PF, Linux allocates a new page and fills by reading from storage). -Tony
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 995a15eb67e2..f9a6b136a6f0 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, (void *)&priv); if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); - else - ret = 0; mmap_read_unlock(p->mm); - return ret > 0 ? -EHWPOISON : -EFAULT; + + return ret >= 0 ? -EHWPOISON : -EFAULT; } /*
When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure(). For dirty pages, memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, converting the PTE to a hwpoison entry. However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is marked as a hwpoison entry allowing kill_accessing_process() to: - call walk_page_range() and return 1 - call kill_proc() to make sure a SIGBUS is sent - return -EHWPOISON to indicate that SIGBUS is already sent to the process and kill_me_maybe() doesn't have to send it again. Conversely, for clean pages where PTE entries are not marked as hwpoison, kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a SIGBUS. Console log looks like this: Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered Memory failure: 0x827ca68: already hardware poisoned mce: Memory error not recovered To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing an unnecessary SIGBUS. Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- mm/memory-failure.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)