Message ID | 20221019170835.155381-1-tony.luck@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm, hwpoison: Try to recover from copy-on write faults | expand |
Tony Luck wrote: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an error into a private > page, fork(2), and have the child process write to the page. > > I wrapped that neatly into a test at: > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > just enable ACPI error injection and run: > > # ./einj_mem-uc -f copy-on-write > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > on architectures where that is available (currently x86 and powerpc). > When an error is detected during the page copy, return VM_FAULT_HWPOISON > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > and powerpc have code in their fault handler to deal with this code by > sending a SIGBUS to the application. > > Note that this patch avoids a system crash and signals the process that > triggered the copy-on-write action. It does not take any action for the > memory error that is still in the shared page. To handle that a call to > memory_failure() is needed. But this cannot be done from wp_page_copy() > because it holds mmap_lock(). Perhaps the architecture fault handlers > can deal with this loose end in a subsequent patch? > > On Intel/x86 this loose end will often be handled automatically because > the memory controller provides an additional notification of the h/w > poison in memory, the handler for this will call memory_failure(). This > isn't a 100% solution. If there are multiple errors, not all may be > logged in this way. > > Signed-off-by: Tony Luck <tony.luck@intel.com> Just some minor comments below, but you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > --- > Changes in V2: > Naoya Horiguchi: > 1) Use -EHWPOISON error code instead of minus one. > 2) Poison path needs also to deal with old_page > Tony Luck: > Rewrote commit message > Added some powerpc folks to Cc: list > --- > include/linux/highmem.h | 19 +++++++++++++++++++ > mm/memory.c | 28 +++++++++++++++++++--------- > 2 files changed, 38 insertions(+), 9 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index e9912da5441b..5967541fbf0e 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, > > #endif > > +static inline int copy_user_highpage_mc(struct page *to, struct page *from, > + unsigned long vaddr, struct vm_area_struct *vma) > +{ > + unsigned long ret = 0; > +#ifdef copy_mc_to_kernel > + char *vfrom, *vto; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + kunmap_local(vto); > + kunmap_local(vfrom); > +#else > + copy_user_highpage(to, from, vaddr, vma); > +#endif > + > + return ret; > +} > + There is likely some small benefit of doing this the idiomatic way and let grep see that there are multiple definitions of copy_user_highpage_mc() with an organization like: #ifdef copy_mc_to_kernel static inline int copy_user_highpage_mc(struct page *to, struct page *from, unsigned long vaddr, struct vm_area_struct *vma) { unsigned long ret = 0; char *vfrom, *vto; vfrom = kmap_local_page(from); vto = kmap_local_page(to); ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); kunmap_local(vto); kunmap_local(vfrom); return ret; } #else static inline int copy_user_highpage_mc(struct page *to, struct page *from, unsigned long vaddr, struct vm_area_struct *vma) { copy_user_highpage(to, from, vaddr, vma); return 0; } #endif Per the copy_mc* discussion with Linus I would have called this function copy_mc_to_user_highpage() to clarify that hwpoison is handled from the source buffer of the copy. > #ifndef __HAVE_ARCH_COPY_HIGHPAGE > > static inline void copy_highpage(struct page *to, struct page *from) > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..a32556c9b689 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > -static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > - struct vm_fault *vmf) > +/* > + * Return: > + * -EHWPOISON: copy failed due to hwpoison in source page > + * 0: copied failed (some other reason) > + * 1: copied succeeded > + */ > +static inline int __wp_page_copy_user(struct page *dst, struct page *src, > + struct vm_fault *vmf) > { > bool ret; > void *kaddr; > @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - copy_user_highpage(dst, src, addr, vma); > - return true; > + if (copy_user_highpage_mc(dst, src, addr, vma)) > + return -EHWPOISON; Given there is no use case for the residue value returned by copy_mc_to_kernel() perhaps just return EHWPOISON directly from copyuser_highpage_mc() in the short-copy case? > + return 1; > } > > /* > @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > * and update local tlb only > */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; What do you think about just making these 'false' cases also return a negative errno? (rationale below...) > goto pte_unlock; > } > > @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { > /* The PTE changed under us, update local tlb */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; > goto pte_unlock; > } > > @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > } > } > > - ret = true; > + ret = 1; > > pte_unlock: > if (locked) > @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > pte_t entry; > int page_copied = 0; > struct mmu_notifier_range range; > + int ret; > > delayacct_wpcopy_start(); > > @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > if (!new_page) > goto oom; > > - if (!__wp_page_copy_user(new_page, old_page, vmf)) { > + ret = __wp_page_copy_user(new_page, old_page, vmf); > + if (ret <= 0) { ...this would become a typical '0 == success' and 'negative errno == failure', where all but EHWPOISON are retried. > /* > * COW failed, if the fault was solved by other, > * it's fine. If not, userspace would re-fault on > * the same address and we will handle the fault > * from the second attempt. > + * The -EHWPOISON case will not be retried. > */ > put_page(new_page); > if (old_page) > put_page(old_page); > > delayacct_wpcopy_end(); > - return 0; > + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
> Given there is no use case for the residue value returned by > copy_mc_to_kernel() perhaps just return EHWPOISON directly from > copyuser_highpage_mc() in the short-copy case? I don't think it hurts to keep the return value as residue count. It isn't making that code any more complex and could be useful someday. Other feedback looks good and I have applied ready for next version. Thanks for the review. -Tony
在 2022/10/20 AM1:08, Tony Luck 写道: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an error into a private > page, fork(2), and have the child process write to the page. > > I wrapped that neatly into a test at: > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > just enable ACPI error injection and run: > > # ./einj_mem-uc -f copy-on-write > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > on architectures where that is available (currently x86 and powerpc). > When an error is detected during the page copy, return VM_FAULT_HWPOISON > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > and powerpc have code in their fault handler to deal with this code by > sending a SIGBUS to the application. Does it send SIGBUS to only child process or both parent and child process? > > Note that this patch avoids a system crash and signals the process that > triggered the copy-on-write action. It does not take any action for the > memory error that is still in the shared page. To handle that a call to > memory_failure() is needed. If the error page is not poisoned, should the return value of wp_page_copy be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. Thanks. Best Regards, Shuai > But this cannot be done from wp_page_copy() > because it holds mmap_lock(). Perhaps the architecture fault handlers > can deal with this loose end in a subsequent patch? > > On Intel/x86 this loose end will often be handled automatically because > the memory controller provides an additional notification of the h/w > poison in memory, the handler for this will call memory_failure(). This > isn't a 100% solution. If there are multiple errors, not all may be > logged in this way. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > > --- > Changes in V2: > Naoya Horiguchi: > 1) Use -EHWPOISON error code instead of minus one. > 2) Poison path needs also to deal with old_page > Tony Luck: > Rewrote commit message > Added some powerpc folks to Cc: list > --- > include/linux/highmem.h | 19 +++++++++++++++++++ > mm/memory.c | 28 +++++++++++++++++++--------- > 2 files changed, 38 insertions(+), 9 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index e9912da5441b..5967541fbf0e 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, > > #endif > > +static inline int copy_user_highpage_mc(struct page *to, struct page *from, > + unsigned long vaddr, struct vm_area_struct *vma) > +{ > + unsigned long ret = 0; > +#ifdef copy_mc_to_kernel > + char *vfrom, *vto; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + kunmap_local(vto); > + kunmap_local(vfrom); > +#else > + copy_user_highpage(to, from, vaddr, vma); > +#endif > + > + return ret; > +} > + > #ifndef __HAVE_ARCH_COPY_HIGHPAGE > > static inline void copy_highpage(struct page *to, struct page *from) > diff --git a/mm/memory.c b/mm/memory.c > index f88c351aecd4..a32556c9b689 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > -static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > - struct vm_fault *vmf) > +/* > + * Return: > + * -EHWPOISON: copy failed due to hwpoison in source page > + * 0: copied failed (some other reason) > + * 1: copied succeeded > + */ > +static inline int __wp_page_copy_user(struct page *dst, struct page *src, > + struct vm_fault *vmf) > { > bool ret; > void *kaddr; > @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - copy_user_highpage(dst, src, addr, vma); > - return true; > + if (copy_user_highpage_mc(dst, src, addr, vma)) > + return -EHWPOISON; > + return 1; > } > > /* > @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > * and update local tlb only > */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; > goto pte_unlock; > } > > @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { > /* The PTE changed under us, update local tlb */ > update_mmu_tlb(vma, addr, vmf->pte); > - ret = false; > + ret = 0; > goto pte_unlock; > } > > @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, > } > } > > - ret = true; > + ret = 1; > > pte_unlock: > if (locked) > @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > pte_t entry; > int page_copied = 0; > struct mmu_notifier_range range; > + int ret; > > delayacct_wpcopy_start(); > > @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > if (!new_page) > goto oom; > > - if (!__wp_page_copy_user(new_page, old_page, vmf)) { > + ret = __wp_page_copy_user(new_page, old_page, vmf); > + if (ret <= 0) { > /* > * COW failed, if the fault was solved by other, > * it's fine. If not, userspace would re-fault on > * the same address and we will handle the fault > * from the second attempt. > + * The -EHWPOISON case will not be retried. > */ > put_page(new_page); > if (old_page) > put_page(old_page); > > delayacct_wpcopy_end(); > - return 0; > + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0; > } > kmsan_copy_page_meta(new_page, old_page); > }
On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: > > > 在 2022/10/20 AM1:08, Tony Luck 写道: > > If the kernel is copying a page as the result of a copy-on-write > > fault and runs into an uncorrectable error, Linux will crash because > > it does not have recovery code for this case where poison is consumed > > by the kernel. > > > > It is easy to set up a test case. Just inject an error into a private > > page, fork(2), and have the child process write to the page. > > > > I wrapped that neatly into a test at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > > > just enable ACPI error injection and run: > > > > # ./einj_mem-uc -f copy-on-write > > > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > > on architectures where that is available (currently x86 and powerpc). > > When an error is detected during the page copy, return VM_FAULT_HWPOISON > > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > > and powerpc have code in their fault handler to deal with this code by > > sending a SIGBUS to the application. > > Does it send SIGBUS to only child process or both parent and child process? This only sends a SIGBUS to the process that wrote the page (typically the child, but also possible that the parent is the one that does the write that causes the COW). > > > > Note that this patch avoids a system crash and signals the process that > > triggered the copy-on-write action. It does not take any action for the > > memory error that is still in the shared page. To handle that a call to > > memory_failure() is needed. > > If the error page is not poisoned, should the return value of wp_page_copy > be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or > PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. > And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. The page has uncorrected data in it, but this patch doesn't mark it as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS that doesn't include the BUS_MCEERR_AR and "lsb" information. It would also skip the: "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n" console message. So might result in confusion and attepmts to debug a s/w problem with the application instead of blaming the death on a bad DIMM. > > But this cannot be done from wp_page_copy() > > because it holds mmap_lock(). Perhaps the architecture fault handlers > > can deal with this loose end in a subsequent patch? I started looking at this for x86 ... but I have changed my mind about this being a good place for a fix. When control returns back to the architecture fault handler it no longer has easy access to the physical page frame number. It has the virtual address, so it could descend back into somee new mm/memory.c function to get the physical address ... but that seems silly. I'm experimenting with using sched_work() to handle the call to memory_failure() (echoing what the machine check handler does using task_work)_add() to avoid the same problem of not being able to directly call memory_failure()). So far it seems to be working. Patch below (goes on top of original patch ... well on top of the internal version with mods based on feedback from Dan Williams ... but should show the general idea) With this patch applied the page does get unmapped from all users. Other tasks that shared the page will get a SIGBUS if they attempt to access it later (from the page fault handler because of is_hwpoison_entry() as you mention above. -Tony From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001 From: Tony Luck <tony.luck@intel.com> Date: Thu, 20 Oct 2022 09:57:28 -0700 Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW failure Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use schedule_work() to queue a request to call memory_failure() for the page with the error. Signed-off-by: Tony Luck <tony.luck@intel.com> --- mm/memory.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index b6056eef2f72..4a1304cf1f4e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf) return same; } +#ifdef CONFIG_MEMORY_FAILURE +struct pfn_work { + struct work_struct work; + unsigned long pfn; +}; + +static void do_sched_memory_failure(struct work_struct *w) +{ + struct pfn_work *p = container_of(w, struct pfn_work, work); + + memory_failure(p->pfn, 0); + kfree(p); +} + +static void sched_memory_failure(unsigned long pfn) +{ + struct pfn_work *p; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return; + INIT_WORK(&p->work, do_sched_memory_failure); + p->pfn = pfn; + schedule_work(&p->work); +} +#else +static void sched_memory_failure(unsigned long pfn) +{ +} +#endif + /* * Return: * 0: copied succeeded @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, unsigned long addr = vmf->address; if (likely(src)) { - if (copy_mc_user_highpage(dst, src, addr, vma)) + if (copy_mc_user_highpage(dst, src, addr, vma)) { + sched_memory_failure(page_to_pfn(src)); return -EHWPOISON; + } return 0; }
On 2022/10/21 4:05, Tony Luck wrote: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable error, Linux will crash because >>> it does not have recovery code for this case where poison is consumed >>> by the kernel. >>> >>> It is easy to set up a test case. Just inject an error into a private >>> page, fork(2), and have the child process write to the page. >>> >>> I wrapped that neatly into a test at: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git >>> >>> just enable ACPI error injection and run: >>> >>> # ./einj_mem-uc -f copy-on-write >>> >>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() >>> on architectures where that is available (currently x86 and powerpc). >>> When an error is detected during the page copy, return VM_FAULT_HWPOISON >>> to caller of wp_page_copy(). This propagates up the call stack. Both x86 >>> and powerpc have code in their fault handler to deal with this code by >>> sending a SIGBUS to the application. >> >> Does it send SIGBUS to only child process or both parent and child process? > > This only sends a SIGBUS to the process that wrote the page (typically > the child, but also possible that the parent is the one that does the > write that causes the COW). > >>> >>> Note that this patch avoids a system crash and signals the process that >>> triggered the copy-on-write action. It does not take any action for the >>> memory error that is still in the shared page. To handle that a call to >>> memory_failure() is needed. >> >> If the error page is not poisoned, should the return value of wp_page_copy >> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or >> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. >> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. > > The page has uncorrected data in it, but this patch doesn't mark it > as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS > that doesn't include the BUS_MCEERR_AR and "lsb" information. It would > also skip the: > > "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n" > > console message. So might result in confusion and attepmts to debug a > s/w problem with the application instead of blaming the death on a bad > DIMM. > >>> But this cannot be done from wp_page_copy() >>> because it holds mmap_lock(). Perhaps the architecture fault handlers >>> can deal with this loose end in a subsequent patch? > > I started looking at this for x86 ... but I have changed my mind > about this being a good place for a fix. When control returns back > to the architecture fault handler it no longer has easy access to > the physical page frame number. It has the virtual address, so it > could descend back into somee new mm/memory.c function to get the > physical address ... but that seems silly. > > I'm experimenting with using sched_work() to handle the call to > memory_failure() (echoing what the machine check handler does using > task_work)_add() to avoid the same problem of not being able to directly > call memory_failure()). > > So far it seems to be working. Patch below (goes on top of original > patch ... well on top of the internal version with mods based on > feedback from Dan Williams ... but should show the general idea) > > With this patch applied the page does get unmapped from all users. > Other tasks that shared the page will get a SIGBUS if they attempt > to access it later (from the page fault handler because of > is_hwpoison_entry() as you mention above. > > -Tony > >>From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001 > From: Tony Luck <tony.luck@intel.com> > Date: Thu, 20 Oct 2022 09:57:28 -0700 > Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW > failure > > Cannot call memory_failure() directly from the fault handler because > mmap_lock (and others) are held. > > It is important, but not urgent, to mark the source page as h/w poisoned > and unmap it from other tasks. > > Use schedule_work() to queue a request to call memory_failure() for the > page with the error. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > mm/memory.c | 35 ++++++++++++++++++++++++++++++++++- > 1 file changed, 34 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index b6056eef2f72..4a1304cf1f4e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > +#ifdef CONFIG_MEMORY_FAILURE > +struct pfn_work { > + struct work_struct work; > + unsigned long pfn; > +}; > + > +static void do_sched_memory_failure(struct work_struct *w) > +{ > + struct pfn_work *p = container_of(w, struct pfn_work, work); > + > + memory_failure(p->pfn, 0); > + kfree(p); > +} > + > +static void sched_memory_failure(unsigned long pfn) > +{ > + struct pfn_work *p; > + > + p = kmalloc(sizeof *p, GFP_KERNEL); > + if (!p) > + return; > + INIT_WORK(&p->work, do_sched_memory_failure); > + p->pfn = pfn; > + schedule_work(&p->work); There is already memory_failure_queue() that can do this. Can we use it directly? Thanks, Miaohe Lin > +} > +#else > +static void sched_memory_failure(unsigned long pfn) > +{ > +} > +#endif > + > /* > * Return: > * 0: copied succeeded > @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - if (copy_mc_user_highpage(dst, src, addr, vma)) > + if (copy_mc_user_highpage(dst, src, addr, vma)) { > + sched_memory_failure(page_to_pfn(src)); > return -EHWPOISON; > + } > return 0; > } > >
在 2022/10/21 AM4:05, Tony Luck 写道: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable error, Linux will crash because >>> it does not have recovery code for this case where poison is consumed >>> by the kernel. >>> >>> It is easy to set up a test case. Just inject an error into a private >>> page, fork(2), and have the child process write to the page. >>> >>> I wrapped that neatly into a test at: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git >>> >>> just enable ACPI error injection and run: >>> >>> # ./einj_mem-uc -f copy-on-write >>> >>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() >>> on architectures where that is available (currently x86 and powerpc). >>> When an error is detected during the page copy, return VM_FAULT_HWPOISON >>> to caller of wp_page_copy(). This propagates up the call stack. Both x86 >>> and powerpc have code in their fault handler to deal with this code by >>> sending a SIGBUS to the application. >> >> Does it send SIGBUS to only child process or both parent and child process? > > This only sends a SIGBUS to the process that wrote the page (typically > the child, but also possible that the parent is the one that does the > write that causes the COW). Thanks for your explanation. > >>> >>> Note that this patch avoids a system crash and signals the process that >>> triggered the copy-on-write action. It does not take any action for the >>> memory error that is still in the shared page. To handle that a call to >>> memory_failure() is needed. >> >> If the error page is not poisoned, should the return value of wp_page_copy >> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or >> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller. >> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS. > > The page has uncorrected data in it, but this patch doesn't mark it > as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS > that doesn't include the BUS_MCEERR_AR and "lsb" information. It would > also skip the: > > "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n" > > console message. So might result in confusion and attepmts to debug a > s/w problem with the application instead of blaming the death on a bad > DIMM. I see your point. Thank you. > >>> But this cannot be done from wp_page_copy() >>> because it holds mmap_lock(). Perhaps the architecture fault handlers >>> can deal with this loose end in a subsequent patch? > > I started looking at this for x86 ... but I have changed my mind > about this being a good place for a fix. When control returns back > to the architecture fault handler it no longer has easy access to > the physical page frame number. It has the virtual address, so it > could descend back into somee new mm/memory.c function to get the > physical address ... but that seems silly. > > I'm experimenting with using sched_work() to handle the call to > memory_failure() (echoing what the machine check handler does using > task_work)_add() to avoid the same problem of not being able to directly > call memory_failure()). Work queues permit work to be deferred outside of the interrupt context into the kernel process context. If we return to user-space before the queued memory_failure() work is processed, we will take the fault again, as we discussed recently. commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak So, in my opinion, we should add memory failure as a task work, like do_machine_check does, e.g. queue_task_work(&m, msg, kill_me_maybe); > > So far it seems to be working. Patch below (goes on top of original > patch ... well on top of the internal version with mods based on > feedback from Dan Williams ... but should show the general idea) > > With this patch applied the page does get unmapped from all users. > Other tasks that shared the page will get a SIGBUS if they attempt > to access it later (from the page fault handler because of > is_hwpoison_entry() as you mention above. > > -Tony > > From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001 > From: Tony Luck <tony.luck@intel.com> > Date: Thu, 20 Oct 2022 09:57:28 -0700 > Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW > failure > > Cannot call memory_failure() directly from the fault handler because > mmap_lock (and others) are held. > > It is important, but not urgent, to mark the source page as h/w poisoned > and unmap it from other tasks. > > Use schedule_work() to queue a request to call memory_failure() for the > page with the error. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > mm/memory.c | 35 ++++++++++++++++++++++++++++++++++- > 1 file changed, 34 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index b6056eef2f72..4a1304cf1f4e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf) > return same; > } > > +#ifdef CONFIG_MEMORY_FAILURE > +struct pfn_work { > + struct work_struct work; > + unsigned long pfn; > +}; > + > +static void do_sched_memory_failure(struct work_struct *w) > +{ > + struct pfn_work *p = container_of(w, struct pfn_work, work); > + > + memory_failure(p->pfn, 0); > + kfree(p); > +} > + > +static void sched_memory_failure(unsigned long pfn) > +{ > + struct pfn_work *p; > + > + p = kmalloc(sizeof *p, GFP_KERNEL); > + if (!p) > + return; > + INIT_WORK(&p->work, do_sched_memory_failure); > + p->pfn = pfn; > + schedule_work(&p->work); > +} I think there is already a function to do such work in mm/memory-failure.c. void memory_failure_queue(unsigned long pfn, int flags) Best Regards, Shuai > +#else > +static void sched_memory_failure(unsigned long pfn) > +{ > +} > +#endif > + > /* > * Return: > * 0: copied succeeded > @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, > unsigned long addr = vmf->address; > > if (likely(src)) { > - if (copy_mc_user_highpage(dst, src, addr, vma)) > + if (copy_mc_user_highpage(dst, src, addr, vma)) { > + sched_memory_failure(page_to_pfn(src)); > return -EHWPOISON; > + } > return 0; > } >
>> + INIT_WORK(&p->work, do_sched_memory_failure); >> + p->pfn = pfn; >> + schedule_work(&p->work); > > There is already memory_failure_queue() that can do this. Can we use it directly? Miaohe Lin, Yes, can use that. A thousand thanks for pointing it out. I just tried it, and it works perfectly. I think I'll need to add an empty stub version for the CONFIG_MEMORY_FAILURE=n build. But that's trivial. -Tony
On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: > > > 在 2022/10/21 AM4:05, Tony Luck 写道: > > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: > >> > >> > >> 在 2022/10/20 AM1:08, Tony Luck 写道: > > I'm experimenting with using sched_work() to handle the call to > > memory_failure() (echoing what the machine check handler does using > > task_work)_add() to avoid the same problem of not being able to directly > > call memory_failure()). > > Work queues permit work to be deferred outside of the interrupt context > into the kernel process context. If we return to user-space before the > queued memory_failure() work is processed, we will take the fault again, > as we discussed recently. > > commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors > commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak > > So, in my opinion, we should add memory failure as a task work, like > do_machine_check does, e.g. > > queue_task_work(&m, msg, kill_me_maybe); Maybe ... but this case isn't pending back to a user instruction that is trying to READ the poison memory address. The task is just trying to WRITE to any address within the page. So this is much more like a patrol scrub error found asynchronously by the memory controller (in this case found asynchronously by the Linux page copy function). So I don't feel that it's really the responsibility of the current task. When we do return to user mode the task is going to be busy servicing a SIGBUS ... so shouldn't try to touch the poison page before the memory_failure() called by the worker thread cleans things up. > > + INIT_WORK(&p->work, do_sched_memory_failure); > > + p->pfn = pfn; > > + schedule_work(&p->work); > > +} > > I think there is already a function to do such work in mm/memory-failure.c. > > void memory_failure_queue(unsigned long pfn, int flags) Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does exacly what I want, and is working well in tests so far. So perhaps a cleaner solution than making the kill_me_maybe() function globally visible. -Tony
From: Tony Luck > Sent: 21 October 2022 05:08 .... > When we do return to user mode the task is going to be busy servicing > a SIGBUS ... so shouldn't try to touch the poison page before the > memory_failure() called by the worker thread cleans things up. What about an RT process on a busy system? The worker threads are pretty low priority. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
>> When we do return to user mode the task is going to be busy servicing >> a SIGBUS ... so shouldn't try to touch the poison page before the >> memory_failure() called by the worker thread cleans things up. > > What about an RT process on a busy system? > The worker threads are pretty low priority. Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return" so the task jumps back to the instruction that cause the COW then there is a 63/64 likelihood that it is touching a different cache line from the poisoned one. In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to modify the page) ... so won't generate another machine check (those only happen for reads). But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we could get another machine check from the same address. But then we just follow the usual recovery path. -Tony
在 2022/10/21 PM12:08, Tony Luck 写道: > On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/21 AM4:05, Tony Luck 写道: >>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >>>> >>>> >>>> 在 2022/10/20 AM1:08, Tony Luck 写道: > >>> I'm experimenting with using sched_work() to handle the call to >>> memory_failure() (echoing what the machine check handler does using >>> task_work)_add() to avoid the same problem of not being able to directly >>> call memory_failure()). >> >> Work queues permit work to be deferred outside of the interrupt context >> into the kernel process context. If we return to user-space before the >> queued memory_failure() work is processed, we will take the fault again, >> as we discussed recently. >> >> commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors >> commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak >> >> So, in my opinion, we should add memory failure as a task work, like >> do_machine_check does, e.g. >> >> queue_task_work(&m, msg, kill_me_maybe); > > Maybe ... but this case isn't pending back to a user instruction > that is trying to READ the poison memory address. The task is just > trying to WRITE to any address within the page. Aha, I see the difference. Thank you. But I still have a question on this. Let us discuss in your reply to David Laight. Best Regards, Shuai > > So this is much more like a patrol scrub error found asynchronously > by the memory controller (in this case found asynchronously by the > Linux page copy function). So I don't feel that it's really the > responsibility of the current task. > > When we do return to user mode the task is going to be busy servicing > a SIGBUS ... so shouldn't try to touch the poison page before the > memory_failure() called by the worker thread cleans things up. > >>> + INIT_WORK(&p->work, do_sched_memory_failure); >>> + p->pfn = pfn; >>> + schedule_work(&p->work); >>> +} >> >> I think there is already a function to do such work in mm/memory-failure.c. >> >> void memory_failure_queue(unsigned long pfn, int flags) > > Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does > exacly what I want, and is working well in tests so far. So perhaps > a cleaner solution than making the kill_me_maybe() function globally > visible. > > -Tony
在 2022/10/21 PM12:41, Luck, Tony 写道: >>> When we do return to user mode the task is going to be busy servicing >>> a SIGBUS ... so shouldn't try to touch the poison page before the >>> memory_failure() called by the worker thread cleans things up. >> >> What about an RT process on a busy system? >> The worker threads are pretty low priority. > > Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison > > If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return" > so the task jumps back to the instruction that cause the COW then there is a 63/64 > likelihood that it is touching a different cache line from the poisoned one. > > In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to > modify the page) ... so won't generate another machine check (those only happen for reads). > > But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we > could get another machine check from the same address. But then we just follow the usual > recovery path. > > -Tony Let assume the instruction that cause the COW is in the 63/64 case, aka, it is writing a different cache line from the poisoned one. But the new_page allocated in COW is dropped right? So might page fault again? Best Regards, Shuai
>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we >> could get another machine check from the same address. But then we just follow the usual >> recovery path. > Let assume the instruction that cause the COW is in the 63/64 case, aka, > it is writing a different cache line from the poisoned one. But the new_page > allocated in COW is dropped right? So might page fault again? It can, but this should be no surprise to a user that has a signal handler for a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the problem, but simply returns to re-execute the same instruction that caused the original trap. There may be badly written signal handlers that do this. But they just cause pain for themselves. Linux can keep taking the traps and fixing things up and sending a new signal over and over. In this case that loop may involve taking the machine check again, so some extra pain for the kernel, but recoverable machine checks on Intel/x86 switched from broadcast to delivery to just the logical CPU that tried to consume the poison a few generations back. So only a bit more painful than a repeated page fault. -Tony
在 2022/10/22 AM12:30, Luck, Tony 写道: >>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we >>> could get another machine check from the same address. But then we just follow the usual >>> recovery path. > > >> Let assume the instruction that cause the COW is in the 63/64 case, aka, >> it is writing a different cache line from the poisoned one. But the new_page >> allocated in COW is dropped right? So might page fault again? > > It can, but this should be no surprise to a user that has a signal handler for > a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the > problem, but simply returns to re-execute the same instruction that caused > the original trap. > > There may be badly written signal handlers that do this. But they just cause > pain for themselves. Linux can keep taking the traps and fixing things up and > sending a new signal over and over. > > In this case that loop may involve taking the machine check again, so some > extra pain for the kernel, but recoverable machine checks on Intel/x86 switched > from broadcast to delivery to just the logical CPU that tried to consume the poison > a few generations back. So only a bit more painful than a repeated page fault. > > -Tony > > I see, thanks for your patient explanation :) Best Regards, Shuai
diff --git a/include/linux/highmem.h b/include/linux/highmem.h index e9912da5441b..5967541fbf0e 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, #endif +static inline int copy_user_highpage_mc(struct page *to, struct page *from, + unsigned long vaddr, struct vm_area_struct *vma) +{ + unsigned long ret = 0; +#ifdef copy_mc_to_kernel + char *vfrom, *vto; + + vfrom = kmap_local_page(from); + vto = kmap_local_page(to); + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); + kunmap_local(vto); + kunmap_local(vfrom); +#else + copy_user_highpage(to, from, vaddr, vma); +#endif + + return ret; +} + #ifndef __HAVE_ARCH_COPY_HIGHPAGE static inline void copy_highpage(struct page *to, struct page *from) diff --git a/mm/memory.c b/mm/memory.c index f88c351aecd4..a32556c9b689 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf) return same; } -static inline bool __wp_page_copy_user(struct page *dst, struct page *src, - struct vm_fault *vmf) +/* + * Return: + * -EHWPOISON: copy failed due to hwpoison in source page + * 0: copied failed (some other reason) + * 1: copied succeeded + */ +static inline int __wp_page_copy_user(struct page *dst, struct page *src, + struct vm_fault *vmf) { bool ret; void *kaddr; @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, unsigned long addr = vmf->address; if (likely(src)) { - copy_user_highpage(dst, src, addr, vma); - return true; + if (copy_user_highpage_mc(dst, src, addr, vma)) + return -EHWPOISON; + return 1; } /* @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, * and update local tlb only */ update_mmu_tlb(vma, addr, vmf->pte); - ret = false; + ret = 0; goto pte_unlock; } @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { /* The PTE changed under us, update local tlb */ update_mmu_tlb(vma, addr, vmf->pte); - ret = false; + ret = 0; goto pte_unlock; } @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, } } - ret = true; + ret = 1; pte_unlock: if (locked) @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) pte_t entry; int page_copied = 0; struct mmu_notifier_range range; + int ret; delayacct_wpcopy_start(); @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; - if (!__wp_page_copy_user(new_page, old_page, vmf)) { + ret = __wp_page_copy_user(new_page, old_page, vmf); + if (ret <= 0) { /* * COW failed, if the fault was solved by other, * it's fine. If not, userspace would re-fault on * the same address and we will handle the fault * from the second attempt. + * The -EHWPOISON case will not be retried. */ put_page(new_page); if (old_page) put_page(old_page); delayacct_wpcopy_end(); - return 0; + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0; } kmsan_copy_page_meta(new_page, old_page); }
If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. Signed-off-by: Tony Luck <tony.luck@intel.com> --- Changes in V2: Naoya Horiguchi: 1) Use -EHWPOISON error code instead of minus one. 2) Poison path needs also to deal with old_page Tony Luck: Rewrote commit message Added some powerpc folks to Cc: list --- include/linux/highmem.h | 19 +++++++++++++++++++ mm/memory.c | 28 +++++++++++++++++++--------- 2 files changed, 38 insertions(+), 9 deletions(-)