[v2] mm, hwpoison: Try to recover from copy-on write faults

Message ID	20221019170835.155381-1-tony.luck@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Tony Luck <tony.luck@intel.com> To: Naoya Horiguchi <naoya.horiguchi@nec.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Miaohe Lin <linmiaohe@huawei.com>, Matthew Wilcox <willy@infradead.org>, Shuai Xue <xueshuai@linux.alibaba.com>, Dan Williams <dan.j.williams@intel.com>, Michael Ellerman <mpe@ellerman.id.au>, Nicholas Piggin <npiggin@gmail.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Tony Luck <tony.luck@intel.com> Subject: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults Date: Wed, 19 Oct 2022 10:08:35 -0700 Message-Id: <20221019170835.155381-1-tony.luck@intel.com> In-Reply-To: <SJ1PR11MB60838C1F65CA293188BB442DFC289@SJ1PR11MB6083.namprd11.prod.outlook.com> References: <SJ1PR11MB60838C1F65CA293188BB442DFC289@SJ1PR11MB6083.namprd11.prod.outlook.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm, hwpoison: Try to recover from copy-on write faults \| expand [v2] mm, hwpoison: Try to recover from copy-on write faults

Tony Luck Oct. 19, 2022, 5:08 p.m. UTC

If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.

It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.

I wrapped that neatly into a test at:

  git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

just enable ACPI error injection and run:

  # ./einj_mem-uc -f copy-on-write

Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
on architectures where that is available (currently x86 and powerpc).
When an error is detected during the page copy, return VM_FAULT_HWPOISON
to caller of wp_page_copy(). This propagates up the call stack. Both x86
and powerpc have code in their fault handler to deal with this code by
sending a SIGBUS to the application.

Note that this patch avoids a system crash and signals the process that
triggered the copy-on-write action. It does not take any action for the
memory error that is still in the shared page. To handle that a call to
memory_failure() is needed. But this cannot be done from wp_page_copy()
because it holds mmap_lock(). Perhaps the architecture fault handlers
can deal with this loose end in a subsequent patch?

On Intel/x86 this loose end will often be handled automatically because
the memory controller provides an additional notification of the h/w
poison in memory, the handler for this will call memory_failure(). This
isn't a 100% solution. If there are multiple errors, not all may be
logged in this way.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes in V2:
   Naoya Horiguchi:
	1) Use -EHWPOISON error code instead of minus one.
	2) Poison path needs also to deal with old_page
   Tony Luck:
	Rewrote commit message
	Added some powerpc folks to Cc: list
---
 include/linux/highmem.h | 19 +++++++++++++++++++
 mm/memory.c             | 28 +++++++++++++++++++---------
 2 files changed, 38 insertions(+), 9 deletions(-)

Dan Williams Oct. 19, 2022, 5:45 p.m. UTC | #1

Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Just some minor comments below, but you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

> 
> ---
> Changes in V2:
>    Naoya Horiguchi:
> 	1) Use -EHWPOISON error code instead of minus one.
> 	2) Poison path needs also to deal with old_page
>    Tony Luck:
> 	Rewrote commit message
> 	Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 19 +++++++++++++++++++
>  mm/memory.c             | 28 +++++++++++++++++++---------
>  2 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>  
>  #endif
>  
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> +					unsigned long vaddr, struct vm_area_struct *vma)
> +{
> +	unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> +	char *vfrom, *vto;
> +
> +	vfrom = kmap_local_page(from);
> +	vto = kmap_local_page(to);
> +	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> +	kunmap_local(vto);
> +	kunmap_local(vfrom);
> +#else
> +	copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> +	return ret;
> +}
> +

There is likely some small benefit of doing this the idiomatic way and
let grep see that there are multiple definitions of
copy_user_highpage_mc() with an organization like:

#ifdef copy_mc_to_kernel
static inline int copy_user_highpage_mc(struct page *to, struct page *from,
                                        unsigned long vaddr,
                                        struct vm_area_struct *vma)
{    
        unsigned long ret = 0;
        char *vfrom, *vto;
     
        vfrom = kmap_local_page(from);
        vto = kmap_local_page(to);
        ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
        kunmap_local(vto);
        kunmap_local(vfrom);
     
        return ret;
}
#else
static inline int copy_user_highpage_mc(struct page *to, struct page *from,
                                        unsigned long vaddr,
                                        struct vm_area_struct *vma)
{       
        copy_user_highpage(to, from, vaddr, vma);
        return 0;
}
#endif

Per the copy_mc* discussion with Linus I would have called this function
copy_mc_to_user_highpage() to clarify that hwpoison is handled from the
source buffer of the copy.

>  #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>  
>  static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..a32556c9b689 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>  	return same;
>  }
>  
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> -				       struct vm_fault *vmf)
> +/*
> + * Return:
> + *	-EHWPOISON:	copy failed due to hwpoison in source page
> + *	0:		copied failed (some other reason)
> + *	1:		copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> +				      struct vm_fault *vmf)
>  {
>  	bool ret;
>  	void *kaddr;
> @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  	unsigned long addr = vmf->address;
>  
>  	if (likely(src)) {
> -		copy_user_highpage(dst, src, addr, vma);
> -		return true;
> +		if (copy_user_highpage_mc(dst, src, addr, vma))
> +			return -EHWPOISON;

Given there is no use case for the residue value returned by
copy_mc_to_kernel() perhaps just return EHWPOISON directly from
copyuser_highpage_mc() in the short-copy case?

> +		return 1;
>  	}
>  
>  	/*
> @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  			 * and update local tlb only
>  			 */
>  			update_mmu_tlb(vma, addr, vmf->pte);
> -			ret = false;
> +			ret = 0;

What do you think about just making these 'false' cases also return a
negative errno? (rationale below...)

>  			goto pte_unlock;
>  		}
>  
> @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>  			/* The PTE changed under us, update local tlb */
>  			update_mmu_tlb(vma, addr, vmf->pte);
> -			ret = false;
> +			ret = 0;
>  			goto pte_unlock;
>  		}
>  
> @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  		}
>  	}
>  
> -	ret = true;
> +	ret = 1;
>  
>  pte_unlock:
>  	if (locked)
> @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  	pte_t entry;
>  	int page_copied = 0;
>  	struct mmu_notifier_range range;
> +	int ret;
>  
>  	delayacct_wpcopy_start();
>  
> @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		if (!new_page)
>  			goto oom;
>  
> -		if (!__wp_page_copy_user(new_page, old_page, vmf)) {
> +		ret = __wp_page_copy_user(new_page, old_page, vmf);
> +		if (ret <= 0) {

...this would become a typical '0 == success' and 'negative errno ==
failure', where all but EHWPOISON are retried.

>  			/*
>  			 * COW failed, if the fault was solved by other,
>  			 * it's fine. If not, userspace would re-fault on
>  			 * the same address and we will handle the fault
>  			 * from the second attempt.
> +			 * The -EHWPOISON case will not be retried.
>  			 */
>  			put_page(new_page);
>  			if (old_page)
>  				put_page(old_page);
>  
>  			delayacct_wpcopy_end();
> -			return 0;
> +			return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;

Tony Luck Oct. 19, 2022, 8:30 p.m. UTC | #2

> Given there is no use case for the residue value returned by
> copy_mc_to_kernel() perhaps just return EHWPOISON directly from
> copyuser_highpage_mc() in the short-copy case?

I don't think it hurts to keep the return value as residue count. It isn't
making that code any more complex and could be useful someday.

Other feedback looks good and I have applied ready for next version.

Thanks for the review.

-Tony

Shuai Xue Oct. 20, 2022, 1:57 a.m. UTC | #3

在 2022/10/20 AM1:08, Tony Luck 写道:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.

Does it send SIGBUS to only child process or both parent and child process?

> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. 

If the error page is not poisoned, should the return value of wp_page_copy
be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

Thanks.

Best Regards,
Shuai


> But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> ---
> Changes in V2:
>    Naoya Horiguchi:
> 	1) Use -EHWPOISON error code instead of minus one.
> 	2) Poison path needs also to deal with old_page
>    Tony Luck:
> 	Rewrote commit message
> 	Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 19 +++++++++++++++++++
>  mm/memory.c             | 28 +++++++++++++++++++---------
>  2 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>  
>  #endif
>  
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> +					unsigned long vaddr, struct vm_area_struct *vma)
> +{
> +	unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> +	char *vfrom, *vto;
> +
> +	vfrom = kmap_local_page(from);
> +	vto = kmap_local_page(to);
> +	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> +	kunmap_local(vto);
> +	kunmap_local(vfrom);
> +#else
> +	copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> +	return ret;
> +}
> +
>  #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>  
>  static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..a32556c9b689 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>  	return same;
>  }
>  
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> -				       struct vm_fault *vmf)
> +/*
> + * Return:
> + *	-EHWPOISON:	copy failed due to hwpoison in source page
> + *	0:		copied failed (some other reason)
> + *	1:		copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> +				      struct vm_fault *vmf)
>  {
>  	bool ret;
>  	void *kaddr;
> @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  	unsigned long addr = vmf->address;
>  
>  	if (likely(src)) {
> -		copy_user_highpage(dst, src, addr, vma);
> -		return true;
> +		if (copy_user_highpage_mc(dst, src, addr, vma))
> +			return -EHWPOISON;
> +		return 1;
>  	}
>  
>  	/*
> @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  			 * and update local tlb only
>  			 */
>  			update_mmu_tlb(vma, addr, vmf->pte);
> -			ret = false;
> +			ret = 0;
>  			goto pte_unlock;
>  		}
>  
> @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
>  			/* The PTE changed under us, update local tlb */
>  			update_mmu_tlb(vma, addr, vmf->pte);
> -			ret = false;
> +			ret = 0;
>  			goto pte_unlock;
>  		}
>  
> @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
>  		}
>  	}
>  
> -	ret = true;
> +	ret = 1;
>  
>  pte_unlock:
>  	if (locked)
> @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  	pte_t entry;
>  	int page_copied = 0;
>  	struct mmu_notifier_range range;
> +	int ret;
>  
>  	delayacct_wpcopy_start();
>  
> @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		if (!new_page)
>  			goto oom;
>  
> -		if (!__wp_page_copy_user(new_page, old_page, vmf)) {
> +		ret = __wp_page_copy_user(new_page, old_page, vmf);
> +		if (ret <= 0) {
>  			/*
>  			 * COW failed, if the fault was solved by other,
>  			 * it's fine. If not, userspace would re-fault on
>  			 * the same address and we will handle the fault
>  			 * from the second attempt.
> +			 * The -EHWPOISON case will not be retried.
>  			 */
>  			put_page(new_page);
>  			if (old_page)
>  				put_page(old_page);
>  
>  			delayacct_wpcopy_end();
> -			return 0;
> +			return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
>  		}
>  		kmsan_copy_page_meta(new_page, old_page);
>  	}

Tony Luck Oct. 20, 2022, 8:05 p.m. UTC | #4

On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
> 
> 
> 在 2022/10/20 AM1:08, Tony Luck 写道:
> > If the kernel is copying a page as the result of a copy-on-write
> > fault and runs into an uncorrectable error, Linux will crash because
> > it does not have recovery code for this case where poison is consumed
> > by the kernel.
> > 
> > It is easy to set up a test case. Just inject an error into a private
> > page, fork(2), and have the child process write to the page.
> > 
> > I wrapped that neatly into a test at:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> > 
> > just enable ACPI error injection and run:
> > 
> >   # ./einj_mem-uc -f copy-on-write
> > 
> > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> > on architectures where that is available (currently x86 and powerpc).
> > When an error is detected during the page copy, return VM_FAULT_HWPOISON
> > to caller of wp_page_copy(). This propagates up the call stack. Both x86
> > and powerpc have code in their fault handler to deal with this code by
> > sending a SIGBUS to the application.
> 
> Does it send SIGBUS to only child process or both parent and child process?

This only sends a SIGBUS to the process that wrote the page (typically
the child, but also possible that the parent is the one that does the
write that causes the COW).

> > 
> > Note that this patch avoids a system crash and signals the process that
> > triggered the copy-on-write action. It does not take any action for the
> > memory error that is still in the shared page. To handle that a call to
> > memory_failure() is needed. 
> 
> If the error page is not poisoned, should the return value of wp_page_copy
> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

The page has uncorrected data in it, but this patch doesn't mark it
as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
also skip the:

	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"

console message. So might result in confusion and attepmts to debug a
s/w problem with the application instead of blaming the death on a bad
DIMM.

> > But this cannot be done from wp_page_copy()
> > because it holds mmap_lock(). Perhaps the architecture fault handlers
> > can deal with this loose end in a subsequent patch?

I started looking at this for x86 ... but I have changed my mind
about this being a good place for a fix. When control returns back
to the architecture fault handler it no longer has easy access to
the physical page frame number. It has the virtual address, so it
could descend back into somee new mm/memory.c function to get the
physical address ... but that seems silly.

I'm experimenting with using sched_work() to handle the call to
memory_failure() (echoing what the machine check handler does using
task_work)_add() to avoid the same problem of not being able to directly
call memory_failure()).

So far it seems to be working. Patch below (goes on top of original
patch ... well on top of the internal version with mods based on
feedback from Dan Williams ... but should show the general idea)

With this patch applied the page does get unmapped from all users.
Other tasks that shared the page will get a SIGBUS if they attempt
to access it later (from the page fault handler because of
is_hwpoison_entry() as you mention above.

-Tony

From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Thu, 20 Oct 2022 09:57:28 -0700
Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
 failure

Cannot call memory_failure() directly from the fault handler because
mmap_lock (and others) are held.

It is important, but not urgent, to mark the source page as h/w poisoned
and unmap it from other tasks.

Use schedule_work() to queue a request to call memory_failure() for the
page with the error.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 mm/memory.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b6056eef2f72..4a1304cf1f4e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
 	return same;
 }

+#ifdef CONFIG_MEMORY_FAILURE
+struct pfn_work {
+	struct work_struct work;
+	unsigned long pfn;
+};
+
+static void do_sched_memory_failure(struct work_struct *w)
+{
+	struct pfn_work *p = container_of(w, struct pfn_work, work);
+
+	memory_failure(p->pfn, 0);
+	kfree(p);
+}
+
+static void sched_memory_failure(unsigned long pfn)
+{
+	struct pfn_work *p;
+
+	p = kmalloc(sizeof *p, GFP_KERNEL);
+	if (!p)
+		return;
+	INIT_WORK(&p->work, do_sched_memory_failure);
+	p->pfn = pfn;
+	schedule_work(&p->work);
+}
+#else
+static void sched_memory_failure(unsigned long pfn)
+{
+}
+#endif
+
 /*
  * Return:
  *	0:		copied succeeded
@@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
 	unsigned long addr = vmf->address;

 	if (likely(src)) {
-		if (copy_mc_user_highpage(dst, src, addr, vma))
+		if (copy_mc_user_highpage(dst, src, addr, vma)) {
+			sched_memory_failure(page_to_pfn(src));
 			return -EHWPOISON;
+		}
 		return 0;
 	}

Miaohe Lin Oct. 21, 2022, 1:38 a.m. UTC | #5

On 2022/10/21 4:05, Tony Luck wrote:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable error, Linux will crash because
>>> it does not have recovery code for this case where poison is consumed
>>> by the kernel.
>>>
>>> It is easy to set up a test case. Just inject an error into a private
>>> page, fork(2), and have the child process write to the page.
>>>
>>> I wrapped that neatly into a test at:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>>>
>>> just enable ACPI error injection and run:
>>>
>>>   # ./einj_mem-uc -f copy-on-write
>>>
>>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
>>> on architectures where that is available (currently x86 and powerpc).
>>> When an error is detected during the page copy, return VM_FAULT_HWPOISON
>>> to caller of wp_page_copy(). This propagates up the call stack. Both x86
>>> and powerpc have code in their fault handler to deal with this code by
>>> sending a SIGBUS to the application.
>>
>> Does it send SIGBUS to only child process or both parent and child process?
> 
> This only sends a SIGBUS to the process that wrote the page (typically
> the child, but also possible that the parent is the one that does the
> write that causes the COW).
> 
>>>
>>> Note that this patch avoids a system crash and signals the process that
>>> triggered the copy-on-write action. It does not take any action for the
>>> memory error that is still in the shared page. To handle that a call to
>>> memory_failure() is needed. 
>>
>> If the error page is not poisoned, should the return value of wp_page_copy
>> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
>> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
>> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.
> 
> The page has uncorrected data in it, but this patch doesn't mark it
> as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
> that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
> also skip the:
> 
> 	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"
> 
> console message. So might result in confusion and attepmts to debug a
> s/w problem with the application instead of blaming the death on a bad
> DIMM.
> 
>>> But this cannot be done from wp_page_copy()
>>> because it holds mmap_lock(). Perhaps the architecture fault handlers
>>> can deal with this loose end in a subsequent patch?
> 
> I started looking at this for x86 ... but I have changed my mind
> about this being a good place for a fix. When control returns back
> to the architecture fault handler it no longer has easy access to
> the physical page frame number. It has the virtual address, so it
> could descend back into somee new mm/memory.c function to get the
> physical address ... but that seems silly.
> 
> I'm experimenting with using sched_work() to handle the call to
> memory_failure() (echoing what the machine check handler does using
> task_work)_add() to avoid the same problem of not being able to directly
> call memory_failure()).
> 
> So far it seems to be working. Patch below (goes on top of original
> patch ... well on top of the internal version with mods based on
> feedback from Dan Williams ... but should show the general idea)
> 
> With this patch applied the page does get unmapped from all users.
> Other tasks that shared the page will get a SIGBUS if they attempt
> to access it later (from the page fault handler because of
> is_hwpoison_entry() as you mention above.
> 
> -Tony
> 
>>From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Thu, 20 Oct 2022 09:57:28 -0700
> Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
>  failure
> 
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
> 
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
> 
> Use schedule_work() to queue a request to call memory_failure() for the
> page with the error.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  mm/memory.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..4a1304cf1f4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>  	return same;
>  }
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +struct pfn_work {
> +	struct work_struct work;
> +	unsigned long pfn;
> +};
> +
> +static void do_sched_memory_failure(struct work_struct *w)
> +{
> +	struct pfn_work *p = container_of(w, struct pfn_work, work);
> +
> +	memory_failure(p->pfn, 0);
> +	kfree(p);
> +}
> +
> +static void sched_memory_failure(unsigned long pfn)
> +{
> +	struct pfn_work *p;
> +
> +	p = kmalloc(sizeof *p, GFP_KERNEL);
> +	if (!p)
> +		return;
> +	INIT_WORK(&p->work, do_sched_memory_failure);
> +	p->pfn = pfn;
> +	schedule_work(&p->work);

There is already memory_failure_queue() that can do this. Can we use it directly?

Thanks,
Miaohe Lin


> +}
> +#else
> +static void sched_memory_failure(unsigned long pfn)
> +{
> +}
> +#endif
> +
>  /*
>   * Return:
>   *	0:		copied succeeded
> @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
>  	unsigned long addr = vmf->address;
>  
>  	if (likely(src)) {
> -		if (copy_mc_user_highpage(dst, src, addr, vma))
> +		if (copy_mc_user_highpage(dst, src, addr, vma)) {
> +			sched_memory_failure(page_to_pfn(src));
>  			return -EHWPOISON;
> +		}
>  		return 0;
>  	}
>  
>

Shuai Xue Oct. 21, 2022, 1:52 a.m. UTC | #6

在 2022/10/21 AM4:05, Tony Luck 写道:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable error, Linux will crash because
>>> it does not have recovery code for this case where poison is consumed
>>> by the kernel.
>>>
>>> It is easy to set up a test case. Just inject an error into a private
>>> page, fork(2), and have the child process write to the page.
>>>
>>> I wrapped that neatly into a test at:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>>>
>>> just enable ACPI error injection and run:
>>>
>>>   # ./einj_mem-uc -f copy-on-write
>>>
>>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
>>> on architectures where that is available (currently x86 and powerpc).
>>> When an error is detected during the page copy, return VM_FAULT_HWPOISON
>>> to caller of wp_page_copy(). This propagates up the call stack. Both x86
>>> and powerpc have code in their fault handler to deal with this code by
>>> sending a SIGBUS to the application.
>>
>> Does it send SIGBUS to only child process or both parent and child process?
> 
> This only sends a SIGBUS to the process that wrote the page (typically
> the child, but also possible that the parent is the one that does the
> write that causes the COW).


Thanks for your explanation.

> 
>>>
>>> Note that this patch avoids a system crash and signals the process that
>>> triggered the copy-on-write action. It does not take any action for the
>>> memory error that is still in the shared page. To handle that a call to
>>> memory_failure() is needed. 
>>
>> If the error page is not poisoned, should the return value of wp_page_copy
>> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
>> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
>> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.
> 
> The page has uncorrected data in it, but this patch doesn't mark it
> as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
> that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
> also skip the:
> 
> 	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"
> 
> console message. So might result in confusion and attepmts to debug a
> s/w problem with the application instead of blaming the death on a bad
> DIMM.

I see your point. Thank you.

> 
>>> But this cannot be done from wp_page_copy()
>>> because it holds mmap_lock(). Perhaps the architecture fault handlers
>>> can deal with this loose end in a subsequent patch?
> 
> I started looking at this for x86 ... but I have changed my mind
> about this being a good place for a fix. When control returns back
> to the architecture fault handler it no longer has easy access to
> the physical page frame number. It has the virtual address, so it
> could descend back into somee new mm/memory.c function to get the
> physical address ... but that seems silly.
> 
> I'm experimenting with using sched_work() to handle the call to
> memory_failure() (echoing what the machine check handler does using
> task_work)_add() to avoid the same problem of not being able to directly
> call memory_failure()).

Work queues permit work to be deferred outside of the interrupt context
into the kernel process context. If we return to user-space before the
queued memory_failure() work is processed, we will take the fault again,
as we discussed recently.

    commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
    commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak

So, in my opinion, we should add memory failure as a task work, like
do_machine_check does, e.g.

    queue_task_work(&m, msg, kill_me_maybe);

> 
> So far it seems to be working. Patch below (goes on top of original
> patch ... well on top of the internal version with mods based on
> feedback from Dan Williams ... but should show the general idea)
> 
> With this patch applied the page does get unmapped from all users.
> Other tasks that shared the page will get a SIGBUS if they attempt
> to access it later (from the page fault handler because of
> is_hwpoison_entry() as you mention above.
> 
> -Tony
> 
> From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Thu, 20 Oct 2022 09:57:28 -0700
> Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
>  failure
> 
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
> 
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
> 
> Use schedule_work() to queue a request to call memory_failure() for the
> page with the error.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  mm/memory.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..4a1304cf1f4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>  	return same;
>  }
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +struct pfn_work {
> +	struct work_struct work;
> +	unsigned long pfn;
> +};
> +
> +static void do_sched_memory_failure(struct work_struct *w)
> +{
> +	struct pfn_work *p = container_of(w, struct pfn_work, work);
> +
> +	memory_failure(p->pfn, 0);
> +	kfree(p);
> +}
> +
> +static void sched_memory_failure(unsigned long pfn)
> +{
> +	struct pfn_work *p;
> +
> +	p = kmalloc(sizeof *p, GFP_KERNEL);
> +	if (!p)
> +		return;
> +	INIT_WORK(&p->work, do_sched_memory_failure);
> +	p->pfn = pfn;
> +	schedule_work(&p->work);
> +}

I think there is already a function to do such work in mm/memory-failure.c.

	void memory_failure_queue(unsigned long pfn, int flags)


Best Regards,
Shuai


> +#else
> +static void sched_memory_failure(unsigned long pfn)
> +{
> +}
> +#endif
> +
>  /*
>   * Return:
>   *	0:		copied succeeded
> @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
>  	unsigned long addr = vmf->address;
>  
>  	if (likely(src)) {
> -		if (copy_mc_user_highpage(dst, src, addr, vma))
> +		if (copy_mc_user_highpage(dst, src, addr, vma)) {
> +			sched_memory_failure(page_to_pfn(src));
>  			return -EHWPOISON;
> +		}
>  		return 0;
>  	}
>

Tony Luck Oct. 21, 2022, 3:57 a.m. UTC | #7

>> +	INIT_WORK(&p->work, do_sched_memory_failure);
>> +	p->pfn = pfn;
>> +	schedule_work(&p->work);
>
> There is already memory_failure_queue() that can do this. Can we use it directly?

Miaohe Lin,

Yes, can use that. A thousand thanks for pointing it out. I just tried it, and it works
perfectly.

I think I'll need to add an empty stub version for the CONFIG_MEMORY_FAILURE=n
build. But that's trivial.

-Tony

Tony Luck Oct. 21, 2022, 4:08 a.m. UTC | #8

On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
> 
> 
> 在 2022/10/21 AM4:05, Tony Luck 写道:
> > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
> >>
> >>
> >> 在 2022/10/20 AM1:08, Tony Luck 写道:

> > I'm experimenting with using sched_work() to handle the call to
> > memory_failure() (echoing what the machine check handler does using
> > task_work)_add() to avoid the same problem of not being able to directly
> > call memory_failure()).
> 
> Work queues permit work to be deferred outside of the interrupt context
> into the kernel process context. If we return to user-space before the
> queued memory_failure() work is processed, we will take the fault again,
> as we discussed recently.
> 
>     commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
>     commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
> 
> So, in my opinion, we should add memory failure as a task work, like
> do_machine_check does, e.g.
> 
>     queue_task_work(&m, msg, kill_me_maybe);

Maybe ... but this case isn't pending back to a user instruction
that is trying to READ the poison memory address. The task is just
trying to WRITE to any address within the page.

So this is much more like a patrol scrub error found asynchronously
by the memory controller (in this case found asynchronously by the
Linux page copy function).  So I don't feel that it's really the
responsibility of the current task.

When we do return to user mode the task is going to be busy servicing
a SIGBUS ... so shouldn't try to touch the poison page before the
memory_failure() called by the worker thread cleans things up.

> > +	INIT_WORK(&p->work, do_sched_memory_failure);
> > +	p->pfn = pfn;
> > +	schedule_work(&p->work);
> > +}
> 
> I think there is already a function to do such work in mm/memory-failure.c.
> 
> 	void memory_failure_queue(unsigned long pfn, int flags)

Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does
exacly what I want, and is working well in tests so far. So perhaps
a cleaner solution than making the kill_me_maybe() function globally
visible.

-Tony

David Laight Oct. 21, 2022, 4:11 a.m. UTC | #9

From: Tony Luck
> Sent: 21 October 2022 05:08
....
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.

What about an RT process on a busy system?
The worker threads are pretty low priority.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Tony Luck Oct. 21, 2022, 4:41 a.m. UTC | #10

>> When we do return to user mode the task is going to be busy servicing
>> a SIGBUS ... so shouldn't try to touch the poison page before the
>> memory_failure() called by the worker thread cleans things up.
>
> What about an RT process on a busy system?
> The worker threads are pretty low priority.

Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison

If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return"
so the task jumps back to the instruction that cause the COW then there is a 63/64
likelihood that it is touching a different cache line from the poisoned one.

In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to
modify the page) ... so won't generate another machine check (those only happen for reads).

But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
could get another machine check from the same address. But then we just follow the usual
recovery path.

-Tony

Shuai Xue Oct. 21, 2022, 6:57 a.m. UTC | #11

在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2022/10/20 AM1:08, Tony Luck 写道:
> 
>>> I'm experimenting with using sched_work() to handle the call to
>>> memory_failure() (echoing what the machine check handler does using
>>> task_work)_add() to avoid the same problem of not being able to directly
>>> call memory_failure()).
>>
>> Work queues permit work to be deferred outside of the interrupt context
>> into the kernel process context. If we return to user-space before the
>> queued memory_failure() work is processed, we will take the fault again,
>> as we discussed recently.
>>
>>     commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
>>     commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
>>
>> So, in my opinion, we should add memory failure as a task work, like
>> do_machine_check does, e.g.
>>
>>     queue_task_work(&m, msg, kill_me_maybe);
> 
> Maybe ... but this case isn't pending back to a user instruction
> that is trying to READ the poison memory address. The task is just
> trying to WRITE to any address within the page.

Aha, I see the difference. Thank you. But I still have a question on
this. Let us discuss in your reply to David Laight.

Best Regards,
Shuai

> 
> So this is much more like a patrol scrub error found asynchronously
> by the memory controller (in this case found asynchronously by the
> Linux page copy function).  So I don't feel that it's really the
> responsibility of the current task.
> 
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.
> 
>>> +	INIT_WORK(&p->work, do_sched_memory_failure);
>>> +	p->pfn = pfn;
>>> +	schedule_work(&p->work);
>>> +}
>>
>> I think there is already a function to do such work in mm/memory-failure.c.
>>
>> 	void memory_failure_queue(unsigned long pfn, int flags)
> 
> Also pointed out by Miaohe Lin <linmiaohe@huawei.com> ... this does
> exacly what I want, and is working well in tests so far. So perhaps
> a cleaner solution than making the kill_me_maybe() function globally
> visible.
> 
> -Tony

Shuai Xue Oct. 21, 2022, 9:29 a.m. UTC | #12

在 2022/10/21 PM12:41, Luck, Tony 写道:
>>> When we do return to user mode the task is going to be busy servicing
>>> a SIGBUS ... so shouldn't try to touch the poison page before the
>>> memory_failure() called by the worker thread cleans things up.
>>
>> What about an RT process on a busy system?
>> The worker threads are pretty low priority.
> 
> Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison
> 
> If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return"
> so the task jumps back to the instruction that cause the COW then there is a 63/64
> likelihood that it is touching a different cache line from the poisoned one.
> 
> In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to
> modify the page) ... so won't generate another machine check (those only happen for reads).
> 
> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
> could get another machine check from the same address. But then we just follow the usual
> recovery path.
> 
> -Tony


Let assume the instruction that cause the COW is in the 63/64 case, aka,
it is writing a different cache line from the poisoned one. But the new_page
allocated in COW is dropped right? So might page fault again?

Best Regards,
Shuai

Tony Luck Oct. 21, 2022, 4:30 p.m. UTC | #13

>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
>> could get another machine check from the same address. But then we just follow the usual
>> recovery path.


> Let assume the instruction that cause the COW is in the 63/64 case, aka,
> it is writing a different cache line from the poisoned one. But the new_page
> allocated in COW is dropped right? So might page fault again?

It can, but this should be no surprise to a user that has a signal handler for
a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
problem, but simply returns to re-execute the same instruction that caused
the original trap.

There may be badly written signal handlers that do this. But they just cause
pain for themselves. Linux can keep taking the traps and fixing things up and
sending a new signal over and over.

In this case that loop may involve taking the machine check again, so some
extra pain for the kernel, but recoverable machine checks on Intel/x86 switched
from broadcast to delivery to just the logical CPU that tried to consume the poison
a few generations back. So only a bit more painful than a repeated page fault.

-Tony

Shuai Xue Oct. 23, 2022, 3:04 p.m. UTC | #14

在 2022/10/22 AM12:30, Luck, Tony 写道:
>>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
>>> could get another machine check from the same address. But then we just follow the usual
>>> recovery path.
> 
> 
>> Let assume the instruction that cause the COW is in the 63/64 case, aka,
>> it is writing a different cache line from the poisoned one. But the new_page
>> allocated in COW is dropped right? So might page fault again?
> 
> It can, but this should be no surprise to a user that has a signal handler for
> a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
> problem, but simply returns to re-execute the same instruction that caused
> the original trap.
> 
> There may be badly written signal handlers that do this. But they just cause
> pain for themselves. Linux can keep taking the traps and fixing things up and
> sending a new signal over and over.
> 
> In this case that loop may involve taking the machine check again, so some
> extra pain for the kernel, but recoverable machine checks on Intel/x86 switched
> from broadcast to delivery to just the logical CPU that tried to consume the poison
> a few generations back. So only a bit more painful than a repeated page fault.
> 
> -Tony
> 
> 

I see, thanks for your patient explanation :)

Best Regards,
Shuai

[v2] mm, hwpoison: Try to recover from copy-on write faults

Commit Message

Comments

Patch