diff mbox series

[v7] mm/gup: check page hwpoison status for memory recovery failures.

Message ID 20210406104123.451ee3c3@alex-virtual-machine (mailing list archive)
State New, archived
Headers show
Series [v7] mm/gup: check page hwpoison status for memory recovery failures. | expand

Commit Message

yaoaili [么爱利] April 6, 2021, 2:41 a.m. UTC
When we call get_user_pages() to pin user page in memory, there may be
hwpoison page, currently, we just handle the normal case that memory
recovery jod is correctly finished, and we will not return the hwpoison
page to callers, but for other cases like memory recovery fails and the
user process related pte is not correctly set invalid, we will still
return the hwpoison page, and may touch it and lead to panic.

In gup.c, for normal page, after we call follow_page_mask(), we will
return the related page pointer; or like another hwpoison case with pte
invalid, it will return NULL. For NULL, we will handle it in if (!page)
branch. In this patch, we will filter out the hwpoison page in
follow_page_mask() and return error code for recovery failure cases.

We will check the page hwpoison status as soon as possible and avoid doing
followed normal procedure and try not to grab related pages.

Changes since v6:
- Fix wrong page pointer check in follow_trans_huge_pmd();

Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
---
 mm/gup.c         | 27 +++++++++++++++++++++++----
 mm/huge_memory.c | 11 ++++++++---
 mm/hugetlb.c     |  8 +++++++-
 mm/internal.h    | 13 +++++++++++++
 4 files changed, 51 insertions(+), 8 deletions(-)

Comments

HORIGUCHI NAOYA(堀口 直也) April 7, 2021, 1:54 a.m. UTC | #1
On Tue, Apr 06, 2021 at 10:41:23AM +0800, Aili Yao wrote:
> When we call get_user_pages() to pin user page in memory, there may be
> hwpoison page, currently, we just handle the normal case that memory
> recovery jod is correctly finished, and we will not return the hwpoison
> page to callers, but for other cases like memory recovery fails and the
> user process related pte is not correctly set invalid, we will still
> return the hwpoison page, and may touch it and lead to panic.
> 
> In gup.c, for normal page, after we call follow_page_mask(), we will
> return the related page pointer; or like another hwpoison case with pte
> invalid, it will return NULL. For NULL, we will handle it in if (!page)
> branch. In this patch, we will filter out the hwpoison page in
> follow_page_mask() and return error code for recovery failure cases.
> 
> We will check the page hwpoison status as soon as possible and avoid doing
> followed normal procedure and try not to grab related pages.
> 
> Changes since v6:
> - Fix wrong page pointer check in follow_trans_huge_pmd();
> 
> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: stable@vger.kernel.org
> ---
>  mm/gup.c         | 27 +++++++++++++++++++++++----
>  mm/huge_memory.c | 11 ++++++++---
>  mm/hugetlb.c     |  8 +++++++-
>  mm/internal.h    | 13 +++++++++++++
>  4 files changed, 51 insertions(+), 8 deletions(-)

Thank you for the work.

Looking through this patch, the internal of follow_page_mask() is
very complicated so it's not easy to make this hwpoison-aware.
Now I'm getting unsure to judge that this is the best approach.
What actually I imagined might be like below (which is totally
untested, and I'm sorry about my previous misleading comments):

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..a60a08fc7668 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1090,6 +1090,11 @@ static long __get_user_pages(struct mm_struct *mm,
 		} else if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			goto out;
+		} else if (gup_flags & FOLL_HWPOISON && PageHWPoison(page)) {
+			if (gup_flags & FOLL_GET)
+				put_page(page);
+			ret = -EHWPOISON;
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -1532,7 +1537,7 @@ struct page *get_dump_page(unsigned long addr)
 	if (mmap_read_lock_killable(mm))
 		return NULL;
 	ret = __get_user_pages_locked(mm, addr, 1, &page, NULL, &locked,
-				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
+				      FOLL_FORCE | FOLL_DUMP | FOLL_GET | FOLL_HWPOISON);
 	if (locked)
 		mmap_read_unlock(mm);
 	return (ret == 1) ? page : NULL;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a86a58ef132d..03c3d3225c0d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4949,6 +4949,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
+		if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
+			vaddr += huge_page_size(h);
+			remainder -= pages_per_huge_page(h);
+			i += pages_per_huge_page(h);
+			spin_unlock(ptl);
+			continue;
+		}
+
 		refs = min3(pages_per_huge_page(h) - pfn_offset,
 			    (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
 

We can surely say that this change only affects get_user_pages() callers
with FOLL_HWPOISON set, so this should pinpoint the current problem only.
A side note is that the above change on follow_hugetlb_page() has a room of
refactoring to reduce duplicated code.

Could you try to test and complete it?

Thanks,
Naoya Horiguchi
yaoaili [么爱利] April 7, 2021, 7:48 a.m. UTC | #2
On Wed, 7 Apr 2021 01:54:28 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> On Tue, Apr 06, 2021 at 10:41:23AM +0800, Aili Yao wrote:
> > When we call get_user_pages() to pin user page in memory, there may be
> > hwpoison page, currently, we just handle the normal case that memory
> > recovery jod is correctly finished, and we will not return the hwpoison
> > page to callers, but for other cases like memory recovery fails and the
> > user process related pte is not correctly set invalid, we will still
> > return the hwpoison page, and may touch it and lead to panic.
> > 
> > In gup.c, for normal page, after we call follow_page_mask(), we will
> > return the related page pointer; or like another hwpoison case with pte
> > invalid, it will return NULL. For NULL, we will handle it in if (!page)
> > branch. In this patch, we will filter out the hwpoison page in
> > follow_page_mask() and return error code for recovery failure cases.
> > 
> > We will check the page hwpoison status as soon as possible and avoid doing
> > followed normal procedure and try not to grab related pages.
> > 
> > Changes since v6:
> > - Fix wrong page pointer check in follow_trans_huge_pmd();
> > 
> > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: stable@vger.kernel.org
> > ---
> >  mm/gup.c         | 27 +++++++++++++++++++++++----
> >  mm/huge_memory.c | 11 ++++++++---
> >  mm/hugetlb.c     |  8 +++++++-
> >  mm/internal.h    | 13 +++++++++++++
> >  4 files changed, 51 insertions(+), 8 deletions(-)  
> 
> Thank you for the work.
> 
> Looking through this patch, the internal of follow_page_mask() is
> very complicated so it's not easy to make this hwpoison-aware.
> Now I'm getting unsure to judge that this is the best approach.
> What actually I imagined might be like below (which is totally
> untested, and I'm sorry about my previous misleading comments):
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..a60a08fc7668 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1090,6 +1090,11 @@ static long __get_user_pages(struct mm_struct *mm,
>  		} else if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);
>  			goto out;
> +		} else if (gup_flags & FOLL_HWPOISON && PageHWPoison(page)) {
> +			if (gup_flags & FOLL_GET)
> +				put_page(page);
> +			ret = -EHWPOISON;
> +			goto out;
>  		}
>  		if (pages) {
>  			pages[i] = page;
> @@ -1532,7 +1537,7 @@ struct page *get_dump_page(unsigned long addr)
>  	if (mmap_read_lock_killable(mm))
>  		return NULL;
>  	ret = __get_user_pages_locked(mm, addr, 1, &page, NULL, &locked,
> -				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> +				      FOLL_FORCE | FOLL_DUMP | FOLL_GET | FOLL_HWPOISON);
>  	if (locked)
>  		mmap_read_unlock(mm);
>  	return (ret == 1) ? page : NULL;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a86a58ef132d..03c3d3225c0d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4949,6 +4949,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			continue;
>  		}
>  
> +		if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
> +			vaddr += huge_page_size(h);
> +			remainder -= pages_per_huge_page(h);
> +			i += pages_per_huge_page(h);
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +
>  		refs = min3(pages_per_huge_page(h) - pfn_offset,
>  			    (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
>  
> 
> We can surely say that this change only affects get_user_pages() callers
> with FOLL_HWPOISON set, so this should pinpoint the current problem only.
> A side note is that the above change on follow_hugetlb_page() has a room of
> refactoring to reduce duplicated code.
> 
> Could you try to test and complete it?

Got it, I will try to complete it and test it.

For the code:

long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			continue;
>  		}
>  
> +		if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
> +			vaddr += huge_page_size(h);
> +			remainder -= pages_per_huge_page(h);
> +			i += pages_per_huge_page(h);
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +

I am wondering if we still need to continue the loop in follow_hugetlb_page()?  This function
seems mainly for prerparation of vmas and grab the hugepage, if we meet one hwpoison hugetlb page,
we will check it after follow_page_mask() return, then we will quit the total loop
and the num of page or error code will be returned, and the vmas after the hwpoison one will
not be needed?
yaoaili [么爱利] May 10, 2021, 3:13 a.m. UTC | #3
On Tue, 6 Apr 2021 10:41:23 +0800
Aili Yao <yaoaili@kingsoft.com> wrote:

> When we call get_user_pages() to pin user page in memory, there may be
> hwpoison page, currently, we just handle the normal case that memory
> recovery jod is correctly finished, and we will not return the hwpoison
> page to callers, but for other cases like memory recovery fails and the
> user process related pte is not correctly set invalid, we will still
> return the hwpoison page, and may touch it and lead to panic.
> 
> In gup.c, for normal page, after we call follow_page_mask(), we will
> return the related page pointer; or like another hwpoison case with pte
> invalid, it will return NULL. For NULL, we will handle it in if (!page)
> branch. In this patch, we will filter out the hwpoison page in
> follow_page_mask() and return error code for recovery failure cases.
> 
> We will check the page hwpoison status as soon as possible and avoid doing
> followed normal procedure and try not to grab related pages.
> 
> Changes since v6:
> - Fix wrong page pointer check in follow_trans_huge_pmd();
> 
> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>

After considering more, I still insist that we should take more care to treat the case that kernel touched  user hwpoison pages.

Usually when process touched the hwpoison page, it will be killed, and no other bad consequence will happen. while in a case, kernel touches
the user process hwpoison page, It should be one more serious issue than the user process's access, the error return nor Null return is
not sufficient and reasonable for me.

There will be some doubt places, 
1) should the user process be killed first? 
2) if error returned, the process can correctly process it?  It seems there is no way the process can finish that. 
3) if NULL returned, job continue again? the data integrity will be guaranteed?  I doubted that.

There are so many arguements about the CopyIn patch from intel that the process should killed or not...
And this gup module seems more compicated and have the same question. 
When error returned or NULL return from gup, in some way and some scenario, this will make the hw issue even more complicated.

Yes, there is a flag FOLL_HWPOISON, but this flag seems be designed for different return values that the process can accept,
it is not designed to decide kernel to panic or not;

Currently I have no other scenario like coredump case. And I havn't one good way to fix this leak in a graceful way.

Maybe, For some scenario, the kernel recovery thing for user process is wrong and not deserved.

I have a lot of other works to do and have not much time to continue on this thread, And I insist my original post.

There should be a better option for this issue, when the better option is ready, this patch can and should be reverted.

Thanks!
Aili Yao
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..88a93b89c03e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -433,6 +433,9 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 			page = ERR_PTR(ret);
 			goto out;
 		}
+	} else if (PageHWPoison(page)) {
+		page = ERR_PTR(-EHWPOISON);
+		goto out;
 	}
 
 	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
@@ -540,8 +543,13 @@  static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		page = follow_huge_pd(vma, address,
 				      __hugepd(pmd_val(pmdval)), flags,
 				      PMD_SHIFT);
-		if (page)
-			return page;
+		if (page) {
+			struct page *p = check_page_hwpoison(page);
+
+			if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+				put_page(page);
+			return p;
+		}
 		return no_page_table(vma, flags);
 	}
 retry:
@@ -643,7 +651,7 @@  static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) {
 		page = follow_huge_pud(mm, address, pud, flags);
 		if (page)
-			return page;
+			return check_page_hwpoison(page);
 		return no_page_table(vma, flags);
 	}
 	if (is_hugepd(__hugepd(pud_val(*pud)))) {
@@ -652,6 +660,13 @@  static struct page *follow_pud_mask(struct vm_area_struct *vma,
 				      PUD_SHIFT);
 		if (page)
 			return page;
+		if (page) {
+			struct page *p = check_page_hwpoison(page);
+
+			if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+				put_page(page);
+			return p;
+		}
 		return no_page_table(vma, flags);
 	}
 	if (pud_devmap(*pud)) {
@@ -1087,10 +1102,14 @@  static long __get_user_pages(struct mm_struct *mm,
 			 * struct page.
 			 */
 			goto next_page;
-		} else if (IS_ERR(page)) {
+		} else if (PTR_ERR(page) == -EHWPOISON) {
+			ret = (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+			goto out;
+		}  else if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			goto out;
 		}
+
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..56ff2e83b67c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,7 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page = NULL;
+	struct page *tail = NULL;
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
@@ -1366,6 +1367,11 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 
+	tail = page + ((addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+	if (PageHWPoison(tail))
+		return ERR_PTR(-EHWPOISON);
+
 	if (!try_grab_page(page, flags))
 		return ERR_PTR(-ENOMEM);
 
@@ -1405,11 +1411,10 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		unlock_page(page);
 	}
 skip_mlock:
-	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(tail) && !is_zone_device_page(tail), tail);
 
 out:
-	return page;
+	return tail;
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a86a58ef132d..8b50f7eaa159 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4958,7 +4958,8 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 					     likely(pages) ? pages + i : NULL,
 					     vmas ? vmas + i : NULL);
 
-		if (pages) {
+		/* As we will filter out the hwpoison page, so don't try grab it */
+		if (pages && !PageHWPoison(page)) {
 			/*
 			 * try_grab_compound_head() should always succeed here,
 			 * because: a) we hold the ptl lock, and b) we've just
@@ -5581,6 +5582,11 @@  follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+		/* if hwpoison, we don't grab it */
+		if (PageHWPoison(compound_head(page))) {
+			page = ERR_PTR(-EHWPOISON);
+			goto out;
+		}
 		/*
 		 * try_grab_page() should always succeed here, because: a) we
 		 * hold the pmd (ptl) lock, and b) we've just checked that the
diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..049b310bc79a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,19 @@  static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
+/*
+ * Check the hwposion status of any page type, and if TRUE, return ERR ptr.
+ */
+static inline struct page *check_page_hwpoison(struct page *page)
+{
+	if (PageHWPoison(page))
+		return ERR_PTR(-EHWPOISON);
+	else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+		return ERR_PTR(-EHWPOISON);
+
+	return page;
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*