diff mbox series

[v2,1/5] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault

Message ID 20210923032830.314328-2-shy828301@gmail.com (mailing list archive)
State New
Headers show
Series Solve silent data loss caused by poisoned page cache (shmem/tmpfs) | expand

Commit Message

Yang Shi Sept. 23, 2021, 3:28 a.m. UTC
When handling shmem page fault the THP with corrupted subpage could be PMD
mapped if certain conditions are satisfied.  But kernel is supposed to
send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
the thing was even worse in fault around path.  The THP could be PMD mapped as
long as the VMA fits regardless what subpage is accessed and corrupted.  After
this commit as long as head page is not corrupted the THP could be PMD mapped.

In the regulat fault path the THP could be PMD mapped as long as the corrupted
page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault path.

So introduce a new page flag called HasHWPoisoned on the first tail page.  It
indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
is found hwpoisoned by memory failure and cleared when the THP is freed or
split.

Cc: <stable@vger.kernel.org>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/page-flags.h | 19 +++++++++++++++++++
 mm/filemap.c               | 15 +++++++++------
 mm/huge_memory.c           |  2 ++
 mm/memory-failure.c        |  4 ++++
 mm/memory.c                |  9 +++++++++
 mm/page_alloc.c            |  4 +++-
 6 files changed, 46 insertions(+), 7 deletions(-)

Comments

Kirill A. Shutemov Sept. 23, 2021, 2:39 p.m. UTC | #1
On Wed, Sep 22, 2021 at 08:28:26PM -0700, Yang Shi wrote:
> When handling shmem page fault the THP with corrupted subpage could be PMD
> mapped if certain conditions are satisfied.  But kernel is supposed to
> send SIGBUS when trying to map hwpoisoned page.
> 
> There are two paths which may do PMD map: fault around and regular fault.
> 
> Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> the thing was even worse in fault around path.  The THP could be PMD mapped as
> long as the VMA fits regardless what subpage is accessed and corrupted.  After
> this commit as long as head page is not corrupted the THP could be PMD mapped.
> 
> In the regulat fault path the THP could be PMD mapped as long as the corrupted

s/regulat/regular/

> page is not accessed and the VMA fits.
> 
> This loophole could be fixed by iterating every subpage to check if any
> of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> 
> So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> is found hwpoisoned by memory failure and cleared when the THP is freed or
> split.
> 
> Cc: <stable@vger.kernel.org>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---

...

> diff --git a/mm/filemap.c b/mm/filemap.c
> index dae481293b5d..740b7afe159a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3195,12 +3195,14 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
>  	}
>  
>  	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> -	    vm_fault_t ret = do_set_pmd(vmf, page);
> -	    if (!ret) {
> -		    /* The page is mapped successfully, reference consumed. */
> -		    unlock_page(page);
> -		    return true;
> -	    }
> +		vm_fault_t ret = do_set_pmd(vmf, page);
> +		if (ret == VM_FAULT_FALLBACK)
> +			goto out;

Hm.. What? I don't get it. Who will establish page table in the pmd then?

> +		if (!ret) {
> +			/* The page is mapped successfully, reference consumed. */
> +			unlock_page(page);
> +			return true;
> +		}
>  	}
>  
>  	if (pmd_none(*vmf->pmd)) {
> @@ -3220,6 +3222,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
>  		return true;
>  	}
>  
> +out:
>  	return false;
>  }
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5e9ef0fc261e..0574b1613714 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
>  	lruvec = lock_page_lruvec(head);
>  
> +	ClearPageHasHWPoisoned(head);
> +

Do we serialize the new flag with lock_page() or what? I mean what
prevents the flag being set again after this point, but before
ClearPageCompound()?
Yang Shi Sept. 23, 2021, 5:15 p.m. UTC | #2
On Thu, Sep 23, 2021 at 7:39 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, Sep 22, 2021 at 08:28:26PM -0700, Yang Shi wrote:
> > When handling shmem page fault the THP with corrupted subpage could be PMD
> > mapped if certain conditions are satisfied.  But kernel is supposed to
> > send SIGBUS when trying to map hwpoisoned page.
> >
> > There are two paths which may do PMD map: fault around and regular fault.
> >
> > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > this commit as long as head page is not corrupted the THP could be PMD mapped.
> >
> > In the regulat fault path the THP could be PMD mapped as long as the corrupted
>
> s/regulat/regular/
>
> > page is not accessed and the VMA fits.
> >
> > This loophole could be fixed by iterating every subpage to check if any
> > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> >
> > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > is found hwpoisoned by memory failure and cleared when the THP is freed or
> > split.
> >
> > Cc: <stable@vger.kernel.org>
> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
>
> ...
>
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index dae481293b5d..740b7afe159a 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3195,12 +3195,14 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> >       }
> >
> >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > -         if (!ret) {
> > -                 /* The page is mapped successfully, reference consumed. */
> > -                 unlock_page(page);
> > -                 return true;
> > -         }
> > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > +             if (ret == VM_FAULT_FALLBACK)
> > +                     goto out;
>
> Hm.. What? I don't get it. Who will establish page table in the pmd then?

Aha, yeah. It should jump to the below PMD populate section. Will fix
it in the next version.

>
> > +             if (!ret) {
> > +                     /* The page is mapped successfully, reference consumed. */
> > +                     unlock_page(page);
> > +                     return true;
> > +             }
> >       }
> >
> >       if (pmd_none(*vmf->pmd)) {
> > @@ -3220,6 +3222,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> >               return true;
> >       }
> >
> > +out:
> >       return false;
> >  }
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5e9ef0fc261e..0574b1613714 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> >       /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> >       lruvec = lock_page_lruvec(head);
> >
> > +     ClearPageHasHWPoisoned(head);
> > +
>
> Do we serialize the new flag with lock_page() or what? I mean what
> prevents the flag being set again after this point, but before
> ClearPageCompound()?

No, not in this patch. But I think we could use refcount. THP split
would freeze refcount and the split is guaranteed to succeed after
that point, so refcount can be checked in memory failure. The
SetPageHasHWPoisoned() call could be moved to __get_hwpoison_page()
when get_unless_page_zero() bumps the refcount successfully. If the
refcount is zero it means the THP is under split or being freed, we
don't care about these two cases.

The THP might be mapped before this flag is set, but the process will
be killed later, so it seems fine.

>
> --
>  Kirill A. Shutemov
Yang Shi Sept. 23, 2021, 8:39 p.m. UTC | #3
On Thu, Sep 23, 2021 at 10:15 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 7:39 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Wed, Sep 22, 2021 at 08:28:26PM -0700, Yang Shi wrote:
> > > When handling shmem page fault the THP with corrupted subpage could be PMD
> > > mapped if certain conditions are satisfied.  But kernel is supposed to
> > > send SIGBUS when trying to map hwpoisoned page.
> > >
> > > There are two paths which may do PMD map: fault around and regular fault.
> > >
> > > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > > this commit as long as head page is not corrupted the THP could be PMD mapped.
> > >
> > > In the regulat fault path the THP could be PMD mapped as long as the corrupted
> >
> > s/regulat/regular/
> >
> > > page is not accessed and the VMA fits.
> > >
> > > This loophole could be fixed by iterating every subpage to check if any
> > > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> > >
> > > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > > is found hwpoisoned by memory failure and cleared when the THP is freed or
> > > split.
> > >
> > > Cc: <stable@vger.kernel.org>
> > > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> >
> > ...
> >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index dae481293b5d..740b7afe159a 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3195,12 +3195,14 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > >       }
> > >
> > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > -         if (!ret) {
> > > -                 /* The page is mapped successfully, reference consumed. */
> > > -                 unlock_page(page);
> > > -                 return true;
> > > -         }
> > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > +             if (ret == VM_FAULT_FALLBACK)
> > > +                     goto out;
> >
> > Hm.. What? I don't get it. Who will establish page table in the pmd then?
>
> Aha, yeah. It should jump to the below PMD populate section. Will fix
> it in the next version.
>
> >
> > > +             if (!ret) {
> > > +                     /* The page is mapped successfully, reference consumed. */
> > > +                     unlock_page(page);
> > > +                     return true;
> > > +             }
> > >       }
> > >
> > >       if (pmd_none(*vmf->pmd)) {
> > > @@ -3220,6 +3222,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > >               return true;
> > >       }
> > >
> > > +out:
> > >       return false;
> > >  }
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 5e9ef0fc261e..0574b1613714 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > >       /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> > >       lruvec = lock_page_lruvec(head);
> > >
> > > +     ClearPageHasHWPoisoned(head);
> > > +
> >
> > Do we serialize the new flag with lock_page() or what? I mean what
> > prevents the flag being set again after this point, but before
> > ClearPageCompound()?
>
> No, not in this patch. But I think we could use refcount. THP split
> would freeze refcount and the split is guaranteed to succeed after
> that point, so refcount can be checked in memory failure. The
> SetPageHasHWPoisoned() call could be moved to __get_hwpoison_page()
> when get_unless_page_zero() bumps the refcount successfully. If the
> refcount is zero it means the THP is under split or being freed, we
> don't care about these two cases.

Setting the flag in __get_hwpoison_page() would make this patch depend
on patch #3. However, this patch probably will be backported to older
versions. To ease the backport, I'd like to have the refcount check in
the same place where THP is checked. So, something like "if
(PageTransHuge(hpage) && page_count(hpage) != 0)".

Then the call to set the flag could be moved to __get_hwpoison_page()
in the following patch (after patch #3). Does this sound good to you?

>
> The THP might be mapped before this flag is set, but the process will
> be killed later, so it seems fine.
>
> >
> > --
> >  Kirill A. Shutemov
Kirill A. Shutemov Sept. 24, 2021, 9:26 a.m. UTC | #4
On Thu, Sep 23, 2021 at 01:39:49PM -0700, Yang Shi wrote:
> On Thu, Sep 23, 2021 at 10:15 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Sep 23, 2021 at 7:39 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 08:28:26PM -0700, Yang Shi wrote:
> > > > When handling shmem page fault the THP with corrupted subpage could be PMD
> > > > mapped if certain conditions are satisfied.  But kernel is supposed to
> > > > send SIGBUS when trying to map hwpoisoned page.
> > > >
> > > > There are two paths which may do PMD map: fault around and regular fault.
> > > >
> > > > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > > > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > > > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > > > this commit as long as head page is not corrupted the THP could be PMD mapped.
> > > >
> > > > In the regulat fault path the THP could be PMD mapped as long as the corrupted
> > >
> > > s/regulat/regular/
> > >
> > > > page is not accessed and the VMA fits.
> > > >
> > > > This loophole could be fixed by iterating every subpage to check if any
> > > > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> > > >
> > > > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > > > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > > > is found hwpoisoned by memory failure and cleared when the THP is freed or
> > > > split.
> > > >
> > > > Cc: <stable@vger.kernel.org>
> > > > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > >
> > > ...
> > >
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index dae481293b5d..740b7afe159a 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -3195,12 +3195,14 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > >       }
> > > >
> > > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > > -         if (!ret) {
> > > > -                 /* The page is mapped successfully, reference consumed. */
> > > > -                 unlock_page(page);
> > > > -                 return true;
> > > > -         }
> > > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > > +             if (ret == VM_FAULT_FALLBACK)
> > > > +                     goto out;
> > >
> > > Hm.. What? I don't get it. Who will establish page table in the pmd then?
> >
> > Aha, yeah. It should jump to the below PMD populate section. Will fix
> > it in the next version.
> >
> > >
> > > > +             if (!ret) {
> > > > +                     /* The page is mapped successfully, reference consumed. */
> > > > +                     unlock_page(page);
> > > > +                     return true;
> > > > +             }
> > > >       }
> > > >
> > > >       if (pmd_none(*vmf->pmd)) {
> > > > @@ -3220,6 +3222,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > >               return true;
> > > >       }
> > > >
> > > > +out:
> > > >       return false;
> > > >  }
> > > >
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index 5e9ef0fc261e..0574b1613714 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > >       /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> > > >       lruvec = lock_page_lruvec(head);
> > > >
> > > > +     ClearPageHasHWPoisoned(head);
> > > > +
> > >
> > > Do we serialize the new flag with lock_page() or what? I mean what
> > > prevents the flag being set again after this point, but before
> > > ClearPageCompound()?
> >
> > No, not in this patch. But I think we could use refcount. THP split
> > would freeze refcount and the split is guaranteed to succeed after
> > that point, so refcount can be checked in memory failure. The
> > SetPageHasHWPoisoned() call could be moved to __get_hwpoison_page()
> > when get_unless_page_zero() bumps the refcount successfully. If the
> > refcount is zero it means the THP is under split or being freed, we
> > don't care about these two cases.
> 
> Setting the flag in __get_hwpoison_page() would make this patch depend
> on patch #3. However, this patch probably will be backported to older
> versions. To ease the backport, I'd like to have the refcount check in
> the same place where THP is checked. So, something like "if
> (PageTransHuge(hpage) && page_count(hpage) != 0)".
> 
> Then the call to set the flag could be moved to __get_hwpoison_page()
> in the following patch (after patch #3). Does this sound good to you?

Could you show the code I'm not sure I follow. page_count(hpage) check
looks racy to me. What if split happens just after the check?
Yang Shi Sept. 24, 2021, 4:44 p.m. UTC | #5
On Fri, Sep 24, 2021 at 2:26 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Thu, Sep 23, 2021 at 01:39:49PM -0700, Yang Shi wrote:
> > On Thu, Sep 23, 2021 at 10:15 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Thu, Sep 23, 2021 at 7:39 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > >
> > > > On Wed, Sep 22, 2021 at 08:28:26PM -0700, Yang Shi wrote:
> > > > > When handling shmem page fault the THP with corrupted subpage could be PMD
> > > > > mapped if certain conditions are satisfied.  But kernel is supposed to
> > > > > send SIGBUS when trying to map hwpoisoned page.
> > > > >
> > > > > There are two paths which may do PMD map: fault around and regular fault.
> > > > >
> > > > > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > > > > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > > > > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > > > > this commit as long as head page is not corrupted the THP could be PMD mapped.
> > > > >
> > > > > In the regulat fault path the THP could be PMD mapped as long as the corrupted
> > > >
> > > > s/regulat/regular/
> > > >
> > > > > page is not accessed and the VMA fits.
> > > > >
> > > > > This loophole could be fixed by iterating every subpage to check if any
> > > > > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> > > > >
> > > > > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > > > > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > > > > is found hwpoisoned by memory failure and cleared when the THP is freed or
> > > > > split.
> > > > >
> > > > > Cc: <stable@vger.kernel.org>
> > > > > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > > ---
> > > >
> > > > ...
> > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index dae481293b5d..740b7afe159a 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -3195,12 +3195,14 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > > >       }
> > > > >
> > > > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > -         if (!ret) {
> > > > > -                 /* The page is mapped successfully, reference consumed. */
> > > > > -                 unlock_page(page);
> > > > > -                 return true;
> > > > > -         }
> > > > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > +             if (ret == VM_FAULT_FALLBACK)
> > > > > +                     goto out;
> > > >
> > > > Hm.. What? I don't get it. Who will establish page table in the pmd then?
> > >
> > > Aha, yeah. It should jump to the below PMD populate section. Will fix
> > > it in the next version.
> > >
> > > >
> > > > > +             if (!ret) {
> > > > > +                     /* The page is mapped successfully, reference consumed. */
> > > > > +                     unlock_page(page);
> > > > > +                     return true;
> > > > > +             }
> > > > >       }
> > > > >
> > > > >       if (pmd_none(*vmf->pmd)) {
> > > > > @@ -3220,6 +3222,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > > >               return true;
> > > > >       }
> > > > >
> > > > > +out:
> > > > >       return false;
> > > > >  }
> > > > >
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index 5e9ef0fc261e..0574b1613714 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > > >       /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> > > > >       lruvec = lock_page_lruvec(head);
> > > > >
> > > > > +     ClearPageHasHWPoisoned(head);
> > > > > +
> > > >
> > > > Do we serialize the new flag with lock_page() or what? I mean what
> > > > prevents the flag being set again after this point, but before
> > > > ClearPageCompound()?
> > >
> > > No, not in this patch. But I think we could use refcount. THP split
> > > would freeze refcount and the split is guaranteed to succeed after
> > > that point, so refcount can be checked in memory failure. The
> > > SetPageHasHWPoisoned() call could be moved to __get_hwpoison_page()
> > > when get_unless_page_zero() bumps the refcount successfully. If the
> > > refcount is zero it means the THP is under split or being freed, we
> > > don't care about these two cases.
> >
> > Setting the flag in __get_hwpoison_page() would make this patch depend
> > on patch #3. However, this patch probably will be backported to older
> > versions. To ease the backport, I'd like to have the refcount check in
> > the same place where THP is checked. So, something like "if
> > (PageTransHuge(hpage) && page_count(hpage) != 0)".
> >
> > Then the call to set the flag could be moved to __get_hwpoison_page()
> > in the following patch (after patch #3). Does this sound good to you?
>
> Could you show the code I'm not sure I follow. page_count(hpage) check
> looks racy to me. What if split happens just after the check?

Yes, it is racy. The flag has to be set after get_page_unless_zero().
Did some archeology, it seems patch #3 is also applicable to v4.9+.
So, the simplest way may be to have both patch #3 and this patch
backport to stable.



>
> --
>  Kirill A. Shutemov
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a558d67ee86f..a357b41b3057 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -171,6 +171,11 @@  enum pageflags {
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_workingset,
 
+#ifdef CONFIG_MEMORY_FAILURE
+	/* Compound pages. Stored in first tail page's flags */
+	PG_has_hwpoisoned = PG_mappedtodisk,
+#endif
+
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
 
@@ -668,6 +673,20 @@  PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+/*
+ * PageHasPoisoned indicates that at least on subpage is hwpoisoned in the
+ * compound page.
+ *
+ * This flag is set by hwpoison handler.  Cleared by THP split or free page.
+ */
+PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
+	TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
+#else
+PAGEFLAG_FALSE(HasHWPoisoned)
+	TESTSCFLAG_FALSE(HasHWPoisoned)
+#endif
+
 /*
  * Check if a page is currently marked HWPoisoned. Note that this check is
  * best effort only and inherently racy: there is no way to synchronize with
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..740b7afe159a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3195,12 +3195,14 @@  static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	}
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
-	    vm_fault_t ret = do_set_pmd(vmf, page);
-	    if (!ret) {
-		    /* The page is mapped successfully, reference consumed. */
-		    unlock_page(page);
-		    return true;
-	    }
+		vm_fault_t ret = do_set_pmd(vmf, page);
+		if (ret == VM_FAULT_FALLBACK)
+			goto out;
+		if (!ret) {
+			/* The page is mapped successfully, reference consumed. */
+			unlock_page(page);
+			return true;
+		}
 	}
 
 	if (pmd_none(*vmf->pmd)) {
@@ -3220,6 +3222,7 @@  static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		return true;
 	}
 
+out:
 	return false;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0fc261e..0574b1613714 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2426,6 +2426,8 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
 	lruvec = lock_page_lruvec(head);
 
+	ClearPageHasHWPoisoned(head);
+
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond EOF: drop them from page cache */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 54879c339024..93ae0ce90ab8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1663,6 +1663,10 @@  int memory_failure(unsigned long pfn, int flags)
 	}
 
 	orig_head = hpage = compound_head(p);
+
+	if (PageTransHuge(hpage))
+		SetPageHasHWPoisoned(orig_head);
+
 	num_poisoned_pages_inc();
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..738f4e1df81e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3905,6 +3905,15 @@  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	if (compound_order(page) != HPAGE_PMD_ORDER)
 		return ret;
 
+	/*
+	 * Just backoff if any subpage of a THP is corrupted otherwise
+	 * the corrupted page may mapped by PMD silently to escape the
+	 * check.  This kind of THP just can be PTE mapped.  Access to
+	 * the corrupted subpage should trigger SIGBUS as expected.
+	 */
+	if (unlikely(PageHasHWPoisoned(page)))
+		return ret;
+
 	/*
 	 * Archs like ppc64 need additional space to store information
 	 * related to pte entry. Use the preallocated table for that.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..7f37652f0287 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1312,8 +1312,10 @@  static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound)
+		if (compound) {
 			ClearPageDoubleMap(page);
+			ClearPageHasHWPoisoned(page);
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);