diff mbox series

[v3,2/5] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault

Message ID 20210930215311.240774-3-shy828301@gmail.com (mailing list archive)
State New, archived
Headers show
Series Solve silent data loss caused by poisoned page cache (shmem/tmpfs) | expand

Commit Message

Yang Shi Sept. 30, 2021, 9:53 p.m. UTC
When handling shmem page fault the THP with corrupted subpage could be PMD
mapped if certain conditions are satisfied.  But kernel is supposed to
send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
the thing was even worse in fault around path.  The THP could be PMD mapped as
long as the VMA fits regardless what subpage is accessed and corrupted.  After
this commit as long as head page is not corrupted the THP could be PMD mapped.

In the regular fault path the THP could be PMD mapped as long as the corrupted
page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault path.

So introduce a new page flag called HasHWPoisoned on the first tail page.  It
indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
is found hwpoisoned by memory failure and cleared when the THP is freed or
split.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Cc: <stable@vger.kernel.org>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/page-flags.h | 19 +++++++++++++++++++
 mm/filemap.c               | 12 ++++++------
 mm/huge_memory.c           |  2 ++
 mm/memory-failure.c        |  6 +++++-
 mm/memory.c                |  9 +++++++++
 mm/page_alloc.c            |  4 +++-
 6 files changed, 44 insertions(+), 8 deletions(-)

Comments

Naoya Horiguchi Oct. 1, 2021, 7:23 a.m. UTC | #1
On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> When handling shmem page fault the THP with corrupted subpage could be PMD
> mapped if certain conditions are satisfied.  But kernel is supposed to
> send SIGBUS when trying to map hwpoisoned page.
> 
> There are two paths which may do PMD map: fault around and regular fault.
> 
> Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> the thing was even worse in fault around path.  The THP could be PMD mapped as
> long as the VMA fits regardless what subpage is accessed and corrupted.  After
> this commit as long as head page is not corrupted the THP could be PMD mapped.
> 
> In the regular fault path the THP could be PMD mapped as long as the corrupted
> page is not accessed and the VMA fits.
> 
> This loophole could be fixed by iterating every subpage to check if any
> of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> 
> So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> is found hwpoisoned by memory failure and cleared when the THP is freed or
> split.
> 
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: <stable@vger.kernel.org>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
...
> @@ -668,6 +673,20 @@ PAGEFLAG_FALSE(DoubleMap)
>  	TESTSCFLAG_FALSE(DoubleMap)
>  #endif
>  
> +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +/*
> + * PageHasPoisoned indicates that at least on subpage is hwpoisoned in the

Maybe you meant as follow?

+ * PageHasHWPoisoned indicates that at least one subpage is hwpoisoned in the

Thanks,
Naoya Horiguchi
Yang Shi Oct. 1, 2021, 9:07 p.m. UTC | #2
On Fri, Oct 1, 2021 at 12:23 AM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > When handling shmem page fault the THP with corrupted subpage could be PMD
> > mapped if certain conditions are satisfied.  But kernel is supposed to
> > send SIGBUS when trying to map hwpoisoned page.
> >
> > There are two paths which may do PMD map: fault around and regular fault.
> >
> > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > this commit as long as head page is not corrupted the THP could be PMD mapped.
> >
> > In the regular fault path the THP could be PMD mapped as long as the corrupted
> > page is not accessed and the VMA fits.
> >
> > This loophole could be fixed by iterating every subpage to check if any
> > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> >
> > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > is found hwpoisoned by memory failure and cleared when the THP is freed or
> > split.
> >
> > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > Cc: <stable@vger.kernel.org>
> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> ...
> > @@ -668,6 +673,20 @@ PAGEFLAG_FALSE(DoubleMap)
> >       TESTSCFLAG_FALSE(DoubleMap)
> >  #endif
> >
> > +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +/*
> > + * PageHasPoisoned indicates that at least on subpage is hwpoisoned in the
>
> Maybe you meant as follow?
>
> + * PageHasHWPoisoned indicates that at least one subpage is hwpoisoned in the

Yeah, thanks for catching it. It is a typo because the flag was called
PageHasPoisoned. But "poisoned" seems ambiguous for some cases since,
for example, some memory sanitizers use "poisoned", so I renamed it to
PageHasHWPoisoned to make it less ambiguous.

>
> Thanks,
> Naoya Horiguchi
Kirill A . Shutemov Oct. 4, 2021, 2:06 p.m. UTC | #3
On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index dae481293b5d..2acc2b977f66 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
>  	}
>  
>  	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> -	    vm_fault_t ret = do_set_pmd(vmf, page);
> -	    if (!ret) {
> -		    /* The page is mapped successfully, reference consumed. */
> -		    unlock_page(page);
> -		    return true;
> -	    }
> +		vm_fault_t ret = do_set_pmd(vmf, page);
> +		if (!ret) {
> +			/* The page is mapped successfully, reference consumed. */
> +			unlock_page(page);
> +			return true;
> +		}
>  	}
>  
>  	if (pmd_none(*vmf->pmd)) {

Hm. Is it unrelated whitespace fix?
Yang Shi Oct. 4, 2021, 6:17 p.m. UTC | #4
On Mon, Oct 4, 2021 at 7:06 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index dae481293b5d..2acc2b977f66 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> >       }
> >
> >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > -         if (!ret) {
> > -                 /* The page is mapped successfully, reference consumed. */
> > -                 unlock_page(page);
> > -                 return true;
> > -         }
> > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > +             if (!ret) {
> > +                     /* The page is mapped successfully, reference consumed. */
> > +                     unlock_page(page);
> > +                     return true;
> > +             }
> >       }
> >
> >       if (pmd_none(*vmf->pmd)) {
>
> Hm. Is it unrelated whitespace fix?

It is a coding style clean up. I thought it may be overkilling to have
a separate patch. Do you prefer separate one?

>
> --
>  Kirill A. Shutemov
Kirill A . Shutemov Oct. 4, 2021, 7:41 p.m. UTC | #5
On Mon, Oct 04, 2021 at 11:17:29AM -0700, Yang Shi wrote:
> On Mon, Oct 4, 2021 at 7:06 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index dae481293b5d..2acc2b977f66 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > >       }
> > >
> > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > -         if (!ret) {
> > > -                 /* The page is mapped successfully, reference consumed. */
> > > -                 unlock_page(page);
> > > -                 return true;
> > > -         }
> > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > +             if (!ret) {
> > > +                     /* The page is mapped successfully, reference consumed. */
> > > +                     unlock_page(page);
> > > +                     return true;
> > > +             }
> > >       }
> > >
> > >       if (pmd_none(*vmf->pmd)) {
> >
> > Hm. Is it unrelated whitespace fix?
> 
> It is a coding style clean up. I thought it may be overkilling to have
> a separate patch. Do you prefer separate one?

Maybe. I tried to find what changed here. It's confusing.
Yang Shi Oct. 4, 2021, 8:13 p.m. UTC | #6
On Mon, Oct 4, 2021 at 12:41 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, Oct 04, 2021 at 11:17:29AM -0700, Yang Shi wrote:
> > On Mon, Oct 4, 2021 at 7:06 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index dae481293b5d..2acc2b977f66 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > >       }
> > > >
> > > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > > -         if (!ret) {
> > > > -                 /* The page is mapped successfully, reference consumed. */
> > > > -                 unlock_page(page);
> > > > -                 return true;
> > > > -         }
> > > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > > +             if (!ret) {
> > > > +                     /* The page is mapped successfully, reference consumed. */
> > > > +                     unlock_page(page);
> > > > +                     return true;
> > > > +             }
> > > >       }
> > > >
> > > >       if (pmd_none(*vmf->pmd)) {
> > >
> > > Hm. Is it unrelated whitespace fix?
> >
> > It is a coding style clean up. I thought it may be overkilling to have
> > a separate patch. Do you prefer separate one?
>
> Maybe. I tried to find what changed here. It's confusing.

Yeah, maybe. Anyway I will separate the real big fix and the cleanup
into two patches. This may be helpful for backporting too.

>
> --
>  Kirill A. Shutemov
Peter Xu Oct. 6, 2021, 7:54 p.m. UTC | #7
On Mon, Oct 04, 2021 at 01:13:07PM -0700, Yang Shi wrote:
> On Mon, Oct 4, 2021 at 12:41 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Mon, Oct 04, 2021 at 11:17:29AM -0700, Yang Shi wrote:
> > > On Mon, Oct 4, 2021 at 7:06 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > >
> > > > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index dae481293b5d..2acc2b977f66 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > > >       }
> > > > >
> > > > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > -         if (!ret) {
> > > > > -                 /* The page is mapped successfully, reference consumed. */
> > > > > -                 unlock_page(page);
> > > > > -                 return true;
> > > > > -         }
> > > > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > +             if (!ret) {
> > > > > +                     /* The page is mapped successfully, reference consumed. */
> > > > > +                     unlock_page(page);
> > > > > +                     return true;
> > > > > +             }
> > > > >       }
> > > > >
> > > > >       if (pmd_none(*vmf->pmd)) {
> > > >
> > > > Hm. Is it unrelated whitespace fix?
> > >
> > > It is a coding style clean up. I thought it may be overkilling to have
> > > a separate patch. Do you prefer separate one?
> >
> > Maybe. I tried to find what changed here. It's confusing.
> 
> Yeah, maybe. Anyway I will separate the real big fix and the cleanup
> into two patches. This may be helpful for backporting too.

Or maybe we just don't touch it until there's need for a functional change?  I
feel it a pity to lose the git blame info for reindent-only patches, but no
strong opinion, because I know many people don't think the same and I'm fine
with either ways.

Another side note: perhaps a comment above pageflags enum on PG_has_hwpoisoned
would be nice?  I saw that we've got a bunch of those already.

Thanks,
Peter Xu Oct. 6, 2021, 8:15 p.m. UTC | #8
On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> @@ -1148,8 +1148,12 @@ static int __get_hwpoison_page(struct page *page)
>  		return -EBUSY;
>  
>  	if (get_page_unless_zero(head)) {
> -		if (head == compound_head(page))
> +		if (head == compound_head(page)) {
> +			if (PageTransHuge(head))
> +				SetPageHasHWPoisoned(head);
> +
>  			return 1;
> +		}
>  
>  		pr_info("Memory failure: %#lx cannot catch tail\n",
>  			page_to_pfn(page));

Sorry for the late comments.

I'm wondering whether it's ideal to set this bit here, as get_hwpoison_page()
sounds like a pure helper to get a refcount out of a sane hwpoisoned page.  I'm
afraid there can be side effect that we set this without being noticed, so I'm
also wondering we should keep it in memory_failure().

Quotting comments for get_hwpoison_page():

 * get_hwpoison_page() takes a page refcount of an error page to handle memory
 * error on it, after checking that the error page is in a well-defined state
 * (defined as a page-type we can successfully handle the memor error on it,
 * such as LRU page and hugetlb page).

For example, I see that both unpoison_memory() and soft_offline_page() will
call it too, does it mean that we'll also set the bits e.g. even when we want
to inject an unpoison event too?

Thanks,
Peter Xu Oct. 6, 2021, 8:18 p.m. UTC | #9
On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> +#ifdef CONFIG_MEMORY_FAILURE
> +	/* Compound pages. Stored in first tail page's flags */
> +	PG_has_hwpoisoned = PG_mappedtodisk,
> +#endif

Sorry one more comment I missed: would PG_hwpoisoned_subpage better?  It's just
that "has_hwpoison" can be directly read as "this page has been hwpoisoned",
which sounds too close to PG_hwpoisoned.  No strong opinion either.  Thanks,
Yang Shi Oct. 6, 2021, 11:41 p.m. UTC | #10
On Wed, Oct 6, 2021 at 12:54 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Oct 04, 2021 at 01:13:07PM -0700, Yang Shi wrote:
> > On Mon, Oct 4, 2021 at 12:41 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Mon, Oct 04, 2021 at 11:17:29AM -0700, Yang Shi wrote:
> > > > On Mon, Oct 4, 2021 at 7:06 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > >
> > > > > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > > index dae481293b5d..2acc2b977f66 100644
> > > > > > --- a/mm/filemap.c
> > > > > > +++ b/mm/filemap.c
> > > > > > @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > > > >       }
> > > > > >
> > > > > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > > > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > > -         if (!ret) {
> > > > > > -                 /* The page is mapped successfully, reference consumed. */
> > > > > > -                 unlock_page(page);
> > > > > > -                 return true;
> > > > > > -         }
> > > > > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > > +             if (!ret) {
> > > > > > +                     /* The page is mapped successfully, reference consumed. */
> > > > > > +                     unlock_page(page);
> > > > > > +                     return true;
> > > > > > +             }
> > > > > >       }
> > > > > >
> > > > > >       if (pmd_none(*vmf->pmd)) {
> > > > >
> > > > > Hm. Is it unrelated whitespace fix?
> > > >
> > > > It is a coding style clean up. I thought it may be overkilling to have
> > > > a separate patch. Do you prefer separate one?
> > >
> > > Maybe. I tried to find what changed here. It's confusing.
> >
> > Yeah, maybe. Anyway I will separate the real big fix and the cleanup
> > into two patches. This may be helpful for backporting too.
>
> Or maybe we just don't touch it until there's need for a functional change?  I
> feel it a pity to lose the git blame info for reindent-only patches, but no
> strong opinion, because I know many people don't think the same and I'm fine
> with either ways.

TBH I really don't think keeping old "git blame" info should be an
excuse to avoid any coding style cleanup.

>
> Another side note: perhaps a comment above pageflags enum on PG_has_hwpoisoned
> would be nice?  I saw that we've got a bunch of those already.

I was thinking about that, but it seems PG_double_map doesn't have
comment there either so I didn't add.

>
> Thanks,
>
> --
> Peter Xu
>
Yang Shi Oct. 6, 2021, 11:57 p.m. UTC | #11
On Wed, Oct 6, 2021 at 1:15 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > @@ -1148,8 +1148,12 @@ static int __get_hwpoison_page(struct page *page)
> >               return -EBUSY;
> >
> >       if (get_page_unless_zero(head)) {
> > -             if (head == compound_head(page))
> > +             if (head == compound_head(page)) {
> > +                     if (PageTransHuge(head))
> > +                             SetPageHasHWPoisoned(head);
> > +
> >                       return 1;
> > +             }
> >
> >               pr_info("Memory failure: %#lx cannot catch tail\n",
> >                       page_to_pfn(page));
>
> Sorry for the late comments.
>
> I'm wondering whether it's ideal to set this bit here, as get_hwpoison_page()
> sounds like a pure helper to get a refcount out of a sane hwpoisoned page.  I'm
> afraid there can be side effect that we set this without being noticed, so I'm
> also wondering we should keep it in memory_failure().
>
> Quotting comments for get_hwpoison_page():
>
>  * get_hwpoison_page() takes a page refcount of an error page to handle memory
>  * error on it, after checking that the error page is in a well-defined state
>  * (defined as a page-type we can successfully handle the memor error on it,
>  * such as LRU page and hugetlb page).
>
> For example, I see that both unpoison_memory() and soft_offline_page() will
> call it too, does it mean that we'll also set the bits e.g. even when we want
> to inject an unpoison event too?

unpoison_memory() should be not a problem since it will just bail out
once THP is met as the comment says:

/*
* unpoison_memory() can encounter thp only when the thp is being
* worked by memory_failure() and the page lock is not held yet.
* In such case, we yield to memory_failure() and make unpoison fail.
*/


And I think we should set the flag for soft offline too, right? The
soft offline does set the hwpoison flag for the corrupted sub page and
doesn't split file THP, so it should be captured by page fault as
well. And yes for poison injection.

But your comment reminds me that get_hwpoison_page() is just called
when !MF_COUNT_INCREASED, so it means MADV_HWPOISON still could
escape. This needs to be covered too.

BTW, I did the test with MADV_HWPOISON, but I didn't test this change
(moving flag set after get_page_unless_zero()) since I thought it was
just a trivial change and did overlook this case.

>
> Thanks,
>
> --
> Peter Xu
>
Yang Shi Oct. 7, 2021, 2:49 a.m. UTC | #12
On Wed, Oct 6, 2021 at 1:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +     /* Compound pages. Stored in first tail page's flags */
> > +     PG_has_hwpoisoned = PG_mappedtodisk,
> > +#endif
>
> Sorry one more comment I missed: would PG_hwpoisoned_subpage better?  It's just
> that "has_hwpoison" can be directly read as "this page has been hwpoisoned",
> which sounds too close to PG_hwpoisoned.  No strong opinion either.  Thanks,

TBH, I don't see too much difference IMHO. Someone may interpret it to
"this" subpage is hwpoisoned rather than at least one subpage of THP
is hwpoisoned.

>
> --
> Peter Xu
>
Peter Xu Oct. 7, 2021, 4:06 p.m. UTC | #13
On Wed, Oct 06, 2021 at 04:57:38PM -0700, Yang Shi wrote:
> > For example, I see that both unpoison_memory() and soft_offline_page() will
> > call it too, does it mean that we'll also set the bits e.g. even when we want
> > to inject an unpoison event too?
> 
> unpoison_memory() should be not a problem since it will just bail out
> once THP is met as the comment says:
> 
> /*
> * unpoison_memory() can encounter thp only when the thp is being
> * worked by memory_failure() and the page lock is not held yet.
> * In such case, we yield to memory_failure() and make unpoison fail.
> */

But I still think setting the subpage-hwpoison bit hides too deep there, it'll
be great we can keep get_hwpoison_page() as simple as a safe version of getting
the refcount of the page we want.  Or we'd still better touch up the comment
above get_hwpoison_page() to show that side effect.

> 
> 
> And I think we should set the flag for soft offline too, right? The

I'm not familiar with either memory failure or soft offline, so far it looks
right to me.  However..

> soft offline does set the hwpoison flag for the corrupted sub page and
> doesn't split file THP,

.. I believe this will become not true after your patch 5, right?

> so it should be captured by page fault as well. And yes for poison injection.

One more thing: besides thp split and page free, do we need to conditionally
drop the HasHwpoisoned bit when received an unpoison event?

If my understanding is correct, we may need to scan all the subpages there, to
make sure HasHwpoisoned bit reflects the latest status for the thp in question.

> 
> But your comment reminds me that get_hwpoison_page() is just called
> when !MF_COUNT_INCREASED, so it means MADV_HWPOISON still could
> escape. This needs to be covered too.

Right, maybe that's also a clue that we shouldn't set the new page flag within
get_hwpoison_page(), since get_hwpoison_page() is actually well coupled with
MF_COUNT_INCREASED and all of them are only about refcounting of the pages.
Peter Xu Oct. 7, 2021, 4:14 p.m. UTC | #14
On Wed, Oct 06, 2021 at 04:41:35PM -0700, Yang Shi wrote:
> > Or maybe we just don't touch it until there's need for a functional change?  I
> > feel it a pity to lose the git blame info for reindent-only patches, but no
> > strong opinion, because I know many people don't think the same and I'm fine
> > with either ways.
> 
> TBH I really don't think keeping old "git blame" info should be an
> excuse to avoid any coding style cleanup.

Sure.

> 
> >
> > Another side note: perhaps a comment above pageflags enum on PG_has_hwpoisoned
> > would be nice?  I saw that we've got a bunch of those already.
> 
> I was thinking about that, but it seems PG_double_map doesn't have
> comment there either so I didn't add.

IMHO that means we may just need even more documentations? :)

I won't ask for documenting doublemap bit in this series, but I just don't
think it's a good excuse to not provide documentations if we still can.
Especially to me PageHasHwpoisoned looks really so like PageHwpoisoned, so
it'll be still very nice to have some good document along with the patch it's
introduced.

Thanks,
Yang Shi Oct. 7, 2021, 6:19 p.m. UTC | #15
On Thu, Oct 7, 2021 at 9:06 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Oct 06, 2021 at 04:57:38PM -0700, Yang Shi wrote:
> > > For example, I see that both unpoison_memory() and soft_offline_page() will
> > > call it too, does it mean that we'll also set the bits e.g. even when we want
> > > to inject an unpoison event too?
> >
> > unpoison_memory() should be not a problem since it will just bail out
> > once THP is met as the comment says:
> >
> > /*
> > * unpoison_memory() can encounter thp only when the thp is being
> > * worked by memory_failure() and the page lock is not held yet.
> > * In such case, we yield to memory_failure() and make unpoison fail.
> > */
>
> But I still think setting the subpage-hwpoison bit hides too deep there, it'll
> be great we can keep get_hwpoison_page() as simple as a safe version of getting
> the refcount of the page we want.  Or we'd still better touch up the comment
> above get_hwpoison_page() to show that side effect.
>
> >
> >
> > And I think we should set the flag for soft offline too, right? The
>
> I'm not familiar with either memory failure or soft offline, so far it looks
> right to me.  However..
>
> > soft offline does set the hwpoison flag for the corrupted sub page and
> > doesn't split file THP,
>
> .. I believe this will become not true after your patch 5, right?

But THP split may fail, right?

>
> > so it should be captured by page fault as well. And yes for poison injection.
>
> One more thing: besides thp split and page free, do we need to conditionally
> drop the HasHwpoisoned bit when received an unpoison event?

It seems not to me, as the above comment from unpoison_memory() says
unpoison can encounter thp only when the thp is being worked by
memory_failure() and the page lock is not held yet. So it just bails
out.

In addition, unpoison just works for software injected errors, not
real hardware failure.

>
> If my understanding is correct, we may need to scan all the subpages there, to
> make sure HasHwpoisoned bit reflects the latest status for the thp in question.
>
> >
> > But your comment reminds me that get_hwpoison_page() is just called
> > when !MF_COUNT_INCREASED, so it means MADV_HWPOISON still could
> > escape. This needs to be covered too.
>
> Right, maybe that's also a clue that we shouldn't set the new page flag within
> get_hwpoison_page(), since get_hwpoison_page() is actually well coupled with
> MF_COUNT_INCREASED and all of them are only about refcounting of the pages.

Yeah, maybe, as long as there is not early bail out in some error
handling paths.

>
> --
> Peter Xu
>
Yang Shi Oct. 7, 2021, 6:28 p.m. UTC | #16
On Thu, Oct 7, 2021 at 9:14 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Oct 06, 2021 at 04:41:35PM -0700, Yang Shi wrote:
> > > Or maybe we just don't touch it until there's need for a functional change?  I
> > > feel it a pity to lose the git blame info for reindent-only patches, but no
> > > strong opinion, because I know many people don't think the same and I'm fine
> > > with either ways.
> >
> > TBH I really don't think keeping old "git blame" info should be an
> > excuse to avoid any coding style cleanup.
>
> Sure.
>
> >
> > >
> > > Another side note: perhaps a comment above pageflags enum on PG_has_hwpoisoned
> > > would be nice?  I saw that we've got a bunch of those already.
> >
> > I was thinking about that, but it seems PG_double_map doesn't have
> > comment there either so I didn't add.
>
> IMHO that means we may just need even more documentations? :)
>
> I won't ask for documenting doublemap bit in this series, but I just don't
> think it's a good excuse to not provide documentations if we still can.
> Especially to me PageHasHwpoisoned looks really so like PageHwpoisoned, so
> it'll be still very nice to have some good document along with the patch it's
> introduced.

OK, I could add more comments for this flag in the enum. It should be
just a duplicate of the comment right before the PAGEFLAG definition.

>
> Thanks,
>
> --
> Peter Xu
>
Yang Shi Oct. 7, 2021, 8:27 p.m. UTC | #17
On Thu, Oct 7, 2021 at 11:19 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Oct 7, 2021 at 9:06 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Oct 06, 2021 at 04:57:38PM -0700, Yang Shi wrote:
> > > > For example, I see that both unpoison_memory() and soft_offline_page() will
> > > > call it too, does it mean that we'll also set the bits e.g. even when we want
> > > > to inject an unpoison event too?
> > >
> > > unpoison_memory() should be not a problem since it will just bail out
> > > once THP is met as the comment says:
> > >
> > > /*
> > > * unpoison_memory() can encounter thp only when the thp is being
> > > * worked by memory_failure() and the page lock is not held yet.
> > > * In such case, we yield to memory_failure() and make unpoison fail.
> > > */
> >
> > But I still think setting the subpage-hwpoison bit hides too deep there, it'll
> > be great we can keep get_hwpoison_page() as simple as a safe version of getting
> > the refcount of the page we want.  Or we'd still better touch up the comment
> > above get_hwpoison_page() to show that side effect.
> >
> > >
> > >
> > > And I think we should set the flag for soft offline too, right? The
> >
> > I'm not familiar with either memory failure or soft offline, so far it looks
> > right to me.  However..
> >
> > > soft offline does set the hwpoison flag for the corrupted sub page and
> > > doesn't split file THP,
> >
> > .. I believe this will become not true after your patch 5, right?
>
> But THP split may fail, right?
>
> >
> > > so it should be captured by page fault as well. And yes for poison injection.
> >
> > One more thing: besides thp split and page free, do we need to conditionally
> > drop the HasHwpoisoned bit when received an unpoison event?
>
> It seems not to me, as the above comment from unpoison_memory() says
> unpoison can encounter thp only when the thp is being worked by
> memory_failure() and the page lock is not held yet. So it just bails
> out.
>
> In addition, unpoison just works for software injected errors, not
> real hardware failure.
>
> >
> > If my understanding is correct, we may need to scan all the subpages there, to
> > make sure HasHwpoisoned bit reflects the latest status for the thp in question.
> >
> > >
> > > But your comment reminds me that get_hwpoison_page() is just called
> > > when !MF_COUNT_INCREASED, so it means MADV_HWPOISON still could
> > > escape. This needs to be covered too.
> >
> > Right, maybe that's also a clue that we shouldn't set the new page flag within
> > get_hwpoison_page(), since get_hwpoison_page() is actually well coupled with
> > MF_COUNT_INCREASED and all of them are only about refcounting of the pages.
>
> Yeah, maybe, as long as there is not early bail out in some error
> handling paths.

It seems fine to move setting the flag out of get_hwpoison_page() to
right before splitting THP so that both MF_COUNT_INCREASED and
!MF_COUNT_INCREASED could be covered.


>
> >
> > --
> > Peter Xu
> >
Yang Shi Oct. 7, 2021, 9:28 p.m. UTC | #18
On Wed, Oct 6, 2021 at 4:57 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Oct 6, 2021 at 1:15 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > @@ -1148,8 +1148,12 @@ static int __get_hwpoison_page(struct page *page)
> > >               return -EBUSY;
> > >
> > >       if (get_page_unless_zero(head)) {
> > > -             if (head == compound_head(page))
> > > +             if (head == compound_head(page)) {
> > > +                     if (PageTransHuge(head))
> > > +                             SetPageHasHWPoisoned(head);
> > > +
> > >                       return 1;
> > > +             }
> > >
> > >               pr_info("Memory failure: %#lx cannot catch tail\n",
> > >                       page_to_pfn(page));
> >
> > Sorry for the late comments.
> >
> > I'm wondering whether it's ideal to set this bit here, as get_hwpoison_page()
> > sounds like a pure helper to get a refcount out of a sane hwpoisoned page.  I'm
> > afraid there can be side effect that we set this without being noticed, so I'm
> > also wondering we should keep it in memory_failure().
> >
> > Quotting comments for get_hwpoison_page():
> >
> >  * get_hwpoison_page() takes a page refcount of an error page to handle memory
> >  * error on it, after checking that the error page is in a well-defined state
> >  * (defined as a page-type we can successfully handle the memor error on it,
> >  * such as LRU page and hugetlb page).
> >
> > For example, I see that both unpoison_memory() and soft_offline_page() will
> > call it too, does it mean that we'll also set the bits e.g. even when we want
> > to inject an unpoison event too?
>
> unpoison_memory() should be not a problem since it will just bail out
> once THP is met as the comment says:
>
> /*
> * unpoison_memory() can encounter thp only when the thp is being
> * worked by memory_failure() and the page lock is not held yet.
> * In such case, we yield to memory_failure() and make unpoison fail.
> */
>
>
> And I think we should set the flag for soft offline too, right? The
> soft offline does set the hwpoison flag for the corrupted sub page and
> doesn't split file THP, so it should be captured by page fault as
> well. And yes for poison injection.

Err... I must be blind. The soft offline does *NOT* set hwpoison flag
for any page. So your comment does stand. The flag should be set
outside get_hwpoison_page().

>
> But your comment reminds me that get_hwpoison_page() is just called
> when !MF_COUNT_INCREASED, so it means MADV_HWPOISON still could
> escape. This needs to be covered too.
>
> BTW, I did the test with MADV_HWPOISON, but I didn't test this change
> (moving flag set after get_page_unless_zero()) since I thought it was
> just a trivial change and did overlook this case.
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
Kirill A . Shutemov Oct. 8, 2021, 9:35 a.m. UTC | #19
On Wed, Oct 06, 2021 at 03:54:16PM -0400, Peter Xu wrote:
> On Mon, Oct 04, 2021 at 01:13:07PM -0700, Yang Shi wrote:
> > On Mon, Oct 4, 2021 at 12:41 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Mon, Oct 04, 2021 at 11:17:29AM -0700, Yang Shi wrote:
> > > > On Mon, Oct 4, 2021 at 7:06 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > >
> > > > > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > > index dae481293b5d..2acc2b977f66 100644
> > > > > > --- a/mm/filemap.c
> > > > > > +++ b/mm/filemap.c
> > > > > > @@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> > > > > >       }
> > > > > >
> > > > > >       if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> > > > > > -         vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > > -         if (!ret) {
> > > > > > -                 /* The page is mapped successfully, reference consumed. */
> > > > > > -                 unlock_page(page);
> > > > > > -                 return true;
> > > > > > -         }
> > > > > > +             vm_fault_t ret = do_set_pmd(vmf, page);
> > > > > > +             if (!ret) {
> > > > > > +                     /* The page is mapped successfully, reference consumed. */
> > > > > > +                     unlock_page(page);
> > > > > > +                     return true;
> > > > > > +             }
> > > > > >       }
> > > > > >
> > > > > >       if (pmd_none(*vmf->pmd)) {
> > > > >
> > > > > Hm. Is it unrelated whitespace fix?
> > > >
> > > > It is a coding style clean up. I thought it may be overkilling to have
> > > > a separate patch. Do you prefer separate one?
> > >
> > > Maybe. I tried to find what changed here. It's confusing.
> > 
> > Yeah, maybe. Anyway I will separate the real big fix and the cleanup
> > into two patches. This may be helpful for backporting too.
> 
> Or maybe we just don't touch it until there's need for a functional change?  I
> feel it a pity to lose the git blame info for reindent-only patches,

JFYI, git blame -w ignores whitespace changes :P
Peter Xu Oct. 11, 2021, 10:57 p.m. UTC | #20
On Fri, Oct 08, 2021 at 12:35:29PM +0300, Kirill A. Shutemov wrote:
> JFYI, git blame -w ignores whitespace changes :P

Thanks, Kirill. :)

I must confess in reality I didn't encounter white-space changes a lot, but
mostly on moving the code around either e.g. by putting things into, or out of,
"if/for" blocks, or moving code between files.

I used git-blame a lot not to looking for people to blame but to dig history of
code changes, not sure about how others do that.  So maybe it's just that I
didn't do it right beforfe, and I'll be more than glad to learn if there's more
tricks like the "-w" one (which I don't know before..).
Peter Xu Oct. 12, 2021, 12:55 a.m. UTC | #21
On Thu, Oct 07, 2021 at 02:28:35PM -0700, Yang Shi wrote:
> On Wed, Oct 6, 2021 at 4:57 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Oct 6, 2021 at 1:15 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Sep 30, 2021 at 02:53:08PM -0700, Yang Shi wrote:
> > > > @@ -1148,8 +1148,12 @@ static int __get_hwpoison_page(struct page *page)
> > > >               return -EBUSY;
> > > >
> > > >       if (get_page_unless_zero(head)) {
> > > > -             if (head == compound_head(page))
> > > > +             if (head == compound_head(page)) {
> > > > +                     if (PageTransHuge(head))
> > > > +                             SetPageHasHWPoisoned(head);
> > > > +
> > > >                       return 1;
> > > > +             }
> > > >
> > > >               pr_info("Memory failure: %#lx cannot catch tail\n",
> > > >                       page_to_pfn(page));
> > >
> > > Sorry for the late comments.
> > >
> > > I'm wondering whether it's ideal to set this bit here, as get_hwpoison_page()
> > > sounds like a pure helper to get a refcount out of a sane hwpoisoned page.  I'm
> > > afraid there can be side effect that we set this without being noticed, so I'm
> > > also wondering we should keep it in memory_failure().
> > >
> > > Quotting comments for get_hwpoison_page():
> > >
> > >  * get_hwpoison_page() takes a page refcount of an error page to handle memory
> > >  * error on it, after checking that the error page is in a well-defined state
> > >  * (defined as a page-type we can successfully handle the memor error on it,
> > >  * such as LRU page and hugetlb page).
> > >
> > > For example, I see that both unpoison_memory() and soft_offline_page() will
> > > call it too, does it mean that we'll also set the bits e.g. even when we want
> > > to inject an unpoison event too?
> >
> > unpoison_memory() should be not a problem since it will just bail out
> > once THP is met as the comment says:
> >
> > /*
> > * unpoison_memory() can encounter thp only when the thp is being
> > * worked by memory_failure() and the page lock is not held yet.
> > * In such case, we yield to memory_failure() and make unpoison fail.
> > */
> >
> >
> > And I think we should set the flag for soft offline too, right? The
> > soft offline does set the hwpoison flag for the corrupted sub page and
> > doesn't split file THP, so it should be captured by page fault as
> > well. And yes for poison injection.
> 
> Err... I must be blind. The soft offline does *NOT* set hwpoison flag
> for any page. So your comment does stand. The flag should be set
> outside get_hwpoison_page().

I saw that page_handle_poison() sets it, so perhaps we do need to take care of
soft offline?  Though I still think the extra bit should be set outside of the
get_hwpoison_page() function.

Another thing is I noticed soft_offline_in_use_page() will still ignore file
backed split.  I'm not sure whether it means we'd better also handle that case
as well, so shmem thp can be split there too?
Peter Xu Oct. 12, 2021, 1:44 a.m. UTC | #22
On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> Another thing is I noticed soft_offline_in_use_page() will still ignore file
> backed split.  I'm not sure whether it means we'd better also handle that case
> as well, so shmem thp can be split there too?

Please ignore this paragraph - I somehow read "!PageHuge(page)" as
"PageAnon(page)"...  So I think patch 5 handles soft offline too.
Yang Shi Oct. 12, 2021, 6:02 p.m. UTC | #23
On Mon, Oct 11, 2021 at 6:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> > Another thing is I noticed soft_offline_in_use_page() will still ignore file
> > backed split.  I'm not sure whether it means we'd better also handle that case
> > as well, so shmem thp can be split there too?
>
> Please ignore this paragraph - I somehow read "!PageHuge(page)" as
> "PageAnon(page)"...  So I think patch 5 handles soft offline too.

Yes, exactly. And even though the split is failed (or file THP didn't
get split before patch 5/5), soft offline would just return -EBUSY
instead of calling __soft_offline_page->page_handle_poison(). So
page_handle_poison() should not see THP at all.

>
> --
> Peter Xu
>
Peter Xu Oct. 12, 2021, 10:10 p.m. UTC | #24
On Tue, Oct 12, 2021 at 11:02:09AM -0700, Yang Shi wrote:
> On Mon, Oct 11, 2021 at 6:44 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> > > Another thing is I noticed soft_offline_in_use_page() will still ignore file
> > > backed split.  I'm not sure whether it means we'd better also handle that case
> > > as well, so shmem thp can be split there too?
> >
> > Please ignore this paragraph - I somehow read "!PageHuge(page)" as
> > "PageAnon(page)"...  So I think patch 5 handles soft offline too.
> 
> Yes, exactly. And even though the split is failed (or file THP didn't
> get split before patch 5/5), soft offline would just return -EBUSY
> instead of calling __soft_offline_page->page_handle_poison(). So
> page_handle_poison() should not see THP at all.

I see, so I'm trying to summarize myself on what I see now with the new logic..

I think the offline code handles hwpoison differently as it sets PageHWPoison
at the end of the process, IOW if anything failed during the offline process
the hwpoison bit is not set.

That's different from how the memory failure path is handling this, as in that
case the hwpoison bit on the subpage is set firstly, e.g. before split thp.  I
believe that's also why memory failure requires the extra sub-page-hwpoison bit
while offline code shouldn't need to: because for soft offline split happens
before setting hwpoison so we just won't ever see a "poisoned file thp", while
for memory failure it could happen, and the sub-page-hwpoison will be a temp
bit anyway only exist for a very short period right after we set hwpoison on
the small page but before we split the thp.

Am I right above?

I feel like __soft_offline_page() still has some code that assumes "thp can be
there", e.g. iiuc after your change to allow file thp split, "hpage" will
always be the same as "page" then in that function, and isolate_page() does not
need to pass in a pagelist pointer too as it'll always be handling a small page
anyway.  But maybe they're fine to be there for now as they'll just work as
before, I think, so just raise it up.
Yang Shi Oct. 13, 2021, 2:48 a.m. UTC | #25
On Tue, Oct 12, 2021 at 3:10 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Oct 12, 2021 at 11:02:09AM -0700, Yang Shi wrote:
> > On Mon, Oct 11, 2021 at 6:44 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> > > > Another thing is I noticed soft_offline_in_use_page() will still ignore file
> > > > backed split.  I'm not sure whether it means we'd better also handle that case
> > > > as well, so shmem thp can be split there too?
> > >
> > > Please ignore this paragraph - I somehow read "!PageHuge(page)" as
> > > "PageAnon(page)"...  So I think patch 5 handles soft offline too.
> >
> > Yes, exactly. And even though the split is failed (or file THP didn't
> > get split before patch 5/5), soft offline would just return -EBUSY
> > instead of calling __soft_offline_page->page_handle_poison(). So
> > page_handle_poison() should not see THP at all.
>
> I see, so I'm trying to summarize myself on what I see now with the new logic..
>
> I think the offline code handles hwpoison differently as it sets PageHWPoison
> at the end of the process, IOW if anything failed during the offline process
> the hwpoison bit is not set.
>
> That's different from how the memory failure path is handling this, as in that
> case the hwpoison bit on the subpage is set firstly, e.g. before split thp.  I
> believe that's also why memory failure requires the extra sub-page-hwpoison bit
> while offline code shouldn't need to: because for soft offline split happens
> before setting hwpoison so we just won't ever see a "poisoned file thp", while
> for memory failure it could happen, and the sub-page-hwpoison will be a temp
> bit anyway only exist for a very short period right after we set hwpoison on
> the small page but before we split the thp.
>
> Am I right above?

Yeah, you are right. I noticed this too, only successfully migrated
page is marked as hwpoison. But TBH I'm not sure why it does this way.
Naoya may know. Anyway, THP doesn't get migrated if it can't be split,
so PageHasHWPoisoned doesn't apply, right?

>
> I feel like __soft_offline_page() still has some code that assumes "thp can be
> there", e.g. iiuc after your change to allow file thp split, "hpage" will
> always be the same as "page" then in that function, and isolate_page() does not
> need to pass in a pagelist pointer too as it'll always be handling a small page
> anyway.  But maybe they're fine to be there for now as they'll just work as
> before, I think, so just raise it up.

That compound_head() call seems to be for hugetlb since isolating
hugetlb needs to pass in the head page IIUC. For the pagelist, I think
it is just because migrate_pages() requires a list as the second
parameter.

>
> --
> Peter Xu
>
Peter Xu Oct. 13, 2021, 3:01 a.m. UTC | #26
On Tue, Oct 12, 2021 at 07:48:39PM -0700, Yang Shi wrote:
> On Tue, Oct 12, 2021 at 3:10 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Oct 12, 2021 at 11:02:09AM -0700, Yang Shi wrote:
> > > On Mon, Oct 11, 2021 at 6:44 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> > > > > Another thing is I noticed soft_offline_in_use_page() will still ignore file
> > > > > backed split.  I'm not sure whether it means we'd better also handle that case
> > > > > as well, so shmem thp can be split there too?
> > > >
> > > > Please ignore this paragraph - I somehow read "!PageHuge(page)" as
> > > > "PageAnon(page)"...  So I think patch 5 handles soft offline too.
> > >
> > > Yes, exactly. And even though the split is failed (or file THP didn't
> > > get split before patch 5/5), soft offline would just return -EBUSY
> > > instead of calling __soft_offline_page->page_handle_poison(). So
> > > page_handle_poison() should not see THP at all.
> >
> > I see, so I'm trying to summarize myself on what I see now with the new logic..
> >
> > I think the offline code handles hwpoison differently as it sets PageHWPoison
> > at the end of the process, IOW if anything failed during the offline process
> > the hwpoison bit is not set.
> >
> > That's different from how the memory failure path is handling this, as in that
> > case the hwpoison bit on the subpage is set firstly, e.g. before split thp.  I
> > believe that's also why memory failure requires the extra sub-page-hwpoison bit
> > while offline code shouldn't need to: because for soft offline split happens
> > before setting hwpoison so we just won't ever see a "poisoned file thp", while
> > for memory failure it could happen, and the sub-page-hwpoison will be a temp
> > bit anyway only exist for a very short period right after we set hwpoison on
> > the small page but before we split the thp.
> >
> > Am I right above?
> 
> Yeah, you are right. I noticed this too, only successfully migrated
> page is marked as hwpoison. But TBH I'm not sure why it does this way.

My wild guess is that unlike memory failures, soft offline is best-effort. Say,
the data on the page is still consistent, so even if offline failed for some
reason we shouldn't stop the program from execution.  That's not true for
memory failures via MCEs, afaict, as the execution could read/write wrong data
and that'll be a serious mistake, so we set hwpoison 1st there first before
doing anything else, making sure "this page is broken" message delivered and
user app won't run with risk.

But yeah it'll be great if Naoya could help confirm that.

> Naoya may know. Anyway, THP doesn't get migrated if it can't be split,
> so PageHasHWPoisoned doesn't apply, right?

Right, that matches my current understanding of the code, so the extra bit is
perhaps not useful for soft offline case.

But this also reminded me that shouldn't we be with the page lock already
during the process of "setting hwpoison-subpage bit, split thp, clear
hwpoison-subpage bit"?  If it's only the small window that needs protection,
while when looking up the shmem pagecache we always need to take the page lock
too, then it seems already safe even without the extra bit?  Hmm?

> 
> >
> > I feel like __soft_offline_page() still has some code that assumes "thp can be
> > there", e.g. iiuc after your change to allow file thp split, "hpage" will
> > always be the same as "page" then in that function, and isolate_page() does not
> > need to pass in a pagelist pointer too as it'll always be handling a small page
> > anyway.  But maybe they're fine to be there for now as they'll just work as
> > before, I think, so just raise it up.
> 
> That compound_head() call seems to be for hugetlb since isolating
> hugetlb needs to pass in the head page IIUC. For the pagelist, I think
> it is just because migrate_pages() requires a list as the second
> parameter.

Fair enough.

Thanks,
Yang Shi Oct. 13, 2021, 3:27 a.m. UTC | #27
On Tue, Oct 12, 2021 at 8:01 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Oct 12, 2021 at 07:48:39PM -0700, Yang Shi wrote:
> > On Tue, Oct 12, 2021 at 3:10 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Oct 12, 2021 at 11:02:09AM -0700, Yang Shi wrote:
> > > > On Mon, Oct 11, 2021 at 6:44 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> > > > > > Another thing is I noticed soft_offline_in_use_page() will still ignore file
> > > > > > backed split.  I'm not sure whether it means we'd better also handle that case
> > > > > > as well, so shmem thp can be split there too?
> > > > >
> > > > > Please ignore this paragraph - I somehow read "!PageHuge(page)" as
> > > > > "PageAnon(page)"...  So I think patch 5 handles soft offline too.
> > > >
> > > > Yes, exactly. And even though the split is failed (or file THP didn't
> > > > get split before patch 5/5), soft offline would just return -EBUSY
> > > > instead of calling __soft_offline_page->page_handle_poison(). So
> > > > page_handle_poison() should not see THP at all.
> > >
> > > I see, so I'm trying to summarize myself on what I see now with the new logic..
> > >
> > > I think the offline code handles hwpoison differently as it sets PageHWPoison
> > > at the end of the process, IOW if anything failed during the offline process
> > > the hwpoison bit is not set.
> > >
> > > That's different from how the memory failure path is handling this, as in that
> > > case the hwpoison bit on the subpage is set firstly, e.g. before split thp.  I
> > > believe that's also why memory failure requires the extra sub-page-hwpoison bit
> > > while offline code shouldn't need to: because for soft offline split happens
> > > before setting hwpoison so we just won't ever see a "poisoned file thp", while
> > > for memory failure it could happen, and the sub-page-hwpoison will be a temp
> > > bit anyway only exist for a very short period right after we set hwpoison on
> > > the small page but before we split the thp.
> > >
> > > Am I right above?
> >
> > Yeah, you are right. I noticed this too, only successfully migrated
> > page is marked as hwpoison. But TBH I'm not sure why it does this way.
>
> My wild guess is that unlike memory failures, soft offline is best-effort. Say,
> the data on the page is still consistent, so even if offline failed for some
> reason we shouldn't stop the program from execution.  That's not true for
> memory failures via MCEs, afaict, as the execution could read/write wrong data
> and that'll be a serious mistake, so we set hwpoison 1st there first before
> doing anything else, making sure "this page is broken" message delivered and
> user app won't run with risk.

Makes sense to me.

>
> But yeah it'll be great if Naoya could help confirm that.
>
> > Naoya may know. Anyway, THP doesn't get migrated if it can't be split,
> > so PageHasHWPoisoned doesn't apply, right?
>
> Right, that matches my current understanding of the code, so the extra bit is
> perhaps not useful for soft offline case.
>
> But this also reminded me that shouldn't we be with the page lock already
> during the process of "setting hwpoison-subpage bit, split thp, clear
> hwpoison-subpage bit"?  If it's only the small window that needs protection,
> while when looking up the shmem pagecache we always need to take the page lock
> too, then it seems already safe even without the extra bit?  Hmm?

I don't quite get your point. Do you mean memory_failure()? If so the
answer is no, outside the page lock. And the window may be indefinite
since file THP doesn't get split before this series and the split may
fail even after this series.

>
> >
> > >
> > > I feel like __soft_offline_page() still has some code that assumes "thp can be
> > > there", e.g. iiuc after your change to allow file thp split, "hpage" will
> > > always be the same as "page" then in that function, and isolate_page() does not
> > > need to pass in a pagelist pointer too as it'll always be handling a small page
> > > anyway.  But maybe they're fine to be there for now as they'll just work as
> > > before, I think, so just raise it up.
> >
> > That compound_head() call seems to be for hugetlb since isolating
> > hugetlb needs to pass in the head page IIUC. For the pagelist, I think
> > it is just because migrate_pages() requires a list as the second
> > parameter.
>
> Fair enough.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Oct. 13, 2021, 3:41 a.m. UTC | #28
On Tue, Oct 12, 2021 at 08:27:06PM -0700, Yang Shi wrote:
> > But this also reminded me that shouldn't we be with the page lock already
> > during the process of "setting hwpoison-subpage bit, split thp, clear
> > hwpoison-subpage bit"?  If it's only the small window that needs protection,
> > while when looking up the shmem pagecache we always need to take the page lock
> > too, then it seems already safe even without the extra bit?  Hmm?
> 
> I don't quite get your point. Do you mean memory_failure()? If so the
> answer is no, outside the page lock. And the window may be indefinite
> since file THP doesn't get split before this series and the split may
> fail even after this series.

What I meant is that we could extend the page_lock in try_to_split_thp_page()
to cover setting hwpoison-subpage too (and it of course covers the clearing if
thp split succeeded, as that's part of the split process).  But yeah it's a
good point that the split may fail, so the extra bit seems still necessary.

Maybe that'll be something worth mentioning in the commit message too?  The
commit message described very well on the overhead of looping over 512 pages,
however the reader can easily overlook the real reason for needing this bit -
IMHO it's really for the thp split failure case, as we could also mention that
if thp split won't fail, page lock should be suffice (imho).  We could also
mention about why soft offline does not need that extra bit, which seems not
obvious as well, so imho good material for commit messages.

Sorry to have asked for a lot of commit message changes; I hope they make sense.

Thanks,
Yang Shi Oct. 13, 2021, 9:42 p.m. UTC | #29
On Tue, Oct 12, 2021 at 8:41 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Oct 12, 2021 at 08:27:06PM -0700, Yang Shi wrote:
> > > But this also reminded me that shouldn't we be with the page lock already
> > > during the process of "setting hwpoison-subpage bit, split thp, clear
> > > hwpoison-subpage bit"?  If it's only the small window that needs protection,
> > > while when looking up the shmem pagecache we always need to take the page lock
> > > too, then it seems already safe even without the extra bit?  Hmm?
> >
> > I don't quite get your point. Do you mean memory_failure()? If so the
> > answer is no, outside the page lock. And the window may be indefinite
> > since file THP doesn't get split before this series and the split may
> > fail even after this series.
>
> What I meant is that we could extend the page_lock in try_to_split_thp_page()
> to cover setting hwpoison-subpage too (and it of course covers the clearing if
> thp split succeeded, as that's part of the split process).  But yeah it's a
> good point that the split may fail, so the extra bit seems still necessary.
>
> Maybe that'll be something worth mentioning in the commit message too?  The
> commit message described very well on the overhead of looping over 512 pages,
> however the reader can easily overlook the real reason for needing this bit -
> IMHO it's really for the thp split failure case, as we could also mention that
> if thp split won't fail, page lock should be suffice (imho).  We could also

Not only for THP split failure case. Before this series, shmem THP
does't get split at all. And this commit is supposed to be backported
to the older versions, so saying "page lock is sufficient" is not
precise and confusing.

> mention about why soft offline does not need that extra bit, which seems not
> obvious as well, so imho good material for commit messages.

It would be nice to mention soft offline case.

>
> Sorry to have asked for a lot of commit message changes; I hope they make sense.

Thanks a lot for all the great suggestions.

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Oct. 13, 2021, 11:13 p.m. UTC | #30
On Wed, Oct 13, 2021 at 02:42:42PM -0700, Yang Shi wrote:
> On Tue, Oct 12, 2021 at 8:41 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Oct 12, 2021 at 08:27:06PM -0700, Yang Shi wrote:
> > > > But this also reminded me that shouldn't we be with the page lock already
> > > > during the process of "setting hwpoison-subpage bit, split thp, clear
> > > > hwpoison-subpage bit"?  If it's only the small window that needs protection,
> > > > while when looking up the shmem pagecache we always need to take the page lock
> > > > too, then it seems already safe even without the extra bit?  Hmm?
> > >
> > > I don't quite get your point. Do you mean memory_failure()? If so the
> > > answer is no, outside the page lock. And the window may be indefinite
> > > since file THP doesn't get split before this series and the split may
> > > fail even after this series.
> >
> > What I meant is that we could extend the page_lock in try_to_split_thp_page()
> > to cover setting hwpoison-subpage too (and it of course covers the clearing if
> > thp split succeeded, as that's part of the split process).  But yeah it's a
> > good point that the split may fail, so the extra bit seems still necessary.
> >
> > Maybe that'll be something worth mentioning in the commit message too?  The
> > commit message described very well on the overhead of looping over 512 pages,
> > however the reader can easily overlook the real reason for needing this bit -
> > IMHO it's really for the thp split failure case, as we could also mention that
> > if thp split won't fail, page lock should be suffice (imho).  We could also
> 
> Not only for THP split failure case. Before this series, shmem THP
> does't get split at all. And this commit is supposed to be backported
> to the older versions, so saying "page lock is sufficient" is not
> precise and confusing.

Sure, please feel free to use any wording you prefer as long as the other side
of the lock besides the performance impact could be mentioned.  Thanks,
Naoya Horiguchi Oct. 14, 2021, 6:54 a.m. UTC | #31
On Wed, Oct 13, 2021 at 11:01:33AM +0800, Peter Xu wrote:
> On Tue, Oct 12, 2021 at 07:48:39PM -0700, Yang Shi wrote:
> > On Tue, Oct 12, 2021 at 3:10 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Oct 12, 2021 at 11:02:09AM -0700, Yang Shi wrote:
> > > > On Mon, Oct 11, 2021 at 6:44 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Mon, Oct 11, 2021 at 08:55:26PM -0400, Peter Xu wrote:
> > > > > > Another thing is I noticed soft_offline_in_use_page() will still ignore file
> > > > > > backed split.  I'm not sure whether it means we'd better also handle that case
> > > > > > as well, so shmem thp can be split there too?
> > > > >
> > > > > Please ignore this paragraph - I somehow read "!PageHuge(page)" as
> > > > > "PageAnon(page)"...  So I think patch 5 handles soft offline too.
> > > >
> > > > Yes, exactly. And even though the split is failed (or file THP didn't
> > > > get split before patch 5/5), soft offline would just return -EBUSY
> > > > instead of calling __soft_offline_page->page_handle_poison(). So
> > > > page_handle_poison() should not see THP at all.
> > >
> > > I see, so I'm trying to summarize myself on what I see now with the new logic..
> > >
> > > I think the offline code handles hwpoison differently as it sets PageHWPoison
> > > at the end of the process, IOW if anything failed during the offline process
> > > the hwpoison bit is not set.
> > >
> > > That's different from how the memory failure path is handling this, as in that
> > > case the hwpoison bit on the subpage is set firstly, e.g. before split thp.  I
> > > believe that's also why memory failure requires the extra sub-page-hwpoison bit
> > > while offline code shouldn't need to: because for soft offline split happens
> > > before setting hwpoison so we just won't ever see a "poisoned file thp", while
> > > for memory failure it could happen, and the sub-page-hwpoison will be a temp
> > > bit anyway only exist for a very short period right after we set hwpoison on
> > > the small page but before we split the thp.
> > >
> > > Am I right above?
> > 
> > Yeah, you are right. I noticed this too, only successfully migrated
> > page is marked as hwpoison. But TBH I'm not sure why it does this way.
> 
> My wild guess is that unlike memory failures, soft offline is best-effort. Say,
> the data on the page is still consistent, so even if offline failed for some
> reason we shouldn't stop the program from execution.  That's not true for
> memory failures via MCEs, afaict, as the execution could read/write wrong data
> and that'll be a serious mistake, so we set hwpoison 1st there first before
> doing anything else, making sure "this page is broken" message delivered and
> user app won't run with risk.
> 
> But yeah it'll be great if Naoya could help confirm that.

Yes, these descriptions are totally correct, how PG_hwpoison flag is set
is different between hwpoison/soft-offline mechanisms from the beginning.

Thanks,
Naoya Horiguchi
Naresh Kamboju Nov. 1, 2021, 7:05 p.m. UTC | #32
Hi Yang,

On Fri, 1 Oct 2021 at 03:23, Yang Shi <shy828301@gmail.com> wrote:
>
> When handling shmem page fault the THP with corrupted subpage could be PMD
> mapped if certain conditions are satisfied.  But kernel is supposed to
> send SIGBUS when trying to map hwpoisoned page.
>
> There are two paths which may do PMD map: fault around and regular fault.
>
> Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> the thing was even worse in fault around path.  The THP could be PMD mapped as
> long as the VMA fits regardless what subpage is accessed and corrupted.  After
> this commit as long as head page is not corrupted the THP could be PMD mapped.
>
> In the regular fault path the THP could be PMD mapped as long as the corrupted
> page is not accessed and the VMA fits.
>
> This loophole could be fixed by iterating every subpage to check if any
> of them is hwpoisoned or not, but it is somewhat costly in page fault path.
>
> So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> is found hwpoisoned by memory failure and cleared when the THP is freed or
> split.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: <stable@vger.kernel.org>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/page-flags.h | 19 +++++++++++++++++++
>  mm/filemap.c               | 12 ++++++------
>  mm/huge_memory.c           |  2 ++
>  mm/memory-failure.c        |  6 +++++-
>  mm/memory.c                |  9 +++++++++
>  mm/page_alloc.c            |  4 +++-
>  6 files changed, 44 insertions(+), 8 deletions(-)

When CONFIG_MEMORY_FAILURE not set
we get these build failures.

Regression found on x86_64 and i386 gcc-11 builds
Following build warnings / errors reported on Linux mainline master.

metadata:
    git_describe: v5.15-559-g19901165d90f
    git_repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
    git_short_log: 19901165d90f (\"Merge tag
'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block\")
    target_arch: x86_64
    toolchain: gcc-11


In file included from include/linux/mmzone.h:22,
                 from include/linux/gfp.h:6,
                 from include/linux/slab.h:15,
                 from include/linux/crypto.h:20,
                 from arch/x86/kernel/asm-offsets.c:9:
include/linux/page-flags.h:806:29: error: macro "PAGEFLAG_FALSE"
requires 2 arguments, but only 1 given
  806 | PAGEFLAG_FALSE(HasHWPoisoned)
      |                             ^
include/linux/page-flags.h:411: note: macro "PAGEFLAG_FALSE" defined here
  411 | #define PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname,
lname)   \
      |
include/linux/page-flags.h:807:39: error: macro "TESTSCFLAG_FALSE"
requires 2 arguments, but only 1 given
  807 |         TESTSCFLAG_FALSE(HasHWPoisoned)
      |                                       ^
include/linux/page-flags.h:414: note: macro "TESTSCFLAG_FALSE" defined here
  414 | #define TESTSCFLAG_FALSE(uname, lname)
         \
      |
include/linux/page-flags.h:806:1: error: unknown type name 'PAGEFLAG_FALSE'
  806 | PAGEFLAG_FALSE(HasHWPoisoned)
      | ^~~~~~~~~~~~~~
include/linux/page-flags.h:807:25: error: expected ';' before 'static'
  807 |         TESTSCFLAG_FALSE(HasHWPoisoned)
      |                         ^
      |                         ;
......
  815 | static inline bool is_page_hwpoison(struct page *page)
      | ~~~~~~
make[2]: *** [scripts/Makefile.build:121: arch/x86/kernel/asm-offsets.s] Error 1

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

build link:
-----------
https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/build.log

build config:
-------------
https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/config

# To install tuxmake on your system globally
# sudo pip3 install -U tuxmake

tuxmake --runtime podman --target-arch x86_64 --toolchain gcc-11
--kconfig defconfig --kconfig-add
https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/config

link:
https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/tuxmake_reproducer.sh

--
Linaro LKFT
https://lkft.linaro.org
Yang Shi Nov. 1, 2021, 7:26 p.m. UTC | #33
On Mon, Nov 1, 2021 at 12:05 PM Naresh Kamboju
<naresh.kamboju@linaro.org> wrote:
>
> Hi Yang,
>
> On Fri, 1 Oct 2021 at 03:23, Yang Shi <shy828301@gmail.com> wrote:
> >
> > When handling shmem page fault the THP with corrupted subpage could be PMD
> > mapped if certain conditions are satisfied.  But kernel is supposed to
> > send SIGBUS when trying to map hwpoisoned page.
> >
> > There are two paths which may do PMD map: fault around and regular fault.
> >
> > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > this commit as long as head page is not corrupted the THP could be PMD mapped.
> >
> > In the regular fault path the THP could be PMD mapped as long as the corrupted
> > page is not accessed and the VMA fits.
> >
> > This loophole could be fixed by iterating every subpage to check if any
> > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> >
> > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > is found hwpoisoned by memory failure and cleared when the THP is freed or
> > split.
> >
> > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > Cc: <stable@vger.kernel.org>
> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/page-flags.h | 19 +++++++++++++++++++
> >  mm/filemap.c               | 12 ++++++------
> >  mm/huge_memory.c           |  2 ++
> >  mm/memory-failure.c        |  6 +++++-
> >  mm/memory.c                |  9 +++++++++
> >  mm/page_alloc.c            |  4 +++-
> >  6 files changed, 44 insertions(+), 8 deletions(-)
>
> When CONFIG_MEMORY_FAILURE not set
> we get these build failures.

Thanks for catching this. It is because Willy's page folio series
changed the definition of PAGEFLAG_FALSE macro. But patch was new in
5.15-rc7, so his series doesn't cover this.

The below patch should be able to fix it:

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d8623d6e1141..981341a3c3c4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -803,8 +803,8 @@ PAGEFLAG_FALSE(DoubleMap, double_map)
 PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
        TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
 #else
-PAGEFLAG_FALSE(HasHWPoisoned)
-       TESTSCFLAG_FALSE(HasHWPoisoned)
+PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
+       TESTSCFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
 #endif

 /*

I will prepare a formal patch for 5.16.

>
> Regression found on x86_64 and i386 gcc-11 builds
> Following build warnings / errors reported on Linux mainline master.
>
> metadata:
>     git_describe: v5.15-559-g19901165d90f
>     git_repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>     git_short_log: 19901165d90f (\"Merge tag
> 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block\")
>     target_arch: x86_64
>     toolchain: gcc-11
>
>
> In file included from include/linux/mmzone.h:22,
>                  from include/linux/gfp.h:6,
>                  from include/linux/slab.h:15,
>                  from include/linux/crypto.h:20,
>                  from arch/x86/kernel/asm-offsets.c:9:
> include/linux/page-flags.h:806:29: error: macro "PAGEFLAG_FALSE"
> requires 2 arguments, but only 1 given
>   806 | PAGEFLAG_FALSE(HasHWPoisoned)
>       |                             ^
> include/linux/page-flags.h:411: note: macro "PAGEFLAG_FALSE" defined here
>   411 | #define PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname,
> lname)   \
>       |
> include/linux/page-flags.h:807:39: error: macro "TESTSCFLAG_FALSE"
> requires 2 arguments, but only 1 given
>   807 |         TESTSCFLAG_FALSE(HasHWPoisoned)
>       |                                       ^
> include/linux/page-flags.h:414: note: macro "TESTSCFLAG_FALSE" defined here
>   414 | #define TESTSCFLAG_FALSE(uname, lname)
>          \
>       |
> include/linux/page-flags.h:806:1: error: unknown type name 'PAGEFLAG_FALSE'
>   806 | PAGEFLAG_FALSE(HasHWPoisoned)
>       | ^~~~~~~~~~~~~~
> include/linux/page-flags.h:807:25: error: expected ';' before 'static'
>   807 |         TESTSCFLAG_FALSE(HasHWPoisoned)
>       |                         ^
>       |                         ;
> ......
>   815 | static inline bool is_page_hwpoison(struct page *page)
>       | ~~~~~~
> make[2]: *** [scripts/Makefile.build:121: arch/x86/kernel/asm-offsets.s] Error 1
>
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
>
> build link:
> -----------
> https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/build.log
>
> build config:
> -------------
> https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/config
>
> # To install tuxmake on your system globally
> # sudo pip3 install -U tuxmake
>
> tuxmake --runtime podman --target-arch x86_64 --toolchain gcc-11
> --kconfig defconfig --kconfig-add
> https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/config
>
> link:
> https://builds.tuxbuild.com/20KPBpXK6K0bKSIKAIKfwlBq7O4/tuxmake_reproducer.sh
>
> --
> Linaro LKFT
> https://lkft.linaro.org
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a558d67ee86f..a357b41b3057 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -171,6 +171,11 @@  enum pageflags {
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_workingset,
 
+#ifdef CONFIG_MEMORY_FAILURE
+	/* Compound pages. Stored in first tail page's flags */
+	PG_has_hwpoisoned = PG_mappedtodisk,
+#endif
+
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
 
@@ -668,6 +673,20 @@  PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+/*
+ * PageHasPoisoned indicates that at least on subpage is hwpoisoned in the
+ * compound page.
+ *
+ * This flag is set by hwpoison handler.  Cleared by THP split or free page.
+ */
+PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
+	TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
+#else
+PAGEFLAG_FALSE(HasHWPoisoned)
+	TESTSCFLAG_FALSE(HasHWPoisoned)
+#endif
+
 /*
  * Check if a page is currently marked HWPoisoned. Note that this check is
  * best effort only and inherently racy: there is no way to synchronize with
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..2acc2b977f66 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3195,12 +3195,12 @@  static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	}
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
-	    vm_fault_t ret = do_set_pmd(vmf, page);
-	    if (!ret) {
-		    /* The page is mapped successfully, reference consumed. */
-		    unlock_page(page);
-		    return true;
-	    }
+		vm_fault_t ret = do_set_pmd(vmf, page);
+		if (!ret) {
+			/* The page is mapped successfully, reference consumed. */
+			unlock_page(page);
+			return true;
+		}
 	}
 
 	if (pmd_none(*vmf->pmd)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0fc261e..0574b1613714 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2426,6 +2426,8 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
 	lruvec = lock_page_lruvec(head);
 
+	ClearPageHasHWPoisoned(head);
+
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond EOF: drop them from page cache */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ed28eba50f98..a79a38374a14 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1148,8 +1148,12 @@  static int __get_hwpoison_page(struct page *page)
 		return -EBUSY;
 
 	if (get_page_unless_zero(head)) {
-		if (head == compound_head(page))
+		if (head == compound_head(page)) {
+			if (PageTransHuge(head))
+				SetPageHasHWPoisoned(head);
+
 			return 1;
+		}
 
 		pr_info("Memory failure: %#lx cannot catch tail\n",
 			page_to_pfn(page));
diff --git a/mm/memory.c b/mm/memory.c
index adf9b9ef8277..c52be6d6b605 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3906,6 +3906,15 @@  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	if (compound_order(page) != HPAGE_PMD_ORDER)
 		return ret;
 
+	/*
+	 * Just backoff if any subpage of a THP is corrupted otherwise
+	 * the corrupted page may mapped by PMD silently to escape the
+	 * check.  This kind of THP just can be PTE mapped.  Access to
+	 * the corrupted subpage should trigger SIGBUS as expected.
+	 */
+	if (unlikely(PageHasHWPoisoned(page)))
+		return ret;
+
 	/*
 	 * Archs like ppc64 need additional space to store information
 	 * related to pte entry. Use the preallocated table for that.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..7f37652f0287 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1312,8 +1312,10 @@  static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound)
+		if (compound) {
 			ClearPageDoubleMap(page);
+			ClearPageHasHWPoisoned(page);
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);