mm: hwpoison: deal with page cache THP

Message ID	20210824221322.7663-1-shy828301@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=xPZv=NP=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B9CC6611C8 From: Yang Shi <shy828301@gmail.com> To: naoya.horiguchi@nec.com, osalvador@suse.de, hughd@google.com, kirill.shutemov@linux.intel.com, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: hwpoison: deal with page cache THP Date: Tue, 24 Aug 2021 15:13:22 -0700 Message-Id: <20210824221322.7663-1-shy828301@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: hwpoison: deal with page cache THP \| expand mm: hwpoison: deal with page cache THP

Yang Shi Aug. 24, 2021, 10:13 p.m. UTC

Currently hwpoison doesn't handle page cache THP, just give up and return
error.  It is just because the hwpoison THP support was added before
page cache THP was supported.

Handling page cache THP is simple, they could be offlined by splitting THP,
just like anonymous THP.

The question is how to distinguish them with allocating and freeing THP
which can't be handled by hwpoison properly.  It seems page->mapping is a
good indicator, both anonymous page and file page have it populated, but
it won't be populated until the page is added to rmap or page cache, in
other word, instantiated.  If page->mapping is populated it is
definitely not in allocating or freeing.

The later get_page_unless_zero() could serialize against page free
paths.

Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

HORIGUCHI NAOYA(堀口直也) Aug. 26, 2021, 6:17 a.m. UTC | #1

On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> Currently hwpoison doesn't handle page cache THP, just give up and return
> error.  It is just because the hwpoison THP support was added before
> page cache THP was supported.
> 
> Handling page cache THP is simple, they could be offlined by splitting THP,
> just like anonymous THP.

I think that this patch is not enough to contain an error because page table
entries pointing to subpages in shmem thp are removed during thp splitting.
Then the processes using the file newly allocates another (zeroed) page for
the poisoned address, which results in slient data lost.

According to the comment in unmap_page() at mm/huge_memory.c, file pages are
supposed to be faulted back on demand:

  static void unmap_page(struct page *page)
  ...
          /*
           * Anon pages need migration entries to preserve them, but file
           * pages can simply be left unmapped, then faulted back on demand.
           * If that is ever changed (perhaps for mlock), update remap_page().
           */
          if (PageAnon(page))
                  try_to_migrate(page, ttu_flags);
          else
                  try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);

, but I don't think that faulting back during memory error handling might be
hard because it does not have any direct information about mapping processes
of an error page.  memory_failure() uses try_to_unmap() to find them but
splitting of shmem thp makes this impossible, because it removes the related
page table entries.

There was a discussion about another approach of keeping error pages in page
cache for filesystem without backend storage.
https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
This approach seems to me less complicated, but one concern is that this
change affects user-visible behavior of memory errors.  Keeping error pages
in page cache means that the errors are persistent until next system reboot,
so we might need to define the way to clear the errors to continue to use
the error file.  Current implementation is just to send SIGBUS to the
mapping processes (at least once), then forget about the error, so there is
no such issue.

Another thought of possible solution might be to send SIGBUS immediately when
a memory error happens on a shmem thp. We can find all the mapping processes
before splitting shmem thp, so send SIGBUS first, then split it and contain
the error page.  This is not elegant (giving up any optional actions) but
anyway we can avoid the silent data lost.

- Naoya Horiguchi

> 
> The question is how to distinguish them with allocating and freeing THP
> which can't be handled by hwpoison properly.  It seems page->mapping is a
> good indicator, both anonymous page and file page have it populated, but
> it won't be populated until the page is added to rmap or page cache, in
> other word, instantiated.  If page->mapping is populated it is
> definitely not in allocating or freeing.
> 
> The later get_page_unless_zero() could serialize against page free
> paths.
> 
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/memory-failure.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 60df8fcd0444..caa0b0c1f5b8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1149,13 +1149,16 @@ static int __get_hwpoison_page(struct page *page)
>  
>  	if (PageTransHuge(head)) {
>  		/*
> -		 * Non anonymous thp exists only in allocation/free time. We
> -		 * can't handle such a case correctly, so let's give it up.
> -		 * This should be better than triggering BUG_ON when kernel
> -		 * tries to touch the "partially handled" page.
> +		 * We can't handle allocating or freeing THPs, so let's give
> +		 * it up. This should be better than triggering BUG_ON when
> +		 * kernel tries to touch the "partially handled" page.
> +		 *
> +		 * page->mapping won't be initialized until the page is added
> +		 * to rmap or page cache.  Use this as an indicator for if
> +		 * this is an instantiated page.
>  		 */
> -		if (!PageAnon(head)) {
> -			pr_err("Memory failure: %#lx: non anonymous thp\n",
> +		if (!head->mapping) {
> +			pr_err("Memory failure: %#lx: non instantiated thp\n",
>  				page_to_pfn(page));
>  			return 0;
>  		}
> @@ -1414,12 +1417,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>  static int try_to_split_thp_page(struct page *page, const char *msg)
>  {
>  	lock_page(page);
> -	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> +	if (!page->mapping || unlikely(split_huge_page(page))) {
>  		unsigned long pfn = page_to_pfn(page);
>  
>  		unlock_page(page);
> -		if (!PageAnon(page))
> -			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> +		if (!page->mapping)
> +			pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
>  		else
>  			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
>  		put_page(page);
> -- 
> 2.26.2
>

Yang Shi Aug. 26, 2021, 8:03 p.m. UTC | #2

On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > Currently hwpoison doesn't handle page cache THP, just give up and return
> > error.  It is just because the hwpoison THP support was added before
> > page cache THP was supported.
> >
> > Handling page cache THP is simple, they could be offlined by splitting THP,
> > just like anonymous THP.
>
> I think that this patch is not enough to contain an error because page table
> entries pointing to subpages in shmem thp are removed during thp splitting.
> Then the processes using the file newly allocates another (zeroed) page for
> the poisoned address, which results in slient data lost.
>
> According to the comment in unmap_page() at mm/huge_memory.c, file pages are
> supposed to be faulted back on demand:
>
>   static void unmap_page(struct page *page)
>   ...
>           /*
>            * Anon pages need migration entries to preserve them, but file
>            * pages can simply be left unmapped, then faulted back on demand.
>            * If that is ever changed (perhaps for mlock), update remap_page().
>            */
>           if (PageAnon(page))
>                   try_to_migrate(page, ttu_flags);
>           else
>                   try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
>
> , but I don't think that faulting back during memory error handling might be
> hard because it does not have any direct information about mapping processes
> of an error page.  memory_failure() uses try_to_unmap() to find them but
> splitting of shmem thp makes this impossible, because it removes the related
> page table entries.
>
> There was a discussion about another approach of keeping error pages in page
> cache for filesystem without backend storage.
> https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> This approach seems to me less complicated, but one concern is that this
> change affects user-visible behavior of memory errors.  Keeping error pages
> in page cache means that the errors are persistent until next system reboot,
> so we might need to define the way to clear the errors to continue to use
> the error file.  Current implementation is just to send SIGBUS to the
> mapping processes (at least once), then forget about the error, so there is
> no such issue.
>
> Another thought of possible solution might be to send SIGBUS immediately when
> a memory error happens on a shmem thp. We can find all the mapping processes
> before splitting shmem thp, so send SIGBUS first, then split it and contain
> the error page.  This is not elegant (giving up any optional actions) but
> anyway we can avoid the silent data lost.

Thanks a lot. I apologize I didn't notice you already posted a similar
patch before.

Yes, I think I focused on the soft offline part too much and missed
the uncorrected error part and I admit I did underestimate the
problem.

I think Hugh's suggestion makes sense if we treat tmpfs as a regular
filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
may do checksum after reading from storage block then return an error
if checksum is not right since it may indicate hardware failure on
disk. Then the syscalls or page fault return error or SIGBUS.

So in shmem/tmpfs case, if hwpoisoned page is met, just return error
(-EIO or whatever) for syscall or SIGBUS for page fault. It does align
with the behavior of other filesystems. It is definitely applications'
responsibility to check the return value of read/write syscalls.

>
> - Naoya Horiguchi
>
> >
> > The question is how to distinguish them with allocating and freeing THP
> > which can't be handled by hwpoison properly.  It seems page->mapping is a
> > good indicator, both anonymous page and file page have it populated, but
> > it won't be populated until the page is added to rmap or page cache, in
> > other word, instantiated.  If page->mapping is populated it is
> > definitely not in allocating or freeing.
> >
> > The later get_page_unless_zero() could serialize against page free
> > paths.
> >
> > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/memory-failure.c | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 60df8fcd0444..caa0b0c1f5b8 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1149,13 +1149,16 @@ static int __get_hwpoison_page(struct page *page)
> >
> >       if (PageTransHuge(head)) {
> >               /*
> > -              * Non anonymous thp exists only in allocation/free time. We
> > -              * can't handle such a case correctly, so let's give it up.
> > -              * This should be better than triggering BUG_ON when kernel
> > -              * tries to touch the "partially handled" page.
> > +              * We can't handle allocating or freeing THPs, so let's give
> > +              * it up. This should be better than triggering BUG_ON when
> > +              * kernel tries to touch the "partially handled" page.
> > +              *
> > +              * page->mapping won't be initialized until the page is added
> > +              * to rmap or page cache.  Use this as an indicator for if
> > +              * this is an instantiated page.
> >                */
> > -             if (!PageAnon(head)) {
> > -                     pr_err("Memory failure: %#lx: non anonymous thp\n",
> > +             if (!head->mapping) {
> > +                     pr_err("Memory failure: %#lx: non instantiated thp\n",
> >                               page_to_pfn(page));
> >                       return 0;
> >               }
> > @@ -1414,12 +1417,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> >  static int try_to_split_thp_page(struct page *page, const char *msg)
> >  {
> >       lock_page(page);
> > -     if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> > +     if (!page->mapping || unlikely(split_huge_page(page))) {
> >               unsigned long pfn = page_to_pfn(page);
> >
> >               unlock_page(page);
> > -             if (!PageAnon(page))
> > -                     pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> > +             if (!page->mapping)
> > +                     pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
> >               else
> >                       pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> >               put_page(page);
> > --
> > 2.26.2
> >

Yang Shi Aug. 26, 2021, 10:03 p.m. UTC | #3

On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > Currently hwpoison doesn't handle page cache THP, just give up and return
> > > error.  It is just because the hwpoison THP support was added before
> > > page cache THP was supported.
> > >
> > > Handling page cache THP is simple, they could be offlined by splitting THP,
> > > just like anonymous THP.
> >
> > I think that this patch is not enough to contain an error because page table
> > entries pointing to subpages in shmem thp are removed during thp splitting.
> > Then the processes using the file newly allocates another (zeroed) page for
> > the poisoned address, which results in slient data lost.
> >
> > According to the comment in unmap_page() at mm/huge_memory.c, file pages are
> > supposed to be faulted back on demand:
> >
> >   static void unmap_page(struct page *page)
> >   ...
> >           /*
> >            * Anon pages need migration entries to preserve them, but file
> >            * pages can simply be left unmapped, then faulted back on demand.
> >            * If that is ever changed (perhaps for mlock), update remap_page().
> >            */
> >           if (PageAnon(page))
> >                   try_to_migrate(page, ttu_flags);
> >           else
> >                   try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
> >
> > , but I don't think that faulting back during memory error handling might be
> > hard because it does not have any direct information about mapping processes
> > of an error page.  memory_failure() uses try_to_unmap() to find them but
> > splitting of shmem thp makes this impossible, because it removes the related
> > page table entries.
> >
> > There was a discussion about another approach of keeping error pages in page
> > cache for filesystem without backend storage.
> > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > This approach seems to me less complicated, but one concern is that this
> > change affects user-visible behavior of memory errors.  Keeping error pages
> > in page cache means that the errors are persistent until next system reboot,
> > so we might need to define the way to clear the errors to continue to use
> > the error file.  Current implementation is just to send SIGBUS to the
> > mapping processes (at least once), then forget about the error, so there is
> > no such issue.
> >
> > Another thought of possible solution might be to send SIGBUS immediately when
> > a memory error happens on a shmem thp. We can find all the mapping processes
> > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > the error page.  This is not elegant (giving up any optional actions) but
> > anyway we can avoid the silent data lost.
>
> Thanks a lot. I apologize I didn't notice you already posted a similar
> patch before.
>
> Yes, I think I focused on the soft offline part too much and missed
> the uncorrected error part and I admit I did underestimate the
> problem.
>
> I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> may do checksum after reading from storage block then return an error
> if checksum is not right since it may indicate hardware failure on
> disk. Then the syscalls or page fault return error or SIGBUS.
>
> So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> with the behavior of other filesystems. It is definitely applications'
> responsibility to check the return value of read/write syscalls.

BTW, IIUC the dirty regular page cache (storage backed) would be left
in the page cache too, the clean page cache would be truncated since
they can be just reread from storage, right?

>
> >
> > - Naoya Horiguchi
> >
> > >
> > > The question is how to distinguish them with allocating and freeing THP
> > > which can't be handled by hwpoison properly.  It seems page->mapping is a
> > > good indicator, both anonymous page and file page have it populated, but
> > > it won't be populated until the page is added to rmap or page cache, in
> > > other word, instantiated.  If page->mapping is populated it is
> > > definitely not in allocating or freeing.
> > >
> > > The later get_page_unless_zero() could serialize against page free
> > > paths.
> > >
> > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  mm/memory-failure.c | 21 ++++++++++++---------
> > >  1 file changed, 12 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index 60df8fcd0444..caa0b0c1f5b8 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -1149,13 +1149,16 @@ static int __get_hwpoison_page(struct page *page)
> > >
> > >       if (PageTransHuge(head)) {
> > >               /*
> > > -              * Non anonymous thp exists only in allocation/free time. We
> > > -              * can't handle such a case correctly, so let's give it up.
> > > -              * This should be better than triggering BUG_ON when kernel
> > > -              * tries to touch the "partially handled" page.
> > > +              * We can't handle allocating or freeing THPs, so let's give
> > > +              * it up. This should be better than triggering BUG_ON when
> > > +              * kernel tries to touch the "partially handled" page.
> > > +              *
> > > +              * page->mapping won't be initialized until the page is added
> > > +              * to rmap or page cache.  Use this as an indicator for if
> > > +              * this is an instantiated page.
> > >                */
> > > -             if (!PageAnon(head)) {
> > > -                     pr_err("Memory failure: %#lx: non anonymous thp\n",
> > > +             if (!head->mapping) {
> > > +                     pr_err("Memory failure: %#lx: non instantiated thp\n",
> > >                               page_to_pfn(page));
> > >                       return 0;
> > >               }
> > > @@ -1414,12 +1417,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> > >  static int try_to_split_thp_page(struct page *page, const char *msg)
> > >  {
> > >       lock_page(page);
> > > -     if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> > > +     if (!page->mapping || unlikely(split_huge_page(page))) {
> > >               unsigned long pfn = page_to_pfn(page);
> > >
> > >               unlock_page(page);
> > > -             if (!PageAnon(page))
> > > -                     pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> > > +             if (!page->mapping)
> > > +                     pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
> > >               else
> > >                       pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> > >               put_page(page);
> > > --
> > > 2.26.2
> > >

HORIGUCHI NAOYA(堀口直也) Aug. 27, 2021, 3:57 a.m. UTC | #4

On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > <naoya.horiguchi@nec.com> wrote:
> > >
> > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
...
> > >
> > > There was a discussion about another approach of keeping error pages in page
> > > cache for filesystem without backend storage.
> > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > This approach seems to me less complicated, but one concern is that this
> > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > in page cache means that the errors are persistent until next system reboot,
> > > so we might need to define the way to clear the errors to continue to use
> > > the error file.  Current implementation is just to send SIGBUS to the
> > > mapping processes (at least once), then forget about the error, so there is
> > > no such issue.
> > >
> > > Another thought of possible solution might be to send SIGBUS immediately when
> > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > the error page.  This is not elegant (giving up any optional actions) but
> > > anyway we can avoid the silent data lost.
> >
> > Thanks a lot. I apologize I didn't notice you already posted a similar
> > patch before.
> >
> > Yes, I think I focused on the soft offline part too much and missed
> > the uncorrected error part and I admit I did underestimate the
> > problem.
> >
> > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > may do checksum after reading from storage block then return an error
> > if checksum is not right since it may indicate hardware failure on
> > disk. Then the syscalls or page fault return error or SIGBUS.
> >
> > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > with the behavior of other filesystems. It is definitely applications'
> > responsibility to check the return value of read/write syscalls.
> 
> BTW, IIUC the dirty regular page cache (storage backed) would be left
> in the page cache too, the clean page cache would be truncated since
> they can be just reread from storage, right?

A dirty page cache is also removed on error (me_pagecache_dirty() falls
through me_pagecache_clean(), then truncate_error_page() is called).
The main purpose of this is to separate off the error page from exising
data structures to minimize the risk of later accesses (maybe by race or bug).
But we can change this behavior for specific file systems by updating
error_remove_page() callbacks in address_space_operation.

Honestly, it seems to me that how dirty data is lost does not depend on
file system, and I'm still not sure that this is really a right approach
for the current issue.

Thanks,
Naoya Horiguchi

Yang Shi Aug. 27, 2021, 5:02 a.m. UTC | #5

On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > <naoya.horiguchi@nec.com> wrote:
> > > >
> > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> ...
> > > >
> > > > There was a discussion about another approach of keeping error pages in page
> > > > cache for filesystem without backend storage.
> > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > This approach seems to me less complicated, but one concern is that this
> > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > in page cache means that the errors are persistent until next system reboot,
> > > > so we might need to define the way to clear the errors to continue to use
> > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > mapping processes (at least once), then forget about the error, so there is
> > > > no such issue.
> > > >
> > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > anyway we can avoid the silent data lost.
> > >
> > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > patch before.
> > >
> > > Yes, I think I focused on the soft offline part too much and missed
> > > the uncorrected error part and I admit I did underestimate the
> > > problem.
> > >
> > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > may do checksum after reading from storage block then return an error
> > > if checksum is not right since it may indicate hardware failure on
> > > disk. Then the syscalls or page fault return error or SIGBUS.
> > >
> > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > with the behavior of other filesystems. It is definitely applications'
> > > responsibility to check the return value of read/write syscalls.
> >
> > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > in the page cache too, the clean page cache would be truncated since
> > they can be just reread from storage, right?
>
> A dirty page cache is also removed on error (me_pagecache_dirty() falls
> through me_pagecache_clean(), then truncate_error_page() is called).
> The main purpose of this is to separate off the error page from exising
> data structures to minimize the risk of later accesses (maybe by race or bug).
> But we can change this behavior for specific file systems by updating
> error_remove_page() callbacks in address_space_operation.

Yeah, if fs's error_remove_page() is defined. It seems the filesystems
which have error_remove_page() defined just use generic_remove_page()
except hugetlbfs. And the generic implementation just clears the dirty
flag and removes the page from page cache.

If error_remove_page() is not defined, the page would stay in page
cache since invalidate_inode_page() can't remove dirty page.

>
> Honestly, it seems to me that how dirty data is lost does not depend on
> file system, and I'm still not sure that this is really a right approach
> for the current issue.

IMHO the biggest problem is that applications may see
obsolete/inconsistent data silently, right? Actually keeping the
corrupted page in page cache should be able to notify applications
that they are accessing inconsistent data.

>
> Thanks,
> Naoya Horiguchi

Yang Shi Aug. 30, 2021, 11:44 p.m. UTC | #6

On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > <naoya.horiguchi@nec.com> wrote:
> > > > >
> > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > ...
> > > > >
> > > > > There was a discussion about another approach of keeping error pages in page
> > > > > cache for filesystem without backend storage.
> > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > This approach seems to me less complicated, but one concern is that this
> > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > so we might need to define the way to clear the errors to continue to use
> > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > no such issue.
> > > > >
> > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > anyway we can avoid the silent data lost.
> > > >
> > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > patch before.
> > > >
> > > > Yes, I think I focused on the soft offline part too much and missed
> > > > the uncorrected error part and I admit I did underestimate the
> > > > problem.
> > > >
> > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > may do checksum after reading from storage block then return an error
> > > > if checksum is not right since it may indicate hardware failure on
> > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > >
> > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > with the behavior of other filesystems. It is definitely applications'
> > > > responsibility to check the return value of read/write syscalls.
> > >
> > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > in the page cache too, the clean page cache would be truncated since
> > > they can be just reread from storage, right?
> >
> > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > through me_pagecache_clean(), then truncate_error_page() is called).
> > The main purpose of this is to separate off the error page from exising
> > data structures to minimize the risk of later accesses (maybe by race or bug).
> > But we can change this behavior for specific file systems by updating
> > error_remove_page() callbacks in address_space_operation.
>
> Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> which have error_remove_page() defined just use generic_remove_page()
> except hugetlbfs. And the generic implementation just clears the dirty
> flag and removes the page from page cache.
>
> If error_remove_page() is not defined, the page would stay in page
> cache since invalidate_inode_page() can't remove dirty page.
>
> >
> > Honestly, it seems to me that how dirty data is lost does not depend on
> > file system, and I'm still not sure that this is really a right approach
> > for the current issue.
>
> IMHO the biggest problem is that applications may see
> obsolete/inconsistent data silently, right? Actually keeping the
> corrupted page in page cache should be able to notify applications
> that they are accessing inconsistent data.

The removal from page cache behavior may be much worse for shmem/tmpfs
since it actually removes the whole data blocks for the file. The user
will get all zero if the corrupted blocks are read without any
notification.

The more I stared at the code and had tests done, the more I think we
should keep the corrupted page in page cache and notify the users.

It seems easier for readonly filesystem. Just remove the page from
page cache since it always could read data from disk. This is also the
current behavior.

For shmem, the page could be kept in page cache with dirty flag set
since it won't be written back.

For regular filesystems that could do writeback, things are a little
bit more complicated since we need to prevent from writing back by
clearing dirty flag. Other than writeback we also need to distinguish
cache drop from truncation/hole punch/unlink. We don't want cache drop
(e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
truncate/hole punch/unlink should be fine to remove the page since the
underlying data blocks will be gone too.

Thanks to the refcount pin done by memory failure, cache drop can't
drop the page since it checks if the refcount is expected or not.
Truncate/hole punch/unlink doesn't check refcount so they could
proceed. But inode evict (slab shrinking path) may call truncate, so
the corrupted page may still be removed from page cache when the
underlying data blocks still exist IIUC. There might be other paths in
filesystems to have page cache truncate but the underlying data blocks
are still present.

The read/write syscalls also need check hwpoisoned flag. I'm not sure
if I miss other syscalls or not.

I'm not a filesystem expert so I'm not sure if I'm missing something
else or not. But I'm supposed the most should be covered.

I'd like to start with shmem/tmpfs since it is relatively easier and
this also could unblock shmem THP hwpoison support. Any comment is
welcome.

>
> >
> > Thanks,
> > Naoya Horiguchi

HORIGUCHI NAOYA(堀口直也) Sept. 2, 2021, 3:07 a.m. UTC | #7

On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> > <naoya.horiguchi@nec.com> wrote:
> > >
> > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > ...
> > > > > >
> > > > > > There was a discussion about another approach of keeping error pages in page
> > > > > > cache for filesystem without backend storage.
> > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > > This approach seems to me less complicated, but one concern is that this
> > > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > > so we might need to define the way to clear the errors to continue to use
> > > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > > no such issue.
> > > > > >
> > > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > > anyway we can avoid the silent data lost.
> > > > >
> > > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > > patch before.
> > > > >
> > > > > Yes, I think I focused on the soft offline part too much and missed
> > > > > the uncorrected error part and I admit I did underestimate the
> > > > > problem.
> > > > >
> > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > > may do checksum after reading from storage block then return an error
> > > > > if checksum is not right since it may indicate hardware failure on
> > > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > > >
> > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > > with the behavior of other filesystems. It is definitely applications'
> > > > > responsibility to check the return value of read/write syscalls.
> > > >
> > > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > > in the page cache too, the clean page cache would be truncated since
> > > > they can be just reread from storage, right?
> > >
> > > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > > through me_pagecache_clean(), then truncate_error_page() is called).
> > > The main purpose of this is to separate off the error page from exising
> > > data structures to minimize the risk of later accesses (maybe by race or bug).
> > > But we can change this behavior for specific file systems by updating
> > > error_remove_page() callbacks in address_space_operation.
> >
> > Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> > which have error_remove_page() defined just use generic_remove_page()
> > except hugetlbfs. And the generic implementation just clears the dirty
> > flag and removes the page from page cache.
> >
> > If error_remove_page() is not defined, the page would stay in page
> > cache since invalidate_inode_page() can't remove dirty page.
> >
> > >
> > > Honestly, it seems to me that how dirty data is lost does not depend on
> > > file system, and I'm still not sure that this is really a right approach
> > > for the current issue.
> >
> > IMHO the biggest problem is that applications may see
> > obsolete/inconsistent data silently, right? Actually keeping the
> > corrupted page in page cache should be able to notify applications
> > that they are accessing inconsistent data.
> 
> The removal from page cache behavior may be much worse for shmem/tmpfs
> since it actually removes the whole data blocks for the file. The user
> will get all zero if the corrupted blocks are read without any
> notification.
> 
> The more I stared at the code and had tests done, the more I think we
> should keep the corrupted page in page cache and notify the users.
> 
> It seems easier for readonly filesystem. Just remove the page from
> page cache since it always could read data from disk. This is also the
> current behavior.
> 
> For shmem, the page could be kept in page cache with dirty flag set
> since it won't be written back.
> 
> For regular filesystems that could do writeback, things are a little
> bit more complicated since we need to prevent from writing back by
> clearing dirty flag. Other than writeback we also need to distinguish
> cache drop from truncation/hole punch/unlink. We don't want cache drop
> (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> truncate/hole punch/unlink should be fine to remove the page since the
> underlying data blocks will be gone too.
> 
> Thanks to the refcount pin done by memory failure, cache drop can't
> drop the page since it checks if the refcount is expected or not.
> Truncate/hole punch/unlink doesn't check refcount so they could
> proceed. But inode evict (slab shrinking path) may call truncate, so
> the corrupted page may still be removed from page cache when the
> underlying data blocks still exist IIUC. There might be other paths in
> filesystems to have page cache truncate but the underlying data blocks
> are still present.
> 
> The read/write syscalls also need check hwpoisoned flag. I'm not sure
> if I miss other syscalls or not.

Thanks for analyzing, I think that filesystems supporting writeback might
call filemap_write_and_wait() when evicting inode, which can notify error
by AS_EIO flag in struct address_space.  But it does nothing for no-writeback
filesystems like shmem.  So we need better error report for them.

> 
> I'm not a filesystem expert so I'm not sure if I'm missing something
> else or not. But I'm supposed the most should be covered.
> 
> I'd like to start with shmem/tmpfs since it is relatively easier and
> this also could unblock shmem THP hwpoison support. Any comment is
> welcome.

I think that keeping corrupted pages in page cache can be a better solution.
So if you plan to write a patchset, I'm glad to review/test it.  Starting with
shmem/tmpfs sounds nice to me.

Thanks,
Naoya Horiguchi

Yang Shi Sept. 2, 2021, 6:32 p.m. UTC | #8

On Wed, Sep 1, 2021 at 8:07 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> > On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> > > <naoya.horiguchi@nec.com> wrote:
> > > >
> > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > > ...
> > > > > > >
> > > > > > > There was a discussion about another approach of keeping error pages in page
> > > > > > > cache for filesystem without backend storage.
> > > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > > > This approach seems to me less complicated, but one concern is that this
> > > > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > > > so we might need to define the way to clear the errors to continue to use
> > > > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > > > no such issue.
> > > > > > >
> > > > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > > > anyway we can avoid the silent data lost.
> > > > > >
> > > > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > > > patch before.
> > > > > >
> > > > > > Yes, I think I focused on the soft offline part too much and missed
> > > > > > the uncorrected error part and I admit I did underestimate the
> > > > > > problem.
> > > > > >
> > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > > > may do checksum after reading from storage block then return an error
> > > > > > if checksum is not right since it may indicate hardware failure on
> > > > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > > > >
> > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > > > with the behavior of other filesystems. It is definitely applications'
> > > > > > responsibility to check the return value of read/write syscalls.
> > > > >
> > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > > > in the page cache too, the clean page cache would be truncated since
> > > > > they can be just reread from storage, right?
> > > >
> > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > > > through me_pagecache_clean(), then truncate_error_page() is called).
> > > > The main purpose of this is to separate off the error page from exising
> > > > data structures to minimize the risk of later accesses (maybe by race or bug).
> > > > But we can change this behavior for specific file systems by updating
> > > > error_remove_page() callbacks in address_space_operation.
> > >
> > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> > > which have error_remove_page() defined just use generic_remove_page()
> > > except hugetlbfs. And the generic implementation just clears the dirty
> > > flag and removes the page from page cache.
> > >
> > > If error_remove_page() is not defined, the page would stay in page
> > > cache since invalidate_inode_page() can't remove dirty page.
> > >
> > > >
> > > > Honestly, it seems to me that how dirty data is lost does not depend on
> > > > file system, and I'm still not sure that this is really a right approach
> > > > for the current issue.
> > >
> > > IMHO the biggest problem is that applications may see
> > > obsolete/inconsistent data silently, right? Actually keeping the
> > > corrupted page in page cache should be able to notify applications
> > > that they are accessing inconsistent data.
> >
> > The removal from page cache behavior may be much worse for shmem/tmpfs
> > since it actually removes the whole data blocks for the file. The user
> > will get all zero if the corrupted blocks are read without any
> > notification.
> >
> > The more I stared at the code and had tests done, the more I think we
> > should keep the corrupted page in page cache and notify the users.
> >
> > It seems easier for readonly filesystem. Just remove the page from
> > page cache since it always could read data from disk. This is also the
> > current behavior.
> >
> > For shmem, the page could be kept in page cache with dirty flag set
> > since it won't be written back.
> >
> > For regular filesystems that could do writeback, things are a little
> > bit more complicated since we need to prevent from writing back by
> > clearing dirty flag. Other than writeback we also need to distinguish
> > cache drop from truncation/hole punch/unlink. We don't want cache drop
> > (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> > truncate/hole punch/unlink should be fine to remove the page since the
> > underlying data blocks will be gone too.
> >
> > Thanks to the refcount pin done by memory failure, cache drop can't
> > drop the page since it checks if the refcount is expected or not.
> > Truncate/hole punch/unlink doesn't check refcount so they could
> > proceed. But inode evict (slab shrinking path) may call truncate, so
> > the corrupted page may still be removed from page cache when the
> > underlying data blocks still exist IIUC. There might be other paths in
> > filesystems to have page cache truncate but the underlying data blocks
> > are still present.
> >
> > The read/write syscalls also need check hwpoisoned flag. I'm not sure
> > if I miss other syscalls or not.
>
> Thanks for analyzing, I think that filesystems supporting writeback might
> call filemap_write_and_wait() when evicting inode, which can notify error
> by AS_EIO flag in struct address_space.  But it does nothing for no-writeback
> filesystems like shmem.  So we need better error report for them.

AFAIK the address_space error just works for fsync. Anyway I could be wrong.

I think clearing the dirty flag might be the easiest way? It seems
unnecessary to notify the users when writing back since the most write
back happens asynchronously. They should be notified when the page is
accessed, e.g. read/write and page fault.

I did some further investigation and got a clearer picture for
writeback filesystem:
1. The page should be not written back: clearing dirty flag could
prevent from writeback
2. The page should be not dropped (it shows as a clean page): the
refcount pin from hwpoison could prevent from invalidating (called by
cache drop, inode cache shrinking, etc), but it doesn't avoid
invalidation in DIO path (easy to deal with)
3. The page should be able to get truncated/hole punched/unlinked: it
works as it is
4. Notify users when the page is accessed, e.g. read/write, page fault
and other paths: this is hard

The hardest part is #4. Since there are too many paths in filesystems
that do *NOT* check if page is poisoned or not, e.g. read/write,
compression (btrfs, f2fs), etc. A couple of ways to handle it off the
top of my head:
1. Check hwpoison flag for every path, the most straightforward way,
but a lot work
2. Return NULL for poisoned page from page cache lookup, the most
callsites check if NULL is returned, this should have least work I
think. But the error handling in filesystems just return -ENOMEM, the
error code will incur confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
but this will involve significant amount of code change as well since
all the paths need check if the pointer is ERR or not.
4. #3 may break truncate, so maybe need convert hwpoison page to
exceptional entry? I'm not sure if I'm over-engineered or not.
5. We also could define a new FGP flag to return poisoned page, NULL
or error pointer. This also should need significant code change since
a lt callsites need to be contemplated.

Most filesystems use generic page fault handler (filemap_fault), just
very few have their own implementation, e.g. ceph, DAX, etc.

I don't know which way we should go, should need more advice.

>
> >
> > I'm not a filesystem expert so I'm not sure if I'm missing something
> > else or not. But I'm supposed the most should be covered.
> >
> > I'd like to start with shmem/tmpfs since it is relatively easier and
> > this also could unblock shmem THP hwpoison support. Any comment is
> > welcome.
>
> I think that keeping corrupted pages in page cache can be a better solution.
> So if you plan to write a patchset, I'm glad to review/test it.  Starting with
> shmem/tmpfs sounds nice to me.

Thank you very much. Yes, I'm working on it. And I already had some
patches work (just check if page is hwpoisoned in every path, #1
approach listed above). I could clean up the patches and send out to
gather some feedback.

>
> Thanks,
> Naoya Horiguchi

HORIGUCHI NAOYA(堀口直也) Sept. 3, 2021, 11:53 a.m. UTC | #9

On Thu, Sep 02, 2021 at 11:32:49AM -0700, Yang Shi wrote:
> On Wed, Sep 1, 2021 at 8:07 PM HORIGUCHI NAOYA(堀口　直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> > > On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> > > > <naoya.horiguchi@nec.com> wrote:
> > > > >
> > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > > > ...
> > > > > > > >
> > > > > > > > There was a discussion about another approach of keeping error pages in page
> > > > > > > > cache for filesystem without backend storage.
> > > > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > > > > This approach seems to me less complicated, but one concern is that this
> > > > > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > > > > so we might need to define the way to clear the errors to continue to use
> > > > > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > > > > no such issue.
> > > > > > > >
> > > > > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > > > > anyway we can avoid the silent data lost.
> > > > > > >
> > > > > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > > > > patch before.
> > > > > > >
> > > > > > > Yes, I think I focused on the soft offline part too much and missed
> > > > > > > the uncorrected error part and I admit I did underestimate the
> > > > > > > problem.
> > > > > > >
> > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > > > > may do checksum after reading from storage block then return an error
> > > > > > > if checksum is not right since it may indicate hardware failure on
> > > > > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > > > > >
> > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > > > > with the behavior of other filesystems. It is definitely applications'
> > > > > > > responsibility to check the return value of read/write syscalls.
> > > > > >
> > > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > > > > in the page cache too, the clean page cache would be truncated since
> > > > > > they can be just reread from storage, right?
> > > > >
> > > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > > > > through me_pagecache_clean(), then truncate_error_page() is called).
> > > > > The main purpose of this is to separate off the error page from exising
> > > > > data structures to minimize the risk of later accesses (maybe by race or bug).
> > > > > But we can change this behavior for specific file systems by updating
> > > > > error_remove_page() callbacks in address_space_operation.
> > > >
> > > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> > > > which have error_remove_page() defined just use generic_remove_page()
> > > > except hugetlbfs. And the generic implementation just clears the dirty
> > > > flag and removes the page from page cache.
> > > >
> > > > If error_remove_page() is not defined, the page would stay in page
> > > > cache since invalidate_inode_page() can't remove dirty page.
> > > >
> > > > >
> > > > > Honestly, it seems to me that how dirty data is lost does not depend on
> > > > > file system, and I'm still not sure that this is really a right approach
> > > > > for the current issue.
> > > >
> > > > IMHO the biggest problem is that applications may see
> > > > obsolete/inconsistent data silently, right? Actually keeping the
> > > > corrupted page in page cache should be able to notify applications
> > > > that they are accessing inconsistent data.
> > >
> > > The removal from page cache behavior may be much worse for shmem/tmpfs
> > > since it actually removes the whole data blocks for the file. The user
> > > will get all zero if the corrupted blocks are read without any
> > > notification.
> > >
> > > The more I stared at the code and had tests done, the more I think we
> > > should keep the corrupted page in page cache and notify the users.
> > >
> > > It seems easier for readonly filesystem. Just remove the page from
> > > page cache since it always could read data from disk. This is also the
> > > current behavior.
> > >
> > > For shmem, the page could be kept in page cache with dirty flag set
> > > since it won't be written back.
> > >
> > > For regular filesystems that could do writeback, things are a little
> > > bit more complicated since we need to prevent from writing back by
> > > clearing dirty flag. Other than writeback we also need to distinguish
> > > cache drop from truncation/hole punch/unlink. We don't want cache drop
> > > (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> > > truncate/hole punch/unlink should be fine to remove the page since the
> > > underlying data blocks will be gone too.
> > >
> > > Thanks to the refcount pin done by memory failure, cache drop can't
> > > drop the page since it checks if the refcount is expected or not.
> > > Truncate/hole punch/unlink doesn't check refcount so they could
> > > proceed. But inode evict (slab shrinking path) may call truncate, so
> > > the corrupted page may still be removed from page cache when the
> > > underlying data blocks still exist IIUC. There might be other paths in
> > > filesystems to have page cache truncate but the underlying data blocks
> > > are still present.
> > >
> > > The read/write syscalls also need check hwpoisoned flag. I'm not sure
> > > if I miss other syscalls or not.
> >
> > Thanks for analyzing, I think that filesystems supporting writeback might
> > call filemap_write_and_wait() when evicting inode, which can notify error
> > by AS_EIO flag in struct address_space.  But it does nothing for no-writeback
> > filesystems like shmem.  So we need better error report for them.
> 
> AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> 
> I think clearing the dirty flag might be the easiest way? It seems
> unnecessary to notify the users when writing back since the most write
> back happens asynchronously. They should be notified when the page is
> accessed, e.g. read/write and page fault.
> 
> I did some further investigation and got a clearer picture for
> writeback filesystem:
> 1. The page should be not written back: clearing dirty flag could
> prevent from writeback
> 2. The page should be not dropped (it shows as a clean page): the
> refcount pin from hwpoison could prevent from invalidating (called by
> cache drop, inode cache shrinking, etc), but it doesn't avoid
> invalidation in DIO path (easy to deal with)
> 3. The page should be able to get truncated/hole punched/unlinked: it
> works as it is
> 4. Notify users when the page is accessed, e.g. read/write, page fault
> and other paths: this is hard
> 
> The hardest part is #4. Since there are too many paths in filesystems
> that do *NOT* check if page is poisoned or not, e.g. read/write,
> compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> top of my head:
> 1. Check hwpoison flag for every path, the most straightforward way,
> but a lot work
> 2. Return NULL for poisoned page from page cache lookup, the most
> callsites check if NULL is returned, this should have least work I
> think. But the error handling in filesystems just return -ENOMEM, the
> error code will incur confusion to the users obviously.
> 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> but this will involve significant amount of code change as well since
> all the paths need check if the pointer is ERR or not.

I think the approach #3 sounds good for now, it seems to me that these
statements are about general ways to handle error pages on all page cache
users, so then the amount of code changes is a big problem, but when
focusing on shmem/tmpfs, could the amount of changes be more handlable, or
still large?

> 4. #3 may break truncate, so maybe need convert hwpoison page to
> exceptional entry? I'm not sure if I'm over-engineered or not.

I think of taking refcount of inode associated with error pages to prevent
the truncation via inode eviction, which might be worth considering.

We somehow need keep error information on memory, which should be kept from
being removed, maybe except system reboot and explicit removal operation
from the file's user.  So it might be helpful if we can deal with
inode_evict from slab shrinking and from explicit calls from users.

> 5. We also could define a new FGP flag to return poisoned page, NULL
> or error pointer. This also should need significant code change since
> a lt callsites need to be contemplated.

Could you explain a little more about which callers should use the flag?

> 
> Most filesystems use generic page fault handler (filemap_fault), just
> very few have their own implementation, e.g. ceph, DAX, etc.
> 
> I don't know which way we should go, should need more advice.

I think that DAX doesn't use page cache, so I think that it anyway needs
special handling as a separate issue.

Thanks,
Naoya Horiguchi

> 
> >
> > >
> > > I'm not a filesystem expert so I'm not sure if I'm missing something
> > > else or not. But I'm supposed the most should be covered.
> > >
> > > I'd like to start with shmem/tmpfs since it is relatively easier and
> > > this also could unblock shmem THP hwpoison support. Any comment is
> > > welcome.
> >
> > I think that keeping corrupted pages in page cache can be a better solution.
> > So if you plan to write a patchset, I'm glad to review/test it.  Starting with
> > shmem/tmpfs sounds nice to me.
> 
> Thank you very much. Yes, I'm working on it. And I already had some
> patches work (just check if page is hwpoisoned in every path, #1
> approach listed above). I could clean up the patches and send out to
> gather some feedback.

Yang Shi Sept. 3, 2021, 6:01 p.m. UTC | #10

On Fri, Sep 3, 2021 at 4:53 AM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Thu, Sep 02, 2021 at 11:32:49AM -0700, Yang Shi wrote:
> > On Wed, Sep 1, 2021 at 8:07 PM HORIGUCHI NAOYA(堀口　直也)
> > <naoya.horiguchi@nec.com> wrote:
> > >
> > > On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> > > > On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > > > > ...
> > > > > > > > >
> > > > > > > > > There was a discussion about another approach of keeping error pages in page
> > > > > > > > > cache for filesystem without backend storage.
> > > > > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > > > > > This approach seems to me less complicated, but one concern is that this
> > > > > > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > > > > > so we might need to define the way to clear the errors to continue to use
> > > > > > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > > > > > no such issue.
> > > > > > > > >
> > > > > > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > > > > > anyway we can avoid the silent data lost.
> > > > > > > >
> > > > > > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > > > > > patch before.
> > > > > > > >
> > > > > > > > Yes, I think I focused on the soft offline part too much and missed
> > > > > > > > the uncorrected error part and I admit I did underestimate the
> > > > > > > > problem.
> > > > > > > >
> > > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > > > > > may do checksum after reading from storage block then return an error
> > > > > > > > if checksum is not right since it may indicate hardware failure on
> > > > > > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > > > > > >
> > > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > > > > > with the behavior of other filesystems. It is definitely applications'
> > > > > > > > responsibility to check the return value of read/write syscalls.
> > > > > > >
> > > > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > > > > > in the page cache too, the clean page cache would be truncated since
> > > > > > > they can be just reread from storage, right?
> > > > > >
> > > > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > > > > > through me_pagecache_clean(), then truncate_error_page() is called).
> > > > > > The main purpose of this is to separate off the error page from exising
> > > > > > data structures to minimize the risk of later accesses (maybe by race or bug).
> > > > > > But we can change this behavior for specific file systems by updating
> > > > > > error_remove_page() callbacks in address_space_operation.
> > > > >
> > > > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> > > > > which have error_remove_page() defined just use generic_remove_page()
> > > > > except hugetlbfs. And the generic implementation just clears the dirty
> > > > > flag and removes the page from page cache.
> > > > >
> > > > > If error_remove_page() is not defined, the page would stay in page
> > > > > cache since invalidate_inode_page() can't remove dirty page.
> > > > >
> > > > > >
> > > > > > Honestly, it seems to me that how dirty data is lost does not depend on
> > > > > > file system, and I'm still not sure that this is really a right approach
> > > > > > for the current issue.
> > > > >
> > > > > IMHO the biggest problem is that applications may see
> > > > > obsolete/inconsistent data silently, right? Actually keeping the
> > > > > corrupted page in page cache should be able to notify applications
> > > > > that they are accessing inconsistent data.
> > > >
> > > > The removal from page cache behavior may be much worse for shmem/tmpfs
> > > > since it actually removes the whole data blocks for the file. The user
> > > > will get all zero if the corrupted blocks are read without any
> > > > notification.
> > > >
> > > > The more I stared at the code and had tests done, the more I think we
> > > > should keep the corrupted page in page cache and notify the users.
> > > >
> > > > It seems easier for readonly filesystem. Just remove the page from
> > > > page cache since it always could read data from disk. This is also the
> > > > current behavior.
> > > >
> > > > For shmem, the page could be kept in page cache with dirty flag set
> > > > since it won't be written back.
> > > >
> > > > For regular filesystems that could do writeback, things are a little
> > > > bit more complicated since we need to prevent from writing back by
> > > > clearing dirty flag. Other than writeback we also need to distinguish
> > > > cache drop from truncation/hole punch/unlink. We don't want cache drop
> > > > (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> > > > truncate/hole punch/unlink should be fine to remove the page since the
> > > > underlying data blocks will be gone too.
> > > >
> > > > Thanks to the refcount pin done by memory failure, cache drop can't
> > > > drop the page since it checks if the refcount is expected or not.
> > > > Truncate/hole punch/unlink doesn't check refcount so they could
> > > > proceed. But inode evict (slab shrinking path) may call truncate, so
> > > > the corrupted page may still be removed from page cache when the
> > > > underlying data blocks still exist IIUC. There might be other paths in
> > > > filesystems to have page cache truncate but the underlying data blocks
> > > > are still present.
> > > >
> > > > The read/write syscalls also need check hwpoisoned flag. I'm not sure
> > > > if I miss other syscalls or not.
> > >
> > > Thanks for analyzing, I think that filesystems supporting writeback might
> > > call filemap_write_and_wait() when evicting inode, which can notify error
> > > by AS_EIO flag in struct address_space.  But it does nothing for no-writeback
> > > filesystems like shmem.  So we need better error report for them.
> >
> > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> >
> > I think clearing the dirty flag might be the easiest way? It seems
> > unnecessary to notify the users when writing back since the most write
> > back happens asynchronously. They should be notified when the page is
> > accessed, e.g. read/write and page fault.
> >
> > I did some further investigation and got a clearer picture for
> > writeback filesystem:
> > 1. The page should be not written back: clearing dirty flag could
> > prevent from writeback
> > 2. The page should be not dropped (it shows as a clean page): the
> > refcount pin from hwpoison could prevent from invalidating (called by
> > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > invalidation in DIO path (easy to deal with)
> > 3. The page should be able to get truncated/hole punched/unlinked: it
> > works as it is
> > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > and other paths: this is hard
> >
> > The hardest part is #4. Since there are too many paths in filesystems
> > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > top of my head:
> > 1. Check hwpoison flag for every path, the most straightforward way,
> > but a lot work
> > 2. Return NULL for poisoned page from page cache lookup, the most
> > callsites check if NULL is returned, this should have least work I
> > think. But the error handling in filesystems just return -ENOMEM, the
> > error code will incur confusion to the users obviously.
> > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > but this will involve significant amount of code change as well since
> > all the paths need check if the pointer is ERR or not.
>
> I think the approach #3 sounds good for now, it seems to me that these
> statements are about general ways to handle error pages on all page cache
> users, so then the amount of code changes is a big problem, but when
> focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> still large?

Yeah, I agree #3 makes more sense. Just return an error when finding
out corrupted page. I think this is the right semantic.

The amount of work for shmem should not be big.

For other writeback filesystems we could make the filesystems check if
the page is error pointer or not one at a time. Once all filesystems
have been done, we change hwpoison code to keep corrupted page in page
cache so that we don't have to do all the things altogether in one
gigantic patchset. This approach should be more manageable and review
friendly IMHO.

>
> > 4. #3 may break truncate, so maybe need convert hwpoison page to
> > exceptional entry? I'm not sure if I'm over-engineered or not.
>
> I think of taking refcount of inode associated with error pages to prevent
> the truncation via inode eviction, which might be worth considering.

The memory_failure already does it by get_hwpoison_page(), right?

>
> We somehow need keep error information on memory, which should be kept from
> being removed, maybe except system reboot and explicit removal operation
> from the file's user.  So it might be helpful if we can deal with
> inode_evict from slab shrinking and from explicit calls from users.

The refcount pin from get_hwpoison_page() could prevent the page from
being released by inode eviction from slab shrinking path since
invalidate_mapping_pages() is used which does need to freeze refcount.

In the beginning I thought the problem is if we return error pointer
for truncate/invalidate, how can we convert it to regular page pointer
since truncate still need remove the page from page cache. But the
further investigation shows they call find_get_entry() which just
iterates xarray to return page. So it seems we don't need to do
anything. It doesn't matter to return poisoned page for
invalidate/truncate paths.

There are some other callers of find_get_entry() may need attention. A
quick look shows they could backoff sooner if poisoned page is found.

>
> > 5. We also could define a new FGP flag to return poisoned page, NULL
> > or error pointer. This also should need significant code change since
> > a lt callsites need to be contemplated.
>
> Could you explain a little more about which callers should use the flag?

Just to solve the above invalidate/truncate problem and page fault
doesn't expect an error pointer. But it seems the above
invalidate/truncate paths don't matter. Page fault should be the only
user since page fault may need unlock the page if poisoned page is
returned.

>
> >
> > Most filesystems use generic page fault handler (filemap_fault), just
> > very few have their own implementation, e.g. ceph, DAX, etc.
> >
> > I don't know which way we should go, should need more advice.
>
> I think that DAX doesn't use page cache, so I think that it anyway needs
> special handling as a separate issue.

Yes, DAX doesn't use page cache, but the data blocks of filesystem
actually are memory. It seems PMEM driver already has some mechanism
to handle poisoned page. I don't think we need handle it right now.

>
> Thanks,
> Naoya Horiguchi
>
> >
> > >
> > > >
> > > > I'm not a filesystem expert so I'm not sure if I'm missing something
> > > > else or not. But I'm supposed the most should be covered.
> > > >
> > > > I'd like to start with shmem/tmpfs since it is relatively easier and
> > > > this also could unblock shmem THP hwpoison support. Any comment is
> > > > welcome.
> > >
> > > I think that keeping corrupted pages in page cache can be a better solution.
> > > So if you plan to write a patchset, I'm glad to review/test it.  Starting with
> > > shmem/tmpfs sounds nice to me.
> >
> > Thank you very much. Yes, I'm working on it. And I already had some
> > patches work (just check if page is hwpoisoned in every path, #1
> > approach listed above). I could clean up the patches and send out to
> > gather some feedback.

Yang Shi Sept. 4, 2021, 12:03 a.m. UTC | #11

On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Sep 3, 2021 at 4:53 AM HORIGUCHI NAOYA(堀口　直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 11:32:49AM -0700, Yang Shi wrote:
> > > On Wed, Sep 1, 2021 at 8:07 PM HORIGUCHI NAOYA(堀口　直也)
> > > <naoya.horiguchi@nec.com> wrote:
> > > >
> > > > On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> > > > > On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > > > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > There was a discussion about another approach of keeping error pages in page
> > > > > > > > > > cache for filesystem without backend storage.
> > > > > > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > > > > > > This approach seems to me less complicated, but one concern is that this
> > > > > > > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > > > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > > > > > > so we might need to define the way to clear the errors to continue to use
> > > > > > > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > > > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > > > > > > no such issue.
> > > > > > > > > >
> > > > > > > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > > > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > > > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > > > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > > > > > > anyway we can avoid the silent data lost.
> > > > > > > > >
> > > > > > > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > > > > > > patch before.
> > > > > > > > >
> > > > > > > > > Yes, I think I focused on the soft offline part too much and missed
> > > > > > > > > the uncorrected error part and I admit I did underestimate the
> > > > > > > > > problem.
> > > > > > > > >
> > > > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > > > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > > > > > > may do checksum after reading from storage block then return an error
> > > > > > > > > if checksum is not right since it may indicate hardware failure on
> > > > > > > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > > > > > > >
> > > > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > > > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > > > > > > with the behavior of other filesystems. It is definitely applications'
> > > > > > > > > responsibility to check the return value of read/write syscalls.
> > > > > > > >
> > > > > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > > > > > > in the page cache too, the clean page cache would be truncated since
> > > > > > > > they can be just reread from storage, right?
> > > > > > >
> > > > > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > > > > > > through me_pagecache_clean(), then truncate_error_page() is called).
> > > > > > > The main purpose of this is to separate off the error page from exising
> > > > > > > data structures to minimize the risk of later accesses (maybe by race or bug).
> > > > > > > But we can change this behavior for specific file systems by updating
> > > > > > > error_remove_page() callbacks in address_space_operation.
> > > > > >
> > > > > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> > > > > > which have error_remove_page() defined just use generic_remove_page()
> > > > > > except hugetlbfs. And the generic implementation just clears the dirty
> > > > > > flag and removes the page from page cache.
> > > > > >
> > > > > > If error_remove_page() is not defined, the page would stay in page
> > > > > > cache since invalidate_inode_page() can't remove dirty page.
> > > > > >
> > > > > > >
> > > > > > > Honestly, it seems to me that how dirty data is lost does not depend on
> > > > > > > file system, and I'm still not sure that this is really a right approach
> > > > > > > for the current issue.
> > > > > >
> > > > > > IMHO the biggest problem is that applications may see
> > > > > > obsolete/inconsistent data silently, right? Actually keeping the
> > > > > > corrupted page in page cache should be able to notify applications
> > > > > > that they are accessing inconsistent data.
> > > > >
> > > > > The removal from page cache behavior may be much worse for shmem/tmpfs
> > > > > since it actually removes the whole data blocks for the file. The user
> > > > > will get all zero if the corrupted blocks are read without any
> > > > > notification.
> > > > >
> > > > > The more I stared at the code and had tests done, the more I think we
> > > > > should keep the corrupted page in page cache and notify the users.
> > > > >
> > > > > It seems easier for readonly filesystem. Just remove the page from
> > > > > page cache since it always could read data from disk. This is also the
> > > > > current behavior.
> > > > >
> > > > > For shmem, the page could be kept in page cache with dirty flag set
> > > > > since it won't be written back.
> > > > >
> > > > > For regular filesystems that could do writeback, things are a little
> > > > > bit more complicated since we need to prevent from writing back by
> > > > > clearing dirty flag. Other than writeback we also need to distinguish
> > > > > cache drop from truncation/hole punch/unlink. We don't want cache drop
> > > > > (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> > > > > truncate/hole punch/unlink should be fine to remove the page since the
> > > > > underlying data blocks will be gone too.
> > > > >
> > > > > Thanks to the refcount pin done by memory failure, cache drop can't
> > > > > drop the page since it checks if the refcount is expected or not.
> > > > > Truncate/hole punch/unlink doesn't check refcount so they could
> > > > > proceed. But inode evict (slab shrinking path) may call truncate, so
> > > > > the corrupted page may still be removed from page cache when the
> > > > > underlying data blocks still exist IIUC. There might be other paths in
> > > > > filesystems to have page cache truncate but the underlying data blocks
> > > > > are still present.
> > > > >
> > > > > The read/write syscalls also need check hwpoisoned flag. I'm not sure
> > > > > if I miss other syscalls or not.
> > > >
> > > > Thanks for analyzing, I think that filesystems supporting writeback might
> > > > call filemap_write_and_wait() when evicting inode, which can notify error
> > > > by AS_EIO flag in struct address_space.  But it does nothing for no-writeback
> > > > filesystems like shmem.  So we need better error report for them.
> > >
> > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > >
> > > I think clearing the dirty flag might be the easiest way? It seems
> > > unnecessary to notify the users when writing back since the most write
> > > back happens asynchronously. They should be notified when the page is
> > > accessed, e.g. read/write and page fault.
> > >
> > > I did some further investigation and got a clearer picture for
> > > writeback filesystem:
> > > 1. The page should be not written back: clearing dirty flag could
> > > prevent from writeback
> > > 2. The page should be not dropped (it shows as a clean page): the
> > > refcount pin from hwpoison could prevent from invalidating (called by
> > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > invalidation in DIO path (easy to deal with)
> > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > works as it is
> > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > and other paths: this is hard
> > >
> > > The hardest part is #4. Since there are too many paths in filesystems
> > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > top of my head:
> > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > but a lot work
> > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > callsites check if NULL is returned, this should have least work I
> > > think. But the error handling in filesystems just return -ENOMEM, the
> > > error code will incur confusion to the users obviously.
> > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > but this will involve significant amount of code change as well since
> > > all the paths need check if the pointer is ERR or not.
> >
> > I think the approach #3 sounds good for now, it seems to me that these
> > statements are about general ways to handle error pages on all page cache
> > users, so then the amount of code changes is a big problem, but when
> > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > still large?
>
> Yeah, I agree #3 makes more sense. Just return an error when finding
> out corrupted page. I think this is the right semantic.
>
> The amount of work for shmem should not be big.
>
> For other writeback filesystems we could make the filesystems check if
> the page is error pointer or not one at a time. Once all filesystems
> have been done, we change hwpoison code to keep corrupted page in page
> cache so that we don't have to do all the things altogether in one
> gigantic patchset. This approach should be more manageable and review
> friendly IMHO.
>
> >
> > > 4. #3 may break truncate, so maybe need convert hwpoison page to
> > > exceptional entry? I'm not sure if I'm over-engineered or not.
> >
> > I think of taking refcount of inode associated with error pages to prevent
> > the truncation via inode eviction, which might be worth considering.
>
> The memory_failure already does it by get_hwpoison_page(), right?
>
> >
> > We somehow need keep error information on memory, which should be kept from
> > being removed, maybe except system reboot and explicit removal operation
> > from the file's user.  So it might be helpful if we can deal with
> > inode_evict from slab shrinking and from explicit calls from users.
>
> The refcount pin from get_hwpoison_page() could prevent the page from
> being released by inode eviction from slab shrinking path since
> invalidate_mapping_pages() is used which does need to freeze refcount.
>
> In the beginning I thought the problem is if we return error pointer
> for truncate/invalidate, how can we convert it to regular page pointer
> since truncate still need remove the page from page cache. But the
> further investigation shows they call find_get_entry() which just
> iterates xarray to return page. So it seems we don't need to do
> anything. It doesn't matter to return poisoned page for
> invalidate/truncate paths.
>
> There are some other callers of find_get_entry() may need attention. A
> quick look shows they could backoff sooner if poisoned page is found.
>
> >
> > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > or error pointer. This also should need significant code change since
> > > a lt callsites need to be contemplated.
> >
> > Could you explain a little more about which callers should use the flag?
>
> Just to solve the above invalidate/truncate problem and page fault
> doesn't expect an error pointer. But it seems the above
> invalidate/truncate paths don't matter. Page fault should be the only
> user since page fault may need unlock the page if poisoned page is
> returned.

It seems page fault check IS_ERR(page) then just return
VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
to return head page then handle subpage or just return the page but
don't care the content of the page. They should ignore hwpoison. So I
guess we'd better to have a FGP flag for such cases.

>
> >
> > >
> > > Most filesystems use generic page fault handler (filemap_fault), just
> > > very few have their own implementation, e.g. ceph, DAX, etc.
> > >
> > > I don't know which way we should go, should need more advice.
> >
> > I think that DAX doesn't use page cache, so I think that it anyway needs
> > special handling as a separate issue.
>
> Yes, DAX doesn't use page cache, but the data blocks of filesystem
> actually are memory. It seems PMEM driver already has some mechanism
> to handle poisoned page. I don't think we need handle it right now.
>
> >
> > Thanks,
> > Naoya Horiguchi
> >
> > >
> > > >
> > > > >
> > > > > I'm not a filesystem expert so I'm not sure if I'm missing something
> > > > > else or not. But I'm supposed the most should be covered.
> > > > >
> > > > > I'd like to start with shmem/tmpfs since it is relatively easier and
> > > > > this also could unblock shmem THP hwpoison support. Any comment is
> > > > > welcome.
> > > >
> > > > I think that keeping corrupted pages in page cache can be a better solution.
> > > > So if you plan to write a patchset, I'm glad to review/test it.  Starting with
> > > > shmem/tmpfs sounds nice to me.
> > >
> > > Thank you very much. Yes, I'm working on it. And I already had some
> > > patches work (just check if page is hwpoisoned in every path, #1
> > > approach listed above). I could clean up the patches and send out to
> > > gather some feedback.

Yang Shi Sept. 7, 2021, 9:34 p.m. UTC | #12

On Fri, Sep 3, 2021 at 5:03 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, Sep 3, 2021 at 4:53 AM HORIGUCHI NAOYA(堀口　直也)
> > <naoya.horiguchi@nec.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 11:32:49AM -0700, Yang Shi wrote:
> > > > On Wed, Sep 1, 2021 at 8:07 PM HORIGUCHI NAOYA(堀口　直也)
> > > > <naoya.horiguchi@nec.com> wrote:
> > > > >
> > > > > On Mon, Aug 30, 2021 at 04:44:06PM -0700, Yang Shi wrote:
> > > > > > On Thu, Aug 26, 2021 at 10:02 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote:
> > > > > > > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote:
> > > > > > > > ...
> > > > > > > > > > >
> > > > > > > > > > > There was a discussion about another approach of keeping error pages in page
> > > > > > > > > > > cache for filesystem without backend storage.
> > > > > > > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@eggly.anvils/
> > > > > > > > > > > This approach seems to me less complicated, but one concern is that this
> > > > > > > > > > > change affects user-visible behavior of memory errors.  Keeping error pages
> > > > > > > > > > > in page cache means that the errors are persistent until next system reboot,
> > > > > > > > > > > so we might need to define the way to clear the errors to continue to use
> > > > > > > > > > > the error file.  Current implementation is just to send SIGBUS to the
> > > > > > > > > > > mapping processes (at least once), then forget about the error, so there is
> > > > > > > > > > > no such issue.
> > > > > > > > > > >
> > > > > > > > > > > Another thought of possible solution might be to send SIGBUS immediately when
> > > > > > > > > > > a memory error happens on a shmem thp. We can find all the mapping processes
> > > > > > > > > > > before splitting shmem thp, so send SIGBUS first, then split it and contain
> > > > > > > > > > > the error page.  This is not elegant (giving up any optional actions) but
> > > > > > > > > > > anyway we can avoid the silent data lost.
> > > > > > > > > >
> > > > > > > > > > Thanks a lot. I apologize I didn't notice you already posted a similar
> > > > > > > > > > patch before.
> > > > > > > > > >
> > > > > > > > > > Yes, I think I focused on the soft offline part too much and missed
> > > > > > > > > > the uncorrected error part and I admit I did underestimate the
> > > > > > > > > > problem.
> > > > > > > > > >
> > > > > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regular
> > > > > > > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs,
> > > > > > > > > > may do checksum after reading from storage block then return an error
> > > > > > > > > > if checksum is not right since it may indicate hardware failure on
> > > > > > > > > > disk. Then the syscalls or page fault return error or SIGBUS.
> > > > > > > > > >
> > > > > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return error
> > > > > > > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does align
> > > > > > > > > > with the behavior of other filesystems. It is definitely applications'
> > > > > > > > > > responsibility to check the return value of read/write syscalls.
> > > > > > > > >
> > > > > > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left
> > > > > > > > > in the page cache too, the clean page cache would be truncated since
> > > > > > > > > they can be just reread from storage, right?
> > > > > > > >
> > > > > > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls
> > > > > > > > through me_pagecache_clean(), then truncate_error_page() is called).
> > > > > > > > The main purpose of this is to separate off the error page from exising
> > > > > > > > data structures to minimize the risk of later accesses (maybe by race or bug).
> > > > > > > > But we can change this behavior for specific file systems by updating
> > > > > > > > error_remove_page() callbacks in address_space_operation.
> > > > > > >
> > > > > > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems
> > > > > > > which have error_remove_page() defined just use generic_remove_page()
> > > > > > > except hugetlbfs. And the generic implementation just clears the dirty
> > > > > > > flag and removes the page from page cache.
> > > > > > >
> > > > > > > If error_remove_page() is not defined, the page would stay in page
> > > > > > > cache since invalidate_inode_page() can't remove dirty page.
> > > > > > >
> > > > > > > >
> > > > > > > > Honestly, it seems to me that how dirty data is lost does not depend on
> > > > > > > > file system, and I'm still not sure that this is really a right approach
> > > > > > > > for the current issue.
> > > > > > >
> > > > > > > IMHO the biggest problem is that applications may see
> > > > > > > obsolete/inconsistent data silently, right? Actually keeping the
> > > > > > > corrupted page in page cache should be able to notify applications
> > > > > > > that they are accessing inconsistent data.
> > > > > >
> > > > > > The removal from page cache behavior may be much worse for shmem/tmpfs
> > > > > > since it actually removes the whole data blocks for the file. The user
> > > > > > will get all zero if the corrupted blocks are read without any
> > > > > > notification.
> > > > > >
> > > > > > The more I stared at the code and had tests done, the more I think we
> > > > > > should keep the corrupted page in page cache and notify the users.
> > > > > >
> > > > > > It seems easier for readonly filesystem. Just remove the page from
> > > > > > page cache since it always could read data from disk. This is also the
> > > > > > current behavior.
> > > > > >
> > > > > > For shmem, the page could be kept in page cache with dirty flag set
> > > > > > since it won't be written back.
> > > > > >
> > > > > > For regular filesystems that could do writeback, things are a little
> > > > > > bit more complicated since we need to prevent from writing back by
> > > > > > clearing dirty flag. Other than writeback we also need to distinguish
> > > > > > cache drop from truncation/hole punch/unlink. We don't want cache drop
> > > > > > (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But
> > > > > > truncate/hole punch/unlink should be fine to remove the page since the
> > > > > > underlying data blocks will be gone too.
> > > > > >
> > > > > > Thanks to the refcount pin done by memory failure, cache drop can't
> > > > > > drop the page since it checks if the refcount is expected or not.
> > > > > > Truncate/hole punch/unlink doesn't check refcount so they could
> > > > > > proceed. But inode evict (slab shrinking path) may call truncate, so
> > > > > > the corrupted page may still be removed from page cache when the
> > > > > > underlying data blocks still exist IIUC. There might be other paths in
> > > > > > filesystems to have page cache truncate but the underlying data blocks
> > > > > > are still present.
> > > > > >
> > > > > > The read/write syscalls also need check hwpoisoned flag. I'm not sure
> > > > > > if I miss other syscalls or not.
> > > > >
> > > > > Thanks for analyzing, I think that filesystems supporting writeback might
> > > > > call filemap_write_and_wait() when evicting inode, which can notify error
> > > > > by AS_EIO flag in struct address_space.  But it does nothing for no-writeback
> > > > > filesystems like shmem.  So we need better error report for them.
> > > >
> > > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > > >
> > > > I think clearing the dirty flag might be the easiest way? It seems
> > > > unnecessary to notify the users when writing back since the most write
> > > > back happens asynchronously. They should be notified when the page is
> > > > accessed, e.g. read/write and page fault.
> > > >
> > > > I did some further investigation and got a clearer picture for
> > > > writeback filesystem:
> > > > 1. The page should be not written back: clearing dirty flag could
> > > > prevent from writeback
> > > > 2. The page should be not dropped (it shows as a clean page): the
> > > > refcount pin from hwpoison could prevent from invalidating (called by
> > > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > > invalidation in DIO path (easy to deal with)
> > > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > > works as it is
> > > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > > and other paths: this is hard
> > > >
> > > > The hardest part is #4. Since there are too many paths in filesystems
> > > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > > top of my head:
> > > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > > but a lot work
> > > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > > callsites check if NULL is returned, this should have least work I
> > > > think. But the error handling in filesystems just return -ENOMEM, the
> > > > error code will incur confusion to the users obviously.
> > > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > > but this will involve significant amount of code change as well since
> > > > all the paths need check if the pointer is ERR or not.
> > >
> > > I think the approach #3 sounds good for now, it seems to me that these
> > > statements are about general ways to handle error pages on all page cache
> > > users, so then the amount of code changes is a big problem, but when
> > > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > > still large?
> >
> > Yeah, I agree #3 makes more sense. Just return an error when finding
> > out corrupted page. I think this is the right semantic.

I had a prototype patchset for #3, but now I have to consider stepping
back to rethink which way we should go. I actually did a patchset for
#1 too. By comparing the two patchsets and some deeper investigation
during preparing the two patchsets, I realized #3 may not be the best
way.

For #3 ERR_PTR will be returned so all the callers need to check the
return value otherwise invalid pointer may be dereferenced, but not
all callers really care about the content of the page, for example,
partial truncate which just set the truncated range in one page to 0.
So for such paths it needs additional modification if ERR_PTR is
returned. And if the callers have their own way to handle the
problematic pages we need to add a new FGP flag to tell FGP functions
to return the pointer to the page.

But we don't need to do anything for such paths if we go with #1. For
most paths (e.g. read/write) both ways need to check the validity of
the page by checking ERR PTR or page flag. But a lot of extra code
could be saved for the paths which don't care about the actual data
for #1.

> >
> > The amount of work for shmem should not be big.
> >
> > For other writeback filesystems we could make the filesystems check if
> > the page is error pointer or not one at a time. Once all filesystems
> > have been done, we change hwpoison code to keep corrupted page in page
> > cache so that we don't have to do all the things altogether in one
> > gigantic patchset. This approach should be more manageable and review
> > friendly IMHO.
> >
> > >
> > > > 4. #3 may break truncate, so maybe need convert hwpoison page to
> > > > exceptional entry? I'm not sure if I'm over-engineered or not.
> > >
> > > I think of taking refcount of inode associated with error pages to prevent
> > > the truncation via inode eviction, which might be worth considering.
> >
> > The memory_failure already does it by get_hwpoison_page(), right?
> >
> > >
> > > We somehow need keep error information on memory, which should be kept from
> > > being removed, maybe except system reboot and explicit removal operation
> > > from the file's user.  So it might be helpful if we can deal with
> > > inode_evict from slab shrinking and from explicit calls from users.
> >
> > The refcount pin from get_hwpoison_page() could prevent the page from
> > being released by inode eviction from slab shrinking path since
> > invalidate_mapping_pages() is used which does need to freeze refcount.
> >
> > In the beginning I thought the problem is if we return error pointer
> > for truncate/invalidate, how can we convert it to regular page pointer
> > since truncate still need remove the page from page cache. But the
> > further investigation shows they call find_get_entry() which just
> > iterates xarray to return page. So it seems we don't need to do
> > anything. It doesn't matter to return poisoned page for
> > invalidate/truncate paths.
> >
> > There are some other callers of find_get_entry() may need attention. A
> > quick look shows they could backoff sooner if poisoned page is found.
> >
> > >
> > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > or error pointer. This also should need significant code change since
> > > > a lt callsites need to be contemplated.
> > >
> > > Could you explain a little more about which callers should use the flag?
> >
> > Just to solve the above invalidate/truncate problem and page fault
> > doesn't expect an error pointer. But it seems the above
> > invalidate/truncate paths don't matter. Page fault should be the only
> > user since page fault may need unlock the page if poisoned page is
> > returned.
>
> It seems page fault check IS_ERR(page) then just return
> VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> to return head page then handle subpage or just return the page but
> don't care the content of the page. They should ignore hwpoison. So I
> guess we'd better to have a FGP flag for such cases.
>
> >
> > >
> > > >
> > > > Most filesystems use generic page fault handler (filemap_fault), just
> > > > very few have their own implementation, e.g. ceph, DAX, etc.
> > > >
> > > > I don't know which way we should go, should need more advice.
> > >
> > > I think that DAX doesn't use page cache, so I think that it anyway needs
> > > special handling as a separate issue.
> >
> > Yes, DAX doesn't use page cache, but the data blocks of filesystem
> > actually are memory. It seems PMEM driver already has some mechanism
> > to handle poisoned page. I don't think we need handle it right now.
> >
> > >
> > > Thanks,
> > > Naoya Horiguchi
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I'm not a filesystem expert so I'm not sure if I'm missing something
> > > > > > else or not. But I'm supposed the most should be covered.
> > > > > >
> > > > > > I'd like to start with shmem/tmpfs since it is relatively easier and
> > > > > > this also could unblock shmem THP hwpoison support. Any comment is
> > > > > > welcome.
> > > > >
> > > > > I think that keeping corrupted pages in page cache can be a better solution.
> > > > > So if you plan to write a patchset, I'm glad to review/test it.  Starting with
> > > > > shmem/tmpfs sounds nice to me.
> > > >
> > > > Thank you very much. Yes, I'm working on it. And I already had some
> > > > patches work (just check if page is hwpoisoned in every path, #1
> > > > approach listed above). I could clean up the patches and send out to
> > > > gather some feedback.

HORIGUCHI NAOYA(堀口直也) Sept. 8, 2021, 2:50 a.m. UTC | #13

On Tue, Sep 07, 2021 at 02:34:24PM -0700, Yang Shi wrote:
> On Fri, Sep 3, 2021 at 5:03 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
...
> > > > >
> > > > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > > > >
> > > > > I think clearing the dirty flag might be the easiest way? It seems
> > > > > unnecessary to notify the users when writing back since the most write
> > > > > back happens asynchronously. They should be notified when the page is
> > > > > accessed, e.g. read/write and page fault.
> > > > >
> > > > > I did some further investigation and got a clearer picture for
> > > > > writeback filesystem:
> > > > > 1. The page should be not written back: clearing dirty flag could
> > > > > prevent from writeback
> > > > > 2. The page should be not dropped (it shows as a clean page): the
> > > > > refcount pin from hwpoison could prevent from invalidating (called by
> > > > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > > > invalidation in DIO path (easy to deal with)
> > > > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > > > works as it is
> > > > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > > > and other paths: this is hard
> > > > >
> > > > > The hardest part is #4. Since there are too many paths in filesystems
> > > > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > > > top of my head:
> > > > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > > > but a lot work
> > > > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > > > callsites check if NULL is returned, this should have least work I
> > > > > think. But the error handling in filesystems just return -ENOMEM, the
> > > > > error code will incur confusion to the users obviously.
> > > > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > > > but this will involve significant amount of code change as well since
> > > > > all the paths need check if the pointer is ERR or not.
> > > >
> > > > I think the approach #3 sounds good for now, it seems to me that these
> > > > statements are about general ways to handle error pages on all page cache
> > > > users, so then the amount of code changes is a big problem, but when
> > > > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > > > still large?
> > >
> > > Yeah, I agree #3 makes more sense. Just return an error when finding
> > > out corrupted page. I think this is the right semantic.
> 
> I had a prototype patchset for #3, but now I have to consider stepping
> back to rethink which way we should go. I actually did a patchset for
> #1 too. By comparing the two patchsets and some deeper investigation
> during preparing the two patchsets, I realized #3 may not be the best
> way.
> 
> For #3 ERR_PTR will be returned so all the callers need to check the
> return value otherwise invalid pointer may be dereferenced, but not
> all callers really care about the content of the page, for example,
> partial truncate which just set the truncated range in one page to 0.
> So for such paths it needs additional modification if ERR_PTR is
> returned. And if the callers have their own way to handle the
> problematic pages we need to add a new FGP flag to tell FGP functions
> to return the pointer to the page.
> 
> But we don't need to do anything for such paths if we go with #1. For
> most paths (e.g. read/write) both ways need to check the validity of
> the page by checking ERR PTR or page flag. But a lot of extra code
> could be saved for the paths which don't care about the actual data
> for #1.

OK, so it's not clear which is better. Maybe we need discussion over
patches...

...

> > >
> > > >
> > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > or error pointer. This also should need significant code change since
> > > > > a lt callsites need to be contemplated.
> > > >
> > > > Could you explain a little more about which callers should use the flag?
> > >
> > > Just to solve the above invalidate/truncate problem and page fault
> > > doesn't expect an error pointer. But it seems the above
> > > invalidate/truncate paths don't matter. Page fault should be the only
> > > user since page fault may need unlock the page if poisoned page is
> > > returned.

Orignally PG_hwpoison is detected in page fault handler for file-backed pages
like below,

  static vm_fault_t __do_fault(struct vm_fault *vmf)
  ...
          ret = vma->vm_ops->fault(vmf);
          if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
                              VM_FAULT_DONE_COW)))
                  return ret;
  
          if (unlikely(PageHWPoison(vmf->page))) {
                  if (ret & VM_FAULT_LOCKED)
                          unlock_page(vmf->page);
                  put_page(vmf->page);
                  vmf->page = NULL;
                  return VM_FAULT_HWPOISON;
          }

so it seems to me that if a filesystem switches to the new scheme of keeping
error pages in page cache, ->fault() callback for the filesystem needs to be
aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
detects an error page (maybe by detecting pagecache_get_page() to return
PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
that page fault handler falls back to the generic PageHWPoison check above.

> >
> > It seems page fault check IS_ERR(page) then just return
> > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > to return head page then handle subpage or just return the page but
> > don't care the content of the page. They should ignore hwpoison. So I
> > guess we'd better to have a FGP flag for such cases.

At least currently thp are supposed to be split before error handling.
We could loosen this assumption to support hwpoison on a subpage of thp,
but I'm not sure whether that has some benefit.
And also this discussion makes me aware that we need to have a way to
prevent hwpoisoned pages from being used to thp collapse operation.

Thanks,
Naoya Horiguchi

Yang Shi Sept. 8, 2021, 3:14 a.m. UTC | #14

On Tue, Sep 7, 2021 at 7:50 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Sep 07, 2021 at 02:34:24PM -0700, Yang Shi wrote:
> > On Fri, Sep 3, 2021 at 5:03 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, Sep 3, 2021 at 11:01 AM Yang Shi <shy828301@gmail.com> wrote:
> ...
> > > > > >
> > > > > > AFAIK the address_space error just works for fsync. Anyway I could be wrong.
> > > > > >
> > > > > > I think clearing the dirty flag might be the easiest way? It seems
> > > > > > unnecessary to notify the users when writing back since the most write
> > > > > > back happens asynchronously. They should be notified when the page is
> > > > > > accessed, e.g. read/write and page fault.
> > > > > >
> > > > > > I did some further investigation and got a clearer picture for
> > > > > > writeback filesystem:
> > > > > > 1. The page should be not written back: clearing dirty flag could
> > > > > > prevent from writeback
> > > > > > 2. The page should be not dropped (it shows as a clean page): the
> > > > > > refcount pin from hwpoison could prevent from invalidating (called by
> > > > > > cache drop, inode cache shrinking, etc), but it doesn't avoid
> > > > > > invalidation in DIO path (easy to deal with)
> > > > > > 3. The page should be able to get truncated/hole punched/unlinked: it
> > > > > > works as it is
> > > > > > 4. Notify users when the page is accessed, e.g. read/write, page fault
> > > > > > and other paths: this is hard
> > > > > >
> > > > > > The hardest part is #4. Since there are too many paths in filesystems
> > > > > > that do *NOT* check if page is poisoned or not, e.g. read/write,
> > > > > > compression (btrfs, f2fs), etc. A couple of ways to handle it off the
> > > > > > top of my head:
> > > > > > 1. Check hwpoison flag for every path, the most straightforward way,
> > > > > > but a lot work
> > > > > > 2. Return NULL for poisoned page from page cache lookup, the most
> > > > > > callsites check if NULL is returned, this should have least work I
> > > > > > think. But the error handling in filesystems just return -ENOMEM, the
> > > > > > error code will incur confusion to the users obviously.
> > > > > > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
> > > > > > but this will involve significant amount of code change as well since
> > > > > > all the paths need check if the pointer is ERR or not.
> > > > >
> > > > > I think the approach #3 sounds good for now, it seems to me that these
> > > > > statements are about general ways to handle error pages on all page cache
> > > > > users, so then the amount of code changes is a big problem, but when
> > > > > focusing on shmem/tmpfs, could the amount of changes be more handlable, or
> > > > > still large?
> > > >
> > > > Yeah, I agree #3 makes more sense. Just return an error when finding
> > > > out corrupted page. I think this is the right semantic.
> >
> > I had a prototype patchset for #3, but now I have to consider stepping
> > back to rethink which way we should go. I actually did a patchset for
> > #1 too. By comparing the two patchsets and some deeper investigation
> > during preparing the two patchsets, I realized #3 may not be the best
> > way.
> >
> > For #3 ERR_PTR will be returned so all the callers need to check the
> > return value otherwise invalid pointer may be dereferenced, but not
> > all callers really care about the content of the page, for example,
> > partial truncate which just set the truncated range in one page to 0.
> > So for such paths it needs additional modification if ERR_PTR is
> > returned. And if the callers have their own way to handle the
> > problematic pages we need to add a new FGP flag to tell FGP functions
> > to return the pointer to the page.
> >
> > But we don't need to do anything for such paths if we go with #1. For
> > most paths (e.g. read/write) both ways need to check the validity of
> > the page by checking ERR PTR or page flag. But a lot of extra code
> > could be saved for the paths which don't care about the actual data
> > for #1.
>
> OK, so it's not clear which is better. Maybe we need discussion over
> patches...

Yeah, the conclusion is #1 would need fewer changes than #3.

>
> ...
>
> > > >
> > > > >
> > > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > > or error pointer. This also should need significant code change since
> > > > > > a lt callsites need to be contemplated.
> > > > >
> > > > > Could you explain a little more about which callers should use the flag?
> > > >
> > > > Just to solve the above invalidate/truncate problem and page fault
> > > > doesn't expect an error pointer. But it seems the above
> > > > invalidate/truncate paths don't matter. Page fault should be the only
> > > > user since page fault may need unlock the page if poisoned page is
> > > > returned.
>
> Orignally PG_hwpoison is detected in page fault handler for file-backed pages
> like below,
>
>   static vm_fault_t __do_fault(struct vm_fault *vmf)
>   ...
>           ret = vma->vm_ops->fault(vmf);
>           if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>                               VM_FAULT_DONE_COW)))
>                   return ret;
>
>           if (unlikely(PageHWPoison(vmf->page))) {
>                   if (ret & VM_FAULT_LOCKED)
>                           unlock_page(vmf->page);
>                   put_page(vmf->page);
>                   vmf->page = NULL;
>                   return VM_FAULT_HWPOISON;
>           }
>
> so it seems to me that if a filesystem switches to the new scheme of keeping
> error pages in page cache, ->fault() callback for the filesystem needs to be
> aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
> detects an error page (maybe by detecting pagecache_get_page() to return
> PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
> with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
> that page fault handler falls back to the generic PageHWPoison check above.

Yes, exactly. If we return ERR_PTR we need modify the above code
accordingly. But returning the page, no change is required.

>
> > >
> > > It seems page fault check IS_ERR(page) then just return
> > > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > > to return head page then handle subpage or just return the page but
> > > don't care the content of the page. They should ignore hwpoison. So I
> > > guess we'd better to have a FGP flag for such cases.
>
> At least currently thp are supposed to be split before error handling.

Not for file THP.

> We could loosen this assumption to support hwpoison on a subpage of thp,
> but I'm not sure whether that has some benefit.

I don't quite get your point. Currently the hwpoison flag is set on
specific subpage instead of head page.

> And also this discussion makes me aware that we need to have a way to
> prevent hwpoisoned pages from being used to thp collapse operation.

Actually not. The refcount from hwpoison could prevent it from
collapsing. But if we return ERR_PTR (#3) we need extra code in
khugepaged to handle it.

So I realized #1 would require fewer changes. And leaving the page
state check to callers seem reasonable since the callers usually check
other states, e.g. uptodate.


>
> Thanks,
> Naoya Horiguchi

HORIGUCHI NAOYA(堀口直也) Sept. 8, 2021, 4:25 a.m. UTC | #15

On Tue, Sep 07, 2021 at 08:14:26PM -0700, Yang Shi wrote:
...
> > > > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > > > or error pointer. This also should need significant code change since
> > > > > > > a lt callsites need to be contemplated.
> > > > > >
> > > > > > Could you explain a little more about which callers should use the flag?
> > > > >
> > > > > Just to solve the above invalidate/truncate problem and page fault
> > > > > doesn't expect an error pointer. But it seems the above
> > > > > invalidate/truncate paths don't matter. Page fault should be the only
> > > > > user since page fault may need unlock the page if poisoned page is
> > > > > returned.
> >
> > Orignally PG_hwpoison is detected in page fault handler for file-backed pages
> > like below,
> >
> >   static vm_fault_t __do_fault(struct vm_fault *vmf)
> >   ...
> >           ret = vma->vm_ops->fault(vmf);
> >           if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> >                               VM_FAULT_DONE_COW)))
> >                   return ret;
> >
> >           if (unlikely(PageHWPoison(vmf->page))) {
> >                   if (ret & VM_FAULT_LOCKED)
> >                           unlock_page(vmf->page);
> >                   put_page(vmf->page);
> >                   vmf->page = NULL;
> >                   return VM_FAULT_HWPOISON;
> >           }
> >
> > so it seems to me that if a filesystem switches to the new scheme of keeping
> > error pages in page cache, ->fault() callback for the filesystem needs to be
> > aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
> > detects an error page (maybe by detecting pagecache_get_page() to return
> > PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
> > with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
> > that page fault handler falls back to the generic PageHWPoison check above.
> 
> Yes, exactly. If we return ERR_PTR we need modify the above code
> accordingly. But returning the page, no change is required.
> 
> >
> > > >
> > > > It seems page fault check IS_ERR(page) then just return
> > > > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > > > to return head page then handle subpage or just return the page but
> > > > don't care the content of the page. They should ignore hwpoison. So I
> > > > guess we'd better to have a FGP flag for such cases.
> >
> > At least currently thp are supposed to be split before error handling.
> 
> Not for file THP.
> 
> > We could loosen this assumption to support hwpoison on a subpage of thp,
> > but I'm not sure whether that has some benefit.
> 
> I don't quite get your point. Currently the hwpoison flag is set on
> specific subpage instead of head page.

Sorry, I miss the case when thp split fails, then the thp has hwpoison
flag set on any subpage.  I only thought of the successful case, where the
resulting error page is no longer a subpage.

> 
> > And also this discussion makes me aware that we need to have a way to
> > prevent hwpoisoned pages from being used to thp collapse operation.
> 
> Actually not. The refcount from hwpoison could prevent it from
> collapsing. But if we return ERR_PTR (#3) we need extra code in
> khugepaged to handle it.

OK, so we already prevent it.

Thanks,
Naoya Horiguchi

> 
> So I realized #1 would require fewer changes. And leaving the page
> state check to callers seem reasonable since the callers usually check
> other states, e.g. uptodate.

Yang Shi Sept. 9, 2021, 11:07 p.m. UTC | #16

On Tue, Sep 7, 2021 at 9:25 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Tue, Sep 07, 2021 at 08:14:26PM -0700, Yang Shi wrote:
> ...
> > > > > > > > 5. We also could define a new FGP flag to return poisoned page, NULL
> > > > > > > > or error pointer. This also should need significant code change since
> > > > > > > > a lt callsites need to be contemplated.
> > > > > > >
> > > > > > > Could you explain a little more about which callers should use the flag?
> > > > > >
> > > > > > Just to solve the above invalidate/truncate problem and page fault
> > > > > > doesn't expect an error pointer. But it seems the above
> > > > > > invalidate/truncate paths don't matter. Page fault should be the only
> > > > > > user since page fault may need unlock the page if poisoned page is
> > > > > > returned.
> > >
> > > Orignally PG_hwpoison is detected in page fault handler for file-backed pages
> > > like below,
> > >
> > >   static vm_fault_t __do_fault(struct vm_fault *vmf)
> > >   ...
> > >           ret = vma->vm_ops->fault(vmf);
> > >           if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> > >                               VM_FAULT_DONE_COW)))
> > >                   return ret;
> > >
> > >           if (unlikely(PageHWPoison(vmf->page))) {
> > >                   if (ret & VM_FAULT_LOCKED)
> > >                           unlock_page(vmf->page);
> > >                   put_page(vmf->page);
> > >                   vmf->page = NULL;
> > >                   return VM_FAULT_HWPOISON;
> > >           }
> > >
> > > so it seems to me that if a filesystem switches to the new scheme of keeping
> > > error pages in page cache, ->fault() callback for the filesystem needs to be
> > > aware of hwpoisoned pages inside it and returns VM_FAULT_HWPOISON when it
> > > detects an error page (maybe by detecting pagecache_get_page() to return
> > > PTR_ERR(-EIO or -EHWPOISON) or some other ways).  In the other filesystems
> > > with the old scheme, fault() callback does not return VM_FAULT_HWPOISON so
> > > that page fault handler falls back to the generic PageHWPoison check above.
> >
> > Yes, exactly. If we return ERR_PTR we need modify the above code
> > accordingly. But returning the page, no change is required.
> >
> > >
> > > > >
> > > > > It seems page fault check IS_ERR(page) then just return
> > > > > VM_FAULT_HWPOISON. But I found a couple of places in shmem which want
> > > > > to return head page then handle subpage or just return the page but
> > > > > don't care the content of the page. They should ignore hwpoison. So I
> > > > > guess we'd better to have a FGP flag for such cases.
> > >
> > > At least currently thp are supposed to be split before error handling.
> >
> > Not for file THP.
> >
> > > We could loosen this assumption to support hwpoison on a subpage of thp,
> > > but I'm not sure whether that has some benefit.
> >
> > I don't quite get your point. Currently the hwpoison flag is set on
> > specific subpage instead of head page.
>
> Sorry, I miss the case when thp split fails, then the thp has hwpoison
> flag set on any subpage.  I only thought of the successful case, where the
> resulting error page is no longer a subpage.
>
> >
> > > And also this discussion makes me aware that we need to have a way to
> > > prevent hwpoisoned pages from being used to thp collapse operation.
> >
> > Actually not. The refcount from hwpoison could prevent it from
> > collapsing. But if we return ERR_PTR (#3) we need extra code in
> > khugepaged to handle it.
>
> OK, so we already prevent it.

I just realized I'm wrong for the refcount. The poisoned page has been
isolated from LRU before khugepaged finds it. So the refcount from
isolation is not incremented at all. The refcount seen by khugepaged
is 3: hwpoison, page cache and khugepaged. It actually meets the
expectation.

The khugepaged does check if the page is a LRU page or not but without
holding page lock, so it may race with hwpoison. After holding page
lock, it doesn't check LRU flag anymore. So I think we need add the
check.

>
> Thanks,
> Naoya Horiguchi
>
> >
> > So I realized #1 would require fewer changes. And leaving the page
> > state check to callers seem reasonable since the callers usually check
> > other states, e.g. uptodate.

mm: hwpoison: deal with page cache THP

Commit Message

Comments

Patch