mbox series

[RFC,v3,0/5] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)

Message ID 20210930215311.240774-1-shy828301@gmail.com (mailing list archive)
Headers show
Series Solve silent data loss caused by poisoned page cache (shmem/tmpfs) | expand

Message

Yang Shi Sept. 30, 2021, 9:53 p.m. UTC
When discussing the patch that splits page cache THP in order to offline the
poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
from working since the page cache page will be truncated if uncorrectable
errors happen.  By looking this deeper it turns out this approach (truncating
poisoned page) may incur silent data loss for all non-readonly filesystems if
the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
since the data blocks are actually gone.

To solve this problem we could keep the poisoned dirty page in page cache then
notify the users on any later access, e.g. page fault, read/write, etc.  The
clean page could be truncated as is since they can be reread from disk later on.

The consequence is the filesystems may find poisoned page and manipulate it as
healthy page since all the filesystems actually don't check if the page is
poisoned or not in all the relevant paths except page fault.  In general, we
need make the filesystems be aware of poisoned page before we could keep the
poisoned page in page cache in order to solve the data loss problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it once
a page is returned from page cache lookup.  There are a couple of options to
do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more changes
than #1.  For #3 ERR_PTR will be returned so all the callers need to check the
return value otherwise invalid pointer may be dereferenced, but not all callers
really care about the content of the page, for example, partial truncate which
just sets the truncated range in one page to 0.  So for such paths it needs
additional modification if ERR_PTR is returned.  And if the callers have their
own way to handle the problematic pages we need to add a new FGP flag to tell
FGP functions to return the pointer to the page.

It may happen very rarely, but once it happens the consequence (data corruption)
could be very bad and it is very hard to debug.  It seems this problem had been
slightly discussed before, but seems no action was taken at that time. [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than others
since even the data blocks are gone due to truncating.  So this patchset starts
from shmem/tmpfs by taking option #1.

Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #2: refactor and preparation.
Patch #4: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #5: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.

I didn't receive too many comments for patch #3 ~ #5, so may consider separate
the bug fixes (patch #1 and #2) from others to make them merged sooner.  This
version still includes all 5 patches.


Changelog
v2 --> v3:
  * Incorporated the comments from Kirill.
  * Reordered the series to reflect the right dependency (patch #3 from v2
    is patch #1 in this revision, patch #1 from v2 is patch #2 in this
    revision).
  * After the reorder, patch #2 depends on patch #1 and both need to be
    backported to -stable.
v1 --> v2:
  * Incorporated the suggestion from Kirill to use a new page flag to
    indicate there is hwpoisoned subpage(s) in a THP. (patch #1)
  * Dropped patch #2 of v1.
  * Refctored the page refcount check logic of hwpoison per Naoya. (patch #2)
  * Removed unnecessary THP check per Naoya. (patch #3)
  * Incorporated the other comments for shmem from Naoya. (patch #4)


Yang Shi (5):
      mm: hwpoison: remove the unnecessary THP check
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
      mm: hwpoison: refactor refcount check handling
      mm: shmem: don't truncate page if memory failure happens
      mm: hwpoison: handle non-anonymous THP correctly

 include/linux/page-flags.h |  19 +++++++++++
 mm/filemap.c               |  12 +++----
 mm/huge_memory.c           |   2 ++
 mm/memory-failure.c        | 129 +++++++++++++++++++++++++++++++++++++++++++--------------------------
 mm/memory.c                |   9 +++++
 mm/page_alloc.c            |   4 ++-
 mm/shmem.c                 |  31 +++++++++++++++--
 mm/userfaultfd.c           |   5 +++
 8 files changed, 153 insertions(+), 58 deletions(-)


[1] https://lore.kernel.org/linux-mm/CAHbLzkqNPBh_sK09qfr4yu4WTFOzRy+MKj+PA7iG-adzi9zGsg@mail.gmail.com/T/#m0e959283380156f1d064456af01ae51fdff91265
[2] https://lore.kernel.org/lkml/20210318183350.GT3420@casper.infradead.org/

Comments

Peter Xu Oct. 13, 2021, 2:40 a.m. UTC | #1
On Thu, Sep 30, 2021 at 02:53:06PM -0700, Yang Shi wrote:
> Yang Shi (5):
>       mm: hwpoison: remove the unnecessary THP check
>       mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
>       mm: hwpoison: refactor refcount check handling
>       mm: shmem: don't truncate page if memory failure happens
>       mm: hwpoison: handle non-anonymous THP correctly

Today I just noticed one more thing: unpoison path has (unpoison_memory):

	if (page_mapping(page)) {
		unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
				 pfn, &unpoison_rs);
		return 0;
	}

I _think_ it was used to make sure we ignore page that was not successfully
poisoned/offlined before (for anonymous), so raising this question up on
whether we should make sure e.g. shmem hwpoisoned pages still can be unpoisoned
for debugging purposes.
Yang Shi Oct. 13, 2021, 3:09 a.m. UTC | #2
On Tue, Oct 12, 2021 at 7:41 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 30, 2021 at 02:53:06PM -0700, Yang Shi wrote:
> > Yang Shi (5):
> >       mm: hwpoison: remove the unnecessary THP check
> >       mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
> >       mm: hwpoison: refactor refcount check handling
> >       mm: shmem: don't truncate page if memory failure happens
> >       mm: hwpoison: handle non-anonymous THP correctly
>
> Today I just noticed one more thing: unpoison path has (unpoison_memory):
>
>         if (page_mapping(page)) {
>                 unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
>                                  pfn, &unpoison_rs);
>                 return 0;
>         }
>
> I _think_ it was used to make sure we ignore page that was not successfully
> poisoned/offlined before (for anonymous), so raising this question up on
> whether we should make sure e.g. shmem hwpoisoned pages still can be unpoisoned
> for debugging purposes.

Yes, not only mapping, the refcount check is not right if page cache
page is kept in page cache instead of being truncated after this
series. But actually unpoison has been broken since commit
0ed950d1f28142ccd9a9453c60df87853530d778 ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"). And Naoya said in the
commit "unpoison_memory() is also unchanged because it's broken and
need thorough fixes (will be done later)."

I do have some fixes in my tree to unblock tests and fix unpoison for
this series (just make it work for testing). Naoya may have some ideas
in mind and it is just a debugging feature so I don't think it must be
fixed in this series. It could be done later. I could add a TODO
section in the cover letter to make this more clear.

>
> --
> Peter Xu
>
Peter Xu Oct. 13, 2021, 3:24 a.m. UTC | #3
On Tue, Oct 12, 2021 at 08:09:24PM -0700, Yang Shi wrote:
> On Tue, Oct 12, 2021 at 7:41 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 30, 2021 at 02:53:06PM -0700, Yang Shi wrote:
> > > Yang Shi (5):
> > >       mm: hwpoison: remove the unnecessary THP check
> > >       mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
> > >       mm: hwpoison: refactor refcount check handling
> > >       mm: shmem: don't truncate page if memory failure happens
> > >       mm: hwpoison: handle non-anonymous THP correctly
> >
> > Today I just noticed one more thing: unpoison path has (unpoison_memory):
> >
> >         if (page_mapping(page)) {
> >                 unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
> >                                  pfn, &unpoison_rs);
> >                 return 0;
> >         }
> >
> > I _think_ it was used to make sure we ignore page that was not successfully
> > poisoned/offlined before (for anonymous), so raising this question up on
> > whether we should make sure e.g. shmem hwpoisoned pages still can be unpoisoned
> > for debugging purposes.
> 
> Yes, not only mapping, the refcount check is not right if page cache
> page is kept in page cache instead of being truncated after this
> series. But actually unpoison has been broken since commit
> 0ed950d1f28142ccd9a9453c60df87853530d778 ("mm,hwpoison: make
> get_hwpoison_page() call get_any_page()"). And Naoya said in the
> commit "unpoison_memory() is also unchanged because it's broken and
> need thorough fixes (will be done later)."
> 
> I do have some fixes in my tree to unblock tests and fix unpoison for
> this series (just make it work for testing). Naoya may have some ideas
> in mind and it is just a debugging feature so I don't think it must be
> fixed in this series. It could be done later. I could add a TODO
> section in the cover letter to make this more clear.

I see, that sounds good enough to me.  Thanks,
Naoya Horiguchi Oct. 14, 2021, 6:54 a.m. UTC | #4
On Tue, Oct 12, 2021 at 08:09:24PM -0700, Yang Shi wrote:
> On Tue, Oct 12, 2021 at 7:41 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 30, 2021 at 02:53:06PM -0700, Yang Shi wrote:
> > > Yang Shi (5):
> > >       mm: hwpoison: remove the unnecessary THP check
> > >       mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
> > >       mm: hwpoison: refactor refcount check handling
> > >       mm: shmem: don't truncate page if memory failure happens
> > >       mm: hwpoison: handle non-anonymous THP correctly
> >
> > Today I just noticed one more thing: unpoison path has (unpoison_memory):
> >
> >         if (page_mapping(page)) {
> >                 unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
> >                                  pfn, &unpoison_rs);
> >                 return 0;
> >         }
> >
> > I _think_ it was used to make sure we ignore page that was not successfully
> > poisoned/offlined before (for anonymous), so raising this question up on
> > whether we should make sure e.g. shmem hwpoisoned pages still can be unpoisoned
> > for debugging purposes.
> 
> Yes, not only mapping, the refcount check is not right if page cache
> page is kept in page cache instead of being truncated after this
> series. But actually unpoison has been broken since commit
> 0ed950d1f28142ccd9a9453c60df87853530d778 ("mm,hwpoison: make
> get_hwpoison_page() call get_any_page()"). And Naoya said in the
> commit "unpoison_memory() is also unchanged because it's broken and
> need thorough fixes (will be done later)."
> 

Yeah, I have a patch but still not make it merged to upstream.
Sorry about that...

> I do have some fixes in my tree to unblock tests and fix unpoison for
> this series (just make it work for testing). Naoya may have some ideas
> in mind and it is just a debugging feature so I don't think it must be
> fixed in this series. It could be done later. I could add a TODO
> section in the cover letter to make this more clear.

Yes, it's fine to me that you leave this unsolved in this patchset.

Thanks,
Naoya Horiguchi