[v3,01/11] pagemap: Introduce ->memory_failure()

Message ID	20210208105530.3072869-2-ruansy.fnst@cn.fujitsu.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> To: <linux-kernel@vger.kernel.org>, <linux-xfs@vger.kernel.org>, <linux-nvdimm@lists.01.org>, <linux-mm@kvack.org>, <linux-fsdevel@vger.kernel.org>, <dm-devel@redhat.com> CC: <darrick.wong@oracle.com>, <dan.j.williams@intel.com>, <david@fromorbit.com>, <hch@lst.de>, <agk@redhat.com>, <snitzer@redhat.com>, <rgoldwyn@suse.de>, <qi.fuli@fujitsu.com>, <y-goto@fujitsu.com> Subject: [PATCH v3 01/11] pagemap: Introduce ->memory_failure() Date: Mon, 8 Feb 2021 18:55:20 +0800 Message-ID: <20210208105530.3072869-2-ruansy.fnst@cn.fujitsu.com> In-Reply-To: <20210208105530.3072869-1-ruansy.fnst@cn.fujitsu.com> References: <20210208105530.3072869-1-ruansy.fnst@cn.fujitsu.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	fsdax: introduce fs query to support reflink \| expand [v3,00/11] fsdax: introduce fs query to support reflink [v3,01/11] pagemap: Introduce ->memory_failure() [v3,02/11] blk: Introduce ->corrupted_range() for block device [v3,03/11] fs: Introduce ->corrupted_range() for superblock [v3,04/11] block_dev: Introduce bd_corrupted_range() for block device [v3,05/11] mm, fsdax: Refactor memory-failure handler for dax mapping [v3,06/11] mm, pmem: Implement ->memory_failure() in pmem driver [v3,07/11] pmem: Implement ->corrupted_range() for pmem driver [v3,08/11] dm: Introduce ->rmap() to find bdev offset [v3,09/11] md: Implement ->corrupted_range() [v3,10/11] xfs: Implement ->corrupted_range() for XFS [v3,11/11] fs/dax: Remove useless functions

Ruan Shiyang Feb. 8, 2021, 10:55 a.m. UTC

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the block device where
the error page locates in, and try to get the filesystem on this block
device.  And finally call filesystem handler to deal with the error.
The filesystem will try to recover the corrupted data if possiable.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 include/linux/memremap.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Christoph Hellwig Feb. 10, 2021, 1:20 p.m. UTC | #1

On Mon, Feb 08, 2021 at 06:55:20PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the block device where
> the error page locates in, and try to get the filesystem on this block
> device.  And finally call filesystem handler to deal with the error.
> The filesystem will try to recover the corrupted data if possiable.

I'm not sure adding just a method without any of the support code
is a useful patch on its own.

Dan Williams March 6, 2021, 8:36 p.m. UTC | #2

On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> wrote:
>
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the block device where
> the error page locates in, and try to get the filesystem on this block
> device.  And finally call filesystem handler to deal with the error.
> The filesystem will try to recover the corrupted data if possiable.
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> ---
>  include/linux/memremap.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 79c49e7f5c30..0bcf2b1e20bd 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
>          * the page back to a CPU accessible page.
>          */
>         vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> +       /*
> +        * Handle the memory failure happens on one page.  Notify the processes
> +        * who are using this page, and try to recover the data on this page
> +        * if necessary.
> +        */
> +       int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> +                             int flags);
>  };

After the conversation with Dave I don't see the point of this. If
there is a memory_failure() on a page, why not just call
memory_failure()? That already knows how to find the inode and the
filesystem can be notified from there.

Although memory_failure() is inefficient for large range failures, I'm
not seeing a better option, so I'm going to test calling
memory_failure() over a large range whenever an in-use dax-device is
hot-removed.

Shiyang Ruan March 8, 2021, 3:38 a.m. UTC | #3

> On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> wrote:
> >
> > When memory-failure occurs, we call this function which is implemented
> > by each kind of devices.  For the fsdax case, pmem device driver
> > implements it.  Pmem device driver will find out the block device where
> > the error page locates in, and try to get the filesystem on this block
> > device.  And finally call filesystem handler to deal with the error.
> > The filesystem will try to recover the corrupted data if possiable.
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> > ---
> >  include/linux/memremap.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> >          * the page back to a CPU accessible page.
> >          */
> >         vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > +
> > +       /*
> > +        * Handle the memory failure happens on one page.  Notify the processes
> > +        * who are using this page, and try to recover the data on this page
> > +        * if necessary.
> > +        */
> > +       int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> > +                             int flags);
> >  };
> 
> After the conversation with Dave I don't see the point of this. If
> there is a memory_failure() on a page, why not just call
> memory_failure()? That already knows how to find the inode and the
> filesystem can be notified from there.

We want memory_failure() supports reflinked files.  In this case, we are not
able to track multiple files from a page(this broken page) because
page->mapping,page->index can only track one file.  Thus, I introduce this
->memory_failure() implemented in pmem driver, to call ->corrupted_range()
upper level to upper level, and finally find out files who are
using(mmapping) this page.

> 
> Although memory_failure() is inefficient for large range failures, I'm
> not seeing a better option, so I'm going to test calling
> memory_failure() over a large range whenever an in-use dax-device is
> hot-removed.
> 

I did not test this for large range failure yet...  I am not sure if it works
fine. But because of the complex tracking method, I think it would be more
inefficient in this case than before.


--
Thanks,
Ruan Shiyang.

Dan Williams March 8, 2021, 5:23 a.m. UTC | #4

On Sun, Mar 7, 2021 at 7:38 PM ruansy.fnst@fujitsu.com
<ruansy.fnst@fujitsu.com> wrote:
>
> > On Mon, Feb 8, 2021 at 2:55 AM Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> wrote:
> > >
> > > When memory-failure occurs, we call this function which is implemented
> > > by each kind of devices.  For the fsdax case, pmem device driver
> > > implements it.  Pmem device driver will find out the block device where
> > > the error page locates in, and try to get the filesystem on this block
> > > device.  And finally call filesystem handler to deal with the error.
> > > The filesystem will try to recover the corrupted data if possiable.
> > >
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> > > ---
> > >  include/linux/memremap.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > > --- a/include/linux/memremap.h
> > > +++ b/include/linux/memremap.h
> > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > >          * the page back to a CPU accessible page.
> > >          */
> > >         vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > +
> > > +       /*
> > > +        * Handle the memory failure happens on one page.  Notify the processes
> > > +        * who are using this page, and try to recover the data on this page
> > > +        * if necessary.
> > > +        */
> > > +       int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> > > +                             int flags);
> > >  };
> >
> > After the conversation with Dave I don't see the point of this. If
> > there is a memory_failure() on a page, why not just call
> > memory_failure()? That already knows how to find the inode and the
> > filesystem can be notified from there.
>
> We want memory_failure() supports reflinked files.  In this case, we are not
> able to track multiple files from a page(this broken page) because
> page->mapping,page->index can only track one file.  Thus, I introduce this
> ->memory_failure() implemented in pmem driver, to call ->corrupted_range()
> upper level to upper level, and finally find out files who are
> using(mmapping) this page.
>

I know the motivation, but this implementation seems backwards. It's
already the case that memory_failure() looks up the address_space
associated with a mapping. From there I would expect a new 'struct
address_space_operations' op to let the fs handle the case when there
are multiple address_spaces associated with a given file.

Shiyang Ruan March 8, 2021, 11:34 a.m. UTC | #5

> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > > > --- a/include/linux/memremap.h
> > > > +++ b/include/linux/memremap.h
> > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > > >          * the page back to a CPU accessible page.
> > > >          */
> > > >         vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > > +
> > > > +       /*
> > > > +        * Handle the memory failure happens on one page.  Notify the processes
> > > > +        * who are using this page, and try to recover the data on this page
> > > > +        * if necessary.
> > > > +        */
> > > > +       int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> > > > +                             int flags);
> > > >  };
> > >
> > > After the conversation with Dave I don't see the point of this. If
> > > there is a memory_failure() on a page, why not just call
> > > memory_failure()? That already knows how to find the inode and the
> > > filesystem can be notified from there.
> >
> > We want memory_failure() supports reflinked files.  In this case, we are not
> > able to track multiple files from a page(this broken page) because
> > page->mapping,page->index can only track one file.  Thus, I introduce this
> > ->memory_failure() implemented in pmem driver, to call ->corrupted_range()
> > upper level to upper level, and finally find out files who are
> > using(mmapping) this page.
> >
> 
> I know the motivation, but this implementation seems backwards. It's
> already the case that memory_failure() looks up the address_space
> associated with a mapping. From there I would expect a new 'struct
> address_space_operations' op to let the fs handle the case when there
> are multiple address_spaces associated with a given file.
> 

Let me think about it.  In this way, we
    1. associate file mapping with dax page in dax page fault;
    2. iterate files reflinked to notify `kill processes signal` by the
          new address_space_operation;
    3. re-associate to another reflinked file mapping when unmmaping
        (rmap qeury in filesystem to get the another file).

It did not handle those dax pages are not in use, because their ->mapping are
not associated to any file.  I didn't think it through until reading your
conversation.  Here is my understanding: this case should be handled by
badblock mechanism in pmem driver.  This badblock mechanism will call
->corrupted_range() to tell filesystem to repaire the data if possible.

So, we split it into two parts.  And dax device and block device won't be mixed
up again.   Is my understanding right?

But the solution above is to solve the hwpoison on one or couple pages, which
happens rarely(I think).  Do the 'pmem remove' operation cause hwpoison too?
Call memory_failure() so many times?  I havn't understood this yet.


--
Thanks,
Ruan Shiyang.

Dan Williams March 8, 2021, 6:01 p.m. UTC | #6

On Mon, Mar 8, 2021 at 3:34 AM ruansy.fnst@fujitsu.com
<ruansy.fnst@fujitsu.com> wrote:
> > > > >  1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > > > > --- a/include/linux/memremap.h
> > > > > +++ b/include/linux/memremap.h
> > > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > > > >          * the page back to a CPU accessible page.
> > > > >          */
> > > > >         vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > > > +
> > > > > +       /*
> > > > > +        * Handle the memory failure happens on one page.  Notify the processes
> > > > > +        * who are using this page, and try to recover the data on this page
> > > > > +        * if necessary.
> > > > > +        */
> > > > > +       int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> > > > > +                             int flags);
> > > > >  };
> > > >
> > > > After the conversation with Dave I don't see the point of this. If
> > > > there is a memory_failure() on a page, why not just call
> > > > memory_failure()? That already knows how to find the inode and the
> > > > filesystem can be notified from there.
> > >
> > > We want memory_failure() supports reflinked files.  In this case, we are not
> > > able to track multiple files from a page(this broken page) because
> > > page->mapping,page->index can only track one file.  Thus, I introduce this
> > > ->memory_failure() implemented in pmem driver, to call ->corrupted_range()
> > > upper level to upper level, and finally find out files who are
> > > using(mmapping) this page.
> > >
> >
> > I know the motivation, but this implementation seems backwards. It's
> > already the case that memory_failure() looks up the address_space
> > associated with a mapping. From there I would expect a new 'struct
> > address_space_operations' op to let the fs handle the case when there
> > are multiple address_spaces associated with a given file.
> >
>
> Let me think about it.  In this way, we
>     1. associate file mapping with dax page in dax page fault;

I think this needs to be a new type of association that proxies the
representation of the reflink across all involved address_spaces.

>     2. iterate files reflinked to notify `kill processes signal` by the
>           new address_space_operation;
>     3. re-associate to another reflinked file mapping when unmmaping
>         (rmap qeury in filesystem to get the another file).

Perhaps the proxy object is reference counted per-ref-link. It seems
error prone to keep changing the association of the pfn while the
reflink is in-tact.

> It did not handle those dax pages are not in use, because their ->mapping are
> not associated to any file.  I didn't think it through until reading your
> conversation.  Here is my understanding: this case should be handled by
> badblock mechanism in pmem driver.  This badblock mechanism will call
> ->corrupted_range() to tell filesystem to repaire the data if possible.

There are 2 types of notifications. There are badblocks discovered by
the driver (see notify_pmem()) and there are memory_failures()
signalled by the CPU machine-check handler, or the platform BIOS. In
the case of badblocks that needs to be information considered by the
fs block allocator to avoid / try-to-repair badblocks on allocate, and
to allow listing damaged files that need repair. The memory_failure()
notification needs immediate handling to tear down mappings to that
pfn and signal processes that have consumed it with
SIGBUS-action-required. Processes that have the poison mapped, but
have not consumed it receive SIGBUS-action-optional.

> So, we split it into two parts.  And dax device and block device won't be mixed
> up again.   Is my understanding right?

Right, it's only the filesystem that knows that the block_device and
the dax_device alias data at the same logical offset. The requirements
for sector error handling and page error handling are separate like
block_device_operations and dax_operations.

> But the solution above is to solve the hwpoison on one or couple pages, which
> happens rarely(I think).  Do the 'pmem remove' operation cause hwpoison too?
> Call memory_failure() so many times?  I havn't understood this yet.

I'm working on a patch here to call memory_failure() on a wide range
for the surprise remove of a dax_device while a filesystem might be
mounted. It won't be efficient, but there is no other way to notify
the kernel that it needs to immediately stop referencing a page.

Shiyang Ruan March 12, 2021, 10:18 a.m. UTC | #7

> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Subject: Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> 
> On Mon, Mar 8, 2021 at 3:34 AM ruansy.fnst@fujitsu.com
> <ruansy.fnst@fujitsu.com> wrote:
> > > > > >  1 file changed, 8 insertions(+)
> > > > > >
> > > > > > diff --git a/include/linux/memremap.h
> > > > > > b/include/linux/memremap.h index 79c49e7f5c30..0bcf2b1e20bd
> > > > > > 100644
> > > > > > --- a/include/linux/memremap.h
> > > > > > +++ b/include/linux/memremap.h
> > > > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > > > > >          * the page back to a CPU accessible page.
> > > > > >          */
> > > > > >         vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Handle the memory failure happens on one page.  Notify
> the processes
> > > > > > +        * who are using this page, and try to recover the data on
> this page
> > > > > > +        * if necessary.
> > > > > > +        */
> > > > > > +       int (*memory_failure)(struct dev_pagemap *pgmap,
> unsigned long pfn,
> > > > > > +                             int flags);
> > > > > >  };
> > > > >
> > > > > After the conversation with Dave I don't see the point of this.
> > > > > If there is a memory_failure() on a page, why not just call
> > > > > memory_failure()? That already knows how to find the inode and
> > > > > the filesystem can be notified from there.
> > > >
> > > > We want memory_failure() supports reflinked files.  In this case,
> > > > we are not able to track multiple files from a page(this broken
> > > > page) because
> > > > page->mapping,page->index can only track one file.  Thus, I
> > > > page->introduce this
> > > > ->memory_failure() implemented in pmem driver, to call
> > > > ->->corrupted_range()
> > > > upper level to upper level, and finally find out files who are
> > > > using(mmapping) this page.
> > > >
> > >
> > > I know the motivation, but this implementation seems backwards. It's
> > > already the case that memory_failure() looks up the address_space
> > > associated with a mapping. From there I would expect a new 'struct
> > > address_space_operations' op to let the fs handle the case when
> > > there are multiple address_spaces associated with a given file.
> > >
> >
> > Let me think about it.  In this way, we
> >     1. associate file mapping with dax page in dax page fault;
> 
> I think this needs to be a new type of association that proxies the representation
> of the reflink across all involved address_spaces.
> 
> >     2. iterate files reflinked to notify `kill processes signal` by the
> >           new address_space_operation;
> >     3. re-associate to another reflinked file mapping when unmmaping
> >         (rmap qeury in filesystem to get the another file).
> 
> Perhaps the proxy object is reference counted per-ref-link. It seems error prone
> to keep changing the association of the pfn while the reflink is in-tact.
Hi, Dan

I think my early rfc patchset was implemented in this way:
 - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping, offset) when causing dax page fault.
 - Mount this tree on page->zone_device_data which is not used in fsdax, so that we can iterate reflinked file mappings in memory_failure() easily.
In my understanding, the dax-rmap tree is the proxy object you mentioned.  If so, I have to say, this method was rejected. Because this will cause huge overhead in some case that every dax page have one dax-rmap tree.


--
Thanks,
Ruan Shiyang.
> 
> > It did not handle those dax pages are not in use, because their
> > ->mapping are not associated to any file.  I didn't think it through
> > until reading your conversation.  Here is my understanding: this case
> > should be handled by badblock mechanism in pmem driver.  This badblock
> > mechanism will call
> > ->corrupted_range() to tell filesystem to repaire the data if possible.
> 
> There are 2 types of notifications. There are badblocks discovered by the driver
> (see notify_pmem()) and there are memory_failures() signalled by the CPU
> machine-check handler, or the platform BIOS. In the case of badblocks that
> needs to be information considered by the fs block allocator to avoid /
> try-to-repair badblocks on allocate, and to allow listing damaged files that need
> repair. The memory_failure() notification needs immediate handling to tear
> down mappings to that pfn and signal processes that have consumed it with
> SIGBUS-action-required. Processes that have the poison mapped, but have not
> consumed it receive SIGBUS-action-optional.
> 
> > So, we split it into two parts.  And dax device and block device won't be
> mixed
> > up again.   Is my understanding right?
> 
> Right, it's only the filesystem that knows that the block_device and the
> dax_device alias data at the same logical offset. The requirements for sector
> error handling and page error handling are separate like
> block_device_operations and dax_operations.
> 
> > But the solution above is to solve the hwpoison on one or couple
> > pages, which happens rarely(I think).  Do the 'pmem remove' operation
> cause hwpoison too?
> > Call memory_failure() so many times?  I havn't understood this yet.
> 
> I'm working on a patch here to call memory_failure() on a wide range for the
> surprise remove of a dax_device while a filesystem might be mounted. It won't
> be efficient, but there is no other way to notify the kernel that it needs to
> immediately stop referencing a page.

Shiyang Ruan March 19, 2021, 2:17 a.m. UTC | #8

> -----Original Message-----
> From: ruansy.fnst@fujitsu.com <ruansy.fnst@fujitsu.com>
> Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> > > > > >
> > > > > > After the conversation with Dave I don't see the point of this.
> > > > > > If there is a memory_failure() on a page, why not just call
> > > > > > memory_failure()? That already knows how to find the inode and
> > > > > > the filesystem can be notified from there.
> > > > >
> > > > > We want memory_failure() supports reflinked files.  In this
> > > > > case, we are not able to track multiple files from a page(this
> > > > > broken
> > > > > page) because
> > > > > page->mapping,page->index can only track one file.  Thus, I
> > > > > page->introduce this
> > > > > ->memory_failure() implemented in pmem driver, to call
> > > > > ->->corrupted_range()
> > > > > upper level to upper level, and finally find out files who are
> > > > > using(mmapping) this page.
> > > > >
> > > >
> > > > I know the motivation, but this implementation seems backwards.
> > > > It's already the case that memory_failure() looks up the
> > > > address_space associated with a mapping. From there I would expect
> > > > a new 'struct address_space_operations' op to let the fs handle
> > > > the case when there are multiple address_spaces associated with a given
> file.
> > > >
> > >
> > > Let me think about it.  In this way, we
> > >     1. associate file mapping with dax page in dax page fault;
> >
> > I think this needs to be a new type of association that proxies the
> > representation of the reflink across all involved address_spaces.
> >
> > >     2. iterate files reflinked to notify `kill processes signal` by the
> > >           new address_space_operation;
> > >     3. re-associate to another reflinked file mapping when unmmaping
> > >         (rmap qeury in filesystem to get the another file).
> >
> > Perhaps the proxy object is reference counted per-ref-link. It seems
> > error prone to keep changing the association of the pfn while the reflink is
> in-tact.
> Hi, Dan
> 
> I think my early rfc patchset was implemented in this way:
>  - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping,
> offset) when causing dax page fault.
>  - Mount this tree on page->zone_device_data which is not used in fsdax, so
> that we can iterate reflinked file mappings in memory_failure() easily.
> In my understanding, the dax-rmap tree is the proxy object you mentioned.  If
> so, I have to say, this method was rejected. Because this will cause huge
> overhead in some case that every dax page have one dax-rmap tree.
> 

Hi, Dan

How do you think about this?  I am still confused.  Could you give me some advice?


--
Thanks,
Ruan Shiyang.

> 
> --
> Thanks,
> Ruan Shiyang.
> >
> > > It did not handle those dax pages are not in use, because their
> > > ->mapping are not associated to any file.  I didn't think it through
> > > until reading your conversation.  Here is my understanding: this
> > > case should be handled by badblock mechanism in pmem driver.  This
> > > badblock mechanism will call
> > > ->corrupted_range() to tell filesystem to repaire the data if possible.
> >
> > There are 2 types of notifications. There are badblocks discovered by
> > the driver (see notify_pmem()) and there are memory_failures()
> > signalled by the CPU machine-check handler, or the platform BIOS. In
> > the case of badblocks that needs to be information considered by the
> > fs block allocator to avoid / try-to-repair badblocks on allocate, and
> > to allow listing damaged files that need repair. The memory_failure()
> > notification needs immediate handling to tear down mappings to that
> > pfn and signal processes that have consumed it with
> > SIGBUS-action-required. Processes that have the poison mapped, but have not
> consumed it receive SIGBUS-action-optional.
> >
> > > So, we split it into two parts.  And dax device and block device
> > > won't be
> > mixed
> > > up again.   Is my understanding right?
> >
> > Right, it's only the filesystem that knows that the block_device and
> > the dax_device alias data at the same logical offset. The requirements
> > for sector error handling and page error handling are separate like
> > block_device_operations and dax_operations.
> >
> > > But the solution above is to solve the hwpoison on one or couple
> > > pages, which happens rarely(I think).  Do the 'pmem remove'
> > > operation
> > cause hwpoison too?
> > > Call memory_failure() so many times?  I havn't understood this yet.
> >
> > I'm working on a patch here to call memory_failure() on a wide range
> > for the surprise remove of a dax_device while a filesystem might be
> > mounted. It won't be efficient, but there is no other way to notify
> > the kernel that it needs to immediately stop referencing a page.
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an
> email to linux-nvdimm-leave@lists.01.org
>

Dan Williams March 24, 2021, 2:19 a.m. UTC | #9

On Thu, Mar 18, 2021 at 7:18 PM ruansy.fnst@fujitsu.com
<ruansy.fnst@fujitsu.com> wrote:
>
>
>
> > -----Original Message-----
> > From: ruansy.fnst@fujitsu.com <ruansy.fnst@fujitsu.com>
> > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> > > > > > >
> > > > > > > After the conversation with Dave I don't see the point of this.
> > > > > > > If there is a memory_failure() on a page, why not just call
> > > > > > > memory_failure()? That already knows how to find the inode and
> > > > > > > the filesystem can be notified from there.
> > > > > >
> > > > > > We want memory_failure() supports reflinked files.  In this
> > > > > > case, we are not able to track multiple files from a page(this
> > > > > > broken
> > > > > > page) because
> > > > > > page->mapping,page->index can only track one file.  Thus, I
> > > > > > page->introduce this
> > > > > > ->memory_failure() implemented in pmem driver, to call
> > > > > > ->->corrupted_range()
> > > > > > upper level to upper level, and finally find out files who are
> > > > > > using(mmapping) this page.
> > > > > >
> > > > >
> > > > > I know the motivation, but this implementation seems backwards.
> > > > > It's already the case that memory_failure() looks up the
> > > > > address_space associated with a mapping. From there I would expect
> > > > > a new 'struct address_space_operations' op to let the fs handle
> > > > > the case when there are multiple address_spaces associated with a given
> > file.
> > > > >
> > > >
> > > > Let me think about it.  In this way, we
> > > >     1. associate file mapping with dax page in dax page fault;
> > >
> > > I think this needs to be a new type of association that proxies the
> > > representation of the reflink across all involved address_spaces.
> > >
> > > >     2. iterate files reflinked to notify `kill processes signal` by the
> > > >           new address_space_operation;
> > > >     3. re-associate to another reflinked file mapping when unmmaping
> > > >         (rmap qeury in filesystem to get the another file).
> > >
> > > Perhaps the proxy object is reference counted per-ref-link. It seems
> > > error prone to keep changing the association of the pfn while the reflink is
> > in-tact.
> > Hi, Dan
> >
> > I think my early rfc patchset was implemented in this way:
> >  - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping,
> > offset) when causing dax page fault.
> >  - Mount this tree on page->zone_device_data which is not used in fsdax, so
> > that we can iterate reflinked file mappings in memory_failure() easily.
> > In my understanding, the dax-rmap tree is the proxy object you mentioned.  If
> > so, I have to say, this method was rejected. Because this will cause huge
> > overhead in some case that every dax page have one dax-rmap tree.
> >
>
> Hi, Dan
>
> How do you think about this?  I am still confused.  Could you give me some advice?

So I think the primary driver of this functionality is dax-devices and
the architectural model for memory failure where several architectures
and error handlers know how to route pfn failure to the
memory_failure() frontend.

Compare that to block-devices where sector failure has no similar
framework, and despite some initial interest about reusing 'struct
badblocks' for this type of scenario there has been no real uptake to
expand 'struct badblocks' outside of the pmem driver.

I think the work you have done for ->corrupted_range() just needs to
be repurposed away from a block-device operation to dax-device
infrastructure. Christoph's pushback on extending
block_device_operations makes sense to me because there is likely no
other user of this facility than the pmem driver, and the pmem driver
only needs it for the vestigial reason that filesystems mount on
block-devices and not dax-devices.

Recently Dave drove home the point that a filesystem can't do anything
with pfns, it needs LBAs. A dax-device does not have LBA's, but it
does operate on the concept of device-relative offsets. The filesystem
is allowed to assume that dax-device:PFN[device_byte_offset >>
PAGE_SHIFT] aliases the same data as the associated
block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also
reiterated that this interface should be range based, which you
already had, but I did not include in my attempt to communicate the
mass failure of an entire surprise-removed device.

So I think the path forward is:

- teach memory_failure() to allow for ranged failures

- let interested drivers register for memory failure events via a
blocking_notifier_head

- teach memory_failure() to optionally let the notifier chain claim
the event vs its current default of walking page->mapping

- teach the pmem driver to register for memory_failure() events and
filter the ones that apply to pfns that the driver owns

- drop the nfit driver's usage of the mce notifier chain since
memory_failure() is a superset of what the mce notifier communicates

- augment the pmem driver's view of badblocks that it gets from
address range scrub with one's it gets from memory_failure() events

- when pmem handles a memory_failure() event or an address range scrub
event fire a new event on a new per-dax-device blocking_notifier_head
indicating the dax-relative offset ranges of the translated PFNs. This
notification can optionally indicate failure, offline (for removal),
and online (for repaired ranges).

- teach dm to receive dax-device notifier events from its leaf devices
and then translate them into dax-device notifications relative to the
dm-device offset. This would seem to be a straightforward conversion
of what you have done with ->corrupted_range()

- teach filesystems to register for dax-device notifiers

With all of that in place an interested filesystem can take ownership
of a memory failure that impacts a range of pfns it is responsible for
via a dax-device, but it also allows a not interested filesystem to
default to standard single-pfn-at-a-time error handling and
assumptions about page->mapping only referring to a single address
space.

This obviously does not solve Dave's desire to get this type of error
reporting on block_devices, but I think there's nothing stopping a
parallel notifier chain from being created for block-devices, but
that's orthogonal to requirements and capabilities provided by
dax-devices.

Christoph Hellwig March 24, 2021, 7:47 a.m. UTC | #10

On Tue, Mar 23, 2021 at 07:19:28PM -0700, Dan Williams wrote:
> So I think the path forward is:
> 
> - teach memory_failure() to allow for ranged failures
> 
> - let interested drivers register for memory failure events via a
> blocking_notifier_head

Eww.  As I said I think the right way is that the file system (or
other consumer) can register a set of callbacks for opening the device.
I have a series I need to finish and send out to do that for block
devices.  We probably also need the concept of a holder for the dax
device to make it work nicely, as otherwise we're going to have a bit
of a mess.

> This obviously does not solve Dave's desire to get this type of error
> reporting on block_devices, but I think there's nothing stopping a
> parallel notifier chain from being created for block-devices, but
> that's orthogonal to requirements and capabilities provided by
> dax-devices.

FYI, my series could easily accomodate that if we ever get a block
driver that actually could report such errors.

Dan Williams March 24, 2021, 4:37 p.m. UTC | #11

On Wed, Mar 24, 2021 at 12:48 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Mar 23, 2021 at 07:19:28PM -0700, Dan Williams wrote:
> > So I think the path forward is:
> >
> > - teach memory_failure() to allow for ranged failures
> >
> > - let interested drivers register for memory failure events via a
> > blocking_notifier_head
>
> Eww.  As I said I think the right way is that the file system (or
> other consumer) can register a set of callbacks for opening the device.

How does that solve the problem of the driver being notified of all
pfn failure events? Today pmem only finds out about the ones that are
notified via native x86 machine check error handling via a notifier
(yes "firmware-first" error handling fails to do the right thing for
the pmem driver), or the ones that are eventually reported via address
range scrub, but only for the nvdimms that implement range scrubbing.
memory_failure() seems a reasonable catch all point to route pfn
failure events, in an arch independent way, to interested drivers.

I'm fine swapping out dax_device blocking_notiier chains for your
proposal, but that does not address all the proposed reworks in my
list which are:

- delete "drivers/acpi/nfit/mce.c"

- teach memory_failure() to be able to communicate range failure

- enable memory_failure() to defer to a filesystem that can say
"critical metadata is impacted, no point in trying to do file-by-file
isolation, bring the whole fs down".

> I have a series I need to finish and send out to do that for block
> devices.  We probably also need the concept of a holder for the dax
> device to make it work nicely, as otherwise we're going to have a bit
> of a mess.

Ok, I'll take a look at adding a holder.

>
> > This obviously does not solve Dave's desire to get this type of error
> > reporting on block_devices, but I think there's nothing stopping a
> > parallel notifier chain from being created for block-devices, but
> > that's orthogonal to requirements and capabilities provided by
> > dax-devices.
>
> FYI, my series could easily accomodate that if we ever get a block
> driver that actually could report such errors.

Sure, whatever we land for a dax_device could easily be adopted for a
block device.

Christoph Hellwig March 24, 2021, 5:39 p.m. UTC | #12

On Wed, Mar 24, 2021 at 09:37:01AM -0700, Dan Williams wrote:
> > Eww.  As I said I think the right way is that the file system (or
> > other consumer) can register a set of callbacks for opening the device.
> 
> How does that solve the problem of the driver being notified of all
> pfn failure events?

Ok, I probably just showed I need to spend more time looking at
your proposal vs the actual code..

Don't we have a proper way how one of the nvdimm layers own a
spefific memory range and call directly into that instead of through
a notifier?

> Today pmem only finds out about the ones that are
> notified via native x86 machine check error handling via a notifier
> (yes "firmware-first" error handling fails to do the right thing for
> the pmem driver),

Did any kind of firmware-first error handling ever get anything
right?  I wish people would have learned that by now.

> or the ones that are eventually reported via address
> range scrub, but only for the nvdimms that implement range scrubbing.
> memory_failure() seems a reasonable catch all point to route pfn
> failure events, in an arch independent way, to interested drivers.

Yeah.

> I'm fine swapping out dax_device blocking_notiier chains for your
> proposal, but that does not address all the proposed reworks in my
> list which are:
> 
> - delete "drivers/acpi/nfit/mce.c"
> 
> - teach memory_failure() to be able to communicate range failure
> 
> - enable memory_failure() to defer to a filesystem that can say
> "critical metadata is impacted, no point in trying to do file-by-file
> isolation, bring the whole fs down".

This all sounds sensible.

Dan Williams March 24, 2021, 6 p.m. UTC | #13

On Wed, Mar 24, 2021 at 10:39 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Mar 24, 2021 at 09:37:01AM -0700, Dan Williams wrote:
> > > Eww.  As I said I think the right way is that the file system (or
> > > other consumer) can register a set of callbacks for opening the device.
> >
> > How does that solve the problem of the driver being notified of all
> > pfn failure events?
>
> Ok, I probably just showed I need to spend more time looking at
> your proposal vs the actual code..
>
> Don't we have a proper way how one of the nvdimm layers own a
> spefific memory range and call directly into that instead of through
> a notifier?

So that could be a new dev_pagemap operation as Ruan has here. I was
thinking that other agents would be interested in non-dev_pagemap
managed ranges, but we could leave that for later and just make the
current pgmap->memory_failure() callback proposal range based.

>
> > Today pmem only finds out about the ones that are
> > notified via native x86 machine check error handling via a notifier
> > (yes "firmware-first" error handling fails to do the right thing for
> > the pmem driver),
>
> Did any kind of firmware-first error handling ever get anything
> right?  I wish people would have learned that by now.

Part of me wants to say if you use firmware-first you get to keep the
pieces, but it's not always the end user choice as far as I
understand.

> > or the ones that are eventually reported via address
> > range scrub, but only for the nvdimms that implement range scrubbing.
> > memory_failure() seems a reasonable catch all point to route pfn
> > failure events, in an arch independent way, to interested drivers.
>
> Yeah.
>
> > I'm fine swapping out dax_device blocking_notiier chains for your
> > proposal, but that does not address all the proposed reworks in my
> > list which are:
> >
> > - delete "drivers/acpi/nfit/mce.c"
> >
> > - teach memory_failure() to be able to communicate range failure
> >
> > - enable memory_failure() to defer to a filesystem that can say
> > "critical metadata is impacted, no point in trying to do file-by-file
> > isolation, bring the whole fs down".
>
> This all sounds sensible.

Ok, Ruan, I think this means rework your dev_pagemap_ops callback to
be range based. Add a holder concept for dax_devices and then layer
that on Christoph's eventual dax_device callback mechanism that a
dax_device holder can register.

[v3,01/11] pagemap: Introduce ->memory_failure()

Commit Message

Comments

Patch