diff mbox series

[RESEND,v6,1/9] pagemap: Introduce ->memory_failure()

Message ID 20210730100158.3117319-2-ruansy.fnst@fujitsu.com (mailing list archive)
State New, archived
Headers show
Series fsdax: introduce fs query to support reflink | expand

Commit Message

Shiyang Ruan July 30, 2021, 10:01 a.m. UTC
When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.  And finally call filesystem handler to
deal with this error.

The filesystem will try to recover the corrupted data if necessary.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 include/linux/memremap.h | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Jane Chu Aug. 6, 2021, 1:17 a.m. UTC | #1
The filesystem part of the pmem failure handling is at minimum built
on PAGE_SIZE granularity - an inheritance from general memory_failure 
handling.  However, with Intel's DCPMEM technology, the error blast
radius is no more than 256bytes, and might get smaller with future
hardware generation, also advanced atomic 64B write to clear the poison.
But I don't see any of that could be incorporated in, given that the
filesystem is notified a corruption with pfn, rather than an exact
address.

So I guess this question is also for Dan: how to avoid unnecessarily
repairing a PMD range for a 256B corrupt range going forward?

thanks,
-jane


On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the filesystem in which
> the corrupted page located in.  And finally call filesystem handler to
> deal with this error.
> 
> The filesystem will try to recover the corrupted data if necessary.
Jane Chu Aug. 16, 2021, 5:20 p.m. UTC | #2
Hi, ShiYang,

So I applied the v6 patch series to my 5.14-rc3 as it's what you 
indicated is what v6 was based at, and injected a hardware poison.

I'm seeing the same problem that was reported a while ago after the
poison was consumed - in the SIGBUS payload, the si_addr is missing:

** SIGBUS(7): canjmp=1, whichstep=0, **
** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **

The si_addr ought to be 0x7f6568000000 - the vaddr of the first page
in this case.

Something is not right...

thanks,
-jane


On 8/5/2021 6:17 PM, Jane Chu wrote:
> The filesystem part of the pmem failure handling is at minimum built
> on PAGE_SIZE granularity - an inheritance from general memory_failure 
> handling.  However, with Intel's DCPMEM technology, the error blast
> radius is no more than 256bytes, and might get smaller with future
> hardware generation, also advanced atomic 64B write to clear the poison.
> But I don't see any of that could be incorporated in, given that the
> filesystem is notified a corruption with pfn, rather than an exact
> address.
> 
> So I guess this question is also for Dan: how to avoid unnecessarily
> repairing a PMD range for a 256B corrupt range going forward?
> 
> thanks,
> -jane
> 
> 
> On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
>> When memory-failure occurs, we call this function which is implemented
>> by each kind of devices.  For the fsdax case, pmem device driver
>> implements it.  Pmem device driver will find out the filesystem in which
>> the corrupted page located in.  And finally call filesystem handler to
>> deal with this error.
>>
>> The filesystem will try to recover the corrupted data if necessary.
>
Shiyang Ruan Aug. 17, 2021, 1:44 a.m. UTC | #3
> -----Original Message-----
> From: Jane Chu <jane.chu@oracle.com>
> Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()
> 
> Hi, ShiYang,
> 
> So I applied the v6 patch series to my 5.14-rc3 as it's what you indicated is what
> v6 was based at, and injected a hardware poison.
> 
> I'm seeing the same problem that was reported a while ago after the poison
> was consumed - in the SIGBUS payload, the si_addr is missing:
> 
> ** SIGBUS(7): canjmp=1, whichstep=0, **
> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> 
> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page in this
> case.
> 
> Something is not right...

Hi Jane,

Sorry for late reply.  Thanks for testing.  This address should have been reported in my code. I'll check why it's finally nil.


--
Thanks.
Ruan.

> 
> thanks,
> -jane
> 
> 
> On 8/5/2021 6:17 PM, Jane Chu wrote:
> > The filesystem part of the pmem failure handling is at minimum built
> > on PAGE_SIZE granularity - an inheritance from general memory_failure
> > handling.  However, with Intel's DCPMEM technology, the error blast
> > radius is no more than 256bytes, and might get smaller with future
> > hardware generation, also advanced atomic 64B write to clear the poison.
> > But I don't see any of that could be incorporated in, given that the
> > filesystem is notified a corruption with pfn, rather than an exact
> > address.
> >
> > So I guess this question is also for Dan: how to avoid unnecessarily
> > repairing a PMD range for a 256B corrupt range going forward?
> >
> > thanks,
> > -jane
> >
> >
> > On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
> >> When memory-failure occurs, we call this function which is
> >> implemented by each kind of devices.  For the fsdax case, pmem device
> >> driver implements it.  Pmem device driver will find out the
> >> filesystem in which the corrupted page located in.  And finally call
> >> filesystem handler to deal with this error.
> >>
> >> The filesystem will try to recover the corrupted data if necessary.
> >
Jane Chu Aug. 18, 2021, 5:43 a.m. UTC | #4
More information -

On 8/16/2021 10:20 AM, Jane Chu wrote:
> Hi, ShiYang,
> 
> So I applied the v6 patch series to my 5.14-rc3 as it's what you 
> indicated is what v6 was based at, and injected a hardware poison.
> 
> I'm seeing the same problem that was reported a while ago after the
> poison was consumed - in the SIGBUS payload, the si_addr is missing:
> 
> ** SIGBUS(7): canjmp=1, whichstep=0, **
> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> 
> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page
> in this case.

The failure came from here :

[PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS

+static int
+xfs_dax_notify_failure(
...
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
+		return -EOPNOTSUPP;
+	}

I am not familiar with XFS, but I have a few questions I hope to get 
answers -

1) What does it take and cost to make
    xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?

2) For a running environment that fails the above check, is it
    okay to leave the poison handle in limbo and why?

3) If the above regression is not acceptable, any potential remedy?

thanks!
-jane


> 
> Something is not right...
> 
> thanks,
> -jane
> 
> 
> On 8/5/2021 6:17 PM, Jane Chu wrote:
>> The filesystem part of the pmem failure handling is at minimum built
>> on PAGE_SIZE granularity - an inheritance from general memory_failure 
>> handling.  However, with Intel's DCPMEM technology, the error blast
>> radius is no more than 256bytes, and might get smaller with future
>> hardware generation, also advanced atomic 64B write to clear the poison.
>> But I don't see any of that could be incorporated in, given that the
>> filesystem is notified a corruption with pfn, rather than an exact
>> address.
>>
>> So I guess this question is also for Dan: how to avoid unnecessarily
>> repairing a PMD range for a 256B corrupt range going forward?
>>
>> thanks,
>> -jane
>>
>>
>> On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
>>> When memory-failure occurs, we call this function which is implemented
>>> by each kind of devices.  For the fsdax case, pmem device driver
>>> implements it.  Pmem device driver will find out the filesystem in which
>>> the corrupted page located in.  And finally call filesystem handler to
>>> deal with this error.
>>>
>>> The filesystem will try to recover the corrupted data if necessary.
>>
>
Jane Chu Aug. 18, 2021, 6:08 a.m. UTC | #5
On 8/17/2021 10:43 PM, Jane Chu wrote:
> More information -
> 
> On 8/16/2021 10:20 AM, Jane Chu wrote:
>> Hi, ShiYang,
>>
>> So I applied the v6 patch series to my 5.14-rc3 as it's what you 
>> indicated is what v6 was based at, and injected a hardware poison.
>>
>> I'm seeing the same problem that was reported a while ago after the
>> poison was consumed - in the SIGBUS payload, the si_addr is missing:
>>
>> ** SIGBUS(7): canjmp=1, whichstep=0, **
>> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
>>
>> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page
>> in this case.
> 
> The failure came from here :
> 
> [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS
> 
> +static int
> +xfs_dax_notify_failure(
> ...
> +    if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> +        xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
> +        return -EOPNOTSUPP;
> +    }
> 
> I am not familiar with XFS, but I have a few questions I hope to get 
> answers -
> 
> 1) What does it take and cost to make
>     xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
> 
> 2) For a running environment that fails the above check, is it
>     okay to leave the poison handle in limbo and why?
> 
> 3) If the above regression is not acceptable, any potential remedy?

How about moving the check to prior to the notifier registration?
And register only if the check is passed?  This seems better
than an alternative which is to fall back to the legacy memory_failure
handling in case the filesystem returns -EOPNOTSUPP.

thanks,
-jane

> 
> thanks!
> -jane
> 
> 
>>
>> Something is not right...
>>
>> thanks,
>> -jane
>>
>>
>> On 8/5/2021 6:17 PM, Jane Chu wrote:
>>> The filesystem part of the pmem failure handling is at minimum built
>>> on PAGE_SIZE granularity - an inheritance from general memory_failure 
>>> handling.  However, with Intel's DCPMEM technology, the error blast
>>> radius is no more than 256bytes, and might get smaller with future
>>> hardware generation, also advanced atomic 64B write to clear the poison.
>>> But I don't see any of that could be incorporated in, given that the
>>> filesystem is notified a corruption with pfn, rather than an exact
>>> address.
>>>
>>> So I guess this question is also for Dan: how to avoid unnecessarily
>>> repairing a PMD range for a 256B corrupt range going forward?
>>>
>>> thanks,
>>> -jane
>>>
>>>
>>> On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
>>>> When memory-failure occurs, we call this function which is implemented
>>>> by each kind of devices.  For the fsdax case, pmem device driver
>>>> implements it.  Pmem device driver will find out the filesystem in 
>>>> which
>>>> the corrupted page located in.  And finally call filesystem handler to
>>>> deal with this error.
>>>>
>>>> The filesystem will try to recover the corrupted data if necessary.
>>>
>>
>
Shiyang Ruan Aug. 18, 2021, 7:52 a.m. UTC | #6
> -----Original Message-----
> From: Jane Chu <jane.chu@oracle.com>
> Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()
> 
> 
> On 8/17/2021 10:43 PM, Jane Chu wrote:
> > More information -
> >
> > On 8/16/2021 10:20 AM, Jane Chu wrote:
> >> Hi, ShiYang,
> >>
> >> So I applied the v6 patch series to my 5.14-rc3 as it's what you
> >> indicated is what v6 was based at, and injected a hardware poison.
> >>
> >> I'm seeing the same problem that was reported a while ago after the
> >> poison was consumed - in the SIGBUS payload, the si_addr is missing:
> >>
> >> ** SIGBUS(7): canjmp=1, whichstep=0, **
> >> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> >>
> >> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page
> >> in this case.
> >
> > The failure came from here :
> >
> > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS
> >
> > +static int
> > +xfs_dax_notify_failure(
> > ...
> > +    if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > +        xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
> > +        return -EOPNOTSUPP;
> > +    }
> >
> > I am not familiar with XFS, but I have a few questions I hope to get
> > answers -
> >
> > 1) What does it take and cost to make
> >     xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?

Enable rmpabt feature when making xfs filesystem
   `mkfs.xfs -m rmapbt=1 /path/to/device`
BTW, reflink is enabled by default.

> >
> > 2) For a running environment that fails the above check, is it
> >     okay to leave the poison handle in limbo and why?
It will fall back to the old handler.  I think you have already known it.

> >
> > 3) If the above regression is not acceptable, any potential remedy?
> 
> How about moving the check to prior to the notifier registration?
> And register only if the check is passed?  This seems better than an
> alternative which is to fall back to the legacy memory_failure handling in case
> the filesystem returns -EOPNOTSUPP.

Sounds like a nice solution.  I think I can add an is_notify_supported() interface in dax_holder_ops and check it when register dax_holder.

--
Thanks,
Ruan.
> 
> thanks,
> -jane
> 
> >
> > thanks!
> > -jane
> >
> >
> >>
> >> Something is not right...
> >>
> >> thanks,
> >> -jane
> >>
> >>
> >> On 8/5/2021 6:17 PM, Jane Chu wrote:
> >>> The filesystem part of the pmem failure handling is at minimum built
> >>> on PAGE_SIZE granularity - an inheritance from general
> >>> memory_failure handling.  However, with Intel's DCPMEM technology,
> >>> the error blast radius is no more than 256bytes, and might get
> >>> smaller with future hardware generation, also advanced atomic 64B write
> to clear the poison.
> >>> But I don't see any of that could be incorporated in, given that the
> >>> filesystem is notified a corruption with pfn, rather than an exact
> >>> address.
> >>>
> >>> So I guess this question is also for Dan: how to avoid unnecessarily
> >>> repairing a PMD range for a 256B corrupt range going forward?
> >>>
> >>> thanks,
> >>> -jane
> >>>
> >>>
> >>> On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
> >>>> When memory-failure occurs, we call this function which is
> >>>> implemented by each kind of devices.  For the fsdax case, pmem
> >>>> device driver implements it.  Pmem device driver will find out the
> >>>> filesystem in which the corrupted page located in.  And finally
> >>>> call filesystem handler to deal with this error.
> >>>>
> >>>> The filesystem will try to recover the corrupted data if necessary.
> >>>
> >>
> >
Darrick J. Wong Aug. 18, 2021, 3:52 p.m. UTC | #7
On Tue, Aug 17, 2021 at 11:08:40PM -0700, Jane Chu wrote:
> 
> On 8/17/2021 10:43 PM, Jane Chu wrote:
> > More information -
> > 
> > On 8/16/2021 10:20 AM, Jane Chu wrote:
> > > Hi, ShiYang,
> > > 
> > > So I applied the v6 patch series to my 5.14-rc3 as it's what you
> > > indicated is what v6 was based at, and injected a hardware poison.
> > > 
> > > I'm seeing the same problem that was reported a while ago after the
> > > poison was consumed - in the SIGBUS payload, the si_addr is missing:
> > > 
> > > ** SIGBUS(7): canjmp=1, whichstep=0, **
> > > ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> > > 
> > > The si_addr ought to be 0x7f6568000000 - the vaddr of the first page
> > > in this case.
> > 
> > The failure came from here :
> > 
> > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS
> > 
> > +static int
> > +xfs_dax_notify_failure(
> > ...
> > +    if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > +        xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
> > +        return -EOPNOTSUPP;
> > +    }
> > 
> > I am not familiar with XFS, but I have a few questions I hope to get
> > answers -
> > 
> > 1) What does it take and cost to make
> >     xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?

mkfs.xfs -m rmapbt=1

> > 2) For a running environment that fails the above check, is it
> >     okay to leave the poison handle in limbo and why?
> > 
> > 3) If the above regression is not acceptable, any potential remedy?
> 
> How about moving the check to prior to the notifier registration?
> And register only if the check is passed?  This seems better
> than an alternative which is to fall back to the legacy memory_failure
> handling in case the filesystem returns -EOPNOTSUPP.

"return -EOPNOTSUPP;" is the branching point where a future patch could
probe the (DRAM) buffer cache to bwrite the contents to restore the pmem
contents.  Right now the focus should be on landing the core code
changes without drawing any more NAKs from Dan.

--D

> thanks,
> -jane
> 
> > 
> > thanks!
> > -jane
> > 
> > 
> > > 
> > > Something is not right...
> > > 
> > > thanks,
> > > -jane
> > > 
> > > 
> > > On 8/5/2021 6:17 PM, Jane Chu wrote:
> > > > The filesystem part of the pmem failure handling is at minimum built
> > > > on PAGE_SIZE granularity - an inheritance from general
> > > > memory_failure handling.  However, with Intel's DCPMEM
> > > > technology, the error blast
> > > > radius is no more than 256bytes, and might get smaller with future
> > > > hardware generation, also advanced atomic 64B write to clear the poison.
> > > > But I don't see any of that could be incorporated in, given that the
> > > > filesystem is notified a corruption with pfn, rather than an exact
> > > > address.
> > > > 
> > > > So I guess this question is also for Dan: how to avoid unnecessarily
> > > > repairing a PMD range for a 256B corrupt range going forward?
> > > > 
> > > > thanks,
> > > > -jane
> > > > 
> > > > 
> > > > On 7/30/2021 3:01 AM, Shiyang Ruan wrote:
> > > > > When memory-failure occurs, we call this function which is implemented
> > > > > by each kind of devices.  For the fsdax case, pmem device driver
> > > > > implements it.  Pmem device driver will find out the
> > > > > filesystem in which
> > > > > the corrupted page located in.  And finally call filesystem handler to
> > > > > deal with this error.
> > > > > 
> > > > > The filesystem will try to recover the corrupted data if necessary.
> > > > 
> > > 
> >
Dan Williams Aug. 18, 2021, 5:10 p.m. UTC | #8
On Wed, Aug 18, 2021 at 12:52 AM ruansy.fnst@fujitsu.com
<ruansy.fnst@fujitsu.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jane Chu <jane.chu@oracle.com>
> > Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()
> >
> >
> > On 8/17/2021 10:43 PM, Jane Chu wrote:
> > > More information -
> > >
> > > On 8/16/2021 10:20 AM, Jane Chu wrote:
> > >> Hi, ShiYang,
> > >>
> > >> So I applied the v6 patch series to my 5.14-rc3 as it's what you
> > >> indicated is what v6 was based at, and injected a hardware poison.
> > >>
> > >> I'm seeing the same problem that was reported a while ago after the
> > >> poison was consumed - in the SIGBUS payload, the si_addr is missing:
> > >>
> > >> ** SIGBUS(7): canjmp=1, whichstep=0, **
> > >> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> > >>
> > >> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page
> > >> in this case.
> > >
> > > The failure came from here :
> > >
> > > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS
> > >
> > > +static int
> > > +xfs_dax_notify_failure(
> > > ...
> > > +    if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > > +        xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
> > > +        return -EOPNOTSUPP;
> > > +    }
> > >
> > > I am not familiar with XFS, but I have a few questions I hope to get
> > > answers -
> > >
> > > 1) What does it take and cost to make
> > >     xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
>
> Enable rmpabt feature when making xfs filesystem
>    `mkfs.xfs -m rmapbt=1 /path/to/device`
> BTW, reflink is enabled by default.
>
> > >
> > > 2) For a running environment that fails the above check, is it
> > >     okay to leave the poison handle in limbo and why?
> It will fall back to the old handler.  I think you have already known it.
>
> > >
> > > 3) If the above regression is not acceptable, any potential remedy?
> >
> > How about moving the check to prior to the notifier registration?
> > And register only if the check is passed?  This seems better than an
> > alternative which is to fall back to the legacy memory_failure handling in case
> > the filesystem returns -EOPNOTSUPP.
>
> Sounds like a nice solution.  I think I can add an is_notify_supported() interface in dax_holder_ops and check it when register dax_holder.

Shouldn't the fs avoid registering a memory failure handler if it is
not prepared to take over? For example, shouldn't this case behave
identically to ext4 that will not even register a callback?
Jane Chu Aug. 19, 2021, 7:18 a.m. UTC | #9
Hi, Shiyang,

 >  > > 1) What does it take and cost to make
 >  > >     xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
 >
 > Enable rmpabt feature when making xfs filesystem
 >     `mkfs.xfs -m rmapbt=1 /path/to/device`
 > BTW, reflink is enabled by default.

Thanks!  I tried
mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m 
rmapbt=1 -f /dev/pmem0

Again, injected a HW poison to the first page in a dax-file, had
the poison consumed and received a SIGBUS. The result is better -

** SIGBUS(7): canjmp=1, whichstep=0, **
** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **

The SIGBUS payload looks correct.

However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes -

[ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7003.507956] Memory failure: 0x1850800: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7003.531681] Memory failure: 0x1850a00: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7003.554190] Memory failure: 0x1850c00: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7003.575831] Memory failure: 0x1850e00: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7003.596796] Memory failure: 0x1851000: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
....
[ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7045.758885] Memory failure: 0x1950000: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7045.779495] Memory failure: 0x1950200: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption
[ 7045.800106] Memory failure: 0x1950400: Sending SIGBUS to 
fsdax_poison_v1:4109 due to hardware memory corruption

That's too much for a single process dealing with a single
poison in a PMD page. If nothing else, given an .si_addr_lsb being 0x15,
it doesn't make sense to send a SIGBUS per 512B block.

Could you determine the user process' mapping size from the filesystem,
and take that as a hint to determine how many iterations to call
mf_dax_kill_procs() ?

thanks!
-jane
Jane Chu Aug. 19, 2021, 8:11 a.m. UTC | #10
Sorry, correction in line.

On 8/19/2021 12:18 AM, Jane Chu wrote:
> Hi, Shiyang,
> 
>  >  > > 1) What does it take and cost to make
>  >  > >     xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
>  >
>  > Enable rmpabt feature when making xfs filesystem
>  >     `mkfs.xfs -m rmapbt=1 /path/to/device`
>  > BTW, reflink is enabled by default.
> 
> Thanks!  I tried
> mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m 
> rmapbt=1 -f /dev/pmem0
> 
> Again, injected a HW poison to the first page in a dax-file, had
> the poison consumed and received a SIGBUS. The result is better -
> 
> ** SIGBUS(7): canjmp=1, whichstep=0, **
> ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **
> 
> The SIGBUS payload looks correct.
> 
> However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes -

Actually that's one per 2MB, even though the poison is located
in pfn 0x1850600 only.

> 
> [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7003.507956] Memory failure: 0x1850800: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7003.531681] Memory failure: 0x1850a00: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7003.554190] Memory failure: 0x1850c00: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7003.575831] Memory failure: 0x1850e00: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7003.596796] Memory failure: 0x1851000: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> ....
> [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7045.758885] Memory failure: 0x1950000: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7045.779495] Memory failure: 0x1950200: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> [ 7045.800106] Memory failure: 0x1950400: Sending SIGBUS to 
> fsdax_poison_v1:4109 due to hardware memory corruption
> 
> That's too much for a single process dealing with a single
> poison in a PMD page. If nothing else, given an .si_addr_lsb being 0x15,
> it doesn't make sense to send a SIGBUS per 512B block.
> 
> Could you determine the user process' mapping size from the filesystem,
> and take that as a hint to determine how many iterations to call
> mf_dax_kill_procs() ?

Sorry, scratch the 512byte stuff... the filesystem has been
notified the length of the poison blast radius, could it take clue
from that?

thanks,
-jane

> 
> thanks!
> -jane
> 
> 
>
Shiyang Ruan Aug. 19, 2021, 9:10 a.m. UTC | #11
> From: Jane Chu <jane.chu@oracle.com>
> Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()
> 
> Sorry, correction in line.
> 
> On 8/19/2021 12:18 AM, Jane Chu wrote:
> > Hi, Shiyang,
> >
> >  >  > > 1) What does it take and cost to make  >  > >
> > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
> >  >
> >  > Enable rmpabt feature when making xfs filesystem  >     `mkfs.xfs
> > -m rmapbt=1 /path/to/device`  > BTW, reflink is enabled by default.
> >
> > Thanks!  I tried
> > mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m
> > rmapbt=1 -f /dev/pmem0
> >
> > Again, injected a HW poison to the first page in a dax-file, had the
> > poison consumed and received a SIGBUS. The result is better -
> >
> > ** SIGBUS(7): canjmp=1, whichstep=0, **
> > ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4,
> > BUS_MCEERR_AR) **
> >
> > The SIGBUS payload looks correct.
> >
> > However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes -
> 
> Actually that's one per 2MB, even though the poison is located in pfn 0x1850600
> only.
> 
> >
> > [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.507956]
> > Memory failure: 0x1850800: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.531681]
> > Memory failure: 0x1850a00: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.554190]
> > Memory failure: 0x1850c00: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.575831]
> > Memory failure: 0x1850e00: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.596796]
> > Memory failure: 0x1851000: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption ....
> > [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.758885]
> > Memory failure: 0x1950000: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.779495]
> > Memory failure: 0x1950200: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.800106]
> > Memory failure: 0x1950400: Sending SIGBUS to
> > fsdax_poison_v1:4109 due to hardware memory corruption
> >
> > That's too much for a single process dealing with a single poison in a
> > PMD page. If nothing else, given an .si_addr_lsb being 0x15, it
> > doesn't make sense to send a SIGBUS per 512B block.
> >
> > Could you determine the user process' mapping size from the
> > filesystem, and take that as a hint to determine how many iterations
> > to call
> > mf_dax_kill_procs() ?
> 
> Sorry, scratch the 512byte stuff... the filesystem has been notified the length of
> the poison blast radius, could it take clue from that?

I think this is caused by a mistake I made in the 6th patch: xfs handler iterates the file range in block size(4k here) even though it is a PMD page. That's why so many message shows when poison on a PMD page.  I'll fix it in next version.


--
Thanks,
Ruan.

> 
> thanks,
> -jane
> 
> >
> > thanks!
> > -jane
> >
> >
> >
Jane Chu Aug. 19, 2021, 8:50 p.m. UTC | #12
On 8/19/2021 2:10 AM, ruansy.fnst@fujitsu.com wrote:
>> From: Jane Chu <jane.chu@oracle.com>
>> Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()
>>
>> Sorry, correction in line.
>>
>> On 8/19/2021 12:18 AM, Jane Chu wrote:
>>> Hi, Shiyang,
>>>
>>>   >  > > 1) What does it take and cost to make  >  > >
>>> xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
>>>   >
>>>   > Enable rmpabt feature when making xfs filesystem  >     `mkfs.xfs
>>> -m rmapbt=1 /path/to/device`  > BTW, reflink is enabled by default.
>>>
>>> Thanks!  I tried
>>> mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m
>>> rmapbt=1 -f /dev/pmem0
>>>
>>> Again, injected a HW poison to the first page in a dax-file, had the
>>> poison consumed and received a SIGBUS. The result is better -
>>>
>>> ** SIGBUS(7): canjmp=1, whichstep=0, **
>>> ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4,
>>> BUS_MCEERR_AR) **
>>>
>>> The SIGBUS payload looks correct.
>>>
>>> However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes -
>>
>> Actually that's one per 2MB, even though the poison is located in pfn 0x1850600
>> only.
>>
>>>
>>> [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.507956]
>>> Memory failure: 0x1850800: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.531681]
>>> Memory failure: 0x1850a00: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.554190]
>>> Memory failure: 0x1850c00: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.575831]
>>> Memory failure: 0x1850e00: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.596796]
>>> Memory failure: 0x1851000: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption ....
>>> [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.758885]
>>> Memory failure: 0x1950000: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.779495]
>>> Memory failure: 0x1950200: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.800106]
>>> Memory failure: 0x1950400: Sending SIGBUS to
>>> fsdax_poison_v1:4109 due to hardware memory corruption
>>>
>>> That's too much for a single process dealing with a single poison in a
>>> PMD page. If nothing else, given an .si_addr_lsb being 0x15, it
>>> doesn't make sense to send a SIGBUS per 512B block.
>>>
>>> Could you determine the user process' mapping size from the
>>> filesystem, and take that as a hint to determine how many iterations
>>> to call
>>> mf_dax_kill_procs() ?
>>
>> Sorry, scratch the 512byte stuff... the filesystem has been notified the length of
>> the poison blast radius, could it take clue from that?
> 
> I think this is caused by a mistake I made in the 6th patch: xfs handler iterates the file range in block size(4k here) even though it is a PMD page. That's why so many message shows when poison on a PMD page.  I'll fix it in next version.
> 

Sorry, just to clarify, it looks like XFS has iterated through out the
entire file in 2MiB stride.  The test file size is 4GiB, that explains
'dmesg' showing 2048 line about sending SIGBUS.

thanks,
-jane


> 
> --
> Thanks,
> Ruan.
> 
>>
>> thanks,
>> -jane
>>
>>>
>>> thanks!
>>> -jane
>>>
>>>
>>>
Dan Williams Aug. 20, 2021, 4:07 p.m. UTC | #13
On Fri, Jul 30, 2021 at 3:02 AM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
>
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the filesystem in which
> the corrupted page located in.  And finally call filesystem handler to
> deal with this error.
>
> The filesystem will try to recover the corrupted data if necessary.

This patch looks good to me, but I would fold it into the patch that
first populates ->memory_failure().
Christoph Hellwig Aug. 23, 2021, 1:21 p.m. UTC | #14
On Wed, Aug 18, 2021 at 10:10:51AM -0700, Dan Williams wrote:
> > Sounds like a nice solution.  I think I can add an is_notify_supported() interface in dax_holder_ops and check it when register dax_holder.
> 
> Shouldn't the fs avoid registering a memory failure handler if it is
> not prepared to take over? For example, shouldn't this case behave
> identically to ext4 that will not even register a callback?

Yes.
diff mbox series

Patch

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..ed6980087db5 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -87,6 +87,15 @@  struct dev_pagemap_ops {
 	 * the page back to a CPU accessible page.
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+	/*
+	 * Handle the memory failure happens on a range of pfns.  Notify the
+	 * processes who are using these pfns, and try to recover the data on
+	 * them if necessary.  The flag is finally passed to the recover
+	 * function through the whole notify routine.
+	 */
+	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+			      unsigned long nr_pfns, int flags);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)