mbox series

[0/6] dax poison recovery with RWF_RECOVERY_DATA flag

Message ID 20211021001059.438843-1-jane.chu@oracle.com (mailing list archive)
Headers show
Series dax poison recovery with RWF_RECOVERY_DATA flag | expand

Message

Jane Chu Oct. 21, 2021, 12:10 a.m. UTC
This patch set is a follow up to below conversation
  https://lore.kernel.org/linux-xfs/20210927210750.GH1756565@dread.disaster.area/
where Dave Chinner proposed a single flag RWF_RECOVER_DATA to
be added to the system calls preadv2() and pwritev2() for the
purpose of user directed recovery from data loss due to poison
in a dax range.

The idea is that when a dax range is poisoned, this flag gives
preadv2() permission to fetch as much clean data as possible,
and permission for pwritev2() to attempt to clear poison(s)
in the range and then store the good data over.

This feature maybe deployed by user process' signal handler
in response to SIGBUS with BUS_MCEERR_AR or BUS_ADRERR when
the 'si_addr' points to a dax range; or, simply when not sure
of a poison free range. This approach does not fragment the
dax backend, if it is unable to clear poison, it will fail
the pwritev2() so that the situation is explicit to the user.
This approach is not recommended to normal non-recovery 
code path due to potential performance impact incurred by
poison checking.

Also, this approach is an alternative to the existing
punch-a-hole-followed-by-pwrite approach which does not clear
poison, but instead, allocates a discontiguous clean backend
range to satisfy the pwrite().

Jane Chu (6):
  dax: introduce RWF_RECOVERY_DATA flag to preadv2() and pwritev2()
  dax: prepare dax_direct_access() API with DAXDEV_F_RECOVERY flag
  pmem: pmem_dax_direct_access() to honor the DAXDEV_F_RECOVERY flag
  dm,dax,pmem: prepare dax_copy_to/from_iter() APIs with
    DAXDEV_F_RECOVERY
  dax,pmem: Add data recovery feature to pmem_copy_to/from_iter()
  dm: Ensure dm honors DAXDEV_F_RECOVERY flag on dax only

 drivers/dax/super.c             | 19 ++++---
 drivers/md/dm-linear.c          | 12 ++---
 drivers/md/dm-log-writes.c      | 17 ++++---
 drivers/md/dm-stripe.c          | 12 ++---
 drivers/md/dm-target.c          |  2 +-
 drivers/md/dm-writecache.c      |  4 +-
 drivers/md/dm.c                 | 16 +++---
 drivers/nvdimm/pmem.c           | 88 ++++++++++++++++++++++++++++-----
 drivers/nvdimm/pmem.h           |  2 +-
 drivers/s390/block/dcssblk.c    | 13 +++--
 fs/dax.c                        | 24 ++++++---
 fs/fuse/dax.c                   |  2 +-
 fs/fuse/virtio_fs.c             | 12 ++---
 include/linux/dax.h             | 15 +++---
 include/linux/device-mapper.h   |  4 +-
 include/linux/fs.h              |  1 +
 include/linux/iomap.h           |  1 +
 include/uapi/linux/fs.h         |  5 +-
 tools/testing/nvdimm/pmem-dax.c |  2 +-
 19 files changed, 173 insertions(+), 78 deletions(-)

Comments

Christoph Hellwig Oct. 21, 2021, 11:31 a.m. UTC | #1
Looking over the series I have serious doubts that overloading the
slow path clear poison operation over the fast path read/write
path is such a great idea.
Jane Chu Oct. 22, 2021, 1:37 a.m. UTC | #2
On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> Looking over the series I have serious doubts that overloading the
> slow path clear poison operation over the fast path read/write
> path is such a great idea.
> 

Understood, sounds like a concern on principle. But it seems to me
that the task of recovery overlaps with the normal write operation
on the write part. Without overloading some write operation for
'recovery', I guess we'll need to come up with a new userland
command coupled with a new dax API ->clear_poison and propagate the
new API support to each dm targets that support dax which, again,
is an idea that sounds too bulky if I recall Dan's earlier rejection
correctly.

It is in my plan though, to provide pwritev2(2) and preadv2(2) man pages
with description about the RWF_RECOVERY_DATA flag and specifically not
recommending the flag for normal read/write operation - due to potential
performance impact from searching badblocks in the range.

That said, the badblock searching is turned on only if the pmem device
contains badblocks(i.e. bb->count > 0), otherwise, the performance
impact is next to none. And once the badblock search starts,
it is a binary search over user provided range. The unwanted impact
happens to be the case when the pmem device contains badblocks
that do not fall in the user specified range, and in that case, the
search would end in O(1).

Thanks!
-jane
Darrick J. Wong Oct. 22, 2021, 1:58 a.m. UTC | #3
On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
> On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> > Looking over the series I have serious doubts that overloading the
> > slow path clear poison operation over the fast path read/write
> > path is such a great idea.

Why would data recovery after a media error ever be considered a
fast/hot path?  A normal read/write to a fsdax file would not pass the
flag, which skips the poison checking with whatever MCE consequences
that has, right?

pwritev2(..., RWF_RECOVER_DATA) should be infrequent enough that
carefully stepping around dax_direct_access only has to be faster than
restoring from backup, I hope?

> Understood, sounds like a concern on principle. But it seems to me
> that the task of recovery overlaps with the normal write operation
> on the write part. Without overloading some write operation for
> 'recovery', I guess we'll need to come up with a new userland
> command coupled with a new dax API ->clear_poison and propagate the
> new API support to each dm targets that support dax which, again,
> is an idea that sounds too bulky if I recall Dan's earlier rejection
> correctly.
> 
> It is in my plan though, to provide pwritev2(2) and preadv2(2) man pages
> with description about the RWF_RECOVERY_DATA flag and specifically not
> recommending the flag for normal read/write operation - due to potential
> performance impact from searching badblocks in the range.

Yes, this will help much. :)

> That said, the badblock searching is turned on only if the pmem device
> contains badblocks(i.e. bb->count > 0), otherwise, the performance
> impact is next to none. And once the badblock search starts,
> it is a binary search over user provided range. The unwanted impact
> happens to be the case when the pmem device contains badblocks
> that do not fall in the user specified range, and in that case, the
> search would end in O(1).

(I wonder about improving badblocks to be less sector-oriented and not
have that weird 16-records-max limit, but that can be a later
optimization.)

--D

> Thanks!
> -jane
> 
>
Christoph Hellwig Oct. 22, 2021, 5:36 a.m. UTC | #4
On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
> On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> > Looking over the series I have serious doubts that overloading the
> > slow path clear poison operation over the fast path read/write
> > path is such a great idea.
> > 
> 
> Understood, sounds like a concern on principle. But it seems to me
> that the task of recovery overlaps with the normal write operation
> on the write part. Without overloading some write operation for
> 'recovery', I guess we'll need to come up with a new userland
> command coupled with a new dax API ->clear_poison and propagate the
> new API support to each dm targets that support dax which, again,
> is an idea that sounds too bulky if I recall Dan's earlier rejection
> correctly.

When I wrote the above I mostly thought about the in-kernel API, that
is use a separate method.  But reading your mail and thinking about
this a bit more I'm actually less and less sure that overloading
pwritev2 and preadv2 with this at the syscall level makes sense either.
read/write are our I/O fast path.  We really should not overload the
core of the VFS with error recovery for a broken hardware interface.
Christoph Hellwig Oct. 22, 2021, 5:38 a.m. UTC | #5
On Thu, Oct 21, 2021 at 06:58:17PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
> > On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
> > > Looking over the series I have serious doubts that overloading the
> > > slow path clear poison operation over the fast path read/write
> > > path is such a great idea.
> 
> Why would data recovery after a media error ever be considered a
> fast/hot path?

Not sure what you're replying to (the text is from me, the mail you are
repling to is fom Jane), but my point is that the read/write
got path should not be overloaded with data recovery.

> A normal read/write to a fsdax file would not pass the
> flag, which skips the poison checking with whatever MCE consequences
> that has, right?

Exactly!

> pwritev2(..., RWF_RECOVER_DATA) should be infrequent enough that
> carefully stepping around dax_direct_access only has to be faster than
> restoring from backup, I hope?

Yes.  And thus be kept as separate as possible.
Jane Chu Oct. 22, 2021, 8:52 p.m. UTC | #6
On 10/21/2021 10:36 PM, Christoph Hellwig wrote:
> On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote:
>> On 10/21/2021 4:31 AM, Christoph Hellwig wrote:
>>> Looking over the series I have serious doubts that overloading the
>>> slow path clear poison operation over the fast path read/write
>>> path is such a great idea.
>>>
>>
>> Understood, sounds like a concern on principle. But it seems to me
>> that the task of recovery overlaps with the normal write operation
>> on the write part. Without overloading some write operation for
>> 'recovery', I guess we'll need to come up with a new userland
>> command coupled with a new dax API ->clear_poison and propagate the
>> new API support to each dm targets that support dax which, again,
>> is an idea that sounds too bulky if I recall Dan's earlier rejection
>> correctly.
> 
> When I wrote the above I mostly thought about the in-kernel API, that
> is use a separate method.  But reading your mail and thinking about
> this a bit more I'm actually less and less sure that overloading
> pwritev2 and preadv2 with this at the syscall level makes sense either.
> read/write are our I/O fast path.  We really should not overload the
> core of the VFS with error recovery for a broken hardware interface.
> 

Thanks - I try to be honest.  As far as I can tell, the argument
about the flag is a philosophical argument between two views.
One view assumes design based on perfect hardware, and media error
belongs to the category of brokenness. Another view sees media
error as a build-in hardware component and make design to include
dealing with such errors.

Back when I was fresh out of school, a senior engineer explained
to me about media error might be caused by cosmic ray hitting on
the media at however frequency and at whatever timing.  It's an
argument that media error within certain range is a fact of the product,
and to me, it argues for building normal software component with
errors in mind from start.  I guess I'm trying to articulate why
it is acceptable to include the RWF_DATA_RECOVERY flag to the
existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
and its slow path (w/ error clearing) is faster than other alternative.
Other alternative being 1 system call to clear the poison, and
another system call to run the fast pwrite for recovery, what
happens if something happened in between?

thanks!
-jane
Christoph Hellwig Oct. 27, 2021, 6:49 a.m. UTC | #7
On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> Thanks - I try to be honest.  As far as I can tell, the argument
> about the flag is a philosophical argument between two views.
> One view assumes design based on perfect hardware, and media error
> belongs to the category of brokenness. Another view sees media
> error as a build-in hardware component and make design to include
> dealing with such errors.

No, I don't think so.  Bit errors do happen in all media, which is
why devices are built to handle them.  It is just the Intel-style
pmem interface to handle them which is completely broken.  

> errors in mind from start.  I guess I'm trying to articulate why
> it is acceptable to include the RWF_DATA_RECOVERY flag to the
> existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> and its slow path (w/ error clearing) is faster than other alternative.
> Other alternative being 1 system call to clear the poison, and
> another system call to run the fast pwrite for recovery, what
> happens if something happened in between?

Well, my point is doing recovery from bit errors is by definition not
the fast path.  Which is why I'd rather keep it away from the pmem
read/write fast path, which also happens to be the (much more important)
non-pmem read/write path.
Darrick J. Wong Oct. 28, 2021, 12:24 a.m. UTC | #8
On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > Thanks - I try to be honest.  As far as I can tell, the argument
> > about the flag is a philosophical argument between two views.
> > One view assumes design based on perfect hardware, and media error
> > belongs to the category of brokenness. Another view sees media
> > error as a build-in hardware component and make design to include
> > dealing with such errors.
> 
> No, I don't think so.  Bit errors do happen in all media, which is
> why devices are built to handle them.  It is just the Intel-style
> pmem interface to handle them which is completely broken.  

Yeah, I agree, this takes me back to learning how to use DISKEDIT to
work around a hole punched in a file (with a pen!) in the 1980s...

...so would you happen to know if anyone's working on solving this
problem for us by putting the memory controller in charge of dealing
with media errors?

> > errors in mind from start.  I guess I'm trying to articulate why
> > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > and its slow path (w/ error clearing) is faster than other alternative.
> > Other alternative being 1 system call to clear the poison, and
> > another system call to run the fast pwrite for recovery, what
> > happens if something happened in between?
> 
> Well, my point is doing recovery from bit errors is by definition not
> the fast path.  Which is why I'd rather keep it away from the pmem
> read/write fast path, which also happens to be the (much more important)
> non-pmem read/write path.

The trouble is, we really /do/ want to be able to (re)write the failed
area, and we probably want to try to read whatever we can.  Those are
reads and writes, not {pre,f}allocation activities.  This is where Dave
and I arrived at a month ago.

Unless you'd be ok with a second IO path for recovery where we're
allowed to be slow?  That would probably have the same user interface
flag, just a different path into the pmem driver.

Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever
speshul options (retry raid mirrors!  scrape the film off the disk if
you have to!) you want that can take forever, leaving the fast paths
alone?

(Ok, that wasn't entirely serious...)

--D
Dave Chinner Oct. 28, 2021, 10:59 p.m. UTC | #9
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > > Thanks - I try to be honest.  As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.
> > 
> > No, I don't think so.  Bit errors do happen in all media, which is
> > why devices are built to handle them.  It is just the Intel-style
> > pmem interface to handle them which is completely broken.  
> 
> Yeah, I agree, this takes me back to learning how to use DISKEDIT to
> work around a hole punched in a file (with a pen!) in the 1980s...
> 
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?
> 
> > > errors in mind from start.  I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?
> > 
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path.  Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
> 
> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can.  Those are
> reads and writes, not {pre,f}allocation activities.  This is where Dave
> and I arrived at a month ago.
> 
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow?  That would probably have the same user interface
> flag, just a different path into the pmem driver.

I just don't see how 4 single line branches to propage RWF_RECOVERY
down to the hardware is in any way an imposition on the fast path.
It's no different for passing RWF_HIPRI down to the hardware *in the
fast path* so that the IO runs the hardware in polling mode because
it's faster for some hardware.

IOWs, saying that we shouldn't implement RWF_RECOVERY because it
adds a handful of branches to the fast path is like saying that we
shouldn't implement RWF_HIPRI because it slows down the fast path
for non-polled IO....

Just factor the actual recovery operations out into a separate
function like:

static void noinline
pmem_media_recovery(...)
{
}

pmem_copy_from_iter()
{
	if ((unlikely)(flag & RECOVERY))
		pmem_media_recovery(...);
	return _copy_from_iter_flushcache(addr, bytes, i);
}
....

And there's basically zero overhead in the fast paths for normal
data IO operations, whilst supporting a simple, easy to use data
recovery IO operations for regions that have bad media...

> Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever
> speshul options (retry raid mirrors!  scrape the film off the disk if
> you have to!) you want that can take forever, leaving the fast paths
> alone?

Why wouldn't we just pass RWF_RECOVERY down to a REQ_RECOVERY bio
flag and have raid devices use that to trigger scraping whatever
they can if there are errors? The io path through the VFS and
filesystem to get the scraped data out to the user *is exactly the
same*, so we're going to have to plumb this functionality into fast
paths *somewhere along the line*.

Really, I think this whole "flag propagation is too much overhead
for the fast path" argument is completely invalid - if 4 conditional
branches is too much overhead to add to the fast path, then we can't
add *anything ever again* to the IO path because it has too much
overhead and impact on the fast path.

Cheers,

Dave.
Pavel Begunkov Oct. 29, 2021, 11:46 a.m. UTC | #10
On 10/28/21 23:59, Dave Chinner wrote:
[...]
>>> Well, my point is doing recovery from bit errors is by definition not
>>> the fast path.  Which is why I'd rather keep it away from the pmem
>>> read/write fast path, which also happens to be the (much more important)
>>> non-pmem read/write path.
>>
>> The trouble is, we really /do/ want to be able to (re)write the failed
>> area, and we probably want to try to read whatever we can.  Those are
>> reads and writes, not {pre,f}allocation activities.  This is where Dave
>> and I arrived at a month ago.
>>
>> Unless you'd be ok with a second IO path for recovery where we're
>> allowed to be slow?  That would probably have the same user interface
>> flag, just a different path into the pmem driver.
> 
> I just don't see how 4 single line branches to propage RWF_RECOVERY
> down to the hardware is in any way an imposition on the fast path.
> It's no different for passing RWF_HIPRI down to the hardware *in the
> fast path* so that the IO runs the hardware in polling mode because
> it's faster for some hardware.

Not particularly about this flag, but it is expensive. Surely looks
cheap when it's just one feature, but there are dozens of them with
limited applicability, default config kernels are already sluggish
when it comes to really fast devices and it's not getting better.
Also, pretty often every of them will add a bunch of extra checks
to fix something of whatever it would be.

So let's add a bit of pragmatism to the picture, if there is just one
user of a feature but it adds overhead for millions of machines that
won't ever use it, it's expensive.

This one doesn't spill yet into paths I care about, but in general
it'd be great if we start thinking more about such stuff instead of
throwing yet another if into the path, e.g. by shifting the overhead
from linear to a constant for cases that don't use it, for instance
with callbacks or bit masks.

> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> adds a handful of branches 	 the fast path is like saying that we
> shouldn't implement RWF_HIPRI because it slows down the fast path
> for non-polled IO....
> 
> Just factor the actual recovery operations out into a separate
> function like:
Darrick J. Wong Oct. 29, 2021, 4:57 p.m. UTC | #11
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
> On 10/28/21 23:59, Dave Chinner wrote:
> [...]
> > > > Well, my point is doing recovery from bit errors is by definition not
> > > > the fast path.  Which is why I'd rather keep it away from the pmem
> > > > read/write fast path, which also happens to be the (much more important)
> > > > non-pmem read/write path.
> > > 
> > > The trouble is, we really /do/ want to be able to (re)write the failed
> > > area, and we probably want to try to read whatever we can.  Those are
> > > reads and writes, not {pre,f}allocation activities.  This is where Dave
> > > and I arrived at a month ago.
> > > 
> > > Unless you'd be ok with a second IO path for recovery where we're
> > > allowed to be slow?  That would probably have the same user interface
> > > flag, just a different path into the pmem driver.
> > 
> > I just don't see how 4 single line branches to propage RWF_RECOVERY
> > down to the hardware is in any way an imposition on the fast path.
> > It's no different for passing RWF_HIPRI down to the hardware *in the
> > fast path* so that the IO runs the hardware in polling mode because
> > it's faster for some hardware.
> 
> Not particularly about this flag, but it is expensive. Surely looks
> cheap when it's just one feature, but there are dozens of them with
> limited applicability, default config kernels are already sluggish
> when it comes to really fast devices and it's not getting better.
> Also, pretty often every of them will add a bunch of extra checks
> to fix something of whatever it would be.

So we can't have data recovery because moving fast the only goal?

That's so meta.

--D

> So let's add a bit of pragmatism to the picture, if there is just one
> user of a feature but it adds overhead for millions of machines that
> won't ever use it, it's expensive.
> 
> This one doesn't spill yet into paths I care about, but in general
> it'd be great if we start thinking more about such stuff instead of
> throwing yet another if into the path, e.g. by shifting the overhead
> from linear to a constant for cases that don't use it, for instance
> with callbacks or bit masks.
> 
> > IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> > adds a handful of branches 	 the fast path is like saying that we
> > shouldn't implement RWF_HIPRI because it slows down the fast path
> > for non-polled IO....
> > 
> > Just factor the actual recovery operations out into a separate
> > function like:
> 
> -- 
> Pavel Begunkov
Jane Chu Oct. 29, 2021, 6:53 p.m. UTC | #12
On 10/29/2021 4:46 AM, Pavel Begunkov wrote:
> On 10/28/21 23:59, Dave Chinner wrote:
> [...]
>>>> Well, my point is doing recovery from bit errors is by definition not
>>>> the fast path.  Which is why I'd rather keep it away from the pmem
>>>> read/write fast path, which also happens to be the (much more 
>>>> important)
>>>> non-pmem read/write path.
>>>
>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>> area, and we probably want to try to read whatever we can.  Those are
>>> reads and writes, not {pre,f}allocation activities.  This is where Dave
>>> and I arrived at a month ago.
>>>
>>> Unless you'd be ok with a second IO path for recovery where we're
>>> allowed to be slow?  That would probably have the same user interface
>>> flag, just a different path into the pmem driver.
>>
>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>> down to the hardware is in any way an imposition on the fast path.
>> It's no different for passing RWF_HIPRI down to the hardware *in the
>> fast path* so that the IO runs the hardware in polling mode because
>> it's faster for some hardware.
> 
> Not particularly about this flag, but it is expensive. Surely looks
> cheap when it's just one feature, but there are dozens of them with
> limited applicability, default config kernels are already sluggish
> when it comes to really fast devices and it's not getting better.
> Also, pretty often every of them will add a bunch of extra checks
> to fix something of whatever it would be.
> 
> So let's add a bit of pragmatism to the picture, if there is just one
> user of a feature but it adds overhead for millions of machines that
> won't ever use it, it's expensive.
> 
> This one doesn't spill yet into paths I care about, but in general
> it'd be great if we start thinking more about such stuff instead of
> throwing yet another if into the path, e.g. by shifting the overhead
> from linear to a constant for cases that don't use it, for instance
> with callbacks or bit masks.

May I ask what solution would you propose for pmem recovery that satisfy
the requirement of binding poison-clearing and write in one operation?

thanks!
-jane


> 
>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>> adds a handful of branches      the fast path is like saying that we
>> shouldn't implement RWF_HIPRI because it slows down the fast path
>> for non-polled IO....
>>
>> Just factor the actual recovery operations out into a separate
>> function like:
>
Pavel Begunkov Oct. 29, 2021, 7:23 p.m. UTC | #13
On 10/29/21 17:57, Darrick J. Wong wrote:
> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
>> On 10/28/21 23:59, Dave Chinner wrote:
>> [...]
>>>>> Well, my point is doing recovery from bit errors is by definition not
>>>>> the fast path.  Which is why I'd rather keep it away from the pmem
>>>>> read/write fast path, which also happens to be the (much more important)
>>>>> non-pmem read/write path.
>>>>
>>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>>> area, and we probably want to try to read whatever we can.  Those are
>>>> reads and writes, not {pre,f}allocation activities.  This is where Dave
>>>> and I arrived at a month ago.
>>>>
>>>> Unless you'd be ok with a second IO path for recovery where we're
>>>> allowed to be slow?  That would probably have the same user interface
>>>> flag, just a different path into the pmem driver.
>>>
>>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>>> down to the hardware is in any way an imposition on the fast path.
>>> It's no different for passing RWF_HIPRI down to the hardware *in the
>>> fast path* so that the IO runs the hardware in polling mode because
>>> it's faster for some hardware.
>>
>> Not particularly about this flag, but it is expensive. Surely looks
>> cheap when it's just one feature, but there are dozens of them with
>> limited applicability, default config kernels are already sluggish
>> when it comes to really fast devices and it's not getting better.
>> Also, pretty often every of them will add a bunch of extra checks
>> to fix something of whatever it would be.
> 
> So we can't have data recovery because moving fast the only goal?

That's not what was said and you missed the point, which was in
the rest of the message.

> 
> That's so meta.
> 
> --D
> 
>> So let's add a bit of pragmatism to the picture, if there is just one
>> user of a feature but it adds overhead for millions of machines that
>> won't ever use it, it's expensive.
>>
>> This one doesn't spill yet into paths I care about, but in general
>> it'd be great if we start thinking more about such stuff instead of
>> throwing yet another if into the path, e.g. by shifting the overhead
>> from linear to a constant for cases that don't use it, for instance
>> with callbacks or bit masks.
>>
>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>>> adds a handful of branches 	 the fast path is like saying that we
>>> shouldn't implement RWF_HIPRI because it slows down the fast path
>>> for non-polled IO....
>>>
>>> Just factor the actual recovery operations out into a separate
>>> function like:
>>
>> -- 
>> Pavel Begunkov
Darrick J. Wong Oct. 29, 2021, 8:08 p.m. UTC | #14
On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote:
> On 10/29/21 17:57, Darrick J. Wong wrote:
> > On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
> > > On 10/28/21 23:59, Dave Chinner wrote:
> > > [...]
> > > > > > Well, my point is doing recovery from bit errors is by definition not
> > > > > > the fast path.  Which is why I'd rather keep it away from the pmem
> > > > > > read/write fast path, which also happens to be the (much more important)
> > > > > > non-pmem read/write path.
> > > > > 
> > > > > The trouble is, we really /do/ want to be able to (re)write the failed
> > > > > area, and we probably want to try to read whatever we can.  Those are
> > > > > reads and writes, not {pre,f}allocation activities.  This is where Dave
> > > > > and I arrived at a month ago.
> > > > > 
> > > > > Unless you'd be ok with a second IO path for recovery where we're
> > > > > allowed to be slow?  That would probably have the same user interface
> > > > > flag, just a different path into the pmem driver.
> > > > 
> > > > I just don't see how 4 single line branches to propage RWF_RECOVERY
> > > > down to the hardware is in any way an imposition on the fast path.
> > > > It's no different for passing RWF_HIPRI down to the hardware *in the
> > > > fast path* so that the IO runs the hardware in polling mode because
> > > > it's faster for some hardware.
> > > 
> > > Not particularly about this flag, but it is expensive. Surely looks
> > > cheap when it's just one feature, but there are dozens of them with
> > > limited applicability, default config kernels are already sluggish
> > > when it comes to really fast devices and it's not getting better.
> > > Also, pretty often every of them will add a bunch of extra checks
> > > to fix something of whatever it would be.
> > 
> > So we can't have data recovery because moving fast the only goal?
> 
> That's not what was said and you missed the point, which was in
> the rest of the message.

...whatever point you were trying to make was so vague that it was
totally uninformative and I completely missed it.

What does "callbacks or bit masks" mean, then, specifically?  How
*exactly* would you solve the problem that Jane is seeking to solve by
using callbacks?

Actually, you know what?  I'm so fed up with every single DAX
conversation turning into a ****storm of people saying NO NO NO NO NO NO
NO NO to everything proposed that I'm actually going to respond to
whatever I think your point is, and you can defend whatever I come up
with.

> > 
> > That's so meta.
> > 
> > --D
> > 
> > > So let's add a bit of pragmatism to the picture, if there is just one
> > > user of a feature but it adds overhead for millions of machines that
> > > won't ever use it, it's expensive.

Errors are infrequent, and since everything is cloud-based and disposble
now, we can replace error handling with BUG_ON().  This will reduce code
complexity, which will reduce code size, and improve icache usage.  Win!

> > > This one doesn't spill yet into paths I care about,

...so you sail in and say 'no' even though you don't yet care...

> > > but in general
> > > it'd be great if we start thinking more about such stuff instead of
> > > throwing yet another if into the path, e.g. by shifting the overhead
> > > from linear to a constant for cases that don't use it, for instance
> > > with callbacks

Ok so after userspace calls into pread to access a DAX file, hits the
poisoned memory line and the machinecheck fires, what then?  I guess we
just have to figure out how to get from the MCA handler (assuming the
machine doesn't just reboot instantly) all the way back into memcpy?
Ok, you're in charge of figuring that out because I don't know how to do
that.

Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a
callback that happens after memory controller realizes it's lost
something, kicks a notification to the OS kernel through ACPI, and the
kernel signal userspace to do something about it.  Yeah, that's dumb
since spinning rust already does all this for us, but that's pmem.

> > > or bit masks.

WTF does this even mean?

--D

> > > 
> > > > IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> > > > adds a handful of branches 	 the fast path is like saying that we
> > > > shouldn't implement RWF_HIPRI because it slows down the fast path
> > > > for non-polled IO....
> > > > 
> > > > Just factor the actual recovery operations out into a separate
> > > > function like:
> > > 
> > > -- 
> > > Pavel Begunkov
> 
> -- 
> Pavel Begunkov
Dave Chinner Oct. 29, 2021, 10:32 p.m. UTC | #15
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
> On 10/28/21 23:59, Dave Chinner wrote:
> [...]
> > > > Well, my point is doing recovery from bit errors is by definition not
> > > > the fast path.  Which is why I'd rather keep it away from the pmem
> > > > read/write fast path, which also happens to be the (much more important)
> > > > non-pmem read/write path.
> > > 
> > > The trouble is, we really /do/ want to be able to (re)write the failed
> > > area, and we probably want to try to read whatever we can.  Those are
> > > reads and writes, not {pre,f}allocation activities.  This is where Dave
> > > and I arrived at a month ago.
> > > 
> > > Unless you'd be ok with a second IO path for recovery where we're
> > > allowed to be slow?  That would probably have the same user interface
> > > flag, just a different path into the pmem driver.
> > 
> > I just don't see how 4 single line branches to propage RWF_RECOVERY
> > down to the hardware is in any way an imposition on the fast path.
> > It's no different for passing RWF_HIPRI down to the hardware *in the
> > fast path* so that the IO runs the hardware in polling mode because
> > it's faster for some hardware.
> 
> Not particularly about this flag, but it is expensive. Surely looks
> cheap when it's just one feature, but there are dozens of them with
> limited applicability, default config kernels are already sluggish
> when it comes to really fast devices and it's not getting better.
> Also, pretty often every of them will add a bunch of extra checks
> to fix something of whatever it would be.
> 
> So let's add a bit of pragmatism to the picture, if there is just one
> user of a feature but it adds overhead for millions of machines that
> won't ever use it, it's expensive.

Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
past this?  I'll quote what I said again, because I've already
addressed this argument to point out how silly it is:

> > IOWs, saying that we shouldn't implement RWF_RECOVERY because it
> > adds a handful of branches to the fast path is like saying that we
> > shouldn't implement RWF_HIPRI because it slows down the fast path
> > for non-polled IO....

 RWF_HIPRI functionality represents a *tiny* niche in the wider
Linux ecosystem, so by your reasoning it is too expensive to
implement because millions (billions!) of machines don't need or use
it. Do you now see how silly your argument is?

Seriously, this "optimise the IO fast path at the cost of everything
else" craziness has gotten out of hand. Nobody in the filesystem or
application world cares if you can do 10M IOPS per core when all the
CPU is doing is sitting in a tight loop inside the kernel repeatedly
overwriting data in the same memory buffers, essentially tossing the
old away the data without ever accessing it or doing anything with
it.  Such speed racer games are *completely meaningless* as an
optimisation goal - it's what we've called "benchmarketing" for a
couple of decades now.

If all we focus on is bragging rights because "bigger number is
always better", then we'll end up with iand IO path that looks like
the awful mess that the fs/direct-io.c turned into. That ended up
being hyper-optimised for CPU performance right down to single
instructions and cacheline load orders that the code became
extremely fragile and completely unmaintainable.

We ended up *reimplementing the direct IO code from scratch* so that
XFS could build and submit direct IO smarter and faster because it
simply couldn't be done to the old code.  That's how iomap came
about, and without *any optimisation at all* iomap was 20-30% faster
than the old, hyper-optimised fs/direct-io.c code.  IOWs, we always
knew we could do direct IO faster than fs/direct-io.c, but we
couldn't make the fs/direct-io.c faster because of the
hyper-optimisation of the code paths made it impossible to modify
and maintain.

The current approach of hyper-optimising the IO path for maximum
per-core IOPS at the expensive of everything else has been proven in
the past to be exactly the wrong approach to be taking for IO path
development. Yes, we need to be concerned about performance and work
to improve it, but we should not be doing that at the cost of
everything else that the IO stack needs to be able to do.

Fundamentally, optimisation is something we do *after* we provide
the functionality that is required; using "fast path optimisation"
as a blunt force implement to prevent new features from being
implemented is just ...  obnoxious.

> This one doesn't spill yet into paths I care about, but in general
> it'd be great if we start thinking more about such stuff instead of
> throwing yet another if into the path, e.g. by shifting the overhead
> from linear to a constant for cases that don't use it, for instance
> with callbacks or bit masks.

This is orthogonal to providing data recovery functionality.
If the claims that flag propagation is too expensive are true, then
fixing this problem this will also improve RWF_HIPRI performance
regardless of whether RWF_DATA_RECOVERY exists or not...

IOWs, *if* there is a fast path performance degradation as a result
of flag propagation, then *go measure it* and show us how much
impact it has on _real world applications_.  *Show us the numbers*
and document how much each additional flag propagation actually
costs so we can talk about whether it is acceptible, mitigation
strategies and/or alternative implementations.  Flag propagation
overhead is just not a valid reason for preventing us adding new
flags to the IO path. Fix the flag propagation overhead if it's a
problem for you, don't use it as an excuse for preventing people
from adding new functionality that uses flag propagation...

Cheers,

Dave.
Pavel Begunkov Oct. 31, 2021, 1:19 p.m. UTC | #16
On 10/29/21 23:32, Dave Chinner wrote:
> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
>> On 10/28/21 23:59, Dave Chinner wrote:
>> [...]
>>>>> Well, my point is doing recovery from bit errors is by definition not
>>>>> the fast path.  Which is why I'd rather keep it away from the pmem
>>>>> read/write fast path, which also happens to be the (much more important)
>>>>> non-pmem read/write path.
>>>>
>>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>>> area, and we probably want to try to read whatever we can.  Those are
>>>> reads and writes, not {pre,f}allocation activities.  This is where Dave
>>>> and I arrived at a month ago.
>>>>
>>>> Unless you'd be ok with a second IO path for recovery where we're
>>>> allowed to be slow?  That would probably have the same user interface
>>>> flag, just a different path into the pmem driver.
>>>
>>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>>> down to the hardware is in any way an imposition on the fast path.
>>> It's no different for passing RWF_HIPRI down to the hardware *in the
>>> fast path* so that the IO runs the hardware in polling mode because
>>> it's faster for some hardware.
>>
>> Not particularly about this flag, but it is expensive. Surely looks
>> cheap when it's just one feature, but there are dozens of them with
>> limited applicability, default config kernels are already sluggish
>> when it comes to really fast devices and it's not getting better.
>> Also, pretty often every of them will add a bunch of extra checks
>> to fix something of whatever it would be.
>>
>> So let's add a bit of pragmatism to the picture, if there is just one
>> user of a feature but it adds overhead for millions of machines that
>> won't ever use it, it's expensive.
> 
> Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
> past this?  I'll quote what I said again, because I've already
> addressed this argument to point out how silly it is:

And you almost got to the initial point in your penult paragraph. A
single if for a single flag is not an issue, what is the problem is
when there are dozens of them and the overhead for it is not isolated,
so the kernel has to jump through dozens of those.

And just to be clear I'll outline again, that's a general problem,
I have no relation to the layers touched and it's up to relevant
people, obviously. Even though I'd expect but haven't found the flag
being rejected in other places, but well I may have missed something.


>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>>> adds a handful of branches to the fast path is like saying that we
>>> shouldn't implement RWF_HIPRI because it slows down the fast path
>>> for non-polled IO....
> 
>   RWF_HIPRI functionality represents a *tiny* niche in the wider
> Linux ecosystem, so by your reasoning it is too expensive to
> implement because millions (billions!) of machines don't need or use
> it. Do you now see how silly your argument is?
> 
> Seriously, this "optimise the IO fast path at the cost of everything
> else" craziness has gotten out of hand. Nobody in the filesystem or
> application world cares if you can do 10M IOPS per core when all the
> CPU is doing is sitting in a tight loop inside the kernel repeatedly
> overwriting data in the same memory buffers, essentially tossing the
> old away the data without ever accessing it or doing anything with
> it.  Such speed racer games are *completely meaningless* as an
> optimisation goal - it's what we've called "benchmarketing" for a
> couple of decades now.

10M you mentioned is just a way to measure, there is nothing wrong
with it. And considering that there are enough of users considering
or already switching to spdk because of performance, the approach
is not wrong. And it goes not only for IO polling, normal irq IO
suffers from the same problems.

A related story is that this number is for a pretty reduced config,
it'll go down with a more default-ish kernel.

> If all we focus on is bragging rights because "bigger number is
> always better", then we'll end up with iand IO path that looks like
> the awful mess that the fs/direct-io.c turned into. That ended up
> being hyper-optimised for CPU performance right down to single
> instructions and cacheline load orders that the code became
> extremely fragile and completely unmaintainable.
> 
> We ended up *reimplementing the direct IO code from scratch* so that
> XFS could build and submit direct IO smarter and faster because it
> simply couldn't be done to the old code.  That's how iomap came
> about, and without *any optimisation at all* iomap was 20-30% faster
> than the old, hyper-optimised fs/direct-io.c code.  IOWs, we always
> knew we could do direct IO faster than fs/direct-io.c, but we
> couldn't make the fs/direct-io.c faster because of the
> hyper-optimisation of the code paths made it impossible to modify
> and maintain.> The current approach of hyper-optimising the IO path for maximum
> per-core IOPS at the expensive of everything else has been proven in
> the past to be exactly the wrong approach to be taking for IO path
> development. Yes, we need to be concerned about performance and work
> to improve it, but we should not be doing that at the cost of
> everything else that the IO stack needs to be able to do.

And iomap is great, what you described is a good typical example
of unmaintainable code. I may get wrong what you exactly refer
to, but I don't see maintainability not being considered.

Even more interesting to notice that more often than not extra
features (and flags) almost always hurt maintainability of the
kernel, but then other benefits outweigh (hopefully).

> Fundamentally, optimisation is something we do *after* we provide
> the functionality that is required; using "fast path optimisation"
> as a blunt force implement to prevent new features from being
> implemented is just ...  obnoxious.
> 
>> This one doesn't spill yet into paths I care about, but in general
>> it'd be great if we start thinking more about such stuff instead of
>> throwing yet another if into the path, e.g. by shifting the overhead
>> from linear to a constant for cases that don't use it, for instance
>> with callbacks or bit masks.
> 
> This is orthogonal to providing data recovery functionality.
> If the claims that flag propagation is too expensive are true, then
> fixing this problem this will also improve RWF_HIPRI performance
> regardless of whether RWF_DATA_RECOVERY exists or not...
> 
> IOWs, *if* there is a fast path performance degradation as a result
> of flag propagation, then *go measure it* and show us how much
> impact it has on _real world applications_.  *Show us the numbers*
> and document how much each additional flag propagation actually
> costs so we can talk about whether it is acceptible, mitigation
> strategies and/or alternative implementations.  Flag propagation
> overhead is just not a valid reason for preventing us adding new
> flags to the IO path. Fix the flag propagation overhead if it's a
> problem for you, don't use it as an excuse for preventing people
> from adding new functionality that uses flag propagation...
Pavel Begunkov Oct. 31, 2021, 1:27 p.m. UTC | #17
On 10/29/21 21:08, Darrick J. Wong wrote:
> On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote:
>> On 10/29/21 17:57, Darrick J. Wong wrote:
>>> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
>>>> On 10/28/21 23:59, Dave Chinner wrote:
>>>> [...]
>>>>>>> Well, my point is doing recovery from bit errors is by definition not
>>>>>>> the fast path.  Which is why I'd rather keep it away from the pmem
>>>>>>> read/write fast path, which also happens to be the (much more important)
>>>>>>> non-pmem read/write path.
>>>>>>
>>>>>> The trouble is, we really /do/ want to be able to (re)write the failed
>>>>>> area, and we probably want to try to read whatever we can.  Those are
>>>>>> reads and writes, not {pre,f}allocation activities.  This is where Dave
>>>>>> and I arrived at a month ago.
>>>>>>
>>>>>> Unless you'd be ok with a second IO path for recovery where we're
>>>>>> allowed to be slow?  That would probably have the same user interface
>>>>>> flag, just a different path into the pmem driver.
>>>>>
>>>>> I just don't see how 4 single line branches to propage RWF_RECOVERY
>>>>> down to the hardware is in any way an imposition on the fast path.
>>>>> It's no different for passing RWF_HIPRI down to the hardware *in the
>>>>> fast path* so that the IO runs the hardware in polling mode because
>>>>> it's faster for some hardware.
>>>>
>>>> Not particularly about this flag, but it is expensive. Surely looks
>>>> cheap when it's just one feature, but there are dozens of them with
>>>> limited applicability, default config kernels are already sluggish
>>>> when it comes to really fast devices and it's not getting better.
>>>> Also, pretty often every of them will add a bunch of extra checks
>>>> to fix something of whatever it would be.
>>>
>>> So we can't have data recovery because moving fast the only goal?
>>
>> That's not what was said and you missed the point, which was in
>> the rest of the message.
> 
> ...whatever point you were trying to make was so vague that it was
> totally uninformative and I completely missed it.
> 
> What does "callbacks or bit masks" mean, then, specifically?  How
> *exactly* would you solve the problem that Jane is seeking to solve by
> using callbacks?
> 
> Actually, you know what?  I'm so fed up with every single DAX
> conversation turning into a ****storm of people saying NO NO NO NO NO NO
> NO NO to everything proposed that I'm actually going to respond to
> whatever I think your point is, and you can defend whatever I come up
> with.

Interesting, I don't want to break it to you but nobody is going to
defend against yours made up out of thin air interpretations. I think
there is one thing we can relate, I wonder as well what the bloody
hell that opus of yours was


> 
>>>
>>> That's so meta.
>>>
>>> --D
>>>
>>>> So let's add a bit of pragmatism to the picture, if there is just one
>>>> user of a feature but it adds overhead for millions of machines that
>>>> won't ever use it, it's expensive.
> 
> Errors are infrequent, and since everything is cloud-based and disposble
> now, we can replace error handling with BUG_ON().  This will reduce code
> complexity, which will reduce code size, and improve icache usage.  Win!
> 
>>>> This one doesn't spill yet into paths I care about,
> 
> ...so you sail in and say 'no' even though you don't yet care...
> 
>>>> but in general
>>>> it'd be great if we start thinking more about such stuff instead of
>>>> throwing yet another if into the path, e.g. by shifting the overhead
>>>> from linear to a constant for cases that don't use it, for instance
>>>> with callbacks
> 
> Ok so after userspace calls into pread to access a DAX file, hits the
> poisoned memory line and the machinecheck fires, what then?  I guess we
> just have to figure out how to get from the MCA handler (assuming the
> machine doesn't just reboot instantly) all the way back into memcpy?
> Ok, you're in charge of figuring that out because I don't know how to do
> that.
> 
> Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a
> callback that happens after memory controller realizes it's lost
> something, kicks a notification to the OS kernel through ACPI, and the
> kernel signal userspace to do something about it.  Yeah, that's dumb
> since spinning rust already does all this for us, but that's pmem.
> 
>>>> or bit masks.
> 
> WTF does this even mean?
> 
> --D
> 
>>>>
>>>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it
>>>>> adds a handful of branches 	 the fast path is like saying that we
>>>>> shouldn't implement RWF_HIPRI because it slows down the fast path
>>>>> for non-polled IO....
>>>>>
>>>>> Just factor the actual recovery operations out into a separate
>>>>> function like:
Matthew Wilcox Nov. 1, 2021, 2:31 a.m. UTC | #18
On Sun, Oct 31, 2021 at 01:19:48PM +0000, Pavel Begunkov wrote:
> On 10/29/21 23:32, Dave Chinner wrote:
> > Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
> > past this?  I'll quote what I said again, because I've already
> > addressed this argument to point out how silly it is:
> 
> And you almost got to the initial point in your penult paragraph. A
> single if for a single flag is not an issue, what is the problem is
> when there are dozens of them and the overhead for it is not isolated,
> so the kernel has to jump through dozens of those.

This argument can be used to reject *ANY* new feature.  For example, by
using your argument, we should have rejected the addition of IOCB_WAITQ
because it penalises the vast majority of IOs which do not use it.

But we didn't.  Because we see that while it may not be of use to US
today, it's a generally useful feature for Linux to support.  You say
yourself that this feature doesn't slow down your use case, so why are
you spending so much time and energy annoying the people who actually
want to use it?

Seriously.  Stop arguing about something you actually don't care about.
You're just making Linux less fun to work on.
Christoph Hellwig Nov. 2, 2021, 6:18 a.m. UTC | #19
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?

The only one who could know is Intel..

> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can.  Those are
> reads and writes, not {pre,f}allocation activities.  This is where Dave
> and I arrived at a month ago.
> 
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow?  That would probably have the same user interface
> flag, just a different path into the pmem driver.

Which is fine with me.  If you look at the API here we do have the
RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
API which then gets special casing over three methods.

And while Pavel pointed out that he and Jens are now optimizing for
single branches like this.  I think this actually is silly and it is
not my point.

The point is that the DAX in-kernel API is a mess, and before we make
it even worse we need to sort it first.  What is directly relevant
here is that the copy_from_iter and copy_to_iter APIs do not make
sense.  Most of the DAX API is based around getting a memory mapping
using ->direct_access, it is just the read/write path which is a slow
path that actually uses this.  I have a very WIP patch series to try
to sort this out here:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize

But back to this series.  The basic DAX model is that the callers gets a
memory mapping an just works on that, maybe calling a sync after a write
in a few cases.  So any kind of recovery really needs to be able to
work with that model as going forward the copy_to/from_iter path will
be used less and less.  i.e. file systems can and should use
direct_access directly instead of using the block layer implementation
in the pmem driver.  As an example the dm-writecache driver, the pending
bcache nvdimm support and the (horribly and out of tree) nova file systems
won't even use this path.  We need to find a way to support recovery
for them.  And overloading it over the read/write path which is not
the main path for DAX, but the absolutely fast path for 99% of the
kernel users is a horrible idea.

So how can we work around the horrible nvdimm design for data recovery
in a way that:

   a) actually works with the intended direct memory map use case
   b) doesn't really affect the normal kernel too much

?
Dan Williams Nov. 2, 2021, 4:03 p.m. UTC | #20
On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > Thanks - I try to be honest.  As far as I can tell, the argument
> > about the flag is a philosophical argument between two views.
> > One view assumes design based on perfect hardware, and media error
> > belongs to the category of brokenness. Another view sees media
> > error as a build-in hardware component and make design to include
> > dealing with such errors.
>
> No, I don't think so.  Bit errors do happen in all media, which is
> why devices are built to handle them.  It is just the Intel-style
> pmem interface to handle them which is completely broken.

No, any media can report checksum / parity errors. NVME also seems to
do a poor job with multi-bit ECC errors consumed from DRAM. There is
nothing "pmem" or "Intel" specific here.

> > errors in mind from start.  I guess I'm trying to articulate why
> > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > and its slow path (w/ error clearing) is faster than other alternative.
> > Other alternative being 1 system call to clear the poison, and
> > another system call to run the fast pwrite for recovery, what
> > happens if something happened in between?
>
> Well, my point is doing recovery from bit errors is by definition not
> the fast path.  Which is why I'd rather keep it away from the pmem
> read/write fast path, which also happens to be the (much more important)
> non-pmem read/write path.

I would expect this interface to be useful outside of pmem as a
"failfast" or "try harder to recover" flag for reading over media
errors.
Dan Williams Nov. 2, 2021, 4:12 p.m. UTC | #21
On Wed, Oct 27, 2021 at 5:25 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:
> > > Thanks - I try to be honest.  As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.
> >
> > No, I don't think so.  Bit errors do happen in all media, which is
> > why devices are built to handle them.  It is just the Intel-style
> > pmem interface to handle them which is completely broken.
>
> Yeah, I agree, this takes me back to learning how to use DISKEDIT to
> work around a hole punched in a file (with a pen!) in the 1980s...
>
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?

What are you guys going on about? ECC memory corrects single-bit
errors in the background, multi-bit errors cause the memory controller
to signal that data is gone. This is how ECC memory has worked since
forever. Typically the kernel's memory-failure path is just throwing
away pages that signal data loss. Throwing away pmem pages is harder
because unlike DRAM the physical address of the page matters to upper
layers.

>
> > > errors in mind from start.  I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?
> >
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path.  Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
>
> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can.  Those are
> reads and writes, not {pre,f}allocation activities.  This is where Dave
> and I arrived at a month ago.
>
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow?  That would probably have the same user interface
> flag, just a different path into the pmem driver.
>
> Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever
> speshul options (retry raid mirrors!  scrape the film off the disk if
> you have to!) you want that can take forever, leaving the fast paths
> alone?

I am still failing to see the technical argument for why
RWF_RECOVER_DATA significantly impacts the fast path, and why you
think this is somehow specific to pmem. In fact the pmem effort is
doing the responsible thing and trying to plumb this path while other
storage drivers just seem to be pretending that memory errors never
happen.
Dan Williams Nov. 2, 2021, 7:57 p.m. UTC | #22
On Mon, Nov 1, 2021 at 11:19 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> > ...so would you happen to know if anyone's working on solving this
> > problem for us by putting the memory controller in charge of dealing
> > with media errors?
>
> The only one who could know is Intel..
>
> > The trouble is, we really /do/ want to be able to (re)write the failed
> > area, and we probably want to try to read whatever we can.  Those are
> > reads and writes, not {pre,f}allocation activities.  This is where Dave
> > and I arrived at a month ago.
> >
> > Unless you'd be ok with a second IO path for recovery where we're
> > allowed to be slow?  That would probably have the same user interface
> > flag, just a different path into the pmem driver.
>
> Which is fine with me.  If you look at the API here we do have the
> RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> API which then gets special casing over three methods.
>
> And while Pavel pointed out that he and Jens are now optimizing for
> single branches like this.  I think this actually is silly and it is
> not my point.
>
> The point is that the DAX in-kernel API is a mess, and before we make
> it even worse we need to sort it first.  What is directly relevant
> here is that the copy_from_iter and copy_to_iter APIs do not make
> sense.  Most of the DAX API is based around getting a memory mapping
> using ->direct_access, it is just the read/write path which is a slow
> path that actually uses this.  I have a very WIP patch series to try
> to sort this out here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
>
> But back to this series.  The basic DAX model is that the callers gets a
> memory mapping an just works on that, maybe calling a sync after a write
> in a few cases.  So any kind of recovery really needs to be able to
> work with that model as going forward the copy_to/from_iter path will
> be used less and less.  i.e. file systems can and should use
> direct_access directly instead of using the block layer implementation
> in the pmem driver.  As an example the dm-writecache driver, the pending
> bcache nvdimm support and the (horribly and out of tree) nova file systems
> won't even use this path.  We need to find a way to support recovery
> for them.  And overloading it over the read/write path which is not
> the main path for DAX, but the absolutely fast path for 99% of the
> kernel users is a horrible idea.
>
> So how can we work around the horrible nvdimm design for data recovery
> in a way that:
>
>    a) actually works with the intended direct memory map use case
>    b) doesn't really affect the normal kernel too much
>
> ?

Ok, now I see where you are going, but I don't see line of sight to
something better than RWF_RECOVER_DATA.

This goes back to one of the original DAX concerns of wanting a kernel
library for coordinating PMEM mmap I/O vs leaving userspace to wrap
PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
has this error-handling-API issue whether it is a DAX mapping or not.
I.e. a memory failure in page cache is going to signal the process the
same way and it will need to fall back to something other than mmap
I/O to make forward progress. This is not a PMEM, Intel or even x86
problem, it's a generic CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE problem.

CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
and then rely on readv(2)/writev(2) to recover. Do you see a readily
available way to improve upon that model without CPU instruction
changes? Even with CPU instructions changes, do you think it could
improve much upon the model of interrupting the process when a load
instruction aborts?

I do agree with you that DAX needs to separate itself from block, but
I don't think it follows that DAX also needs to separate itself from
readv/writev for when a kernel slow-path needs to get involved because
mmap I/O (just CPU instructions) does not have the proper semantics.
Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
those semantics in new / augmented CPU instructions you will likely
not get all of them to move and certainly not in any near term
timeframe, so the kernel path will be around indefinitely.

Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
storage besides PMEM and helps storage-drivers do better than large
blast radius "I/O error" completions with no other recourse.
Christoph Hellwig Nov. 3, 2021, 4:53 p.m. UTC | #23
On Tue, Nov 02, 2021 at 09:03:55AM -0700, Dan Williams wrote:
> > why devices are built to handle them.  It is just the Intel-style
> > pmem interface to handle them which is completely broken.
> 
> No, any media can report checksum / parity errors. NVME also seems to
> do a poor job with multi-bit ECC errors consumed from DRAM. There is
> nothing "pmem" or "Intel" specific here.

If you do get data corruption from NVMe (which yes can happen despite
the typical very good UBER rate) you just write over it again.  You
don't need to magically whack the underlying device.  Same for hard
drives.

> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path.  Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.
> 
> I would expect this interface to be useful outside of pmem as a
> "failfast" or "try harder to recover" flag for reading over media
> errors.

Maybe we need to sit down and define useful semantics then?  The problem
on the write side isn't really that the behavior with the flag is
undefined, it is more that writes without the flag have horrible
semantics if they don't just clear the error.
Christoph Hellwig Nov. 3, 2021, 4:58 p.m. UTC | #24
On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote:
> This goes back to one of the original DAX concerns of wanting a kernel
> library for coordinating PMEM mmap I/O vs leaving userspace to wrap
> PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
> has this error-handling-API issue whether it is a DAX mapping or not.

Semantics of writes through shared mmaps are a nightmare.  Agreed,
including agreeing that this is neither new nor pmem specific.  But
it also has absolutely nothing to do with the new RWF_ flag.

> CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
> receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
> and then rely on readv(2)/writev(2) to recover. Do you see a readily
> available way to improve upon that model without CPU instruction
> changes? Even with CPU instructions changes, do you think it could
> improve much upon the model of interrupting the process when a load
> instruction aborts?

The "only" think we need is something like the exception table we
use in the kernel for the uaccess helpers (and the new _nofault
kernel access helper).  But I suspect refitting that into userspace
environments is probably non-trivial.

> I do agree with you that DAX needs to separate itself from block, but
> I don't think it follows that DAX also needs to separate itself from
> readv/writev for when a kernel slow-path needs to get involved because
> mmap I/O (just CPU instructions) does not have the proper semantics.
> Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
> those semantics in new / augmented CPU instructions you will likely
> not get all of them to move and certainly not in any near term
> timeframe, so the kernel path will be around indefinitely.

I think you misunderstood me.  I don't think pmem needs to be
decoupled from the read/write path.  But I'm very skeptical of adding
a new flag to the common read/write path for the special workaround
that a plain old write will not actually clear errors unlike every
other store interfac.

> Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
> storage besides PMEM and helps storage-drivers do better than large
> blast radius "I/O error" completions with no other recourse.

How?
Jane Chu Nov. 3, 2021, 6:09 p.m. UTC | #25
On 11/1/2021 11:18 PM, Christoph Hellwig wrote:
> On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
>> ...so would you happen to know if anyone's working on solving this
>> problem for us by putting the memory controller in charge of dealing
>> with media errors?
> 
> The only one who could know is Intel..
> 
>> The trouble is, we really /do/ want to be able to (re)write the failed
>> area, and we probably want to try to read whatever we can.  Those are
>> reads and writes, not {pre,f}allocation activities.  This is where Dave
>> and I arrived at a month ago.
>>
>> Unless you'd be ok with a second IO path for recovery where we're
>> allowed to be slow?  That would probably have the same user interface
>> flag, just a different path into the pmem driver.
> 
> Which is fine with me.  If you look at the API here we do have the
> RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> API which then gets special casing over three methods.
> 
> And while Pavel pointed out that he and Jens are now optimizing for
> single branches like this.  I think this actually is silly and it is
> not my point.
> 
> The point is that the DAX in-kernel API is a mess, and before we make
> it even worse we need to sort it first.  What is directly relevant
> here is that the copy_from_iter and copy_to_iter APIs do not make
> sense.  Most of the DAX API is based around getting a memory mapping
> using ->direct_access, it is just the read/write path which is a slow
> path that actually uses this.  I have a very WIP patch series to try
> to sort this out here:
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> 
> But back to this series.  The basic DAX model is that the callers gets a
> memory mapping an just works on that, maybe calling a sync after a write
> in a few cases.  So any kind of recovery really needs to be able to
> work with that model as going forward the copy_to/from_iter path will
> be used less and less.  i.e. file systems can and should use
> direct_access directly instead of using the block layer implementation
> in the pmem driver.  As an example the dm-writecache driver, the pending
> bcache nvdimm support and the (horribly and out of tree) nova file systems
> won't even use this path.  We need to find a way to support recovery
> for them.  And overloading it over the read/write path which is not
> the main path for DAX, but the absolutely fast path for 99% of the
> kernel users is a horrible idea.
> 
> So how can we work around the horrible nvdimm design for data recovery
> in a way that:
> 
>     a) actually works with the intended direct memory map use case
>     b) doesn't really affect the normal kernel too much
> 
> ?
> 

This is clearer, I've looked at your 'dax-devirtualize' patch which 
removes pmem_copy_to/from_iter, and as you mentioned before,
a separate API for poison-clearing is needed. So how about I go ahead
rebase my earlier patch
 
https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/
on 'dax-devirtualize', provide dm support for clear-poison?
That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
and we resolve the urgent pmem poison-clearing issue?

Dan, are you okay with this?  I am getting pressure from our customers
who are basically stuck at the moment.

thanks!
-jane
Dan Williams Nov. 3, 2021, 8:33 p.m. UTC | #26
On Wed, Nov 3, 2021 at 9:58 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote:
> > This goes back to one of the original DAX concerns of wanting a kernel
> > library for coordinating PMEM mmap I/O vs leaving userspace to wrap
> > PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
> > has this error-handling-API issue whether it is a DAX mapping or not.
>
> Semantics of writes through shared mmaps are a nightmare.  Agreed,
> including agreeing that this is neither new nor pmem specific.  But
> it also has absolutely nothing to do with the new RWF_ flag.

Ok.

> > CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
> > receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
> > and then rely on readv(2)/writev(2) to recover. Do you see a readily
> > available way to improve upon that model without CPU instruction
> > changes? Even with CPU instructions changes, do you think it could
> > improve much upon the model of interrupting the process when a load
> > instruction aborts?
>
> The "only" think we need is something like the exception table we
> use in the kernel for the uaccess helpers (and the new _nofault
> kernel access helper).  But I suspect refitting that into userspace
> environments is probably non-trivial.

Is the exception table requirement not already fulfilled by:

sigaction(SIGBUS, &act, 0);
...
if (sigsetjmp(sj_env, 1)) {
...

...but yes, that's awkward when all you want is an error return from a
copy operation.

For _nofault I'll note that on the kernel side Linus was explicit
about not mixing fault handling and memory error exception handling in
the same accessor. That's why copy_mc_to_kernel() and
copy_{to,from}_kernel_nofault() are distinct. I only say that to probe
deeper about what a "copy_mc()" looks like in userspace? Perhaps an
interface to suppress SIGBUS generation and register a ring buffer
that gets filled with error-event records encountered over a given
MMAP I/O code sequence?

> > I do agree with you that DAX needs to separate itself from block, but
> > I don't think it follows that DAX also needs to separate itself from
> > readv/writev for when a kernel slow-path needs to get involved because
> > mmap I/O (just CPU instructions) does not have the proper semantics.
> > Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
> > those semantics in new / augmented CPU instructions you will likely
> > not get all of them to move and certainly not in any near term
> > timeframe, so the kernel path will be around indefinitely.
>
> I think you misunderstood me.  I don't think pmem needs to be
> decoupled from the read/write path.  But I'm very skeptical of adding
> a new flag to the common read/write path for the special workaround
> that a plain old write will not actually clear errors unlike every
> other store interfac.

Ah, ok, yes, I agree with you there that needing to redirect writes to
a platform firmware call to clear errors, and notify the device that
its error-list has changed is exceedingly awkward. That said, even if
the device-side error-list auto-updated on write (like the promise of
MOVDIR64B) there's still the question about when to do management on
the software error lists in the driver and/or filesytem. I.e. given
that XFS at least wants to be aware of the error lists for block
allocation and "list errors" type features. More below...

> > Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
> > storage besides PMEM and helps storage-drivers do better than large
> > blast radius "I/O error" completions with no other recourse.
>
> How?

Hasn't this been a perennial topic at LSF/MM, i.e. how to get an
interface for the filesystem to request "try harder" to return data?
If the device has a recovery slow-path, or error tracking granularity
is smaller than the I/O size, then RWF_RECOVER_DATA gives the
device/driver leeway to do better than the typical fast path. For
writes though, I can only come up with the use case of this being a
signal to the driver to take the opportunity to do error-list
management relative to the incoming write data.

However, if signaling that "now is the time to update error-lists" is
the requirement, I imagine the @kaddr returned from
dax_direct_access() could be made to point to an unmapped address
representing the poisoned page. Then, arrange for a pmem-driver fault
handler to emulate the copy operation and do the slow path updates
that would otherwise have been gated by RWF_RECOVER_DATA.

Although, I'm not excited about teaching every PMEM arch's fault
handler about this new source of kernel faults. Other ideas?
RWF_RECOVER_DATA still seems the most viable / cleanest option, but
I'm willing to do what it takes to move this error management
capability forward.
Dan Williams Nov. 4, 2021, 6:21 a.m. UTC | #27
On Wed, Nov 3, 2021 at 11:10 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/1/2021 11:18 PM, Christoph Hellwig wrote:
> > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> >> ...so would you happen to know if anyone's working on solving this
> >> problem for us by putting the memory controller in charge of dealing
> >> with media errors?
> >
> > The only one who could know is Intel..
> >
> >> The trouble is, we really /do/ want to be able to (re)write the failed
> >> area, and we probably want to try to read whatever we can.  Those are
> >> reads and writes, not {pre,f}allocation activities.  This is where Dave
> >> and I arrived at a month ago.
> >>
> >> Unless you'd be ok with a second IO path for recovery where we're
> >> allowed to be slow?  That would probably have the same user interface
> >> flag, just a different path into the pmem driver.
> >
> > Which is fine with me.  If you look at the API here we do have the
> > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> > API which then gets special casing over three methods.
> >
> > And while Pavel pointed out that he and Jens are now optimizing for
> > single branches like this.  I think this actually is silly and it is
> > not my point.
> >
> > The point is that the DAX in-kernel API is a mess, and before we make
> > it even worse we need to sort it first.  What is directly relevant
> > here is that the copy_from_iter and copy_to_iter APIs do not make
> > sense.  Most of the DAX API is based around getting a memory mapping
> > using ->direct_access, it is just the read/write path which is a slow
> > path that actually uses this.  I have a very WIP patch series to try
> > to sort this out here:
> >
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> >
> > But back to this series.  The basic DAX model is that the callers gets a
> > memory mapping an just works on that, maybe calling a sync after a write
> > in a few cases.  So any kind of recovery really needs to be able to
> > work with that model as going forward the copy_to/from_iter path will
> > be used less and less.  i.e. file systems can and should use
> > direct_access directly instead of using the block layer implementation
> > in the pmem driver.  As an example the dm-writecache driver, the pending
> > bcache nvdimm support and the (horribly and out of tree) nova file systems
> > won't even use this path.  We need to find a way to support recovery
> > for them.  And overloading it over the read/write path which is not
> > the main path for DAX, but the absolutely fast path for 99% of the
> > kernel users is a horrible idea.
> >
> > So how can we work around the horrible nvdimm design for data recovery
> > in a way that:
> >
> >     a) actually works with the intended direct memory map use case
> >     b) doesn't really affect the normal kernel too much
> >
> > ?
> >
>
> This is clearer, I've looked at your 'dax-devirtualize' patch which
> removes pmem_copy_to/from_iter, and as you mentioned before,
> a separate API for poison-clearing is needed. So how about I go ahead
> rebase my earlier patch
>
> https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/
> on 'dax-devirtualize', provide dm support for clear-poison?
> That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
> and we resolve the urgent pmem poison-clearing issue?
>
> Dan, are you okay with this?  I am getting pressure from our customers
> who are basically stuck at the moment.

The concern I have with dax_clear_poison() is that it precludes atomic
error clearing. Also, as Boris and I discussed, poisoned pages should
be marked NP (not present) rather than UC (uncacheable) [1]. With
those 2 properties combined I think that wants a custom pmem fault
handler that knows how to carefully write to pmem pages with poison
present, rather than an additional explicit dax-operation. That also
meets Christoph's requirement of "works with the intended direct
memory map use case".

[1]: https://lore.kernel.org/r/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com
Christoph Hellwig Nov. 4, 2021, 8:21 a.m. UTC | #28
On Wed, Nov 03, 2021 at 06:09:57PM +0000, Jane Chu wrote:
> This is clearer, I've looked at your 'dax-devirtualize' patch which 
> removes pmem_copy_to/from_iter, and as you mentioned before,
> a separate API for poison-clearing is needed. So how about I go ahead
> rebase my earlier patch
>  
> https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/
> on 'dax-devirtualize', provide dm support for clear-poison?
> That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
> and we resolve the urgent pmem poison-clearing issue?

FYI, I really do like the in-kernel interface in that series.  And
now that we discussed all the effects I also think that automatically
doing the clear on EIO is a good idea.

I'll go back to the series and will reply with a few nitpicks.
Christoph Hellwig Nov. 4, 2021, 8:30 a.m. UTC | #29
On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote:
> Is the exception table requirement not already fulfilled by:
> 
> sigaction(SIGBUS, &act, 0);
> ...
> if (sigsetjmp(sj_env, 1)) {
> ...
> 
> ...but yes, that's awkward when all you want is an error return from a
> copy operation.

Yeah.  The nice thing about the kernel uaccess / _nofault helpers is
that they allow normal error handling flows.

> For _nofault I'll note that on the kernel side Linus was explicit
> about not mixing fault handling and memory error exception handling in
> the same accessor. That's why copy_mc_to_kernel() and
> copy_{to,from}_kernel_nofault() are distinct.

I've always wondered why we need all this mess.  But if the head
penguin wants it..

> I only say that to probe
> deeper about what a "copy_mc()" looks like in userspace? Perhaps an
> interface to suppress SIGBUS generation and register a ring buffer
> that gets filled with error-event records encountered over a given
> MMAP I/O code sequence?

Well, the equivalent to the kernel uaccess model would be to register
a SIGBUS handler that uses an exception table like the kernel, and then
if you use the right helpers to load/store they can return errors.

The other option would be something more like the Windows Structured
Exception Handling:

https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160


> > I think you misunderstood me.  I don't think pmem needs to be
> > decoupled from the read/write path.  But I'm very skeptical of adding
> > a new flag to the common read/write path for the special workaround
> > that a plain old write will not actually clear errors unlike every
> > other store interfac.
> 
> Ah, ok, yes, I agree with you there that needing to redirect writes to
> a platform firmware call to clear errors, and notify the device that
> its error-list has changed is exceedingly awkward.

Yes.  And that is the big difference to every other modern storage
device.  SSDs always write out of place and will just not reuse bad
blocks, and all drivers built in the last 25-30 years will also do
internal bad block remapping.

> That said, even if
> the device-side error-list auto-updated on write (like the promise of
> MOVDIR64B) there's still the question about when to do management on
> the software error lists in the driver and/or filesytem. I.e. given
> that XFS at least wants to be aware of the error lists for block
> allocation and "list errors" type features. More below...

Well, the whole problem is that we should not have to manage this at
all, and this is where I blame Intel.  There is no good reason to not
slightly overprovision the nvdimms and just do internal bad page
remapping like every other modern storage device.

> Hasn't this been a perennial topic at LSF/MM, i.e. how to get an
> interface for the filesystem to request "try harder" to return data?

Trying harder to _get_ data or to _store_ data?  All the discussion
here seems to be able about actually writing data.

> If the device has a recovery slow-path, or error tracking granularity
> is smaller than the I/O size, then RWF_RECOVER_DATA gives the
> device/driver leeway to do better than the typical fast path. For
> writes though, I can only come up with the use case of this being a
> signal to the driver to take the opportunity to do error-list
> management relative to the incoming write data.

Well, while devices have data recovery slow path they tend to use those
automatically.  What I'm actually seeing in practice is flags in the
storage interfaces to skip this slow path recovery because the higher
level software would prefer to instead recover e.g. from remote copies.

> However, if signaling that "now is the time to update error-lists" is
> the requirement, I imagine the @kaddr returned from
> dax_direct_access() could be made to point to an unmapped address
> representing the poisoned page. Then, arrange for a pmem-driver fault
> handler to emulate the copy operation and do the slow path updates
> that would otherwise have been gated by RWF_RECOVER_DATA.

That does sound like a much better interface than most of what we've
discussed so far. 

> Although, I'm not excited about teaching every PMEM arch's fault
> handler about this new source of kernel faults. Other ideas?
> RWF_RECOVER_DATA still seems the most viable / cleanest option, but
> I'm willing to do what it takes to move this error management
> capability forward.

So far out of the low instrusiveness options Janes' previous series
to automatically retry after calling a clear_poison operation seems
like the best idea so far.  We just need to also think about what
we want to do for direct users of ->direct_access that do not use
the mcsafe iov_iter helpers.
Christoph Hellwig Nov. 4, 2021, 8:36 a.m. UTC | #30
On Wed, Nov 03, 2021 at 11:21:39PM -0700, Dan Williams wrote:
> The concern I have with dax_clear_poison() is that it precludes atomic
> error clearing.

atomic as in clear poison and write the actual data?  Yes, that would
be useful, but it is not how the Intel pmem support actually works, right?

> Also, as Boris and I discussed, poisoned pages should
> be marked NP (not present) rather than UC (uncacheable) [1].

This would not really have an affect on the series, right?  But yes,
that seems like the right thing to do.

> With
> those 2 properties combined I think that wants a custom pmem fault
> handler that knows how to carefully write to pmem pages with poison
> present, rather than an additional explicit dax-operation. That also
> meets Christoph's requirement of "works with the intended direct
> memory map use case".

So we have 3 kinds of accesses to DAX memory:

 (1) user space mmap direct access. 
 (2) iov_iter based access (could be from kernel or userspace)
 (3) open coded kernel access using ->direct_access

One thing I noticed:  (2) could also work with kernel memory or pages,
but that doesn't use MC safe access.  Which seems like a major independent
of this discussion.

I suspect all kernel access could work fine with a copy_mc_to_kernel
helper as long as everyone actually uses it, which will cover the
missing required bits of (2) and (3) together with something like the
->clear_poison series from Jane. We just need to think hard what we
want to do for userspace mmap access.
Matthew Wilcox Nov. 4, 2021, 12:29 p.m. UTC | #31
On Thu, Nov 04, 2021 at 01:30:48AM -0700, Christoph Hellwig wrote:
> Well, the whole problem is that we should not have to manage this at
> all, and this is where I blame Intel.  There is no good reason to not
> slightly overprovision the nvdimms and just do internal bad page
> remapping like every other modern storage device.

What makes you think they don't?

The problem is persuading the CPU to do writes without doing reads.
That's where magic instructions or out of band IO is needed.
Dan Williams Nov. 4, 2021, 4:08 p.m. UTC | #32
On Thu, Nov 4, 2021 at 1:36 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Nov 03, 2021 at 11:21:39PM -0700, Dan Williams wrote:
> > The concern I have with dax_clear_poison() is that it precludes atomic
> > error clearing.
>
> atomic as in clear poison and write the actual data?  Yes, that would
> be useful, but it is not how the Intel pmem support actually works, right?

Yes, atomic clear+write new data. The ability to atomic clear requires
either a CPU with the ability to overwrite cachelines without doing a
RMW cycle (MOVDIR64B), or it requires a device with a suitable
slow-path mailbox command like the one defined for CXL devices (see
section 8.2.9.5.4.3 Clear Poison in CXL 2.0).

I don't know why you think these devices don't perform wear-leveling
with spare blocks?

> > Also, as Boris and I discussed, poisoned pages should
> > be marked NP (not present) rather than UC (uncacheable) [1].
>
> This would not really have an affect on the series, right?  But yes,
> that seems like the right thing to do.

It would because the implementations would need to be careful to clear
poison in an entire page before any of it could be accessed. With an
enlightened write-path RWF flag or custom fault handler it could do
sub-page overwrites of poison. Not that I think the driver should
optimize for multiple failed cachelines in a page, but it does mean
dax_clear_poison() fails in more theoretical scenarios.

> > With
> > those 2 properties combined I think that wants a custom pmem fault
> > handler that knows how to carefully write to pmem pages with poison
> > present, rather than an additional explicit dax-operation. That also
> > meets Christoph's requirement of "works with the intended direct
> > memory map use case".
>
> So we have 3 kinds of accesses to DAX memory:
>
>  (1) user space mmap direct access.
>  (2) iov_iter based access (could be from kernel or userspace)
>  (3) open coded kernel access using ->direct_access
>
> One thing I noticed:  (2) could also work with kernel memory or pages,
> but that doesn't use MC safe access.

Yes, but after the fight to even get copy_mc_to_kernel() to exist for
pmem_copy_to_iter() I did not have the nerve to push for wider usage.

> Which seems like a major independent
> of this discussion.
>
> I suspect all kernel access could work fine with a copy_mc_to_kernel
> helper as long as everyone actually uses it,

All kernel accesses do use it. They either route to
pmem_copy_to_iter(), or like dm-writecache, call it directly. Do you
see a kernel path that does not use that helper?

> missing required bits of (2) and (3) together with something like the
> ->clear_poison series from Jane. We just need to think hard what we
> want to do for userspace mmap access.

dax_clear_poison() is at least ready to go today and does not preclude
adding the atomic and finer grained support later.
Dan Williams Nov. 4, 2021, 4:24 p.m. UTC | #33
On Thu, Nov 4, 2021 at 1:30 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote:
> > Is the exception table requirement not already fulfilled by:
> >
> > sigaction(SIGBUS, &act, 0);
> > ...
> > if (sigsetjmp(sj_env, 1)) {
> > ...
> >
> > ...but yes, that's awkward when all you want is an error return from a
> > copy operation.
>
> Yeah.  The nice thing about the kernel uaccess / _nofault helpers is
> that they allow normal error handling flows.
>
> > For _nofault I'll note that on the kernel side Linus was explicit
> > about not mixing fault handling and memory error exception handling in
> > the same accessor. That's why copy_mc_to_kernel() and
> > copy_{to,from}_kernel_nofault() are distinct.
>
> I've always wondered why we need all this mess.  But if the head
> penguin wants it..
>
> > I only say that to probe
> > deeper about what a "copy_mc()" looks like in userspace? Perhaps an
> > interface to suppress SIGBUS generation and register a ring buffer
> > that gets filled with error-event records encountered over a given
> > MMAP I/O code sequence?
>
> Well, the equivalent to the kernel uaccess model would be to register
> a SIGBUS handler that uses an exception table like the kernel, and then
> if you use the right helpers to load/store they can return errors.
>
> The other option would be something more like the Windows Structured
> Exception Handling:
>
> https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160

Yes, try-catch blocks for PMEM is what is needed if it's not there
already from PMDK. Searching for SIGBUS in PMDK finds things like:

https://github.com/pmem/pmdk/blob/26449a51a86861db17feabdfefaa5ee4d5afabc9/src/libpmem2/mcsafe_ops_posix.c

>
>
> > > I think you misunderstood me.  I don't think pmem needs to be
> > > decoupled from the read/write path.  But I'm very skeptical of adding
> > > a new flag to the common read/write path for the special workaround
> > > that a plain old write will not actually clear errors unlike every
> > > other store interfac.
> >
> > Ah, ok, yes, I agree with you there that needing to redirect writes to
> > a platform firmware call to clear errors, and notify the device that
> > its error-list has changed is exceedingly awkward.
>
> Yes.  And that is the big difference to every other modern storage
> device.  SSDs always write out of place and will just not reuse bad
> blocks, and all drivers built in the last 25-30 years will also do
> internal bad block remapping.
>

No, the big difference with every other modern storage device is
access to byte-addressable storage. Storage devices get to "cheat"
with guaranteed minimum 512-byte accesses. So you can arrange for
writes to always be large enough to scrub the ECC bits along with the
data. For PMEM and byte-granularity DAX accesses the "sector size" is
a cacheline and it needed a new CPU instruction before software could
atomically update data + ECC. Otherwise, with sub-cacheline accesses,
a RMW cycle can't always be avoided. Such a cycle pulls poison from
the device on the read and pushes it back out to the media on the
cacheline writeback.

> > That said, even if
> > the device-side error-list auto-updated on write (like the promise of
> > MOVDIR64B) there's still the question about when to do management on
> > the software error lists in the driver and/or filesytem. I.e. given
> > that XFS at least wants to be aware of the error lists for block
> > allocation and "list errors" type features. More below...
>
> Well, the whole problem is that we should not have to manage this at
> all, and this is where I blame Intel.  There is no good reason to not
> slightly overprovision the nvdimms and just do internal bad page
> remapping like every other modern storage device.

I don't understand what overprovisioning has to do with better error
management? No other storage device has seen fit to be as transparent
with communicating the error list and offering ways to proactively
scrub it. Dave and Darrick rightly saw this and said "hey, the FS
could do a much better job for the user if it knew about this error
list". So I don't get what this argument about spare blocks has to do
with what XFS wants? I.e. an rmap facility to communicate files that
have been clobbered by cosmic rays and other calamities.

> > Hasn't this been a perennial topic at LSF/MM, i.e. how to get an
> > interface for the filesystem to request "try harder" to return data?
>
> Trying harder to _get_ data or to _store_ data?  All the discussion
> here seems to be able about actually writing data.
>
> > If the device has a recovery slow-path, or error tracking granularity
> > is smaller than the I/O size, then RWF_RECOVER_DATA gives the
> > device/driver leeway to do better than the typical fast path. For
> > writes though, I can only come up with the use case of this being a
> > signal to the driver to take the opportunity to do error-list
> > management relative to the incoming write data.
>
> Well, while devices have data recovery slow path they tend to use those
> automatically.  What I'm actually seeing in practice is flags in the
> storage interfaces to skip this slow path recovery because the higher
> level software would prefer to instead recover e.g. from remote copies.

Ok.

> > However, if signaling that "now is the time to update error-lists" is
> > the requirement, I imagine the @kaddr returned from
> > dax_direct_access() could be made to point to an unmapped address
> > representing the poisoned page. Then, arrange for a pmem-driver fault
> > handler to emulate the copy operation and do the slow path updates
> > that would otherwise have been gated by RWF_RECOVER_DATA.
>
> That does sound like a much better interface than most of what we've
> discussed so far.
>
> > Although, I'm not excited about teaching every PMEM arch's fault
> > handler about this new source of kernel faults. Other ideas?
> > RWF_RECOVER_DATA still seems the most viable / cleanest option, but
> > I'm willing to do what it takes to move this error management
> > capability forward.
>
> So far out of the low instrusiveness options Janes' previous series
> to automatically retry after calling a clear_poison operation seems
> like the best idea so far.  We just need to also think about what
> we want to do for direct users of ->direct_access that do not use
> the mcsafe iov_iter helpers.

Those exist? Even dm-writecache uses copy_mc_to_kernel().
Christoph Hellwig Nov. 4, 2021, 5:43 p.m. UTC | #34
On Thu, Nov 04, 2021 at 09:24:15AM -0700, Dan Williams wrote:
> No, the big difference with every other modern storage device is
> access to byte-addressable storage. Storage devices get to "cheat"
> with guaranteed minimum 512-byte accesses. So you can arrange for
> writes to always be large enough to scrub the ECC bits along with the
> data. For PMEM and byte-granularity DAX accesses the "sector size" is
> a cacheline and it needed a new CPU instruction before software could
> atomically update data + ECC. Otherwise, with sub-cacheline accesses,
> a RMW cycle can't always be avoided. Such a cycle pulls poison from
> the device on the read and pushes it back out to the media on the
> cacheline writeback.

Indeed.  The fake byte addressability is indeed the problem, and the
fix is to not do that, at least on the second attempt.

> I don't understand what overprovisioning has to do with better error
> management? No other storage device has seen fit to be as transparent
> with communicating the error list and offering ways to proactively
> scrub it. Dave and Darrick rightly saw this and said "hey, the FS
> could do a much better job for the user if it knew about this error
> list". So I don't get what this argument about spare blocks has to do
> with what XFS wants? I.e. an rmap facility to communicate files that
> have been clobbered by cosmic rays and other calamities.

Well, the answer for other interfaces (at least at the gold plated
cost option) is so strong internal CRCs that user visible bits clobbered
by cosmic rays don't realisticly happen.  But it is a problem with the
cheaper ones, and at least SCSI and NVMe offer the error list through
the Get LBA status command (and I bet ATA too, but I haven't looked into
that).  Oddly enough there has never been much interested from the
fs community for those.

> > So far out of the low instrusiveness options Janes' previous series
> > to automatically retry after calling a clear_poison operation seems
> > like the best idea so far.  We just need to also think about what
> > we want to do for direct users of ->direct_access that do not use
> > the mcsafe iov_iter helpers.
> 
> Those exist? Even dm-writecache uses copy_mc_to_kernel().

I'm sorry, I have completely missed that it has been added.  And it's
been in for a whole year..
Christoph Hellwig Nov. 4, 2021, 5:46 p.m. UTC | #35
On Thu, Nov 04, 2021 at 09:08:41AM -0700, Dan Williams wrote:
> Yes, atomic clear+write new data. The ability to atomic clear requires
> either a CPU with the ability to overwrite cachelines without doing a
> RMW cycle (MOVDIR64B), or it requires a device with a suitable
> slow-path mailbox command like the one defined for CXL devices (see
> section 8.2.9.5.4.3 Clear Poison in CXL 2.0).
> 
> I don't know why you think these devices don't perform wear-leveling
> with spare blocks?

Because the interface looks so broken.  But yes, apparently it's not
the media management that is broken but just the inteface that fakes
up byte level access.

> All kernel accesses do use it. They either route to
> pmem_copy_to_iter(), or like dm-writecache, call it directly. Do you
> see a kernel path that does not use that helper?

No, sorry.  My knowledge is out of date.
(nova does, but it is out of tree, and the lack of using
copy_mc_to_kernel is the least of its problems)
Dan Williams Nov. 4, 2021, 5:50 p.m. UTC | #36
On Thu, Nov 4, 2021 at 10:43 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Nov 04, 2021 at 09:24:15AM -0700, Dan Williams wrote:
> > No, the big difference with every other modern storage device is
> > access to byte-addressable storage. Storage devices get to "cheat"
> > with guaranteed minimum 512-byte accesses. So you can arrange for
> > writes to always be large enough to scrub the ECC bits along with the
> > data. For PMEM and byte-granularity DAX accesses the "sector size" is
> > a cacheline and it needed a new CPU instruction before software could
> > atomically update data + ECC. Otherwise, with sub-cacheline accesses,
> > a RMW cycle can't always be avoided. Such a cycle pulls poison from
> > the device on the read and pushes it back out to the media on the
> > cacheline writeback.
>
> Indeed.  The fake byte addressability is indeed the problem, and the
> fix is to not do that, at least on the second attempt.

Fair enough.

> > I don't understand what overprovisioning has to do with better error
> > management? No other storage device has seen fit to be as transparent
> > with communicating the error list and offering ways to proactively
> > scrub it. Dave and Darrick rightly saw this and said "hey, the FS
> > could do a much better job for the user if it knew about this error
> > list". So I don't get what this argument about spare blocks has to do
> > with what XFS wants? I.e. an rmap facility to communicate files that
> > have been clobbered by cosmic rays and other calamities.
>
> Well, the answer for other interfaces (at least at the gold plated
> cost option) is so strong internal CRCs that user visible bits clobbered
> by cosmic rays don't realisticly happen.  But it is a problem with the
> cheaper ones, and at least SCSI and NVMe offer the error list through
> the Get LBA status command (and I bet ATA too, but I haven't looked into
> that).  Oddly enough there has never been much interested from the
> fs community for those.

It seems the entanglement with 'struct page', error handling, and
reflink made people take notice. Hopefully someone could follow the
same plumbing we're doing for pmem to offer error-rmap help for NVME
badblocks.
Matthew Wilcox Nov. 4, 2021, 6:05 p.m. UTC | #37
On Thu, Nov 04, 2021 at 10:43:23AM -0700, Christoph Hellwig wrote:
> Well, the answer for other interfaces (at least at the gold plated
> cost option) is so strong internal CRCs that user visible bits clobbered
> by cosmic rays don't realisticly happen.  But it is a problem with the
> cheaper ones, and at least SCSI and NVMe offer the error list through
> the Get LBA status command (and I bet ATA too, but I haven't looked into
> that).  Oddly enough there has never been much interested from the
> fs community for those.

"don't realistically happen" is different when you're talking about
"doesn't happen within the warranty period of my laptop's SSD" and
"doesn't happen on my fleet of 10k servers before they're taken out of
service".  There's also a big difference in speeds between an NVMe drive
(7GB/s) and a memory device (20-50GB/s).  The UBER being talked about
when I was still at Intel was similar to / slightly better than DRAM,
but that's still several failures per year across an entire data centre
that's using pmem flat-out.
Jane Chu Nov. 4, 2021, 6:33 p.m. UTC | #38
Thanks for the enlightening discussion here, it's so helpful!

Please allow me to recap what I've caught up so far -

1. recovery write at page boundary due to NP setting in poisoned
    page to prevent undesirable prefetching
2. single interface to perform 3 tasks:
      { clear-poison, update error-list, write }
    such as an API in pmem driver.
    For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write'
    task can be combined (would need something different from the
    existing _copy_mcsafe though) and 'update error-list' follows
    closely behind;
    For CPUs that rely on firmware call to clear posion, the existing
    pmem_clear_poison() can be used, followed by the 'write' task.
3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery
    would be automatic for a write if range is page aligned;
    otherwise, the write fails with EIO as usual.
    Also, user mustn't have punched out the poisoned page in which
    case poison repairing will be a lot more complicated.
4. desirable to fetch as much data as possible from a poisoned range.

If this understanding is in the right direction, then I'd like to
propose below changes to
   dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
   and the dm layer copy_to/from_iter, dax_iomap_iter().

1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
    is likely media error: if the API without DAX_F_RECOVERY returns
    -EIO, then switch to recovery-read/write code.  In recovery code,
    supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
    'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
2. the _copy_to/from_iter implementation would be largely the same
    as in my recent patch, but some changes in Christoph's
    'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
    virtual devices don't have the ability to clear poison, so no need
    to complicate them.  And this also means that not every endpoint
    dax device has to provide dax_op.copy_to/from_iter, they may use the
    default.

I'm not sure about nova and others, if they use different 'write' other
than via iomap, does that mean there will be need for a new set of
dax_op for their read/write?  the 3-in-1 binding would always be
required though. Maybe that'll be an ongoing discussion?

Comments? Suggestions?

Thanks!
-jane
Dan Williams Nov. 4, 2021, 7 p.m. UTC | #39
On Thu, Nov 4, 2021 at 11:34 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> Thanks for the enlightening discussion here, it's so helpful!
>
> Please allow me to recap what I've caught up so far -
>
> 1. recovery write at page boundary due to NP setting in poisoned
>     page to prevent undesirable prefetching
> 2. single interface to perform 3 tasks:
>       { clear-poison, update error-list, write }
>     such as an API in pmem driver.
>     For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write'
>     task can be combined (would need something different from the
>     existing _copy_mcsafe though) and 'update error-list' follows
>     closely behind;
>     For CPUs that rely on firmware call to clear posion, the existing
>     pmem_clear_poison() can be used, followed by the 'write' task.
> 3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery
>     would be automatic for a write if range is page aligned;
>     otherwise, the write fails with EIO as usual.
>     Also, user mustn't have punched out the poisoned page in which
>     case poison repairing will be a lot more complicated.
> 4. desirable to fetch as much data as possible from a poisoned range.
>
> If this understanding is in the right direction, then I'd like to
> propose below changes to
>    dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
>    and the dm layer copy_to/from_iter, dax_iomap_iter().
>
> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
>     is likely media error: if the API without DAX_F_RECOVERY returns
>     -EIO, then switch to recovery-read/write code.  In recovery code,
>     supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
>     'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.

I like it. It allows for an atomic write+clear implementation on
capable platforms and coordinates with potentially unmapped pages. The
best of both worlds from the dax_clear_poison() proposal and my "take
a fault and do a slow-path copy".

> 2. the _copy_to/from_iter implementation would be largely the same
>     as in my recent patch, but some changes in Christoph's
>     'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
>     virtual devices don't have the ability to clear poison, so no need
>     to complicate them.  And this also means that not every endpoint
>     dax device has to provide dax_op.copy_to/from_iter, they may use the
>     default.

Did I miss this series or are you talking about this one?
https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/

> I'm not sure about nova and others, if they use different 'write' other
> than via iomap, does that mean there will be need for a new set of
> dax_op for their read/write?

No, they're out-of-tree they'll adjust to the same interface that xfs
and ext4 are using when/if they go upstream.

> the 3-in-1 binding would always be
> required though. Maybe that'll be an ongoing discussion?

Yeah, let's cross that bridge when we come to it.

> Comments? Suggestions?

It sounds great to me!
Jane Chu Nov. 4, 2021, 8:27 p.m. UTC | #40
On 11/4/2021 12:00 PM, Dan Williams wrote:

>>
>> If this understanding is in the right direction, then I'd like to
>> propose below changes to
>>     dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
>>     and the dm layer copy_to/from_iter, dax_iomap_iter().
>>
>> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
>>      is likely media error: if the API without DAX_F_RECOVERY returns
>>      -EIO, then switch to recovery-read/write code.  In recovery code,
>>      supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
>>      'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
> 
> I like it. It allows for an atomic write+clear implementation on
> capable platforms and coordinates with potentially unmapped pages. The
> best of both worlds from the dax_clear_poison() proposal and my "take
> a fault and do a slow-path copy".
> 
>> 2. the _copy_to/from_iter implementation would be largely the same
>>      as in my recent patch, but some changes in Christoph's
>>      'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
>>      virtual devices don't have the ability to clear poison, so no need
>>      to complicate them.  And this also means that not every endpoint
>>      dax device has to provide dax_op.copy_to/from_iter, they may use the
>>      default.
> 
> Did I miss this series or are you talking about this one?
> https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/

I was referring to
 
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
that has not come out yet, I said early on that I'll rebase on it,
but looks like we still need pmem_copy_to/from_iter(), so.

> 
>> I'm not sure about nova and others, if they use different 'write' other
>> than via iomap, does that mean there will be need for a new set of
>> dax_op for their read/write?
> 
> No, they're out-of-tree they'll adjust to the same interface that xfs
> and ext4 are using when/if they go upstream.
> 
>> the 3-in-1 binding would always be
>> required though. Maybe that'll be an ongoing discussion?
> 
> Yeah, let's cross that bridge when we come to it.
> 
>> Comments? Suggestions?
> 
> It sounds great to me!
> 

Thanks!  I'll send out an updated patchset when it's ready.

-jane
Dan Williams Nov. 5, 2021, 12:46 a.m. UTC | #41
On Thu, Nov 4, 2021 at 1:27 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 11/4/2021 12:00 PM, Dan Williams wrote:
>
> >>
> >> If this understanding is in the right direction, then I'd like to
> >> propose below changes to
> >>     dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
> >>     and the dm layer copy_to/from_iter, dax_iomap_iter().
> >>
> >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> >>      is likely media error: if the API without DAX_F_RECOVERY returns
> >>      -EIO, then switch to recovery-read/write code.  In recovery code,
> >>      supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> >>      'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
> >
> > I like it. It allows for an atomic write+clear implementation on
> > capable platforms and coordinates with potentially unmapped pages. The
> > best of both worlds from the dax_clear_poison() proposal and my "take
> > a fault and do a slow-path copy".
> >
> >> 2. the _copy_to/from_iter implementation would be largely the same
> >>      as in my recent patch, but some changes in Christoph's
> >>      'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> >>      virtual devices don't have the ability to clear poison, so no need
> >>      to complicate them.  And this also means that not every endpoint
> >>      dax device has to provide dax_op.copy_to/from_iter, they may use the
> >>      default.
> >
> > Did I miss this series or are you talking about this one?
> > https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/
>
> I was referring to
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> that has not come out yet, I said early on that I'll rebase on it,
> but looks like we still need pmem_copy_to/from_iter(), so.

Yeah, since the block-layer divorce gets rid of the old poison
clearing path, then we're back to pmem_copy_to_iter() (or something
like it) needing to pick up the slack for poison clearing. I do agree
it would be nice to clean up all the unnecessary boilerplate, but the
error-list coordination requires a driver specific callback. At least
the DAX_F_VIRTUAL flag can eliminate the virtiofs and fuse callbacks.
Dan Williams Nov. 5, 2021, 1:35 a.m. UTC | #42
On Thu, Nov 4, 2021 at 5:46 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Nov 4, 2021 at 1:27 PM Jane Chu <jane.chu@oracle.com> wrote:
> >
> > On 11/4/2021 12:00 PM, Dan Williams wrote:
> >
> > >>
> > >> If this understanding is in the right direction, then I'd like to
> > >> propose below changes to
> > >>     dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter()
> > >>     and the dm layer copy_to/from_iter, dax_iomap_iter().
> > >>
> > >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> > >>      is likely media error: if the API without DAX_F_RECOVERY returns
> > >>      -EIO, then switch to recovery-read/write code.  In recovery code,
> > >>      supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> > >>      'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
> > >
> > > I like it. It allows for an atomic write+clear implementation on
> > > capable platforms and coordinates with potentially unmapped pages. The
> > > best of both worlds from the dax_clear_poison() proposal and my "take
> > > a fault and do a slow-path copy".
> > >
> > >> 2. the _copy_to/from_iter implementation would be largely the same
> > >>      as in my recent patch, but some changes in Christoph's
> > >>      'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> > >>      virtual devices don't have the ability to clear poison, so no need
> > >>      to complicate them.  And this also means that not every endpoint
> > >>      dax device has to provide dax_op.copy_to/from_iter, they may use the
> > >>      default.
> > >
> > > Did I miss this series or are you talking about this one?
> > > https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/
> >
> > I was referring to
> >
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> > that has not come out yet, I said early on that I'll rebase on it,
> > but looks like we still need pmem_copy_to/from_iter(), so.
>
> Yeah, since the block-layer divorce gets rid of the old poison
> clearing path, then we're back to pmem_copy_to_iter() (or something
> like it) needing to pick up the slack for poison clearing. I do agree
> it would be nice to clean up all the unnecessary boilerplate, but the
> error-list coordination requires a driver specific callback. At least
> the DAX_F_VIRTUAL flag can eliminate the virtiofs and fuse callbacks.

Although, if error management is generically implemented relative to a
'struct dax_device' then there wouldn't be a need to call all the way
back into the driver, and the cleanup could still move forward.
Christoph Hellwig Nov. 5, 2021, 5:56 a.m. UTC | #43
On Thu, Nov 04, 2021 at 12:00:12PM -0700, Dan Williams wrote:
> > 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there
> >     is likely media error: if the API without DAX_F_RECOVERY returns
> >     -EIO, then switch to recovery-read/write code.  In recovery code,
> >     supply DAX_F_RECOVERY to dax_direct_access() in order to obtain
> >     'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY.
> 
> I like it. It allows for an atomic write+clear implementation on
> capable platforms and coordinates with potentially unmapped pages. The
> best of both worlds from the dax_clear_poison() proposal and my "take
> a fault and do a slow-path copy".

Fine with me as well.

> 
> > 2. the _copy_to/from_iter implementation would be largely the same
> >     as in my recent patch, but some changes in Christoph's
> >     'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously
> >     virtual devices don't have the ability to clear poison, so no need
> >     to complicate them.  And this also means that not every endpoint
> >     dax device has to provide dax_op.copy_to/from_iter, they may use the
> >     default.
> 
> Did I miss this series or are you talking about this one?
> https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/

Yes.  This is an early RFC, but I plan to finish this up and submit
it after the updated decouple series. 

> 
> > I'm not sure about nova and others, if they use different 'write' other
> > than via iomap, does that mean there will be need for a new set of
> > dax_op for their read/write?
> 
> No, they're out-of-tree they'll adjust to the same interface that xfs
> and ext4 are using when/if they go upstream.

Yepp.
Lukas Straub Nov. 6, 2021, 7:41 a.m. UTC | #44
On Tue, 2 Nov 2021 09:03:55 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote:  
> > > Thanks - I try to be honest.  As far as I can tell, the argument
> > > about the flag is a philosophical argument between two views.
> > > One view assumes design based on perfect hardware, and media error
> > > belongs to the category of brokenness. Another view sees media
> > > error as a build-in hardware component and make design to include
> > > dealing with such errors.  
> >
> > No, I don't think so.  Bit errors do happen in all media, which is
> > why devices are built to handle them.  It is just the Intel-style
> > pmem interface to handle them which is completely broken.  
> 
> No, any media can report checksum / parity errors. NVME also seems to
> do a poor job with multi-bit ECC errors consumed from DRAM. There is
> nothing "pmem" or "Intel" specific here.
> 
> > > errors in mind from start.  I guess I'm trying to articulate why
> > > it is acceptable to include the RWF_DATA_RECOVERY flag to the
> > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path,
> > > and its slow path (w/ error clearing) is faster than other alternative.
> > > Other alternative being 1 system call to clear the poison, and
> > > another system call to run the fast pwrite for recovery, what
> > > happens if something happened in between?  
> >
> > Well, my point is doing recovery from bit errors is by definition not
> > the fast path.  Which is why I'd rather keep it away from the pmem
> > read/write fast path, which also happens to be the (much more important)
> > non-pmem read/write path.  
> 
> I would expect this interface to be useful outside of pmem as a
> "failfast" or "try harder to recover" flag for reading over media
> errors.

Yeah, I think this flag could also be useful for non-raid btrfs.

If you have an extend that is shared between multiple snapshots and
it's data is corrupted (without the disk returning an i/o error), btrfs
won't be able to fix the corruption without raid and will always return
an i/o error when accessing the affected range (due to checksum
mismatch).

Of course you could just overwrite the range in the file with good
data, but that would only fix the file you are operating on, snapshots
will still reference the corrupted data.

With this flag, a read could just return the corrupted data without i/o
error and a write could write directly to the on-disk data to fixup the
corruption everywhere. btrfs could also check that the newly written
data actually matches the checksum.
However, in this btrfs usecase the process still needs to be
CAP_SYS_ADMIN or similar, since it's easy to create collisions for
crc32 and so an attacker could write to a file that he has no
permissions for, if that file shares an extend with one where he has
write permissions.

Regards,
Lukas Straub
--