mbox series

[PATCHSET,RFC,v2,jane,0/5] vfs: enable userspace to reset damaged file storage

Message ID 163192864476.417973.143014658064006895.stgit@magnolia (mailing list archive)
Headers show
Series vfs: enable userspace to reset damaged file storage | expand

Message

Darrick J. Wong Sept. 18, 2021, 1:30 a.m. UTC
Hi all,

Jane Chu has taken an interest in trying to fix the pmem poison recovery
story on Linux.  Since I sort of had a half-baked patchset that seems to
contain some elements of what the reviewers of her patchset wanted, I'm
releasing this reworked version to see if it has any traction.

Our current "advice" to people using persistent memory and FSDAX who
wish to recover upon receipt of a media error (aka 'hwpoison') event
from ACPI is to punch-hole that part of the file and then pwrite it,
which will magically cause the pmem to be reinitialized and the poison
to be cleared.

Punching doesn't make any sense at all -- the (re)allocation on pwrite
does not permit the caller to specify where to find blocks, which means
that we might not get the same pmem back.  This pushes the user farther
away from the goal of reinitializing poisoned memory and leads to
complaints about unnecessary file fragmentation.

AFAICT, the only reason why the "punch and write" dance works at all is
that the XFS and ext4 currently call blkdev_issue_zeroout when
allocating pmem ahead of a write call.  Even a regular overwrite won't
clear the poison, because dax_direct_access is smart enough to bail out
on poisoned pmem, but not smart enough to clear it.  To be fair, that
function maps pages and has no idea what kinds of reads and writes the
caller might want to perform.

Therefore, clean up this whole mess by creating a dax_zeroinit_range
function that callers can use on poisoned persistent memory to reset the
contents of the persistent memory to a known state (all zeroes) and
clear any lingering poison state that might be lingering in the memory
controllers.  Create a new fallocate mode to trigger this functionality,
then wire up XFS and ext4 to use it.  For good measure, wire it up to
traditional storage if the storage has a fast way to zero LBA contents,
since we assume that those LBAs won't hit old media errors.

v2: change the name to zeroinit, add an explicit fallocate mode, and
    support regular block devices for non-dax files

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=zero-initialize-pmem-5.16
---
 fs/dax.c                    |   93 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/extents.c           |   93 +++++++++++++++++++++++++++++++++++++++++++
 fs/iomap/direct-io.c        |   75 +++++++++++++++++++++++++++++++++++
 fs/open.c                   |    5 ++
 fs/xfs/xfs_bmap_util.c      |   22 ++++++++++
 fs/xfs/xfs_bmap_util.h      |    2 +
 fs/xfs/xfs_file.c           |   11 ++++-
 fs/xfs/xfs_trace.h          |    1 
 include/linux/dax.h         |    7 +++
 include/linux/falloc.h      |    1 
 include/linux/iomap.h       |    3 +
 include/trace/events/ext4.h |    7 +++
 include/uapi/linux/falloc.h |    9 ++++
 13 files changed, 325 insertions(+), 4 deletions(-)

Comments

Dan Williams Sept. 18, 2021, 6:05 p.m. UTC | #1
On Fri, Sep 17, 2021 at 6:30 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi all,
>
> Jane Chu has taken an interest in trying to fix the pmem poison recovery
> story on Linux.  Since I sort of had a half-baked patchset that seems to
> contain some elements of what the reviewers of her patchset wanted, I'm
> releasing this reworked version to see if it has any traction.
>
> Our current "advice" to people using persistent memory and FSDAX who
> wish to recover upon receipt of a media error (aka 'hwpoison') event
> from ACPI is to punch-hole that part of the file and then pwrite it,
> which will magically cause the pmem to be reinitialized and the poison
> to be cleared.
>
> Punching doesn't make any sense at all -- the (re)allocation on pwrite
> does not permit the caller to specify where to find blocks, which means
> that we might not get the same pmem back.

Not sure this is a driving concern. If you get the same pmem back it
will have gone through a poison clearing cycle when it gets
reallocated. Also, once the filesystem gets notified of error
locations through Ruan's series the FS can avoid allocating blocks
where poison clearing failed.

> This pushes the user farther
> away from the goal of reinitializing poisoned memory and leads to
> complaints about unnecessary file fragmentation.

Fragmentation though is a valid concern.

>
> AFAICT, the only reason why the "punch and write" dance works at all is
> that the XFS and ext4 currently call blkdev_issue_zeroout when
> allocating pmem ahead of a write call.  Even a regular overwrite won't
> clear the poison, because dax_direct_access is smart enough to bail out
> on poisoned pmem, but not smart enough to clear it.

Alignment constraints were the entanglement that kept DAX from poison
clearing. This is similar to the dance you need to do to get a disk to
remap a bad block, which needs an O_DIRECT write. It was also deemed
messy to keep overloading writes this way.

> To be fair, that
> function maps pages and has no idea what kinds of reads and writes the
> caller might want to perform.
>
> Therefore, clean up this whole mess by creating a dax_zeroinit_range
> function that callers can use on poisoned persistent memory to reset the
> contents of the persistent memory to a known state (all zeroes) and
> clear any lingering poison state that might be lingering in the memory
> controllers.  Create a new fallocate mode to trigger this functionality,
> then wire up XFS and ext4 to use it.  For good measure, wire it up to
> traditional storage if the storage has a fast way to zero LBA contents,
> since we assume that those LBAs won't hit old media errors.

Sounds good, I'll take a look at the rest.
Darrick J. Wong Sept. 23, 2021, 12:51 a.m. UTC | #2
I withdraw the patchset due to spiralling complexity and the need to get
back to the other hojillion things.

Dave now suggests creating a RWF_CLEAR_POISON/IOCB_CLEAR_POISON flag to
clear poison ahead of userspace using a regular write to reset the
storage.  Honestly I only though of zero writes because I though that
would lead to the least amount of back and forth, which was clearly
wrong.

Jane, could you take over and try that?

--D
Dan Williams Sept. 23, 2021, 1:17 a.m. UTC | #3
On Wed, Sep 22, 2021 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> I withdraw the patchset due to spiralling complexity and the need to get
> back to the other hojillion things.
>
> Dave now suggests creating a RWF_CLEAR_POISON/IOCB_CLEAR_POISON flag to
> clear poison ahead of userspace using a regular write to reset the
> storage.  Honestly I only though of zero writes because I though that
> would lead to the least amount of back and forth, which was clearly
> wrong.
>

Sorry Darrick, you really tried to do right by the feedback.

> Jane, could you take over and try that?

Let's just solve the immediate problem for DAX in the easiest way
possible which is just Jane's original patches, but using the existing
dax_zero_page_range() operation in the DAX pwrite() failure path. The
only change would be to make sure that the pwrite() is page aligned,
and punt on the ability to clear poison on sub-page granularity for
now.