diff mbox series

[1/5] dax: prepare pmem for use by zero-initializing contents and clearing poisons

Message ID 163192865031.417973.8372869475521627214.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series vfs: enable userspace to reset damaged file storage | expand

Commit Message

Darrick J. Wong Sept. 18, 2021, 1:30 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Our current "advice" to people using persistent memory and FSDAX who
wish to recover upon receipt of a media error (aka 'hwpoison') event
from ACPI is to punch-hole that part of the file and then pwrite it,
which will magically cause the pmem to be reinitialized and the poison
to be cleared.

Punching doesn't make any sense at all -- the (re)allocation on pwrite
does not permit the caller to specify where to find blocks, which means
that we might not get the same pmem back.  This pushes the user farther
away from the goal of reinitializing poisoned memory and leads to
complaints about unnecessary file fragmentation.

AFAICT, the only reason why the "punch and write" dance works at all is
that the XFS and ext4 currently call blkdev_issue_zeroout when
allocating pmem ahead of a write call.  Even a regular overwrite won't
clear the poison, because dax_direct_access is smart enough to bail out
on poisoned pmem, but not smart enough to clear it.  To be fair, that
function maps pages and has no idea what kinds of reads and writes the
caller might want to perform.

Therefore, create a dax_zeroinit_range function that filesystems can to
reset the pmem contents to zero and clear hardware media error flags.
This uses the dax page zeroing helper function, which should ensure that
subsequent accesses will not trip over any pre-existing media errors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |    7 ++++
 2 files changed, 100 insertions(+)

Comments

Ritesh Harjani Sept. 18, 2021, 4:54 p.m. UTC | #1
On 21/09/17 06:30PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Our current "advice" to people using persistent memory and FSDAX who
> wish to recover upon receipt of a media error (aka 'hwpoison') event
> from ACPI is to punch-hole that part of the file and then pwrite it,
> which will magically cause the pmem to be reinitialized and the poison
> to be cleared.
>
> Punching doesn't make any sense at all -- the (re)allocation on pwrite
> does not permit the caller to specify where to find blocks, which means
> that we might not get the same pmem back.  This pushes the user farther
> away from the goal of reinitializing poisoned memory and leads to
> complaints about unnecessary file fragmentation.
>
> AFAICT, the only reason why the "punch and write" dance works at all is
> that the XFS and ext4 currently call blkdev_issue_zeroout when
> allocating pmem ahead of a write call.  Even a regular overwrite won't
> clear the poison, because dax_direct_access is smart enough to bail out
> on poisoned pmem, but not smart enough to clear it.  To be fair, that
> function maps pages and has no idea what kinds of reads and writes the
> caller might want to perform.
>
> Therefore, create a dax_zeroinit_range function that filesystems can to
> reset the pmem contents to zero and clear hardware media error flags.
> This uses the dax page zeroing helper function, which should ensure that
> subsequent accesses will not trip over any pre-existing media errors.

Thanks Darrick for such clear explaination of the problem and your solution.
As I see from this thread [1], it looks like we are heading in this direction,
so I thought of why not review this RFC patch series :)

[1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |    7 ++++
>  2 files changed, 100 insertions(+)
>
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 4e3e5a283a91..765b80d08605 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t
> +dax_zeroinit_iter(struct iomap_iter *iter)
> +{
> +	struct iomap *iomap = &iter->iomap;
> +	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +	const u64 start = iomap->addr + iter->pos - iomap->offset;
> +	const u64 nr_bytes = iomap_length(iter);
> +	u64 start_page = start >> PAGE_SHIFT;
> +	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> +	int ret;
> +
> +	if (!iomap->dax_dev)
> +		return -ECANCELED;
> +
> +	/*
> +	 * The physical extent must be page aligned because that's what the dax
> +	 * function requires.
> +	 */
> +	if (!PAGE_ALIGNED(start | nr_bytes))
> +		return -ECANCELED;
> +
> +	/*
> +	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
> +	 * we must check for overflows.
> +	 */
> +	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> +		return -ECANCELED;
> +
> +	/* Must be able to zero storage directly without fs intervention. */
> +	if (iomap->flags & IOMAP_F_SHARED)
> +		return -ECANCELED;
> +	if (srcmap != iomap)
> +		return -ECANCELED;
> +
> +	switch (iomap->type) {
> +	case IOMAP_MAPPED:
> +		while (nr_pages > 0) {
> +			/* XXX function only supports one page at a time?! */
> +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> +					1);
> +			if (ret)
> +				return ret;
> +			start_page++;
> +			nr_pages--;
> +		}
> +
> +		fallthrough;
> +	case IOMAP_UNWRITTEN:
> +		return nr_bytes;
> +	}
> +
> +	/* Reject holes, inline data, or delalloc extents. */
> +	return -ECANCELED;

We reject holes here, but the other vfs plumbing patch [2] mentions
"Holes and unwritten extents are left untouched.".
Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?

[2]: "vfs: add a zero-initialization mode to fallocate"

Although I am not an expert in this area, but the rest of the patch looks
very well crafted to me. Thanks again for such details :)

-ritesh

>
> +}
> +
> +/*
> + * Initialize storage mapped to a DAX-mode file to a known value and ensure the
> + * media are ready to accept read and write commands.  This requires the use of
> + * the dax layer's zero page range function to write zeroes to a pmem region
> + * and to reset any hardware media error state.
> + *
> + * The physical extents must be aligned to page size.  The file must be backed
> + * by a pmem device.  The extents returned must not require copy on write (or
> + * any other mapping interventions from the filesystem) and must be contiguous.
> + * @done will be set to true if the reset succeeded.
> + *
> + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
> + * mappings do not support zero initialization, -EOPNOTSUPP if the device does
> + * not support it, or the usual negative errno.
> + */
> +int
> +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> +		   const struct iomap_ops *ops)
> +{
> +	struct iomap_iter iter = {
> +		.inode		= inode,
> +		.pos		= pos,
> +		.len		= len,
> +		.flags		= IOMAP_REPORT,
> +	};
> +	int ret;
> +
> +	if (!IS_DAX(inode))
> +		return -EINVAL;
> +	if (pos + len > i_size_read(inode))
> +		return -EINVAL;
> +
> +	while ((ret = iomap_iter(&iter, ops)) > 0)
> +		iter.processed = dax_zeroinit_iter(&iter);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_zeroinit_range);
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 2619d94c308d..3c873f7c35ba 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
>  struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
>  dax_entry_t dax_lock_page(struct page *page);
>  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> +			const struct iomap_ops *ops);
>  #else
>  #define generic_fsdax_supported		NULL
>
> @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page)
>  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  {
>  }
> +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> +		const struct iomap_ops *ops)
> +{
> +	return -EOPNOTSUPP;
> +}
>  #endif
>
>  #if IS_ENABLED(CONFIG_DAX)
>
Darrick J. Wong Sept. 20, 2021, 5:22 p.m. UTC | #2
On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote:
> On 21/09/17 06:30PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Our current "advice" to people using persistent memory and FSDAX who
> > wish to recover upon receipt of a media error (aka 'hwpoison') event
> > from ACPI is to punch-hole that part of the file and then pwrite it,
> > which will magically cause the pmem to be reinitialized and the poison
> > to be cleared.
> >
> > Punching doesn't make any sense at all -- the (re)allocation on pwrite
> > does not permit the caller to specify where to find blocks, which means
> > that we might not get the same pmem back.  This pushes the user farther
> > away from the goal of reinitializing poisoned memory and leads to
> > complaints about unnecessary file fragmentation.
> >
> > AFAICT, the only reason why the "punch and write" dance works at all is
> > that the XFS and ext4 currently call blkdev_issue_zeroout when
> > allocating pmem ahead of a write call.  Even a regular overwrite won't
> > clear the poison, because dax_direct_access is smart enough to bail out
> > on poisoned pmem, but not smart enough to clear it.  To be fair, that
> > function maps pages and has no idea what kinds of reads and writes the
> > caller might want to perform.
> >
> > Therefore, create a dax_zeroinit_range function that filesystems can to
> > reset the pmem contents to zero and clear hardware media error flags.
> > This uses the dax page zeroing helper function, which should ensure that
> > subsequent accesses will not trip over any pre-existing media errors.
> 
> Thanks Darrick for such clear explaination of the problem and your solution.
> As I see from this thread [1], it looks like we are heading in this direction,
> so I thought of why not review this RFC patch series :)
> 
> [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/
> 
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/dax.h |    7 ++++
> >  2 files changed, 100 insertions(+)
> >
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4e3e5a283a91..765b80d08605 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> >  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > +
> > +static loff_t
> > +dax_zeroinit_iter(struct iomap_iter *iter)
> > +{
> > +	struct iomap *iomap = &iter->iomap;
> > +	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > +	const u64 start = iomap->addr + iter->pos - iomap->offset;
> > +	const u64 nr_bytes = iomap_length(iter);
> > +	u64 start_page = start >> PAGE_SHIFT;
> > +	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> > +	int ret;
> > +
> > +	if (!iomap->dax_dev)
> > +		return -ECANCELED;
> > +
> > +	/*
> > +	 * The physical extent must be page aligned because that's what the dax
> > +	 * function requires.
> > +	 */
> > +	if (!PAGE_ALIGNED(start | nr_bytes))
> > +		return -ECANCELED;
> > +
> > +	/*
> > +	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
> > +	 * we must check for overflows.
> > +	 */
> > +	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> > +		return -ECANCELED;
> > +
> > +	/* Must be able to zero storage directly without fs intervention. */
> > +	if (iomap->flags & IOMAP_F_SHARED)
> > +		return -ECANCELED;
> > +	if (srcmap != iomap)
> > +		return -ECANCELED;
> > +
> > +	switch (iomap->type) {
> > +	case IOMAP_MAPPED:
> > +		while (nr_pages > 0) {
> > +			/* XXX function only supports one page at a time?! */
> > +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > +					1);
> > +			if (ret)
> > +				return ret;
> > +			start_page++;
> > +			nr_pages--;
> > +		}
> > +
> > +		fallthrough;
> > +	case IOMAP_UNWRITTEN:
> > +		return nr_bytes;
> > +	}
> > +
> > +	/* Reject holes, inline data, or delalloc extents. */
> > +	return -ECANCELED;
> 
> We reject holes here, but the other vfs plumbing patch [2] mentions
> "Holes and unwritten extents are left untouched.".
> Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?

I'm not entirely sure what we should do for holes and unwritten extents,
as you can tell from the gross inconsistency between the comment and the
code. :/

On block devices, I think we rely on the behavior that writing to disk
will clear the device's error state (via LBA remapping or some other
strategy).  I think this means iomap_zeroinit can skip unwritten extents
because reads and read faults will be satisfied from the zero page and
writeback (or direct writes) will trigger the drive firmware.

On FSDAX devices, reads are fulfilled by zeroing the user buffer, and
read faults with the (dax) zero page.  Writes and write faults won't
clear the poison (unlike disks).  So I guess this means that
dax_zeroinit *does* have to act on unwritten areas.

Ok.  I'll make those changes.

As for holes -- on the one hand, one could argue that zero-initializing
a hole makes no sense and should be an error.  OTOH one could make an
equally compelling argument that it's merely a nop.  Thoughts?

--D

> [2]: "vfs: add a zero-initialization mode to fallocate"
> 
> Although I am not an expert in this area, but the rest of the patch looks
> very well crafted to me. Thanks again for such details :)
> 
> -ritesh
> 
> >
> > +}
> > +
> > +/*
> > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the
> > + * media are ready to accept read and write commands.  This requires the use of
> > + * the dax layer's zero page range function to write zeroes to a pmem region
> > + * and to reset any hardware media error state.
> > + *
> > + * The physical extents must be aligned to page size.  The file must be backed
> > + * by a pmem device.  The extents returned must not require copy on write (or
> > + * any other mapping interventions from the filesystem) and must be contiguous.
> > + * @done will be set to true if the reset succeeded.
> > + *
> > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
> > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does
> > + * not support it, or the usual negative errno.
> > + */
> > +int
> > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > +		   const struct iomap_ops *ops)
> > +{
> > +	struct iomap_iter iter = {
> > +		.inode		= inode,
> > +		.pos		= pos,
> > +		.len		= len,
> > +		.flags		= IOMAP_REPORT,
> > +	};
> > +	int ret;
> > +
> > +	if (!IS_DAX(inode))
> > +		return -EINVAL;
> > +	if (pos + len > i_size_read(inode))
> > +		return -EINVAL;
> > +
> > +	while ((ret = iomap_iter(&iter, ops)) > 0)
> > +		iter.processed = dax_zeroinit_iter(&iter);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_zeroinit_range);
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 2619d94c308d..3c873f7c35ba 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
> >  struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> >  dax_entry_t dax_lock_page(struct page *page);
> >  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > +			const struct iomap_ops *ops);
> >  #else
> >  #define generic_fsdax_supported		NULL
> >
> > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page)
> >  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
> >  {
> >  }
> > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > +		const struct iomap_ops *ops)
> > +{
> > +	return -EOPNOTSUPP;
> > +}
> >  #endif
> >
> >  #if IS_ENABLED(CONFIG_DAX)
> >
Ritesh Harjani Sept. 21, 2021, 4:07 a.m. UTC | #3
On 21/09/20 10:22AM, Darrick J. Wong wrote:
> On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote:
> > On 21/09/17 06:30PM, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Our current "advice" to people using persistent memory and FSDAX who
> > > wish to recover upon receipt of a media error (aka 'hwpoison') event
> > > from ACPI is to punch-hole that part of the file and then pwrite it,
> > > which will magically cause the pmem to be reinitialized and the poison
> > > to be cleared.
> > >
> > > Punching doesn't make any sense at all -- the (re)allocation on pwrite
> > > does not permit the caller to specify where to find blocks, which means
> > > that we might not get the same pmem back.  This pushes the user farther
> > > away from the goal of reinitializing poisoned memory and leads to
> > > complaints about unnecessary file fragmentation.
> > >
> > > AFAICT, the only reason why the "punch and write" dance works at all is
> > > that the XFS and ext4 currently call blkdev_issue_zeroout when
> > > allocating pmem ahead of a write call.  Even a regular overwrite won't
> > > clear the poison, because dax_direct_access is smart enough to bail out
> > > on poisoned pmem, but not smart enough to clear it.  To be fair, that
> > > function maps pages and has no idea what kinds of reads and writes the
> > > caller might want to perform.
> > >
> > > Therefore, create a dax_zeroinit_range function that filesystems can to
> > > reset the pmem contents to zero and clear hardware media error flags.
> > > This uses the dax page zeroing helper function, which should ensure that
> > > subsequent accesses will not trip over any pre-existing media errors.
> >
> > Thanks Darrick for such clear explaination of the problem and your solution.
> > As I see from this thread [1], it looks like we are heading in this direction,
> > so I thought of why not review this RFC patch series :)
> >
> > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/
> >
> > >
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/dax.h |    7 ++++
> > >  2 files changed, 100 insertions(+)
> > >
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 4e3e5a283a91..765b80d08605 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > >  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
> > >  }
> > >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > > +
> > > +static loff_t
> > > +dax_zeroinit_iter(struct iomap_iter *iter)
> > > +{
> > > +	struct iomap *iomap = &iter->iomap;
> > > +	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > +	const u64 start = iomap->addr + iter->pos - iomap->offset;
> > > +	const u64 nr_bytes = iomap_length(iter);
> > > +	u64 start_page = start >> PAGE_SHIFT;
> > > +	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> > > +	int ret;
> > > +
> > > +	if (!iomap->dax_dev)
> > > +		return -ECANCELED;
> > > +
> > > +	/*
> > > +	 * The physical extent must be page aligned because that's what the dax
> > > +	 * function requires.
> > > +	 */
> > > +	if (!PAGE_ALIGNED(start | nr_bytes))
> > > +		return -ECANCELED;
> > > +
> > > +	/*
> > > +	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
> > > +	 * we must check for overflows.
> > > +	 */
> > > +	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> > > +		return -ECANCELED;
> > > +
> > > +	/* Must be able to zero storage directly without fs intervention. */
> > > +	if (iomap->flags & IOMAP_F_SHARED)
> > > +		return -ECANCELED;
> > > +	if (srcmap != iomap)
> > > +		return -ECANCELED;
> > > +
> > > +	switch (iomap->type) {
> > > +	case IOMAP_MAPPED:
> > > +		while (nr_pages > 0) {
> > > +			/* XXX function only supports one page at a time?! */
> > > +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > > +					1);
> > > +			if (ret)
> > > +				return ret;
> > > +			start_page++;
> > > +			nr_pages--;
> > > +		}
> > > +
> > > +		fallthrough;
> > > +	case IOMAP_UNWRITTEN:
> > > +		return nr_bytes;
> > > +	}
> > > +
> > > +	/* Reject holes, inline data, or delalloc extents. */
> > > +	return -ECANCELED;
> >
> > We reject holes here, but the other vfs plumbing patch [2] mentions
> > "Holes and unwritten extents are left untouched.".
> > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?
>
> I'm not entirely sure what we should do for holes and unwritten extents,
> as you can tell from the gross inconsistency between the comment and the
> code. :/
>
> On block devices, I think we rely on the behavior that writing to disk
> will clear the device's error state (via LBA remapping or some other
> strategy).  I think this means iomap_zeroinit can skip unwritten extents
> because reads and read faults will be satisfied from the zero page and
> writeback (or direct writes) will trigger the drive firmware.
>
> On FSDAX devices, reads are fulfilled by zeroing the user buffer, and
> read faults with the (dax) zero page.  Writes and write faults won't
> clear the poison (unlike disks).  So I guess this means that
> dax_zeroinit *does* have to act on unwritten areas.
>
> Ok.  I'll make those changes.

Yes, I guess unwritten extents still have extents blocks allocated with
generally a bit marked (to mark it as unwritten). So there could still be
a need to clear the poison for this in case of DAX.

>
> As for holes -- on the one hand, one could argue that zero-initializing
> a hole makes no sense and should be an error.  OTOH one could make an
> equally compelling argument that it's merely a nop.  Thoughts?

So in case of holes consider this case (please correct if any of my
understanding below is wrong/incomplete).
If we have a large hole and if someone tries to do write to that area.
1. Then from what I understood from the code FS will try and allocate some disk
   blocks (could these blocks be marked with HWpoison as FS has no way of
   knowing it?).
2. If yes, then after allocating those blocks dax_direct_access will fail (as
   you had mentioned above). But it won't clear the HWposion.
3. Then the user again will have to clear using this API. But that is only for
   a given extent which is some part of the hole which FS allocated.
Now above could be repeated until the entire hole range is covered.
Is that above understanding correct?

If yes, then maybe it all depends on what sort of gurantee the API is providing.
If using the API on the given range guarantees that the entire file range will
not have any blocks with HWpoison then I guess we may have to cover the
IOMAP_HOLE case as well?
If not, then maybe we could explicitly mentioned this in the API documentation.

Please help correct if any of above does not make any sense. It will help me
understand this use case better.

-ritesh

>
> --D
>
> > [2]: "vfs: add a zero-initialization mode to fallocate"
> >
> > Although I am not an expert in this area, but the rest of the patch looks
> > very well crafted to me. Thanks again for such details :)
> >
> > -ritesh
> >
> > >
> > > +}
> > > +
> > > +/*
> > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the
> > > + * media are ready to accept read and write commands.  This requires the use of
> > > + * the dax layer's zero page range function to write zeroes to a pmem region
> > > + * and to reset any hardware media error state.
> > > + *
> > > + * The physical extents must be aligned to page size.  The file must be backed
> > > + * by a pmem device.  The extents returned must not require copy on write (or
> > > + * any other mapping interventions from the filesystem) and must be contiguous.
> > > + * @done will be set to true if the reset succeeded.
> > > + *
> > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
> > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does
> > > + * not support it, or the usual negative errno.
> > > + */
> > > +int
> > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > +		   const struct iomap_ops *ops)
> > > +{
> > > +	struct iomap_iter iter = {
> > > +		.inode		= inode,
> > > +		.pos		= pos,
> > > +		.len		= len,
> > > +		.flags		= IOMAP_REPORT,
> > > +	};
> > > +	int ret;
> > > +
> > > +	if (!IS_DAX(inode))
> > > +		return -EINVAL;
> > > +	if (pos + len > i_size_read(inode))
> > > +		return -EINVAL;
> > > +
> > > +	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > +		iter.processed = dax_zeroinit_iter(&iter);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range);
> > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > index 2619d94c308d..3c873f7c35ba 100644
> > > --- a/include/linux/dax.h
> > > +++ b/include/linux/dax.h
> > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
> > >  struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> > >  dax_entry_t dax_lock_page(struct page *page);
> > >  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > +			const struct iomap_ops *ops);
> > >  #else
> > >  #define generic_fsdax_supported		NULL
> > >
> > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page)
> > >  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
> > >  {
> > >  }
> > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > +		const struct iomap_ops *ops)
> > > +{
> > > +	return -EOPNOTSUPP;
> > > +}
> > >  #endif
> > >
> > >  #if IS_ENABLED(CONFIG_DAX)
> > >
Christoph Hellwig Sept. 21, 2021, 8:34 a.m. UTC | #4
On Fri, Sep 17, 2021 at 06:30:50PM -0700, Darrick J. Wong wrote:
> +	case IOMAP_MAPPED:
> +		while (nr_pages > 0) {
> +			/* XXX function only supports one page at a time?! */
> +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> +					1);

Yes.  Given that it will have to kmap every page that kinda makes sense.
Unlike a nr_pages argument which needs to be 1, which is just silly.
That being said it would make sense to move the trivial loop over the
pages into the methods to reduce the indirect function call overhead
Darrick J. Wong Sept. 22, 2021, 6:10 p.m. UTC | #5
On Tue, Sep 21, 2021 at 09:34:55AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 06:30:50PM -0700, Darrick J. Wong wrote:
> > +	case IOMAP_MAPPED:
> > +		while (nr_pages > 0) {
> > +			/* XXX function only supports one page at a time?! */
> > +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > +					1);
> 
> Yes.  Given that it will have to kmap every page that kinda makes sense.
> Unlike a nr_pages argument which needs to be 1, which is just silly.
> That being said it would make sense to move the trivial loop over the
> pages into the methods to reduce the indirect function call overhead

Done.  AFAICT all the implementations *except* nvdimm can handle more
than one page.

--D
Darrick J. Wong Sept. 22, 2021, 6:26 p.m. UTC | #6
On Tue, Sep 21, 2021 at 09:37:08AM +0530, riteshh wrote:
> On 21/09/20 10:22AM, Darrick J. Wong wrote:
> > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote:
> > > On 21/09/17 06:30PM, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Our current "advice" to people using persistent memory and FSDAX who
> > > > wish to recover upon receipt of a media error (aka 'hwpoison') event
> > > > from ACPI is to punch-hole that part of the file and then pwrite it,
> > > > which will magically cause the pmem to be reinitialized and the poison
> > > > to be cleared.
> > > >
> > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite
> > > > does not permit the caller to specify where to find blocks, which means
> > > > that we might not get the same pmem back.  This pushes the user farther
> > > > away from the goal of reinitializing poisoned memory and leads to
> > > > complaints about unnecessary file fragmentation.
> > > >
> > > > AFAICT, the only reason why the "punch and write" dance works at all is
> > > > that the XFS and ext4 currently call blkdev_issue_zeroout when
> > > > allocating pmem ahead of a write call.  Even a regular overwrite won't
> > > > clear the poison, because dax_direct_access is smart enough to bail out
> > > > on poisoned pmem, but not smart enough to clear it.  To be fair, that
> > > > function maps pages and has no idea what kinds of reads and writes the
> > > > caller might want to perform.
> > > >
> > > > Therefore, create a dax_zeroinit_range function that filesystems can to
> > > > reset the pmem contents to zero and clear hardware media error flags.
> > > > This uses the dax page zeroing helper function, which should ensure that
> > > > subsequent accesses will not trip over any pre-existing media errors.
> > >
> > > Thanks Darrick for such clear explaination of the problem and your solution.
> > > As I see from this thread [1], it looks like we are heading in this direction,
> > > so I thought of why not review this RFC patch series :)
> > >
> > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/
> > >
> > > >
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  include/linux/dax.h |    7 ++++
> > > >  2 files changed, 100 insertions(+)
> > > >
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 4e3e5a283a91..765b80d08605 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > > >  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > > > +
> > > > +static loff_t
> > > > +dax_zeroinit_iter(struct iomap_iter *iter)
> > > > +{
> > > > +	struct iomap *iomap = &iter->iomap;
> > > > +	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > > +	const u64 start = iomap->addr + iter->pos - iomap->offset;
> > > > +	const u64 nr_bytes = iomap_length(iter);
> > > > +	u64 start_page = start >> PAGE_SHIFT;
> > > > +	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> > > > +	int ret;
> > > > +
> > > > +	if (!iomap->dax_dev)
> > > > +		return -ECANCELED;
> > > > +
> > > > +	/*
> > > > +	 * The physical extent must be page aligned because that's what the dax
> > > > +	 * function requires.
> > > > +	 */
> > > > +	if (!PAGE_ALIGNED(start | nr_bytes))
> > > > +		return -ECANCELED;
> > > > +
> > > > +	/*
> > > > +	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
> > > > +	 * we must check for overflows.
> > > > +	 */
> > > > +	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> > > > +		return -ECANCELED;
> > > > +
> > > > +	/* Must be able to zero storage directly without fs intervention. */
> > > > +	if (iomap->flags & IOMAP_F_SHARED)
> > > > +		return -ECANCELED;
> > > > +	if (srcmap != iomap)
> > > > +		return -ECANCELED;
> > > > +
> > > > +	switch (iomap->type) {
> > > > +	case IOMAP_MAPPED:
> > > > +		while (nr_pages > 0) {
> > > > +			/* XXX function only supports one page at a time?! */
> > > > +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > > > +					1);
> > > > +			if (ret)
> > > > +				return ret;
> > > > +			start_page++;
> > > > +			nr_pages--;
> > > > +		}
> > > > +
> > > > +		fallthrough;
> > > > +	case IOMAP_UNWRITTEN:
> > > > +		return nr_bytes;
> > > > +	}
> > > > +
> > > > +	/* Reject holes, inline data, or delalloc extents. */
> > > > +	return -ECANCELED;
> > >
> > > We reject holes here, but the other vfs plumbing patch [2] mentions
> > > "Holes and unwritten extents are left untouched.".
> > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?
> >
> > I'm not entirely sure what we should do for holes and unwritten extents,
> > as you can tell from the gross inconsistency between the comment and the
> > code. :/
> >
> > On block devices, I think we rely on the behavior that writing to disk
> > will clear the device's error state (via LBA remapping or some other
> > strategy).  I think this means iomap_zeroinit can skip unwritten extents
> > because reads and read faults will be satisfied from the zero page and
> > writeback (or direct writes) will trigger the drive firmware.
> >
> > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and
> > read faults with the (dax) zero page.  Writes and write faults won't
> > clear the poison (unlike disks).  So I guess this means that
> > dax_zeroinit *does* have to act on unwritten areas.

I was confused when I wrote this -- before writing, dax filesystems are
required to allocate written zeroed extents and/or zero unwritten
extents and mark them written.  So we don't actually need to zero
unwritten extents.

> >
> > Ok.  I'll make those changes.
> 
> Yes, I guess unwritten extents still have extents blocks allocated with
> generally a bit marked (to mark it as unwritten). So there could still be
> a need to clear the poison for this in case of DAX.
> 
> >
> > As for holes -- on the one hand, one could argue that zero-initializing
> > a hole makes no sense and should be an error.  OTOH one could make an
> > equally compelling argument that it's merely a nop.  Thoughts?
> 
> So in case of holes consider this case (please correct if any of my
> understanding below is wrong/incomplete).
> If we have a large hole and if someone tries to do write to that area.
> 1. Then from what I understood from the code FS will try and allocate some disk
>    blocks (could these blocks be marked with HWpoison as FS has no way of
>    knowing it?).

Freshly allocated extents are zeroed via blkdev_issue_zeroout before
being mapped into the file, which will clear the poison.  That last bit
is only the reason why the punch-and-rewrite dance ever worked at all.

We'll have to make sure this poison clearing still happens even after
the dax/block layer divorce.

> 2. If yes, then after allocating those blocks dax_direct_access will fail (as
>    you had mentioned above). But it won't clear the HWposion.
> 3. Then the user again will have to clear using this API. But that is only for
>    a given extent which is some part of the hole which FS allocated.
> Now above could be repeated until the entire hole range is covered.
> Is that above understanding correct?
> 
> If yes, then maybe it all depends on what sort of gurantee the API is providing.
> If using the API on the given range guarantees that the entire file range will
> not have any blocks with HWpoison then I guess we may have to cover the
> IOMAP_HOLE case as well?
> If not, then maybe we could explicitly mentioned this in the API documentation.

Ok.  The short version is that zeroinit will ensure that subsequent
reads, writes, or faults to allocated file ranges won't have problems
with pre-existing poison flags.  If the user wants to fill sparse holes,
they can do that with a separate fallocate call.

--D

> Please help correct if any of above does not make any sense. It will help me
> understand this use case better.
> 
> -ritesh
> 
> >
> > --D
> >
> > > [2]: "vfs: add a zero-initialization mode to fallocate"
> > >
> > > Although I am not an expert in this area, but the rest of the patch looks
> > > very well crafted to me. Thanks again for such details :)
> > >
> > > -ritesh
> > >
> > > >
> > > > +}
> > > > +
> > > > +/*
> > > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the
> > > > + * media are ready to accept read and write commands.  This requires the use of
> > > > + * the dax layer's zero page range function to write zeroes to a pmem region
> > > > + * and to reset any hardware media error state.
> > > > + *
> > > > + * The physical extents must be aligned to page size.  The file must be backed
> > > > + * by a pmem device.  The extents returned must not require copy on write (or
> > > > + * any other mapping interventions from the filesystem) and must be contiguous.
> > > > + * @done will be set to true if the reset succeeded.
> > > > + *
> > > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
> > > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does
> > > > + * not support it, or the usual negative errno.
> > > > + */
> > > > +int
> > > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > > +		   const struct iomap_ops *ops)
> > > > +{
> > > > +	struct iomap_iter iter = {
> > > > +		.inode		= inode,
> > > > +		.pos		= pos,
> > > > +		.len		= len,
> > > > +		.flags		= IOMAP_REPORT,
> > > > +	};
> > > > +	int ret;
> > > > +
> > > > +	if (!IS_DAX(inode))
> > > > +		return -EINVAL;
> > > > +	if (pos + len > i_size_read(inode))
> > > > +		return -EINVAL;
> > > > +
> > > > +	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > > +		iter.processed = dax_zeroinit_iter(&iter);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range);
> > > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > > index 2619d94c308d..3c873f7c35ba 100644
> > > > --- a/include/linux/dax.h
> > > > +++ b/include/linux/dax.h
> > > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
> > > >  struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> > > >  dax_entry_t dax_lock_page(struct page *page);
> > > >  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> > > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > > +			const struct iomap_ops *ops);
> > > >  #else
> > > >  #define generic_fsdax_supported		NULL
> > > >
> > > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page)
> > > >  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
> > > >  {
> > > >  }
> > > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > > +		const struct iomap_ops *ops)
> > > > +{
> > > > +	return -EOPNOTSUPP;
> > > > +}
> > > >  #endif
> > > >
> > > >  #if IS_ENABLED(CONFIG_DAX)
> > > >
Ritesh Harjani Sept. 22, 2021, 7:47 p.m. UTC | #7
On 21/09/22 11:26AM, Darrick J. Wong wrote:
> On Tue, Sep 21, 2021 at 09:37:08AM +0530, riteshh wrote:
> > On 21/09/20 10:22AM, Darrick J. Wong wrote:
> > > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote:
> > > > On 21/09/17 06:30PM, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > Our current "advice" to people using persistent memory and FSDAX who
> > > > > wish to recover upon receipt of a media error (aka 'hwpoison') event
> > > > > from ACPI is to punch-hole that part of the file and then pwrite it,
> > > > > which will magically cause the pmem to be reinitialized and the poison
> > > > > to be cleared.
> > > > >
> > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite
> > > > > does not permit the caller to specify where to find blocks, which means
> > > > > that we might not get the same pmem back.  This pushes the user farther
> > > > > away from the goal of reinitializing poisoned memory and leads to
> > > > > complaints about unnecessary file fragmentation.
> > > > >
> > > > > AFAICT, the only reason why the "punch and write" dance works at all is
> > > > > that the XFS and ext4 currently call blkdev_issue_zeroout when
> > > > > allocating pmem ahead of a write call.  Even a regular overwrite won't
> > > > > clear the poison, because dax_direct_access is smart enough to bail out
> > > > > on poisoned pmem, but not smart enough to clear it.  To be fair, that
> > > > >
> > > > >
> > > > > function maps pages and has no idea what kinds of reads and writes the
> > > > > caller might want to perform.
> > > > >
> > > > > Therefore, create a dax_zeroinit_range function that filesystems can to
> > > > > reset the pmem contents to zero and clear hardware media error flags.
> > > > > This uses the dax page zeroing helper function, which should ensure that
> > > > > subsequent accesses will not trip over any pre-existing media errors.
> > > >
> > > > Thanks Darrick for such clear explaination of the problem and your solution.
> > > > As I see from this thread [1], it looks like we are heading in this direction,
> > > > so I thought of why not review this RFC patch series :)
> > > >
> > > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/
> > > >
> > > > >
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  include/linux/dax.h |    7 ++++
> > > > >  2 files changed, 100 insertions(+)
> > > > >
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > > index 4e3e5a283a91..765b80d08605 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > > > >  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > > > > +
> > > > > +static loff_t
> > > > > +dax_zeroinit_iter(struct iomap_iter *iter)
> > > > > +{
> > > > > +	struct iomap *iomap = &iter->iomap;
> > > > > +	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > > > +	const u64 start = iomap->addr + iter->pos - iomap->offset;
> > > > > +	const u64 nr_bytes = iomap_length(iter);
> > > > > +	u64 start_page = start >> PAGE_SHIFT;
> > > > > +	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!iomap->dax_dev)
> > > > > +		return -ECANCELED;
> > > > > +
> > > > > +	/*
> > > > > +	 * The physical extent must be page aligned because that's what the dax
> > > > > +	 * function requires.
> > > > > +	 */
> > > > > +	if (!PAGE_ALIGNED(start | nr_bytes))
> > > > > +		return -ECANCELED;
> > > > > +
> > > > > +	/*
> > > > > +	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
> > > > > +	 * we must check for overflows.
> > > > > +	 */
> > > > > +	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> > > > > +		return -ECANCELED;
> > > > > +
> > > > > +	/* Must be able to zero storage directly without fs intervention. */
> > > > > +	if (iomap->flags & IOMAP_F_SHARED)
> > > > > +		return -ECANCELED;
> > > > > +	if (srcmap != iomap)
> > > > > +		return -ECANCELED;
> > > > > +
> > > > > +	switch (iomap->type) {
> > > > > +	case IOMAP_MAPPED:
> > > > > +		while (nr_pages > 0) {
> > > > > +			/* XXX function only supports one page at a time?! */
> > > > > +			ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > > > > +					1);
> > > > > +			if (ret)
> > > > > +				return ret;
> > > > > +			start_page++;
> > > > > +			nr_pages--;
> > > > > +		}
> > > > > +
> > > > > +		fallthrough;
> > > > > +	case IOMAP_UNWRITTEN:
> > > > > +		return nr_bytes;
> > > > > +	}
> > > > > +
> > > > > +	/* Reject holes, inline data, or delalloc extents. */
> > > > > +	return -ECANCELED;
> > > >
> > > > We reject holes here, but the other vfs plumbing patch [2] mentions
> > > > "Holes and unwritten extents are left untouched.".
> > > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?
> > >
> > > I'm not entirely sure what we should do for holes and unwritten extents,
> > > as you can tell from the gross inconsistency between the comment and the
> > > code. :/
> > >
> > > On block devices, I think we rely on the behavior that writing to disk
> > > will clear the device's error state (via LBA remapping or some other
> > > strategy).  I think this means iomap_zeroinit can skip unwritten extents
> > > because reads and read faults will be satisfied from the zero page and
> > > writeback (or direct writes) will trigger the drive firmware.
> > >
> > > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and
> > > read faults with the (dax) zero page.  Writes and write faults won't
> > > clear the poison (unlike disks).  So I guess this means that
> > > dax_zeroinit *does* have to act on unwritten areas.
>
> I was confused when I wrote this -- before writing, dax filesystems are
> required to allocate written zeroed extents and/or zero unwritten
> extents and mark them written.  So we don't actually need to zero
> unwritten extents.

Oh yes, thanks for catching that.

ext4 has this flag set for DAX inode in ext4_iomap_alloc()
ext4_iomap_begin() -> ext4_iomap_alloc() ->
	if (IS_DAX(inode))
		m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;

Also,
#define EXT4_GET_BLOCKS_CREATE_ZERO		(EXT4_GET_BLOCKS_CREATE | EXT4_GET_BLOCKS_ZERO)

if EXT4_GET_BLOCKS_ZERO is set then we call
ext4_map_blocks -> ext4_issue_zeroout() -> blkdev_issue_zeroout()

>
> > >
> > > Ok.  I'll make those changes.
> >
> > Yes, I guess unwritten extents still have extents blocks allocated with
> > generally a bit marked (to mark it as unwritten). So there could still be
> > a need to clear the poison for this in case of DAX.
> >
> > >
> > > As for holes -- on the one hand, one could argue that zero-initializing
> > > a hole makes no sense and should be an error.  OTOH one could make an
> > > equally compelling argument that it's merely a nop.  Thoughts?
> >
> > So in case of holes consider this case (please correct if any of my
> > understanding below is wrong/incomplete).
> > If we have a large hole and if someone tries to do write to that area.
> > 1. Then from what I understood from the code FS will try and allocate some disk
> >    blocks (could these blocks be marked with HWpoison as FS has no way of
> >    knowing it?).
>
> Freshly allocated extents are zeroed via blkdev_issue_zeroout before
> being mapped into the file, which will clear the poison.  That last bit
> is only the reason why the punch-and-rewrite dance ever worked at all.

Frankly speaking, this discussion actually got me thinking about what is special
about punch and write (w/o making any assumptions) that it clears the poison?

What about just punch hole alone?
So I guess this too can clear the HWpoison on the next write (after fault ->
allocating blocks for DAX -> this will call blkdev_issue_zeroout())
but I guess the problem with this approach, as you mentioned, it may cause file
fragmentation. Thisis since we may not get the same block back on next write
after punch hole.

Looking at the code of pmem driver, it is actually pmem_do_write() which
clears the pmem HWpoison.

And I guess the offset and size should be page_aligned because that's what calls
dax_zero_page_range() which can clear the HWpoison in dax_iomap_zero().

>
> We'll have to make sure this poison clearing still happens even after
> the dax/block layer divorce.
>
> > 2. If yes, then after allocating those blocks dax_direct_access will fail (as
> >    you had mentioned above). But it won't clear the HWposion.
> > 3. Then the user again will have to clear using this API. But that is only for
> >    a given extent which is some part of the hole which FS allocated.
> > Now above could be repeated until the entire hole range is covered.
> > Is that above understanding correct?
> >
> > If yes, then maybe it all depends on what sort of gurantee the API is providing.
> > If using the API on the given range guarantees that the entire file range will
> > not have any blocks with HWpoison then I guess we may have to cover the
> > IOMAP_HOLE case as well?
> > If not, then maybe we could explicitly mentioned this in the API documentation.
>
> Ok.  The short version is that zeroinit will ensure that subsequent
> reads, writes, or faults to allocated file ranges won't have problems
> with pre-existing poison flags.  If the user wants to fill sparse holes,
> they can do that with a separate fallocate call.

Yes, it is now clear to me.
Thanks for taking time and replying.

-ritesh

>
> --D
>
> > Please help correct if any of above does not make any sense. It will help me
> > understand this use case better.
> >
> > -ritesh
> >
> > >
> > > --D
> > >
> > > > [2]: "vfs: add a zero-initialization mode to fallocate"
> > > >
> > > > Although I am not an expert in this area, but the rest of the patch looks
> > > > very well crafted to me. Thanks again for such details :)
> > > >
> > > > -ritesh
> > > >
> > > > >
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the
> > > > > + * media are ready to accept read and write commands.  This requires the use of
> > > > > + * the dax layer's zero page range function to write zeroes to a pmem region
> > > > > + * and to reset any hardware media error state.
> > > > > + *
> > > > > + * The physical extents must be aligned to page size.  The file must be backed
> > > > > + * by a pmem device.  The extents returned must not require copy on write (or
> > > > > + * any other mapping interventions from the filesystem) and must be contiguous.
> > > > > + * @done will be set to true if the reset succeeded.
> > > > > + *
> > > > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
> > > > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does
> > > > > + * not support it, or the usual negative errno.
> > > > > + */
> > > > > +int
> > > > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > > > +		   const struct iomap_ops *ops)
> > > > > +{
> > > > > +	struct iomap_iter iter = {
> > > > > +		.inode		= inode,
> > > > > +		.pos		= pos,
> > > > > +		.len		= len,
> > > > > +		.flags		= IOMAP_REPORT,
> > > > > +	};
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!IS_DAX(inode))
> > > > > +		return -EINVAL;
> > > > > +	if (pos + len > i_size_read(inode))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > > > +		iter.processed = dax_zeroinit_iter(&iter);
> > > > > +	return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range);
> > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > > > index 2619d94c308d..3c873f7c35ba 100644
> > > > > --- a/include/linux/dax.h
> > > > > +++ b/include/linux/dax.h
> > > > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
> > > > >  struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> > > > >  dax_entry_t dax_lock_page(struct page *page);
> > > > >  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> > > > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > > > +			const struct iomap_ops *ops);
> > > > >  #else
> > > > >  #define generic_fsdax_supported		NULL
> > > > >
> > > > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page)
> > > > >  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
> > > > >  {
> > > > >  }
> > > > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
> > > > > +		const struct iomap_ops *ops)
> > > > > +{
> > > > > +	return -EOPNOTSUPP;
> > > > > +}
> > > > >  #endif
> > > > >
> > > > >  #if IS_ENABLED(CONFIG_DAX)
> > > > >
Dan Williams Sept. 22, 2021, 8:26 p.m. UTC | #8
On Wed, Sep 22, 2021 at 11:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Sep 21, 2021 at 09:37:08AM +0530, riteshh wrote:
> > On 21/09/20 10:22AM, Darrick J. Wong wrote:
> > > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote:
> > > > On 21/09/17 06:30PM, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > Our current "advice" to people using persistent memory and FSDAX who
> > > > > wish to recover upon receipt of a media error (aka 'hwpoison') event
> > > > > from ACPI is to punch-hole that part of the file and then pwrite it,
> > > > > which will magically cause the pmem to be reinitialized and the poison
> > > > > to be cleared.
> > > > >
> > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite
> > > > > does not permit the caller to specify where to find blocks, which means
> > > > > that we might not get the same pmem back.  This pushes the user farther
> > > > > away from the goal of reinitializing poisoned memory and leads to
> > > > > complaints about unnecessary file fragmentation.
> > > > >
> > > > > AFAICT, the only reason why the "punch and write" dance works at all is
> > > > > that the XFS and ext4 currently call blkdev_issue_zeroout when
> > > > > allocating pmem ahead of a write call.  Even a regular overwrite won't
> > > > > clear the poison, because dax_direct_access is smart enough to bail out
> > > > > on poisoned pmem, but not smart enough to clear it.  To be fair, that
> > > > > function maps pages and has no idea what kinds of reads and writes the
> > > > > caller might want to perform.
> > > > >
> > > > > Therefore, create a dax_zeroinit_range function that filesystems can to
> > > > > reset the pmem contents to zero and clear hardware media error flags.
> > > > > This uses the dax page zeroing helper function, which should ensure that
> > > > > subsequent accesses will not trip over any pre-existing media errors.
> > > >
> > > > Thanks Darrick for such clear explaination of the problem and your solution.
> > > > As I see from this thread [1], it looks like we are heading in this direction,
> > > > so I thought of why not review this RFC patch series :)
> > > >
> > > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/
> > > >
> > > > >
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  include/linux/dax.h |    7 ++++
> > > > >  2 files changed, 100 insertions(+)
> > > > >
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > > index 4e3e5a283a91..765b80d08605 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > > > >         return dax_insert_pfn_mkwrite(vmf, pfn, order);
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > > > > +
> > > > > +static loff_t
> > > > > +dax_zeroinit_iter(struct iomap_iter *iter)
> > > > > +{
> > > > > +       struct iomap *iomap = &iter->iomap;
> > > > > +       const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > > > +       const u64 start = iomap->addr + iter->pos - iomap->offset;
> > > > > +       const u64 nr_bytes = iomap_length(iter);
> > > > > +       u64 start_page = start >> PAGE_SHIFT;
> > > > > +       u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> > > > > +       int ret;
> > > > > +
> > > > > +       if (!iomap->dax_dev)
> > > > > +               return -ECANCELED;
> > > > > +
> > > > > +       /*
> > > > > +        * The physical extent must be page aligned because that's what the dax
> > > > > +        * function requires.
> > > > > +        */
> > > > > +       if (!PAGE_ALIGNED(start | nr_bytes))
> > > > > +               return -ECANCELED;
> > > > > +
> > > > > +       /*
> > > > > +        * The dax function, by using pgoff_t, is stuck with unsigned long, so
> > > > > +        * we must check for overflows.
> > > > > +        */
> > > > > +       if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> > > > > +               return -ECANCELED;
> > > > > +
> > > > > +       /* Must be able to zero storage directly without fs intervention. */
> > > > > +       if (iomap->flags & IOMAP_F_SHARED)
> > > > > +               return -ECANCELED;
> > > > > +       if (srcmap != iomap)
> > > > > +               return -ECANCELED;
> > > > > +
> > > > > +       switch (iomap->type) {
> > > > > +       case IOMAP_MAPPED:
> > > > > +               while (nr_pages > 0) {
> > > > > +                       /* XXX function only supports one page at a time?! */
> > > > > +                       ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > > > > +                                       1);
> > > > > +                       if (ret)
> > > > > +                               return ret;
> > > > > +                       start_page++;
> > > > > +                       nr_pages--;
> > > > > +               }
> > > > > +
> > > > > +               fallthrough;
> > > > > +       case IOMAP_UNWRITTEN:
> > > > > +               return nr_bytes;
> > > > > +       }
> > > > > +
> > > > > +       /* Reject holes, inline data, or delalloc extents. */
> > > > > +       return -ECANCELED;
> > > >
> > > > We reject holes here, but the other vfs plumbing patch [2] mentions
> > > > "Holes and unwritten extents are left untouched.".
> > > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?
> > >
> > > I'm not entirely sure what we should do for holes and unwritten extents,
> > > as you can tell from the gross inconsistency between the comment and the
> > > code. :/
> > >
> > > On block devices, I think we rely on the behavior that writing to disk
> > > will clear the device's error state (via LBA remapping or some other
> > > strategy).  I think this means iomap_zeroinit can skip unwritten extents
> > > because reads and read faults will be satisfied from the zero page and
> > > writeback (or direct writes) will trigger the drive firmware.
> > >
> > > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and
> > > read faults with the (dax) zero page.  Writes and write faults won't
> > > clear the poison (unlike disks).  So I guess this means that
> > > dax_zeroinit *does* have to act on unwritten areas.
>
> I was confused when I wrote this -- before writing, dax filesystems are
> required to allocate written zeroed extents and/or zero unwritten
> extents and mark them written.  So we don't actually need to zero
> unwritten extents.
>
> > >
> > > Ok.  I'll make those changes.
> >
> > Yes, I guess unwritten extents still have extents blocks allocated with
> > generally a bit marked (to mark it as unwritten). So there could still be
> > a need to clear the poison for this in case of DAX.
> >
> > >
> > > As for holes -- on the one hand, one could argue that zero-initializing
> > > a hole makes no sense and should be an error.  OTOH one could make an
> > > equally compelling argument that it's merely a nop.  Thoughts?
> >
> > So in case of holes consider this case (please correct if any of my
> > understanding below is wrong/incomplete).
> > If we have a large hole and if someone tries to do write to that area.
> > 1. Then from what I understood from the code FS will try and allocate some disk
> >    blocks (could these blocks be marked with HWpoison as FS has no way of
> >    knowing it?).
>
> Freshly allocated extents are zeroed via blkdev_issue_zeroout before
> being mapped into the file, which will clear the poison.  That last bit
> is only the reason why the punch-and-rewrite dance ever worked at all.
>
> We'll have to make sure this poison clearing still happens even after
> the dax/block layer divorce.
>
> > 2. If yes, then after allocating those blocks dax_direct_access will fail (as
> >    you had mentioned above). But it won't clear the HWposion.
> > 3. Then the user again will have to clear using this API. But that is only for
> >    a given extent which is some part of the hole which FS allocated.
> > Now above could be repeated until the entire hole range is covered.
> > Is that above understanding correct?
> >
> > If yes, then maybe it all depends on what sort of gurantee the API is providing.
> > If using the API on the given range guarantees that the entire file range will
> > not have any blocks with HWpoison then I guess we may have to cover the
> > IOMAP_HOLE case as well?
> > If not, then maybe we could explicitly mentioned this in the API documentation.
>
> Ok.  The short version is that zeroinit will ensure that subsequent
> reads, writes, or faults to allocated file ranges won't have problems
> with pre-existing poison flags.

s/won't/likely won't/

s/pre-existing/known pre-existing/

i.e. the guarantees of this interface is that it will have tried its
best to clean up media errors, and that may only be possible for pmem
if latent poison has been previously notified to the driver, and if
the driver supports error clearing. For example, if you're using
memmap=ss!nn to emulate PMEM with a DRAM range with poison I expect
that this routine will succeed even though no poison is corrected.
diff mbox series

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 4e3e5a283a91..765b80d08605 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1714,3 +1714,96 @@  vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t
+dax_zeroinit_iter(struct iomap_iter *iter)
+{
+	struct iomap *iomap = &iter->iomap;
+	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	const u64 start = iomap->addr + iter->pos - iomap->offset;
+	const u64 nr_bytes = iomap_length(iter);
+	u64 start_page = start >> PAGE_SHIFT;
+	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
+	int ret;
+
+	if (!iomap->dax_dev)
+		return -ECANCELED;
+
+	/*
+	 * The physical extent must be page aligned because that's what the dax
+	 * function requires.
+	 */
+	if (!PAGE_ALIGNED(start | nr_bytes))
+		return -ECANCELED;
+
+	/*
+	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
+	 * we must check for overflows.
+	 */
+	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
+		return -ECANCELED;
+
+	/* Must be able to zero storage directly without fs intervention. */
+	if (iomap->flags & IOMAP_F_SHARED)
+		return -ECANCELED;
+	if (srcmap != iomap)
+		return -ECANCELED;
+
+	switch (iomap->type) {
+	case IOMAP_MAPPED:
+		while (nr_pages > 0) {
+			/* XXX function only supports one page at a time?! */
+			ret = dax_zero_page_range(iomap->dax_dev, start_page,
+					1);
+			if (ret)
+				return ret;
+			start_page++;
+			nr_pages--;
+		}
+
+		fallthrough;
+	case IOMAP_UNWRITTEN:
+		return nr_bytes;
+	}
+
+	/* Reject holes, inline data, or delalloc extents. */
+	return -ECANCELED;
+}
+
+/*
+ * Initialize storage mapped to a DAX-mode file to a known value and ensure the
+ * media are ready to accept read and write commands.  This requires the use of
+ * the dax layer's zero page range function to write zeroes to a pmem region
+ * and to reset any hardware media error state.
+ *
+ * The physical extents must be aligned to page size.  The file must be backed
+ * by a pmem device.  The extents returned must not require copy on write (or
+ * any other mapping interventions from the filesystem) and must be contiguous.
+ * @done will be set to true if the reset succeeded.
+ *
+ * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
+ * mappings do not support zero initialization, -EOPNOTSUPP if the device does
+ * not support it, or the usual negative errno.
+ */
+int
+dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
+		   const struct iomap_ops *ops)
+{
+	struct iomap_iter iter = {
+		.inode		= inode,
+		.pos		= pos,
+		.len		= len,
+		.flags		= IOMAP_REPORT,
+	};
+	int ret;
+
+	if (!IS_DAX(inode))
+		return -EINVAL;
+	if (pos + len > i_size_read(inode))
+		return -EINVAL;
+
+	while ((ret = iomap_iter(&iter, ops)) > 0)
+		iter.processed = dax_zeroinit_iter(&iter);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_zeroinit_range);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2619d94c308d..3c873f7c35ba 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -129,6 +129,8 @@  struct page *dax_layout_busy_page(struct address_space *mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
+			const struct iomap_ops *ops);
 #else
 #define generic_fsdax_supported		NULL
 
@@ -174,6 +176,11 @@  static inline dax_entry_t dax_lock_page(struct page *page)
 static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
+static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
+		const struct iomap_ops *ops)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_DAX)