diff mbox series

[3/5] vfs: add a zero-initialization mode to fallocate

Message ID 163192866125.417973.7293598039998376121.stgit@magnolia (mailing list archive)
State New, archived
Headers show
Series vfs: enable userspace to reset damaged file storage | expand

Commit Message

Darrick J. Wong Sept. 18, 2021, 1:31 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Add a new mode to fallocate to zero-initialize all the storage backing a
file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/open.c                   |    5 +++++
 include/linux/falloc.h      |    1 +
 include/uapi/linux/falloc.h |    9 +++++++++
 3 files changed, 15 insertions(+)

Comments

Ritesh Harjani Sept. 18, 2021, 4:58 p.m. UTC | #1
On 21/09/17 06:31PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add a new mode to fallocate to zero-initialize all the storage backing a
> file.

This patch looks pretty straight forward to me.

Thanks
-ritesh

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/open.c                   |    5 +++++
>  include/linux/falloc.h      |    1 +
>  include/uapi/linux/falloc.h |    9 +++++++++
>  3 files changed, 15 insertions(+)
>
>
> diff --git a/fs/open.c b/fs/open.c
> index daa324606a41..230220b8f67a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	    (mode & ~FALLOC_FL_INSERT_RANGE))
>  		return -EINVAL;
>
> +	/* Zeroinit should only be used by itself and keep size must be set. */
> +	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> +	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> +		return -EINVAL;
> +
>  	/* Unshare range should only be used with allocate mode. */
>  	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
>  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..4597b416667b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -29,6 +29,7 @@ struct space_resv {
>  					 FALLOC_FL_PUNCH_HOLE |		\
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
> +					 FALLOC_FL_ZEROINIT_RANGE |	\
>  					 FALLOC_FL_INSERT_RANGE |	\
>  					 FALLOC_FL_UNSHARE_RANGE)
>
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..8144403b6102 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,13 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>
> +/*
> + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> + * writing zeros to it.  Subsequent read and writes should not fail due to any
> + * previous media errors.  Blocks must be not be shared or require copy on
> + * write.  Holes and unwritten extents are left untouched.  This mode must be
> + * used with FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_ZEROINIT_RANGE	0x80
> +
>  #endif /* _UAPI_FALLOC_H_ */
>
Eric Biggers Sept. 20, 2021, 5:52 p.m. UTC | #2
On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a new mode to fallocate to zero-initialize all the storage backing a
> file.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/open.c                   |    5 +++++
>  include/linux/falloc.h      |    1 +
>  include/uapi/linux/falloc.h |    9 +++++++++
>  3 files changed, 15 insertions(+)
> 
> 
> diff --git a/fs/open.c b/fs/open.c
> index daa324606a41..230220b8f67a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	    (mode & ~FALLOC_FL_INSERT_RANGE))
>  		return -EINVAL;
>  
> +	/* Zeroinit should only be used by itself and keep size must be set. */
> +	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> +	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> +		return -EINVAL;
> +
>  	/* Unshare range should only be used with allocate mode. */
>  	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
>  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..4597b416667b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -29,6 +29,7 @@ struct space_resv {
>  					 FALLOC_FL_PUNCH_HOLE |		\
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
> +					 FALLOC_FL_ZEROINIT_RANGE |	\
>  					 FALLOC_FL_INSERT_RANGE |	\
>  					 FALLOC_FL_UNSHARE_RANGE)
>  
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..8144403b6102 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,13 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> + * writing zeros to it.  Subsequent read and writes should not fail due to any
> + * previous media errors.  Blocks must be not be shared or require copy on
> + * write.  Holes and unwritten extents are left untouched.  This mode must be
> + * used with FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_ZEROINIT_RANGE	0x80
> +

How does this differ from ZERO_RANGE?  Especially when the above says that
ZEROINIT_RANGE leaves unwritten extents untouched.  That implies that unwritten
extents are still "allowed".  If that's the case, why not just use ZERO_RANGE?

- Eric
Darrick J. Wong Sept. 20, 2021, 6:06 p.m. UTC | #3
On Mon, Sep 20, 2021 at 10:52:09AM -0700, Eric Biggers wrote:
> On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a new mode to fallocate to zero-initialize all the storage backing a
> > file.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/open.c                   |    5 +++++
> >  include/linux/falloc.h      |    1 +
> >  include/uapi/linux/falloc.h |    9 +++++++++
> >  3 files changed, 15 insertions(+)
> > 
> > 
> > diff --git a/fs/open.c b/fs/open.c
> > index daa324606a41..230220b8f67a 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  	    (mode & ~FALLOC_FL_INSERT_RANGE))
> >  		return -EINVAL;
> >  
> > +	/* Zeroinit should only be used by itself and keep size must be set. */
> > +	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> > +	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> > +		return -EINVAL;
> > +
> >  	/* Unshare range should only be used with allocate mode. */
> >  	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
> >  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..4597b416667b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -29,6 +29,7 @@ struct space_resv {
> >  					 FALLOC_FL_PUNCH_HOLE |		\
> >  					 FALLOC_FL_COLLAPSE_RANGE |	\
> >  					 FALLOC_FL_ZERO_RANGE |		\
> > +					 FALLOC_FL_ZEROINIT_RANGE |	\
> >  					 FALLOC_FL_INSERT_RANGE |	\
> >  					 FALLOC_FL_UNSHARE_RANGE)
> >  
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..8144403b6102 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,13 @@
> >   */
> >  #define FALLOC_FL_UNSHARE_RANGE		0x40
> >  
> > +/*
> > + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> > + * writing zeros to it.  Subsequent read and writes should not fail due to any
> > + * previous media errors.  Blocks must be not be shared or require copy on
> > + * write.  Holes and unwritten extents are left untouched.  This mode must be
> > + * used with FALLOC_FL_KEEP_SIZE.
> > + */
> > +#define FALLOC_FL_ZEROINIT_RANGE	0x80
> > +
> 
> How does this differ from ZERO_RANGE?  Especially when the above says that
> ZEROINIT_RANGE leaves unwritten extents untouched.  That implies that unwritten
> extents are still "allowed".  If that's the case, why not just use ZERO_RANGE?

(Note: I changed the second to last sentence to read "Holes are ignored,
and inline data regions are not supported.")

ZERO_RANGE only guarantees that a subsequent read returns all zeroes,
which means that implementations are allowed to play games with the file
mappings to make that happen with as little work as possible.

ZEROINIT_RANGE implies that the filesystem doesn't change the mappings
at all and actually writes/resets the mapped storage, though I didn't
want to exclude the possibility that the filesystem could change the
mapping as a last resort to guarantee that "subsequent read and writes
should not fail" if the media write fails.

I /could/ encode all that in the definition, but that feels like
overspecifying the implementation.

--D

> - Eric
Dave Chinner Sept. 21, 2021, 12:44 a.m. UTC | #4
On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a new mode to fallocate to zero-initialize all the storage backing a
> file.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/open.c                   |    5 +++++
>  include/linux/falloc.h      |    1 +
>  include/uapi/linux/falloc.h |    9 +++++++++
>  3 files changed, 15 insertions(+)
> 
> 
> diff --git a/fs/open.c b/fs/open.c
> index daa324606a41..230220b8f67a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	    (mode & ~FALLOC_FL_INSERT_RANGE))
>  		return -EINVAL;
>  
> +	/* Zeroinit should only be used by itself and keep size must be set. */
> +	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> +	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> +		return -EINVAL;
> +
>  	/* Unshare range should only be used with allocate mode. */
>  	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
>  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..4597b416667b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -29,6 +29,7 @@ struct space_resv {
>  					 FALLOC_FL_PUNCH_HOLE |		\
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
> +					 FALLOC_FL_ZEROINIT_RANGE |	\
>  					 FALLOC_FL_INSERT_RANGE |	\
>  					 FALLOC_FL_UNSHARE_RANGE)
>  
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..8144403b6102 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,13 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> + * writing zeros to it.  Subsequent read and writes should not fail due to any
> + * previous media errors.  Blocks must be not be shared or require copy on
> + * write.  Holes and unwritten extents are left untouched.  This mode must be
> + * used with FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_ZEROINIT_RANGE	0x80

Hmmmm.

I think this wants to be a behavioural modifier for existing
operations rather than an operation unto itself. i.e. similar to how
KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
the guarantees ALLOC provides userspace.

In this case, the change of behaviour over ZERO_RANGE is that we
want physical zeros to be written instead of the filesystem
optimising away the physical zeros by manipulating the layout
of the file.

There's been requests in the past for a way to make ALLOC also
behave like this - in the case that users want fully allocated space
to be preallocated so their applications don't take unwritten extent
conversion penalties on first writes. Databases are an example here,
where setup of a new WAL file isn't performance critical, but writes
to the WAL are and the WAL files are write-once. Hence they always
take unwritten conversion penalties and the only way around that is
to physically zero the files before use...

So it seems to me what we actually need here is a "write zeroes"
modifier to fallocate() operations to tell the filesystem that the
application really wants it to write zeroes over that range, not
just guarantee space has been physically allocated....

Then we have and API that looks like:

	ALLOC		- allocate space efficiently
	ALLOC | INIT	- allocate space by writing zeros to it
	ZERO		- zero data and preallocate space efficiently
	ZERO | INIT	- zero range by writing zeros to it

Which seems to cater for all the cases I know of where physically
writing zeros instead of allocating unwritten extents is the
preferred behaviour of fallocate()....

Cheers,

Dave.
Christoph Hellwig Sept. 21, 2021, 8:31 a.m. UTC | #5
On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> I think this wants to be a behavioural modifier for existing
> operations rather than an operation unto itself. i.e. similar to how
> KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> the guarantees ALLOC provides userspace.
> 
> In this case, the change of behaviour over ZERO_RANGE is that we
> want physical zeros to be written instead of the filesystem
> optimising away the physical zeros by manipulating the layout
> of the file.

Yes.

> Then we have and API that looks like:
> 
> 	ALLOC		- allocate space efficiently
> 	ALLOC | INIT	- allocate space by writing zeros to it
> 	ZERO		- zero data and preallocate space efficiently
> 	ZERO | INIT	- zero range by writing zeros to it
> 
> Which seems to cater for all the cases I know of where physically
> writing zeros instead of allocating unwritten extents is the
> preferred behaviour of fallocate()....

Agreed.  I'm not sure INIT is really the right name, but I can't come
up with a better idea offhand.
Dan Williams Sept. 22, 2021, 2:16 a.m. UTC | #6
On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > I think this wants to be a behavioural modifier for existing
> > operations rather than an operation unto itself. i.e. similar to how
> > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > the guarantees ALLOC provides userspace.
> >
> > In this case, the change of behaviour over ZERO_RANGE is that we
> > want physical zeros to be written instead of the filesystem
> > optimising away the physical zeros by manipulating the layout
> > of the file.
>
> Yes.
>
> > Then we have and API that looks like:
> >
> >       ALLOC           - allocate space efficiently
> >       ALLOC | INIT    - allocate space by writing zeros to it
> >       ZERO            - zero data and preallocate space efficiently
> >       ZERO | INIT     - zero range by writing zeros to it
> >
> > Which seems to cater for all the cases I know of where physically
> > writing zeros instead of allocating unwritten extents is the
> > preferred behaviour of fallocate()....
>
> Agreed.  I'm not sure INIT is really the right name, but I can't come
> up with a better idea offhand.

FUA? As in, this is a forced-unit-access zeroing all the way to media
bypassing any mechanisms to emulate zero-filled payloads on future
reads.
Darrick J. Wong Sept. 22, 2021, 2:38 a.m. UTC | #7
On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > I think this wants to be a behavioural modifier for existing
> > > operations rather than an operation unto itself. i.e. similar to how
> > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > the guarantees ALLOC provides userspace.
> > >
> > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > want physical zeros to be written instead of the filesystem
> > > optimising away the physical zeros by manipulating the layout
> > > of the file.
> >
> > Yes.
> >
> > > Then we have and API that looks like:
> > >
> > >       ALLOC           - allocate space efficiently
> > >       ALLOC | INIT    - allocate space by writing zeros to it
> > >       ZERO            - zero data and preallocate space efficiently
> > >       ZERO | INIT     - zero range by writing zeros to it
> > >
> > > Which seems to cater for all the cases I know of where physically
> > > writing zeros instead of allocating unwritten extents is the
> > > preferred behaviour of fallocate()....
> >
> > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > up with a better idea offhand.
> 
> FUA? As in, this is a forced-unit-access zeroing all the way to media
> bypassing any mechanisms to emulate zero-filled payloads on future
> reads.

FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
already exists at that file range?

--D
Dave Chinner Sept. 22, 2021, 3:59 a.m. UTC | #8
On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > I think this wants to be a behavioural modifier for existing
> > > > operations rather than an operation unto itself. i.e. similar to how
> > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > the guarantees ALLOC provides userspace.
> > > >
> > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > want physical zeros to be written instead of the filesystem
> > > > optimising away the physical zeros by manipulating the layout
> > > > of the file.
> > >
> > > Yes.
> > >
> > > > Then we have and API that looks like:
> > > >
> > > >       ALLOC           - allocate space efficiently
> > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > >       ZERO            - zero data and preallocate space efficiently
> > > >       ZERO | INIT     - zero range by writing zeros to it
> > > >
> > > > Which seems to cater for all the cases I know of where physically
> > > > writing zeros instead of allocating unwritten extents is the
> > > > preferred behaviour of fallocate()....
> > >
> > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > up with a better idea offhand.
> > 
> > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > bypassing any mechanisms to emulate zero-filled payloads on future
> > reads.

Yes, that's the semantic we want, but FUA already defines specific
data integrity behaviour in the storage stack w.r.t. volatile
caches.

Also, FUA is associated with devices - it's low level storage jargon
and so is not really appropriate to call a user interface operation
FUA where users have no idea what a "unit" or "access" actually
means.

Hence we should not overload this name with some other operation
that does not have (and should not have) explicit data integrity
requirements. That will just cause confusion for everyone.

> FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> already exists at that file range?

IMO that doesn't work as a behavioural modifier for ALLOC because
the ALLOC semantics are explicitly "don't touch existing user
data"...

Cheers,

Dave.
Darrick J. Wong Sept. 22, 2021, 4:13 a.m. UTC | #9
On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > I think this wants to be a behavioural modifier for existing
> > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > the guarantees ALLOC provides userspace.
> > > > >
> > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > want physical zeros to be written instead of the filesystem
> > > > > optimising away the physical zeros by manipulating the layout
> > > > > of the file.
> > > >
> > > > Yes.
> > > >
> > > > > Then we have and API that looks like:
> > > > >
> > > > >       ALLOC           - allocate space efficiently
> > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > >       ZERO            - zero data and preallocate space efficiently
> > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > >
> > > > > Which seems to cater for all the cases I know of where physically
> > > > > writing zeros instead of allocating unwritten extents is the
> > > > > preferred behaviour of fallocate()....
> > > >
> > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > up with a better idea offhand.
> > > 
> > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > reads.
> 
> Yes, that's the semantic we want, but FUA already defines specific
> data integrity behaviour in the storage stack w.r.t. volatile
> caches.
> 
> Also, FUA is associated with devices - it's low level storage jargon
> and so is not really appropriate to call a user interface operation
> FUA where users have no idea what a "unit" or "access" actually
> means.
> 
> Hence we should not overload this name with some other operation
> that does not have (and should not have) explicit data integrity
> requirements. That will just cause confusion for everyone.
> 
> > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > already exists at that file range?
> 
> IMO that doesn't work as a behavioural modifier for ALLOC because
> the ALLOC semantics are explicitly "don't touch existing user
> data"...

Well since you can't preallocate /and/ zerorange at the same time...

/* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
#define FALLOC_FL_ZERO_EXISTING		(0x80)

/* For preallocation, allocate written extents and set the contents to
 * zeroes. */
#define FALLOC_FL_ALLOC_WRITE_ZEROES	(0x80)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Ritesh Harjani Sept. 22, 2021, 5:28 a.m. UTC | #10
On 21/09/21 10:44AM, Dave Chinner wrote:
> On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add a new mode to fallocate to zero-initialize all the storage backing a
> > file.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/open.c                   |    5 +++++
> >  include/linux/falloc.h      |    1 +
> >  include/uapi/linux/falloc.h |    9 +++++++++
> >  3 files changed, 15 insertions(+)
> >
> >
> > diff --git a/fs/open.c b/fs/open.c
> > index daa324606a41..230220b8f67a 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  	    (mode & ~FALLOC_FL_INSERT_RANGE))
> >  		return -EINVAL;
> >
> > +	/* Zeroinit should only be used by itself and keep size must be set. */
> > +	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> > +	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> > +		return -EINVAL;
> > +
> >  	/* Unshare range should only be used with allocate mode. */
> >  	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
> >  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..4597b416667b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -29,6 +29,7 @@ struct space_resv {
> >  					 FALLOC_FL_PUNCH_HOLE |		\
> >  					 FALLOC_FL_COLLAPSE_RANGE |	\
> >  					 FALLOC_FL_ZERO_RANGE |		\
> > +					 FALLOC_FL_ZEROINIT_RANGE |	\
> >  					 FALLOC_FL_INSERT_RANGE |	\
> >  					 FALLOC_FL_UNSHARE_RANGE)
> >
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..8144403b6102 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,13 @@
> >   */
> >  #define FALLOC_FL_UNSHARE_RANGE		0x40
> >
> > +/*
> > + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> > + * writing zeros to it.  Subsequent read and writes should not fail due to any
> > + * previous media errors.  Blocks must be not be shared or require copy on
> > + * write.  Holes and unwritten extents are left untouched.  This mode must be
> > + * used with FALLOC_FL_KEEP_SIZE.
> > + */
> > +#define FALLOC_FL_ZEROINIT_RANGE	0x80
>
> Hmmmm.
>
> I think this wants to be a behavioural modifier for existing
> operations rather than an operation unto itself. i.e. similar to how
> KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> the guarantees ALLOC provides userspace.
>
> In this case, the change of behaviour over ZERO_RANGE is that we
> want physical zeros to be written instead of the filesystem
> optimising away the physical zeros by manipulating the layout
> of the file.
>
> There's been requests in the past for a way to make ALLOC also
> behave like this - in the case that users want fully allocated space
> to be preallocated so their applications don't take unwritten extent
> conversion penalties on first writes. Databases are an example here,
> where setup of a new WAL file isn't performance critical, but writes
> to the WAL are and the WAL files are write-once. Hence they always
> take unwritten conversion penalties and the only way around that is
> to physically zero the files before use...
>
> So it seems to me what we actually need here is a "write zeroes"
> modifier to fallocate() operations to tell the filesystem that the
> application really wants it to write zeroes over that range, not
> just guarantee space has been physically allocated....
>
> Then we have and API that looks like:
>
> 	ALLOC		- allocate space efficiently
> 	ALLOC | INIT	- allocate space by writing zeros to it
> 	ZERO		- zero data and preallocate space efficiently
> 	ZERO | INIT	- zero range by writing zeros to it
>
> Which seems to cater for all the cases I know of where physically
> writing zeros instead of allocating unwritten extents is the
> preferred behaviour of fallocate()....
>

If that's the case we can just have FALLOC_FL_ZEROWRITE_RANGE?
Where FALLOC_FL_ZERO_RANGE & FALLOC_FL_ZEROWRITE_RANGE are mutually exclusive.

AFAIU,
/* FALLOC_FL_ZERO_RANGE may optimize the underlying blocks with unwritten
 * extents if the filesystem allows so, but with FALLOC_FL_ZEROWRITE_RANGE,
 * the underlying blocks are guranteed to be written with zeros.
 * In case of hole it will be preallocated with written extents and will be
 * initialized with zeroes. If FALLOC_FL_KEEP_SIZE is specified then the
 * inode size will remain the same.
 *
 * Essentially similar to FALLOC_FL_ZERO_RANGE but with gurantees that
 * underlying storage has written extents initialized with zeroes.
 */
#define FALLOC_FL_ZEROWRITE_RANGE		0x80

Does that make sense?

-ritesh
Dave Chinner Sept. 22, 2021, 5:49 a.m. UTC | #11
On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > the guarantees ALLOC provides userspace.
> > > > > >
> > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > want physical zeros to be written instead of the filesystem
> > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > of the file.
> > > > >
> > > > > Yes.
> > > > >
> > > > > > Then we have and API that looks like:
> > > > > >
> > > > > >       ALLOC           - allocate space efficiently
> > > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > > >       ZERO            - zero data and preallocate space efficiently
> > > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > > >
> > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > preferred behaviour of fallocate()....
> > > > >
> > > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > > up with a better idea offhand.
> > > > 
> > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > reads.
> > 
> > Yes, that's the semantic we want, but FUA already defines specific
> > data integrity behaviour in the storage stack w.r.t. volatile
> > caches.
> > 
> > Also, FUA is associated with devices - it's low level storage jargon
> > and so is not really appropriate to call a user interface operation
> > FUA where users have no idea what a "unit" or "access" actually
> > means.
> > 
> > Hence we should not overload this name with some other operation
> > that does not have (and should not have) explicit data integrity
> > requirements. That will just cause confusion for everyone.
> > 
> > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > already exists at that file range?
> > 
> > IMO that doesn't work as a behavioural modifier for ALLOC because
> > the ALLOC semantics are explicitly "don't touch existing user
> > data"...
> 
> Well since you can't preallocate /and/ zerorange at the same time...
> 
> /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> #define FALLOC_FL_ZERO_EXISTING		(0x80)

Except we also want the newly allocated regions (i.e. where holes
were) in that range being zeroed to have zeroes written to them as
well, yes? Otherwise we end up with a combination of unwritten
extents and physical zeroes, and you can't use
ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT

/*
 * For preallocation and zeroing operations, force the filesystem to
 * write zeroes rather than use unwritten extents to indicate the
 * range contains zeroes.
 *
 * For filesystems that support unwritten extents, this trades off
 * slow fallocate performance for faster first write performance as
 * unwritten extent conversion on the first write to each block in
 * the range is not needed.
 *
 * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will
 * be much slower overall for large ranges and/or slow storage
 * compared to using unwritten extents.
 */
#define FALLOC_FL_ALLOC_INIT_DATA	(1 << 7)

Cheers,

Dave.
Darrick J. Wong Sept. 22, 2021, 9:27 p.m. UTC | #12
On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote:
> On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > > the guarantees ALLOC provides userspace.
> > > > > > >
> > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > > want physical zeros to be written instead of the filesystem
> > > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > > of the file.
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > Then we have and API that looks like:
> > > > > > >
> > > > > > >       ALLOC           - allocate space efficiently
> > > > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > > > >       ZERO            - zero data and preallocate space efficiently
> > > > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > > > >
> > > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > > preferred behaviour of fallocate()....
> > > > > >
> > > > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > > > up with a better idea offhand.
> > > > > 
> > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > > reads.
> > > 
> > > Yes, that's the semantic we want, but FUA already defines specific
> > > data integrity behaviour in the storage stack w.r.t. volatile
> > > caches.
> > > 
> > > Also, FUA is associated with devices - it's low level storage jargon
> > > and so is not really appropriate to call a user interface operation
> > > FUA where users have no idea what a "unit" or "access" actually
> > > means.
> > > 
> > > Hence we should not overload this name with some other operation
> > > that does not have (and should not have) explicit data integrity
> > > requirements. That will just cause confusion for everyone.
> > > 
> > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > > already exists at that file range?
> > > 
> > > IMO that doesn't work as a behavioural modifier for ALLOC because
> > > the ALLOC semantics are explicitly "don't touch existing user
> > > data"...
> > 
> > Well since you can't preallocate /and/ zerorange at the same time...
> > 
> > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> > #define FALLOC_FL_ZERO_EXISTING		(0x80)
> 
> Except we also want the newly allocated regions (i.e. where holes
> were) in that range being zeroed to have zeroes written to them as
> well, yes? Otherwise we end up with a combination of unwritten
> extents and physical zeroes, and you can't use
> ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT
> 
> /*
>  * For preallocation and zeroing operations, force the filesystem to
>  * write zeroes rather than use unwritten extents to indicate the
>  * range contains zeroes.
>  *
>  * For filesystems that support unwritten extents, this trades off
>  * slow fallocate performance for faster first write performance as
>  * unwritten extent conversion on the first write to each block in
>  * the range is not needed.
>  *
>  * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will
>  * be much slower overall for large ranges and/or slow storage
>  * compared to using unwritten extents.
>  */
> #define FALLOC_FL_ALLOC_INIT_DATA	(1 << 7)

I prefer FALLOC_FL_ZEROINIT_DATA here, because in the ZERO|INIT case
we're not allocating any new space, merely rewriting existing storage.
I also want to expand the description slightly:

/*
 * For preallocation, force the filesystem to write zeroes rather than
 * use unwritten extents to indicate the range contains zeroes.  For
 * zeroing operations, force the filesystem to write zeroes to existing
 * written extents.

--D

> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Sept. 23, 2021, 12:02 a.m. UTC | #13
On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote:
> > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > > > the guarantees ALLOC provides userspace.
> > > > > > > >
> > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > > > want physical zeros to be written instead of the filesystem
> > > > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > > > of the file.
> > > > > > >
> > > > > > > Yes.
> > > > > > >
> > > > > > > > Then we have and API that looks like:
> > > > > > > >
> > > > > > > >       ALLOC           - allocate space efficiently
> > > > > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > > > > >       ZERO            - zero data and preallocate space efficiently
> > > > > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > > > > >
> > > > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > > > preferred behaviour of fallocate()....
> > > > > > >
> > > > > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > > > > up with a better idea offhand.
> > > > > > 
> > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > > > reads.
> > > > 
> > > > Yes, that's the semantic we want, but FUA already defines specific
> > > > data integrity behaviour in the storage stack w.r.t. volatile
> > > > caches.
> > > > 
> > > > Also, FUA is associated with devices - it's low level storage jargon
> > > > and so is not really appropriate to call a user interface operation
> > > > FUA where users have no idea what a "unit" or "access" actually
> > > > means.
> > > > 
> > > > Hence we should not overload this name with some other operation
> > > > that does not have (and should not have) explicit data integrity
> > > > requirements. That will just cause confusion for everyone.
> > > > 
> > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > > > already exists at that file range?
> > > > 
> > > > IMO that doesn't work as a behavioural modifier for ALLOC because
> > > > the ALLOC semantics are explicitly "don't touch existing user
> > > > data"...
> > > 
> > > Well since you can't preallocate /and/ zerorange at the same time...
> > > 
> > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> > > #define FALLOC_FL_ZERO_EXISTING		(0x80)
> > 
> > Except we also want the newly allocated regions (i.e. where holes
> > were) in that range being zeroed to have zeroes written to them as
> > well, yes? Otherwise we end up with a combination of unwritten
> > extents and physical zeroes, and you can't use
> > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT

Ooookay.  This is drifting further from the original problem of wanting
to write a buffer of zeroes to an already-mapped extent.

What if part of the region is shared?  If the goal is to make a read
return zeroes, then the shared extent must be punched, right?  Should
the new region be preallocated?

What if part of the region is unwritten?  Should zeroing convert that to
written at the same time?  This isn't required to solve the problem, but
"force the filesystem to write zeroes" implies that's required.  Should
preallocation start converting unwritten extents too?

What if part of the region is sparse?  Preallocation should allocate a
written extent, but that wasn't the problem I was focusing on.  Should
zeroing preallocate a written extent?  This also isn't required to solve
my problem, but this extension of the API definition implies this too.

What if part of the region is delalloc?  Should preallocation allocate a
written extent here too?  Should zeroing?

For ALLOC|INITDATA, I think it suffices to map new written extents into
holes with BMAPI_ZERO and do no more work than that.

For ZERO|INITDATA, I /think/ I can solve all of the above by writing
zeroes to the page cache and then switching to regular preallocation to
fill the holes.

--D

> > 
> > /*
> >  * For preallocation and zeroing operations, force the filesystem to
> >  * write zeroes rather than use unwritten extents to indicate the
> >  * range contains zeroes.
> >  *
> >  * For filesystems that support unwritten extents, this trades off
> >  * slow fallocate performance for faster first write performance as
> >  * unwritten extent conversion on the first write to each block in
> >  * the range is not needed.
> >  *
> >  * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will
> >  * be much slower overall for large ranges and/or slow storage
> >  * compared to using unwritten extents.
> >  */
> > #define FALLOC_FL_ALLOC_INIT_DATA	(1 << 7)
> 
> I prefer FALLOC_FL_ZEROINIT_DATA here, because in the ZERO|INIT case
> we're not allocating any new space, merely rewriting existing storage.
> I also want to expand the description slightly:
> 
> /*
>  * For preallocation, force the filesystem to write zeroes rather than
>  * use unwritten extents to indicate the range contains zeroes.  For
>  * zeroing operations, force the filesystem to write zeroes to existing
>  * written extents.
> 
> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Darrick J. Wong Sept. 23, 2021, 12:44 a.m. UTC | #14
On Wed, Sep 22, 2021 at 05:02:55PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote:
> > > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > > > > the guarantees ALLOC provides userspace.
> > > > > > > > >
> > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > > > > want physical zeros to be written instead of the filesystem
> > > > > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > > > > of the file.
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > Then we have and API that looks like:
> > > > > > > > >
> > > > > > > > >       ALLOC           - allocate space efficiently
> > > > > > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > > > > > >       ZERO            - zero data and preallocate space efficiently
> > > > > > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > > > > > >
> > > > > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > > > > preferred behaviour of fallocate()....
> > > > > > > >
> > > > > > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > > > > > up with a better idea offhand.
> > > > > > > 
> > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > > > > reads.
> > > > > 
> > > > > Yes, that's the semantic we want, but FUA already defines specific
> > > > > data integrity behaviour in the storage stack w.r.t. volatile
> > > > > caches.
> > > > > 
> > > > > Also, FUA is associated with devices - it's low level storage jargon
> > > > > and so is not really appropriate to call a user interface operation
> > > > > FUA where users have no idea what a "unit" or "access" actually
> > > > > means.
> > > > > 
> > > > > Hence we should not overload this name with some other operation
> > > > > that does not have (and should not have) explicit data integrity
> > > > > requirements. That will just cause confusion for everyone.
> > > > > 
> > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > > > > already exists at that file range?
> > > > > 
> > > > > IMO that doesn't work as a behavioural modifier for ALLOC because
> > > > > the ALLOC semantics are explicitly "don't touch existing user
> > > > > data"...
> > > > 
> > > > Well since you can't preallocate /and/ zerorange at the same time...
> > > > 
> > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> > > > #define FALLOC_FL_ZERO_EXISTING		(0x80)
> > > 
> > > Except we also want the newly allocated regions (i.e. where holes
> > > were) in that range being zeroed to have zeroes written to them as
> > > well, yes? Otherwise we end up with a combination of unwritten
> > > extents and physical zeroes, and you can't use
> > > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT
> 
> Ooookay.  This is drifting further from the original problem of wanting
> to write a buffer of zeroes to an already-mapped extent.
> 
> What if part of the region is shared?  If the goal is to make a read
> return zeroes, then the shared extent must be punched, right?  Should
> the new region be preallocated?
> 
> What if part of the region is unwritten?  Should zeroing convert that to
> written at the same time?  This isn't required to solve the problem, but
> "force the filesystem to write zeroes" implies that's required.  Should
> preallocation start converting unwritten extents too?
> 
> What if part of the region is sparse?  Preallocation should allocate a
> written extent, but that wasn't the problem I was focusing on.  Should
> zeroing preallocate a written extent?  This also isn't required to solve
> my problem, but this extension of the API definition implies this too.
> 
> What if part of the region is delalloc?  Should preallocation allocate a
> written extent here too?  Should zeroing?
> 
> For ALLOC|INITDATA, I think it suffices to map new written extents into
> holes with BMAPI_ZERO and do no more work than that.
> 
> For ZERO|INITDATA, I /think/ I can solve all of the above by writing
> zeroes to the page cache and then switching to regular preallocation to
> fill the holes.

No, because that will flood the page cache with zeroed pages, which will
blow out the cache and is only marginally better than pwrite, which has
already been denied.  That's right, that's why we wanted all these
fancier functions anywayh.

Ok so first we flush the page cache, then we walk the extent map to
unmap the shared extents and convert the unwritten extents to written
extents.  Then we can use blkdev_issue_zerooout and dax_zero_page_range
to convert the written etents to ...

That's too many times through the file mappings.

First flush the page cache, then loop through the mappings via
xfs_bmapi_read.  If the mapping is sparse, replace it with a
written/zeroed extent.  If the extent is unwritten, convert it to a
written/zeroed extent.  If the mapping is a shared, punch it and replace
it with a written/zeroed extent.  If the mapping is written, go call the
fancy functions directly, no iomappings needed or required.

And now I have to go do the smae for ext4, since we have no shared code
anymore.  Crazylevel now approaching 130%...

--D

> 
> --D
> 
> > > 
> > > /*
> > >  * For preallocation and zeroing operations, force the filesystem to
> > >  * write zeroes rather than use unwritten extents to indicate the
> > >  * range contains zeroes.
> > >  *
> > >  * For filesystems that support unwritten extents, this trades off
> > >  * slow fallocate performance for faster first write performance as
> > >  * unwritten extent conversion on the first write to each block in
> > >  * the range is not needed.
> > >  *
> > >  * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will
> > >  * be much slower overall for large ranges and/or slow storage
> > >  * compared to using unwritten extents.
> > >  */
> > > #define FALLOC_FL_ALLOC_INIT_DATA	(1 << 7)
> > 
> > I prefer FALLOC_FL_ZEROINIT_DATA here, because in the ZERO|INIT case
> > we're not allocating any new space, merely rewriting existing storage.
> > I also want to expand the description slightly:
> > 
> > /*
> >  * For preallocation, force the filesystem to write zeroes rather than
> >  * use unwritten extents to indicate the range contains zeroes.  For
> >  * zeroing operations, force the filesystem to write zeroes to existing
> >  * written extents.
> > 
> > --D
> > 
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > 
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
Dave Chinner Sept. 23, 2021, 1:42 a.m. UTC | #15
On Wed, Sep 22, 2021 at 05:02:55PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote:
> > > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > > > > the guarantees ALLOC provides userspace.
> > > > > > > > >
> > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > > > > want physical zeros to be written instead of the filesystem
> > > > > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > > > > of the file.
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > Then we have and API that looks like:
> > > > > > > > >
> > > > > > > > >       ALLOC           - allocate space efficiently
> > > > > > > > >       ALLOC | INIT    - allocate space by writing zeros to it
> > > > > > > > >       ZERO            - zero data and preallocate space efficiently
> > > > > > > > >       ZERO | INIT     - zero range by writing zeros to it
> > > > > > > > >
> > > > > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > > > > preferred behaviour of fallocate()....
> > > > > > > >
> > > > > > > > Agreed.  I'm not sure INIT is really the right name, but I can't come
> > > > > > > > up with a better idea offhand.
> > > > > > > 
> > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > > > > reads.
> > > > > 
> > > > > Yes, that's the semantic we want, but FUA already defines specific
> > > > > data integrity behaviour in the storage stack w.r.t. volatile
> > > > > caches.
> > > > > 
> > > > > Also, FUA is associated with devices - it's low level storage jargon
> > > > > and so is not really appropriate to call a user interface operation
> > > > > FUA where users have no idea what a "unit" or "access" actually
> > > > > means.
> > > > > 
> > > > > Hence we should not overload this name with some other operation
> > > > > that does not have (and should not have) explicit data integrity
> > > > > requirements. That will just cause confusion for everyone.
> > > > > 
> > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > > > > already exists at that file range?
> > > > > 
> > > > > IMO that doesn't work as a behavioural modifier for ALLOC because
> > > > > the ALLOC semantics are explicitly "don't touch existing user
> > > > > data"...
> > > > 
> > > > Well since you can't preallocate /and/ zerorange at the same time...
> > > > 
> > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> > > > #define FALLOC_FL_ZERO_EXISTING		(0x80)
> > > 
> > > Except we also want the newly allocated regions (i.e. where holes
> > > were) in that range being zeroed to have zeroes written to them as
> > > well, yes? Otherwise we end up with a combination of unwritten
> > > extents and physical zeroes, and you can't use
> > > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT
> 
> Ooookay.  This is drifting further from the original problem of wanting
> to write a buffer of zeroes to an already-mapped extent.

Yup, that's because fallocate() is a multiplexed set of generic
operations. Adding one-off functionality for a specific use case
ends up breaking down the problem into how it maps as a generic
operation to the rest of the functionality that fallocate()
provides.

Please don't get frustrated here - the discussion needs to be had as
to why fallocate() is the wrong place to be managing *storage
hardware state* - the API and scope creep once the operation is
broken down to a generic behaviour demonstrate this clearly.

> What if part of the region is shared?  If the goal is to make a read
> return zeroes, then the shared extent must be punched, right?  Should
> the new region be preallocated?

[....]

These are implementation details once the API has been defined.

And, yes, I agree, it's nuts that just adding "DAX clear poison"
effectively degenerates into "rewrite the entire fallocate()
implementation in XFS", but that's what adding "write physical
zeroes rather than minimising IO overhead" semantics to generic
fallocate() operations results in.

Fundamentally, fallocate() is not for managing storage hardware
state. It's for optimising away bulk data operations by replacing
them with fast filesystem metadata manipulations and/or hardware
acceleration of the bulk data operation.

So, given that for clearing pmem hardware poison, we already know
it requires writing zeros to the poisoned range and where
in the file that we need to write zeroes to. So what exactly is the
problem with doing this:

	iov.iov_base = zero_buf;
	iov.iov_len = zero_len;
	pwritev2(fd, &iov, 1, zero_offset, RWF_SYNC | RWF_CLEAR_HWERRORS);

Where RWF_CLEAR_HWERRORS tells the low level hardware IO code to
clear error states before performing the write of the user data
provided?

The API doesn't get much simpler or explict, and the
RWF_CLEAR_HWERRORS flag is trivial to propagate all the way down
into the DAX IO code where it can be performed appropriately before
performing the write of the new user data into that range. And it's
way simpler than plumbing this sort of things into fallocate().


We need an API that provides a one-off, single data write semantic
to be defined and implemented to manage pmem hardware state. We
already have that in pwritev2().

Hence this discussion leads me to conclude that fallocate() simply
isn't the right interface to clear storage hardware poison state and
it's much simpler for everyone - kernel and userspace - to provide a
pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
clear hardware error state before issuing this user write to the
hardware.

Cheers,

Dave.
Dan Williams Sept. 23, 2021, 2:43 a.m. UTC | #16
On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
[..]
> Hence this discussion leads me to conclude that fallocate() simply
> isn't the right interface to clear storage hardware poison state and
> it's much simpler for everyone - kernel and userspace - to provide a
> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> clear hardware error state before issuing this user write to the
> hardware.

That flag would slot in nicely in dax_iomap_iter() as the gate for
whether dax_direct_access() should allow mapping over error ranges,
and then as a flag to dax_copy_from_iter() to indicate that it should
compare the incoming write to known poison and clear it before
proceeding.

I like the distinction, because there's a chance the application did
not know that the page had experienced data loss and might want the
error behavior. The other service the driver could offer with this
flag is to do a precise check of the incoming write to make sure it
overlaps known poison and then repair the entire page. Repairing whole
pages makes for a cleaner implementation of the code that tries to
keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
Dan Williams Sept. 23, 2021, 5:42 a.m. UTC | #17
On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> [..]
> > Hence this discussion leads me to conclude that fallocate() simply
> > isn't the right interface to clear storage hardware poison state and
> > it's much simpler for everyone - kernel and userspace - to provide a
> > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > clear hardware error state before issuing this user write to the
> > hardware.
>
> That flag would slot in nicely in dax_iomap_iter() as the gate for
> whether dax_direct_access() should allow mapping over error ranges,
> and then as a flag to dax_copy_from_iter() to indicate that it should
> compare the incoming write to known poison and clear it before
> proceeding.
>
> I like the distinction, because there's a chance the application did
> not know that the page had experienced data loss and might want the
> error behavior. The other service the driver could offer with this
> flag is to do a precise check of the incoming write to make sure it
> overlaps known poison and then repair the entire page. Repairing whole
> pages makes for a cleaner implementation of the code that tries to
> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().

This flag could also be useful for preadv2() as there is currently no
way to read the good data in a PMEM page with poison via DAX. So the
flag would tell dax_direct_access() to again proceed in the face of
errors, but then the driver's dax_copy_to_iter() operation could
either read up to the precise byte offset of the error in the page, or
autoreplace error data with zero's to try to maximize data recovery.
Dave Chinner Sept. 23, 2021, 10:54 p.m. UTC | #18
On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > [..]
> > > Hence this discussion leads me to conclude that fallocate() simply
> > > isn't the right interface to clear storage hardware poison state and
> > > it's much simpler for everyone - kernel and userspace - to provide a
> > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > clear hardware error state before issuing this user write to the
> > > hardware.
> >
> > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > whether dax_direct_access() should allow mapping over error ranges,
> > and then as a flag to dax_copy_from_iter() to indicate that it should
> > compare the incoming write to known poison and clear it before
> > proceeding.
> >
> > I like the distinction, because there's a chance the application did
> > not know that the page had experienced data loss and might want the
> > error behavior. The other service the driver could offer with this
> > flag is to do a precise check of the incoming write to make sure it
> > overlaps known poison and then repair the entire page. Repairing whole
> > pages makes for a cleaner implementation of the code that tries to
> > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> 
> This flag could also be useful for preadv2() as there is currently no
> way to read the good data in a PMEM page with poison via DAX. So the
> flag would tell dax_direct_access() to again proceed in the face of
> errors, but then the driver's dax_copy_to_iter() operation could
> either read up to the precise byte offset of the error in the page, or
> autoreplace error data with zero's to try to maximize data recovery.

Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
everything that can be read from the bad range because it's the
other half of the problem RWF_RESET_HWERROR is trying to address.
That is, the operation we want to perform on a range with an error
state is -data recovery-, not "reinitialisation". Data recovery
requires two steps:

- "try to recover the data from the bad storage"; and
- "reinitialise the data and clear the error state"

These naturally map to read() and write() operations, not
fallocate(). With RWF flags they become explicit data recovery
operations, unlike fallocate() which needs to imply that "writing
zeroes" == "reset hardware error state". While that reset method
may be true for a specific pmem hardware implementation it is not a
requirement for all storage hardware. It's most definitely not a
requirement for future storage hardware, either.

It also means that applications have no choice in what data they can
use to reinitialise the damaged range with because fallocate() only
supports writing zeroes. If we've recovered data via a read() as you
suggest we could, then we can rebuild the data from other redundant
information and immediately write that back to the storage, hence
repairing the fault.

That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
write into an exclusive operation and hence allow the
reinitialisation with the recovered/repaired state to run atomically
w.r.t. all other filesystem operations.  i.e. the reset write
completes the recovery operation instead of requiring separate
"reset" and "write recovered data into zeroed range" steps that
cannot be executed atomically by userspace...

Cheers,

Dave.
Dan Williams Sept. 24, 2021, 1:18 a.m. UTC | #19
On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> > On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > > [..]
> > > > Hence this discussion leads me to conclude that fallocate() simply
> > > > isn't the right interface to clear storage hardware poison state and
> > > > it's much simpler for everyone - kernel and userspace - to provide a
> > > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > > clear hardware error state before issuing this user write to the
> > > > hardware.
> > >
> > > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > > whether dax_direct_access() should allow mapping over error ranges,
> > > and then as a flag to dax_copy_from_iter() to indicate that it should
> > > compare the incoming write to known poison and clear it before
> > > proceeding.
> > >
> > > I like the distinction, because there's a chance the application did
> > > not know that the page had experienced data loss and might want the
> > > error behavior. The other service the driver could offer with this
> > > flag is to do a precise check of the incoming write to make sure it
> > > overlaps known poison and then repair the entire page. Repairing whole
> > > pages makes for a cleaner implementation of the code that tries to
> > > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> >
> > This flag could also be useful for preadv2() as there is currently no
> > way to read the good data in a PMEM page with poison via DAX. So the
> > flag would tell dax_direct_access() to again proceed in the face of
> > errors, but then the driver's dax_copy_to_iter() operation could
> > either read up to the precise byte offset of the error in the page, or
> > autoreplace error data with zero's to try to maximize data recovery.
>
> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> everything that can be read from the bad range because it's the
> other half of the problem RWF_RESET_HWERROR is trying to address.
> That is, the operation we want to perform on a range with an error
> state is -data recovery-, not "reinitialisation". Data recovery
> requires two steps:
>
> - "try to recover the data from the bad storage"; and
> - "reinitialise the data and clear the error state"
>
> These naturally map to read() and write() operations, not
> fallocate(). With RWF flags they become explicit data recovery
> operations, unlike fallocate() which needs to imply that "writing
> zeroes" == "reset hardware error state". While that reset method
> may be true for a specific pmem hardware implementation it is not a
> requirement for all storage hardware. It's most definitely not a
> requirement for future storage hardware, either.
>
> It also means that applications have no choice in what data they can
> use to reinitialise the damaged range with because fallocate() only
> supports writing zeroes. If we've recovered data via a read() as you
> suggest we could, then we can rebuild the data from other redundant
> information and immediately write that back to the storage, hence
> repairing the fault.
>
> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> write into an exclusive operation and hence allow the
> reinitialisation with the recovered/repaired state to run atomically
> w.r.t. all other filesystem operations.  i.e. the reset write
> completes the recovery operation instead of requiring separate
> "reset" and "write recovered data into zeroed range" steps that
> cannot be executed atomically by userspace...

/me nods

Jane, want to take a run at patches for this ^^^?
Jane Chu Sept. 24, 2021, 1:21 a.m. UTC | #20
On 9/23/2021 6:18 PM, Dan Williams wrote:
> On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
>>> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>>
>>>> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
>>>> [..]
>>>>> Hence this discussion leads me to conclude that fallocate() simply
>>>>> isn't the right interface to clear storage hardware poison state and
>>>>> it's much simpler for everyone - kernel and userspace - to provide a
>>>>> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
>>>>> clear hardware error state before issuing this user write to the
>>>>> hardware.
>>>>
>>>> That flag would slot in nicely in dax_iomap_iter() as the gate for
>>>> whether dax_direct_access() should allow mapping over error ranges,
>>>> and then as a flag to dax_copy_from_iter() to indicate that it should
>>>> compare the incoming write to known poison and clear it before
>>>> proceeding.
>>>>
>>>> I like the distinction, because there's a chance the application did
>>>> not know that the page had experienced data loss and might want the
>>>> error behavior. The other service the driver could offer with this
>>>> flag is to do a precise check of the incoming write to make sure it
>>>> overlaps known poison and then repair the entire page. Repairing whole
>>>> pages makes for a cleaner implementation of the code that tries to
>>>> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
>>>
>>> This flag could also be useful for preadv2() as there is currently no
>>> way to read the good data in a PMEM page with poison via DAX. So the
>>> flag would tell dax_direct_access() to again proceed in the face of
>>> errors, but then the driver's dax_copy_to_iter() operation could
>>> either read up to the precise byte offset of the error in the page, or
>>> autoreplace error data with zero's to try to maximize data recovery.
>>
>> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
>> everything that can be read from the bad range because it's the
>> other half of the problem RWF_RESET_HWERROR is trying to address.
>> That is, the operation we want to perform on a range with an error
>> state is -data recovery-, not "reinitialisation". Data recovery
>> requires two steps:
>>
>> - "try to recover the data from the bad storage"; and
>> - "reinitialise the data and clear the error state"
>>
>> These naturally map to read() and write() operations, not
>> fallocate(). With RWF flags they become explicit data recovery
>> operations, unlike fallocate() which needs to imply that "writing
>> zeroes" == "reset hardware error state". While that reset method
>> may be true for a specific pmem hardware implementation it is not a
>> requirement for all storage hardware. It's most definitely not a
>> requirement for future storage hardware, either.
>>
>> It also means that applications have no choice in what data they can
>> use to reinitialise the damaged range with because fallocate() only
>> supports writing zeroes. If we've recovered data via a read() as you
>> suggest we could, then we can rebuild the data from other redundant
>> information and immediately write that back to the storage, hence
>> repairing the fault.
>>
>> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
>> write into an exclusive operation and hence allow the
>> reinitialisation with the recovered/repaired state to run atomically
>> w.r.t. all other filesystem operations.  i.e. the reset write
>> completes the recovery operation instead of requiring separate
>> "reset" and "write recovered data into zeroed range" steps that
>> cannot be executed atomically by userspace...
> 
> /me nods
> 
> Jane, want to take a run at patches for this ^^^?
> 

Sure, I'll give it a try.

Thank you all for the discussions!

-jane
Darrick J. Wong Sept. 24, 2021, 1:35 a.m. UTC | #21
On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:
> 
> On 9/23/2021 6:18 PM, Dan Williams wrote:
> > On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> > > > On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > 
> > > > > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > [..]
> > > > > > Hence this discussion leads me to conclude that fallocate() simply
> > > > > > isn't the right interface to clear storage hardware poison state and
> > > > > > it's much simpler for everyone - kernel and userspace - to provide a
> > > > > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > > > > clear hardware error state before issuing this user write to the
> > > > > > hardware.
> > > > > 
> > > > > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > > > > whether dax_direct_access() should allow mapping over error ranges,
> > > > > and then as a flag to dax_copy_from_iter() to indicate that it should
> > > > > compare the incoming write to known poison and clear it before
> > > > > proceeding.
> > > > > 
> > > > > I like the distinction, because there's a chance the application did
> > > > > not know that the page had experienced data loss and might want the
> > > > > error behavior. The other service the driver could offer with this
> > > > > flag is to do a precise check of the incoming write to make sure it
> > > > > overlaps known poison and then repair the entire page. Repairing whole
> > > > > pages makes for a cleaner implementation of the code that tries to
> > > > > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> > > > 
> > > > This flag could also be useful for preadv2() as there is currently no
> > > > way to read the good data in a PMEM page with poison via DAX. So the
> > > > flag would tell dax_direct_access() to again proceed in the face of
> > > > errors, but then the driver's dax_copy_to_iter() operation could
> > > > either read up to the precise byte offset of the error in the page, or
> > > > autoreplace error data with zero's to try to maximize data recovery.
> > > 
> > > Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> > > everything that can be read from the bad range because it's the
> > > other half of the problem RWF_RESET_HWERROR is trying to address.
> > > That is, the operation we want to perform on a range with an error
> > > state is -data recovery-, not "reinitialisation". Data recovery
> > > requires two steps:
> > > 
> > > - "try to recover the data from the bad storage"; and
> > > - "reinitialise the data and clear the error state"
> > > 
> > > These naturally map to read() and write() operations, not
> > > fallocate(). With RWF flags they become explicit data recovery
> > > operations, unlike fallocate() which needs to imply that "writing
> > > zeroes" == "reset hardware error state". While that reset method
> > > may be true for a specific pmem hardware implementation it is not a
> > > requirement for all storage hardware. It's most definitely not a
> > > requirement for future storage hardware, either.
> > > 
> > > It also means that applications have no choice in what data they can
> > > use to reinitialise the damaged range with because fallocate() only
> > > supports writing zeroes. If we've recovered data via a read() as you
> > > suggest we could, then we can rebuild the data from other redundant
> > > information and immediately write that back to the storage, hence
> > > repairing the fault.
> > > 
> > > That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> > > write into an exclusive operation and hence allow the
> > > reinitialisation with the recovered/repaired state to run atomically
> > > w.r.t. all other filesystem operations.  i.e. the reset write
> > > completes the recovery operation instead of requiring separate
> > > "reset" and "write recovered data into zeroed range" steps that
> > > cannot be executed atomically by userspace...
> > 
> > /me nods
> > 
> > Jane, want to take a run at patches for this ^^^?
> > 
> 
> Sure, I'll give it a try.
> 
> Thank you all for the discussions!

Cool, thank you!

--D

> 
> -jane
> 
>
Dave Chinner Sept. 27, 2021, 9:07 p.m. UTC | #22
On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:
> > 
> > On 9/23/2021 6:18 PM, Dan Williams wrote:
> > > On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > 
> > > > On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> > > > > On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > > 
> > > > > > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > [..]
> > > > > > > Hence this discussion leads me to conclude that fallocate() simply
> > > > > > > isn't the right interface to clear storage hardware poison state and
> > > > > > > it's much simpler for everyone - kernel and userspace - to provide a
> > > > > > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > > > > > clear hardware error state before issuing this user write to the
> > > > > > > hardware.
> > > > > > 
> > > > > > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > > > > > whether dax_direct_access() should allow mapping over error ranges,
> > > > > > and then as a flag to dax_copy_from_iter() to indicate that it should
> > > > > > compare the incoming write to known poison and clear it before
> > > > > > proceeding.
> > > > > > 
> > > > > > I like the distinction, because there's a chance the application did
> > > > > > not know that the page had experienced data loss and might want the
> > > > > > error behavior. The other service the driver could offer with this
> > > > > > flag is to do a precise check of the incoming write to make sure it
> > > > > > overlaps known poison and then repair the entire page. Repairing whole
> > > > > > pages makes for a cleaner implementation of the code that tries to
> > > > > > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> > > > > 
> > > > > This flag could also be useful for preadv2() as there is currently no
> > > > > way to read the good data in a PMEM page with poison via DAX. So the
> > > > > flag would tell dax_direct_access() to again proceed in the face of
> > > > > errors, but then the driver's dax_copy_to_iter() operation could
> > > > > either read up to the precise byte offset of the error in the page, or
> > > > > autoreplace error data with zero's to try to maximize data recovery.
> > > > 
> > > > Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> > > > everything that can be read from the bad range because it's the
> > > > other half of the problem RWF_RESET_HWERROR is trying to address.
> > > > That is, the operation we want to perform on a range with an error
> > > > state is -data recovery-, not "reinitialisation". Data recovery
> > > > requires two steps:
> > > > 
> > > > - "try to recover the data from the bad storage"; and
> > > > - "reinitialise the data and clear the error state"
> > > > 
> > > > These naturally map to read() and write() operations, not
> > > > fallocate(). With RWF flags they become explicit data recovery
> > > > operations, unlike fallocate() which needs to imply that "writing
> > > > zeroes" == "reset hardware error state". While that reset method
> > > > may be true for a specific pmem hardware implementation it is not a
> > > > requirement for all storage hardware. It's most definitely not a
> > > > requirement for future storage hardware, either.
> > > > 
> > > > It also means that applications have no choice in what data they can
> > > > use to reinitialise the damaged range with because fallocate() only
> > > > supports writing zeroes. If we've recovered data via a read() as you
> > > > suggest we could, then we can rebuild the data from other redundant
> > > > information and immediately write that back to the storage, hence
> > > > repairing the fault.
> > > > 
> > > > That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> > > > write into an exclusive operation and hence allow the
> > > > reinitialisation with the recovered/repaired state to run atomically
> > > > w.r.t. all other filesystem operations.  i.e. the reset write
> > > > completes the recovery operation instead of requiring separate
> > > > "reset" and "write recovered data into zeroed range" steps that
> > > > cannot be executed atomically by userspace...
> > > 
> > > /me nods
> > > 
> > > Jane, want to take a run at patches for this ^^^?
> > > 
> > 
> > Sure, I'll give it a try.
> > 
> > Thank you all for the discussions!
> 
> Cool, thank you!

I'd like to propose a slight modification to the API: a single RWF
flag called RWF_RECOVER_DATA. On read, this means the storage tries
to read all the data it can from the range, and for the parts it
can't read data from (cachelines, sectors, whatever) it returns as
zeroes.

On write, this means the errors over the range get cleared and the
user data provided gets written over the top of whatever was there.
Filesystems should perform this as an exclusive operation to that
range of the file.

That way we only need one IOCB_RECOVERY flag, and for communicating
with lower storage layers (e.g. dm/md raid and/or hardware) only one
REQ_RECOVERY flag is needed in the bio.

Thoughts?

Cheers,

Dave.
Jane Chu Sept. 27, 2021, 9:57 p.m. UTC | #23
On 9/27/2021 2:07 PM, Dave Chinner wrote:
> On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote:
>> On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:
>>>
>>> On 9/23/2021 6:18 PM, Dan Williams wrote:
>>>> On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
>>>>>
>>>>> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
>>>>>> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
>>>>>>> [..]
>>>>>>>> Hence this discussion leads me to conclude that fallocate() simply
>>>>>>>> isn't the right interface to clear storage hardware poison state and
>>>>>>>> it's much simpler for everyone - kernel and userspace - to provide a
>>>>>>>> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
>>>>>>>> clear hardware error state before issuing this user write to the
>>>>>>>> hardware.
>>>>>>>
>>>>>>> That flag would slot in nicely in dax_iomap_iter() as the gate for
>>>>>>> whether dax_direct_access() should allow mapping over error ranges,
>>>>>>> and then as a flag to dax_copy_from_iter() to indicate that it should
>>>>>>> compare the incoming write to known poison and clear it before
>>>>>>> proceeding.
>>>>>>>
>>>>>>> I like the distinction, because there's a chance the application did
>>>>>>> not know that the page had experienced data loss and might want the
>>>>>>> error behavior. The other service the driver could offer with this
>>>>>>> flag is to do a precise check of the incoming write to make sure it
>>>>>>> overlaps known poison and then repair the entire page. Repairing whole
>>>>>>> pages makes for a cleaner implementation of the code that tries to
>>>>>>> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
>>>>>>
>>>>>> This flag could also be useful for preadv2() as there is currently no
>>>>>> way to read the good data in a PMEM page with poison via DAX. So the
>>>>>> flag would tell dax_direct_access() to again proceed in the face of
>>>>>> errors, but then the driver's dax_copy_to_iter() operation could
>>>>>> either read up to the precise byte offset of the error in the page, or
>>>>>> autoreplace error data with zero's to try to maximize data recovery.
>>>>>
>>>>> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
>>>>> everything that can be read from the bad range because it's the
>>>>> other half of the problem RWF_RESET_HWERROR is trying to address.
>>>>> That is, the operation we want to perform on a range with an error
>>>>> state is -data recovery-, not "reinitialisation". Data recovery
>>>>> requires two steps:
>>>>>
>>>>> - "try to recover the data from the bad storage"; and
>>>>> - "reinitialise the data and clear the error state"
>>>>>
>>>>> These naturally map to read() and write() operations, not
>>>>> fallocate(). With RWF flags they become explicit data recovery
>>>>> operations, unlike fallocate() which needs to imply that "writing
>>>>> zeroes" == "reset hardware error state". While that reset method
>>>>> may be true for a specific pmem hardware implementation it is not a
>>>>> requirement for all storage hardware. It's most definitely not a
>>>>> requirement for future storage hardware, either.
>>>>>
>>>>> It also means that applications have no choice in what data they can
>>>>> use to reinitialise the damaged range with because fallocate() only
>>>>> supports writing zeroes. If we've recovered data via a read() as you
>>>>> suggest we could, then we can rebuild the data from other redundant
>>>>> information and immediately write that back to the storage, hence
>>>>> repairing the fault.
>>>>>
>>>>> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
>>>>> write into an exclusive operation and hence allow the
>>>>> reinitialisation with the recovered/repaired state to run atomically
>>>>> w.r.t. all other filesystem operations.  i.e. the reset write
>>>>> completes the recovery operation instead of requiring separate
>>>>> "reset" and "write recovered data into zeroed range" steps that
>>>>> cannot be executed atomically by userspace...
>>>>
>>>> /me nods
>>>>
>>>> Jane, want to take a run at patches for this ^^^?
>>>>
>>>
>>> Sure, I'll give it a try.
>>>
>>> Thank you all for the discussions!
>>
>> Cool, thank you!
> 
> I'd like to propose a slight modification to the API: a single RWF
> flag called RWF_RECOVER_DATA. On read, this means the storage tries
> to read all the data it can from the range, and for the parts it
> can't read data from (cachelines, sectors, whatever) it returns as
> zeroes.
> 
> On write, this means the errors over the range get cleared and the
> user data provided gets written over the top of whatever was there.
> Filesystems should perform this as an exclusive operation to that
> range of the file.
> 
> That way we only need one IOCB_RECOVERY flag, and for communicating
> with lower storage layers (e.g. dm/md raid and/or hardware) only one
> REQ_RECOVERY flag is needed in the bio.
> 
> Thoughts?

Sounds cleaner.  Dan, your thoughts?

thanks,
-jane

> 
> Cheers,
> 
> Dave.
>
Dan Williams Sept. 28, 2021, 12:08 a.m. UTC | #24
On Mon, Sep 27, 2021 at 2:58 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 9/27/2021 2:07 PM, Dave Chinner wrote:
> > On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote:
> >> On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:
> >>>
> >>> On 9/23/2021 6:18 PM, Dan Williams wrote:
> >>>> On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
> >>>>>
> >>>>> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> >>>>>> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >>>>>>>
> >>>>>>> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> >>>>>>> [..]
> >>>>>>>> Hence this discussion leads me to conclude that fallocate() simply
> >>>>>>>> isn't the right interface to clear storage hardware poison state and
> >>>>>>>> it's much simpler for everyone - kernel and userspace - to provide a
> >>>>>>>> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> >>>>>>>> clear hardware error state before issuing this user write to the
> >>>>>>>> hardware.
> >>>>>>>
> >>>>>>> That flag would slot in nicely in dax_iomap_iter() as the gate for
> >>>>>>> whether dax_direct_access() should allow mapping over error ranges,
> >>>>>>> and then as a flag to dax_copy_from_iter() to indicate that it should
> >>>>>>> compare the incoming write to known poison and clear it before
> >>>>>>> proceeding.
> >>>>>>>
> >>>>>>> I like the distinction, because there's a chance the application did
> >>>>>>> not know that the page had experienced data loss and might want the
> >>>>>>> error behavior. The other service the driver could offer with this
> >>>>>>> flag is to do a precise check of the incoming write to make sure it
> >>>>>>> overlaps known poison and then repair the entire page. Repairing whole
> >>>>>>> pages makes for a cleaner implementation of the code that tries to
> >>>>>>> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> >>>>>>
> >>>>>> This flag could also be useful for preadv2() as there is currently no
> >>>>>> way to read the good data in a PMEM page with poison via DAX. So the
> >>>>>> flag would tell dax_direct_access() to again proceed in the face of
> >>>>>> errors, but then the driver's dax_copy_to_iter() operation could
> >>>>>> either read up to the precise byte offset of the error in the page, or
> >>>>>> autoreplace error data with zero's to try to maximize data recovery.
> >>>>>
> >>>>> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> >>>>> everything that can be read from the bad range because it's the
> >>>>> other half of the problem RWF_RESET_HWERROR is trying to address.
> >>>>> That is, the operation we want to perform on a range with an error
> >>>>> state is -data recovery-, not "reinitialisation". Data recovery
> >>>>> requires two steps:
> >>>>>
> >>>>> - "try to recover the data from the bad storage"; and
> >>>>> - "reinitialise the data and clear the error state"
> >>>>>
> >>>>> These naturally map to read() and write() operations, not
> >>>>> fallocate(). With RWF flags they become explicit data recovery
> >>>>> operations, unlike fallocate() which needs to imply that "writing
> >>>>> zeroes" == "reset hardware error state". While that reset method
> >>>>> may be true for a specific pmem hardware implementation it is not a
> >>>>> requirement for all storage hardware. It's most definitely not a
> >>>>> requirement for future storage hardware, either.
> >>>>>
> >>>>> It also means that applications have no choice in what data they can
> >>>>> use to reinitialise the damaged range with because fallocate() only
> >>>>> supports writing zeroes. If we've recovered data via a read() as you
> >>>>> suggest we could, then we can rebuild the data from other redundant
> >>>>> information and immediately write that back to the storage, hence
> >>>>> repairing the fault.
> >>>>>
> >>>>> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> >>>>> write into an exclusive operation and hence allow the
> >>>>> reinitialisation with the recovered/repaired state to run atomically
> >>>>> w.r.t. all other filesystem operations.  i.e. the reset write
> >>>>> completes the recovery operation instead of requiring separate
> >>>>> "reset" and "write recovered data into zeroed range" steps that
> >>>>> cannot be executed atomically by userspace...
> >>>>
> >>>> /me nods
> >>>>
> >>>> Jane, want to take a run at patches for this ^^^?
> >>>>
> >>>
> >>> Sure, I'll give it a try.
> >>>
> >>> Thank you all for the discussions!
> >>
> >> Cool, thank you!
> >
> > I'd like to propose a slight modification to the API: a single RWF
> > flag called RWF_RECOVER_DATA. On read, this means the storage tries
> > to read all the data it can from the range, and for the parts it
> > can't read data from (cachelines, sectors, whatever) it returns as
> > zeroes.
> >
> > On write, this means the errors over the range get cleared and the
> > user data provided gets written over the top of whatever was there.
> > Filesystems should perform this as an exclusive operation to that
> > range of the file.
> >
> > That way we only need one IOCB_RECOVERY flag, and for communicating
> > with lower storage layers (e.g. dm/md raid and/or hardware) only one
> > REQ_RECOVERY flag is needed in the bio.
> >
> > Thoughts?
>
> Sounds cleaner.  Dan, your thoughts?

I like it. I was struggling with a way to unify the flag names, and
"recovery" is a good term for not surprising the caller with zeros.
I.e. don't use this flow to avoid errors, use this flow to maximize
data recovery.
diff mbox series

Patch

diff --git a/fs/open.c b/fs/open.c
index daa324606a41..230220b8f67a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -256,6 +256,11 @@  int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	    (mode & ~FALLOC_FL_INSERT_RANGE))
 		return -EINVAL;
 
+	/* Zeroinit should only be used by itself and keep size must be set. */
+	if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
+	    (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
+		return -EINVAL;
+
 	/* Unshare range should only be used with allocate mode. */
 	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
 	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index f3f0b97b1675..4597b416667b 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -29,6 +29,7 @@  struct space_resv {
 					 FALLOC_FL_PUNCH_HOLE |		\
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
+					 FALLOC_FL_ZEROINIT_RANGE |	\
 					 FALLOC_FL_INSERT_RANGE |	\
 					 FALLOC_FL_UNSHARE_RANGE)
 
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 51398fa57f6c..8144403b6102 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -77,4 +77,13 @@ 
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
+ * writing zeros to it.  Subsequent read and writes should not fail due to any
+ * previous media errors.  Blocks must be not be shared or require copy on
+ * write.  Holes and unwritten extents are left untouched.  This mode must be
+ * used with FALLOC_FL_KEEP_SIZE.
+ */
+#define FALLOC_FL_ZEROINIT_RANGE	0x80
+
 #endif /* _UAPI_FALLOC_H_ */