[v10,02/41] iomap: support REQ_OP_ZONE_APPEND

Message ID	72734501cc1d9e08117c215ed60f7b38e3665f14.1605007036.git.naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> IronPort-SDR: mgBRpTl31GmQPUtCVCtM0feR7WQupWIsrgALLN0dsJsj5wm/FlFUp16CeZT4bDnl139XKff8Xr uf0rQsWHJ3kQljgL0A8NqPA66RoG2caukfMDFAOJT7FTwKoroStd4zM4OjEExVx2pTq7ftvpJq lZKou9qcC2XG8N8CXew42/uizxBR+oyt3hZ9IY3sH9zlQywNtA9QURqEikRd1B5mOPKLXatH/h kk9cTZ53GrRyfP4SfQVWf4w6GWO7hSCnUaghUGd2NzXX8uqzr0jkaUum/It2kHQhirRdl7CLlh fs4= IronPort-SDR: KIGBsKzPCuMDD6Fadu7IlJJ56/68Bucj5/tLMqD7YFhiGV70IPEyCUumcs/mdbuW+QsxBzqef3 vha1ldAdelEgBBohJJhyVtDizL1HE8znp0p/1qOurXk7VKiiOnAvndDnNTUBTqJwynQlcn4+eP zXh05ZdD4/+E33tZIeFTddKWaMIHGTZ3uFGSxyJlQGTTy3/bEAVDvlTqbf6UOWDVuqYsoHg+fq 3kSXF5mgg4GR03YvG6a61KLkOpeL8OCyXYaKGnx7Z70nhHRMckKw7wbdI98v1MNNDCv6xl2v/J ybQNEz/Fh+XQfa7CKSkzFUd4 IronPort-SDR: 8bfWCJKgm7ywL4SpSw2b7WVFlH18SBGc1bZ5XNTIz7ciYqQ7TzyaaGiWTB/NmgDB0pTdrSVrHH 7H1pfVMsT8J8cxCT7yJeHY4yGh9PAgrK+4GpIsWxgOZRGmCF20R3aOFz70rkbIa/eRiGgRKRMW ORCccDMk2gYxHnd8QGfky6Cdmzd35VcuTm/RVA/WkJ1j2P6n1Tf89iJrmQUrV4UtTtUIBdl9bw MSfEeVQKfTZGQx9kCw39JzB6Ovd1+zPZCr1isalMYeeN7L9ETON3qGzVp21UNMSITEFh3XjIKJ t68= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>, "Darrick J. Wong" <darrick.wong@oracle.com>, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH v10 02/41] iomap: support REQ_OP_ZONE_APPEND Date: Tue, 10 Nov 2020 20:26:05 +0900 Message-Id: <72734501cc1d9e08117c215ed60f7b38e3665f14.1605007036.git.naohiro.aota@wdc.com> In-Reply-To: <cover.1605007036.git.naohiro.aota@wdc.com> References: <cover.1605007036.git.naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned block device support \| expand [v10,00/41] btrfs: zoned block device support [v10,01/41] block: add bio_add_zone_append_page [v10,02/41] iomap: support REQ_OP_ZONE_APPEND [v10,03/41] btrfs: introduce ZONED feature flag [v10,04/41] btrfs: get zone information of zoned block devices [v10,05/41] btrfs: check and enable ZONED mode [v10,06/41] btrfs: introduce max_zone_append_size [v10,07/41] btrfs: disallow space_cache in ZONED mode [v10,08/41] btrfs: disallow NODATACOW in ZONED mode [v10,09/41] btrfs: disable fallocate in ZONED mode [v10,10/41] btrfs: disallow mixed-bg in ZONED mode [v10,11/41] btrfs: implement log-structured superblock for ZONED mode [v10,12/41] btrfs: implement zoned chunk allocator [v10,13/41] btrfs: verify device extent is aligned to zone [v10,14/41] btrfs: load zone's alloction offset [v10,15/41] btrfs: emulate write pointer for conventional zones [v10,16/41] btrfs: track unusable bytes for zones [v10,17/41] btrfs: do sequential extent allocation in ZONED mode [v10,18/41] btrfs: reset zones of unused block groups [v10,19/41] btrfs: redirty released extent buffers in ZONED mode [v10,20/41] btrfs: extract page adding function [v10,21/41] btrfs: use bio_add_zone_append_page for zoned btrfs [v10,22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing [v10,23/41] btrfs: split ordered extent when bio is sent [v10,24/41] btrfs: extend btrfs_rmap_block for specifying a device [v10,25/41] btrfs: use ZONE_APPEND write for ZONED btrfs [v10,26/41] btrfs: enable zone append writing for direct IO [v10,27/41] btrfs: introduce dedicated data write path for ZONED mode [v10,28/41] btrfs: serialize meta IOs on ZONED mode [v10,29/41] btrfs: wait existing extents before truncating [v10,30/41] btrfs: avoid async metadata checksum on ZONED mode [v10,31/41] btrfs: mark block groups to copy for device-replace [v10,32/41] btrfs: implement cloning for ZONED device-replace [v10,33/41] btrfs: implement copying for ZONED device-replace [v10,34/41] btrfs: support dev-replace in ZONED mode [v10,35/41] btrfs: enable relocation in ZONED mode [v10,36/41] btrfs: relocate block group to repair IO failure in ZONED [v10,37/41] btrfs: split alloc_log_tree() [v10,38/41] btrfs: extend zoned allocator to use dedicated tree-log block group [v10,39/41] btrfs: serialize log transaction on ZONED mode [v10,40/41] btrfs: reorder log node allocation [v10,41/41] btrfs: enable to mount ZONED incompat flag

Naohiro Aota Nov. 10, 2020, 11:26 a.m. UTC

A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.

To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
 include/linux/iomap.h |  1 +
 2 files changed, 36 insertions(+), 6 deletions(-)

Christoph Hellwig Nov. 10, 2020, 5:25 p.m. UTC | #1

>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		/*
> +		 * Set the operation flags early so that bio_iov_iter_get_pages
> +		 * can set up the page vector appropriately for a ZONE_APPEND
> +		 * operation.
> +		 */
> +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);

We could just calculate the flags once before the loop if we touch
this anyway.

But otherwise this looks ok to me.

Darrick J. Wong Nov. 10, 2020, 6:55 p.m. UTC | #2

On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> REQ_OP_ZONE_APPEND.
> 
> To utilize it, we need to set the bio_op before calling
> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> and restricted bio.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
>  include/linux/iomap.h |  1 +
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c1aafb2ab990..f04572a55a09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
> +/*
> + * Figure out the bio's operation flags from the dio request, the
> + * mapping, and whether or not we want FUA.  Note that we can end up
> + * clearing the WRITE_FUA flag in the dio request.
> + */
> +static inline unsigned int
> +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)

Hmm, just to check my understanding of what iomap has to do to support
all this:

When we're wanting to use a ZONE_APPEND command, the @iomap structure
has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
to IOMAP_MAPPED, but what should iomap->addr be set to?

I gather from what I see in zonefs and the relevant NVME proposal that
iomap->addr should be set to the (byte) address of the zone we want to
append to?  And if we do that, then bio->bi_iter.bi_sector will be set
to sector address of iomap->addr, right?

(I got lost trying to figure out how btrfs sets ->addr for appends.)

Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
to wherever the drive told it that it actually wrote the bio, right?

If that's true, then that implies that need_zeroout must always be false
for an append operation, right?  Does that also mean that the directio
request has to be aligned to an fs block and not just the sector size?

Can userspace send a directio append that crosses a zone boundary?  If
so, what happens if a direct append to a lower address fails but a
direct append to a higher address succeeds?

> +{
> +	unsigned int opflags = REQ_SYNC | REQ_IDLE;
> +
> +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> +		return REQ_OP_READ;
> +	}
> +
> +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> +		opflags |= REQ_OP_ZONE_APPEND;
> +	else
> +		opflags |= REQ_OP_WRITE;
> +
> +	if (use_fua)
> +		opflags |= REQ_FUA;
> +	else
> +		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +
> +	return opflags;
> +}
> +
>  static loff_t
>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		/*
> +		 * Set the operation flags early so that bio_iov_iter_get_pages
> +		 * can set up the page vector appropriately for a ZONE_APPEND
> +		 * operation.
> +		 */
> +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);

I'm also vaguely wondering how to communicate the write location back to
the filesystem when the bio completes?  btrfs handles the bio completion
completely so it doesn't have a problem, but for other filesystems
(cough future xfs cough) either we'd have to add a new callback for
append operations; or I guess everyone could hook the bio endio.

Admittedly that's not really your problem, and for all I know hch is
already working on this.

--D

> +
>  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>  		if (unlikely(ret)) {
>  			/*
> @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		n = bio->bi_iter.bi_size;
>  		if (dio->flags & IOMAP_DIO_WRITE) {
> -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> -			if (use_fua)
> -				bio->bi_opf |= REQ_FUA;
> -			else
> -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>  			task_io_account_write(n);
>  		} else {
> -			bio->bi_opf = REQ_OP_READ;
>  			if (dio->flags & IOMAP_DIO_DIRTY)
>  				bio_set_pages_dirty(bio);
>  		}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4d1d3c3469e9..1bccd1880d0d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -54,6 +54,7 @@ struct vm_fault;
>  #define IOMAP_F_SHARED		0x04
>  #define IOMAP_F_MERGED		0x08
>  #define IOMAP_F_BUFFER_HEAD	0x10
> +#define IOMAP_F_ZONE_APPEND	0x20
>  
>  /*
>   * Flags set by the core iomap code during operations:
> -- 
> 2.27.0
>

Darrick J. Wong Nov. 10, 2020, 7:01 p.m. UTC | #3

On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
> > A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> > max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> > such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> > REQ_OP_ZONE_APPEND.
> > 
> > To utilize it, we need to set the bio_op before calling
> > bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> > that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> > and restricted bio.
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
> >  include/linux/iomap.h |  1 +
> >  2 files changed, 36 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index c1aafb2ab990..f04572a55a09 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
> >  	iomap_dio_submit_bio(dio, iomap, bio, pos);
> >  }
> >  
> > +/*
> > + * Figure out the bio's operation flags from the dio request, the
> > + * mapping, and whether or not we want FUA.  Note that we can end up
> > + * clearing the WRITE_FUA flag in the dio request.
> > + */
> > +static inline unsigned int
> > +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)
> 
> Hmm, just to check my understanding of what iomap has to do to support
> all this:
> 
> When we're wanting to use a ZONE_APPEND command, the @iomap structure
> has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
> to IOMAP_MAPPED, but what should iomap->addr be set to?
> 
> I gather from what I see in zonefs and the relevant NVME proposal that
> iomap->addr should be set to the (byte) address of the zone we want to
> append to?  And if we do that, then bio->bi_iter.bi_sector will be set
> to sector address of iomap->addr, right?
> 
> (I got lost trying to figure out how btrfs sets ->addr for appends.)
> 
> Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
> to wherever the drive told it that it actually wrote the bio, right?
> 
> If that's true, then that implies that need_zeroout must always be false
> for an append operation, right?  Does that also mean that the directio
> request has to be aligned to an fs block and not just the sector size?
> 
> Can userspace send a directio append that crosses a zone boundary?  If
> so, what happens if a direct append to a lower address fails but a
> direct append to a higher address succeeds?

Bleh, vim tomfoolery == missing sentence.  Change the above paragraph to
read:

Can userspace send a directio append that crosses a zone boundary?  Can
we issue multiple bios for a single append write?  What happens if a
direct append to a lower address fails but a direct append to a higher
address succeeds?

--D

> > +{
> > +	unsigned int opflags = REQ_SYNC | REQ_IDLE;
> > +
> > +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> > +		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> > +		return REQ_OP_READ;
> > +	}
> > +
> > +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> > +		opflags |= REQ_OP_ZONE_APPEND;
> > +	else
> > +		opflags |= REQ_OP_WRITE;
> > +
> > +	if (use_fua)
> > +		opflags |= REQ_FUA;
> > +	else
> > +		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> > +
> > +	return opflags;
> > +}
> > +
> >  static loff_t
> >  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		struct iomap_dio *dio, struct iomap *iomap)
> > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		bio->bi_private = dio;
> >  		bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > +		/*
> > +		 * Set the operation flags early so that bio_iov_iter_get_pages
> > +		 * can set up the page vector appropriately for a ZONE_APPEND
> > +		 * operation.
> > +		 */
> > +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> 
> I'm also vaguely wondering how to communicate the write location back to
> the filesystem when the bio completes?  btrfs handles the bio completion
> completely so it doesn't have a problem, but for other filesystems
> (cough future xfs cough) either we'd have to add a new callback for
> append operations; or I guess everyone could hook the bio endio.
> 
> Admittedly that's not really your problem, and for all I know hch is
> already working on this.
> 
> --D
> 
> > +
> >  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
> >  		if (unlikely(ret)) {
> >  			/*
> > @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
> >  
> >  		n = bio->bi_iter.bi_size;
> >  		if (dio->flags & IOMAP_DIO_WRITE) {
> > -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> > -			if (use_fua)
> > -				bio->bi_opf |= REQ_FUA;
> > -			else
> > -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> >  			task_io_account_write(n);
> >  		} else {
> > -			bio->bi_opf = REQ_OP_READ;
> >  			if (dio->flags & IOMAP_DIO_DIRTY)
> >  				bio_set_pages_dirty(bio);
> >  		}
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 4d1d3c3469e9..1bccd1880d0d 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -54,6 +54,7 @@ struct vm_fault;
> >  #define IOMAP_F_SHARED		0x04
> >  #define IOMAP_F_MERGED		0x08
> >  #define IOMAP_F_BUFFER_HEAD	0x10
> > +#define IOMAP_F_ZONE_APPEND	0x20
> >  
> >  /*
> >   * Flags set by the core iomap code during operations:
> > -- 
> > 2.27.0
> >

Christoph Hellwig Nov. 24, 2020, 11:29 a.m. UTC | #4

On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote:
> When we're wanting to use a ZONE_APPEND command, the @iomap structure
> has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set
> to IOMAP_MAPPED, but what should iomap->addr be set to?
> 
> I gather from what I see in zonefs and the relevant NVME proposal that
> iomap->addr should be set to the (byte) address of the zone we want to
> append to?  And if we do that, then bio->bi_iter.bi_sector will be set
> to sector address of iomap->addr, right?

Yes.

> Then when the IO completes, the block layer sets bio->bi_iter.bi_sector
> to wherever the drive told it that it actually wrote the bio, right?

Yes.

> If that's true, then that implies that need_zeroout must always be false
> for an append operation, right?  Does that also mean that the directio
> request has to be aligned to an fs block and not just the sector size?

I think so, yes.

> Can userspace send a directio append that crosses a zone boundary?  If
> so, what happens if a direct append to a lower address fails but a
> direct append to a higher address succeeds?

Userspace doesn't know about zone boundaries.  It can send I/O larger
than a zone, but the file system has to split it into multiple I/Os
just like when it has to cross and AG boundary in XFS.

> I'm also vaguely wondering how to communicate the write location back to
> the filesystem when the bio completes?  btrfs handles the bio completion
> completely so it doesn't have a problem, but for other filesystems
> (cough future xfs cough) either we'd have to add a new callback for
> append operations; or I guess everyone could hook the bio endio.
> 
> Admittedly that's not really your problem, and for all I know hch is
> already working on this.

I think any non-trivial file system needs to override the bio completion
handler for writes anyway, so this seems reasonable.  It might be worth
documenting, though.

Darrick J. Wong Nov. 30, 2020, 6:11 p.m. UTC | #5

On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> REQ_OP_ZONE_APPEND.
> 
> To utilize it, we need to set the bio_op before calling
> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> and restricted bio.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Christoph's answers seem reasonable to me, so

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Er... do you want me to take this one via the iomap tree?

--D

> ---
>  fs/iomap/direct-io.c  | 41 +++++++++++++++++++++++++++++++++++------
>  include/linux/iomap.h |  1 +
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c1aafb2ab990..f04572a55a09 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
> +/*
> + * Figure out the bio's operation flags from the dio request, the
> + * mapping, and whether or not we want FUA.  Note that we can end up
> + * clearing the WRITE_FUA flag in the dio request.
> + */
> +static inline unsigned int
> +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua)
> +{
> +	unsigned int opflags = REQ_SYNC | REQ_IDLE;
> +
> +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +		WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> +		return REQ_OP_READ;
> +	}
> +
> +	if (iomap->flags & IOMAP_F_ZONE_APPEND)
> +		opflags |= REQ_OP_ZONE_APPEND;
> +	else
> +		opflags |= REQ_OP_WRITE;
> +
> +	if (use_fua)
> +		opflags |= REQ_FUA;
> +	else
> +		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +
> +	return opflags;
> +}
> +
>  static loff_t
>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		struct iomap_dio *dio, struct iomap *iomap)
> @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> +		/*
> +		 * Set the operation flags early so that bio_iov_iter_get_pages
> +		 * can set up the page vector appropriately for a ZONE_APPEND
> +		 * operation.
> +		 */
> +		bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> +
>  		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>  		if (unlikely(ret)) {
>  			/*
> @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		n = bio->bi_iter.bi_size;
>  		if (dio->flags & IOMAP_DIO_WRITE) {
> -			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> -			if (use_fua)
> -				bio->bi_opf |= REQ_FUA;
> -			else
> -				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>  			task_io_account_write(n);
>  		} else {
> -			bio->bi_opf = REQ_OP_READ;
>  			if (dio->flags & IOMAP_DIO_DIRTY)
>  				bio_set_pages_dirty(bio);
>  		}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4d1d3c3469e9..1bccd1880d0d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -54,6 +54,7 @@ struct vm_fault;
>  #define IOMAP_F_SHARED		0x04
>  #define IOMAP_F_MERGED		0x08
>  #define IOMAP_F_BUFFER_HEAD	0x10
> +#define IOMAP_F_ZONE_APPEND	0x20
>  
>  /*
>   * Flags set by the core iomap code during operations:
> -- 
> 2.27.0
>

Johannes Thumshirn Dec. 1, 2020, 10:16 a.m. UTC | #6

On 30/11/2020 19:16, Darrick J. Wong wrote:
> On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote:
>> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
>> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
>> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
>> REQ_OP_ZONE_APPEND.
>>
>> To utilize it, we need to set the bio_op before calling
>> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
>> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
>> and restricted bio.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> Christoph's answers seem reasonable to me, so
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Er... do you want me to take this one via the iomap tree?

Ahm probably yes, but we have an update addressing Christoph's coming
with the next version series.

Christoph Hellwig Dec. 9, 2020, 9:31 a.m. UTC | #7

Btw, another thing I noticed:

when using io_uring to submit a write to btrfs that ends up using Zone
Append we'll hit the

	if (WARN_ON_ONCE(is_bvec))
		return -EINVAL;

case in bio_iov_iter_get_pages with the changes in this series.

Johannes Thumshirn Dec. 9, 2020, 10:08 a.m. UTC | #8

On 09/12/2020 10:34, Christoph Hellwig wrote:
> Btw, another thing I noticed:
> 
> when using io_uring to submit a write to btrfs that ends up using Zone
> Append we'll hit the
> 
> 	if (WARN_ON_ONCE(is_bvec))
> 		return -EINVAL;
> 
> case in bio_iov_iter_get_pages with the changes in this series.

Yes this warning is totally bogus. It was in there from the beginning of the
zone-append series and I have no idea why I didn't kill it.

IIRC Chaitanya had a patch in his nvmet zoned series removing it.

Christoph Hellwig Dec. 9, 2020, 10:10 a.m. UTC | #9

On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
> On 09/12/2020 10:34, Christoph Hellwig wrote:
> > Btw, another thing I noticed:
> > 
> > when using io_uring to submit a write to btrfs that ends up using Zone
> > Append we'll hit the
> > 
> > 	if (WARN_ON_ONCE(is_bvec))
> > 		return -EINVAL;
> > 
> > case in bio_iov_iter_get_pages with the changes in this series.
> 
> Yes this warning is totally bogus. It was in there from the beginning of the
> zone-append series and I have no idea why I didn't kill it.
> 
> IIRC Chaitanya had a patch in his nvmet zoned series removing it.

Yes, but it is wrong.  What we need is a version of
__bio_iov_bvec_add_pages that takes the hardware limits into account.

Johannes Thumshirn Dec. 9, 2020, 10:16 a.m. UTC | #10

On 09/12/2020 11:10, hch@infradead.org wrote:
> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>> Btw, another thing I noticed:
>>>
>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>> Append we'll hit the
>>>
>>> 	if (WARN_ON_ONCE(is_bvec))
>>> 		return -EINVAL;
>>>
>>> case in bio_iov_iter_get_pages with the changes in this series.
>>
>> Yes this warning is totally bogus. It was in there from the beginning of the
>> zone-append series and I have no idea why I didn't kill it.
>>
>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
> 
> Yes, but it is wrong.  What we need is a version of
> __bio_iov_bvec_add_pages that takes the hardware limits into account.
> 

Ah now I understand the situation, I'm on it.

Johannes Thumshirn Dec. 9, 2020, 1:38 p.m. UTC | #11

On 09/12/2020 11:18, Johannes Thumshirn wrote:
> On 09/12/2020 11:10, hch@infradead.org wrote:
>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>> Btw, another thing I noticed:
>>>>
>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>> Append we'll hit the
>>>>
>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>> 		return -EINVAL;
>>>>
>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>
>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>> zone-append series and I have no idea why I didn't kill it.
>>>
>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
>>
>> Yes, but it is wrong.  What we need is a version of
>> __bio_iov_bvec_add_pages that takes the hardware limits into account.
>>
> 
> Ah now I understand the situation, I'm on it.
> 

OK got something, just need to test it.

Johannes Thumshirn Dec. 11, 2020, 7:26 a.m. UTC | #12

On 09/12/2020 14:41, Johannes Thumshirn wrote:
> On 09/12/2020 11:18, Johannes Thumshirn wrote:
>> On 09/12/2020 11:10, hch@infradead.org wrote:
>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>>> Btw, another thing I noticed:
>>>>>
>>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>>> Append we'll hit the
>>>>>
>>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>>> 		return -EINVAL;
>>>>>
>>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>>
>>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>>> zone-append series and I have no idea why I didn't kill it.
>>>>
>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
>>>
>>> Yes, but it is wrong.  What we need is a version of
>>> __bio_iov_bvec_add_pages that takes the hardware limits into account.
>>>
>>
>> Ah now I understand the situation, I'm on it.
>>
> 
> OK got something, just need to test it.
> 

I just ran tests with my solution and to verify it worked as expected I ran the
test without it. Interestingly the WARN_ON() didn't trigger for me. Here's the
fio command line I've used:

fio --ioengine io_uring --rw readwrite --bs 1M --size 1G --time_based   \
    --runtime 1m --filename /mnt/test/io_uring --name io_uring-test     \
    --direct 1 --numjobs $NPROC


I did verify it's using zone append though.

What did you use to trigger the warning?

Chaitanya Kulkarni Dec. 11, 2020, 9:24 p.m. UTC | #13

Johhanes/Christoph,

On 12/10/20 23:30, Johannes Thumshirn wrote:
> On 09/12/2020 14:41, Johannes Thumshirn wrote:
>> On 09/12/2020 11:18, Johannes Thumshirn wrote:
>>> On 09/12/2020 11:10, hch@infradead.org wrote:
>>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>>>> Btw, another thing I noticed:
>>>>>>
>>>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>>>> Append we'll hit the
>>>>>>
>>>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>>>> 		return -EINVAL;
>>>>>>
>>>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>>>> zone-append series and I have no idea why I didn't kill it.
>>>>>
>>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.

Even though that patch is not needed I've tested that with the NVMeOF
backend to build bios from bvecs with bio_iov_iter_get_pages(), I can still
send that patch, please confirm.

Johannes Thumshirn Dec. 12, 2020, 10:22 a.m. UTC | #14

On 11/12/2020 22:24, Chaitanya Kulkarni wrote:
> Johhanes/Christoph,
> 
> On 12/10/20 23:30, Johannes Thumshirn wrote:
>> On 09/12/2020 14:41, Johannes Thumshirn wrote:
>>> On 09/12/2020 11:18, Johannes Thumshirn wrote:
>>>> On 09/12/2020 11:10, hch@infradead.org wrote:
>>>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote:
>>>>>> On 09/12/2020 10:34, Christoph Hellwig wrote:
>>>>>>> Btw, another thing I noticed:
>>>>>>>
>>>>>>> when using io_uring to submit a write to btrfs that ends up using Zone
>>>>>>> Append we'll hit the
>>>>>>>
>>>>>>> 	if (WARN_ON_ONCE(is_bvec))
>>>>>>> 		return -EINVAL;
>>>>>>>
>>>>>>> case in bio_iov_iter_get_pages with the changes in this series.
>>>>>> Yes this warning is totally bogus. It was in there from the beginning of the
>>>>>> zone-append series and I have no idea why I didn't kill it.
>>>>>>
>>>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it.
> 
> Even though that patch is not needed I've tested that with the NVMeOF
> backend to build bios from bvecs with bio_iov_iter_get_pages(), I can still
> send that patch, please confirm.
>

I have the following locally, but I fail to call it in my tests:

commit ea93c91a70204a23ebf9e22b19fbf8add644e12e
Author: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Date:   Wed Dec 9 23:26:38 2020 +0900

    block: provide __bio_iov_bvec_add_append_pages
    
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

diff --git a/block/bio.c b/block/bio.c
index 5c6982902330..dc8178ca796f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -992,6 +992,31 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 }
 EXPORT_SYMBOL_GPL(bio_release_pages);
 
+static int __bio_iov_bvec_add_append_pages(struct bio *bio,
+                                          struct iov_iter *iter)
+{
+       const struct bio_vec *bv = iter->bvec;
+       struct request_queue *q = bio->bi_disk->queue;
+       unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+       unsigned int len;
+       size_t size;
+
+       if (WARN_ON_ONCE(!max_append_sectors))
+               return -EINVAL;
+
+       if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+               return -EINVAL;
+
+       len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
+       size = bio_add_hw_page(q, bio, bv->bv_page, len,
+                              bv->bv_offset + iter->iov_offset,
+                              max_append_sectors, false);
+       if (unlikely(size != len))
+               return -EINVAL;
+       iov_iter_advance(iter, size);
+       return 0;
+}
+
 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 {
        const struct bio_vec *bv = iter->bvec;
@@ -1142,9 +1167,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
        do {
                if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-                       if (WARN_ON_ONCE(is_bvec))
-                               return -EINVAL;
-                       ret = __bio_iov_append_get_pages(bio, iter);
+                       if (is_bvec)
+                               ret = __bio_iov_bvec_add_append_pages(bio, iter);
+                       else
+                               ret = __bio_iov_append_get_pages(bio, iter);
                } else {
                        if (is_bvec)
                                ret = __bio_iov_bvec_add_pages(bio, iter);



It's basically __bio_iov_bvec_add_pages() respecting the max_zone_sectors queue limit.

[v10,02/41] iomap: support REQ_OP_ZONE_APPEND

Commit Message

Comments

Patch