Message ID | 72734501cc1d9e08117c215ed60f7b38e3665f14.1605007036.git.naohiro.aota@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: zoned block device support | expand |
> struct iomap_dio *dio, struct iomap *iomap) > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > + /* > + * Set the operation flags early so that bio_iov_iter_get_pages > + * can set up the page vector appropriately for a ZONE_APPEND > + * operation. > + */ > + bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); We could just calculate the flags once before the loop if we touch this anyway. But otherwise this looks ok to me.
On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote: > A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding > max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds > such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == > REQ_OP_ZONE_APPEND. > > To utilize it, we need to set the bio_op before calling > bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so > that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND > and restricted bio. > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> > --- > fs/iomap/direct-io.c | 41 +++++++++++++++++++++++++++++++++++------ > include/linux/iomap.h | 1 + > 2 files changed, 36 insertions(+), 6 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index c1aafb2ab990..f04572a55a09 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, > iomap_dio_submit_bio(dio, iomap, bio, pos); > } > > +/* > + * Figure out the bio's operation flags from the dio request, the > + * mapping, and whether or not we want FUA. Note that we can end up > + * clearing the WRITE_FUA flag in the dio request. > + */ > +static inline unsigned int > +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua) Hmm, just to check my understanding of what iomap has to do to support all this: When we're wanting to use a ZONE_APPEND command, the @iomap structure has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set to IOMAP_MAPPED, but what should iomap->addr be set to? I gather from what I see in zonefs and the relevant NVME proposal that iomap->addr should be set to the (byte) address of the zone we want to append to? And if we do that, then bio->bi_iter.bi_sector will be set to sector address of iomap->addr, right? (I got lost trying to figure out how btrfs sets ->addr for appends.) Then when the IO completes, the block layer sets bio->bi_iter.bi_sector to wherever the drive told it that it actually wrote the bio, right? If that's true, then that implies that need_zeroout must always be false for an append operation, right? Does that also mean that the directio request has to be aligned to an fs block and not just the sector size? Can userspace send a directio append that crosses a zone boundary? If so, what happens if a direct append to a lower address fails but a direct append to a higher address succeeds? > +{ > + unsigned int opflags = REQ_SYNC | REQ_IDLE; > + > + if (!(dio->flags & IOMAP_DIO_WRITE)) { > + WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND); > + return REQ_OP_READ; > + } > + > + if (iomap->flags & IOMAP_F_ZONE_APPEND) > + opflags |= REQ_OP_ZONE_APPEND; > + else > + opflags |= REQ_OP_WRITE; > + > + if (use_fua) > + opflags |= REQ_FUA; > + else > + dio->flags &= ~IOMAP_DIO_WRITE_FUA; > + > + return opflags; > +} > + > static loff_t > iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > struct iomap_dio *dio, struct iomap *iomap) > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > + /* > + * Set the operation flags early so that bio_iov_iter_get_pages > + * can set up the page vector appropriately for a ZONE_APPEND > + * operation. > + */ > + bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); I'm also vaguely wondering how to communicate the write location back to the filesystem when the bio completes? btrfs handles the bio completion completely so it doesn't have a problem, but for other filesystems (cough future xfs cough) either we'd have to add a new callback for append operations; or I guess everyone could hook the bio endio. Admittedly that's not really your problem, and for all I know hch is already working on this. --D > + > ret = bio_iov_iter_get_pages(bio, dio->submit.iter); > if (unlikely(ret)) { > /* > @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > n = bio->bi_iter.bi_size; > if (dio->flags & IOMAP_DIO_WRITE) { > - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; > - if (use_fua) > - bio->bi_opf |= REQ_FUA; > - else > - dio->flags &= ~IOMAP_DIO_WRITE_FUA; > task_io_account_write(n); > } else { > - bio->bi_opf = REQ_OP_READ; > if (dio->flags & IOMAP_DIO_DIRTY) > bio_set_pages_dirty(bio); > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 4d1d3c3469e9..1bccd1880d0d 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -54,6 +54,7 @@ struct vm_fault; > #define IOMAP_F_SHARED 0x04 > #define IOMAP_F_MERGED 0x08 > #define IOMAP_F_BUFFER_HEAD 0x10 > +#define IOMAP_F_ZONE_APPEND 0x20 > > /* > * Flags set by the core iomap code during operations: > -- > 2.27.0 >
On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote: > On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote: > > A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding > > max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds > > such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == > > REQ_OP_ZONE_APPEND. > > > > To utilize it, we need to set the bio_op before calling > > bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so > > that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND > > and restricted bio. > > > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> > > --- > > fs/iomap/direct-io.c | 41 +++++++++++++++++++++++++++++++++++------ > > include/linux/iomap.h | 1 + > > 2 files changed, 36 insertions(+), 6 deletions(-) > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index c1aafb2ab990..f04572a55a09 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, > > iomap_dio_submit_bio(dio, iomap, bio, pos); > > } > > > > +/* > > + * Figure out the bio's operation flags from the dio request, the > > + * mapping, and whether or not we want FUA. Note that we can end up > > + * clearing the WRITE_FUA flag in the dio request. > > + */ > > +static inline unsigned int > > +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua) > > Hmm, just to check my understanding of what iomap has to do to support > all this: > > When we're wanting to use a ZONE_APPEND command, the @iomap structure > has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set > to IOMAP_MAPPED, but what should iomap->addr be set to? > > I gather from what I see in zonefs and the relevant NVME proposal that > iomap->addr should be set to the (byte) address of the zone we want to > append to? And if we do that, then bio->bi_iter.bi_sector will be set > to sector address of iomap->addr, right? > > (I got lost trying to figure out how btrfs sets ->addr for appends.) > > Then when the IO completes, the block layer sets bio->bi_iter.bi_sector > to wherever the drive told it that it actually wrote the bio, right? > > If that's true, then that implies that need_zeroout must always be false > for an append operation, right? Does that also mean that the directio > request has to be aligned to an fs block and not just the sector size? > > Can userspace send a directio append that crosses a zone boundary? If > so, what happens if a direct append to a lower address fails but a > direct append to a higher address succeeds? Bleh, vim tomfoolery == missing sentence. Change the above paragraph to read: Can userspace send a directio append that crosses a zone boundary? Can we issue multiple bios for a single append write? What happens if a direct append to a lower address fails but a direct append to a higher address succeeds? --D > > +{ > > + unsigned int opflags = REQ_SYNC | REQ_IDLE; > > + > > + if (!(dio->flags & IOMAP_DIO_WRITE)) { > > + WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND); > > + return REQ_OP_READ; > > + } > > + > > + if (iomap->flags & IOMAP_F_ZONE_APPEND) > > + opflags |= REQ_OP_ZONE_APPEND; > > + else > > + opflags |= REQ_OP_WRITE; > > + > > + if (use_fua) > > + opflags |= REQ_FUA; > > + else > > + dio->flags &= ~IOMAP_DIO_WRITE_FUA; > > + > > + return opflags; > > +} > > + > > static loff_t > > iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > struct iomap_dio *dio, struct iomap *iomap) > > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > bio->bi_private = dio; > > bio->bi_end_io = iomap_dio_bio_end_io; > > > > + /* > > + * Set the operation flags early so that bio_iov_iter_get_pages > > + * can set up the page vector appropriately for a ZONE_APPEND > > + * operation. > > + */ > > + bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); > > I'm also vaguely wondering how to communicate the write location back to > the filesystem when the bio completes? btrfs handles the bio completion > completely so it doesn't have a problem, but for other filesystems > (cough future xfs cough) either we'd have to add a new callback for > append operations; or I guess everyone could hook the bio endio. > > Admittedly that's not really your problem, and for all I know hch is > already working on this. > > --D > > > + > > ret = bio_iov_iter_get_pages(bio, dio->submit.iter); > > if (unlikely(ret)) { > > /* > > @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > > > n = bio->bi_iter.bi_size; > > if (dio->flags & IOMAP_DIO_WRITE) { > > - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; > > - if (use_fua) > > - bio->bi_opf |= REQ_FUA; > > - else > > - dio->flags &= ~IOMAP_DIO_WRITE_FUA; > > task_io_account_write(n); > > } else { > > - bio->bi_opf = REQ_OP_READ; > > if (dio->flags & IOMAP_DIO_DIRTY) > > bio_set_pages_dirty(bio); > > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index 4d1d3c3469e9..1bccd1880d0d 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -54,6 +54,7 @@ struct vm_fault; > > #define IOMAP_F_SHARED 0x04 > > #define IOMAP_F_MERGED 0x08 > > #define IOMAP_F_BUFFER_HEAD 0x10 > > +#define IOMAP_F_ZONE_APPEND 0x20 > > > > /* > > * Flags set by the core iomap code during operations: > > -- > > 2.27.0 > >
On Tue, Nov 10, 2020 at 10:55:06AM -0800, Darrick J. Wong wrote: > When we're wanting to use a ZONE_APPEND command, the @iomap structure > has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set > to IOMAP_MAPPED, but what should iomap->addr be set to? > > I gather from what I see in zonefs and the relevant NVME proposal that > iomap->addr should be set to the (byte) address of the zone we want to > append to? And if we do that, then bio->bi_iter.bi_sector will be set > to sector address of iomap->addr, right? Yes. > Then when the IO completes, the block layer sets bio->bi_iter.bi_sector > to wherever the drive told it that it actually wrote the bio, right? Yes. > If that's true, then that implies that need_zeroout must always be false > for an append operation, right? Does that also mean that the directio > request has to be aligned to an fs block and not just the sector size? I think so, yes. > Can userspace send a directio append that crosses a zone boundary? If > so, what happens if a direct append to a lower address fails but a > direct append to a higher address succeeds? Userspace doesn't know about zone boundaries. It can send I/O larger than a zone, but the file system has to split it into multiple I/Os just like when it has to cross and AG boundary in XFS. > I'm also vaguely wondering how to communicate the write location back to > the filesystem when the bio completes? btrfs handles the bio completion > completely so it doesn't have a problem, but for other filesystems > (cough future xfs cough) either we'd have to add a new callback for > append operations; or I guess everyone could hook the bio endio. > > Admittedly that's not really your problem, and for all I know hch is > already working on this. I think any non-trivial file system needs to override the bio completion handler for writes anyway, so this seems reasonable. It might be worth documenting, though.
On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote: > A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding > max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds > such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == > REQ_OP_ZONE_APPEND. > > To utilize it, we need to set the bio_op before calling > bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so > that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND > and restricted bio. > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Christoph's answers seem reasonable to me, so Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Er... do you want me to take this one via the iomap tree? --D > --- > fs/iomap/direct-io.c | 41 +++++++++++++++++++++++++++++++++++------ > include/linux/iomap.h | 1 + > 2 files changed, 36 insertions(+), 6 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index c1aafb2ab990..f04572a55a09 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, > iomap_dio_submit_bio(dio, iomap, bio, pos); > } > > +/* > + * Figure out the bio's operation flags from the dio request, the > + * mapping, and whether or not we want FUA. Note that we can end up > + * clearing the WRITE_FUA flag in the dio request. > + */ > +static inline unsigned int > +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua) > +{ > + unsigned int opflags = REQ_SYNC | REQ_IDLE; > + > + if (!(dio->flags & IOMAP_DIO_WRITE)) { > + WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND); > + return REQ_OP_READ; > + } > + > + if (iomap->flags & IOMAP_F_ZONE_APPEND) > + opflags |= REQ_OP_ZONE_APPEND; > + else > + opflags |= REQ_OP_WRITE; > + > + if (use_fua) > + opflags |= REQ_FUA; > + else > + dio->flags &= ~IOMAP_DIO_WRITE_FUA; > + > + return opflags; > +} > + > static loff_t > iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > struct iomap_dio *dio, struct iomap *iomap) > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > + /* > + * Set the operation flags early so that bio_iov_iter_get_pages > + * can set up the page vector appropriately for a ZONE_APPEND > + * operation. > + */ > + bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); > + > ret = bio_iov_iter_get_pages(bio, dio->submit.iter); > if (unlikely(ret)) { > /* > @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > n = bio->bi_iter.bi_size; > if (dio->flags & IOMAP_DIO_WRITE) { > - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; > - if (use_fua) > - bio->bi_opf |= REQ_FUA; > - else > - dio->flags &= ~IOMAP_DIO_WRITE_FUA; > task_io_account_write(n); > } else { > - bio->bi_opf = REQ_OP_READ; > if (dio->flags & IOMAP_DIO_DIRTY) > bio_set_pages_dirty(bio); > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 4d1d3c3469e9..1bccd1880d0d 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -54,6 +54,7 @@ struct vm_fault; > #define IOMAP_F_SHARED 0x04 > #define IOMAP_F_MERGED 0x08 > #define IOMAP_F_BUFFER_HEAD 0x10 > +#define IOMAP_F_ZONE_APPEND 0x20 > > /* > * Flags set by the core iomap code during operations: > -- > 2.27.0 >
On 30/11/2020 19:16, Darrick J. Wong wrote: > On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote: >> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding >> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds >> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == >> REQ_OP_ZONE_APPEND. >> >> To utilize it, we need to set the bio_op before calling >> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so >> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND >> and restricted bio. >> >> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> > > Christoph's answers seem reasonable to me, so > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > Er... do you want me to take this one via the iomap tree? Ahm probably yes, but we have an update addressing Christoph's coming with the next version series.
Btw, another thing I noticed: when using io_uring to submit a write to btrfs that ends up using Zone Append we'll hit the if (WARN_ON_ONCE(is_bvec)) return -EINVAL; case in bio_iov_iter_get_pages with the changes in this series.
On 09/12/2020 10:34, Christoph Hellwig wrote: > Btw, another thing I noticed: > > when using io_uring to submit a write to btrfs that ends up using Zone > Append we'll hit the > > if (WARN_ON_ONCE(is_bvec)) > return -EINVAL; > > case in bio_iov_iter_get_pages with the changes in this series. Yes this warning is totally bogus. It was in there from the beginning of the zone-append series and I have no idea why I didn't kill it. IIRC Chaitanya had a patch in his nvmet zoned series removing it.
On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote: > On 09/12/2020 10:34, Christoph Hellwig wrote: > > Btw, another thing I noticed: > > > > when using io_uring to submit a write to btrfs that ends up using Zone > > Append we'll hit the > > > > if (WARN_ON_ONCE(is_bvec)) > > return -EINVAL; > > > > case in bio_iov_iter_get_pages with the changes in this series. > > Yes this warning is totally bogus. It was in there from the beginning of the > zone-append series and I have no idea why I didn't kill it. > > IIRC Chaitanya had a patch in his nvmet zoned series removing it. Yes, but it is wrong. What we need is a version of __bio_iov_bvec_add_pages that takes the hardware limits into account.
On 09/12/2020 11:10, hch@infradead.org wrote: > On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote: >> On 09/12/2020 10:34, Christoph Hellwig wrote: >>> Btw, another thing I noticed: >>> >>> when using io_uring to submit a write to btrfs that ends up using Zone >>> Append we'll hit the >>> >>> if (WARN_ON_ONCE(is_bvec)) >>> return -EINVAL; >>> >>> case in bio_iov_iter_get_pages with the changes in this series. >> >> Yes this warning is totally bogus. It was in there from the beginning of the >> zone-append series and I have no idea why I didn't kill it. >> >> IIRC Chaitanya had a patch in his nvmet zoned series removing it. > > Yes, but it is wrong. What we need is a version of > __bio_iov_bvec_add_pages that takes the hardware limits into account. > Ah now I understand the situation, I'm on it.
On 09/12/2020 11:18, Johannes Thumshirn wrote: > On 09/12/2020 11:10, hch@infradead.org wrote: >> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote: >>> On 09/12/2020 10:34, Christoph Hellwig wrote: >>>> Btw, another thing I noticed: >>>> >>>> when using io_uring to submit a write to btrfs that ends up using Zone >>>> Append we'll hit the >>>> >>>> if (WARN_ON_ONCE(is_bvec)) >>>> return -EINVAL; >>>> >>>> case in bio_iov_iter_get_pages with the changes in this series. >>> >>> Yes this warning is totally bogus. It was in there from the beginning of the >>> zone-append series and I have no idea why I didn't kill it. >>> >>> IIRC Chaitanya had a patch in his nvmet zoned series removing it. >> >> Yes, but it is wrong. What we need is a version of >> __bio_iov_bvec_add_pages that takes the hardware limits into account. >> > > Ah now I understand the situation, I'm on it. > OK got something, just need to test it.
On 09/12/2020 14:41, Johannes Thumshirn wrote: > On 09/12/2020 11:18, Johannes Thumshirn wrote: >> On 09/12/2020 11:10, hch@infradead.org wrote: >>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote: >>>> On 09/12/2020 10:34, Christoph Hellwig wrote: >>>>> Btw, another thing I noticed: >>>>> >>>>> when using io_uring to submit a write to btrfs that ends up using Zone >>>>> Append we'll hit the >>>>> >>>>> if (WARN_ON_ONCE(is_bvec)) >>>>> return -EINVAL; >>>>> >>>>> case in bio_iov_iter_get_pages with the changes in this series. >>>> >>>> Yes this warning is totally bogus. It was in there from the beginning of the >>>> zone-append series and I have no idea why I didn't kill it. >>>> >>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it. >>> >>> Yes, but it is wrong. What we need is a version of >>> __bio_iov_bvec_add_pages that takes the hardware limits into account. >>> >> >> Ah now I understand the situation, I'm on it. >> > > OK got something, just need to test it. > I just ran tests with my solution and to verify it worked as expected I ran the test without it. Interestingly the WARN_ON() didn't trigger for me. Here's the fio command line I've used: fio --ioengine io_uring --rw readwrite --bs 1M --size 1G --time_based \ --runtime 1m --filename /mnt/test/io_uring --name io_uring-test \ --direct 1 --numjobs $NPROC I did verify it's using zone append though. What did you use to trigger the warning?
Johhanes/Christoph, On 12/10/20 23:30, Johannes Thumshirn wrote: > On 09/12/2020 14:41, Johannes Thumshirn wrote: >> On 09/12/2020 11:18, Johannes Thumshirn wrote: >>> On 09/12/2020 11:10, hch@infradead.org wrote: >>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote: >>>>> On 09/12/2020 10:34, Christoph Hellwig wrote: >>>>>> Btw, another thing I noticed: >>>>>> >>>>>> when using io_uring to submit a write to btrfs that ends up using Zone >>>>>> Append we'll hit the >>>>>> >>>>>> if (WARN_ON_ONCE(is_bvec)) >>>>>> return -EINVAL; >>>>>> >>>>>> case in bio_iov_iter_get_pages with the changes in this series. >>>>> Yes this warning is totally bogus. It was in there from the beginning of the >>>>> zone-append series and I have no idea why I didn't kill it. >>>>> >>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it. Even though that patch is not needed I've tested that with the NVMeOF backend to build bios from bvecs with bio_iov_iter_get_pages(), I can still send that patch, please confirm.
On 11/12/2020 22:24, Chaitanya Kulkarni wrote: > Johhanes/Christoph, > > On 12/10/20 23:30, Johannes Thumshirn wrote: >> On 09/12/2020 14:41, Johannes Thumshirn wrote: >>> On 09/12/2020 11:18, Johannes Thumshirn wrote: >>>> On 09/12/2020 11:10, hch@infradead.org wrote: >>>>> On Wed, Dec 09, 2020 at 10:08:53AM +0000, Johannes Thumshirn wrote: >>>>>> On 09/12/2020 10:34, Christoph Hellwig wrote: >>>>>>> Btw, another thing I noticed: >>>>>>> >>>>>>> when using io_uring to submit a write to btrfs that ends up using Zone >>>>>>> Append we'll hit the >>>>>>> >>>>>>> if (WARN_ON_ONCE(is_bvec)) >>>>>>> return -EINVAL; >>>>>>> >>>>>>> case in bio_iov_iter_get_pages with the changes in this series. >>>>>> Yes this warning is totally bogus. It was in there from the beginning of the >>>>>> zone-append series and I have no idea why I didn't kill it. >>>>>> >>>>>> IIRC Chaitanya had a patch in his nvmet zoned series removing it. > > Even though that patch is not needed I've tested that with the NVMeOF > backend to build bios from bvecs with bio_iov_iter_get_pages(), I can still > send that patch, please confirm. > I have the following locally, but I fail to call it in my tests: commit ea93c91a70204a23ebf9e22b19fbf8add644e12e Author: Johannes Thumshirn <johannes.thumshirn@wdc.com> Date: Wed Dec 9 23:26:38 2020 +0900 block: provide __bio_iov_bvec_add_append_pages Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> diff --git a/block/bio.c b/block/bio.c index 5c6982902330..dc8178ca796f 100644 --- a/block/bio.c +++ b/block/bio.c @@ -992,6 +992,31 @@ void bio_release_pages(struct bio *bio, bool mark_dirty) } EXPORT_SYMBOL_GPL(bio_release_pages); +static int __bio_iov_bvec_add_append_pages(struct bio *bio, + struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + struct request_queue *q = bio->bi_disk->queue; + unsigned int max_append_sectors = queue_max_zone_append_sectors(q); + unsigned int len; + size_t size; + + if (WARN_ON_ONCE(!max_append_sectors)) + return -EINVAL; + + if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len)) + return -EINVAL; + + len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count); + size = bio_add_hw_page(q, bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset, + max_append_sectors, false); + if (unlikely(size != len)) + return -EINVAL; + iov_iter_advance(iter, size); + return 0; +} + static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) { const struct bio_vec *bv = iter->bvec; @@ -1142,9 +1167,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) do { if (bio_op(bio) == REQ_OP_ZONE_APPEND) { - if (WARN_ON_ONCE(is_bvec)) - return -EINVAL; - ret = __bio_iov_append_get_pages(bio, iter); + if (is_bvec) + ret = __bio_iov_bvec_add_append_pages(bio, iter); + else + ret = __bio_iov_append_get_pages(bio, iter); } else { if (is_bvec) ret = __bio_iov_bvec_add_pages(bio, iter); It's basically __bio_iov_bvec_add_pages() respecting the max_zone_sectors queue limit.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index c1aafb2ab990..f04572a55a09 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, iomap_dio_submit_bio(dio, iomap, bio, pos); } +/* + * Figure out the bio's operation flags from the dio request, the + * mapping, and whether or not we want FUA. Note that we can end up + * clearing the WRITE_FUA flag in the dio request. + */ +static inline unsigned int +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua) +{ + unsigned int opflags = REQ_SYNC | REQ_IDLE; + + if (!(dio->flags & IOMAP_DIO_WRITE)) { + WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND); + return REQ_OP_READ; + } + + if (iomap->flags & IOMAP_F_ZONE_APPEND) + opflags |= REQ_OP_ZONE_APPEND; + else + opflags |= REQ_OP_WRITE; + + if (use_fua) + opflags |= REQ_FUA; + else + dio->flags &= ~IOMAP_DIO_WRITE_FUA; + + return opflags; +} + static loff_t iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, struct iomap_dio *dio, struct iomap *iomap) @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; + /* + * Set the operation flags early so that bio_iov_iter_get_pages + * can set up the page vector appropriately for a ZONE_APPEND + * operation. + */ + bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); + ret = bio_iov_iter_get_pages(bio, dio->submit.iter); if (unlikely(ret)) { /* @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, n = bio->bi_iter.bi_size; if (dio->flags & IOMAP_DIO_WRITE) { - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; - if (use_fua) - bio->bi_opf |= REQ_FUA; - else - dio->flags &= ~IOMAP_DIO_WRITE_FUA; task_io_account_write(n); } else { - bio->bi_opf = REQ_OP_READ; if (dio->flags & IOMAP_DIO_DIRTY) bio_set_pages_dirty(bio); } diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 4d1d3c3469e9..1bccd1880d0d 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -54,6 +54,7 @@ struct vm_fault; #define IOMAP_F_SHARED 0x04 #define IOMAP_F_MERGED 0x08 #define IOMAP_F_BUFFER_HEAD 0x10 +#define IOMAP_F_ZONE_APPEND 0x20 /* * Flags set by the core iomap code during operations:
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == REQ_OP_ZONE_APPEND. To utilize it, we need to set the bio_op before calling bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND and restricted bio. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> --- fs/iomap/direct-io.c | 41 +++++++++++++++++++++++++++++++++++------ include/linux/iomap.h | 1 + 2 files changed, 36 insertions(+), 6 deletions(-)