Message ID | c465430b0802ced71d22f548587f2e06951b3cd5.1725621577.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | implement async block discards and other ops via io_uring | expand |
On Fri, Sep 06, 2024 at 11:57:25PM +0100, Pavel Begunkov wrote: > Add a command that writes the zero page to the drive. Apart from passing > the zero page instead of actual data it uses the normal write path and > doesn't do any further acceleration, nor it requires any special > hardware support. The indended use is to have a fallback when > BLOCK_URING_CMD_WRITE_ZEROES is not supported. That's just a horrible API. The user should not have to care if the kernel is using different kinds of implementations.
On 9/10/24 09:02, Christoph Hellwig wrote: > On Fri, Sep 06, 2024 at 11:57:25PM +0100, Pavel Begunkov wrote: >> Add a command that writes the zero page to the drive. Apart from passing >> the zero page instead of actual data it uses the normal write path and >> doesn't do any further acceleration, nor it requires any special >> hardware support. The indended use is to have a fallback when >> BLOCK_URING_CMD_WRITE_ZEROES is not supported. > > That's just a horrible API. The user should not have to care if the > kernel is using different kinds of implementations. It's rather not a good api when instead of issuing a presumably low overhead fast command the user expects sending a good bunch of actual writes with different performance characteristics. In my experience, such fallbacks cause more pain when a more explicit approach is possible. And let me note that it's already exposed via fallocate, even though in a bit different way.
On Tue, Sep 10, 2024 at 01:17:48PM +0100, Pavel Begunkov wrote: > > > Add a command that writes the zero page to the drive. Apart from passing > > > the zero page instead of actual data it uses the normal write path and > > > doesn't do any further acceleration, nor it requires any special > > > hardware support. The indended use is to have a fallback when > > > BLOCK_URING_CMD_WRITE_ZEROES is not supported. > > > > That's just a horrible API. The user should not have to care if the > > kernel is using different kinds of implementations. > > It's rather not a good api when instead of issuing a presumably low > overhead fast command the user expects sending a good bunch of actual > writes with different performance characteristics. The normal use case (at least the ones I've been involved with) are simply zero these blocks or the entire device, and please do it as good as you can. Needing asynchronous error handling in userspace for that is extremely counter productive. > In my experience, > such fallbacks cause more pain when a more explicit approach is > possible. And let me note that it's already exposed via fallocate, even > though in a bit different way. Do you mean the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE case in blkdev_fallocate? As far as I can tell this is actually a really bad example, as even a hardware offloaded write zeroes can and often does write physical zeroes to the media, and does so from a firmware path that is often slower than the kernel loop. But you have an actual use case where you want to send a write zeroes command but never a loop of writes, it would be good to document that and add a flag for it. And if we don't have that case it would still be good to have a reserved flags field to add it later if needed. Btw, do you have API documentation (e.g. in the form of a man page) for these new calls somewhere?
On 9/10/24 15:20, Christoph Hellwig wrote: > On Tue, Sep 10, 2024 at 01:17:48PM +0100, Pavel Begunkov wrote: >>>> Add a command that writes the zero page to the drive. Apart from passing >>>> the zero page instead of actual data it uses the normal write path and >>>> doesn't do any further acceleration, nor it requires any special >>>> hardware support. The indended use is to have a fallback when >>>> BLOCK_URING_CMD_WRITE_ZEROES is not supported. >>> >>> That's just a horrible API. The user should not have to care if the >>> kernel is using different kinds of implementations. >> >> It's rather not a good api when instead of issuing a presumably low >> overhead fast command the user expects sending a good bunch of actual >> writes with different performance characteristics. > > The normal use case (at least the ones I've been involved with) are > simply zero these blocks or the entire device, and please do it as > good as you can. Needing asynchronous error handling in userspace > for that is extremely counter productive. If we expect any error handling from the user space at all (we do), it'll and have to be asynchronous, it's async commands and io_uring. Asking the user to reissue a command in some form is normal. >> In my experience, >> such fallbacks cause more pain when a more explicit approach is >> possible. And let me note that it's already exposed via fallocate, even >> though in a bit different way. > > Do you mean the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE case in > blkdev_fallocate? As far as I can tell this is actually a really bad > example, as even a hardware offloaded write zeroes can and often does > write physical zeroes to the media, and does so from a firmware path > that is often slower than the kernel loop. That's a shame, I agree, which is why I call it "presumably" faster, but that actually gives more reasons why you might want this cmd separately from write zeroes, considering the user might know its hardware and the kernel doesn't try to choose which approach faster. > But you have an actual use case where you want to send a write zeroes > command but never a loop of writes, it would be good to document that > and add a flag for it. And if we don't have that case it would still Users who know more about hw and e.g. prefer writes with 0 page as per above. Users with lots of devices who care about pcie / memory bandwidth, there is enough of those, they might want to do something different like adjusting algorithms and throttling. Better/easier testing, though of lesser importance. Those I made up just now on the spot, but the reporter did specifically ask about some way to differentiate fallbacks. > be good to have a reserved flags field to add it later if needed. if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len || sqe->rw_flags || sqe->file_index)) return -EINVAL; There is a good bunch of sqe fields that can used for that later. > Btw, do you have API documentation (e.g. in the form of a man page) > for these new calls somewhere? Mentioned in the cover: tests and docs: https://github.com/isilence/liburing.git discard-cmd man page specifically: https://github.com/isilence/liburing/commit/a6fa2bc2400bf7fcb80496e322b5db4c8b3191f0 I'll send them once the kernel is set in place.
On Tue, Sep 10, 2024 at 09:10:34PM +0100, Pavel Begunkov wrote: > If we expect any error handling from the user space at all (we do), > it'll and have to be asynchronous, it's async commands and io_uring. > Asking the user to reissue a command in some form is normal. The point is that pretty much all other errors are fatal, while this is a not supported for which we have a guaranteed to work kernel fallback. Kicking it off reuires a bit of work, but I'd rather have that in one place rather than applications that work on some hardware and not others. > That's a shame, I agree, which is why I call it "presumably" faster, > but that actually gives more reasons why you might want this cmd > separately from write zeroes, considering the user might know > its hardware and the kernel doesn't try to choose which approach > faster. But the kernel is the right place to make that decision, even if we aren't very smart about it right now. Fanning that out to every single applications is a bad idea. > Users who know more about hw and e.g. prefer writes with 0 page as > per above. Users with lots of devices who care about pcie / memory > bandwidth, there is enough of those, they might want to do > something different like adjusting algorithms and throttling. > Better/easier testing, though of lesser importance. > > Those I made up just now on the spot, but the reporter did > specifically ask about some way to differentiate fallbacks. Well, an optional nofallback flag would be in line with how we do that. Do you have the original report to share somewhere?
On 9/12/24 10:26, Christoph Hellwig wrote: > On Tue, Sep 10, 2024 at 09:10:34PM +0100, Pavel Begunkov wrote: >> If we expect any error handling from the user space at all (we do), >> it'll and have to be asynchronous, it's async commands and io_uring. >> Asking the user to reissue a command in some form is normal. > > The point is that pretty much all other errors are fatal, while this > is a not supported for which we have a guaranteed to work kernel Yes, and there will be an error indicating that it's not supported, just like it'll return an error this io_uring commands are not supported by a given kernel. > fallback. Kicking it off reuires a bit of work, but I'd rather have > that in one place rather than applications that work on some hardware > and not others. There is nothing new in features that might be unsupported, because of hardware or otherwise, it's giving control to the userspace. >> That's a shame, I agree, which is why I call it "presumably" faster, >> but that actually gives more reasons why you might want this cmd >> separately from write zeroes, considering the user might know >> its hardware and the kernel doesn't try to choose which approach >> faster. > > But the kernel is the right place to make that decision, even if we > aren't very smart about it right now. Fanning that out to every > single applications is a bad idea. Apart that it will never happen >> Users who know more about hw and e.g. prefer writes with 0 page as >> per above. Users with lots of devices who care about pcie / memory >> bandwidth, there is enough of those, they might want to do >> something different like adjusting algorithms and throttling. >> Better/easier testing, though of lesser importance. >> >> Those I made up just now on the spot, but the reporter did >> specifically ask about some way to differentiate fallbacks. > > Well, an optional nofallback flag would be in line with how we do > that. Do you have the original report to share somewhere? Following with another flag "please do fallback", at which point it doesn't make any sense when that can be done in userspace.
diff --git a/block/ioctl.c b/block/ioctl.c index ef4b2a90ad79..3cb479192023 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -774,7 +774,8 @@ static void bio_cmd_bio_end_io(struct bio *bio) static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd, struct block_device *bdev, - uint64_t start, uint64_t len, bool nowait) + uint64_t start, uint64_t len, + bool nowait, bool zero_pages) { sector_t bs_mask = (bdev_logical_block_size(bdev) >> SECTOR_SHIFT) - 1; @@ -793,6 +794,20 @@ static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd, if (err) return err; + if (zero_pages) { + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, + struct blk_iou_cmd); + + err = blkdev_issue_zero_pages_bio(bdev, sector, nr_sects, + gfp, &prev, + BLKDEV_ZERO_PAGES_NOWAIT); + if (!prev) + return -EAGAIN; + if (err) + bic->res = err; + goto out_submit; + } + if (!limit) return -EOPNOTSUPP; /* @@ -826,7 +841,7 @@ static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd, } if (!prev) return -EAGAIN; - +out_submit: prev->bi_private = cmd; prev->bi_end_io = bio_cmd_bio_end_io; submit_bio(prev); @@ -904,7 +919,10 @@ int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) return blkdev_cmd_discard(cmd, bdev, start, len, bic->nowait); case BLOCK_URING_CMD_WRITE_ZEROES: return blkdev_cmd_write_zeroes(cmd, bdev, start, len, - bic->nowait); + bic->nowait, false); + case BLOCK_URING_CMD_WRITE_ZERO_PAGE: + return blkdev_cmd_write_zeroes(cmd, bdev, start, len, + bic->nowait, true); } return -EINVAL; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 68b0fccebf92..f4337b87d846 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -210,6 +210,7 @@ struct fsxattr { #define BLOCK_URING_CMD_DISCARD _IO(0x12,137) #define BLOCK_URING_CMD_WRITE_ZEROES _IO(0x12,138) +#define BLOCK_URING_CMD_WRITE_ZERO_PAGE _IO(0x12,139) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */
Add a command that writes the zero page to the drive. Apart from passing the zero page instead of actual data it uses the normal write path and doesn't do any further acceleration, nor it requires any special hardware support. The indended use is to have a fallback when BLOCK_URING_CMD_WRITE_ZEROES is not supported. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- block/ioctl.c | 24 +++++++++++++++++++++--- include/uapi/linux/fs.h | 1 + 2 files changed, 22 insertions(+), 3 deletions(-)