Message ID | 6ecd7ab3386f63f1656dc766c1b5b038ff5353c2.1723601134.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | implement asynchronous BLKDISCARD via io_uring | expand |
On Wed, Aug 14, 2024 at 6:46 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > Add ->uring_cmd callback for block device files and use it to implement > asynchronous discard. Normally, it first tries to execute the command > from non-blocking context, which we limit to a single bio because > otherwise one of sub-bios may need to wait for other bios, and we don't > want to deal with partial IO. If non-blocking attempt fails, we'll retry > it in a blocking context. > > Suggested-by: Conrad Meyer <conradmeyer@meta.com> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > block/blk.h | 1 + > block/fops.c | 2 + > block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/fs.h | 2 + > 4 files changed, 99 insertions(+) > > diff --git a/block/blk.h b/block/blk.h > index e180863f918b..5178c5ba6852 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); > int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, > loff_t lstart, loff_t lend); > long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); > long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > > extern const struct address_space_operations def_blk_aops; > diff --git a/block/fops.c b/block/fops.c > index 9825c1713a49..8154b10b5abf 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -17,6 +17,7 @@ > #include <linux/fs.h> > #include <linux/iomap.h> > #include <linux/module.h> > +#include <linux/io_uring/cmd.h> > #include "blk.h" > > static inline struct inode *bdev_file_inode(struct file *file) > @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { > .splice_read = filemap_splice_read, > .splice_write = iter_file_splice_write, > .fallocate = blkdev_fallocate, > + .uring_cmd = blkdev_uring_cmd, Just be curious, we have IORING_OP_FALLOCATE already for sending discard to block device, why is .uring_cmd added for this purpose? Thanks,
On 8/14/24 7:42 PM, Ming Lei wrote: > On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> Add ->uring_cmd callback for block device files and use it to implement >> asynchronous discard. Normally, it first tries to execute the command >> from non-blocking context, which we limit to a single bio because >> otherwise one of sub-bios may need to wait for other bios, and we don't >> want to deal with partial IO. If non-blocking attempt fails, we'll retry >> it in a blocking context. >> >> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> block/blk.h | 1 + >> block/fops.c | 2 + >> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >> include/uapi/linux/fs.h | 2 + >> 4 files changed, 99 insertions(+) >> >> diff --git a/block/blk.h b/block/blk.h >> index e180863f918b..5178c5ba6852 100644 >> --- a/block/blk.h >> +++ b/block/blk.h >> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >> loff_t lstart, loff_t lend); >> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >> >> extern const struct address_space_operations def_blk_aops; >> diff --git a/block/fops.c b/block/fops.c >> index 9825c1713a49..8154b10b5abf 100644 >> --- a/block/fops.c >> +++ b/block/fops.c >> @@ -17,6 +17,7 @@ >> #include <linux/fs.h> >> #include <linux/iomap.h> >> #include <linux/module.h> >> +#include <linux/io_uring/cmd.h> >> #include "blk.h" >> >> static inline struct inode *bdev_file_inode(struct file *file) >> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >> .splice_read = filemap_splice_read, >> .splice_write = iter_file_splice_write, >> .fallocate = blkdev_fallocate, >> + .uring_cmd = blkdev_uring_cmd, > > Just be curious, we have IORING_OP_FALLOCATE already for sending > discard to block device, why is .uring_cmd added for this purpose? I think wiring up a bdev uring_cmd makes sense, because: 1) The existing FALLOCATE op is using vfs_fallocate, which is inherently sync and hence always punted to io-wq. 2) There will most certainly be other async ops that would be interesting to add, at which point we'd need it anyway. 3) It arguably makes more sense to have a direct discard op than use fallocate for this, if working on a raw bdev. And probably others...
On 8/14/24 4:45 AM, Pavel Begunkov wrote: > diff --git a/block/ioctl.c b/block/ioctl.c > index c7a3e6c6f5fa..f7f9c4c6d6b5 100644 > --- a/block/ioctl.c > +++ b/block/ioctl.c > @@ -11,6 +11,8 @@ > #include <linux/blktrace_api.h> > #include <linux/pr.h> > #include <linux/uaccess.h> > +#include <linux/pagemap.h> > +#include <linux/io_uring/cmd.h> > #include "blk.h" > > static int blkpg_do_ioctl(struct block_device *bdev, > @@ -744,4 +746,96 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) > > return ret; > } > + > +struct blk_cmd { > + blk_status_t status; > + bool nowait; > +}; > + > +static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags) > +{ > + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); > + int res = blk_status_to_errno(bc->status); > + > + if (res == -EAGAIN && bc->nowait) > + io_uring_cmd_issue_blocking(cmd); > + else > + io_uring_cmd_done(cmd, res, 0, issue_flags); > +} > + > +static void bio_cmd_end(struct bio *bio) > +{ > + struct io_uring_cmd *cmd = bio->bi_private; > + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); > + > + if (unlikely(bio->bi_status) && !bc->status) > + bc->status = bio->bi_status; > + > + io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete); > + bio_put(bio); > +} > + > +static int blkdev_cmd_discard(struct io_uring_cmd *cmd, > + struct block_device *bdev, > + uint64_t start, uint64_t len, bool nowait) > +{ > + sector_t sector = start >> SECTOR_SHIFT; > + sector_t nr_sects = len >> SECTOR_SHIFT; > + struct bio *prev = NULL, *bio; > + int err; > + > + err = blk_validate_discard(bdev, file_to_blk_mode(cmd->file), > + start, len); > + if (err) > + return err; > + err = filemap_invalidate_pages(bdev->bd_mapping, start, > + start + len - 1, nowait); > + if (err) > + return err; > + > + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > + GFP_KERNEL))) { > + if (nowait) { > + if (unlikely(nr_sects)) { > + bio_put(bio); > + return -EAGAIN; > + } > + bio->bi_opf |= REQ_NOWAIT; > + } > + prev = bio_chain_and_submit(prev, bio); > + } > + if (!prev) > + return -EFAULT; > + > + prev->bi_private = cmd; > + prev->bi_end_io = bio_cmd_end; > + submit_bio(prev); > + return -EIOCBQUEUED; > +} > + > +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) > +{ > + struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host); > + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); > + const struct io_uring_sqe *sqe = cmd->sqe; > + u32 cmd_op = cmd->cmd_op; > + uint64_t start, len; > + > + if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len || > + sqe->rw_flags || sqe->file_index)) > + return -EINVAL; > + > + bc->status = BLK_STS_OK; > + bc->nowait = issue_flags & IO_URING_F_NONBLOCK; > + > + start = READ_ONCE(sqe->addr); > + len = READ_ONCE(sqe->addr3); > + > + switch (cmd_op) { > + case BLOCK_URING_CMD_DISCARD: > + return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait); > + } > + return -EINVAL; > +} This is inside the CONFIG_COMPAT section, which doesn't seem right.
On 8/15/24 15:33, Jens Axboe wrote: > On 8/14/24 7:42 PM, Ming Lei wrote: >> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>> >>> Add ->uring_cmd callback for block device files and use it to implement >>> asynchronous discard. Normally, it first tries to execute the command >>> from non-blocking context, which we limit to a single bio because >>> otherwise one of sub-bios may need to wait for other bios, and we don't >>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>> it in a blocking context. >>> >>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>> --- >>> block/blk.h | 1 + >>> block/fops.c | 2 + >>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>> include/uapi/linux/fs.h | 2 + >>> 4 files changed, 99 insertions(+) >>> >>> diff --git a/block/blk.h b/block/blk.h >>> index e180863f918b..5178c5ba6852 100644 >>> --- a/block/blk.h >>> +++ b/block/blk.h >>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>> loff_t lstart, loff_t lend); >>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>> >>> extern const struct address_space_operations def_blk_aops; >>> diff --git a/block/fops.c b/block/fops.c >>> index 9825c1713a49..8154b10b5abf 100644 >>> --- a/block/fops.c >>> +++ b/block/fops.c >>> @@ -17,6 +17,7 @@ >>> #include <linux/fs.h> >>> #include <linux/iomap.h> >>> #include <linux/module.h> >>> +#include <linux/io_uring/cmd.h> >>> #include "blk.h" >>> >>> static inline struct inode *bdev_file_inode(struct file *file) >>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>> .splice_read = filemap_splice_read, >>> .splice_write = iter_file_splice_write, >>> .fallocate = blkdev_fallocate, >>> + .uring_cmd = blkdev_uring_cmd, >> >> Just be curious, we have IORING_OP_FALLOCATE already for sending >> discard to block device, why is .uring_cmd added for this purpose? Which is a good question, I haven't thought about it, but I tend to agree with Jens. Because vfs_fallocate is created synchronous IORING_OP_FALLOCATE is slow for anything but pretty large requests. Probably can be patched up, which would involve changing the fops->fallocate protot, but I'm not sure async there makes sense outside of bdev (?), and cmd approach is simpler, can be made somewhat more efficient (1 less layer in the way), and it's not really something completely new since we have it in ioctl. > I think wiring up a bdev uring_cmd makes sense, because: > > 1) The existing FALLOCATE op is using vfs_fallocate, which is inherently > sync and hence always punted to io-wq. > > 2) There will most certainly be other async ops that would be > interesting to add, at which point we'd need it anyway. > > 3) It arguably makes more sense to have a direct discard op than use > fallocate for this, if working on a raw bdev. > > And probably others... >
On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: > On 8/15/24 15:33, Jens Axboe wrote: > > On 8/14/24 7:42 PM, Ming Lei wrote: > > > On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > > > > > > > Add ->uring_cmd callback for block device files and use it to implement > > > > asynchronous discard. Normally, it first tries to execute the command > > > > from non-blocking context, which we limit to a single bio because > > > > otherwise one of sub-bios may need to wait for other bios, and we don't > > > > want to deal with partial IO. If non-blocking attempt fails, we'll retry > > > > it in a blocking context. > > > > > > > > Suggested-by: Conrad Meyer <conradmeyer@meta.com> > > > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > > > > --- > > > > block/blk.h | 1 + > > > > block/fops.c | 2 + > > > > block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ > > > > include/uapi/linux/fs.h | 2 + > > > > 4 files changed, 99 insertions(+) > > > > > > > > diff --git a/block/blk.h b/block/blk.h > > > > index e180863f918b..5178c5ba6852 100644 > > > > --- a/block/blk.h > > > > +++ b/block/blk.h > > > > @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); > > > > int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, > > > > loff_t lstart, loff_t lend); > > > > long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > > > > +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); > > > > long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > > > > > > > > extern const struct address_space_operations def_blk_aops; > > > > diff --git a/block/fops.c b/block/fops.c > > > > index 9825c1713a49..8154b10b5abf 100644 > > > > --- a/block/fops.c > > > > +++ b/block/fops.c > > > > @@ -17,6 +17,7 @@ > > > > #include <linux/fs.h> > > > > #include <linux/iomap.h> > > > > #include <linux/module.h> > > > > +#include <linux/io_uring/cmd.h> > > > > #include "blk.h" > > > > > > > > static inline struct inode *bdev_file_inode(struct file *file) > > > > @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { > > > > .splice_read = filemap_splice_read, > > > > .splice_write = iter_file_splice_write, > > > > .fallocate = blkdev_fallocate, > > > > + .uring_cmd = blkdev_uring_cmd, > > > > > > Just be curious, we have IORING_OP_FALLOCATE already for sending > > > discard to block device, why is .uring_cmd added for this purpose? > > Which is a good question, I haven't thought about it, but I tend to > agree with Jens. Because vfs_fallocate is created synchronous > IORING_OP_FALLOCATE is slow for anything but pretty large requests. > Probably can be patched up, which would involve changing the > fops->fallocate protot, but I'm not sure async there makes sense > outside of bdev (?), and cmd approach is simpler, can be made > somewhat more efficient (1 less layer in the way), and it's not > really something completely new since we have it in ioctl. Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, same with blkdev_fallocate(). But this patch drops this exclusive lock, so it becomes async friendly, but may cause stale page cache. However, if the lock is required, it can't be efficient anymore and io-wq may be inevitable, :-) Thanks, Ming
On 8/15/24 5:44 PM, Ming Lei wrote: > On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >> On 8/15/24 15:33, Jens Axboe wrote: >>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>>> >>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>> asynchronous discard. Normally, it first tries to execute the command >>>>> from non-blocking context, which we limit to a single bio because >>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>> it in a blocking context. >>>>> >>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>> --- >>>>> block/blk.h | 1 + >>>>> block/fops.c | 2 + >>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>> include/uapi/linux/fs.h | 2 + >>>>> 4 files changed, 99 insertions(+) >>>>> >>>>> diff --git a/block/blk.h b/block/blk.h >>>>> index e180863f918b..5178c5ba6852 100644 >>>>> --- a/block/blk.h >>>>> +++ b/block/blk.h >>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>> loff_t lstart, loff_t lend); >>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>> >>>>> extern const struct address_space_operations def_blk_aops; >>>>> diff --git a/block/fops.c b/block/fops.c >>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>> --- a/block/fops.c >>>>> +++ b/block/fops.c >>>>> @@ -17,6 +17,7 @@ >>>>> #include <linux/fs.h> >>>>> #include <linux/iomap.h> >>>>> #include <linux/module.h> >>>>> +#include <linux/io_uring/cmd.h> >>>>> #include "blk.h" >>>>> >>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>> .splice_read = filemap_splice_read, >>>>> .splice_write = iter_file_splice_write, >>>>> .fallocate = blkdev_fallocate, >>>>> + .uring_cmd = blkdev_uring_cmd, >>>> >>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>> discard to block device, why is .uring_cmd added for this purpose? >> >> Which is a good question, I haven't thought about it, but I tend to >> agree with Jens. Because vfs_fallocate is created synchronous >> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >> Probably can be patched up, which would involve changing the >> fops->fallocate protot, but I'm not sure async there makes sense >> outside of bdev (?), and cmd approach is simpler, can be made >> somewhat more efficient (1 less layer in the way), and it's not >> really something completely new since we have it in ioctl. > > Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, > same with blkdev_fallocate(). > > But this patch drops this exclusive lock, so it becomes async friendly, > but may cause stale page cache. However, if the lock is required, it can't > be efficient anymore and io-wq may be inevitable, :-) If you want to grab the lock, you can still opportunistically grab it. For (by far) the common case, you'll get it, and you can still do it inline. Really not that unusual.
On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: > On 8/15/24 5:44 PM, Ming Lei wrote: > > On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: > >> On 8/15/24 15:33, Jens Axboe wrote: > >>> On 8/14/24 7:42 PM, Ming Lei wrote: > >>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >>>>> > >>>>> Add ->uring_cmd callback for block device files and use it to implement > >>>>> asynchronous discard. Normally, it first tries to execute the command > >>>>> from non-blocking context, which we limit to a single bio because > >>>>> otherwise one of sub-bios may need to wait for other bios, and we don't > >>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry > >>>>> it in a blocking context. > >>>>> > >>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> > >>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > >>>>> --- > >>>>> block/blk.h | 1 + > >>>>> block/fops.c | 2 + > >>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ > >>>>> include/uapi/linux/fs.h | 2 + > >>>>> 4 files changed, 99 insertions(+) > >>>>> > >>>>> diff --git a/block/blk.h b/block/blk.h > >>>>> index e180863f918b..5178c5ba6852 100644 > >>>>> --- a/block/blk.h > >>>>> +++ b/block/blk.h > >>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); > >>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, > >>>>> loff_t lstart, loff_t lend); > >>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > >>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); > >>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > >>>>> > >>>>> extern const struct address_space_operations def_blk_aops; > >>>>> diff --git a/block/fops.c b/block/fops.c > >>>>> index 9825c1713a49..8154b10b5abf 100644 > >>>>> --- a/block/fops.c > >>>>> +++ b/block/fops.c > >>>>> @@ -17,6 +17,7 @@ > >>>>> #include <linux/fs.h> > >>>>> #include <linux/iomap.h> > >>>>> #include <linux/module.h> > >>>>> +#include <linux/io_uring/cmd.h> > >>>>> #include "blk.h" > >>>>> > >>>>> static inline struct inode *bdev_file_inode(struct file *file) > >>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { > >>>>> .splice_read = filemap_splice_read, > >>>>> .splice_write = iter_file_splice_write, > >>>>> .fallocate = blkdev_fallocate, > >>>>> + .uring_cmd = blkdev_uring_cmd, > >>>> > >>>> Just be curious, we have IORING_OP_FALLOCATE already for sending > >>>> discard to block device, why is .uring_cmd added for this purpose? > >> > >> Which is a good question, I haven't thought about it, but I tend to > >> agree with Jens. Because vfs_fallocate is created synchronous > >> IORING_OP_FALLOCATE is slow for anything but pretty large requests. > >> Probably can be patched up, which would involve changing the > >> fops->fallocate protot, but I'm not sure async there makes sense > >> outside of bdev (?), and cmd approach is simpler, can be made > >> somewhat more efficient (1 less layer in the way), and it's not > >> really something completely new since we have it in ioctl. > > > > Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, > > same with blkdev_fallocate(). > > > > But this patch drops this exclusive lock, so it becomes async friendly, > > but may cause stale page cache. However, if the lock is required, it can't > > be efficient anymore and io-wq may be inevitable, :-) > > If you want to grab the lock, you can still opportunistically grab it. > For (by far) the common case, you'll get it, and you can still do it > inline. If the lock is grabbed in the whole cmd lifetime, it is basically one sync interface cause there is at most one async discard cmd in-flight for each device. Meantime the handling has to move to io-wq for avoiding to block current context, the interface becomes same with IORING_OP_FALLOCATE? thanks, Ming
On 8/16/24 02:45, Ming Lei wrote: > On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >> On 8/15/24 5:44 PM, Ming Lei wrote: >>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>>>>> >>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>> it in a blocking context. >>>>>>> >>>>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>>>> --- >>>>>>> block/blk.h | 1 + >>>>>>> block/fops.c | 2 + >>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>> 4 files changed, 99 insertions(+) >>>>>>> >>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>> --- a/block/blk.h >>>>>>> +++ b/block/blk.h >>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>> loff_t lstart, loff_t lend); >>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>> >>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>> --- a/block/fops.c >>>>>>> +++ b/block/fops.c >>>>>>> @@ -17,6 +17,7 @@ >>>>>>> #include <linux/fs.h> >>>>>>> #include <linux/iomap.h> >>>>>>> #include <linux/module.h> >>>>>>> +#include <linux/io_uring/cmd.h> >>>>>>> #include "blk.h" >>>>>>> >>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>> .splice_read = filemap_splice_read, >>>>>>> .splice_write = iter_file_splice_write, >>>>>>> .fallocate = blkdev_fallocate, >>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>> >>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>> >>>> Which is a good question, I haven't thought about it, but I tend to >>>> agree with Jens. Because vfs_fallocate is created synchronous >>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>> Probably can be patched up, which would involve changing the >>>> fops->fallocate protot, but I'm not sure async there makes sense >>>> outside of bdev (?), and cmd approach is simpler, can be made >>>> somewhat more efficient (1 less layer in the way), and it's not >>>> really something completely new since we have it in ioctl. >>> >>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>> same with blkdev_fallocate(). >>> >>> But this patch drops this exclusive lock, so it becomes async friendly, >>> but may cause stale page cache. However, if the lock is required, it can't >>> be efficient anymore and io-wq may be inevitable, :-) >> >> If you want to grab the lock, you can still opportunistically grab it. >> For (by far) the common case, you'll get it, and you can still do it >> inline. > > If the lock is grabbed in the whole cmd lifetime, it is basically one sync > interface cause there is at most one async discard cmd in-flight for each > device. > > Meantime the handling has to move to io-wq for avoiding to block current > context, the interface becomes same with IORING_OP_FALLOCATE? Right, and agree that we can't trylock because we'd need to keep it locked until IO completes, at least the sync versions does that. But I think *invalidate_pages() in the patch should be enough. That's what the write path does, so it shouldn't cause any problem to the kernel. As for user space, that'd be more relaxed than the ioctl, just as writes are, so nothing new to the user. I hope someone with better filemap understanding can confirm it (or not).
On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote: > On 8/16/24 02:45, Ming Lei wrote: > > On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: > > > On 8/15/24 5:44 PM, Ming Lei wrote: > > > > On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: > > > > > On 8/15/24 15:33, Jens Axboe wrote: > > > > > > On 8/14/24 7:42 PM, Ming Lei wrote: > > > > > > > On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > > > > > > > > > > > > > > > Add ->uring_cmd callback for block device files and use it to implement > > > > > > > > asynchronous discard. Normally, it first tries to execute the command > > > > > > > > from non-blocking context, which we limit to a single bio because > > > > > > > > otherwise one of sub-bios may need to wait for other bios, and we don't > > > > > > > > want to deal with partial IO. If non-blocking attempt fails, we'll retry > > > > > > > > it in a blocking context. > > > > > > > > > > > > > > > > Suggested-by: Conrad Meyer <conradmeyer@meta.com> > > > > > > > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > > > > > > > > --- > > > > > > > > block/blk.h | 1 + > > > > > > > > block/fops.c | 2 + > > > > > > > > block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ > > > > > > > > include/uapi/linux/fs.h | 2 + > > > > > > > > 4 files changed, 99 insertions(+) > > > > > > > > > > > > > > > > diff --git a/block/blk.h b/block/blk.h > > > > > > > > index e180863f918b..5178c5ba6852 100644 > > > > > > > > --- a/block/blk.h > > > > > > > > +++ b/block/blk.h > > > > > > > > @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); > > > > > > > > int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, > > > > > > > > loff_t lstart, loff_t lend); > > > > > > > > long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > > > > > > > > +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); > > > > > > > > long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > > > > > > > > > > > > > > > > extern const struct address_space_operations def_blk_aops; > > > > > > > > diff --git a/block/fops.c b/block/fops.c > > > > > > > > index 9825c1713a49..8154b10b5abf 100644 > > > > > > > > --- a/block/fops.c > > > > > > > > +++ b/block/fops.c > > > > > > > > @@ -17,6 +17,7 @@ > > > > > > > > #include <linux/fs.h> > > > > > > > > #include <linux/iomap.h> > > > > > > > > #include <linux/module.h> > > > > > > > > +#include <linux/io_uring/cmd.h> > > > > > > > > #include "blk.h" > > > > > > > > > > > > > > > > static inline struct inode *bdev_file_inode(struct file *file) > > > > > > > > @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { > > > > > > > > .splice_read = filemap_splice_read, > > > > > > > > .splice_write = iter_file_splice_write, > > > > > > > > .fallocate = blkdev_fallocate, > > > > > > > > + .uring_cmd = blkdev_uring_cmd, > > > > > > > > > > > > > > Just be curious, we have IORING_OP_FALLOCATE already for sending > > > > > > > discard to block device, why is .uring_cmd added for this purpose? > > > > > > > > > > Which is a good question, I haven't thought about it, but I tend to > > > > > agree with Jens. Because vfs_fallocate is created synchronous > > > > > IORING_OP_FALLOCATE is slow for anything but pretty large requests. > > > > > Probably can be patched up, which would involve changing the > > > > > fops->fallocate protot, but I'm not sure async there makes sense > > > > > outside of bdev (?), and cmd approach is simpler, can be made > > > > > somewhat more efficient (1 less layer in the way), and it's not > > > > > really something completely new since we have it in ioctl. > > > > > > > > Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, > > > > same with blkdev_fallocate(). > > > > > > > > But this patch drops this exclusive lock, so it becomes async friendly, > > > > but may cause stale page cache. However, if the lock is required, it can't > > > > be efficient anymore and io-wq may be inevitable, :-) > > > > > > If you want to grab the lock, you can still opportunistically grab it. > > > For (by far) the common case, you'll get it, and you can still do it > > > inline. > > > > If the lock is grabbed in the whole cmd lifetime, it is basically one sync > > interface cause there is at most one async discard cmd in-flight for each > > device. > > > > Meantime the handling has to move to io-wq for avoiding to block current > > context, the interface becomes same with IORING_OP_FALLOCATE? > > Right, and agree that we can't trylock because we'd need to keep it > locked until IO completes, at least the sync versions does that. > > But I think *invalidate_pages() in the patch should be enough. That's > what the write path does, so it shouldn't cause any problem to the > kernel. As for user space, that'd be more relaxed than the ioctl, > just as writes are, so nothing new to the user. I hope someone with > better filemap understanding can confirm it (or not). I may not be familiar with filemap enough, but looks *invalidate_pages() is only for removing pages from the page cache range, and the lock is added for preventing new page cache read from being started, so stale data read can be avoided when DISCARD is in-progress. Thanks, Ming
On 8/16/24 03:08, Ming Lei wrote: > On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote: >> On 8/16/24 02:45, Ming Lei wrote: >>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >>>> On 8/15/24 5:44 PM, Ming Lei wrote: >>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>>>> it in a blocking context. >>>>>>>>> >>>>>>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>>>>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>>>>>> --- >>>>>>>>> block/blk.h | 1 + >>>>>>>>> block/fops.c | 2 + >>>>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>>>> 4 files changed, 99 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>>>> --- a/block/blk.h >>>>>>>>> +++ b/block/blk.h >>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>>>> loff_t lstart, loff_t lend); >>>>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> >>>>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>>>> --- a/block/fops.c >>>>>>>>> +++ b/block/fops.c >>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>> #include <linux/fs.h> >>>>>>>>> #include <linux/iomap.h> >>>>>>>>> #include <linux/module.h> >>>>>>>>> +#include <linux/io_uring/cmd.h> >>>>>>>>> #include "blk.h" >>>>>>>>> >>>>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>>>> .splice_read = filemap_splice_read, >>>>>>>>> .splice_write = iter_file_splice_write, >>>>>>>>> .fallocate = blkdev_fallocate, >>>>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>>>> >>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>>>> >>>>>> Which is a good question, I haven't thought about it, but I tend to >>>>>> agree with Jens. Because vfs_fallocate is created synchronous >>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>>>> Probably can be patched up, which would involve changing the >>>>>> fops->fallocate protot, but I'm not sure async there makes sense >>>>>> outside of bdev (?), and cmd approach is simpler, can be made >>>>>> somewhat more efficient (1 less layer in the way), and it's not >>>>>> really something completely new since we have it in ioctl. >>>>> >>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>>>> same with blkdev_fallocate(). >>>>> >>>>> But this patch drops this exclusive lock, so it becomes async friendly, >>>>> but may cause stale page cache. However, if the lock is required, it can't >>>>> be efficient anymore and io-wq may be inevitable, :-) >>>> >>>> If you want to grab the lock, you can still opportunistically grab it. >>>> For (by far) the common case, you'll get it, and you can still do it >>>> inline. >>> >>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync >>> interface cause there is at most one async discard cmd in-flight for each >>> device. >>> >>> Meantime the handling has to move to io-wq for avoiding to block current >>> context, the interface becomes same with IORING_OP_FALLOCATE? >> >> Right, and agree that we can't trylock because we'd need to keep it >> locked until IO completes, at least the sync versions does that. >> >> But I think *invalidate_pages() in the patch should be enough. That's >> what the write path does, so it shouldn't cause any problem to the >> kernel. As for user space, that'd be more relaxed than the ioctl, >> just as writes are, so nothing new to the user. I hope someone with >> better filemap understanding can confirm it (or not). > > I may not be familiar with filemap enough, but looks *invalidate_pages() > is only for removing pages from the page cache range, and the lock is added > for preventing new page cache read from being started, so stale data read > can be avoided when DISCARD is in-progress. Sounds like it, but the point is it's the same data race for the user as if it would've had a write in progress.
On 8/15/24 7:45 PM, Ming Lei wrote: > On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >> On 8/15/24 5:44 PM, Ming Lei wrote: >>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>>>>> >>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>> it in a blocking context. >>>>>>> >>>>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>>>> --- >>>>>>> block/blk.h | 1 + >>>>>>> block/fops.c | 2 + >>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>> 4 files changed, 99 insertions(+) >>>>>>> >>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>> --- a/block/blk.h >>>>>>> +++ b/block/blk.h >>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>> loff_t lstart, loff_t lend); >>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>> >>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>> --- a/block/fops.c >>>>>>> +++ b/block/fops.c >>>>>>> @@ -17,6 +17,7 @@ >>>>>>> #include <linux/fs.h> >>>>>>> #include <linux/iomap.h> >>>>>>> #include <linux/module.h> >>>>>>> +#include <linux/io_uring/cmd.h> >>>>>>> #include "blk.h" >>>>>>> >>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>> .splice_read = filemap_splice_read, >>>>>>> .splice_write = iter_file_splice_write, >>>>>>> .fallocate = blkdev_fallocate, >>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>> >>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>> >>>> Which is a good question, I haven't thought about it, but I tend to >>>> agree with Jens. Because vfs_fallocate is created synchronous >>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>> Probably can be patched up, which would involve changing the >>>> fops->fallocate protot, but I'm not sure async there makes sense >>>> outside of bdev (?), and cmd approach is simpler, can be made >>>> somewhat more efficient (1 less layer in the way), and it's not >>>> really something completely new since we have it in ioctl. >>> >>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>> same with blkdev_fallocate(). >>> >>> But this patch drops this exclusive lock, so it becomes async friendly, >>> but may cause stale page cache. However, if the lock is required, it can't >>> be efficient anymore and io-wq may be inevitable, :-) >> >> If you want to grab the lock, you can still opportunistically grab it. >> For (by far) the common case, you'll get it, and you can still do it >> inline. > > If the lock is grabbed in the whole cmd lifetime, it is basically one sync > interface cause there is at most one async discard cmd in-flight for each > device. Oh for sure, you could not do that anyway as you'd be holding a lock across the syscall boundary, which isn't allowed. > Meantime the handling has to move to io-wq for avoiding to block current > context, the interface becomes same with IORING_OP_FALLOCATE? I think the current truncate is overkill, we should be able to get by without. And no, I will not entertain an option that's "oh just punt it to io-wq".
On 8/15/24 8:16 PM, Pavel Begunkov wrote: > On 8/16/24 03:08, Ming Lei wrote: >> On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote: >>> On 8/16/24 02:45, Ming Lei wrote: >>>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >>>>> On 8/15/24 5:44 PM, Ming Lei wrote: >>>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>>>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>>>>> it in a blocking context. >>>>>>>>>> >>>>>>>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>>>>>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>>>>>>> --- >>>>>>>>>> block/blk.h | 1 + >>>>>>>>>> block/fops.c | 2 + >>>>>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>>>>> 4 files changed, 99 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>>>>> --- a/block/blk.h >>>>>>>>>> +++ b/block/blk.h >>>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>>>>> loff_t lstart, loff_t lend); >>>>>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>>> >>>>>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>>>>> --- a/block/fops.c >>>>>>>>>> +++ b/block/fops.c >>>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>>> #include <linux/fs.h> >>>>>>>>>> #include <linux/iomap.h> >>>>>>>>>> #include <linux/module.h> >>>>>>>>>> +#include <linux/io_uring/cmd.h> >>>>>>>>>> #include "blk.h" >>>>>>>>>> >>>>>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>>>>> .splice_read = filemap_splice_read, >>>>>>>>>> .splice_write = iter_file_splice_write, >>>>>>>>>> .fallocate = blkdev_fallocate, >>>>>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>>>>> >>>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>>>>> >>>>>>> Which is a good question, I haven't thought about it, but I tend to >>>>>>> agree with Jens. Because vfs_fallocate is created synchronous >>>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>>>>> Probably can be patched up, which would involve changing the >>>>>>> fops->fallocate protot, but I'm not sure async there makes sense >>>>>>> outside of bdev (?), and cmd approach is simpler, can be made >>>>>>> somewhat more efficient (1 less layer in the way), and it's not >>>>>>> really something completely new since we have it in ioctl. >>>>>> >>>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>>>>> same with blkdev_fallocate(). >>>>>> >>>>>> But this patch drops this exclusive lock, so it becomes async friendly, >>>>>> but may cause stale page cache. However, if the lock is required, it can't >>>>>> be efficient anymore and io-wq may be inevitable, :-) >>>>> >>>>> If you want to grab the lock, you can still opportunistically grab it. >>>>> For (by far) the common case, you'll get it, and you can still do it >>>>> inline. >>>> >>>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync >>>> interface cause there is at most one async discard cmd in-flight for each >>>> device. >>>> >>>> Meantime the handling has to move to io-wq for avoiding to block current >>>> context, the interface becomes same with IORING_OP_FALLOCATE? >>> >>> Right, and agree that we can't trylock because we'd need to keep it >>> locked until IO completes, at least the sync versions does that. >>> >>> But I think *invalidate_pages() in the patch should be enough. That's >>> what the write path does, so it shouldn't cause any problem to the >>> kernel. As for user space, that'd be more relaxed than the ioctl, >>> just as writes are, so nothing new to the user. I hope someone with >>> better filemap understanding can confirm it (or not). >> >> I may not be familiar with filemap enough, but looks *invalidate_pages() >> is only for removing pages from the page cache range, and the lock is added >> for preventing new page cache read from being started, so stale data read >> can be avoided when DISCARD is in-progress. > > Sounds like it, but the point is it's the same data race for the > user as if it would've had a write in progress. Right, which is why it should not matter. I think it's pretty silly to take the sync implementation as gospel here, assuming that the original author knew what they were doing in full detail. It just needs proper documenting.
On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote: > On 8/15/24 7:45 PM, Ming Lei wrote: > > On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: > >> On 8/15/24 5:44 PM, Ming Lei wrote: > >>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: > >>>> On 8/15/24 15:33, Jens Axboe wrote: > >>>>> On 8/14/24 7:42 PM, Ming Lei wrote: > >>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >>>>>>> > >>>>>>> Add ->uring_cmd callback for block device files and use it to implement > >>>>>>> asynchronous discard. Normally, it first tries to execute the command > >>>>>>> from non-blocking context, which we limit to a single bio because > >>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't > >>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry > >>>>>>> it in a blocking context. > >>>>>>> > >>>>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> > >>>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > >>>>>>> --- > >>>>>>> block/blk.h | 1 + > >>>>>>> block/fops.c | 2 + > >>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ > >>>>>>> include/uapi/linux/fs.h | 2 + > >>>>>>> 4 files changed, 99 insertions(+) > >>>>>>> > >>>>>>> diff --git a/block/blk.h b/block/blk.h > >>>>>>> index e180863f918b..5178c5ba6852 100644 > >>>>>>> --- a/block/blk.h > >>>>>>> +++ b/block/blk.h > >>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); > >>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, > >>>>>>> loff_t lstart, loff_t lend); > >>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > >>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); > >>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); > >>>>>>> > >>>>>>> extern const struct address_space_operations def_blk_aops; > >>>>>>> diff --git a/block/fops.c b/block/fops.c > >>>>>>> index 9825c1713a49..8154b10b5abf 100644 > >>>>>>> --- a/block/fops.c > >>>>>>> +++ b/block/fops.c > >>>>>>> @@ -17,6 +17,7 @@ > >>>>>>> #include <linux/fs.h> > >>>>>>> #include <linux/iomap.h> > >>>>>>> #include <linux/module.h> > >>>>>>> +#include <linux/io_uring/cmd.h> > >>>>>>> #include "blk.h" > >>>>>>> > >>>>>>> static inline struct inode *bdev_file_inode(struct file *file) > >>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { > >>>>>>> .splice_read = filemap_splice_read, > >>>>>>> .splice_write = iter_file_splice_write, > >>>>>>> .fallocate = blkdev_fallocate, > >>>>>>> + .uring_cmd = blkdev_uring_cmd, > >>>>>> > >>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending > >>>>>> discard to block device, why is .uring_cmd added for this purpose? > >>>> > >>>> Which is a good question, I haven't thought about it, but I tend to > >>>> agree with Jens. Because vfs_fallocate is created synchronous > >>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. > >>>> Probably can be patched up, which would involve changing the > >>>> fops->fallocate protot, but I'm not sure async there makes sense > >>>> outside of bdev (?), and cmd approach is simpler, can be made > >>>> somewhat more efficient (1 less layer in the way), and it's not > >>>> really something completely new since we have it in ioctl. > >>> > >>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, > >>> same with blkdev_fallocate(). > >>> > >>> But this patch drops this exclusive lock, so it becomes async friendly, > >>> but may cause stale page cache. However, if the lock is required, it can't > >>> be efficient anymore and io-wq may be inevitable, :-) > >> > >> If you want to grab the lock, you can still opportunistically grab it. > >> For (by far) the common case, you'll get it, and you can still do it > >> inline. > > > > If the lock is grabbed in the whole cmd lifetime, it is basically one sync > > interface cause there is at most one async discard cmd in-flight for each > > device. > > Oh for sure, you could not do that anyway as you'd be holding a lock > across the syscall boundary, which isn't allowed. Indeed. > > > Meantime the handling has to move to io-wq for avoiding to block current > > context, the interface becomes same with IORING_OP_FALLOCATE? > > I think the current truncate is overkill, we should be able to get by > without. And no, I will not entertain an option that's "oh just punt it > to io-wq". BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"), and block/009 serves as regression test for covering page cache coherency and discard. Here the issue is actually related with the exclusive lock of filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during discard for not polluting page cache. block/009 may fail too without the lock. It is just that concurrent discards can't be allowed any more by down_write() of rw_semaphore, and block device is really capable of doing that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in BLKDISCARD ioctl"). Cc Jan Kara and Shin'ichiro Kawasaki. Thanks, Ming
On 8/19/24 8:36 PM, Ming Lei wrote: > On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote: >> On 8/15/24 7:45 PM, Ming Lei wrote: >>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote: >>>> On 8/15/24 5:44 PM, Ming Lei wrote: >>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote: >>>>>> On 8/15/24 15:33, Jens Axboe wrote: >>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote: >>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement >>>>>>>>> asynchronous discard. Normally, it first tries to execute the command >>>>>>>>> from non-blocking context, which we limit to a single bio because >>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't >>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry >>>>>>>>> it in a blocking context. >>>>>>>>> >>>>>>>>> Suggested-by: Conrad Meyer <conradmeyer@meta.com> >>>>>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>>>>>>> --- >>>>>>>>> block/blk.h | 1 + >>>>>>>>> block/fops.c | 2 + >>>>>>>>> block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ >>>>>>>>> include/uapi/linux/fs.h | 2 + >>>>>>>>> 4 files changed, 99 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/block/blk.h b/block/blk.h >>>>>>>>> index e180863f918b..5178c5ba6852 100644 >>>>>>>>> --- a/block/blk.h >>>>>>>>> +++ b/block/blk.h >>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); >>>>>>>>> int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, >>>>>>>>> loff_t lstart, loff_t lend); >>>>>>>>> long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); >>>>>>>>> long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); >>>>>>>>> >>>>>>>>> extern const struct address_space_operations def_blk_aops; >>>>>>>>> diff --git a/block/fops.c b/block/fops.c >>>>>>>>> index 9825c1713a49..8154b10b5abf 100644 >>>>>>>>> --- a/block/fops.c >>>>>>>>> +++ b/block/fops.c >>>>>>>>> @@ -17,6 +17,7 @@ >>>>>>>>> #include <linux/fs.h> >>>>>>>>> #include <linux/iomap.h> >>>>>>>>> #include <linux/module.h> >>>>>>>>> +#include <linux/io_uring/cmd.h> >>>>>>>>> #include "blk.h" >>>>>>>>> >>>>>>>>> static inline struct inode *bdev_file_inode(struct file *file) >>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { >>>>>>>>> .splice_read = filemap_splice_read, >>>>>>>>> .splice_write = iter_file_splice_write, >>>>>>>>> .fallocate = blkdev_fallocate, >>>>>>>>> + .uring_cmd = blkdev_uring_cmd, >>>>>>>> >>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending >>>>>>>> discard to block device, why is .uring_cmd added for this purpose? >>>>>> >>>>>> Which is a good question, I haven't thought about it, but I tend to >>>>>> agree with Jens. Because vfs_fallocate is created synchronous >>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests. >>>>>> Probably can be patched up, which would involve changing the >>>>>> fops->fallocate protot, but I'm not sure async there makes sense >>>>>> outside of bdev (?), and cmd approach is simpler, can be made >>>>>> somewhat more efficient (1 less layer in the way), and it's not >>>>>> really something completely new since we have it in ioctl. >>>>> >>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock, >>>>> same with blkdev_fallocate(). >>>>> >>>>> But this patch drops this exclusive lock, so it becomes async friendly, >>>>> but may cause stale page cache. However, if the lock is required, it can't >>>>> be efficient anymore and io-wq may be inevitable, :-) >>>> >>>> If you want to grab the lock, you can still opportunistically grab it. >>>> For (by far) the common case, you'll get it, and you can still do it >>>> inline. >>> >>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync >>> interface cause there is at most one async discard cmd in-flight for each >>> device. >> >> Oh for sure, you could not do that anyway as you'd be holding a lock >> across the syscall boundary, which isn't allowed. > > Indeed. > >> >>> Meantime the handling has to move to io-wq for avoiding to block current >>> context, the interface becomes same with IORING_OP_FALLOCATE? >> >> I think the current truncate is overkill, we should be able to get by >> without. And no, I will not entertain an option that's "oh just punt it >> to io-wq". > > BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"), > and block/009 serves as regression test for covering page cache > coherency and discard. > > Here the issue is actually related with the exclusive lock of > filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during > discard for not polluting page cache. block/009 may fail too without the lock. > > It is just that concurrent discards can't be allowed any more by > down_write() of rw_semaphore, and block device is really capable of doing > that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in > BLKDISCARD ioctl"). > > Cc Jan Kara and Shin'ichiro Kawasaki. Honestly I just think that's nonsense. It's like mixing direct and buffered writes. Can you get corruption? Yes you most certainly can. There should be no reason why we can't run discards without providing page cache coherency. The sync interface attempts to do that, but that doesn't mean that an async (or a different sync one, if that made sense) should. If you do discards to the same range as you're doing buffered IO, you get to keep both potentially pieces. Fact is that most folks are doing dio for performant IO exactly because buffered writes tend to be horrible, and you could certainly use that with async discards and have the application manage it just fine. So I really think any attempts to provide page cache synchronization for this is futile. And the existing sync one looks pretty abysmal, but it doesn't really matter as it's a sync interfce. If one were to do something about it for an async interface, then just pretend it's dio and increment i_dio_count.
On 8/20/24 17:30, Jens Axboe wrote: > On 8/19/24 8:36 PM, Ming Lei wrote: >> On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote: >>> On 8/15/24 7:45 PM, Ming Lei wrote: ... >>>> Meantime the handling has to move to io-wq for avoiding to block current >>>> context, the interface becomes same with IORING_OP_FALLOCATE? >>> >>> I think the current truncate is overkill, we should be able to get by >>> without. And no, I will not entertain an option that's "oh just punt it >>> to io-wq". >> >> BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"), >> and block/009 serves as regression test for covering page cache >> coherency and discard. >> >> Here the issue is actually related with the exclusive lock of >> filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during >> discard for not polluting page cache. block/009 may fail too without the lock. >> >> It is just that concurrent discards can't be allowed any more by >> down_write() of rw_semaphore, and block device is really capable of doing >> that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in >> BLKDISCARD ioctl"). >> >> Cc Jan Kara and Shin'ichiro Kawasaki. > > Honestly I just think that's nonsense. It's like mixing direct and > buffered writes. Can you get corruption? Yes you most certainly can. > There should be no reason why we can't run discards without providing > page cache coherency. The sync interface attempts to do that, but that > doesn't mean that an async (or a different sync one, if that made sense) > should. I don't see it as a problem either, it's a new interface, just need to be upfront on what guarantees it provides (one more reason why not fallocate), I'll elaborate on it in the commit message and so. I think a reasonable thing to do is to have one rule for all write-like operations starting from plain writes, which is currently allowing races to happen and shift it to the user. Purely in theory we can get inventive with likes of range lock trees, but that's unwarranted for all sorts of reasons. > If you do discards to the same range as you're doing buffered IO, you > get to keep both potentially pieces. Fact is that most folks are doing > dio for performant IO exactly because buffered writes tend to be > horrible, and you could certainly use that with async discards and have > the application manage it just fine. > > So I really think any attempts to provide page cache synchronization for > this is futile. And the existing sync one looks pretty abysmal, but it > doesn't really matter as it's a sync interfce. If one were to do It should be a pain for sync as well, you can't even spin another process and parallelise this way.
On Tue, Aug 20, 2024 at 06:19:00PM +0100, Pavel Begunkov wrote: > On 8/20/24 17:30, Jens Axboe wrote: > > On 8/19/24 8:36 PM, Ming Lei wrote: > > > On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote: > > > > On 8/15/24 7:45 PM, Ming Lei wrote: > ... > > > > > Meantime the handling has to move to io-wq for avoiding to block current > > > > > context, the interface becomes same with IORING_OP_FALLOCATE? > > > > > > > > I think the current truncate is overkill, we should be able to get by > > > > without. And no, I will not entertain an option that's "oh just punt it > > > > to io-wq". > > > > > > BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"), > > > and block/009 serves as regression test for covering page cache > > > coherency and discard. > > > > > > Here the issue is actually related with the exclusive lock of > > > filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during > > > discard for not polluting page cache. block/009 may fail too without the lock. > > > > > > It is just that concurrent discards can't be allowed any more by > > > down_write() of rw_semaphore, and block device is really capable of doing > > > that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in > > > BLKDISCARD ioctl"). > > > > > > Cc Jan Kara and Shin'ichiro Kawasaki. > > > > Honestly I just think that's nonsense. It's like mixing direct and > > buffered writes. Can you get corruption? Yes you most certainly can. > > There should be no reason why we can't run discards without providing > > page cache coherency. The sync interface attempts to do that, but that > > doesn't mean that an async (or a different sync one, if that made sense) > > should. > > I don't see it as a problem either, it's a new interface, just need > to be upfront on what guarantees it provides (one more reason why > not fallocate), I'll elaborate on it in the commit message and so. Fair enough. > > I think a reasonable thing to do is to have one rule for all write-like > operations starting from plain writes, which is currently allowing races > to happen and shift it to the user. Purely in theory we can get inventive > with likes of range lock trees, but that's unwarranted for all sorts of > reasons. > > > If you do discards to the same range as you're doing buffered IO, you > > get to keep both potentially pieces. Fact is that most folks are doing > > dio for performant IO exactly because buffered writes tend to be > > horrible, and you could certainly use that with async discards and have > > the application manage it just fine. > > > > So I really think any attempts to provide page cache synchronization for > > this is futile. And the existing sync one looks pretty abysmal, but it > > doesn't really matter as it's a sync interfce. If one were to do > > It should be a pain for sync as well, you can't even spin another process > and parallelise this way. Yes, this way has degraded some sync discard workloads perf a lot. Thanks, Ming
diff --git a/block/blk.h b/block/blk.h index e180863f918b..5178c5ba6852 100644 --- a/block/blk.h +++ b/block/blk.h @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, loff_t lstart, loff_t lend); long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; diff --git a/block/fops.c b/block/fops.c index 9825c1713a49..8154b10b5abf 100644 --- a/block/fops.c +++ b/block/fops.c @@ -17,6 +17,7 @@ #include <linux/fs.h> #include <linux/iomap.h> #include <linux/module.h> +#include <linux/io_uring/cmd.h> #include "blk.h" static inline struct inode *bdev_file_inode(struct file *file) @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { .splice_read = filemap_splice_read, .splice_write = iter_file_splice_write, .fallocate = blkdev_fallocate, + .uring_cmd = blkdev_uring_cmd, .fop_flags = FOP_BUFFER_RASYNC, }; diff --git a/block/ioctl.c b/block/ioctl.c index c7a3e6c6f5fa..f7f9c4c6d6b5 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -11,6 +11,8 @@ #include <linux/blktrace_api.h> #include <linux/pr.h> #include <linux/uaccess.h> +#include <linux/pagemap.h> +#include <linux/io_uring/cmd.h> #include "blk.h" static int blkpg_do_ioctl(struct block_device *bdev, @@ -744,4 +746,96 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) return ret; } + +struct blk_cmd { + blk_status_t status; + bool nowait; +}; + +static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); + int res = blk_status_to_errno(bc->status); + + if (res == -EAGAIN && bc->nowait) + io_uring_cmd_issue_blocking(cmd); + else + io_uring_cmd_done(cmd, res, 0, issue_flags); +} + +static void bio_cmd_end(struct bio *bio) +{ + struct io_uring_cmd *cmd = bio->bi_private; + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); + + if (unlikely(bio->bi_status) && !bc->status) + bc->status = bio->bi_status; + + io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete); + bio_put(bio); +} + +static int blkdev_cmd_discard(struct io_uring_cmd *cmd, + struct block_device *bdev, + uint64_t start, uint64_t len, bool nowait) +{ + sector_t sector = start >> SECTOR_SHIFT; + sector_t nr_sects = len >> SECTOR_SHIFT; + struct bio *prev = NULL, *bio; + int err; + + err = blk_validate_discard(bdev, file_to_blk_mode(cmd->file), + start, len); + if (err) + return err; + err = filemap_invalidate_pages(bdev->bd_mapping, start, + start + len - 1, nowait); + if (err) + return err; + + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, + GFP_KERNEL))) { + if (nowait) { + if (unlikely(nr_sects)) { + bio_put(bio); + return -EAGAIN; + } + bio->bi_opf |= REQ_NOWAIT; + } + prev = bio_chain_and_submit(prev, bio); + } + if (!prev) + return -EFAULT; + + prev->bi_private = cmd; + prev->bi_end_io = bio_cmd_end; + submit_bio(prev); + return -EIOCBQUEUED; +} + +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host); + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); + const struct io_uring_sqe *sqe = cmd->sqe; + u32 cmd_op = cmd->cmd_op; + uint64_t start, len; + + if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len || + sqe->rw_flags || sqe->file_index)) + return -EINVAL; + + bc->status = BLK_STS_OK; + bc->nowait = issue_flags & IO_URING_F_NONBLOCK; + + start = READ_ONCE(sqe->addr); + len = READ_ONCE(sqe->addr3); + + switch (cmd_op) { + case BLOCK_URING_CMD_DISCARD: + return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait); + } + return -EINVAL; +} + #endif diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 753971770733..0016e38ed33c 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -208,6 +208,8 @@ struct fsxattr { * (see uapi/linux/blkzoned.h) */ +#define BLOCK_URING_CMD_DISCARD 0 + #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
Add ->uring_cmd callback for block device files and use it to implement asynchronous discard. Normally, it first tries to execute the command from non-blocking context, which we limit to a single bio because otherwise one of sub-bios may need to wait for other bios, and we don't want to deal with partial IO. If non-blocking attempt fails, we'll retry it in a blocking context. Suggested-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- block/blk.h | 1 + block/fops.c | 2 + block/ioctl.c | 94 +++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 2 + 4 files changed, 99 insertions(+)