Message ID | e39a9aabe503bbd7f2b7454327d3e6a6620deccf.1724297388.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | implement async block discards/etc. via io_uring | expand |
On Thu, Aug 22, 2024 at 04:35:55AM +0100, Pavel Begunkov wrote: > io_uring allows to implement custom file specific operations via > fops->uring_cmd callback. Use it to wire up asynchronous discard > commands. Normally, first it tries to do a non-blocking issue, and if > fails we'd retry from a blocking context by returning -EAGAIN to > core io_uring. > > Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races, > we only do a best effort attempt to invalidate page cache, and it can > race with any writes and reads and leave page cache stale. It's the > same kind of races we allow to direct writes. Can you please write up a man page for this that clear documents the expecvted semantics? > +static void bio_cmd_end(struct bio *bio) This is really weird function name. blk_cmd_end_io or blk_cmd_bio_end_io would be the usual choices. > + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > + GFP_KERNEL))) { GFP_KERNEL can often will block. You'll probably want a GFP_NOWAIT allocation here for the nowait case. > + if (nowait) { > + /* > + * Don't allow multi-bio non-blocking submissions as > + * subsequent bios may fail but we won't get direct > + * feedback about that. Normally, the caller should > + * retry from a blocking context. > + */ > + if (unlikely(nr_sects)) { > + bio_put(bio); > + return -EAGAIN; > + } > + bio->bi_opf |= REQ_NOWAIT; > + } And this really looks weird. It first allocates a bio, potentially blocking, and then gives up? I think you're much better off with something like: if (nowait) { if (nr_sects > bio_discard_limit(bdev, sector)) return -EAGAIN; bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, GFP_NOWAIT); if (!bio) return -EAGAIN bio->bi_opf |= REQ_NOWAIT; goto submit; } /* submission loop here */ > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index 753971770733..0016e38ed33c 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -208,6 +208,8 @@ struct fsxattr { > * (see uapi/linux/blkzoned.h) > */ > > +#define BLOCK_URING_CMD_DISCARD 0 Is fs.h the reight place for this? Curious: how to we deal with conflicting uring cmds on different device and how do we probe for them? The NVMe uring_cmds use the ioctl-style _IO* encoding which at least helps a bit with that and which seem like a good idea. Maybe someone needs to write up a few lose rules on uring commands?
On 8/22/24 07:46, Christoph Hellwig wrote: > On Thu, Aug 22, 2024 at 04:35:55AM +0100, Pavel Begunkov wrote: >> io_uring allows to implement custom file specific operations via >> fops->uring_cmd callback. Use it to wire up asynchronous discard >> commands. Normally, first it tries to do a non-blocking issue, and if >> fails we'd retry from a blocking context by returning -EAGAIN to >> core io_uring. >> >> Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races, >> we only do a best effort attempt to invalidate page cache, and it can >> race with any writes and reads and leave page cache stale. It's the >> same kind of races we allow to direct writes. > > Can you please write up a man page for this that clear documents the > expecvted semantics? Do we have it documented anywhere how O_DIRECT writes interact with page cache, so I can refer to it? >> +static void bio_cmd_end(struct bio *bio) > > This is really weird function name. blk_cmd_end_io or > blk_cmd_bio_end_io would be the usual choices. Will change with other cosmetics. >> + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, >> + GFP_KERNEL))) { > > GFP_KERNEL can often will block. You'll probably want a GFP_NOWAIT > allocation here for the nowait case. I can change it for clarity, but I don't think it's much of a concern since the read/write path and pretty sure a bunch of other places never cared about it. It does the main thing, propagating it down e.g. for tag allocation. >> + if (nowait) { >> + /* >> + * Don't allow multi-bio non-blocking submissions as >> + * subsequent bios may fail but we won't get direct >> + * feedback about that. Normally, the caller should >> + * retry from a blocking context. >> + */ >> + if (unlikely(nr_sects)) { >> + bio_put(bio); >> + return -EAGAIN; >> + } >> + bio->bi_opf |= REQ_NOWAIT; >> + } > > And this really looks weird. It first allocates a bio, potentially That's what the write path does. > blocking, and then gives up? I think you're much better off with > something like: I'd rather avoid calling bio_discard_limit() an extra time, it does too much stuff inside, when the expected case is a single bio and for multi-bio that overhead would really matter. Maybe I should uniline blk_alloc_discard_bio() and dedup it with write zeroes, I'll see if can be done after other write zeroes changes. > > if (nowait) { > if (nr_sects > bio_discard_limit(bdev, sector)) > return -EAGAIN; > bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > GFP_NOWAIT); > if (!bio) > return -EAGAIN > bio->bi_opf |= REQ_NOWAIT; > goto submit; > } > > /* submission loop here */ > >> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h >> index 753971770733..0016e38ed33c 100644 >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h >> @@ -208,6 +208,8 @@ struct fsxattr { >> * (see uapi/linux/blkzoned.h) >> */ >> >> +#define BLOCK_URING_CMD_DISCARD 0 > > Is fs.h the reight place for this? Arguable, but I can move it to io_uring, makes things simpler for me. > Curious: how to we deal with conflicting uring cmds on different > device and how do we probe for them? The NVMe uring_cmds > use the ioctl-style _IO* encoding which at least helps a bit with > that and which seem like a good idea. Maybe someone needs to write > up a few lose rules on uring commands? My concern is that we're sacrificing compiler optimisations (well, jump tables are disabled IIRC) for something that doesn't even guarantee uniqueness. I'd like to see some degree of reflection, like user querying a file class in terms of what operations it supports, but that's beyond the scope of the series.
On Thu, Aug 22, 2024 at 02:07:16PM +0100, Pavel Begunkov wrote: > > > Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races, > > > we only do a best effort attempt to invalidate page cache, and it can > > > race with any writes and reads and leave page cache stale. It's the > > > same kind of races we allow to direct writes. > > > > Can you please write up a man page for this that clear documents the > > expecvted semantics? > > Do we have it documented anywhere how O_DIRECT writes interact > with page cache, so I can refer to it? I can't find a good writeup. Adding Dave as he tends to do long emails on topic like this so he might have one hiding somewhere. > > GFP_KERNEL can often will block. You'll probably want a GFP_NOWAIT > > allocation here for the nowait case. > > I can change it for clarity, but I don't think it's much of a concern > since the read/write path and pretty sure a bunch of other places never > cared about it. It does the main thing, propagating it down e.g. for > tag allocation. True, we're only doing the nowait allocation for larger data structures. Which is a bit odd indeed. > I'd rather avoid calling bio_discard_limit() an extra time, it does > too much stuff inside, when the expected case is a single bio and > for multi-bio that overhead would really matter. Compared to a memory allocation it's not really doing all the much. In the long run we really should move splitting discard bios down the stack like we do for normal I/O anyway. > Maybe I should uniline blk_alloc_discard_bio() and dedup it with uniline? I read that as unіnline, but as it's not inline I don't understand what you mean either. > > > +#define BLOCK_URING_CMD_DISCARD 0 > > > > Is fs.h the reight place for this? > > Arguable, but I can move it to io_uring, makes things simpler > for me. I would have expected a uapi/linux/blkdev.h for it (and I'm kinda surprised we don't have that yet). > > > Curious: how to we deal with conflicting uring cmds on different > > device and how do we probe for them? The NVMe uring_cmds > > use the ioctl-style _IO* encoding which at least helps a bit with > > that and which seem like a good idea. Maybe someone needs to write > > up a few lose rules on uring commands? > > My concern is that we're sacrificing compiler optimisations > (well, jump tables are disabled IIRC) for something that doesn't even > guarantee uniqueness. I'd like to see some degree of reflection, > like user querying a file class in terms of what operations it > supports, but that's beyond the scope of the series. We can't guaranteed uniqueness, but between the class, the direction, and the argument size we get a pretty good one. There is a reason pretty much all ioctls added in the last 25 years are using this scheme.
On 8/23/24 12:59, Christoph Hellwig wrote: > On Thu, Aug 22, 2024 at 02:07:16PM +0100, Pavel Begunkov wrote: >>>> Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races, >>>> we only do a best effort attempt to invalidate page cache, and it can >>>> race with any writes and reads and leave page cache stale. It's the >>>> same kind of races we allow to direct writes. >>> >>> Can you please write up a man page for this that clear documents the >>> expecvted semantics? >> >> Do we have it documented anywhere how O_DIRECT writes interact >> with page cache, so I can refer to it? > > I can't find a good writeup. Adding Dave as he tends to do long > emails on topic like this so he might have one hiding somewhere. > >>> GFP_KERNEL can often will block. You'll probably want a GFP_NOWAIT >>> allocation here for the nowait case. >> >> I can change it for clarity, but I don't think it's much of a concern >> since the read/write path and pretty sure a bunch of other places never >> cared about it. It does the main thing, propagating it down e.g. for >> tag allocation. > > True, we're only doing the nowait allocation for larger data > structures. Which is a bit odd indeed. That's widespread, last time I looked into it no amount of patching saved io_uring and tasks being killed by the oom reaper under memory pressure. >> I'd rather avoid calling bio_discard_limit() an extra time, it does >> too much stuff inside, when the expected case is a single bio and >> for multi-bio that overhead would really matter. > > Compared to a memory allocation it's not really doing all the much. > In the long run we really should move splitting discard bios down > the stack like we do for normal I/O anyway. > >> Maybe I should uniline blk_alloc_discard_bio() and dedup it with > > uniline? I read that as unіnline, but as it's not inline I don't > understand what you mean either. "Hand code" if you wish, but you can just ignore it >>>> +#define BLOCK_URING_CMD_DISCARD 0 >>> >>> Is fs.h the reight place for this? >> >> Arguable, but I can move it to io_uring, makes things simpler >> for me. > > I would have expected a uapi/linux/blkdev.h for it (and I'm kinda > surprised we don't have that yet). I think that would be overkill, we don't need it for just these commands, and it's only adds pain with probing the header with autotools or so. If there is a future vision for it I'd say we can drop a patch on top. >>> Curious: how to we deal with conflicting uring cmds on different >>> device and how do we probe for them? The NVMe uring_cmds >>> use the ioctl-style _IO* encoding which at least helps a bit with >>> that and which seem like a good idea. Maybe someone needs to write >>> up a few lose rules on uring commands? >> >> My concern is that we're sacrificing compiler optimisations >> (well, jump tables are disabled IIRC) for something that doesn't even >> guarantee uniqueness. I'd like to see some degree of reflection, >> like user querying a file class in terms of what operations it >> supports, but that's beyond the scope of the series. > > We can't guaranteed uniqueness, but between the class, the direction, > and the argument size we get a pretty good one. There is a reason > pretty much all ioctls added in the last 25 years are using this scheme. which is likely because some people insisted on it and not because the scheme is so great that everyone became acolytes. Not to mention only 256 possible "types" and the endless mess of sharing them and trying to find a range to use. I'll convert to have less headache, but either way we're just propagating the problem into the future.
diff --git a/block/blk.h b/block/blk.h index e180863f918b..5178c5ba6852 100644 --- a/block/blk.h +++ b/block/blk.h @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file); int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, loff_t lstart, loff_t lend); long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; diff --git a/block/fops.c b/block/fops.c index 9825c1713a49..8154b10b5abf 100644 --- a/block/fops.c +++ b/block/fops.c @@ -17,6 +17,7 @@ #include <linux/fs.h> #include <linux/iomap.h> #include <linux/module.h> +#include <linux/io_uring/cmd.h> #include "blk.h" static inline struct inode *bdev_file_inode(struct file *file) @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { .splice_read = filemap_splice_read, .splice_write = iter_file_splice_write, .fallocate = blkdev_fallocate, + .uring_cmd = blkdev_uring_cmd, .fop_flags = FOP_BUFFER_RASYNC, }; diff --git a/block/ioctl.c b/block/ioctl.c index 8df0bc8002f5..a9aaa7cb7f73 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -11,6 +11,8 @@ #include <linux/blktrace_api.h> #include <linux/pr.h> #include <linux/uaccess.h> +#include <linux/pagemap.h> +#include <linux/io_uring/cmd.h> #include "blk.h" static int blkpg_do_ioctl(struct block_device *bdev, @@ -745,3 +747,102 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) return ret; } #endif + +struct blk_cmd { + blk_status_t status; + bool nowait; +}; + +static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); + int res = blk_status_to_errno(bc->status); + + if (res == -EAGAIN && bc->nowait) + io_uring_cmd_issue_blocking(cmd); + else + io_uring_cmd_done(cmd, res, 0, issue_flags); +} + +static void bio_cmd_end(struct bio *bio) +{ + struct io_uring_cmd *cmd = bio->bi_private; + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); + + if (unlikely(bio->bi_status) && !bc->status) + bc->status = bio->bi_status; + + io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete); + bio_put(bio); +} + +static int blkdev_cmd_discard(struct io_uring_cmd *cmd, + struct block_device *bdev, + uint64_t start, uint64_t len, bool nowait) +{ + sector_t sector = start >> SECTOR_SHIFT; + sector_t nr_sects = len >> SECTOR_SHIFT; + struct bio *prev = NULL, *bio; + int err; + + if (!bdev_max_discard_sectors(bdev)) + return -EOPNOTSUPP; + + err = blk_validate_write(bdev, file_to_blk_mode(cmd->file), start, len); + if (err) + return err; + err = filemap_invalidate_pages(bdev->bd_mapping, start, + start + len - 1, nowait); + if (err) + return err; + + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, + GFP_KERNEL))) { + if (nowait) { + /* + * Don't allow multi-bio non-blocking submissions as + * subsequent bios may fail but we won't get direct + * feedback about that. Normally, the caller should + * retry from a blocking context. + */ + if (unlikely(nr_sects)) { + bio_put(bio); + return -EAGAIN; + } + bio->bi_opf |= REQ_NOWAIT; + } + prev = bio_chain_and_submit(prev, bio); + } + if (!prev) + return -EFAULT; + + prev->bi_private = cmd; + prev->bi_end_io = bio_cmd_end; + submit_bio(prev); + return -EIOCBQUEUED; +} + +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host); + struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd); + const struct io_uring_sqe *sqe = cmd->sqe; + u32 cmd_op = cmd->cmd_op; + uint64_t start, len; + + if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len || + sqe->rw_flags || sqe->file_index)) + return -EINVAL; + + bc->status = BLK_STS_OK; + bc->nowait = issue_flags & IO_URING_F_NONBLOCK; + + start = READ_ONCE(sqe->addr); + len = READ_ONCE(sqe->addr3); + + switch (cmd_op) { + case BLOCK_URING_CMD_DISCARD: + return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait); + } + return -EINVAL; +} diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 753971770733..0016e38ed33c 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -208,6 +208,8 @@ struct fsxattr { * (see uapi/linux/blkzoned.h) */ +#define BLOCK_URING_CMD_DISCARD 0 + #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
io_uring allows to implement custom file specific operations via fops->uring_cmd callback. Use it to wire up asynchronous discard commands. Normally, first it tries to do a non-blocking issue, and if fails we'd retry from a blocking context by returning -EAGAIN to core io_uring. Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Suggested-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- block/blk.h | 1 + block/fops.c | 2 + block/ioctl.c | 101 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 2 + 4 files changed, 106 insertions(+)