Message ID | 43cb1fb7-b30b-8df1-bba6-e50797d680c6@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring/rw: transform single vector readv/writev into ubuf | expand |
On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: > It's very common to have applications that use vectored reads or writes, > even if they only pass in a single segment. Obviously they should be > using read/write at that point, but... Yeah, it is like fixing application issue in kernel side, :-) > > Vectored IO comes with the downside of needing to retain iovec state, > and hence they require and allocation and state copy if they end up > getting deferred. Additionally, they also require extra cleanup when > completed as the memory as the allocated state memory has to be freed. > > Automatically transform single segment IORING_OP_{READV,WRITEV} into > IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being > more efficient if needing deferral, ITER_UBUF is also more efficient > for normal processing compared to ITER_IOVEC, as they don't require > iteration. The latter is apparent when running peak testing, where > using IORING_OP_READV to randomly read 24 drives previously scored: > > IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31 > IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31 > IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31 > IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32 > IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32 > IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32 > IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31 > IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32 > > and after the change we get: > > IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31 > IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32 > IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31 > IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32 > IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32 > IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31 > IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32 > IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32 > > which is a nice win as well. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 4c233910e200..5c998754cb17 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, > req->ctx->compat); > if (unlikely(ret < 0)) > return ERR_PTR(ret); > - return iovec; > + if (iter->nr_segs != 1) > + return iovec; > + /* > + * Convert to non-vectored request if we have a single segment. If we > + * need to defer the request, then we no longer have to allocate and > + * maintain a struct io_async_rw. Additionally, we won't have cleanup > + * to do at completion time > + */ > + rw->addr = (unsigned long) iter->iov[0].iov_base; > + rw->len = iter->iov[0].iov_len; > + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); > + /* readv -> read distance is the same as writev -> write */ > + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > + (IORING_OP_WRITE - IORING_OP_WRITEV)); > + req->opcode += (IORING_OP_READ - IORING_OP_READV); It is a bit fragile to change ->opcode, which may need matched callbacks for the two OPs, also cause inconsistent opcode in traces. I am wondering why not play the magic in io_prep_rw() from beginning? Thanks, Ming
On 3/24/23 4:41?PM, Ming Lei wrote: > On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: >> It's very common to have applications that use vectored reads or writes, >> even if they only pass in a single segment. Obviously they should be >> using read/write at that point, but... > > Yeah, it is like fixing application issue in kernel side, :-) It totally is, the same thing happens all of the time for readv as well. No amount of informing or documenting will ever fix that... Also see: https://lore.kernel.org/linux-fsdevel/20230324204443.45950-1-axboe@kernel.dk/ with which I think I'll change this one to just be: if (iter->iter_type == ITER_UBUF) { rw->addr = iter->ubuf; rw->len = iter->count; /* readv -> read distance is the same as writev -> write */ BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != (IORING_OP_WRITE - IORING_OP_WRITEV)); req->opcode += (IORING_OP_READ - IORING_OP_READV); } instead. We could also just skip it completely and just have liburing do the right thing if io_uring_prep_readv/writev is called with nr_segs == 1. Just turn it into a READ/WRITE at that point. If we do that, and with the above generic change, it's probably Good Enough. If you use READV/WRITEV and you're using the raw interface, then you're on your own. >> + rw->addr = (unsigned long) iter->iov[0].iov_base; >> + rw->len = iter->iov[0].iov_len; >> + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); >> + /* readv -> read distance is the same as writev -> write */ >> + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != >> + (IORING_OP_WRITE - IORING_OP_WRITEV)); >> + req->opcode += (IORING_OP_READ - IORING_OP_READV); > > It is a bit fragile to change ->opcode, which may need matched > callbacks for the two OPs, also cause inconsistent opcode in traces. > > I am wondering why not play the magic in io_prep_rw() from beginning? It has to be done when importing the vec, we cannot really do it in prep... Well we could, but that'd be adding a bunch more code and duplicating part of the vec import.
On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: > @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, > req->ctx->compat); > if (unlikely(ret < 0)) > return ERR_PTR(ret); > - return iovec; > + if (iter->nr_segs != 1) > + return iovec; > + /* > + * Convert to non-vectored request if we have a single segment. If we > + * need to defer the request, then we no longer have to allocate and > + * maintain a struct io_async_rw. Additionally, we won't have cleanup > + * to do at completion time > + */ > + rw->addr = (unsigned long) iter->iov[0].iov_base; > + rw->len = iter->iov[0].iov_len; > + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); > + /* readv -> read distance is the same as writev -> write */ > + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > + (IORING_OP_WRITE - IORING_OP_WRITEV)); > + req->opcode += (IORING_OP_READ - IORING_OP_READV); > + return NULL; > } This may break anyone using io_uring with those bizarre drivers that have entirely different readv semantics from normal read. I think we can safely say no one's using io_uring for such interfaces, so probably a moot point. If you wanted to be extra cautious though, you may want to skip this transformation if the file->f_op implements both .read+read_iter and .write+.write_iter.
On Fri, Mar 24, 2023 at 05:06:00PM -0600, Jens Axboe wrote: > On 3/24/23 4:41?PM, Ming Lei wrote: > > On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: > >> It's very common to have applications that use vectored reads or writes, > >> even if they only pass in a single segment. Obviously they should be > >> using read/write at that point, but... > > > > Yeah, it is like fixing application issue in kernel side, :-) > > It totally is, the same thing happens all of the time for readv as well. > No amount of informing or documenting will ever fix that... > > Also see: > > https://lore.kernel.org/linux-fsdevel/20230324204443.45950-1-axboe@kernel.dk/ > > with which I think I'll change this one to just be: > > if (iter->iter_type == ITER_UBUF) { > rw->addr = iter->ubuf; > rw->len = iter->count; > /* readv -> read distance is the same as writev -> write */ > BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > (IORING_OP_WRITE - IORING_OP_WRITEV)); > req->opcode += (IORING_OP_READ - IORING_OP_READV); > } > > instead. > > We could also just skip it completely and just have liburing do the > right thing if io_uring_prep_readv/writev is called with nr_segs == 1. > Just turn it into a READ/WRITE at that point. If we do that, and with > the above generic change, it's probably Good Enough. If you use > READV/WRITEV and you're using the raw interface, then you're on your > own. > > >> + rw->addr = (unsigned long) iter->iov[0].iov_base; > >> + rw->len = iter->iov[0].iov_len; > >> + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); > >> + /* readv -> read distance is the same as writev -> write */ > >> + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > >> + (IORING_OP_WRITE - IORING_OP_WRITEV)); > >> + req->opcode += (IORING_OP_READ - IORING_OP_READV); > > > > It is a bit fragile to change ->opcode, which may need matched > > callbacks for the two OPs, also cause inconsistent opcode in traces. > > > > I am wondering why not play the magic in io_prep_rw() from beginning? > > It has to be done when importing the vec, we cannot really do it in > prep... Well we could, but that'd be adding a bunch more code and > duplicating part of the vec import. I meant something like the following(un-tested), which at least guarantees that op_code, rw->addr/len are finalized since ->prep(). diff --git a/io_uring/rw.c b/io_uring/rw.c index 0c292ef9a40f..4bf4c3effdac 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -120,6 +120,25 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ret; } + if (req->opcode == IORING_OP_READV && req->opcode == IORING_OP_WRITEV && + rw->len == 1) { + struct iovec iov; + struct iovec *iovp; + + iovp = iovec_from_user(u64_to_user_ptr(rw->addr), 1, 1, &iov, + req->ctx->compat); + if (IS_ERR(iovp)) + return PTR_ERR(iovp); + + rw->addr = (unsigned long) iovp->iov_base; + rw->len = iovp->iov_len; + + /* readv -> read distance is the same as writev -> write */ + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != + (IORING_OP_WRITE - IORING_OP_WRITEV)); + req->opcode += (IORING_OP_READ - IORING_OP_READV); + } + return 0; } Thanks, Ming
On Fri, Mar 24, 2023 at 05:54:36PM -0600, Keith Busch wrote: > On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: > > @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, > > req->ctx->compat); > > if (unlikely(ret < 0)) > > return ERR_PTR(ret); > > - return iovec; > > + if (iter->nr_segs != 1) > > + return iovec; > > + /* > > + * Convert to non-vectored request if we have a single segment. If we > > + * need to defer the request, then we no longer have to allocate and > > + * maintain a struct io_async_rw. Additionally, we won't have cleanup > > + * to do at completion time > > + */ > > + rw->addr = (unsigned long) iter->iov[0].iov_base; > > + rw->len = iter->iov[0].iov_len; > > + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); > > + /* readv -> read distance is the same as writev -> write */ > > + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > > + (IORING_OP_WRITE - IORING_OP_WRITEV)); > > + req->opcode += (IORING_OP_READ - IORING_OP_READV); > > + return NULL; > > } > > This may break anyone using io_uring with those bizarre drivers that have > entirely different readv semantics from normal read. I think we can safely say > no one's using io_uring for such interfaces, so probably a moot point. If you > wanted to be extra cautious though, you may want to skip this transformation if > the file->f_op implements both .read+read_iter and .write+.write_iter. Sorry, scratch that; those drivers are already broken if they implement both with different semantics, so we can definitely say none of those are using io_uring. This new patch should be safe then.
On 3/25/23 00:24, Ming Lei wrote: > On Fri, Mar 24, 2023 at 05:06:00PM -0600, Jens Axboe wrote: >> On 3/24/23 4:41?PM, Ming Lei wrote: >>> On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote: >>>> It's very common to have applications that use vectored reads or writes, >>>> even if they only pass in a single segment. Obviously they should be >>>> using read/write at that point, but... >>> >>> Yeah, it is like fixing application issue in kernel side, :-) >> >> It totally is, the same thing happens all of the time for readv as well. >> No amount of informing or documenting will ever fix that... >> >> Also see: >> >> https://lore.kernel.org/linux-fsdevel/20230324204443.45950-1-axboe@kernel.dk/ >> >> with which I think I'll change this one to just be: >> >> if (iter->iter_type == ITER_UBUF) { >> rw->addr = iter->ubuf; >> rw->len = iter->count; >> /* readv -> read distance is the same as writev -> write */ >> BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != >> (IORING_OP_WRITE - IORING_OP_WRITEV)); >> req->opcode += (IORING_OP_READ - IORING_OP_READV); >> } >> >> instead. >> >> We could also just skip it completely and just have liburing do the >> right thing if io_uring_prep_readv/writev is called with nr_segs == 1. >> Just turn it into a READ/WRITE at that point. If we do that, and with >> the above generic change, it's probably Good Enough. If you use >> READV/WRITEV and you're using the raw interface, then you're on your >> own. I like this option but sendmsg and recvmsg probably do need the same fix up, which is more justified as they can't get converted to send/recv as this. Another option is to internally detangle opcodes from iter types. import() { if (req->op == READV) import_iovec(); else import_buf(); } would get replaced with: prep() { if (req->op == READV) req->flags = REQ_F_IOVEC; } import() { if (req->flags & REQ_F_IOVEC) import_iovec(); else import_buf(); } >>>> + rw->addr = (unsigned long) iter->iov[0].iov_base; >>>> + rw->len = iter->iov[0].iov_len; >>>> + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); >>>> + /* readv -> read distance is the same as writev -> write */ >>>> + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != >>>> + (IORING_OP_WRITE - IORING_OP_WRITEV)); >>>> + req->opcode += (IORING_OP_READ - IORING_OP_READV); >>> >>> It is a bit fragile to change ->opcode, which may need matched >>> callbacks for the two OPs, also cause inconsistent opcode in traces. >>> >>> I am wondering why not play the magic in io_prep_rw() from beginning? >> >> It has to be done when importing the vec, we cannot really do it in >> prep... Well we could, but that'd be adding a bunch more code and >> duplicating part of the vec import. > > I meant something like the following(un-tested), which at least > guarantees that op_code, rw->addr/len are finalized since ->prep(). It sounds like a better approach. With opcode machinations it's easy to forget about some kind of state that could be fatal. Take IOSQE_ASYNC for example. The core code will allocate async_data and do io_readv_prep_async() -> import_iovec(), which inside changes the opcode. It'll be a problem if io_readv_prep_async() forgets that it might a different opcode with a slightly different req layout, or even non-vectored read would do sth weird with ->async_data or mishandle REQ_F_NEED_CLEANUP. fwiw, needs compat handling, i.e. leave as iovec if compat > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 0c292ef9a40f..4bf4c3effdac 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -120,6 +120,25 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) > return ret; > } > > + if (req->opcode == IORING_OP_READV && req->opcode == IORING_OP_WRITEV && > + rw->len == 1) { > + struct iovec iov; > + struct iovec *iovp; > + > + iovp = iovec_from_user(u64_to_user_ptr(rw->addr), 1, 1, &iov, > + req->ctx->compat); > + if (IS_ERR(iovp)) > + return PTR_ERR(iovp); > + > + rw->addr = (unsigned long) iovp->iov_base; > + rw->len = iovp->iov_len; > + > + /* readv -> read distance is the same as writev -> write */ > + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != > + (IORING_OP_WRITE - IORING_OP_WRITEV)); > + req->opcode += (IORING_OP_READ - IORING_OP_READV); > + } > + > return 0; > }
diff --git a/io_uring/rw.c b/io_uring/rw.c index 4c233910e200..5c998754cb17 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, req->ctx->compat); if (unlikely(ret < 0)) return ERR_PTR(ret); - return iovec; + if (iter->nr_segs != 1) + return iovec; + /* + * Convert to non-vectored request if we have a single segment. If we + * need to defer the request, then we no longer have to allocate and + * maintain a struct io_async_rw. Additionally, we won't have cleanup + * to do at completion time + */ + rw->addr = (unsigned long) iter->iov[0].iov_base; + rw->len = iter->iov[0].iov_len; + iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len); + /* readv -> read distance is the same as writev -> write */ + BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) != + (IORING_OP_WRITE - IORING_OP_WRITEV)); + req->opcode += (IORING_OP_READ - IORING_OP_READV); + return NULL; } static inline int io_import_iovec(int rw, struct io_kiocb *req,
It's very common to have applications that use vectored reads or writes, even if they only pass in a single segment. Obviously they should be using read/write at that point, but... Vectored IO comes with the downside of needing to retain iovec state, and hence they require and allocation and state copy if they end up getting deferred. Additionally, they also require extra cleanup when completed as the memory as the allocated state memory has to be freed. Automatically transform single segment IORING_OP_{READV,WRITEV} into IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being more efficient if needing deferral, ITER_UBUF is also more efficient for normal processing compared to ITER_IOVEC, as they don't require iteration. The latter is apparent when running peak testing, where using IORING_OP_READV to randomly read 24 drives previously scored: IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31 IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31 IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31 IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32 IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32 IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32 IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31 IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32 and after the change we get: IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31 IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32 IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31 IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32 IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32 IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31 IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32 IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32 which is a nice win as well. Signed-off-by: Jens Axboe <axboe@kernel.dk> ---