diff mbox series

[5/9] io_uring: support SQE group

Message ID 20240408010322.4104395-6-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series io_uring: support sqe group and provide group kbuf | expand

Commit Message

Ming Lei April 8, 2024, 1:03 a.m. UTC
SQE group is defined as one chain of SQEs starting with the first sqe that
has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
doesn't have it set, and it is similar with chain of linked sqes.

The 1st SQE is group leader, and the other SQEs are group member. The group
leader is always freed after all members are completed. Group members
aren't submitted until the group leader is completed, and there isn't any
dependency among group members, and IOSQE_IO_LINK can't be set for group
members, same with IOSQE_IO_DRAIN.

Typically the group leader provides or makes resource, and the other members
consume the resource, such as scenario of multiple backup, the 1st SQE is to
read data from source file into fixed buffer, the other SQEs write data from
the same buffer into other destination files. SQE group provides very
efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
submitted in single syscall, no need to submit fs read SQE first, and wait
until read SQE is completed, 2) no need to link all write SQEs together, then
write SQEs can be submitted to files concurrently. Meantime application is
simplified a lot in this way.

Another use case is to for supporting generic device zero copy:

- the lead SQE is for providing device buffer, which is owned by device or
  kernel, can't be cross userspace, otherwise easy to cause leak for devil
  application or panic

- member SQEs reads or writes concurrently against the buffer provided by lead
  SQE

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring_types.h |  10 ++
 include/uapi/linux/io_uring.h  |   4 +
 io_uring/io_uring.c            | 218 ++++++++++++++++++++++++++++++---
 io_uring/io_uring.h            |  17 ++-
 4 files changed, 231 insertions(+), 18 deletions(-)

Comments

Jens Axboe April 22, 2024, 6:27 p.m. UTC | #1
On 4/7/24 7:03 PM, Ming Lei wrote:
> SQE group is defined as one chain of SQEs starting with the first sqe that
> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> doesn't have it set, and it is similar with chain of linked sqes.
> 
> The 1st SQE is group leader, and the other SQEs are group member. The group
> leader is always freed after all members are completed. Group members
> aren't submitted until the group leader is completed, and there isn't any
> dependency among group members, and IOSQE_IO_LINK can't be set for group
> members, same with IOSQE_IO_DRAIN.
> 
> Typically the group leader provides or makes resource, and the other members
> consume the resource, such as scenario of multiple backup, the 1st SQE is to
> read data from source file into fixed buffer, the other SQEs write data from
> the same buffer into other destination files. SQE group provides very
> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> submitted in single syscall, no need to submit fs read SQE first, and wait
> until read SQE is completed, 2) no need to link all write SQEs together, then
> write SQEs can be submitted to files concurrently. Meantime application is
> simplified a lot in this way.
> 
> Another use case is to for supporting generic device zero copy:
> 
> - the lead SQE is for providing device buffer, which is owned by device or
>   kernel, can't be cross userspace, otherwise easy to cause leak for devil
>   application or panic
> 
> - member SQEs reads or writes concurrently against the buffer provided by lead
>   SQE

In concept, this looks very similar to "sqe bundles" that I played with
in the past:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle

Didn't look too closely yet at the implementation, but in spirit it's
about the same in that the first entry is processed first, and there's
no ordering implied between the test of the members of the bundle /
group.

I do think that's a flexible thing to support, particularly if:

1) We can do it more efficiently than links, which are pretty horrible.
2) It enables new worthwhile use cases
3) It's done cleanly 
4) It's easily understandable and easy to document, so that users will
   actually understand what this is and what use cases it enable. Part
   of that is actually naming, it should be readily apparent what a
   group is, what the lead is, and what the members are. Using your
   terminology here, definitely worth spending some time on that to get
   it just right and self evident.
Kevin Wolf April 23, 2024, 1:08 p.m. UTC | #2
Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> On 4/7/24 7:03 PM, Ming Lei wrote:
> > SQE group is defined as one chain of SQEs starting with the first sqe that
> > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > doesn't have it set, and it is similar with chain of linked sqes.
> > 
> > The 1st SQE is group leader, and the other SQEs are group member. The group
> > leader is always freed after all members are completed. Group members
> > aren't submitted until the group leader is completed, and there isn't any
> > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > members, same with IOSQE_IO_DRAIN.
> > 
> > Typically the group leader provides or makes resource, and the other members
> > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > read data from source file into fixed buffer, the other SQEs write data from
> > the same buffer into other destination files. SQE group provides very
> > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > submitted in single syscall, no need to submit fs read SQE first, and wait
> > until read SQE is completed, 2) no need to link all write SQEs together, then
> > write SQEs can be submitted to files concurrently. Meantime application is
> > simplified a lot in this way.
> > 
> > Another use case is to for supporting generic device zero copy:
> > 
> > - the lead SQE is for providing device buffer, which is owned by device or
> >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> >   application or panic
> > 
> > - member SQEs reads or writes concurrently against the buffer provided by lead
> >   SQE
> 
> In concept, this looks very similar to "sqe bundles" that I played with
> in the past:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> 
> Didn't look too closely yet at the implementation, but in spirit it's
> about the same in that the first entry is processed first, and there's
> no ordering implied between the test of the members of the bundle /
> group.

When I first read this patch, I wondered if it wouldn't make sense to
allow linking a group with subsequent requests, e.g. first having a few
requests that run in parallel and once all of them have completed
continue with the next linked one sequentially.

For SQE bundles, you reused the LINK flag, which doesn't easily allow
this. Ming's patch uses a new flag for groups, so the interface would be
more obvious, you simply set the LINK flag on the last member of the
group (or on the leader, doesn't really matter). Of course, this doesn't
mean it has to be implemented now, but there is a clear way forward if
it's wanted.

The part that looks a bit arbitrary in Ming's patch is that the group
leader is always completed before the rest starts. It makes perfect
sense in the context that this series is really after (enabling zero
copy for ublk), but it doesn't really allow the case you mention in the
SQE bundle commit message, running everything in parallel and getting a
single CQE for the whole group.

I suppose you could hack around the sequential nature of the first
request by using an extra NOP as the group leader - which isn't any
worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
the group completion would still be missing. (Of course, removing the
sequential first operation would mean that ublk wouldn't have the buffer
ready any more when the other requests try to use it, so that would
defeat the purpose of the series...)

I wonder if we can still combine both approaches and create some
generally useful infrastructure and not something where it's visible
that it was designed mostly for ublk's special case and other use cases
just happened to be enabled as a side effect.

Kevin
Ming Lei April 24, 2024, 12:46 a.m. UTC | #3
On Mon, Apr 22, 2024 at 12:27:28PM -0600, Jens Axboe wrote:
> On 4/7/24 7:03 PM, Ming Lei wrote:
> > SQE group is defined as one chain of SQEs starting with the first sqe that
> > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > doesn't have it set, and it is similar with chain of linked sqes.
> > 
> > The 1st SQE is group leader, and the other SQEs are group member. The group
> > leader is always freed after all members are completed. Group members
> > aren't submitted until the group leader is completed, and there isn't any
> > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > members, same with IOSQE_IO_DRAIN.
> > 
> > Typically the group leader provides or makes resource, and the other members
> > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > read data from source file into fixed buffer, the other SQEs write data from
> > the same buffer into other destination files. SQE group provides very
> > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > submitted in single syscall, no need to submit fs read SQE first, and wait
> > until read SQE is completed, 2) no need to link all write SQEs together, then
> > write SQEs can be submitted to files concurrently. Meantime application is
> > simplified a lot in this way.
> > 
> > Another use case is to for supporting generic device zero copy:
> > 
> > - the lead SQE is for providing device buffer, which is owned by device or
> >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> >   application or panic
> > 
> > - member SQEs reads or writes concurrently against the buffer provided by lead
> >   SQE
> 
> In concept, this looks very similar to "sqe bundles" that I played with
> in the past:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle

Indeed, so looks it is something which io_uring needs.

> 
> Didn't look too closely yet at the implementation, but in spirit it's
> about the same in that the first entry is processed first, and there's
> no ordering implied between the test of the members of the bundle /
> group.

Yeah.

> 
> I do think that's a flexible thing to support, particularly if:
> 
> 1) We can do it more efficiently than links, which are pretty horrible.

Agree, link is hard to use in async/.await of modern language per my
experience.

Also sqe group won't break link, and the group is thought as a whole
wrt. linking.

> 2) It enables new worthwhile use cases
> 3) It's done cleanly 
> 4) It's easily understandable and easy to document, so that users will
>    actually understand what this is and what use cases it enable. Part
>    of that is actually naming, it should be readily apparent what a
>    group is, what the lead is, and what the members are. Using your
>    terminology here, definitely worth spending some time on that to get
>    it just right and self evident.

All are nice suggestions, and I will follow above and make them in V2.


Thanks,
Ming
Ming Lei April 24, 2024, 1:39 a.m. UTC | #4
On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > doesn't have it set, and it is similar with chain of linked sqes.
> > > 
> > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > leader is always freed after all members are completed. Group members
> > > aren't submitted until the group leader is completed, and there isn't any
> > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > members, same with IOSQE_IO_DRAIN.
> > > 
> > > Typically the group leader provides or makes resource, and the other members
> > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > read data from source file into fixed buffer, the other SQEs write data from
> > > the same buffer into other destination files. SQE group provides very
> > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > write SQEs can be submitted to files concurrently. Meantime application is
> > > simplified a lot in this way.
> > > 
> > > Another use case is to for supporting generic device zero copy:
> > > 
> > > - the lead SQE is for providing device buffer, which is owned by device or
> > >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > >   application or panic
> > > 
> > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > >   SQE
> > 
> > In concept, this looks very similar to "sqe bundles" that I played with
> > in the past:
> > 
> > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > 
> > Didn't look too closely yet at the implementation, but in spirit it's
> > about the same in that the first entry is processed first, and there's
> > no ordering implied between the test of the members of the bundle /
> > group.
> 
> When I first read this patch, I wondered if it wouldn't make sense to
> allow linking a group with subsequent requests, e.g. first having a few
> requests that run in parallel and once all of them have completed
> continue with the next linked one sequentially.
> 
> For SQE bundles, you reused the LINK flag, which doesn't easily allow
> this. Ming's patch uses a new flag for groups, so the interface would be
> more obvious, you simply set the LINK flag on the last member of the
> group (or on the leader, doesn't really matter). Of course, this doesn't
> mean it has to be implemented now, but there is a clear way forward if
> it's wanted.

Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed
link chain), so I think it may not work.

The link rule is explicit for sqe group:

- only group leader can set link flag, which is applied on the whole
group: the next sqe in the link chain won't be started until the
previous linked sqe group is completed

- link flag can't be set for group members

Also sqe group doesn't limit async for both group leader and member.

sqe group vs link & async is covered in the last liburing test code.

> 
> The part that looks a bit arbitrary in Ming's patch is that the group
> leader is always completed before the rest starts. It makes perfect
> sense in the context that this series is really after (enabling zero
> copy for ublk), but it doesn't really allow the case you mention in the
> SQE bundle commit message, running everything in parallel and getting a
> single CQE for the whole group.

I think it should be easy to cover bundle in this way, such as add one new
op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle.

> 
> I suppose you could hack around the sequential nature of the first
> request by using an extra NOP as the group leader - which isn't any
> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> the group completion would still be missing. (Of course, removing the
> sequential first operation would mean that ublk wouldn't have the buffer
> ready any more when the other requests try to use it, so that would
> defeat the purpose of the series...)
> 
> I wonder if we can still combine both approaches and create some
> generally useful infrastructure and not something where it's visible
> that it was designed mostly for ublk's special case and other use cases
> just happened to be enabled as a side effect.

sqe group is actually one generic interface, please see the multiple copy(
copy one file to multiple destinations in single syscall for one range) example
in the last patch, and it can support generic device zero copy: any device internal
buffer can be linked with io_uring operations in this way, which can't
be done by traditional splice/pipe.

I guess it can be used in network Rx zero copy too, but may depend on actual
network Rx use case.



Thanks,
Ming
Kevin Wolf April 25, 2024, 9:27 a.m. UTC | #5
Am 24.04.2024 um 03:39 hat Ming Lei geschrieben:
> On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > 
> > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > leader is always freed after all members are completed. Group members
> > > > aren't submitted until the group leader is completed, and there isn't any
> > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > members, same with IOSQE_IO_DRAIN.
> > > > 
> > > > Typically the group leader provides or makes resource, and the other members
> > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > the same buffer into other destination files. SQE group provides very
> > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > simplified a lot in this way.
> > > > 
> > > > Another use case is to for supporting generic device zero copy:
> > > > 
> > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > >   application or panic
> > > > 
> > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > >   SQE
> > > 
> > > In concept, this looks very similar to "sqe bundles" that I played with
> > > in the past:
> > > 
> > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > 
> > > Didn't look too closely yet at the implementation, but in spirit it's
> > > about the same in that the first entry is processed first, and there's
> > > no ordering implied between the test of the members of the bundle /
> > > group.
> > 
> > When I first read this patch, I wondered if it wouldn't make sense to
> > allow linking a group with subsequent requests, e.g. first having a few
> > requests that run in parallel and once all of them have completed
> > continue with the next linked one sequentially.
> > 
> > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > this. Ming's patch uses a new flag for groups, so the interface would be
> > more obvious, you simply set the LINK flag on the last member of the
> > group (or on the leader, doesn't really matter). Of course, this doesn't
> > mean it has to be implemented now, but there is a clear way forward if
> > it's wanted.
> 
> Reusing LINK for bundle breaks existed link chains (BUNDLE linked to
> existed link chain), so I think it may not work.

You can always extend things *somehow*, but it wouldn't fit very
naturally. That's why I feel your approach on this detail is a little
better.

> The link rule is explicit for sqe group:
> 
> - only group leader can set link flag, which is applied on the whole
> group: the next sqe in the link chain won't be started until the
> previous linked sqe group is completed
> 
> - link flag can't be set for group members
> 
> Also sqe group doesn't limit async for both group leader and member.
> 
> sqe group vs link & async is covered in the last liburing test code.

Oh right, I didn't actually notice that you already implement what I
proposed!

I was expecting the flag on the last SQE and I saw in the code that this
isn't allowed, but I completely missed your comment that explicitly
states that it's the group leader that gets the link flag. Of course,
this is just as good.

> > The part that looks a bit arbitrary in Ming's patch is that the group
> > leader is always completed before the rest starts. It makes perfect
> > sense in the context that this series is really after (enabling zero
> > copy for ublk), but it doesn't really allow the case you mention in the
> > SQE bundle commit message, running everything in parallel and getting a
> > single CQE for the whole group.
> 
> I think it should be easy to cover bundle in this way, such as add one
> new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for
> whole group/bundle.

This requires an extra SQE compared to just creating the group with
flags, but I suppose this is not a big problem. An alternative might be
sending the CQE for the group leader only after the whole group has
completed if we're okay with userspace never knowing when the leader
itself completed.

However, assuming an IORING_OP_BUNDLE command, if this command only
completes after the whole group, doesn't that conflict with the
principle that all other commands are only started after the first one
has completed?

Maybe we shouldn't wait for the whole group leader request to complete,
but just give the group leader a chance to prepare the group before all
requests in the group (including the leader itself) are run in parallel.
Maybe io_issue_sqe() could just start the rest of the group somewhere
after calling def->issue() for the leader. Then you can't prepare the
group buffer asynchronously, but I don't think this is needed, right?

Your example with one read followed by multiple writes would then have
to be written slightly differently: First the read outside of the group,
linked to a group of writes. I honestly think this makes more sense as
an interface, too, because then links are for sequential things and
groups are (only) for parallel things. This feels clearer than having
both a sequential and a parallel element in groups.

> > I suppose you could hack around the sequential nature of the first
> > request by using an extra NOP as the group leader - which isn't any
> > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> > the group completion would still be missing. (Of course, removing the
> > sequential first operation would mean that ublk wouldn't have the buffer
> > ready any more when the other requests try to use it, so that would
> > defeat the purpose of the series...)
> > 
> > I wonder if we can still combine both approaches and create some
> > generally useful infrastructure and not something where it's visible
> > that it was designed mostly for ublk's special case and other use cases
> > just happened to be enabled as a side effect.
> 
> sqe group is actually one generic interface, please see the multiple
> copy( copy one file to multiple destinations in single syscall for one
> range) example in the last patch

Yes, that's an example that happens to work well with the model that you
derived from ublk.

If you have the opposite case, reading a buffer that is spread across
multiple files and then writing it to one target (i.e. first step
parallel, second step sequential), you can't represent this well
currently. You could work around it by having a NOP leader, but that's
not very elegant.

This asymmetry suggests that it's not the perfect interface yet.

If the whole group runs in parallel instead, including the leader, then
both examples become symmetrical. You have a group for the parallel I/O
and a linked single request for the other operation.

Or if both steps are parallel, you can just have two linked groups.

> and it can support generic device zero copy: any device internal
> buffer can be linked with io_uring operations in this way, which can't
> be done by traditional splice/pipe.

Is this actually implemented or is it just a potential direction for the
future?

> I guess it can be used in network Rx zero copy too, but may depend on
> actual network Rx use case.

Kevin
Ming Lei April 26, 2024, 7:53 a.m. UTC | #6
On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote:
> Am 24.04.2024 um 03:39 hat Ming Lei geschrieben:
> > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > > 
> > > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > > leader is always freed after all members are completed. Group members
> > > > > aren't submitted until the group leader is completed, and there isn't any
> > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > > members, same with IOSQE_IO_DRAIN.
> > > > > 
> > > > > Typically the group leader provides or makes resource, and the other members
> > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > > the same buffer into other destination files. SQE group provides very
> > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > > simplified a lot in this way.
> > > > > 
> > > > > Another use case is to for supporting generic device zero copy:
> > > > > 
> > > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > > >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > > >   application or panic
> > > > > 
> > > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > > >   SQE
> > > > 
> > > > In concept, this looks very similar to "sqe bundles" that I played with
> > > > in the past:
> > > > 
> > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > > 
> > > > Didn't look too closely yet at the implementation, but in spirit it's
> > > > about the same in that the first entry is processed first, and there's
> > > > no ordering implied between the test of the members of the bundle /
> > > > group.
> > > 
> > > When I first read this patch, I wondered if it wouldn't make sense to
> > > allow linking a group with subsequent requests, e.g. first having a few
> > > requests that run in parallel and once all of them have completed
> > > continue with the next linked one sequentially.
> > > 
> > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > more obvious, you simply set the LINK flag on the last member of the
> > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > mean it has to be implemented now, but there is a clear way forward if
> > > it's wanted.
> > 
> > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to
> > existed link chain), so I think it may not work.
> 
> You can always extend things *somehow*, but it wouldn't fit very
> naturally. That's why I feel your approach on this detail is a little
> better.

Linking group in traditionally way is real use case, please see
ublk-nbd's zero copy implementation.

https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp

> 
> > The link rule is explicit for sqe group:
> > 
> > - only group leader can set link flag, which is applied on the whole
> > group: the next sqe in the link chain won't be started until the
> > previous linked sqe group is completed
> > 
> > - link flag can't be set for group members
> > 
> > Also sqe group doesn't limit async for both group leader and member.
> > 
> > sqe group vs link & async is covered in the last liburing test code.
> 
> Oh right, I didn't actually notice that you already implement what I
> proposed!
> 
> I was expecting the flag on the last SQE and I saw in the code that this
> isn't allowed, but I completely missed your comment that explicitly
> states that it's the group leader that gets the link flag. Of course,
> this is just as good.
> 
> > > The part that looks a bit arbitrary in Ming's patch is that the group
> > > leader is always completed before the rest starts. It makes perfect
> > > sense in the context that this series is really after (enabling zero
> > > copy for ublk), but it doesn't really allow the case you mention in the
> > > SQE bundle commit message, running everything in parallel and getting a
> > > single CQE for the whole group.
> > 
> > I think it should be easy to cover bundle in this way, such as add one
> > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for
> > whole group/bundle.
> 
> This requires an extra SQE compared to just creating the group with
> flags, but I suppose this is not a big problem. An alternative might be
> sending the CQE for the group leader only after the whole group has
> completed if we're okay with userspace never knowing when the leader
> itself completed.
> 
> However, assuming an IORING_OP_BUNDLE command, if this command only
> completes after the whole group, doesn't that conflict with the

Here the completion means committing CQE to userspace ring.

> principle that all other commands are only started after the first one
> has completed?

I meant IORING_OP_BUNDLE is the group leader, and the first one is the
the leader.

The member requests won't be started until the leader is completed, and
here the completion means that the request is completed from subsystem
(FS, netowork, ...), so there isn't conflict, but yes, we need to
describe the whole ideas/terms more carefully.

> 
> Maybe we shouldn't wait for the whole group leader request to complete,
> but just give the group leader a chance to prepare the group before all
> requests in the group (including the leader itself) are run in parallel.
> Maybe io_issue_sqe() could just start the rest of the group somewhere
> after calling def->issue() for the leader. Then you can't prepare the
> group buffer asynchronously, but I don't think this is needed, right?

That isn't true, if leader request is one network RX, we need to wait
until the recv is done, then the following member requests can be
started for consuming the received data.

Same example with the multiple copy one in last patch.

> 
> Your example with one read followed by multiple writes would then have
> to be written slightly differently: First the read outside of the group,
> linked to a group of writes. I honestly think this makes more sense as
> an interface, too, because then links are for sequential things and
> groups are (only) for parallel things. This feels clearer than having
> both a sequential and a parallel element in groups.

Group also implements 1:N dependency, in which N members depends on single
group leader, meantime there isn't any dependency among each members. That
is something the current io_uring is missing.

> 
> > > I suppose you could hack around the sequential nature of the first
> > > request by using an extra NOP as the group leader - which isn't any
> > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> > > the group completion would still be missing. (Of course, removing the
> > > sequential first operation would mean that ublk wouldn't have the buffer
> > > ready any more when the other requests try to use it, so that would
> > > defeat the purpose of the series...)
> > > 
> > > I wonder if we can still combine both approaches and create some
> > > generally useful infrastructure and not something where it's visible
> > > that it was designed mostly for ublk's special case and other use cases
> > > just happened to be enabled as a side effect.
> > 
> > sqe group is actually one generic interface, please see the multiple
> > copy( copy one file to multiple destinations in single syscall for one
> > range) example in the last patch
> 
> Yes, that's an example that happens to work well with the model that you
> derived from ublk.

Not only for ublk and device zero copy, it also have the multiple copy example.

> 
> If you have the opposite case, reading a buffer that is spread across
> multiple files and then writing it to one target (i.e. first step
> parallel, second step sequential), you can't represent this well
> currently. You could work around it by having a NOP leader, but that's
> not very elegant.

Yeah, linking the group(nop & reads) with the following write does
work for the above copy case, :-)

> 
> This asymmetry suggests that it's not the perfect interface yet.

1:N dependency requires the asymmetry, an nothing in this world is perfect, :-)
But we can try to make it better.

> 
> If the whole group runs in parallel instead, including the leader, then
> both examples become symmetrical. You have a group for the parallel I/O
> and a linked single request for the other operation.
> 
> Or if both steps are parallel, you can just have two linked groups.

I think sqe group can be extended to this way easily by one new flag if
there is such real use case. We still can use leader's link flag for
same purpose, an only advance the linking chain until the whole group
is done. 

Then sqe group just degrades to single link group without 1:N dependency
covered and leader is just for providing group link flag, looks it can be
well defined and documented, and could be useful too, IMO.

> 
> > and it can support generic device zero copy: any device internal
> > buffer can be linked with io_uring operations in this way, which can't
> > be done by traditional splice/pipe.
> 
> Is this actually implemented or is it just a potential direction for the
> future?

It is potential direction since sqe group & provide buffer provides one
generic framework to export device internal buffer for further consuming
in zero copy & non-mmap way.


Thanks,
Ming
Kevin Wolf April 26, 2024, 5:05 p.m. UTC | #7
Am 26.04.2024 um 09:53 hat Ming Lei geschrieben:
> On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote:
> > Am 24.04.2024 um 03:39 hat Ming Lei geschrieben:
> > > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> > > > When I first read this patch, I wondered if it wouldn't make sense to
> > > > allow linking a group with subsequent requests, e.g. first having a few
> > > > requests that run in parallel and once all of them have completed
> > > > continue with the next linked one sequentially.
> > > > 
> > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > > more obvious, you simply set the LINK flag on the last member of the
> > > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > > mean it has to be implemented now, but there is a clear way forward if
> > > > it's wanted.
> > > 
> > > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to
> > > existed link chain), so I think it may not work.
> > 
> > You can always extend things *somehow*, but it wouldn't fit very
> > naturally. That's why I feel your approach on this detail is a little
> > better.
> 
> Linking group in traditionally way is real use case, please see
> ublk-nbd's zero copy implementation.
> 
> https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp

I'm not sure what you're trying to argue, I agreed with you twice?

I don't think Jens's bundles break existing users of links because the
special meaning is only triggered with IORING_OP_BUNDLE. But obviously
they don't allow linking a bundle with something else after it, which
feels more limiting than necessary.

> > > The link rule is explicit for sqe group:
> > > 
> > > - only group leader can set link flag, which is applied on the whole
> > > group: the next sqe in the link chain won't be started until the
> > > previous linked sqe group is completed
> > > 
> > > - link flag can't be set for group members
> > > 
> > > Also sqe group doesn't limit async for both group leader and member.
> > > 
> > > sqe group vs link & async is covered in the last liburing test code.
> > 
> > Oh right, I didn't actually notice that you already implement what I
> > proposed!
> > 
> > I was expecting the flag on the last SQE and I saw in the code that this
> > isn't allowed, but I completely missed your comment that explicitly
> > states that it's the group leader that gets the link flag. Of course,
> > this is just as good.
> > 
> > > > The part that looks a bit arbitrary in Ming's patch is that the group
> > > > leader is always completed before the rest starts. It makes perfect
> > > > sense in the context that this series is really after (enabling zero
> > > > copy for ublk), but it doesn't really allow the case you mention in the
> > > > SQE bundle commit message, running everything in parallel and getting a
> > > > single CQE for the whole group.
> > > 
> > > I think it should be easy to cover bundle in this way, such as add one
> > > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for
> > > whole group/bundle.
> > 
> > This requires an extra SQE compared to just creating the group with
> > flags, but I suppose this is not a big problem. An alternative might be
> > sending the CQE for the group leader only after the whole group has
> > completed if we're okay with userspace never knowing when the leader
> > itself completed.
> > 
> > However, assuming an IORING_OP_BUNDLE command, if this command only
> > completes after the whole group, doesn't that conflict with the
> 
> Here the completion means committing CQE to userspace ring.
> 
> > principle that all other commands are only started after the first one
> > has completed?
> 
> I meant IORING_OP_BUNDLE is the group leader, and the first one is the
> the leader.
> 
> The member requests won't be started until the leader is completed, and
> here the completion means that the request is completed from subsystem
> (FS, netowork, ...), so there isn't conflict, but yes, we need to
> describe the whole ideas/terms more carefully.

Is there precedence for requests that are completed, but don't result in
a CQE immediately? But yes, it's the same as I had in mind above when I
was talking about completing the leader only after the whole group has
completed.

> > Maybe we shouldn't wait for the whole group leader request to complete,
> > but just give the group leader a chance to prepare the group before all
> > requests in the group (including the leader itself) are run in parallel.
> > Maybe io_issue_sqe() could just start the rest of the group somewhere
> > after calling def->issue() for the leader. Then you can't prepare the
> > group buffer asynchronously, but I don't think this is needed, right?
> 
> That isn't true, if leader request is one network RX, we need to wait
> until the recv is done, then the following member requests can be
> started for consuming the received data.
> 
> Same example with the multiple copy one in last patch.

I don't see a group kernel buffer in the last patch at all? It seems to
use userspace buffers. In which case the read doesn't have to be part of
the group at all: You can have a read and link that with a group of
writes. Then you have the clear semantics of link = sequential,
group = parallel again.

But let's assume that the read actually did provide a group buffer.

What this example showed me is that grouping requests for parallel
submission is logically independent from grouping requests for sharing a
buffer. For full flexibility, they would probably have to be separate
concepts. You could then have the same setup as before (read linked to a
group of writes), but still share a group kernel buffer for the whole
sequence.

However, it's not clear if that the full flexibility is needed, and it
would probably complicate the implementation a bit.

> > Your example with one read followed by multiple writes would then have
> > to be written slightly differently: First the read outside of the group,
> > linked to a group of writes. I honestly think this makes more sense as
> > an interface, too, because then links are for sequential things and
> > groups are (only) for parallel things. This feels clearer than having
> > both a sequential and a parallel element in groups.
> 
> Group also implements 1:N dependency, in which N members depends on
> single group leader, meantime there isn't any dependency among each
> members. That is something the current io_uring is missing.

Dependencies are currently expressed with links, which is why I felt
that it would be good to use them in this case, too. Groups that only
include parallel requests and can be part of a link chain even provide
N:M dependencies, so are even more powerful than the fixed 1:N of your
groups.

The only thing that doesn't work as nicely then is sharing the buffer as
long as it's not treated as a separate concept.

> > > > I suppose you could hack around the sequential nature of the first
> > > > request by using an extra NOP as the group leader - which isn't any
> > > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> > > > the group completion would still be missing. (Of course, removing the
> > > > sequential first operation would mean that ublk wouldn't have the buffer
> > > > ready any more when the other requests try to use it, so that would
> > > > defeat the purpose of the series...)
> > > > 
> > > > I wonder if we can still combine both approaches and create some
> > > > generally useful infrastructure and not something where it's visible
> > > > that it was designed mostly for ublk's special case and other use cases
> > > > just happened to be enabled as a side effect.
> > > 
> > > sqe group is actually one generic interface, please see the multiple
> > > copy( copy one file to multiple destinations in single syscall for one
> > > range) example in the last patch
> > 
> > Yes, that's an example that happens to work well with the model that you
> > derived from ublk.
> 
> Not only for ublk and device zero copy, it also have the multiple copy example.

This is what I replied to. Yes, it's an example where the model works
fine. This is not evidence that the model is as generic as it could be,
just that it's an example that fits it.

> > If you have the opposite case, reading a buffer that is spread across
> > multiple files and then writing it to one target (i.e. first step
> > parallel, second step sequential), you can't represent this well
> > currently. You could work around it by having a NOP leader, but that's
> > not very elegant.
> 
> Yeah, linking the group(nop & reads) with the following write does
> work for the above copy case, :-)
> 
> > 
> > This asymmetry suggests that it's not the perfect interface yet.
> 
> 1:N dependency requires the asymmetry, an nothing in this world is perfect, :-)
> But we can try to make it better.

The asymmetry doesn't contribute anything to the 1:N dependency. As
discussed above, normal links combined with fully parallel (and
therefore symmetrical) groups provide this functionality, too.

The only real reason I see for justifying it is the group kernel buffer.

> > If the whole group runs in parallel instead, including the leader, then
> > both examples become symmetrical. You have a group for the parallel I/O
> > and a linked single request for the other operation.
> > 
> > Or if both steps are parallel, you can just have two linked groups.
> 
> I think sqe group can be extended to this way easily by one new flag if
> there is such real use case. We still can use leader's link flag for
> same purpose, an only advance the linking chain until the whole group
> is done. 
> 
> Then sqe group just degrades to single link group without 1:N dependency
> covered and leader is just for providing group link flag, looks it can be
> well defined and documented, and could be useful too, IMO.

If you're willing to introduce a second flag, then I'd consider using
that flag to define buffer sharing groups independently of groups for
parallel execution.

I think it's usually preferable to build the semantics you need by
combining flags that provide independent building blocks with a
straightforward meaning than having a single complex building block and
then flags that modify the way the complex concept works.

> > > and it can support generic device zero copy: any device internal
> > > buffer can be linked with io_uring operations in this way, which can't
> > > be done by traditional splice/pipe.
> > 
> > Is this actually implemented or is it just a potential direction for the
> > future?
> 
> It is potential direction since sqe group & provide buffer provides one
> generic framework to export device internal buffer for further consuming
> in zero copy & non-mmap way.

I see. This contributes a bit to the impression that much of the design
is driven by ublk alone, because it's the only thing that seems to make
use of group buffers so far.

Anyway, I'm just throwing in a few thoughts and ideas from outside. In
the end, Jens and you need to agree on something.

Kevin
Ming Lei April 29, 2024, 3:34 a.m. UTC | #8
On Fri, Apr 26, 2024 at 07:05:17PM +0200, Kevin Wolf wrote:
> Am 26.04.2024 um 09:53 hat Ming Lei geschrieben:
> > On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote:
> > > Am 24.04.2024 um 03:39 hat Ming Lei geschrieben:
> > > > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> > > > > When I first read this patch, I wondered if it wouldn't make sense to
> > > > > allow linking a group with subsequent requests, e.g. first having a few
> > > > > requests that run in parallel and once all of them have completed
> > > > > continue with the next linked one sequentially.
> > > > > 
> > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > > > more obvious, you simply set the LINK flag on the last member of the
> > > > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > > > mean it has to be implemented now, but there is a clear way forward if
> > > > > it's wanted.
> > > > 
> > > > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to
> > > > existed link chain), so I think it may not work.
> > > 
> > > You can always extend things *somehow*, but it wouldn't fit very
> > > naturally. That's why I feel your approach on this detail is a little
> > > better.
> > 
> > Linking group in traditionally way is real use case, please see
> > ublk-nbd's zero copy implementation.
> > 
> > https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp
> 
> I'm not sure what you're trying to argue, I agreed with you twice?
> 
> I don't think Jens's bundles break existing users of links because the
> special meaning is only triggered with IORING_OP_BUNDLE. But obviously
> they don't allow linking a bundle with something else after it, which
> feels more limiting than necessary.

OK.

> 
> > > > The link rule is explicit for sqe group:
> > > > 
> > > > - only group leader can set link flag, which is applied on the whole
> > > > group: the next sqe in the link chain won't be started until the
> > > > previous linked sqe group is completed
> > > > 
> > > > - link flag can't be set for group members
> > > > 
> > > > Also sqe group doesn't limit async for both group leader and member.
> > > > 
> > > > sqe group vs link & async is covered in the last liburing test code.
> > > 
> > > Oh right, I didn't actually notice that you already implement what I
> > > proposed!
> > > 
> > > I was expecting the flag on the last SQE and I saw in the code that this
> > > isn't allowed, but I completely missed your comment that explicitly
> > > states that it's the group leader that gets the link flag. Of course,
> > > this is just as good.
> > > 
> > > > > The part that looks a bit arbitrary in Ming's patch is that the group
> > > > > leader is always completed before the rest starts. It makes perfect
> > > > > sense in the context that this series is really after (enabling zero
> > > > > copy for ublk), but it doesn't really allow the case you mention in the
> > > > > SQE bundle commit message, running everything in parallel and getting a
> > > > > single CQE for the whole group.
> > > > 
> > > > I think it should be easy to cover bundle in this way, such as add one
> > > > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for
> > > > whole group/bundle.
> > > 
> > > This requires an extra SQE compared to just creating the group with
> > > flags, but I suppose this is not a big problem. An alternative might be
> > > sending the CQE for the group leader only after the whole group has
> > > completed if we're okay with userspace never knowing when the leader
> > > itself completed.
> > > 
> > > However, assuming an IORING_OP_BUNDLE command, if this command only
> > > completes after the whole group, doesn't that conflict with the
> > 
> > Here the completion means committing CQE to userspace ring.
> > 
> > > principle that all other commands are only started after the first one
> > > has completed?
> > 
> > I meant IORING_OP_BUNDLE is the group leader, and the first one is the
> > the leader.
> > 
> > The member requests won't be started until the leader is completed, and
> > here the completion means that the request is completed from subsystem
> > (FS, netowork, ...), so there isn't conflict, but yes, we need to
> > describe the whole ideas/terms more carefully.
> 
> Is there precedence for requests that are completed, but don't result in
> a CQE immediately? But yes, it's the same as I had in mind above when I
> was talking about completing the leader only after the whole group has
> completed.
> 
> > > Maybe we shouldn't wait for the whole group leader request to complete,
> > > but just give the group leader a chance to prepare the group before all
> > > requests in the group (including the leader itself) are run in parallel.
> > > Maybe io_issue_sqe() could just start the rest of the group somewhere
> > > after calling def->issue() for the leader. Then you can't prepare the
> > > group buffer asynchronously, but I don't think this is needed, right?
> > 
> > That isn't true, if leader request is one network RX, we need to wait
> > until the recv is done, then the following member requests can be
> > started for consuming the received data.
> > 
> > Same example with the multiple copy one in last patch.
> 
> I don't see a group kernel buffer in the last patch at all? It seems to
> use userspace buffers. In which case the read doesn't have to be part of
> the group at all: You can have a read and link that with a group of
> writes. Then you have the clear semantics of link = sequential,
> group = parallel again.
> 
> But let's assume that the read actually did provide a group buffer.
> 
> What this example showed me is that grouping requests for parallel
> submission is logically independent from grouping requests for sharing a
> buffer. For full flexibility, they would probably have to be separate
> concepts. You could then have the same setup as before (read linked to a
> group of writes), but still share a group kernel buffer for the whole
> sequence.

kernel buffer lifetime is aligned with the whole group lifetime, that is why
the leader can't be moved out of group. Otherwise the two groups are
actually same, but still lots of things are in common, such as, how to handle
link for whole group, and the implementation should share most of
code, same with the group concept.

> 
> However, it's not clear if that the full flexibility is needed, and it
> would probably complicate the implementation a bit.
> 
> > > Your example with one read followed by multiple writes would then have
> > > to be written slightly differently: First the read outside of the group,
> > > linked to a group of writes. I honestly think this makes more sense as
> > > an interface, too, because then links are for sequential things and
> > > groups are (only) for parallel things. This feels clearer than having
> > > both a sequential and a parallel element in groups.
> > 
> > Group also implements 1:N dependency, in which N members depends on
> > single group leader, meantime there isn't any dependency among each
> > members. That is something the current io_uring is missing.
> 
> Dependencies are currently expressed with links, which is why I felt
> that it would be good to use them in this case, too. Groups that only
> include parallel requests and can be part of a link chain even provide
> N:M dependencies, so are even more powerful than the fixed 1:N of your
> groups.
> 
> The only thing that doesn't work as nicely then is sharing the buffer as
> long as it's not treated as a separate concept.

Right, kernel buffer lifetime is really one hard problem, aligning it
with group lifetime simplifies a lot for zero copy problem.

> 
> > > > > I suppose you could hack around the sequential nature of the first
> > > > > request by using an extra NOP as the group leader - which isn't any
> > > > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> > > > > the group completion would still be missing. (Of course, removing the
> > > > > sequential first operation would mean that ublk wouldn't have the buffer
> > > > > ready any more when the other requests try to use it, so that would
> > > > > defeat the purpose of the series...)
> > > > > 
> > > > > I wonder if we can still combine both approaches and create some
> > > > > generally useful infrastructure and not something where it's visible
> > > > > that it was designed mostly for ublk's special case and other use cases
> > > > > just happened to be enabled as a side effect.
> > > > 
> > > > sqe group is actually one generic interface, please see the multiple
> > > > copy( copy one file to multiple destinations in single syscall for one
> > > > range) example in the last patch
> > > 
> > > Yes, that's an example that happens to work well with the model that you
> > > derived from ublk.
> > 
> > Not only for ublk and device zero copy, it also have the multiple copy example.
> 
> This is what I replied to. Yes, it's an example where the model works
> fine. This is not evidence that the model is as generic as it could be,
> just that it's an example that fits it.
> 
> > > If you have the opposite case, reading a buffer that is spread across
> > > multiple files and then writing it to one target (i.e. first step
> > > parallel, second step sequential), you can't represent this well
> > > currently. You could work around it by having a NOP leader, but that's
> > > not very elegant.
> > 
> > Yeah, linking the group(nop & reads) with the following write does
> > work for the above copy case, :-)
> > 
> > > 
> > > This asymmetry suggests that it's not the perfect interface yet.
> > 
> > 1:N dependency requires the asymmetry, an nothing in this world is perfect, :-)
> > But we can try to make it better.
> 
> The asymmetry doesn't contribute anything to the 1:N dependency. As
> discussed above, normal links combined with fully parallel (and
> therefore symmetrical) groups provide this functionality, too.
> 
> The only real reason I see for justifying it is the group kernel buffer.

Yes, but that is also exactly what the N depends on.

> 
> > > If the whole group runs in parallel instead, including the leader, then
> > > both examples become symmetrical. You have a group for the parallel I/O
> > > and a linked single request for the other operation.
> > > 
> > > Or if both steps are parallel, you can just have two linked groups.
> > 
> > I think sqe group can be extended to this way easily by one new flag if
> > there is such real use case. We still can use leader's link flag for
> > same purpose, an only advance the linking chain until the whole group
> > is done. 
> > 
> > Then sqe group just degrades to single link group without 1:N dependency
> > covered and leader is just for providing group link flag, looks it can be
> > well defined and documented, and could be useful too, IMO.
> 
> If you're willing to introduce a second flag, then I'd consider using
> that flag to define buffer sharing groups independently of groups for
> parallel execution.
> 
> I think it's usually preferable to build the semantics you need by
> combining flags that provide independent building blocks with a
> straightforward meaning than having a single complex building block and
> then flags that modify the way the complex concept works.

This way is fine.

> 
> > > > and it can support generic device zero copy: any device internal
> > > > buffer can be linked with io_uring operations in this way, which can't
> > > > be done by traditional splice/pipe.
> > > 
> > > Is this actually implemented or is it just a potential direction for the
> > > future?
> > 
> > It is potential direction since sqe group & provide buffer provides one
> > generic framework to export device internal buffer for further consuming
> > in zero copy & non-mmap way.
> 
> I see. This contributes a bit to the impression that much of the design
> is driven by ublk alone, because it's the only thing that seems to make
> use of group buffers so far.
> 
> Anyway, I'm just throwing in a few thoughts and ideas from outside. In
> the end, Jens and you need to agree on something.

Your thoughts and ideas are really helpful, thanks!


Thanks,
Ming
Pavel Begunkov April 29, 2024, 3:32 p.m. UTC | #9
On 4/23/24 14:08, Kevin Wolf wrote:
> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>> SQE group is defined as one chain of SQEs starting with the first sqe that
>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
>>> doesn't have it set, and it is similar with chain of linked sqes.
>>>
>>> The 1st SQE is group leader, and the other SQEs are group member. The group
>>> leader is always freed after all members are completed. Group members
>>> aren't submitted until the group leader is completed, and there isn't any
>>> dependency among group members, and IOSQE_IO_LINK can't be set for group
>>> members, same with IOSQE_IO_DRAIN.
>>>
>>> Typically the group leader provides or makes resource, and the other members
>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to
>>> read data from source file into fixed buffer, the other SQEs write data from
>>> the same buffer into other destination files. SQE group provides very
>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
>>> submitted in single syscall, no need to submit fs read SQE first, and wait
>>> until read SQE is completed, 2) no need to link all write SQEs together, then
>>> write SQEs can be submitted to files concurrently. Meantime application is
>>> simplified a lot in this way.
>>>
>>> Another use case is to for supporting generic device zero copy:
>>>
>>> - the lead SQE is for providing device buffer, which is owned by device or
>>>    kernel, can't be cross userspace, otherwise easy to cause leak for devil
>>>    application or panic
>>>
>>> - member SQEs reads or writes concurrently against the buffer provided by lead
>>>    SQE
>>
>> In concept, this looks very similar to "sqe bundles" that I played with
>> in the past:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
>>
>> Didn't look too closely yet at the implementation, but in spirit it's
>> about the same in that the first entry is processed first, and there's
>> no ordering implied between the test of the members of the bundle /
>> group.
> 
> When I first read this patch, I wondered if it wouldn't make sense to
> allow linking a group with subsequent requests, e.g. first having a few
> requests that run in parallel and once all of them have completed
> continue with the next linked one sequentially.
> 
> For SQE bundles, you reused the LINK flag, which doesn't easily allow
> this. Ming's patch uses a new flag for groups, so the interface would be
> more obvious, you simply set the LINK flag on the last member of the
> group (or on the leader, doesn't really matter). Of course, this doesn't
> mean it has to be implemented now, but there is a clear way forward if
> it's wanted.

Putting zc aside, links, graphs, groups, it all sounds interesting in
concept but let's not fool anyone, all the different ordering
relationships between requests proved to be a bad idea.

I can complaint for long, error handling is miserable, user handling
resubmitting a part of a link is horrible, the concept of errors is
hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
and the MSG_WAITALL workaround). The handling and workarounds are
leaking into generic paths, e.g. we can't init files when it's the most
convenient. For cancellation we're walking links, which need more care
than just looking at a request (is cancellation by user_data of a
"linked" to a group request even supported?). The list goes on

And what does it achieve? The infra has matured since early days,
it saves user-kernel transitions at best but not context switching
overhead, and not even that if you do wait(1) and happen to catch
middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
timers and what not is effectively useless with links.

So, please, please! instead of trying to invent a new uber scheme
of request linking, which surely wouldn't step on same problems
over and over again, and would definitely be destined to overshadow
all previous attempts and finally conquer the world, let's rather
focus on minimasing the damage from this patchset's zero copy if
it's going to be taken.

Piggy backing bits on top of links should be just fine. May help
to save space in io_kiocb by unionising with links. And half
of the overhead (including completely destroying all inlining
in the submission patch) can be mitigated by folding it into
REQ_F_LINK handling and generally borowing the code structure
for it in the submission path.


> The part that looks a bit arbitrary in Ming's patch is that the group
> leader is always completed before the rest starts. It makes perfect
> sense in the context that this series is really after (enabling zero
> copy for ublk), but it doesn't really allow the case you mention in the
> SQE bundle commit message, running everything in parallel and getting a
> single CQE for the whole group.
> 
> I suppose you could hack around the sequential nature of the first
> request by using an extra NOP as the group leader - which isn't any
> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> the group completion would still be missing. (Of course, removing the
> sequential first operation would mean that ublk wouldn't have the buffer
> ready any more when the other requests try to use it, so that would
> defeat the purpose of the series...)
> 
> I wonder if we can still combine both approaches and create some
> generally useful infrastructure and not something where it's visible
> that it was designed mostly for ublk's special case and other use cases
> just happened to be enabled as a side effect.
Pavel Begunkov April 29, 2024, 3:48 p.m. UTC | #10
On 4/24/24 02:39, Ming Lei wrote:
> On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>> SQE group is defined as one chain of SQEs starting with the first sqe that
>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
>>>> doesn't have it set, and it is similar with chain of linked sqes.
>>>>
>>>> The 1st SQE is group leader, and the other SQEs are group member. The group
>>>> leader is always freed after all members are completed. Group members
>>>> aren't submitted until the group leader is completed, and there isn't any
>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group
>>>> members, same with IOSQE_IO_DRAIN.
>>>>
>>>> Typically the group leader provides or makes resource, and the other members
>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to
>>>> read data from source file into fixed buffer, the other SQEs write data from
>>>> the same buffer into other destination files. SQE group provides very
>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
>>>> submitted in single syscall, no need to submit fs read SQE first, and wait
>>>> until read SQE is completed, 2) no need to link all write SQEs together, then
>>>> write SQEs can be submitted to files concurrently. Meantime application is
>>>> simplified a lot in this way.
>>>>
>>>> Another use case is to for supporting generic device zero copy:
>>>>
>>>> - the lead SQE is for providing device buffer, which is owned by device or
>>>>    kernel, can't be cross userspace, otherwise easy to cause leak for devil
>>>>    application or panic
>>>>
>>>> - member SQEs reads or writes concurrently against the buffer provided by lead
>>>>    SQE
>>>
>>> In concept, this looks very similar to "sqe bundles" that I played with
>>> in the past:
>>>
>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
>>>
>>> Didn't look too closely yet at the implementation, but in spirit it's
>>> about the same in that the first entry is processed first, and there's
>>> no ordering implied between the test of the members of the bundle /
>>> group.
>>
>> When I first read this patch, I wondered if it wouldn't make sense to
>> allow linking a group with subsequent requests, e.g. first having a few
>> requests that run in parallel and once all of them have completed
>> continue with the next linked one sequentially.
>>
>> For SQE bundles, you reused the LINK flag, which doesn't easily allow
>> this. Ming's patch uses a new flag for groups, so the interface would be
>> more obvious, you simply set the LINK flag on the last member of the
>> group (or on the leader, doesn't really matter). Of course, this doesn't
>> mean it has to be implemented now, but there is a clear way forward if
>> it's wanted.
> 
> Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed
> link chain), so I think it may not work.
> 
> The link rule is explicit for sqe group:
> 
> - only group leader can set link flag, which is applied on the whole
> group: the next sqe in the link chain won't be started until the
> previous linked sqe group is completed
> 
> - link flag can't be set for group members
> 
> Also sqe group doesn't limit async for both group leader and member.
> 
> sqe group vs link & async is covered in the last liburing test code.
> 
>>
>> The part that looks a bit arbitrary in Ming's patch is that the group
>> leader is always completed before the rest starts. It makes perfect
>> sense in the context that this series is really after (enabling zero
>> copy for ublk), but it doesn't really allow the case you mention in the
>> SQE bundle commit message, running everything in parallel and getting a
>> single CQE for the whole group.
> 
> I think it should be easy to cover bundle in this way, such as add one new
> op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle.
> 
>>
>> I suppose you could hack around the sequential nature of the first
>> request by using an extra NOP as the group leader - which isn't any
>> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
>> the group completion would still be missing. (Of course, removing the
>> sequential first operation would mean that ublk wouldn't have the buffer
>> ready any more when the other requests try to use it, so that would
>> defeat the purpose of the series...)
>>
>> I wonder if we can still combine both approaches and create some
>> generally useful infrastructure and not something where it's visible
>> that it was designed mostly for ublk's special case and other use cases
>> just happened to be enabled as a side effect.
> 
> sqe group is actually one generic interface, please see the multiple copy(
> copy one file to multiple destinations in single syscall for one range) example
> in the last patch, and it can support generic device zero copy: any device internal
> buffer can be linked with io_uring operations in this way, which can't
> be done by traditional splice/pipe.
> 
> I guess it can be used in network Rx zero copy too, but may depend on actual
> network Rx use case.

I doubt. With storage same data can be read twice. Socket recv consumes
data. Locking a buffer over the duration of another IO doesn't really sound
plausible, same we returning a buffer back. It'd be different if you can
read the buffer into the userspace if something goes wrong, but perhaps
you remember the fused discussion.
Ming Lei April 30, 2024, 3:03 a.m. UTC | #11
On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote:
> On 4/23/24 14:08, Kevin Wolf wrote:
> > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > 
> > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > leader is always freed after all members are completed. Group members
> > > > aren't submitted until the group leader is completed, and there isn't any
> > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > members, same with IOSQE_IO_DRAIN.
> > > > 
> > > > Typically the group leader provides or makes resource, and the other members
> > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > the same buffer into other destination files. SQE group provides very
> > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > simplified a lot in this way.
> > > > 
> > > > Another use case is to for supporting generic device zero copy:
> > > > 
> > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > >    kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > >    application or panic
> > > > 
> > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > >    SQE
> > > 
> > > In concept, this looks very similar to "sqe bundles" that I played with
> > > in the past:
> > > 
> > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > 
> > > Didn't look too closely yet at the implementation, but in spirit it's
> > > about the same in that the first entry is processed first, and there's
> > > no ordering implied between the test of the members of the bundle /
> > > group.
> > 
> > When I first read this patch, I wondered if it wouldn't make sense to
> > allow linking a group with subsequent requests, e.g. first having a few
> > requests that run in parallel and once all of them have completed
> > continue with the next linked one sequentially.
> > 
> > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > this. Ming's patch uses a new flag for groups, so the interface would be
> > more obvious, you simply set the LINK flag on the last member of the
> > group (or on the leader, doesn't really matter). Of course, this doesn't
> > mean it has to be implemented now, but there is a clear way forward if
> > it's wanted.
> 
> Putting zc aside, links, graphs, groups, it all sounds interesting in
> concept but let's not fool anyone, all the different ordering
> relationships between requests proved to be a bad idea.

As Jens mentioned, sqe group is very similar with bundle:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle

which is really something io_uring is missing.

> 
> I can complaint for long, error handling is miserable, user handling
> resubmitting a part of a link is horrible, the concept of errors is
> hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
> and the MSG_WAITALL workaround). The handling and workarounds are
> leaking into generic paths, e.g. we can't init files when it's the most
> convenient. For cancellation we're walking links, which need more care
> than just looking at a request (is cancellation by user_data of a
> "linked" to a group request even supported?). The list goes on

Only the group leader is linked, if the group leader is canceled, all
requests in the whole group will be canceled.

But yes, cancelling by user_data for group members can't be supported,
and it can be documented clearly, since user still can cancel the whole
group with group leader's user_data.

> 
> And what does it achieve? The infra has matured since early days,
> it saves user-kernel transitions at best but not context switching
> overhead, and not even that if you do wait(1) and happen to catch
> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
> timers and what not is effectively useless with links.

Not only the context switch, it supports 1:N or N:M dependency which
is missing in io_uring, but also makes async application easier to write by
saving extra context switches, which just adds extra intermediate states for
application.

> 
> So, please, please! instead of trying to invent a new uber scheme
> of request linking, which surely wouldn't step on same problems
> over and over again, and would definitely be destined to overshadow
> all previous attempts and finally conquer the world, let's rather
> focus on minimasing the damage from this patchset's zero copy if
> it's going to be taken.

One key problem for zero copy is lifetime of the kernel buffer, which
can't cross OPs, that is why sqe group is introduced, for aligning
kernel buffer lifetime with the group.


Thanks, 
Ming
Ming Lei April 30, 2024, 3:07 a.m. UTC | #12
On Mon, Apr 29, 2024 at 04:48:37PM +0100, Pavel Begunkov wrote:
> On 4/24/24 02:39, Ming Lei wrote:
> > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > > 
> > > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > > leader is always freed after all members are completed. Group members
> > > > > aren't submitted until the group leader is completed, and there isn't any
> > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > > members, same with IOSQE_IO_DRAIN.
> > > > > 
> > > > > Typically the group leader provides or makes resource, and the other members
> > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > > the same buffer into other destination files. SQE group provides very
> > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > > simplified a lot in this way.
> > > > > 
> > > > > Another use case is to for supporting generic device zero copy:
> > > > > 
> > > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > > >    kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > > >    application or panic
> > > > > 
> > > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > > >    SQE
> > > > 
> > > > In concept, this looks very similar to "sqe bundles" that I played with
> > > > in the past:
> > > > 
> > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > > 
> > > > Didn't look too closely yet at the implementation, but in spirit it's
> > > > about the same in that the first entry is processed first, and there's
> > > > no ordering implied between the test of the members of the bundle /
> > > > group.
> > > 
> > > When I first read this patch, I wondered if it wouldn't make sense to
> > > allow linking a group with subsequent requests, e.g. first having a few
> > > requests that run in parallel and once all of them have completed
> > > continue with the next linked one sequentially.
> > > 
> > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > more obvious, you simply set the LINK flag on the last member of the
> > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > mean it has to be implemented now, but there is a clear way forward if
> > > it's wanted.
> > 
> > Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed
> > link chain), so I think it may not work.
> > 
> > The link rule is explicit for sqe group:
> > 
> > - only group leader can set link flag, which is applied on the whole
> > group: the next sqe in the link chain won't be started until the
> > previous linked sqe group is completed
> > 
> > - link flag can't be set for group members
> > 
> > Also sqe group doesn't limit async for both group leader and member.
> > 
> > sqe group vs link & async is covered in the last liburing test code.
> > 
> > > 
> > > The part that looks a bit arbitrary in Ming's patch is that the group
> > > leader is always completed before the rest starts. It makes perfect
> > > sense in the context that this series is really after (enabling zero
> > > copy for ublk), but it doesn't really allow the case you mention in the
> > > SQE bundle commit message, running everything in parallel and getting a
> > > single CQE for the whole group.
> > 
> > I think it should be easy to cover bundle in this way, such as add one new
> > op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle.
> > 
> > > 
> > > I suppose you could hack around the sequential nature of the first
> > > request by using an extra NOP as the group leader - which isn't any
> > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> > > the group completion would still be missing. (Of course, removing the
> > > sequential first operation would mean that ublk wouldn't have the buffer
> > > ready any more when the other requests try to use it, so that would
> > > defeat the purpose of the series...)
> > > 
> > > I wonder if we can still combine both approaches and create some
> > > generally useful infrastructure and not something where it's visible
> > > that it was designed mostly for ublk's special case and other use cases
> > > just happened to be enabled as a side effect.
> > 
> > sqe group is actually one generic interface, please see the multiple copy(
> > copy one file to multiple destinations in single syscall for one range) example
> > in the last patch, and it can support generic device zero copy: any device internal
> > buffer can be linked with io_uring operations in this way, which can't
> > be done by traditional splice/pipe.
> > 
> > I guess it can be used in network Rx zero copy too, but may depend on actual
> > network Rx use case.
> 
> I doubt. With storage same data can be read twice. Socket recv consumes

No, we don't depend on read twice of storage.

> data. Locking a buffer over the duration of another IO doesn't really sound
> plausible, same we returning a buffer back. It'd be different if you can
> read the buffer into the userspace if something goes wrong, but perhaps
> you remember the fused discussion.

As I mentioned, it depends on actual use case. If the received data can
be consumed by following io_uring OPs, such as be sent to network or
FS, this way works just fine, but if data needs to live longer or
further processing, the data can still be saved to userpsace by io_uring
OPs linked to the sqe group.


Thanks,
Ming
Pavel Begunkov April 30, 2024, 12:27 p.m. UTC | #13
On 4/30/24 04:03, Ming Lei wrote:
> On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote:
>> On 4/23/24 14:08, Kevin Wolf wrote:
>>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
>>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>>> SQE group is defined as one chain of SQEs starting with the first sqe that
>>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
>>>>> doesn't have it set, and it is similar with chain of linked sqes.
>>>>>
>>>>> The 1st SQE is group leader, and the other SQEs are group member. The group
>>>>> leader is always freed after all members are completed. Group members
>>>>> aren't submitted until the group leader is completed, and there isn't any
>>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group
>>>>> members, same with IOSQE_IO_DRAIN.
>>>>>
>>>>> Typically the group leader provides or makes resource, and the other members
>>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to
>>>>> read data from source file into fixed buffer, the other SQEs write data from
>>>>> the same buffer into other destination files. SQE group provides very
>>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
>>>>> submitted in single syscall, no need to submit fs read SQE first, and wait
>>>>> until read SQE is completed, 2) no need to link all write SQEs together, then
>>>>> write SQEs can be submitted to files concurrently. Meantime application is
>>>>> simplified a lot in this way.
>>>>>
>>>>> Another use case is to for supporting generic device zero copy:
>>>>>
>>>>> - the lead SQE is for providing device buffer, which is owned by device or
>>>>>     kernel, can't be cross userspace, otherwise easy to cause leak for devil
>>>>>     application or panic
>>>>>
>>>>> - member SQEs reads or writes concurrently against the buffer provided by lead
>>>>>     SQE
>>>>
>>>> In concept, this looks very similar to "sqe bundles" that I played with
>>>> in the past:
>>>>
>>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
>>>>
>>>> Didn't look too closely yet at the implementation, but in spirit it's
>>>> about the same in that the first entry is processed first, and there's
>>>> no ordering implied between the test of the members of the bundle /
>>>> group.
>>>
>>> When I first read this patch, I wondered if it wouldn't make sense to
>>> allow linking a group with subsequent requests, e.g. first having a few
>>> requests that run in parallel and once all of them have completed
>>> continue with the next linked one sequentially.
>>>
>>> For SQE bundles, you reused the LINK flag, which doesn't easily allow
>>> this. Ming's patch uses a new flag for groups, so the interface would be
>>> more obvious, you simply set the LINK flag on the last member of the
>>> group (or on the leader, doesn't really matter). Of course, this doesn't
>>> mean it has to be implemented now, but there is a clear way forward if
>>> it's wanted.
>>
>> Putting zc aside, links, graphs, groups, it all sounds interesting in
>> concept but let's not fool anyone, all the different ordering
>> relationships between requests proved to be a bad idea.
> 
> As Jens mentioned, sqe group is very similar with bundle:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle
> 
> which is really something io_uring is missing.

One could've said same about links, retrospectively I argue that it
was a mistake, so I pretty much doubt arguments like "io_uring is
missing it". Another thing is that zero copy, which is not possible
to implement by returning to the userspace.

>> I can complaint for long, error handling is miserable, user handling
>> resubmitting a part of a link is horrible, the concept of errors is
>> hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
>> and the MSG_WAITALL workaround). The handling and workarounds are
>> leaking into generic paths, e.g. we can't init files when it's the most
>> convenient. For cancellation we're walking links, which need more care
>> than just looking at a request (is cancellation by user_data of a
>> "linked" to a group request even supported?). The list goes on
> 
> Only the group leader is linked, if the group leader is canceled, all
> requests in the whole group will be canceled.
> 
> But yes, cancelling by user_data for group members can't be supported,
> and it can be documented clearly, since user still can cancel the whole
> group with group leader's user_data.

Which means it'd break the case REQ_F_INFLIGHT covers, and you need
to disallow linking REQ_F_INFLIGHT marked requests.

>> And what does it achieve? The infra has matured since early days,
>> it saves user-kernel transitions at best but not context switching
>> overhead, and not even that if you do wait(1) and happen to catch
>> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
>> timers and what not is effectively useless with links.
> 
> Not only the context switch, it supports 1:N or N:M dependency which

I completely missed, how N:M is supported? That starting to sound
terrifying.

> is missing in io_uring, but also makes async application easier to write by
> saving extra context switches, which just adds extra intermediate states for
> application.

You're still executing requests (i.e. ->issue) primarily from the
submitter task context, they would still fly back to the task and
wake it up. You may save something by completing all of them
together via that refcounting, but you might just as well try to
batch CQ, which is a more generic issue. It's not clear what
context switches you save then.

As for simplicity, using the link example and considering error
handling, it only complicates it. In case of an error you need to
figure out a middle req failed, collect all failed CQEs linked to
it and automatically cancelled (unless SKIP_COMPLETE is used), and
then resubmit the failed. That's great your reads are idempotent
and presumably you don't have to resubmit half a link, but in the
grand picture of things it's rather one of use cases where a generic
feature can be used.

>> So, please, please! instead of trying to invent a new uber scheme
>> of request linking, which surely wouldn't step on same problems
>> over and over again, and would definitely be destined to overshadow
>> all previous attempts and finally conquer the world, let's rather
>> focus on minimasing the damage from this patchset's zero copy if
>> it's going to be taken.
> 
> One key problem for zero copy is lifetime of the kernel buffer, which
> can't cross OPs, that is why sqe group is introduced, for aligning
> kernel buffer lifetime with the group.

Right, which is why I'm saying if we're leaving groups with zero
copy, let's rather try to make them simple and not intrusive as
much as possible, instead of creating an unsupportable overarching
beast out of it, which would fail as a generic feature.
Ming Lei April 30, 2024, 3 p.m. UTC | #14
On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote:
> On 4/30/24 04:03, Ming Lei wrote:
> > On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote:
> > > On 4/23/24 14:08, Kevin Wolf wrote:
> > > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > > > 
> > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > > > leader is always freed after all members are completed. Group members
> > > > > > aren't submitted until the group leader is completed, and there isn't any
> > > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > > > members, same with IOSQE_IO_DRAIN.
> > > > > > 
> > > > > > Typically the group leader provides or makes resource, and the other members
> > > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > > > the same buffer into other destination files. SQE group provides very
> > > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > > > simplified a lot in this way.
> > > > > > 
> > > > > > Another use case is to for supporting generic device zero copy:
> > > > > > 
> > > > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > > > >     kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > > > >     application or panic
> > > > > > 
> > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > > > >     SQE
> > > > > 
> > > > > In concept, this looks very similar to "sqe bundles" that I played with
> > > > > in the past:
> > > > > 
> > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > > > 
> > > > > Didn't look too closely yet at the implementation, but in spirit it's
> > > > > about the same in that the first entry is processed first, and there's
> > > > > no ordering implied between the test of the members of the bundle /
> > > > > group.
> > > > 
> > > > When I first read this patch, I wondered if it wouldn't make sense to
> > > > allow linking a group with subsequent requests, e.g. first having a few
> > > > requests that run in parallel and once all of them have completed
> > > > continue with the next linked one sequentially.
> > > > 
> > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > > more obvious, you simply set the LINK flag on the last member of the
> > > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > > mean it has to be implemented now, but there is a clear way forward if
> > > > it's wanted.
> > > 
> > > Putting zc aside, links, graphs, groups, it all sounds interesting in
> > > concept but let's not fool anyone, all the different ordering
> > > relationships between requests proved to be a bad idea.
> > 
> > As Jens mentioned, sqe group is very similar with bundle:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle
> > 
> > which is really something io_uring is missing.
> 
> One could've said same about links, retrospectively I argue that it
> was a mistake, so I pretty much doubt arguments like "io_uring is
> missing it". Another thing is that zero copy, which is not possible
> to implement by returning to the userspace.
> 
> > > I can complaint for long, error handling is miserable, user handling
> > > resubmitting a part of a link is horrible, the concept of errors is
> > > hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
> > > and the MSG_WAITALL workaround). The handling and workarounds are
> > > leaking into generic paths, e.g. we can't init files when it's the most
> > > convenient. For cancellation we're walking links, which need more care
> > > than just looking at a request (is cancellation by user_data of a
> > > "linked" to a group request even supported?). The list goes on
> > 
> > Only the group leader is linked, if the group leader is canceled, all
> > requests in the whole group will be canceled.
> > 
> > But yes, cancelling by user_data for group members can't be supported,
> > and it can be documented clearly, since user still can cancel the whole
> > group with group leader's user_data.
> 
> Which means it'd break the case REQ_F_INFLIGHT covers, and you need
> to disallow linking REQ_F_INFLIGHT marked requests.

Both io_match_linked() and io_match_task() only iterates over req's
link chain, and only the group leader can appear in this link chain,
which is exactly the usual handling.

So care to explain it a bit what the real link issue is about sqe group?

> 
> > > And what does it achieve? The infra has matured since early days,
> > > it saves user-kernel transitions at best but not context switching
> > > overhead, and not even that if you do wait(1) and happen to catch
> > > middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
> > > timers and what not is effectively useless with links.
> > 
> > Not only the context switch, it supports 1:N or N:M dependency which
> 
> I completely missed, how N:M is supported? That starting to sound
> terrifying.

N:M is actually from Kevin's idea.

sqe group can be made to be more flexible by:

    Inside the group, all SQEs are submitted in parallel, so there isn't any
    dependency among SQEs in one group.
    
    The 1st SQE is group leader, and the other SQEs are group member. The whole
    group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
    the two flags can't be set for group members.
    
    When the group is in one link chain, this group isn't submitted until
    the previous SQE or group is completed. And the following SQE or group
    can't be started if this group isn't completed.
    
    When IOSQE_IO_DRAIN is set for group leader, all requests in this group
    and previous requests submitted are drained. Given IOSQE_IO_DRAIN can
    be set for group leader only, we respect IO_DRAIN for SQE group by
    always completing group leader as the last on in the group.
    
    SQE group provides flexible way to support N:M dependency, such as:
    
    - group A is chained with group B together by IOSQE_IO_LINK
    - group A has N SQEs
    - group B has M SQEs
    
    then M SQEs in group B depend on N SQEs in group A.


> 
> > is missing in io_uring, but also makes async application easier to write by
> > saving extra context switches, which just adds extra intermediate states for
> > application.
> 
> You're still executing requests (i.e. ->issue) primarily from the
> submitter task context, they would still fly back to the task and
> wake it up. You may save something by completing all of them
> together via that refcounting, but you might just as well try to
> batch CQ, which is a more generic issue. It's not clear what
> context switches you save then.

Wrt. the above N:M example, one io_uring_enter() is enough, and
it can't be done in single context switch without sqe group, please
see the liburing test code:

https://lore.kernel.org/io-uring/ZiHA+pN28hRdprhX@fedora/T/#ma755c500eab0b7dc8c1473448dd98f093097e066

> 
> As for simplicity, using the link example and considering error
> handling, it only complicates it. In case of an error you need to
> figure out a middle req failed, collect all failed CQEs linked to
> it and automatically cancelled (unless SKIP_COMPLETE is used), and
> then resubmit the failed. That's great your reads are idempotent
> and presumably you don't have to resubmit half a link, but in the
> grand picture of things it's rather one of use cases where a generic
> feature can be used.

SQE group doesn't change the current link implementation, and N:M
dependency is built over IOSQE_IO_LINK actually.

> 
> > > So, please, please! instead of trying to invent a new uber scheme
> > > of request linking, which surely wouldn't step on same problems
> > > over and over again, and would definitely be destined to overshadow
> > > all previous attempts and finally conquer the world, let's rather
> > > focus on minimasing the damage from this patchset's zero copy if
> > > it's going to be taken.
> > 
> > One key problem for zero copy is lifetime of the kernel buffer, which
> > can't cross OPs, that is why sqe group is introduced, for aligning
> > kernel buffer lifetime with the group.
> 
> Right, which is why I'm saying if we're leaving groups with zero
> copy, let's rather try to make them simple and not intrusive as
> much as possible, instead of creating an unsupportable overarching
> beast out of it, which would fail as a generic feature.

Then it degraded to the original fused command, :-)

Thanks,
Ming
Pavel Begunkov May 2, 2024, 2:09 p.m. UTC | #15
On 4/30/24 16:00, Ming Lei wrote:
> On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote:
>> On 4/30/24 04:03, Ming Lei wrote:
>>> On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote:
>>>> On 4/23/24 14:08, Kevin Wolf wrote:
>>>>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
>>>>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>>>>> SQE group is defined as one chain of SQEs starting with the first sqe that
>>>>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
>>>>>>> doesn't have it set, and it is similar with chain of linked sqes.
>>>>>>>
>>>>>>> The 1st SQE is group leader, and the other SQEs are group member. The group
>>>>>>> leader is always freed after all members are completed. Group members
>>>>>>> aren't submitted until the group leader is completed, and there isn't any
>>>>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group
>>>>>>> members, same with IOSQE_IO_DRAIN.
>>>>>>>
>>>>>>> Typically the group leader provides or makes resource, and the other members
>>>>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to
>>>>>>> read data from source file into fixed buffer, the other SQEs write data from
>>>>>>> the same buffer into other destination files. SQE group provides very
>>>>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
>>>>>>> submitted in single syscall, no need to submit fs read SQE first, and wait
>>>>>>> until read SQE is completed, 2) no need to link all write SQEs together, then
>>>>>>> write SQEs can be submitted to files concurrently. Meantime application is
>>>>>>> simplified a lot in this way.
>>>>>>>
>>>>>>> Another use case is to for supporting generic device zero copy:
>>>>>>>
>>>>>>> - the lead SQE is for providing device buffer, which is owned by device or
>>>>>>>      kernel, can't be cross userspace, otherwise easy to cause leak for devil
>>>>>>>      application or panic
>>>>>>>
>>>>>>> - member SQEs reads or writes concurrently against the buffer provided by lead
>>>>>>>      SQE
>>>>>>
>>>>>> In concept, this looks very similar to "sqe bundles" that I played with
>>>>>> in the past:
>>>>>>
>>>>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
>>>>>>
>>>>>> Didn't look too closely yet at the implementation, but in spirit it's
>>>>>> about the same in that the first entry is processed first, and there's
>>>>>> no ordering implied between the test of the members of the bundle /
>>>>>> group.
>>>>>
>>>>> When I first read this patch, I wondered if it wouldn't make sense to
>>>>> allow linking a group with subsequent requests, e.g. first having a few
>>>>> requests that run in parallel and once all of them have completed
>>>>> continue with the next linked one sequentially.
>>>>>
>>>>> For SQE bundles, you reused the LINK flag, which doesn't easily allow
>>>>> this. Ming's patch uses a new flag for groups, so the interface would be
>>>>> more obvious, you simply set the LINK flag on the last member of the
>>>>> group (or on the leader, doesn't really matter). Of course, this doesn't
>>>>> mean it has to be implemented now, but there is a clear way forward if
>>>>> it's wanted.
>>>>
>>>> Putting zc aside, links, graphs, groups, it all sounds interesting in
>>>> concept but let's not fool anyone, all the different ordering
>>>> relationships between requests proved to be a bad idea.
>>>
>>> As Jens mentioned, sqe group is very similar with bundle:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle
>>>
>>> which is really something io_uring is missing.
>>
>> One could've said same about links, retrospectively I argue that it
>> was a mistake, so I pretty much doubt arguments like "io_uring is
>> missing it". Another thing is that zero copy, which is not possible
>> to implement by returning to the userspace.
>>
>>>> I can complaint for long, error handling is miserable, user handling
>>>> resubmitting a part of a link is horrible, the concept of errors is
>>>> hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
>>>> and the MSG_WAITALL workaround). The handling and workarounds are
>>>> leaking into generic paths, e.g. we can't init files when it's the most
>>>> convenient. For cancellation we're walking links, which need more care
>>>> than just looking at a request (is cancellation by user_data of a
>>>> "linked" to a group request even supported?). The list goes on
>>>
>>> Only the group leader is linked, if the group leader is canceled, all
>>> requests in the whole group will be canceled.
>>>
>>> But yes, cancelling by user_data for group members can't be supported,
>>> and it can be documented clearly, since user still can cancel the whole
>>> group with group leader's user_data.
>>
>> Which means it'd break the case REQ_F_INFLIGHT covers, and you need
>> to disallow linking REQ_F_INFLIGHT marked requests.
> 
> Both io_match_linked() and io_match_task() only iterates over req's
> link chain, and only the group leader can appear in this link chain,
> which is exactly the usual handling.
> 
> So care to explain it a bit what the real link issue is about sqe group?

Because of ref deps when a task exits it has to cancel REQ_F_INFLIGHT
requests, and therefore they should be discoverable. The flag is only
for POLL_ADD requests polling io_uring fds, should be good enough if
you disallow such requests from grouping.

>>>> And what does it achieve? The infra has matured since early days,
>>>> it saves user-kernel transitions at best but not context switching
>>>> overhead, and not even that if you do wait(1) and happen to catch
>>>> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
>>>> timers and what not is effectively useless with links.
>>>
>>> Not only the context switch, it supports 1:N or N:M dependency which
>>
>> I completely missed, how N:M is supported? That starting to sound
>> terrifying.
> 
> N:M is actually from Kevin's idea.
> 
> sqe group can be made to be more flexible by:
> 
>      Inside the group, all SQEs are submitted in parallel, so there isn't any
>      dependency among SQEs in one group.
>      
>      The 1st SQE is group leader, and the other SQEs are group member. The whole
>      group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
>      the two flags can't be set for group members.
>      
>      When the group is in one link chain, this group isn't submitted until
>      the previous SQE or group is completed. And the following SQE or group
>      can't be started if this group isn't completed.
>      
>      When IOSQE_IO_DRAIN is set for group leader, all requests in this group
>      and previous requests submitted are drained. Given IOSQE_IO_DRAIN can
>      be set for group leader only, we respect IO_DRAIN for SQE group by
>      always completing group leader as the last on in the group.
>      
>      SQE group provides flexible way to support N:M dependency, such as:
>      
>      - group A is chained with group B together by IOSQE_IO_LINK
>      - group A has N SQEs
>      - group B has M SQEs
>      
>      then M SQEs in group B depend on N SQEs in group A.

In other words, linking groups together with basically no extra rules.
Fwiw, sounds generic, but if there are complications with IOSQE_IO_DRAIN
that I don't immediately see, it'd be more reasonable to just disable it.

>>> is missing in io_uring, but also makes async application easier to write by
>>> saving extra context switches, which just adds extra intermediate states for
>>> application.
>>
>> You're still executing requests (i.e. ->issue) primarily from the
>> submitter task context, they would still fly back to the task and
>> wake it up. You may save something by completing all of them
>> together via that refcounting, but you might just as well try to
>> batch CQ, which is a more generic issue. It's not clear what
>> context switches you save then.
> 
> Wrt. the above N:M example, one io_uring_enter() is enough, and
> it can't be done in single context switch without sqe group, please
> see the liburing test code:
> 
> https://lore.kernel.org/io-uring/ZiHA+pN28hRdprhX@fedora/T/#ma755c500eab0b7dc8c1473448dd98f093097e066
> 
>>
>> As for simplicity, using the link example and considering error
>> handling, it only complicates it. In case of an error you need to
>> figure out a middle req failed, collect all failed CQEs linked to
>> it and automatically cancelled (unless SKIP_COMPLETE is used), and
>> then resubmit the failed. That's great your reads are idempotent
>> and presumably you don't have to resubmit half a link, but in the
>> grand picture of things it's rather one of use cases where a generic
>> feature can be used.
> 
> SQE group doesn't change the current link implementation, and N:M
> dependency is built over IOSQE_IO_LINK actually.
> 
>>
>>>> So, please, please! instead of trying to invent a new uber scheme
>>>> of request linking, which surely wouldn't step on same problems
>>>> over and over again, and would definitely be destined to overshadow
>>>> all previous attempts and finally conquer the world, let's rather
>>>> focus on minimasing the damage from this patchset's zero copy if
>>>> it's going to be taken.
>>>
>>> One key problem for zero copy is lifetime of the kernel buffer, which
>>> can't cross OPs, that is why sqe group is introduced, for aligning
>>> kernel buffer lifetime with the group.
>>
>> Right, which is why I'm saying if we're leaving groups with zero
>> copy, let's rather try to make them simple and not intrusive as
>> much as possible, instead of creating an unsupportable overarching
>> beast out of it, which would fail as a generic feature.
> 
> Then it degraded to the original fused command, :-)

I'm not arguing about restricting it to 1 request in a group apart
from the master/leader/etc., if that's what you mean. The argument
is rather to limit the overhead and abstraction leakage into hot
paths.

For example that N:M, all that sounds great on paper until it sees
the harsh reality. And instead of linking groups together, you
can perfectly fine be sending one group at a time and queuing the
next group from the userspace when the previous completes. And
that would save space in io_kiocb, maybe(?) a flag, maybe something
else.

Regardless of interoperability with links, I'd also prefer it to
be folded into the link code structure and state machine, so
it's not all over in the submission/completion paths adding
overhead for one narrow feature, including un-inlining the
entire submission path.

E.g. there should be no separate hand coded duplicated SQE
/ request assembling like in io_init_req_group(). Instead
it should be able to consume SQEs and requests it's given
from submit_sqes(), and e.g. process grouping in
io_submit_sqe() around the place requests are linked.
Pavel Begunkov May 2, 2024, 2:28 p.m. UTC | #16
On 4/30/24 16:00, Ming Lei wrote:
> On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote:
...
>>>> And what does it achieve? The infra has matured since early days,
>>>> it saves user-kernel transitions at best but not context switching
>>>> overhead, and not even that if you do wait(1) and happen to catch
>>>> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
>>>> timers and what not is effectively useless with links.
>>>
>>> Not only the context switch, it supports 1:N or N:M dependency which
>>
>> I completely missed, how N:M is supported? That starting to sound
>> terrifying.
> 
> N:M is actually from Kevin's idea.
> 
> sqe group can be made to be more flexible by:
> 
>      Inside the group, all SQEs are submitted in parallel, so there isn't any
>      dependency among SQEs in one group.
>      
>      The 1st SQE is group leader, and the other SQEs are group member. The whole
>      group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
>      the two flags can't be set for group members.
>      
>      When the group is in one link chain, this group isn't submitted until
>      the previous SQE or group is completed. And the following SQE or group
>      can't be started if this group isn't completed.
>      
>      When IOSQE_IO_DRAIN is set for group leader, all requests in this group
>      and previous requests submitted are drained. Given IOSQE_IO_DRAIN can
>      be set for group leader only, we respect IO_DRAIN for SQE group by
>      always completing group leader as the last on in the group.
>      
>      SQE group provides flexible way to support N:M dependency, such as:
>      
>      - group A is chained with group B together by IOSQE_IO_LINK
>      - group A has N SQEs
>      - group B has M SQEs
>      
>      then M SQEs in group B depend on N SQEs in group A.
> 
> 
>>
>>> is missing in io_uring, but also makes async application easier to write by
>>> saving extra context switches, which just adds extra intermediate states for
>>> application.
>>
>> You're still executing requests (i.e. ->issue) primarily from the
>> submitter task context, they would still fly back to the task and
>> wake it up. You may save something by completing all of them
>> together via that refcounting, but you might just as well try to
>> batch CQ, which is a more generic issue. It's not clear what
>> context switches you save then.
> 
> Wrt. the above N:M example, one io_uring_enter() is enough, and
> it can't be done in single context switch without sqe group, please
> see the liburing test code:

Do you mean doing all that in a single system call? The main
performance problem for io_uring is waiting, i.e. schedule()ing
the task out and in, that's what I meant by context switching.
Ming Lei May 4, 2024, 1:56 a.m. UTC | #17
On Thu, May 02, 2024 at 03:09:13PM +0100, Pavel Begunkov wrote:
> On 4/30/24 16:00, Ming Lei wrote:
> > On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote:
> > > On 4/30/24 04:03, Ming Lei wrote:
> > > > On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote:
> > > > > On 4/23/24 14:08, Kevin Wolf wrote:
> > > > > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > > > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > > > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > > > > > 
> > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > > > > > leader is always freed after all members are completed. Group members
> > > > > > > > aren't submitted until the group leader is completed, and there isn't any
> > > > > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > > > > > members, same with IOSQE_IO_DRAIN.
> > > > > > > > 
> > > > > > > > Typically the group leader provides or makes resource, and the other members
> > > > > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > > > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > > > > > the same buffer into other destination files. SQE group provides very
> > > > > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > > > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > > > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > > > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > > > > > simplified a lot in this way.
> > > > > > > > 
> > > > > > > > Another use case is to for supporting generic device zero copy:
> > > > > > > > 
> > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > > > > > >      kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > > > > > >      application or panic
> > > > > > > > 
> > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > > > > > >      SQE
> > > > > > > 
> > > > > > > In concept, this looks very similar to "sqe bundles" that I played with
> > > > > > > in the past:
> > > > > > > 
> > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > > > > > 
> > > > > > > Didn't look too closely yet at the implementation, but in spirit it's
> > > > > > > about the same in that the first entry is processed first, and there's
> > > > > > > no ordering implied between the test of the members of the bundle /
> > > > > > > group.
> > > > > > 
> > > > > > When I first read this patch, I wondered if it wouldn't make sense to
> > > > > > allow linking a group with subsequent requests, e.g. first having a few
> > > > > > requests that run in parallel and once all of them have completed
> > > > > > continue with the next linked one sequentially.
> > > > > > 
> > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > > > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > > > > more obvious, you simply set the LINK flag on the last member of the
> > > > > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > > > > mean it has to be implemented now, but there is a clear way forward if
> > > > > > it's wanted.
> > > > > 
> > > > > Putting zc aside, links, graphs, groups, it all sounds interesting in
> > > > > concept but let's not fool anyone, all the different ordering
> > > > > relationships between requests proved to be a bad idea.
> > > > 
> > > > As Jens mentioned, sqe group is very similar with bundle:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle
> > > > 
> > > > which is really something io_uring is missing.
> > > 
> > > One could've said same about links, retrospectively I argue that it
> > > was a mistake, so I pretty much doubt arguments like "io_uring is
> > > missing it". Another thing is that zero copy, which is not possible
> > > to implement by returning to the userspace.
> > > 
> > > > > I can complaint for long, error handling is miserable, user handling
> > > > > resubmitting a part of a link is horrible, the concept of errors is
> > > > > hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
> > > > > and the MSG_WAITALL workaround). The handling and workarounds are
> > > > > leaking into generic paths, e.g. we can't init files when it's the most
> > > > > convenient. For cancellation we're walking links, which need more care
> > > > > than just looking at a request (is cancellation by user_data of a
> > > > > "linked" to a group request even supported?). The list goes on
> > > > 
> > > > Only the group leader is linked, if the group leader is canceled, all
> > > > requests in the whole group will be canceled.
> > > > 
> > > > But yes, cancelling by user_data for group members can't be supported,
> > > > and it can be documented clearly, since user still can cancel the whole
> > > > group with group leader's user_data.
> > > 
> > > Which means it'd break the case REQ_F_INFLIGHT covers, and you need
> > > to disallow linking REQ_F_INFLIGHT marked requests.
> > 
> > Both io_match_linked() and io_match_task() only iterates over req's
> > link chain, and only the group leader can appear in this link chain,
> > which is exactly the usual handling.
> > 
> > So care to explain it a bit what the real link issue is about sqe group?
> 
> Because of ref deps when a task exits it has to cancel REQ_F_INFLIGHT
> requests, and therefore they should be discoverable. The flag is only
> for POLL_ADD requests polling io_uring fds, should be good enough if
> you disallow such requests from grouping.

The group leader is always marked as REQ_F_INFLIGHT and discoverable,
and the cancel code path cancels group leader request which will
guarantees that all group members are canceled, so I think this change
isn't needed.

But io_get_sequence() needs to take it into account, such as:

@@ -1680,8 +1846,12 @@ static u32 io_get_sequence(struct io_kiocb *req)
        struct io_kiocb *cur;

        /* need original cached_sq_head, but it was increased for each req */
-       io_for_each_link(cur, req)
-               seq--;
+       io_for_each_link(cur, req) {
+               if (req_is_group_lead(cur))
+                       seq -= atomic_read(&cur->grp_refs);
+               else
+                       seq--;
+       }

> 
> > > > > And what does it achieve? The infra has matured since early days,
> > > > > it saves user-kernel transitions at best but not context switching
> > > > > overhead, and not even that if you do wait(1) and happen to catch
> > > > > middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
> > > > > timers and what not is effectively useless with links.
> > > > 
> > > > Not only the context switch, it supports 1:N or N:M dependency which
> > > 
> > > I completely missed, how N:M is supported? That starting to sound
> > > terrifying.
> > 
> > N:M is actually from Kevin's idea.
> > 
> > sqe group can be made to be more flexible by:
> > 
> >      Inside the group, all SQEs are submitted in parallel, so there isn't any
> >      dependency among SQEs in one group.
> >      The 1st SQE is group leader, and the other SQEs are group member. The whole
> >      group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
> >      the two flags can't be set for group members.
> >      When the group is in one link chain, this group isn't submitted until
> >      the previous SQE or group is completed. And the following SQE or group
> >      can't be started if this group isn't completed.
> >      When IOSQE_IO_DRAIN is set for group leader, all requests in this group
> >      and previous requests submitted are drained. Given IOSQE_IO_DRAIN can
> >      be set for group leader only, we respect IO_DRAIN for SQE group by
> >      always completing group leader as the last on in the group.
> >      SQE group provides flexible way to support N:M dependency, such as:
> >      - group A is chained with group B together by IOSQE_IO_LINK
> >      - group A has N SQEs
> >      - group B has M SQEs
> >      then M SQEs in group B depend on N SQEs in group A.
> 
> In other words, linking groups together with basically no extra rules.
> Fwiw, sounds generic, but if there are complications with IOSQE_IO_DRAIN
> that I don't immediately see, it'd be more reasonable to just disable it.

The only change on IO_DRAIN is on io_get_sequence() for taking leader->grp_refs
into account, and leader->grp_refs covers all requests in this group.

And my local patchset passes all related sqe->flags combination(DRAIN, LINKING,
ASYNC) on both single group or linked groups, meantime with extra change
of sharing same FORCE_ASYNC for both leader and members.

> 
> > > > is missing in io_uring, but also makes async application easier to write by
> > > > saving extra context switches, which just adds extra intermediate states for
> > > > application.
> > > 
> > > You're still executing requests (i.e. ->issue) primarily from the
> > > submitter task context, they would still fly back to the task and
> > > wake it up. You may save something by completing all of them
> > > together via that refcounting, but you might just as well try to
> > > batch CQ, which is a more generic issue. It's not clear what
> > > context switches you save then.
> > 
> > Wrt. the above N:M example, one io_uring_enter() is enough, and
> > it can't be done in single context switch without sqe group, please
> > see the liburing test code:
> > 
> > https://lore.kernel.org/io-uring/ZiHA+pN28hRdprhX@fedora/T/#ma755c500eab0b7dc8c1473448dd98f093097e066
> > 
> > > 
> > > As for simplicity, using the link example and considering error
> > > handling, it only complicates it. In case of an error you need to
> > > figure out a middle req failed, collect all failed CQEs linked to
> > > it and automatically cancelled (unless SKIP_COMPLETE is used), and
> > > then resubmit the failed. That's great your reads are idempotent
> > > and presumably you don't have to resubmit half a link, but in the
> > > grand picture of things it's rather one of use cases where a generic
> > > feature can be used.
> > 
> > SQE group doesn't change the current link implementation, and N:M
> > dependency is built over IOSQE_IO_LINK actually.
> > 
> > > 
> > > > > So, please, please! instead of trying to invent a new uber scheme
> > > > > of request linking, which surely wouldn't step on same problems
> > > > > over and over again, and would definitely be destined to overshadow
> > > > > all previous attempts and finally conquer the world, let's rather
> > > > > focus on minimasing the damage from this patchset's zero copy if
> > > > > it's going to be taken.
> > > > 
> > > > One key problem for zero copy is lifetime of the kernel buffer, which
> > > > can't cross OPs, that is why sqe group is introduced, for aligning
> > > > kernel buffer lifetime with the group.
> > > 
> > > Right, which is why I'm saying if we're leaving groups with zero
> > > copy, let's rather try to make them simple and not intrusive as
> > > much as possible, instead of creating an unsupportable overarching
> > > beast out of it, which would fail as a generic feature.
> > 
> > Then it degraded to the original fused command, :-)
> 
> I'm not arguing about restricting it to 1 request in a group apart
> from the master/leader/etc., if that's what you mean. The argument
> is rather to limit the overhead and abstraction leakage into hot
> paths.
> 
> For example that N:M, all that sounds great on paper until it sees
> the harsh reality. And instead of linking groups together, you
> can perfectly fine be sending one group at a time and queuing the
> next group from the userspace when the previous completes. And
> that would save space in io_kiocb, maybe(?) a flag, maybe something
> else.

Yes, it can be done with userspace, with cost of one extra io_uring_enter()
for handling two depended groups in userspace.

io_uring excellent performance is attributed to great batch processing,
if group linking is done in per-IO level, this way could degrade
performance a lot. And we know io_uring may perform worse in QD=1.

And the introduced 12 bytes doesn't add extra l1 cacheline, and won't
be touched in code path without using group.

> 
> Regardless of interoperability with links, I'd also prefer it to
> be folded into the link code structure and state machine, so
> it's not all over in the submission/completion paths adding
> overhead for one narrow feature, including un-inlining the
> entire submission path.

I can work toward this direction if the idea of sqe group may be accepted
but the biggest blocker could be where to allocate the extra SQE_GROUP
flag.

> 
> E.g. there should be no separate hand coded duplicated SQE
> / request assembling like in io_init_req_group(). Instead
> it should be able to consume SQEs and requests it's given
> from submit_sqes(), and e.g. process grouping in
> io_submit_sqe() around the place requests are linked.

Yeah, IOSQE_IO_LINK sort of handling could be needed for processing
requests in group.


Thanks,
Ming
diff mbox series

Patch

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 67347e5d06ec..7ce4a2d4a8b8 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -444,6 +444,7 @@  enum {
 	REQ_F_BUFFER_SELECT_BIT	= IOSQE_BUFFER_SELECT_BIT,
 	REQ_F_CQE_SKIP_BIT	= IOSQE_CQE_SKIP_SUCCESS_BIT,
 	REQ_F_SQE_EXT_FLAGS_BIT	= IOSQE_HAS_EXT_FLAGS_BIT,
+	REQ_F_SQE_GROUP_BIT	= 8 + IOSQE_EXT_SQE_GROUP_BIT,
 
 	/* first 2 bytes are taken by user flags, shift it to not overlap */
 	REQ_F_FAIL_BIT		= 16,
@@ -474,6 +475,7 @@  enum {
 	REQ_F_CAN_POLL_BIT,
 	REQ_F_BL_EMPTY_BIT,
 	REQ_F_BL_NO_RECYCLE_BIT,
+	REQ_F_SQE_GROUP_LEAD_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -497,6 +499,8 @@  enum {
 	REQ_F_BUFFER_SELECT	= IO_REQ_FLAG(REQ_F_BUFFER_SELECT_BIT),
 	/* IOSQE_CQE_SKIP_SUCCESS */
 	REQ_F_CQE_SKIP		= IO_REQ_FLAG(REQ_F_CQE_SKIP_BIT),
+	/* IOSQE_EXT_SQE_GROUP */
+	REQ_F_SQE_GROUP		= IO_REQ_FLAG(REQ_F_SQE_GROUP_BIT),
 
 	/* fail rest of links */
 	REQ_F_FAIL		= IO_REQ_FLAG(REQ_F_FAIL_BIT),
@@ -552,6 +556,8 @@  enum {
 	REQ_F_BL_EMPTY		= IO_REQ_FLAG(REQ_F_BL_EMPTY_BIT),
 	/* don't recycle provided buffers for this request */
 	REQ_F_BL_NO_RECYCLE	= IO_REQ_FLAG(REQ_F_BL_NO_RECYCLE_BIT),
+	/* sqe group lead */
+	REQ_F_SQE_GROUP_LEAD	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEAD_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
@@ -665,6 +671,10 @@  struct io_kiocb {
 		u64			extra1;
 		u64			extra2;
 	} big_cqe;
+
+	/* all SQE group members linked here for group lead */
+	struct io_kiocb			*grp_link;
+	atomic_t			grp_refs;
 };
 
 struct io_overflow_cqe {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4847d7cf1ac9..c0d34f2a2c17 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -128,6 +128,8 @@  enum io_uring_sqe_flags_bit {
 	IOSQE_BUFFER_SELECT_BIT,
 	IOSQE_CQE_SKIP_SUCCESS_BIT,
 	IOSQE_HAS_EXT_FLAGS_BIT,
+
+	IOSQE_EXT_SQE_GROUP_BIT = 0,
 };
 
 /*
@@ -152,6 +154,8 @@  enum io_uring_sqe_flags_bit {
  * sqe->uring_cmd_flags for IORING_URING_CMD.
  */
 #define IOSQE_HAS_EXT_FLAGS	(1U << IOSQE_HAS_EXT_FLAGS_BIT)
+/* defines sqe group */
+#define IOSQE_EXT_SQE_GROUP	(1U << IOSQE_EXT_SQE_GROUP_BIT)
 
 /*
  * io_uring_setup() flags
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4969d21ea2f8..0f41b26723a8 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -147,6 +147,10 @@  static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool cancel_all);
 
 static void io_queue_sqe(struct io_kiocb *req);
+static bool io_get_sqe(struct io_ring_ctx *ctx,
+		const struct io_uring_sqe **sqe);
+static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+		       const struct io_uring_sqe *sqe);
 
 struct kmem_cache *req_cachep;
 static struct workqueue_struct *iou_wq __ro_after_init;
@@ -925,10 +929,189 @@  bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 	return posted;
 }
 
+static void __io_req_failed(struct io_kiocb *req, s32 res, bool skip_cqe)
+{
+	const struct io_cold_def *def = &io_cold_defs[req->opcode];
+
+	lockdep_assert_held(&req->ctx->uring_lock);
+
+	if (skip_cqe)
+		req->flags |= REQ_F_FAIL | REQ_F_CQE_SKIP;
+	else
+		req_set_fail(req);
+	io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
+	if (def->fail)
+		def->fail(req);
+}
+
+void io_req_defer_failed(struct io_kiocb *req, s32 res)
+	__must_hold(&ctx->uring_lock)
+{
+	__io_req_failed(req, res, false);
+	io_req_complete_defer(req);
+}
+
+static void io_req_defer_failed_sliently(struct io_kiocb *req, s32 res)
+	__must_hold(&ctx->uring_lock)
+{
+	__io_req_failed(req, res, true);
+	io_req_complete_defer(req);
+}
+
+/*
+ * Called after member req is completed, and return the lead request for
+ * caller to fill cqe & free it really
+ */
+static inline struct io_kiocb *io_complete_group_member(struct io_kiocb *req)
+{
+	struct io_kiocb *lead = req->grp_link;
+
+	req->grp_link = NULL;
+	if (lead && atomic_dec_and_test(&lead->grp_refs)) {
+		req->grp_link = NULL;
+		lead->flags &= ~REQ_F_SQE_GROUP_LEAD;
+		return lead;
+	}
+
+	return NULL;
+}
+
+/*
+ * Called after lead req is completed and before posting cqe and freeing,
+ * for issuing member requests
+ */
+void io_complete_group_lead(struct io_kiocb *req, unsigned issue_flags)
+{
+	struct io_kiocb *member = req->grp_link;
+
+	if (!member)
+		return;
+
+	while (member) {
+		struct io_kiocb *next = member->grp_link;
+
+		member->grp_link = req;
+		if (unlikely(req->flags & REQ_F_FAIL)) {
+			/*
+			 * Now group lead is failed, so simply fail members
+			 * with -EIO, and the application can figure out
+			 * the reason from lead's cqe->res
+			 */
+			__io_req_failed(member, -EIO, false);
+
+			if (issue_flags & IO_URING_F_COMPLETE_DEFER)
+				io_req_complete_defer(member);
+			else {
+				member->io_task_work.func = io_req_task_complete;
+				io_req_task_work_add(member);
+			}
+		} else {
+			trace_io_uring_submit_req(member);
+			if ((issue_flags & IO_URING_F_COMPLETE_DEFER) &&
+					!(member->flags & REQ_F_FORCE_ASYNC))
+				io_queue_sqe(member);
+			else
+				io_req_task_queue(member);
+		}
+		member = next;
+	}
+	req->grp_link = NULL;
+}
+
+static bool sqe_is_group_member(const struct io_uring_sqe *sqe)
+{
+	return (READ_ONCE(sqe->flags) & IOSQE_HAS_EXT_FLAGS) &&
+		(READ_ONCE(sqe->ext_flags) & IOSQE_EXT_SQE_GROUP);
+}
+
+/*
+ * Initialize the whole SQE group.
+ *
+ * Walk every member in this group even though failure happens. If the lead
+ * request is failed, CQE is only posted for lead, otherwise, CQE is posted
+ * for every member request. Member requests aren't issued until the lead
+ * is completed, and the lead request won't be freed until all member
+ * requests are completed.
+ *
+ * The whole group shares the link flag in group lead, and other members
+ * aren't allowed to set any LINK flag. And only the lead request may
+ * appear in the submission link list.
+ */
+static int io_init_req_group(struct io_ring_ctx *ctx, struct io_kiocb *lead,
+		int *nr, int lead_ret)
+	__must_hold(&ctx->uring_lock)
+{
+	bool more = true;
+	int cnt = 0;
+
+	lead->grp_link = NULL;
+	do {
+		const struct io_uring_sqe *sqe;
+		struct io_kiocb *req = NULL;
+		int ret = -ENOMEM;
+
+		io_alloc_req(ctx, &req);
+
+		if (unlikely(!io_get_sqe(ctx, &sqe))) {
+			if (req)
+				io_req_add_to_cache(req, ctx);
+			break;
+		}
+
+		/* one group ends with !IOSQE_EXT_SQE_GROUP */
+		if (!sqe_is_group_member(sqe))
+			more = false;
+
+		if (req) {
+			ret = io_init_req(ctx, req, sqe);
+			/*
+			 * Both IO_LINK and IO_DRAIN aren't allowed for
+			 * group member, and the boundary has to be in
+			 * the lead sqe, so the whole group shares
+			 * same IO_LINK and IO_DRAIN.
+			 */
+			if (!ret && (req->flags & (IO_REQ_LINK_FLAGS |
+							REQ_F_IO_DRAIN)))
+				ret = -EINVAL;
+			if (!more)
+				req->flags |= REQ_F_SQE_GROUP;
+			if (unlikely(ret)) {
+				/*
+				 * The lead will be failed, so don't post
+				 * CQE for any member
+				 */
+				io_req_defer_failed_sliently(req, ret);
+			} else {
+				req->grp_link = lead->grp_link;
+				lead->grp_link = req;
+			}
+		}
+		cnt += 1;
+		if (ret)
+			lead_ret = ret;
+	} while (more);
+
+	/* Mark lead if we get members, otherwise degrade it to normal req */
+	if (cnt > 0) {
+		lead->flags |= REQ_F_SQE_GROUP_LEAD;
+		atomic_set(&lead->grp_refs, cnt);
+		*nr += cnt;
+	} else {
+		lead->flags &= ~REQ_F_SQE_GROUP;
+	}
+
+	return lead_ret;
+}
+
 static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
+	if (req_is_group_lead(req)) {
+		io_complete_group_lead(req, issue_flags);
+		return;
+	}
+
 	/*
 	 * All execution paths but io-wq use the deferred completions by
 	 * passing IO_URING_F_COMPLETE_DEFER and thus should not end up here.
@@ -960,20 +1143,6 @@  static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 	req_ref_put(req);
 }
 
-void io_req_defer_failed(struct io_kiocb *req, s32 res)
-	__must_hold(&ctx->uring_lock)
-{
-	const struct io_cold_def *def = &io_cold_defs[req->opcode];
-
-	lockdep_assert_held(&req->ctx->uring_lock);
-
-	req_set_fail(req);
-	io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
-	if (def->fail)
-		def->fail(req);
-	io_req_complete_defer(req);
-}
-
 /*
  * Don't initialise the fields below on every allocation, but do that in
  * advance and keep them valid across allocations.
@@ -1459,7 +1628,8 @@  static void io_free_batch_list(struct io_ring_ctx *ctx,
 }
 
 static inline void io_fill_cqe_lists(struct io_ring_ctx *ctx,
-				     struct io_wq_work_list *list)
+				     struct io_wq_work_list *list,
+				     struct io_wq_work_list *grp)
 {
 	struct io_wq_work_node *node;
 
@@ -1477,6 +1647,13 @@  static inline void io_fill_cqe_lists(struct io_ring_ctx *ctx,
 				io_req_cqe_overflow(req);
 			}
 		}
+
+		if (grp && req_is_group_member(req)) {
+			struct io_kiocb *lead = io_complete_group_member(req);
+
+			if (lead)
+				wq_list_add_head(&lead->comp_list, grp);
+		}
 	}
 }
 
@@ -1484,14 +1661,19 @@  void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state = &ctx->submit_state;
+	struct io_wq_work_list list = {NULL, NULL};
 
 	__io_cq_lock(ctx);
-	io_fill_cqe_lists(ctx, &state->compl_reqs);
+	io_fill_cqe_lists(ctx, &state->compl_reqs, &list);
+	if (!wq_list_empty(&list))
+		io_fill_cqe_lists(ctx, &list, NULL);
 	__io_cq_unlock_post(ctx);
 
-	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
+	if (!wq_list_empty(&state->compl_reqs)) {
 		io_free_batch_list(ctx, state->compl_reqs.first);
 		INIT_WQ_LIST(&state->compl_reqs);
+		if (!wq_list_empty(&list))
+			io_free_batch_list(ctx, list.first);
 	}
 	ctx->submit_state.cq_flush = false;
 }
@@ -2212,6 +2394,8 @@  static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 	*nr = 1;
 	ret = io_init_req(ctx, req, sqe);
+	if (req->flags & REQ_F_SQE_GROUP)
+		ret = io_init_req_group(ctx, req, nr, ret);
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 1eb65324792a..99eeb4eee219 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -104,6 +104,7 @@  bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 
 bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 			bool cancel_all);
+void io_complete_group_lead(struct io_kiocb *req, unsigned int issue_flags);
 
 enum {
 	IO_EVENTFD_OP_SIGNAL_BIT,
@@ -113,6 +114,16 @@  enum {
 void io_eventfd_ops(struct rcu_head *rcu);
 void io_activate_pollwq(struct io_ring_ctx *ctx);
 
+static inline bool req_is_group_lead(struct io_kiocb *req)
+{
+	return req->flags & REQ_F_SQE_GROUP_LEAD;
+}
+
+static inline bool req_is_group_member(struct io_kiocb *req)
+{
+	return !req_is_group_lead(req) && (req->flags & REQ_F_SQE_GROUP);
+}
+
 static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
 {
 #if defined(CONFIG_PROVE_LOCKING)
@@ -355,7 +366,11 @@  static inline void io_req_complete_defer(struct io_kiocb *req)
 
 	lockdep_assert_held(&req->ctx->uring_lock);
 
-	wq_list_add_tail(&req->comp_list, &state->compl_reqs);
+	if (unlikely(req_is_group_lead(req)))
+		io_complete_group_lead(req, IO_URING_F_COMPLETE_DEFER |
+				IO_URING_F_NONBLOCK);
+	else
+		wq_list_add_tail(&req->comp_list, &state->compl_reqs);
 }
 
 static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)