diff mbox series

[2/9] io_uring: support user sqe ext flags

Message ID 20240408010322.4104395-3-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series io_uring: support sqe group and provide group kbuf | expand

Commit Message

Ming Lei April 8, 2024, 1:03 a.m. UTC
sqe->flags is u8, and now we have used 7 bits, so take the last one for
extending purpose.

If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
IORING_OP_URING_CMD.

io_slot_flags() return value is converted to `ULL` because the affected bits
are beyond 32bit now.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring_types.h |  6 ++++--
 include/uapi/linux/io_uring.h  | 13 +++++++++++++
 io_uring/filetable.h           |  2 +-
 io_uring/io_uring.c            | 14 +++++++++++++-
 io_uring/uring_cmd.c           |  3 ++-
 5 files changed, 33 insertions(+), 5 deletions(-)

Comments

Jens Axboe April 22, 2024, 6:16 p.m. UTC | #1
On 4/7/24 7:03 PM, Ming Lei wrote:
> sqe->flags is u8, and now we have used 7 bits, so take the last one for
> extending purpose.
> 
> If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
> from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
> IORING_OP_URING_CMD.
> 
> io_slot_flags() return value is converted to `ULL` because the affected bits
> are beyond 32bit now.

If we're extending flags, which is something we arguably need to do at
some point, I think we should have them be generic and not spread out.
If uring_cmd needs specific flags and don't have them, then we should
add it just for that.
Ming Lei April 23, 2024, 1:57 p.m. UTC | #2
On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
> On 4/7/24 7:03 PM, Ming Lei wrote:
> > sqe->flags is u8, and now we have used 7 bits, so take the last one for
> > extending purpose.
> > 
> > If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
> > from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
> > IORING_OP_URING_CMD.
> > 
> > io_slot_flags() return value is converted to `ULL` because the affected bits
> > are beyond 32bit now.
> 
> If we're extending flags, which is something we arguably need to do at
> some point, I think we should have them be generic and not spread out.

Sorry, maybe I don't get your idea, and the ext_flag itself is always
initialized in io_init_req(), like normal sqe->flags, same with its
usage.

> If uring_cmd needs specific flags and don't have them, then we should
> add it just for that.

The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
and can't be reused for generic flag. If we want to use byte48~63, it has
to be overlapped with uring_cmd's payload, and it is one generic sqe
flag, which is applied on uring_cmd too.

That is the only way I thought of, or any other suggestion for extending sqe
flags generically?


Thanks,
Ming
Pavel Begunkov April 29, 2024, 3:24 p.m. UTC | #3
On 4/23/24 14:57, Ming Lei wrote:
> On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>> sqe->flags is u8, and now we have used 7 bits, so take the last one for
>>> extending purpose.
>>>
>>> If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
>>> from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
>>> IORING_OP_URING_CMD.
>>>
>>> io_slot_flags() return value is converted to `ULL` because the affected bits
>>> are beyond 32bit now.
>>
>> If we're extending flags, which is something we arguably need to do at
>> some point, I think we should have them be generic and not spread out.
> 
> Sorry, maybe I don't get your idea, and the ext_flag itself is always
> initialized in io_init_req(), like normal sqe->flags, same with its
> usage.
> 
>> If uring_cmd needs specific flags and don't have them, then we should
>> add it just for that.
> 
> The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
> borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
> and can't be reused for generic flag. If we want to use byte48~63, it has
> to be overlapped with uring_cmd's payload, and it is one generic sqe
> flag, which is applied on uring_cmd too.

Which is exactly the mess nobody would want to see. And I'd also
argue 8 extra bits is not enough anyway, otherwise the history will
repeat itself pretty soon

> That is the only way I thought of, or any other suggestion for extending sqe
> flags generically?

idea 1: just use the last bit. When we need another one it'd be time
to think about a long overdue SQE layout v2, this way we can try
to make flags u32 and clean up other problems.

idea 2: the group assembling flag can move into cmds. Very roughly:

io_cmd_init() {
	ublk_cmd_init();
}

ublk_cmd_init() {
	io_uring_start_grouping(ctx, cmd);
}

io_uring_start_grouping(ctx, cmd) {
	ctx->grouping = true;
	ctx->group_head = cmd->req;
}

submit_sqe() {
	if (ctx->grouping) {
		link_to_group(req, ctx->group_head);
		if (!(req->flags & REQ_F_LINK))
			ctx->grouping = false;
	}
}
Ming Lei April 30, 2024, 3:43 a.m. UTC | #4
On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
> On 4/23/24 14:57, Ming Lei wrote:
> > On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
> > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > sqe->flags is u8, and now we have used 7 bits, so take the last one for
> > > > extending purpose.
> > > > 
> > > > If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
> > > > from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
> > > > IORING_OP_URING_CMD.
> > > > 
> > > > io_slot_flags() return value is converted to `ULL` because the affected bits
> > > > are beyond 32bit now.
> > > 
> > > If we're extending flags, which is something we arguably need to do at
> > > some point, I think we should have them be generic and not spread out.
> > 
> > Sorry, maybe I don't get your idea, and the ext_flag itself is always
> > initialized in io_init_req(), like normal sqe->flags, same with its
> > usage.
> > 
> > > If uring_cmd needs specific flags and don't have them, then we should
> > > add it just for that.
> > 
> > The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
> > borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
> > and can't be reused for generic flag. If we want to use byte48~63, it has
> > to be overlapped with uring_cmd's payload, and it is one generic sqe
> > flag, which is applied on uring_cmd too.
> 
> Which is exactly the mess nobody would want to see. And I'd also

The trouble is introduced by supporting uring_cmd, and solving it by setting
ext flags for uring_cmd specially by liburing helper is still reasonable or
understandable, IMO.

> argue 8 extra bits is not enough anyway, otherwise the history will
> repeat itself pretty soon

It is started with 8 bits, now doubled when io_uring is basically
mature, even though history might repeat, it will take much longer time

> 
> > That is the only way I thought of, or any other suggestion for extending sqe
> > flags generically?
> 
> idea 1: just use the last bit. When we need another one it'd be time
> to think about a long overdue SQE layout v2, this way we can try
> to make flags u32 and clean up other problems.

It looks over-kill to invent SQE v2 just for solving the trouble in
uring_cmd, and supporting two layouts can be new trouble for io_uring.

Also I doubt the problem can be solved in layout v2:

- 64 byte is small enough to support everything, same for v2

- uring_cmd has only 16 bytes payload, taking any byte from
the payload may cause trouble for drivers

- the only possible change could still be to suppress bytes for OP
specific flags, but it might cause trouble for some OPs, such as
network.

> 
> idea 2: the group assembling flag can move into cmds. Very roughly:
> 
> io_cmd_init() {
> 	ublk_cmd_init();
> }
> 
> ublk_cmd_init() {
> 	io_uring_start_grouping(ctx, cmd);
> }
> 
> io_uring_start_grouping(ctx, cmd) {
> 	ctx->grouping = true;
> 	ctx->group_head = cmd->req;
> }

How can you know one group is starting without any flag? Or you still
suggest the approach taken in fused command?

> 
> submit_sqe() {
> 	if (ctx->grouping) {
> 		link_to_group(req, ctx->group_head);
> 		if (!(req->flags & REQ_F_LINK))
> 			ctx->grouping = false;
> 	}
> }

The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
not doable.


Thanks,
Ming
Pavel Begunkov April 30, 2024, noon UTC | #5
On 4/30/24 04:43, Ming Lei wrote:
> On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
>> On 4/23/24 14:57, Ming Lei wrote:
>>> On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
>>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>>> sqe->flags is u8, and now we have used 7 bits, so take the last one for
>>>>> extending purpose.
>>>>>
>>>>> If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
>>>>> from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
>>>>> IORING_OP_URING_CMD.
>>>>>
>>>>> io_slot_flags() return value is converted to `ULL` because the affected bits
>>>>> are beyond 32bit now.
>>>>
>>>> If we're extending flags, which is something we arguably need to do at
>>>> some point, I think we should have them be generic and not spread out.
>>>
>>> Sorry, maybe I don't get your idea, and the ext_flag itself is always
>>> initialized in io_init_req(), like normal sqe->flags, same with its
>>> usage.
>>>
>>>> If uring_cmd needs specific flags and don't have them, then we should
>>>> add it just for that.
>>>
>>> The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
>>> borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
>>> and can't be reused for generic flag. If we want to use byte48~63, it has
>>> to be overlapped with uring_cmd's payload, and it is one generic sqe
>>> flag, which is applied on uring_cmd too.
>>
>> Which is exactly the mess nobody would want to see. And I'd also
> 
> The trouble is introduced by supporting uring_cmd, and solving it by setting
> ext flags for uring_cmd specially by liburing helper is still reasonable or
> understandable, IMO.
> 
>> argue 8 extra bits is not enough anyway, otherwise the history will
>> repeat itself pretty soon
> 
> It is started with 8 bits, now doubled when io_uring is basically
> mature, even though history might repeat, it will take much longer time

You're mistaken, only 7 bits are taken not because there haven't been
ideas and need to use them, but because we're out of space and we've
been saving it for something that might be absolutely necessary.

POLL_FIRST IMHO should've been a generic feature, but it worked around
being a send/recv specific flag, same goes for the use of registered
buffers, not to mention ideas for which we haven't had enough flag space.

>>> That is the only way I thought of, or any other suggestion for extending sqe
>>> flags generically?
>>
>> idea 1: just use the last bit. When we need another one it'd be time
>> to think about a long overdue SQE layout v2, this way we can try
>> to make flags u32 and clean up other problems.
> 
> It looks over-kill to invent SQE v2 just for solving the trouble in
> uring_cmd, and supporting two layouts can be new trouble for io_uring.

Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
just one of reasons. As for overkill, that's why I'm not telling you
to change the layour, but suggesting to take the last bit for the
group flag and leave future problems for the future.


> Also I doubt the problem can be solved in layout v2:
> 
> - 64 byte is small enough to support everything, same for v2
> 
> - uring_cmd has only 16 bytes payload, taking any byte from
> the payload may cause trouble for drivers
> 
> - the only possible change could still be to suppress bytes for OP
> specific flags, but it might cause trouble for some OPs, such as
> network.

Look up sqe's __pad1, for example


>> idea 2: the group assembling flag can move into cmds. Very roughly:
>>
>> io_cmd_init() {
>> 	ublk_cmd_init();
>> }
>>
>> ublk_cmd_init() {
>> 	io_uring_start_grouping(ctx, cmd);
>> }
>>
>> io_uring_start_grouping(ctx, cmd) {
>> 	ctx->grouping = true;
>> 	ctx->group_head = cmd->req;
>> }
> 
> How can you know one group is starting without any flag? Or you still
> suggest the approach taken in fused command?

That would be ublk's business, e.g. ublk or cmds specific flag


>> submit_sqe() {
>> 	if (ctx->grouping) {
>> 		link_to_group(req, ctx->group_head);
>> 		if (!(req->flags & REQ_F_LINK))
>> 			ctx->grouping = false;
>> 	}
>> }
> 
> The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
> not doable.

Would it break zero copy feature if you cant?
Ming Lei April 30, 2024, 12:56 p.m. UTC | #6
On Tue, Apr 30, 2024 at 01:00:30PM +0100, Pavel Begunkov wrote:
> On 4/30/24 04:43, Ming Lei wrote:
> > On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
> > > On 4/23/24 14:57, Ming Lei wrote:
> > > > On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
> > > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > > sqe->flags is u8, and now we have used 7 bits, so take the last one for
> > > > > > extending purpose.
> > > > > > 
> > > > > > If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
> > > > > > from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
> > > > > > IORING_OP_URING_CMD.
> > > > > > 
> > > > > > io_slot_flags() return value is converted to `ULL` because the affected bits
> > > > > > are beyond 32bit now.
> > > > > 
> > > > > If we're extending flags, which is something we arguably need to do at
> > > > > some point, I think we should have them be generic and not spread out.
> > > > 
> > > > Sorry, maybe I don't get your idea, and the ext_flag itself is always
> > > > initialized in io_init_req(), like normal sqe->flags, same with its
> > > > usage.
> > > > 
> > > > > If uring_cmd needs specific flags and don't have them, then we should
> > > > > add it just for that.
> > > > 
> > > > The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
> > > > borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
> > > > and can't be reused for generic flag. If we want to use byte48~63, it has
> > > > to be overlapped with uring_cmd's payload, and it is one generic sqe
> > > > flag, which is applied on uring_cmd too.
> > > 
> > > Which is exactly the mess nobody would want to see. And I'd also
> > 
> > The trouble is introduced by supporting uring_cmd, and solving it by setting
> > ext flags for uring_cmd specially by liburing helper is still reasonable or
> > understandable, IMO.
> > 
> > > argue 8 extra bits is not enough anyway, otherwise the history will
> > > repeat itself pretty soon
> > 
> > It is started with 8 bits, now doubled when io_uring is basically
> > mature, even though history might repeat, it will take much longer time
> 
> You're mistaken, only 7 bits are taken not because there haven't been
> ideas and need to use them, but because we're out of space and we've
> been saving it for something that might be absolutely necessary.
> 
> POLL_FIRST IMHO should've been a generic feature, but it worked around
> being a send/recv specific flag, same goes for the use of registered
> buffers, not to mention ideas for which we haven't had enough flag space.

OK, but I am wondering why not extend flags a bit so that io_uring can
become extendable, just like this patch.

> 
> > > > That is the only way I thought of, or any other suggestion for extending sqe
> > > > flags generically?
> > > 
> > > idea 1: just use the last bit. When we need another one it'd be time
> > > to think about a long overdue SQE layout v2, this way we can try
> > > to make flags u32 and clean up other problems.
> > 
> > It looks over-kill to invent SQE v2 just for solving the trouble in
> > uring_cmd, and supporting two layouts can be new trouble for io_uring.
> 
> Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
> just one of reasons. As for overkill, that's why I'm not telling you
> to change the layour, but suggesting to take the last bit for the
> group flag and leave future problems for the future.

You mentioned 8bit flag is designed from beginning just for saving
space, so SQE V2 may not help us at all.

If the last bit can be reserved for extend flag, it is still possible
to extend sqe flags a bit, such as this patch. Otherwise, we just lose
chance to extend sqe flags in future.

Jens, can you share your idea/option wrt. extending sqe flags?

> 
> 
> > Also I doubt the problem can be solved in layout v2:
> > 
> > - 64 byte is small enough to support everything, same for v2
> > 
> > - uring_cmd has only 16 bytes payload, taking any byte from
> > the payload may cause trouble for drivers
> > 
> > - the only possible change could still be to suppress bytes for OP
> > specific flags, but it might cause trouble for some OPs, such as
> > network.
> 
> Look up sqe's __pad1, for example

Suppose it is just for uring_cmd, '__pad1' is shared with cmd_op, which is aligned
with ioctl cmd and is supposed to be 32bit.

Same with 'off' which is used in rw at least, if sqe group is to be
generic flag.

> 
> 
> > > idea 2: the group assembling flag can move into cmds. Very roughly:
> > > 
> > > io_cmd_init() {
> > > 	ublk_cmd_init();
> > > }
> > > 
> > > ublk_cmd_init() {
> > > 	io_uring_start_grouping(ctx, cmd);
> > > }
> > > 
> > > io_uring_start_grouping(ctx, cmd) {
> > > 	ctx->grouping = true;
> > > 	ctx->group_head = cmd->req;
> > > }
> > 
> > How can you know one group is starting without any flag? Or you still
> > suggest the approach taken in fused command?
> 
> That would be ublk's business, e.g. ublk or cmds specific flag

Then it becomes dedicated fused command actually, and last year's main
concern is that the approach isn't generic.

> 
> 
> > > submit_sqe() {
> > > 	if (ctx->grouping) {
> > > 		link_to_group(req, ctx->group_head);
> > > 		if (!(req->flags & REQ_F_LINK))
> > > 			ctx->grouping = false;
> > > 	}
> > > }
> > 
> > The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
> > not doable.
> 
> Would it break zero copy feature if you cant?

The whole sqe group needs to be linked to existed link chain, so we
can't reuse REQ_F_LINK here.

Thanks,
Ming
Pavel Begunkov April 30, 2024, 2:10 p.m. UTC | #7
On 4/30/24 13:56, Ming Lei wrote:
> On Tue, Apr 30, 2024 at 01:00:30PM +0100, Pavel Begunkov wrote:
>> On 4/30/24 04:43, Ming Lei wrote:
>>> On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
>>>> On 4/23/24 14:57, Ming Lei wrote:
>>>>> On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
>>>>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>>>>> sqe->flags is u8, and now we have used 7 bits, so take the last one for
>>>>>>> extending purpose.
>>>>>>>
>>>>>>> If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
>>>>>>> from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
>>>>>>> IORING_OP_URING_CMD.
>>>>>>>
>>>>>>> io_slot_flags() return value is converted to `ULL` because the affected bits
>>>>>>> are beyond 32bit now.
>>>>>>
>>>>>> If we're extending flags, which is something we arguably need to do at
>>>>>> some point, I think we should have them be generic and not spread out.
>>>>>
>>>>> Sorry, maybe I don't get your idea, and the ext_flag itself is always
>>>>> initialized in io_init_req(), like normal sqe->flags, same with its
>>>>> usage.
>>>>>
>>>>>> If uring_cmd needs specific flags and don't have them, then we should
>>>>>> add it just for that.
>>>>>
>>>>> The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
>>>>> borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
>>>>> and can't be reused for generic flag. If we want to use byte48~63, it has
>>>>> to be overlapped with uring_cmd's payload, and it is one generic sqe
>>>>> flag, which is applied on uring_cmd too.
>>>>
>>>> Which is exactly the mess nobody would want to see. And I'd also
>>>
>>> The trouble is introduced by supporting uring_cmd, and solving it by setting
>>> ext flags for uring_cmd specially by liburing helper is still reasonable or
>>> understandable, IMO.
>>>
>>>> argue 8 extra bits is not enough anyway, otherwise the history will
>>>> repeat itself pretty soon
>>>
>>> It is started with 8 bits, now doubled when io_uring is basically
>>> mature, even though history might repeat, it will take much longer time
>>
>> You're mistaken, only 7 bits are taken not because there haven't been
>> ideas and need to use them, but because we're out of space and we've
>> been saving it for something that might be absolutely necessary.
>>
>> POLL_FIRST IMHO should've been a generic feature, but it worked around
>> being a send/recv specific flag, same goes for the use of registered
>> buffers, not to mention ideas for which we haven't had enough flag space.
> 
> OK, but I am wondering why not extend flags a bit so that io_uring can
> become extendable, just like this patch.

That would be great if can be done cleanly. Even having it
non contig with the first 8bits is fine, but not conditional
depending on opcode is too much.


>>>>> That is the only way I thought of, or any other suggestion for extending sqe
>>>>> flags generically?
>>>>
>>>> idea 1: just use the last bit. When we need another one it'd be time
>>>> to think about a long overdue SQE layout v2, this way we can try
>>>> to make flags u32 and clean up other problems.
>>>
>>> It looks over-kill to invent SQE v2 just for solving the trouble in
>>> uring_cmd, and supporting two layouts can be new trouble for io_uring.
>>
>> Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
>> just one of reasons. As for overkill, that's why I'm not telling you
>> to change the layour, but suggesting to take the last bit for the
>> group flag and leave future problems for the future.
> 
> You mentioned 8bit flag is designed from beginning just for saving
> space, so SQE V2 may not help us at all.

Not sure what you mean. Retrospectively speaking, u8 for flags was
an oversight


> If the last bit can be reserved for extend flag, it is still possible
> to extend sqe flags a bit, such as this patch. Otherwise, we just lose
> chance to extend sqe flags in future.

That's why I mentioned SQE layout v2, i.e. a ctx flag which reshuffles
sqe fields in a better way. Surely there will be a lot of headache with
such a migration, but you can make flags a u32 then if you find space
and wouldn't even need and extending flag.


> Jens, can you share your idea/option wrt. extending sqe flags?
> 
>>
>>
>>> Also I doubt the problem can be solved in layout v2:
>>>
>>> - 64 byte is small enough to support everything, same for v2
>>>
>>> - uring_cmd has only 16 bytes payload, taking any byte from
>>> the payload may cause trouble for drivers
>>>
>>> - the only possible change could still be to suppress bytes for OP
>>> specific flags, but it might cause trouble for some OPs, such as
>>> network.
>>
>> Look up sqe's __pad1, for example
> 
> Suppose it is just for uring_cmd, '__pad1' is shared with cmd_op, which is aligned
> with ioctl cmd and is supposed to be 32bit.

It's not shared with cmd_op, it's in a struct with it, unless you
use a u32 part of ->addr2/off, it's just that, a completely
unnecessary created padding. There was also another field left,
at least in case for nvme.

> Same with 'off' which is used in rw at least, if sqe group is to be
> generic flag.
> 
>>
>>
>>>> idea 2: the group assembling flag can move into cmds. Very roughly:
>>>>
>>>> io_cmd_init() {
>>>> 	ublk_cmd_init();
>>>> }
>>>>
>>>> ublk_cmd_init() {
>>>> 	io_uring_start_grouping(ctx, cmd);
>>>> }
>>>>
>>>> io_uring_start_grouping(ctx, cmd) {
>>>> 	ctx->grouping = true;
>>>> 	ctx->group_head = cmd->req;
>>>> }
>>>
>>> How can you know one group is starting without any flag? Or you still
>>> suggest the approach taken in fused command?
>>
>> That would be ublk's business, e.g. ublk or cmds specific flag
> 
> Then it becomes dedicated fused command actually, and last year's main
> concern is that the approach isn't generic.

My concern is anything leaking into hot paths, even if it's a
generic feature (and I wouldn't call it that). The question is
rather at what degree. I wouldn't call groups in isolation
without zc exciting, and making it to look like a generic feature
just for the sake of it might even be worse than having it opcode
specific.

Regardless, this approach doesn't forbid some other opcode from
doing ctx->grouping = true based on some other opcode specific
flag, doesn't necessarily binds it to cmds/ublk.

>>>> submit_sqe() {
>>>> 	if (ctx->grouping) {
>>>> 		link_to_group(req, ctx->group_head);
>>>> 		if (!(req->flags & REQ_F_LINK))
>>>> 			ctx->grouping = false;
>>>> 	}
>>>> }
>>>
>>> The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
>>> not doable.
>>
>> Would it break zero copy feature if you cant?
> 
> The whole sqe group needs to be linked to existed link chain, so we
> can't reuse REQ_F_LINK here.

Why though? You're passing a buffer from the head to all group-linked
requests, how do normal links come into the picture?
Ming Lei April 30, 2024, 3:46 p.m. UTC | #8
On Tue, Apr 30, 2024 at 03:10:01PM +0100, Pavel Begunkov wrote:
> On 4/30/24 13:56, Ming Lei wrote:
> > On Tue, Apr 30, 2024 at 01:00:30PM +0100, Pavel Begunkov wrote:
> > > On 4/30/24 04:43, Ming Lei wrote:
> > > > On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
> > > > > On 4/23/24 14:57, Ming Lei wrote:
> > > > > > On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
> > > > > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > > > > sqe->flags is u8, and now we have used 7 bits, so take the last one for
> > > > > > > > extending purpose.
> > > > > > > > 
> > > > > > > > If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
> > > > > > > > from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
> > > > > > > > IORING_OP_URING_CMD.
> > > > > > > > 
> > > > > > > > io_slot_flags() return value is converted to `ULL` because the affected bits
> > > > > > > > are beyond 32bit now.
> > > > > > > 
> > > > > > > If we're extending flags, which is something we arguably need to do at
> > > > > > > some point, I think we should have them be generic and not spread out.
> > > > > > 
> > > > > > Sorry, maybe I don't get your idea, and the ext_flag itself is always
> > > > > > initialized in io_init_req(), like normal sqe->flags, same with its
> > > > > > usage.
> > > > > > 
> > > > > > > If uring_cmd needs specific flags and don't have them, then we should
> > > > > > > add it just for that.
> > > > > > 
> > > > > > The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
> > > > > > borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
> > > > > > and can't be reused for generic flag. If we want to use byte48~63, it has
> > > > > > to be overlapped with uring_cmd's payload, and it is one generic sqe
> > > > > > flag, which is applied on uring_cmd too.
> > > > > 
> > > > > Which is exactly the mess nobody would want to see. And I'd also
> > > > 
> > > > The trouble is introduced by supporting uring_cmd, and solving it by setting
> > > > ext flags for uring_cmd specially by liburing helper is still reasonable or
> > > > understandable, IMO.
> > > > 
> > > > > argue 8 extra bits is not enough anyway, otherwise the history will
> > > > > repeat itself pretty soon
> > > > 
> > > > It is started with 8 bits, now doubled when io_uring is basically
> > > > mature, even though history might repeat, it will take much longer time
> > > 
> > > You're mistaken, only 7 bits are taken not because there haven't been
> > > ideas and need to use them, but because we're out of space and we've
> > > been saving it for something that might be absolutely necessary.
> > > 
> > > POLL_FIRST IMHO should've been a generic feature, but it worked around
> > > being a send/recv specific flag, same goes for the use of registered
> > > buffers, not to mention ideas for which we haven't had enough flag space.
> > 
> > OK, but I am wondering why not extend flags a bit so that io_uring can
> > become extendable, just like this patch.
> 
> That would be great if can be done cleanly. Even having it
> non contig with the first 8bits is fine, but not conditional
> depending on opcode is too much.

byte56~63 is used for uring_cmd payload, and it can't be done without
depending on uring_cmd op.

The patch is simple, and this usage can be well-documented. In
userspace, just one special helper is needed for setting uring_cmd
ext_flags only.

Except for this simple way, I don't see other approaches to extend sqe flags.

> 
> 
> > > > > > That is the only way I thought of, or any other suggestion for extending sqe
> > > > > > flags generically?
> > > > > 
> > > > > idea 1: just use the last bit. When we need another one it'd be time
> > > > > to think about a long overdue SQE layout v2, this way we can try
> > > > > to make flags u32 and clean up other problems.
> > > > 
> > > > It looks over-kill to invent SQE v2 just for solving the trouble in
> > > > uring_cmd, and supporting two layouts can be new trouble for io_uring.
> > > 
> > > Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
> > > just one of reasons. As for overkill, that's why I'm not telling you
> > > to change the layour, but suggesting to take the last bit for the
> > > group flag and leave future problems for the future.
> > 
> > You mentioned 8bit flag is designed from beginning just for saving
> > space, so SQE V2 may not help us at all.
> 
> Not sure what you mean. Retrospectively speaking, u8 for flags was
> an oversight

You mentioned that:

	You're mistaken, only 7 bits are taken not because there haven't been
	ideas and need to use them, but because we're out of space and we've
	been saving it for something that might be absolutely necessary.

Nothing is changed since then, so where to find more free space from
64 bytes for sqe flags now?

> 
> 
> > If the last bit can be reserved for extend flag, it is still possible
> > to extend sqe flags a bit, such as this patch. Otherwise, we just lose
> > chance to extend sqe flags in future.
> 
> That's why I mentioned SQE layout v2, i.e. a ctx flag which reshuffles
> sqe fields in a better way. Surely there will be a lot of headache with
> such a migration, but you can make flags a u32 then if you find space
> and wouldn't even need and extending flag.

It is one hard problem, and it may not be answered in short time, cause all
use cases need to be covered, meantime 3 extra bytes are saved from the
reshuffling, with alignment respected meantime.

Also it isn't worth of layout v2 just for extending sqe flags.

> 
> 
> > Jens, can you share your idea/option wrt. extending sqe flags?
> > 
> > > 
> > > 
> > > > Also I doubt the problem can be solved in layout v2:
> > > > 
> > > > - 64 byte is small enough to support everything, same for v2
> > > > 
> > > > - uring_cmd has only 16 bytes payload, taking any byte from
> > > > the payload may cause trouble for drivers
> > > > 
> > > > - the only possible change could still be to suppress bytes for OP
> > > > specific flags, but it might cause trouble for some OPs, such as
> > > > network.
> > > 
> > > Look up sqe's __pad1, for example
> > 
> > Suppose it is just for uring_cmd, '__pad1' is shared with cmd_op, which is aligned
> > with ioctl cmd and is supposed to be 32bit.
> 
> It's not shared with cmd_op, it's in a struct with it, unless you
> use a u32 part of ->addr2/off, it's just that, a completely
> unnecessary created padding. There was also another field left,
> at least in case for nvme.

OK, __pad1 is available for uring_cmd, and it could be better to use
__pad1 for uring_cmd ext flags, but it still depends on uring_cmd, and
now ext_flags can be u16 or more, :-)

Thanks for sharing this point.

> 
> > Same with 'off' which is used in rw at least, if sqe group is to be
> > generic flag.
> > 
> > > 
> > > 
> > > > > idea 2: the group assembling flag can move into cmds. Very roughly:
> > > > > 
> > > > > io_cmd_init() {
> > > > > 	ublk_cmd_init();
> > > > > }
> > > > > 
> > > > > ublk_cmd_init() {
> > > > > 	io_uring_start_grouping(ctx, cmd);
> > > > > }
> > > > > 
> > > > > io_uring_start_grouping(ctx, cmd) {
> > > > > 	ctx->grouping = true;
> > > > > 	ctx->group_head = cmd->req;
> > > > > }
> > > > 
> > > > How can you know one group is starting without any flag? Or you still
> > > > suggest the approach taken in fused command?
> > > 
> > > That would be ublk's business, e.g. ublk or cmds specific flag
> > 
> > Then it becomes dedicated fused command actually, and last year's main
> > concern is that the approach isn't generic.
> 
> My concern is anything leaking into hot paths, even if it's a
> generic feature (and I wouldn't call it that). The question is
> rather at what degree. I wouldn't call groups in isolation
> without zc exciting, and making it to look like a generic feature
> just for the sake of it might even be worse than having it opcode
> specific.
> 
> Regardless, this approach doesn't forbid some other opcode from
> doing ctx->grouping = true based on some other opcode specific
> flag, doesn't necessarily binds it to cmds/ublk.

Yes.

> 
> > > > > submit_sqe() {
> > > > > 	if (ctx->grouping) {
> > > > > 		link_to_group(req, ctx->group_head);
> > > > > 		if (!(req->flags & REQ_F_LINK))
> > > > > 			ctx->grouping = false;
> > > > > 	}
> > > > > }
> > > > 
> > > > The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
> > > > not doable.
> > > 
> > > Would it break zero copy feature if you cant?
> > 
> > The whole sqe group needs to be linked to existed link chain, so we
> > can't reuse REQ_F_LINK here.
> 
> Why though? You're passing a buffer from the head to all group-linked
> requests, how do normal links come into the picture?

For example of ublk-nbd, tcp send requests have to be linked, and each
send request belongs to one group in case of zero copy.

Meantime linking is one generic feature, and it can be used everywhere, so
not good to disallow it in application

Thanks,
Ming
Pavel Begunkov May 2, 2024, 2:22 p.m. UTC | #9
On 4/30/24 16:46, Ming Lei wrote:
> On Tue, Apr 30, 2024 at 03:10:01PM +0100, Pavel Begunkov wrote:
>> On 4/30/24 13:56, Ming Lei wrote:
>>> On Tue, Apr 30, 2024 at 01:00:30PM +0100, Pavel Begunkov wrote:
>>>> On 4/30/24 04:43, Ming Lei wrote:
>>>>> On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
>>>>>> On 4/23/24 14:57, Ming Lei wrote:
>>>>>>> On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
>>>>>>>> On 4/7/24 7:03 PM, Ming Lei wrote:
>>>>>>>>> sqe->flags is u8, and now we have used 7 bits, so take the last one for
>>>>>>>>> extending purpose.
>>>>>>>>>
>>>>>>>>> If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
>>>>>>>>> from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
>>>>>>>>> IORING_OP_URING_CMD.
>>>>>>>>>
>>>>>>>>> io_slot_flags() return value is converted to `ULL` because the affected bits
>>>>>>>>> are beyond 32bit now.
>>>>>>>>
>>>>>>>> If we're extending flags, which is something we arguably need to do at
>>>>>>>> some point, I think we should have them be generic and not spread out.
>>>>>>>
>>>>>>> Sorry, maybe I don't get your idea, and the ext_flag itself is always
>>>>>>> initialized in io_init_req(), like normal sqe->flags, same with its
>>>>>>> usage.
>>>>>>>
>>>>>>>> If uring_cmd needs specific flags and don't have them, then we should
>>>>>>>> add it just for that.
>>>>>>>
>>>>>>> The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
>>>>>>> borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
>>>>>>> and can't be reused for generic flag. If we want to use byte48~63, it has
>>>>>>> to be overlapped with uring_cmd's payload, and it is one generic sqe
>>>>>>> flag, which is applied on uring_cmd too.
>>>>>>
>>>>>> Which is exactly the mess nobody would want to see. And I'd also
>>>>>
>>>>> The trouble is introduced by supporting uring_cmd, and solving it by setting
>>>>> ext flags for uring_cmd specially by liburing helper is still reasonable or
>>>>> understandable, IMO.
>>>>>
>>>>>> argue 8 extra bits is not enough anyway, otherwise the history will
>>>>>> repeat itself pretty soon
>>>>>
>>>>> It is started with 8 bits, now doubled when io_uring is basically
>>>>> mature, even though history might repeat, it will take much longer time
>>>>
>>>> You're mistaken, only 7 bits are taken not because there haven't been
>>>> ideas and need to use them, but because we're out of space and we've
>>>> been saving it for something that might be absolutely necessary.
>>>>
>>>> POLL_FIRST IMHO should've been a generic feature, but it worked around
>>>> being a send/recv specific flag, same goes for the use of registered
>>>> buffers, not to mention ideas for which we haven't had enough flag space.
>>>
>>> OK, but I am wondering why not extend flags a bit so that io_uring can
>>> become extendable, just like this patch.
>>
>> That would be great if can be done cleanly. Even having it
>> non contig with the first 8bits is fine, but not conditional
>> depending on opcode is too much.
> 
> byte56~63 is used for uring_cmd payload, and it can't be done without
> depending on uring_cmd op.
> 
> The patch is simple, and this usage can be well-documented. In
> userspace, just one special helper is needed for setting uring_cmd
> ext_flags only.

One simple helper here, one simple helper there, one line in man
in some other place, in the end it'll turn to be a horrible mess.

It's not even a question when we'd see people asking "I used
set_ext_flags but why it doesn't work" from people missing a
separate cmd flag. Or just to be sure they would call both,
such things happen even with more straightforward APIs, and it's
just one problem.

> Except for this simple way, I don't see other approaches to extend sqe flags.

Well, that's why I described below how exactly it can be done
cleanly in a long run.

>>>>>>> That is the only way I thought of, or any other suggestion for extending sqe
>>>>>>> flags generically?
>>>>>>
>>>>>> idea 1: just use the last bit. When we need another one it'd be time
>>>>>> to think about a long overdue SQE layout v2, this way we can try
>>>>>> to make flags u32 and clean up other problems.
>>>>>
>>>>> It looks over-kill to invent SQE v2 just for solving the trouble in
>>>>> uring_cmd, and supporting two layouts can be new trouble for io_uring.
>>>>
>>>> Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
>>>> just one of reasons. As for overkill, that's why I'm not telling you
>>>> to change the layour, but suggesting to take the last bit for the
>>>> group flag and leave future problems for the future.
>>>
>>> You mentioned 8bit flag is designed from beginning just for saving
>>> space, so SQE V2 may not help us at all.
>>
>> Not sure what you mean. Retrospectively speaking, u8 for flags was
>> an oversight
> 
> You mentioned that:
> 
> 	You're mistaken, only 7 bits are taken not because there haven't been
> 	ideas and need to use them, but because we're out of space and we've
> 	been saving it for something that might be absolutely necessary.
> 
> Nothing is changed since then, so where to find more free space from
> 64 bytes for sqe flags now?

ditto

>>> If the last bit can be reserved for extend flag, it is still possible
>>> to extend sqe flags a bit, such as this patch. Otherwise, we just lose
>>> chance to extend sqe flags in future.
>>
>> That's why I mentioned SQE layout v2, i.e. a ctx flag which reshuffles
>> sqe fields in a better way. Surely there will be a lot of headache with
>> such a migration, but you can make flags a u32 then if you find space
>> and wouldn't even need and extending flag.
> 
> It is one hard problem, and it may not be answered in short time, cause all
> use cases need to be covered, meantime 3 extra bytes are saved from the
> reshuffling, with alignment respected meantime.
> 
> Also it isn't worth of layout v2 just for extending sqe flags.

Not just, by opting to the pain of migration to a new SQE layout
we can revise more API decisions. It's a separate topic for
discussion, and the latter it's done the better because we'd
collect more design mistakes that can be fixed.

Reiterating again, you're not blocked by it.

>>> Jens, can you share your idea/option wrt. extending sqe flags?
>>>
>>>>
>>>>
>>>>> Also I doubt the problem can be solved in layout v2:
>>>>>
>>>>> - 64 byte is small enough to support everything, same for v2
>>>>>
>>>>> - uring_cmd has only 16 bytes payload, taking any byte from
>>>>> the payload may cause trouble for drivers
>>>>>
>>>>> - the only possible change could still be to suppress bytes for OP
>>>>> specific flags, but it might cause trouble for some OPs, such as
>>>>> network.
>>>>
>>>> Look up sqe's __pad1, for example
>>>
>>> Suppose it is just for uring_cmd, '__pad1' is shared with cmd_op, which is aligned
>>> with ioctl cmd and is supposed to be 32bit.
>>
>> It's not shared with cmd_op, it's in a struct with it, unless you
>> use a u32 part of ->addr2/off, it's just that, a completely
>> unnecessary created padding. There was also another field left,
>> at least in case for nvme.
> 
> OK, __pad1 is available for uring_cmd, and it could be better to use
> __pad1 for uring_cmd ext flags, but it still depends on uring_cmd, and
> now ext_flags can be u16 or more, :-)
> 
> Thanks for sharing this point.
> 
>>
>>> Same with 'off' which is used in rw at least, if sqe group is to be
>>> generic flag.
>>>
>>>>
>>>>
>>>>>> idea 2: the group assembling flag can move into cmds. Very roughly:
>>>>>>
>>>>>> io_cmd_init() {
>>>>>> 	ublk_cmd_init();
>>>>>> }
>>>>>>
>>>>>> ublk_cmd_init() {
>>>>>> 	io_uring_start_grouping(ctx, cmd);
>>>>>> }
>>>>>>
>>>>>> io_uring_start_grouping(ctx, cmd) {
>>>>>> 	ctx->grouping = true;
>>>>>> 	ctx->group_head = cmd->req;
>>>>>> }
>>>>>
>>>>> How can you know one group is starting without any flag? Or you still
>>>>> suggest the approach taken in fused command?
>>>>
>>>> That would be ublk's business, e.g. ublk or cmds specific flag
>>>
>>> Then it becomes dedicated fused command actually, and last year's main
>>> concern is that the approach isn't generic.
>>
>> My concern is anything leaking into hot paths, even if it's a
>> generic feature (and I wouldn't call it that). The question is
>> rather at what degree. I wouldn't call groups in isolation
>> without zc exciting, and making it to look like a generic feature
>> just for the sake of it might even be worse than having it opcode
>> specific.
>>
>> Regardless, this approach doesn't forbid some other opcode from
>> doing ctx->grouping = true based on some other opcode specific
>> flag, doesn't necessarily binds it to cmds/ublk.
> 
> Yes.
> 
>>
>>>>>> submit_sqe() {
>>>>>> 	if (ctx->grouping) {
>>>>>> 		link_to_group(req, ctx->group_head);
>>>>>> 		if (!(req->flags & REQ_F_LINK))
>>>>>> 			ctx->grouping = false;
>>>>>> 	}
>>>>>> }
>>>>>
>>>>> The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
>>>>> not doable.
>>>>
>>>> Would it break zero copy feature if you cant?
>>>
>>> The whole sqe group needs to be linked to existed link chain, so we
>>> can't reuse REQ_F_LINK here.
>>
>> Why though? You're passing a buffer from the head to all group-linked
>> requests, how do normal links come into the picture?
> 
> For example of ublk-nbd, tcp send requests have to be linked, and each
> send request belongs to one group in case of zero copy.

"Linked" like in "add to a group"? Or in terms of traditional
IOSQE_IO_LINK links? Because if it's about groups you don't need
IOSQE_IO_LINK. And if it's IOSQE_IO_LINK linking, then same
is supposedly can be done in userspace.

> Meantime linking is one generic feature, and it can be used everywhere, so
> not good to disallow it in application
Ming Lei May 4, 2024, 1:19 a.m. UTC | #10
On Thu, May 02, 2024 at 03:22:10PM +0100, Pavel Begunkov wrote:
> On 4/30/24 16:46, Ming Lei wrote:
> > On Tue, Apr 30, 2024 at 03:10:01PM +0100, Pavel Begunkov wrote:
> > > On 4/30/24 13:56, Ming Lei wrote:
> > > > On Tue, Apr 30, 2024 at 01:00:30PM +0100, Pavel Begunkov wrote:
> > > > > On 4/30/24 04:43, Ming Lei wrote:
> > > > > > On Mon, Apr 29, 2024 at 04:24:54PM +0100, Pavel Begunkov wrote:
> > > > > > > On 4/23/24 14:57, Ming Lei wrote:
> > > > > > > > On Mon, Apr 22, 2024 at 12:16:12PM -0600, Jens Axboe wrote:
> > > > > > > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > > > > > > sqe->flags is u8, and now we have used 7 bits, so take the last one for
> > > > > > > > > > extending purpose.
> > > > > > > > > > 
> > > > > > > > > > If bit7(IOSQE_HAS_EXT_FLAGS_BIT) is 1, it means this sqe carries ext flags
> > > > > > > > > > from the last byte(.ext_flags), or bit23~bit16 of sqe->uring_cmd_flags for
> > > > > > > > > > IORING_OP_URING_CMD.
> > > > > > > > > > 
> > > > > > > > > > io_slot_flags() return value is converted to `ULL` because the affected bits
> > > > > > > > > > are beyond 32bit now.
> > > > > > > > > 
> > > > > > > > > If we're extending flags, which is something we arguably need to do at
> > > > > > > > > some point, I think we should have them be generic and not spread out.
> > > > > > > > 
> > > > > > > > Sorry, maybe I don't get your idea, and the ext_flag itself is always
> > > > > > > > initialized in io_init_req(), like normal sqe->flags, same with its
> > > > > > > > usage.
> > > > > > > > 
> > > > > > > > > If uring_cmd needs specific flags and don't have them, then we should
> > > > > > > > > add it just for that.
> > > > > > > > 
> > > > > > > > The only difference is that bit23~bit16 of sqe->uring_cmd_flags is
> > > > > > > > borrowed for uring_cmd's ext flags, because sqe byte0~47 have been taken,
> > > > > > > > and can't be reused for generic flag. If we want to use byte48~63, it has
> > > > > > > > to be overlapped with uring_cmd's payload, and it is one generic sqe
> > > > > > > > flag, which is applied on uring_cmd too.
> > > > > > > 
> > > > > > > Which is exactly the mess nobody would want to see. And I'd also
> > > > > > 
> > > > > > The trouble is introduced by supporting uring_cmd, and solving it by setting
> > > > > > ext flags for uring_cmd specially by liburing helper is still reasonable or
> > > > > > understandable, IMO.
> > > > > > 
> > > > > > > argue 8 extra bits is not enough anyway, otherwise the history will
> > > > > > > repeat itself pretty soon
> > > > > > 
> > > > > > It is started with 8 bits, now doubled when io_uring is basically
> > > > > > mature, even though history might repeat, it will take much longer time
> > > > > 
> > > > > You're mistaken, only 7 bits are taken not because there haven't been
> > > > > ideas and need to use them, but because we're out of space and we've
> > > > > been saving it for something that might be absolutely necessary.
> > > > > 
> > > > > POLL_FIRST IMHO should've been a generic feature, but it worked around
> > > > > being a send/recv specific flag, same goes for the use of registered
> > > > > buffers, not to mention ideas for which we haven't had enough flag space.
> > > > 
> > > > OK, but I am wondering why not extend flags a bit so that io_uring can
> > > > become extendable, just like this patch.
> > > 
> > > That would be great if can be done cleanly. Even having it
> > > non contig with the first 8bits is fine, but not conditional
> > > depending on opcode is too much.
> > 
> > byte56~63 is used for uring_cmd payload, and it can't be done without
> > depending on uring_cmd op.
> > 
> > The patch is simple, and this usage can be well-documented. In
> > userspace, just one special helper is needed for setting uring_cmd
> > ext_flags only.
> 
> One simple helper here, one simple helper there, one line in man
> in some other place, in the end it'll turn to be a horrible mess.
> 
> It's not even a question when we'd see people asking "I used
> set_ext_flags but why it doesn't work" from people missing a
> separate cmd flag. Or just to be sure they would call both,
> such things happen even with more straightforward APIs, and it's
> just one problem.
> 
> > Except for this simple way, I don't see other approaches to extend sqe flags.
> 
> Well, that's why I described below how exactly it can be done
> cleanly in a long run.

OK, then looks io_uring isn't extendable wrt. sqe->flags in long run.

> 
> > > > > > > > That is the only way I thought of, or any other suggestion for extending sqe
> > > > > > > > flags generically?
> > > > > > > 
> > > > > > > idea 1: just use the last bit. When we need another one it'd be time
> > > > > > > to think about a long overdue SQE layout v2, this way we can try
> > > > > > > to make flags u32 and clean up other problems.
> > > > > > 
> > > > > > It looks over-kill to invent SQE v2 just for solving the trouble in
> > > > > > uring_cmd, and supporting two layouts can be new trouble for io_uring.
> > > > > 
> > > > > Sounds too uring_cmd centric, it's not specifically for uring_cmd, it's
> > > > > just one of reasons. As for overkill, that's why I'm not telling you
> > > > > to change the layour, but suggesting to take the last bit for the
> > > > > group flag and leave future problems for the future.
> > > > 
> > > > You mentioned 8bit flag is designed from beginning just for saving
> > > > space, so SQE V2 may not help us at all.
> > > 
> > > Not sure what you mean. Retrospectively speaking, u8 for flags was
> > > an oversight
> > 
> > You mentioned that:
> > 
> > 	You're mistaken, only 7 bits are taken not because there haven't been
> > 	ideas and need to use them, but because we're out of space and we've
> > 	been saving it for something that might be absolutely necessary.
> > 
> > Nothing is changed since then, so where to find more free space from
> > 64 bytes for sqe flags now?
> 
> ditto
> 
> > > > If the last bit can be reserved for extend flag, it is still possible
> > > > to extend sqe flags a bit, such as this patch. Otherwise, we just lose
> > > > chance to extend sqe flags in future.
> > > 
> > > That's why I mentioned SQE layout v2, i.e. a ctx flag which reshuffles
> > > sqe fields in a better way. Surely there will be a lot of headache with
> > > such a migration, but you can make flags a u32 then if you find space
> > > and wouldn't even need and extending flag.
> > 
> > It is one hard problem, and it may not be answered in short time, cause all
> > use cases need to be covered, meantime 3 extra bytes are saved from the
> > reshuffling, with alignment respected meantime.
> > 
> > Also it isn't worth of layout v2 just for extending sqe flags.
> 
> Not just, by opting to the pain of migration to a new SQE layout
> we can revise more API decisions. It's a separate topic for
> discussion, and the latter it's done the better because we'd
> collect more design mistakes that can be fixed.
> 
> Reiterating again, you're not blocked by it.
> 
> > > > Jens, can you share your idea/option wrt. extending sqe flags?
> > > > 
> > > > > 
> > > > > 
> > > > > > Also I doubt the problem can be solved in layout v2:
> > > > > > 
> > > > > > - 64 byte is small enough to support everything, same for v2
> > > > > > 
> > > > > > - uring_cmd has only 16 bytes payload, taking any byte from
> > > > > > the payload may cause trouble for drivers
> > > > > > 
> > > > > > - the only possible change could still be to suppress bytes for OP
> > > > > > specific flags, but it might cause trouble for some OPs, such as
> > > > > > network.
> > > > > 
> > > > > Look up sqe's __pad1, for example
> > > > 
> > > > Suppose it is just for uring_cmd, '__pad1' is shared with cmd_op, which is aligned
> > > > with ioctl cmd and is supposed to be 32bit.
> > > 
> > > It's not shared with cmd_op, it's in a struct with it, unless you
> > > use a u32 part of ->addr2/off, it's just that, a completely
> > > unnecessary created padding. There was also another field left,
> > > at least in case for nvme.
> > 
> > OK, __pad1 is available for uring_cmd, and it could be better to use
> > __pad1 for uring_cmd ext flags, but it still depends on uring_cmd, and
> > now ext_flags can be u16 or more, :-)
> > 
> > Thanks for sharing this point.
> > 
> > > 
> > > > Same with 'off' which is used in rw at least, if sqe group is to be
> > > > generic flag.
> > > > 
> > > > > 
> > > > > 
> > > > > > > idea 2: the group assembling flag can move into cmds. Very roughly:
> > > > > > > 
> > > > > > > io_cmd_init() {
> > > > > > > 	ublk_cmd_init();
> > > > > > > }
> > > > > > > 
> > > > > > > ublk_cmd_init() {
> > > > > > > 	io_uring_start_grouping(ctx, cmd);
> > > > > > > }
> > > > > > > 
> > > > > > > io_uring_start_grouping(ctx, cmd) {
> > > > > > > 	ctx->grouping = true;
> > > > > > > 	ctx->group_head = cmd->req;
> > > > > > > }
> > > > > > 
> > > > > > How can you know one group is starting without any flag? Or you still
> > > > > > suggest the approach taken in fused command?
> > > > > 
> > > > > That would be ublk's business, e.g. ublk or cmds specific flag
> > > > 
> > > > Then it becomes dedicated fused command actually, and last year's main
> > > > concern is that the approach isn't generic.
> > > 
> > > My concern is anything leaking into hot paths, even if it's a
> > > generic feature (and I wouldn't call it that). The question is
> > > rather at what degree. I wouldn't call groups in isolation
> > > without zc exciting, and making it to look like a generic feature
> > > just for the sake of it might even be worse than having it opcode
> > > specific.
> > > 
> > > Regardless, this approach doesn't forbid some other opcode from
> > > doing ctx->grouping = true based on some other opcode specific
> > > flag, doesn't necessarily binds it to cmds/ublk.
> > 
> > Yes.
> > 
> > > 
> > > > > > > submit_sqe() {
> > > > > > > 	if (ctx->grouping) {
> > > > > > > 		link_to_group(req, ctx->group_head);
> > > > > > > 		if (!(req->flags & REQ_F_LINK))
> > > > > > > 			ctx->grouping = false;
> > > > > > > 	}
> > > > > > > }
> > > > > > 
> > > > > > The group needs to be linked to existed link chain, so reusing REQ_F_LINK may
> > > > > > not doable.
> > > > > 
> > > > > Would it break zero copy feature if you cant?
> > > > 
> > > > The whole sqe group needs to be linked to existed link chain, so we
> > > > can't reuse REQ_F_LINK here.
> > > 
> > > Why though? You're passing a buffer from the head to all group-linked
> > > requests, how do normal links come into the picture?
> > 
> > For example of ublk-nbd, tcp send requests have to be linked, and each
> > send request belongs to one group in case of zero copy.
> 
> "Linked" like in "add to a group"? Or in terms of traditional
> IOSQE_IO_LINK links? Because if it's about groups you don't need

It is IOSQE_IO_LINK.

Here one group is just for handling zero copy IO, such as handling one
ublk block request, but there can be lots of such groups, each group
needs to be linked with another group, because it is tcp send.

There are more such examples, such as normal IO depending on meta IO by
IOSQE_IO_LINK.

You are suggesting to reuse IOSQE_IO_LINK, then this inter-group linking
or meta IO linking with normal group is simply non working.

> IOSQE_IO_LINK. And if it's IOSQE_IO_LINK linking, then same
> is supposedly can be done in userspace.

Yes, it can work by forcing to replace IOSQE_IO_LINK with one extra
io_uring_enter() in userspace, together with extra overhead, and more complexity.

That is why I think disallowing generic feature of IOSQE_IO_LINK isn't one
wise option. Is there any other OP or example which doesn't support IOSQE_IO_LINK?

Thanks,
Ming
diff mbox series

Patch

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 3e72fa52f1e3..67347e5d06ec 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -435,6 +435,7 @@  struct io_tw_state {
 };
 
 enum {
+	/* 1st byte is from sqe->flags, and 2nd is from sqe ext_flags */
 	REQ_F_FIXED_FILE_BIT	= IOSQE_FIXED_FILE_BIT,
 	REQ_F_IO_DRAIN_BIT	= IOSQE_IO_DRAIN_BIT,
 	REQ_F_LINK_BIT		= IOSQE_IO_LINK_BIT,
@@ -442,9 +443,10 @@  enum {
 	REQ_F_FORCE_ASYNC_BIT	= IOSQE_ASYNC_BIT,
 	REQ_F_BUFFER_SELECT_BIT	= IOSQE_BUFFER_SELECT_BIT,
 	REQ_F_CQE_SKIP_BIT	= IOSQE_CQE_SKIP_SUCCESS_BIT,
+	REQ_F_SQE_EXT_FLAGS_BIT	= IOSQE_HAS_EXT_FLAGS_BIT,
 
-	/* first byte is taken by user flags, shift it to not overlap */
-	REQ_F_FAIL_BIT		= 8,
+	/* first 2 bytes are taken by user flags, shift it to not overlap */
+	REQ_F_FAIL_BIT		= 16,
 	REQ_F_INFLIGHT_BIT,
 	REQ_F_CUR_POS_BIT,
 	REQ_F_NOWAIT_BIT,
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a7f847543a7f..4847d7cf1ac9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,10 @@  struct io_uring_sqe {
 			__u64	__pad2[1];
 		};
 		__u64	optval;
+		struct {
+			__u8	__pad4[15];
+			__u8	ext_flags;
+		};
 		/*
 		 * If the ring is initialized with IORING_SETUP_SQE128, then
 		 * this field is used for 80 bytes of arbitrary command data
@@ -123,6 +127,7 @@  enum io_uring_sqe_flags_bit {
 	IOSQE_ASYNC_BIT,
 	IOSQE_BUFFER_SELECT_BIT,
 	IOSQE_CQE_SKIP_SUCCESS_BIT,
+	IOSQE_HAS_EXT_FLAGS_BIT,
 };
 
 /*
@@ -142,6 +147,11 @@  enum io_uring_sqe_flags_bit {
 #define IOSQE_BUFFER_SELECT	(1U << IOSQE_BUFFER_SELECT_BIT)
 /* don't post CQE if request succeeded */
 #define IOSQE_CQE_SKIP_SUCCESS	(1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
+/*
+ * sqe ext flags carried in the last byte, or bit23~bit16 of
+ * sqe->uring_cmd_flags for IORING_URING_CMD.
+ */
+#define IOSQE_HAS_EXT_FLAGS	(1U << IOSQE_HAS_EXT_FLAGS_BIT)
 
 /*
  * io_uring_setup() flags
@@ -263,11 +273,14 @@  enum io_uring_op {
 
 /*
  * sqe->uring_cmd_flags		top 8bits aren't available for userspace
+ * bit31 ~ bit24		kernel internal usage
+ * bit23 ~ bit16		sqe ext flags
  * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
  *				along with setting sqe->buf_index.
  */
 #define IORING_URING_CMD_FIXED	(1U << 0)
 #define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
+#define IORING_URING_CMD_EXT_MASK	0x00ff0000
 
 
 /*
diff --git a/io_uring/filetable.h b/io_uring/filetable.h
index b2435c4dca1f..d25247c9b9f5 100644
--- a/io_uring/filetable.h
+++ b/io_uring/filetable.h
@@ -43,7 +43,7 @@  io_fixed_file_slot(struct io_file_table *table, unsigned i)
 #define FFS_ISREG		0x2UL
 #define FFS_MASK		~(FFS_NOWAIT|FFS_ISREG)
 
-static inline unsigned int io_slot_flags(struct io_fixed_file *slot)
+static inline unsigned long io_slot_flags(struct io_fixed_file *slot)
 {
 	return (slot->file_ptr & ~FFS_MASK) << REQ_F_SUPPORT_NOWAIT_BIT;
 }
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8df9ad010803..6d4def11aebf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -109,7 +109,8 @@ 
 			  IOSQE_IO_HARDLINK | IOSQE_ASYNC)
 
 #define SQE_VALID_FLAGS	(SQE_COMMON_FLAGS | IOSQE_BUFFER_SELECT | \
-			IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS)
+			IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS | \
+			IOSQE_HAS_EXT_FLAGS)
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
 				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
@@ -2080,6 +2081,17 @@  static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		/* enforce forwards compatibility on users */
 		if (sqe_flags & ~SQE_VALID_FLAGS)
 			return io_init_fail_req(req, -EINVAL);
+		if (sqe_flags & IOSQE_HAS_EXT_FLAGS) {
+			u32 sqe_ext_flags;
+
+			if (opcode != IORING_OP_URING_CMD)
+				sqe_ext_flags = READ_ONCE(sqe->ext_flags);
+			else
+				sqe_ext_flags = (READ_ONCE(sqe->uring_cmd_flags)
+					& IORING_URING_CMD_EXT_MASK) >> 16;
+			req->flags |= sqe_ext_flags << 8;
+		}
+
 		if (sqe_flags & IOSQE_BUFFER_SELECT) {
 			if (!def->buffer_select)
 				return io_init_fail_req(req, -EOPNOTSUPP);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 334d31dd6628..43b71f29e7b3 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -202,7 +202,8 @@  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (sqe->__pad1)
 		return -EINVAL;
 
-	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags) &
+		~IORING_URING_CMD_EXT_MASK;
 	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
 		return -EINVAL;