mbox series

[v2,0/4] open/accept directly into io_uring fixed file table

Message ID cover.1628871893.git.asml.silence@gmail.com (mailing list archive)
Headers show
Series open/accept directly into io_uring fixed file table | expand

Message

Pavel Begunkov Aug. 13, 2021, 4:43 p.m. UTC
Add an optional feature to open/accept directly into io_uring's fixed
file table bypassing the normal file table. Same behaviour if as the
snippet below, but in one operation:

sqe = prep_[open,accept](...);
cqe = submit_and_wait(sqe);
// error handling
io_uring_register_files_update(uring_idx, (fd = cqe->res));
// optionally
close((fd = cqe->res));

The idea in pretty old, and was brough up and implemented a year ago
by Josh Triplett, though haven't sought the light for some reasons.

Tested on basic cases, will be sent out as liburing patches later.

A copy paste from 2/2 describing user API and some notes:

The behaviour is controlled by setting sqe->file_index, where 0 implies
the old behaviour. If non-zero value is specified, then it will behave
as described and place the file into a fixed file slot
sqe->file_index - 1. A file table should be already created, the slot
should be valid and empty, otherwise the operation will fail.

Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
accept takes a file, and it already uses the flag with a different
meaning.

Note 2: it's u16, where in theory the limit for fixed file tables might
get increased in the future. If would ever happen so, we'll better
workaround later, e.g. by making ioprio to represent upper bits 16 bits.
The layout for open is tight already enough.

since RFC:
 - added attribution
 - updated descriptions
 - rebased

Pavel Begunkov (4):
  net: add accept helper not installing fd
  io_uring: openat directly into fixed fd table
  io_uring: hand code io_accept() fd installing
  io_uring: accept directly into fixed file table

 fs/io_uring.c                 | 114 +++++++++++++++++++++++++++++-----
 include/linux/socket.h        |   3 +
 include/uapi/linux/io_uring.h |   2 +
 net/socket.c                  |  71 +++++++++++----------
 4 files changed, 139 insertions(+), 51 deletions(-)

Comments

Josh Triplett Aug. 13, 2021, 7 p.m. UTC | #1
On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
> Add an optional feature to open/accept directly into io_uring's fixed
> file table bypassing the normal file table. Same behaviour if as the
> snippet below, but in one operation:
> 
> sqe = prep_[open,accept](...);
> cqe = submit_and_wait(sqe);
> // error handling
> io_uring_register_files_update(uring_idx, (fd = cqe->res));
> // optionally
> close((fd = cqe->res));
> 
> The idea in pretty old, and was brough up and implemented a year ago
> by Josh Triplett, though haven't sought the light for some reasons.

Thank you for working to get this over the finish line!

> Tested on basic cases, will be sent out as liburing patches later.
> 
> A copy paste from 2/2 describing user API and some notes:
> 
> The behaviour is controlled by setting sqe->file_index, where 0 implies
> the old behaviour. If non-zero value is specified, then it will behave
> as described and place the file into a fixed file slot
> sqe->file_index - 1. A file table should be already created, the slot
> should be valid and empty, otherwise the operation will fail.
> 
> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
> accept takes a file, and it already uses the flag with a different
> meaning.
> 
> Note 2: it's u16, where in theory the limit for fixed file tables might
> get increased in the future. If would ever happen so, we'll better
> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
> The layout for open is tight already enough.

Rather than using sqe->file_index - 1, which feels like an error-prone
interface, I think it makes sense to use a dedicated flag for this, like
IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
including open, accept, and in the future many other operations such as
memfd_create. (Imagine using a single ring submission to open a memfd,
write a buffer into it, seal it, send it over a UNIX socket, and then
close it.)

The only downside is that you'll need to reject that flag in all
non-open operations. One way to unify that code might be to add a flag
in io_op_def for open-like operations, and then check in common code for
the case of non-open-like operations passing IOSQE_OPEN_FIXED.

Also, rather than using a 16-bit index for the fixed file table and
potentially requiring expansion into a different field in the future,
what about overlapping it with the nofile field in the open and accept
requests? If they're not opening a normal file descriptor, they don't
need nofile. And in the original sqe, you can then overlap it with a
32-bit field like splice_fd_in.

EEXIST seems like the wrong error-code to use if the index is already in
use; open can already return EEXIST if you pass O_EXCL. How about EBADF,
or better yet EBADSLT which is unlikely to be returned for any other
reason?

- Josh Triplett
Pavel Begunkov Aug. 14, 2021, 12:50 p.m. UTC | #2
On 8/13/21 8:00 PM, Josh Triplett wrote:
> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
>> Add an optional feature to open/accept directly into io_uring's fixed
>> file table bypassing the normal file table. Same behaviour if as the
>> snippet below, but in one operation:
>>
>> sqe = prep_[open,accept](...);
>> cqe = submit_and_wait(sqe);
>> // error handling
>> io_uring_register_files_update(uring_idx, (fd = cqe->res));
>> // optionally
>> close((fd = cqe->res));
>>
>> The idea in pretty old, and was brough up and implemented a year ago
>> by Josh Triplett, though haven't sought the light for some reasons.
> 
> Thank you for working to get this over the finish line!
> 
>> Tested on basic cases, will be sent out as liburing patches later.
>>
>> A copy paste from 2/2 describing user API and some notes:
>>
>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>> the old behaviour. If non-zero value is specified, then it will behave
>> as described and place the file into a fixed file slot
>> sqe->file_index - 1. A file table should be already created, the slot
>> should be valid and empty, otherwise the operation will fail.
>>
>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>> accept takes a file, and it already uses the flag with a different
>> meaning.
>>
>> Note 2: it's u16, where in theory the limit for fixed file tables might
>> get increased in the future. If would ever happen so, we'll better
>> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
>> The layout for open is tight already enough.
> 
> Rather than using sqe->file_index - 1, which feels like an error-prone
> interface, I think it makes sense to use a dedicated flag for this, like
> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
> including open, accept, and in the future many other operations such as
> memfd_create. (Imagine using a single ring submission to open a memfd,
> write a buffer into it, seal it, send it over a UNIX socket, and then
> close it.)
> 
> The only downside is that you'll need to reject that flag in all
> non-open operations. One way to unify that code might be to add a flag
> in io_op_def for open-like operations, and then check in common code for
> the case of non-open-like operations passing IOSQE_OPEN_FIXED.

io_uring is really thin, and so I absolutely don't want any extra
overhead in the generic path, IOW anything affecting
reads/writes/sends/recvs.

The other reason is that there are only 2 bits left in sqe->flags,
and we may use them for something better, considering that it's
only open/accept and not much as this.

I agree that it feels error-prone, but at least it can be wrapped
nicely enough in liburing, e.g.

void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
				 const char *path, int flags,
				 mode_t mode, int slot_idx);


> Also, rather than using a 16-bit index for the fixed file table and
> potentially requiring expansion into a different field in the future,
> what about overlapping it with the nofile field in the open and accept
> requests? If they're not opening a normal file descriptor, they don't
> need nofile. And in the original sqe, you can then overlap it with a
> 32-bit field like splice_fd_in.

There is no nofile in SQEs, though

req->open.nofile = rlimit(RLIMIT_NOFILE);
 
> EEXIST seems like the wrong error-code to use if the index is already in
> use; open can already return EEXIST if you pass O_EXCL. How about EBADF,
> or better yet EBADSLT which is unlikely to be returned for any other
> reason?

Sure, sounds better indeed!
Jens Axboe Aug. 14, 2021, 11:03 p.m. UTC | #3
On 8/14/21 6:50 AM, Pavel Begunkov wrote:
> On 8/13/21 8:00 PM, Josh Triplett wrote:
>> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
>>> Add an optional feature to open/accept directly into io_uring's fixed
>>> file table bypassing the normal file table. Same behaviour if as the
>>> snippet below, but in one operation:
>>>
>>> sqe = prep_[open,accept](...);
>>> cqe = submit_and_wait(sqe);
>>> // error handling
>>> io_uring_register_files_update(uring_idx, (fd = cqe->res));
>>> // optionally
>>> close((fd = cqe->res));
>>>
>>> The idea in pretty old, and was brough up and implemented a year ago
>>> by Josh Triplett, though haven't sought the light for some reasons.
>>
>> Thank you for working to get this over the finish line!
>>
>>> Tested on basic cases, will be sent out as liburing patches later.
>>>
>>> A copy paste from 2/2 describing user API and some notes:
>>>
>>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>>> the old behaviour. If non-zero value is specified, then it will behave
>>> as described and place the file into a fixed file slot
>>> sqe->file_index - 1. A file table should be already created, the slot
>>> should be valid and empty, otherwise the operation will fail.
>>>
>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>>> accept takes a file, and it already uses the flag with a different
>>> meaning.
>>>
>>> Note 2: it's u16, where in theory the limit for fixed file tables might
>>> get increased in the future. If would ever happen so, we'll better
>>> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
>>> The layout for open is tight already enough.
>>
>> Rather than using sqe->file_index - 1, which feels like an error-prone
>> interface, I think it makes sense to use a dedicated flag for this, like
>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>> including open, accept, and in the future many other operations such as
>> memfd_create. (Imagine using a single ring submission to open a memfd,
>> write a buffer into it, seal it, send it over a UNIX socket, and then
>> close it.)
>>
>> The only downside is that you'll need to reject that flag in all
>> non-open operations. One way to unify that code might be to add a flag
>> in io_op_def for open-like operations, and then check in common code for
>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
> 
> io_uring is really thin, and so I absolutely don't want any extra
> overhead in the generic path, IOW anything affecting
> reads/writes/sends/recvs.
> 
> The other reason is that there are only 2 bits left in sqe->flags,
> and we may use them for something better, considering that it's
> only open/accept and not much as this.
> 
> I agree that it feels error-prone, but at least it can be wrapped
> nicely enough in liburing, e.g.
> 
> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
> 				 const char *path, int flags,
> 				 mode_t mode, int slot_idx);
> 
> 
>> Also, rather than using a 16-bit index for the fixed file table and
>> potentially requiring expansion into a different field in the future,
>> what about overlapping it with the nofile field in the open and accept
>> requests? If they're not opening a normal file descriptor, they don't
>> need nofile. And in the original sqe, you can then overlap it with a
>> 32-bit field like splice_fd_in.
> 
> There is no nofile in SQEs, though
> 
> req->open.nofile = rlimit(RLIMIT_NOFILE);

What's the plan in terms of limiting the amount of direct descriptors
(for lack of a better word)? That seems like an important aspect that
should get sorted out upfront.

Do we include the regular file table max_fds count for creating a new
direct descriptor, and limit to RLIMIT_NOFILE? That would seem logical,
but then that also implies that the regular file table should include
the ctx (potentially several) direct descriptors. And the latter is much
worse.

Maybe we have a way to size the direct table, which will consume entries
from the same pool that the regular file table does? That would then
work both ways, and could potentially just be done dynamically similarly
to how we expand the regular file table when we exceed its current size.

Anyway, just throwing a few ideas out there, with the intent to spark a
bit of discussion on this topic. I really like the direct descriptors,
it'll be a lot more efficient for certain use cases.
Josh Triplett Aug. 15, 2021, 3:31 a.m. UTC | #4
On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
> On 8/13/21 8:00 PM, Josh Triplett wrote:
> > Rather than using sqe->file_index - 1, which feels like an error-prone
> > interface, I think it makes sense to use a dedicated flag for this, like
> > IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
> > including open, accept, and in the future many other operations such as
> > memfd_create. (Imagine using a single ring submission to open a memfd,
> > write a buffer into it, seal it, send it over a UNIX socket, and then
> > close it.)
> > 
> > The only downside is that you'll need to reject that flag in all
> > non-open operations. One way to unify that code might be to add a flag
> > in io_op_def for open-like operations, and then check in common code for
> > the case of non-open-like operations passing IOSQE_OPEN_FIXED.
> 
> io_uring is really thin, and so I absolutely don't want any extra
> overhead in the generic path, IOW anything affecting
> reads/writes/sends/recvs.

There are already several checks for valid flags in io_init_req. For
instance:
        if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
            !io_op_defs[req->opcode].buffer_select)
                return -EOPNOTSUPP;
It'd be trivial to make io_op_defs have a "valid flags" byte, and one
bitwise op tells you if any invalid flags were passed. *Zero* additional
overhead for other operations.

Alternatively, since there are so few operations that open a file
descriptor, you could just add a separate opcode for those few
operations. That still seems preferable to overloading a 16-bit index
field for this.

With this new mechanism, I think we're going to want to support more
than 65535 fixed-file entries. I can easily imagine wanting to handle
hundreds of thousands of files or sockets this way.

> The other reason is that there are only 2 bits left in sqe->flags,
> and we may use them for something better, considering that it's
> only open/accept and not much as this.

pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
memfd_secret.

Of those, I personally would *love* to have at least pipe, socket,
pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
dup-like operation for moving things between the fixed-file table and
file descriptors.

I think this is valuable and versatile enough to merit a flag. It would
also be entirely reasonable to create separate operations for these. But
either way, I don't think this should just be determined by whether a
16-bit index is non-zero.

> I agree that it feels error-prone, but at least it can be wrapped
> nicely enough in liburing, e.g.
> 
> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
> 				 const char *path, int flags,
> 				 mode_t mode, int slot_idx);

That wrapper wouldn't be able to handle more than a 16-bit slot index
though.

> > Also, rather than using a 16-bit index for the fixed file table and
> > potentially requiring expansion into a different field in the future,
> > what about overlapping it with the nofile field in the open and accept
> > requests? If they're not opening a normal file descriptor, they don't
> > need nofile. And in the original sqe, you can then overlap it with a
> > 32-bit field like splice_fd_in.
> 
> There is no nofile in SQEs, though
> 
> req->open.nofile = rlimit(RLIMIT_NOFILE);

nofile isn't needed for opening into the fixed-file table, so it could
be omitted in that case, and another field unioned with it. That would
allow passing a 32-bit fixed-file index into open and accept without
growing the size of their structures. I think, with this new capability,
we're going to want a large number of fixed files available.

In the SQE, you could overlap it with the splice_fd_in field, which
isn't needed by any calls other than splice.

- Josh Triplett
Josh Triplett Aug. 15, 2021, 3:42 a.m. UTC | #5
On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote:
> What's the plan in terms of limiting the amount of direct descriptors
> (for lack of a better word)? That seems like an important aspect that
> should get sorted out upfront.
[...]
> Maybe we have a way to size the direct table, which will consume entries
> from the same pool that the regular file table does? That would then
> work both ways, and could potentially just be done dynamically similarly
> to how we expand the regular file table when we exceed its current size.

I think we'll want a way to size the direct table regardless, so that
it's pre-allocated and doesn't need to be resized when an index is used.
Then, we could do one of two equally easy things, depending on what
policy we want to set:

- Deduct the full size of the fixed-file table from the allowed number
  of files the process can have open. So, if RLIMIT_NOFILE is 1048576,
  and you pre-allocate 1000000 entries in the fixed-file table, you can
  have no more than 48576 file descriptors open. Stricter, but
  potentially problematic: a program *might* expect that it can
  dup2(some_fd, nofile - 1) successfully.

- Use RLIMIT_NOFILE as the maximum size of the fixed-file table. There's
  precedent for this: we already use RLIMIT_NOFILE as the maximum number
  of file descriptors you can have in flight over UNIX sockets.

I personally would favor the latter; it seems simple and
straightforward.
Pavel Begunkov Aug. 15, 2021, 10:48 a.m. UTC | #6
On 8/15/21 4:31 AM, Josh Triplett wrote:
> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>> interface, I think it makes sense to use a dedicated flag for this, like
>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>> including open, accept, and in the future many other operations such as
>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>> close it.)
>>>
>>> The only downside is that you'll need to reject that flag in all
>>> non-open operations. One way to unify that code might be to add a flag
>>> in io_op_def for open-like operations, and then check in common code for
>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>
>> io_uring is really thin, and so I absolutely don't want any extra
>> overhead in the generic path, IOW anything affecting
>> reads/writes/sends/recvs.
> 
> There are already several checks for valid flags in io_init_req. For
> instance:

Yes, it's horrible and I don't want to make it any worse.

>         if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
>             !io_op_defs[req->opcode].buffer_select)
>                 return -EOPNOTSUPP;
> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
> bitwise op tells you if any invalid flags were passed. *Zero* additional
> overhead for other operations.

Good point

> Alternatively, since there are so few operations that open a file
> descriptor, you could just add a separate opcode for those few
> operations. That still seems preferable to overloading a 16-bit index
> field for this.

I don't think so

> With this new mechanism, I think we're going to want to support more
> than 65535 fixed-file entries. I can easily imagine wanting to handle
> hundreds of thousands of files or sockets this way.

May be. What I'm curious about is that the feature doesn't really
change anything in this regard, but seems I haven't heard people
asking for larger tables.

>> The other reason is that there are only 2 bits left in sqe->flags,
>> and we may use them for something better, considering that it's
>> only open/accept and not much as this.
> 
> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
> memfd_secret.

We could argue for many of those whether they should be in io_uring,
and whether there are many benefits having them async and so. It would
have another story if all the ecosystem was io_uring centric, but
that's speculations.

> Of those, I personally would *love* to have at least pipe, socket,
> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
> dup-like operation for moving things between the fixed-file table and
> file descriptors.
> 
> I think this is valuable and versatile enough to merit a flag. It would
> also be entirely reasonable to create separate operations for these. But
> either way, I don't think this should just be determined by whether a
> 16-bit index is non-zero.
> 
>> I agree that it feels error-prone, but at least it can be wrapped
>> nicely enough in liburing, e.g.
>>
>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>> 				 const char *path, int flags,
>> 				 mode_t mode, int slot_idx);
> 
> That wrapper wouldn't be able to handle more than a 16-bit slot index
> though.

It would. Note, the index is "int" there, so if doesn't fit
into u16, we can fail it. And do conversion if required.

>>> Also, rather than using a 16-bit index for the fixed file table and
>>> potentially requiring expansion into a different field in the future,
>>> what about overlapping it with the nofile field in the open and accept
>>> requests? If they're not opening a normal file descriptor, they don't
>>> need nofile. And in the original sqe, you can then overlap it with a
>>> 32-bit field like splice_fd_in.
>>
>> There is no nofile in SQEs, though
>>
>> req->open.nofile = rlimit(RLIMIT_NOFILE);
> 
> nofile isn't needed for opening into the fixed-file table, so it could
> be omitted in that case, and another field unioned with it.

There is no problem to place it internally. Moreover, it's at the
moment uniformly placed inside io_kiocb, but with nofile we'd need
to find the place on per-op basis.

Not like any matters, it's just bike shedding.

> allow passing a 32-bit fixed-file index into open and accept without
> growing the size of their structures. I think, with this new capability,
> we're going to want a large number of fixed files available.
> 
> In the SQE, you could overlap it with the splice_fd_in field, which
> isn't needed by any calls other than splice.

But it doesn't mean it won't be used, as happened with pretty every
other field in SQE. So, it rather depends on what packing is wanted.
And reusing almost never used ->buf_index (and potentially ->ioprio),
sounds reasonable.
Pavel Begunkov Aug. 15, 2021, 1 p.m. UTC | #7
On 8/15/21 12:03 AM, Jens Axboe wrote:
> On 8/14/21 6:50 AM, Pavel Begunkov wrote:
>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
>>>> Add an optional feature to open/accept directly into io_uring's fixed
>>>> file table bypassing the normal file table. Same behaviour if as the
>>>> snippet below, but in one operation:
>>>>
>>>> sqe = prep_[open,accept](...);
>>>> cqe = submit_and_wait(sqe);
>>>> // error handling
>>>> io_uring_register_files_update(uring_idx, (fd = cqe->res));
>>>> // optionally
>>>> close((fd = cqe->res));
>>>>
>>>> The idea in pretty old, and was brough up and implemented a year ago
>>>> by Josh Triplett, though haven't sought the light for some reasons.
>>>
>>> Thank you for working to get this over the finish line!
>>>
>>>> Tested on basic cases, will be sent out as liburing patches later.
>>>>
>>>> A copy paste from 2/2 describing user API and some notes:
>>>>
>>>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>>>> the old behaviour. If non-zero value is specified, then it will behave
>>>> as described and place the file into a fixed file slot
>>>> sqe->file_index - 1. A file table should be already created, the slot
>>>> should be valid and empty, otherwise the operation will fail.
>>>>
>>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>>>> accept takes a file, and it already uses the flag with a different
>>>> meaning.
>>>>
>>>> Note 2: it's u16, where in theory the limit for fixed file tables might
>>>> get increased in the future. If would ever happen so, we'll better
>>>> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
>>>> The layout for open is tight already enough.
>>>
>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>> interface, I think it makes sense to use a dedicated flag for this, like
>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>> including open, accept, and in the future many other operations such as
>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>> close it.)
>>>
>>> The only downside is that you'll need to reject that flag in all
>>> non-open operations. One way to unify that code might be to add a flag
>>> in io_op_def for open-like operations, and then check in common code for
>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>
>> io_uring is really thin, and so I absolutely don't want any extra
>> overhead in the generic path, IOW anything affecting
>> reads/writes/sends/recvs.
>>
>> The other reason is that there are only 2 bits left in sqe->flags,
>> and we may use them for something better, considering that it's
>> only open/accept and not much as this.
>>
>> I agree that it feels error-prone, but at least it can be wrapped
>> nicely enough in liburing, e.g.
>>
>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>> 				 const char *path, int flags,
>> 				 mode_t mode, int slot_idx);
>>
>>
>>> Also, rather than using a 16-bit index for the fixed file table and
>>> potentially requiring expansion into a different field in the future,
>>> what about overlapping it with the nofile field in the open and accept
>>> requests? If they're not opening a normal file descriptor, they don't
>>> need nofile. And in the original sqe, you can then overlap it with a
>>> 32-bit field like splice_fd_in.
>>
>> There is no nofile in SQEs, though
>>
>> req->open.nofile = rlimit(RLIMIT_NOFILE);
> 
> What's the plan in terms of limiting the amount of direct descriptors
> (for lack of a better word)? That seems like an important aspect that
> should get sorted out upfront.

As was brought before, agree that it have to be solved. However, don't
think it holds this feature, as the same problems can be perfectly
achieved without it.

fd = open();
io_uring_register(fd);
close(fd);

> Do we include the regular file table max_fds count for creating a new
> direct descriptor, and limit to RLIMIT_NOFILE? That would seem logical,
> but then that also implies that the regular file table should include
> the ctx (potentially several) direct descriptors. And the latter is much
> worse.

To which object we're binding the counting? To the task that created
the ring? I'd be afraid of the following case then:

fork(NO_FDTABLE_SHARE, callback -> {
	ring = create_io_uring();
	io_uring_register_fds(&ring);
	pass_ring_to_parent(ring);
	// e.g. via socket or so.
	exit();
});

Restricting based on user may have been a better option, but as well
not without problems.

Another option, which is too ugly to exist but have to mention,
is to count number of tasks and io_urings together. Maybe can spark
some better idea.

Also, do we have anything related in cgroups/namespaces?

> Maybe we have a way to size the direct table, which will consume entries
> from the same pool that the regular file table does? That would then
> work both ways, and could potentially just be done dynamically similarly
> to how we expand the regular file table when we exceed its current size.
> 
> Anyway, just throwing a few ideas out there, with the intent to spark a
> bit of discussion on this topic. I really like the direct descriptors,
> it'll be a lot more efficient for certain use cases.
>
Pavel Begunkov Aug. 15, 2021, 2:23 p.m. UTC | #8
On 8/15/21 11:48 AM, Pavel Begunkov wrote:
> On 8/15/21 4:31 AM, Josh Triplett wrote:
>> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>>> interface, I think it makes sense to use a dedicated flag for this, like
>>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>>> including open, accept, and in the future many other operations such as
>>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>>> close it.)
>>>>
>>>> The only downside is that you'll need to reject that flag in all
>>>> non-open operations. One way to unify that code might be to add a flag
>>>> in io_op_def for open-like operations, and then check in common code for
>>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>>
>>> io_uring is really thin, and so I absolutely don't want any extra
>>> overhead in the generic path, IOW anything affecting
>>> reads/writes/sends/recvs.
>>
>> There are already several checks for valid flags in io_init_req. For
>> instance:
> 
> Yes, it's horrible and I don't want to make it any worse.
> 
>>         if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
>>             !io_op_defs[req->opcode].buffer_select)
>>                 return -EOPNOTSUPP;
>> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
>> bitwise op tells you if any invalid flags were passed. *Zero* additional
>> overhead for other operations.
> 
> Good point
> 
>> Alternatively, since there are so few operations that open a file
>> descriptor, you could just add a separate opcode for those few
>> operations. That still seems preferable to overloading a 16-bit index
>> field for this.
> 
> I don't think so
> 
>> With this new mechanism, I think we're going to want to support more
>> than 65535 fixed-file entries. I can easily imagine wanting to handle
>> hundreds of thousands of files or sockets this way.
> 
> May be. What I'm curious about is that the feature doesn't really
> change anything in this regard, but seems I haven't heard people
> asking for larger tables.
> 
>>> The other reason is that there are only 2 bits left in sqe->flags,
>>> and we may use them for something better, considering that it's
>>> only open/accept and not much as this.
>>
>> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
>> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
>> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
>> memfd_secret.
> 
> We could argue for many of those whether they should be in io_uring,
> and whether there are many benefits having them async and so. It would
> have another story if all the ecosystem was io_uring centric, but
> that's speculations.
> 
>> Of those, I personally would *love* to have at least pipe, socket,
>> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
>> dup-like operation for moving things between the fixed-file table and
>> file descriptors.
>>
>> I think this is valuable and versatile enough to merit a flag. It would
>> also be entirely reasonable to create separate operations for these. But
>> either way, I don't think this should just be determined by whether a
>> 16-bit index is non-zero.
>>
>>> I agree that it feels error-prone, but at least it can be wrapped
>>> nicely enough in liburing, e.g.
>>>
>>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>>> 				 const char *path, int flags,
>>> 				 mode_t mode, int slot_idx);
>>
>> That wrapper wouldn't be able to handle more than a 16-bit slot index
>> though.
> 
> It would. Note, the index is "int" there, so if doesn't fit
> into u16, we can fail it. And do conversion if required.
> 
>>>> Also, rather than using a 16-bit index for the fixed file table and
>>>> potentially requiring expansion into a different field in the future,
>>>> what about overlapping it with the nofile field in the open and accept
>>>> requests? If they're not opening a normal file descriptor, they don't
>>>> need nofile. And in the original sqe, you can then overlap it with a
>>>> 32-bit field like splice_fd_in.
>>>
>>> There is no nofile in SQEs, though
>>>
>>> req->open.nofile = rlimit(RLIMIT_NOFILE);
>>
>> nofile isn't needed for opening into the fixed-file table, so it could
>> be omitted in that case, and another field unioned with it.
> 
> There is no problem to place it internally. Moreover, it's at the
> moment uniformly placed inside io_kiocb, but with nofile we'd need
> to find the place on per-op basis.
> 
> Not like any matters, it's just bike shedding.
> 
>> allow passing a 32-bit fixed-file index into open and accept without
>> growing the size of their structures. I think, with this new capability,
>> we're going to want a large number of fixed files available.
>>
>> In the SQE, you could overlap it with the splice_fd_in field, which
>> isn't needed by any calls other than splice.
> 
> But it doesn't mean it won't be used, as happened with pretty every
> other field in SQE. So, it rather depends on what packing is wanted.
> And reusing almost never used ->buf_index (and potentially ->ioprio),
> sounds reasonable.

Aliasing with ->splice_fd_in looks better indeed (apart from it
inherently not being checked, but meh?), But I still don't think
it's a good option to use sqe->flags, and so still needs some way
to switch between modes.

Can be sqe->rw_flags as once was done with SPLICE_F_FD_IN_FIXED,
but it's IMHO an ugly hackish way. I still lean to the
0 vs >0 encoding .
Jens Axboe Aug. 15, 2021, 3:05 p.m. UTC | #9
On 8/14/21 9:42 PM, Josh Triplett wrote:
> On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote:
>> What's the plan in terms of limiting the amount of direct descriptors
>> (for lack of a better word)? That seems like an important aspect that
>> should get sorted out upfront.
> [...]
>> Maybe we have a way to size the direct table, which will consume entries
>> from the same pool that the regular file table does? That would then
>> work both ways, and could potentially just be done dynamically similarly
>> to how we expand the regular file table when we exceed its current size.
> 
> I think we'll want a way to size the direct table regardless, so that
> it's pre-allocated and doesn't need to be resized when an index is used.

But how do you size it then? I can see this being used into the hundreds
of thousands of fds easily, and right now the table is just an array
(though split into segments, avoiding huge allocs).

> Then, we could do one of two equally easy things, depending on what
> policy we want to set:
> 
> - Deduct the full size of the fixed-file table from the allowed number
>   of files the process can have open. So, if RLIMIT_NOFILE is 1048576,
>   and you pre-allocate 1000000 entries in the fixed-file table, you can
>   have no more than 48576 file descriptors open. Stricter, but
>   potentially problematic: a program *might* expect that it can
>   dup2(some_fd, nofile - 1) successfully.
> 
> - Use RLIMIT_NOFILE as the maximum size of the fixed-file table. There's
>   precedent for this: we already use RLIMIT_NOFILE as the maximum number
>   of file descriptors you can have in flight over UNIX sockets.
> 
> I personally would favor the latter; it seems simple and
> straightforward.

I strongly prefer the latter too, and hopefully that's palatable since
the default limits are quite low anyway. And, as you say, it already is
done for inflight fds as well.
Jens Axboe Aug. 15, 2021, 3:12 p.m. UTC | #10
On 8/15/21 9:05 AM, Jens Axboe wrote:
> On 8/14/21 9:42 PM, Josh Triplett wrote:
>> On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote:
>>> What's the plan in terms of limiting the amount of direct descriptors
>>> (for lack of a better word)? That seems like an important aspect that
>>> should get sorted out upfront.
>> [...]
>>> Maybe we have a way to size the direct table, which will consume entries
>>> from the same pool that the regular file table does? That would then
>>> work both ways, and could potentially just be done dynamically similarly
>>> to how we expand the regular file table when we exceed its current size.
>>
>> I think we'll want a way to size the direct table regardless, so that
>> it's pre-allocated and doesn't need to be resized when an index is used.
> 
> But how do you size it then? I can see this being used into the hundreds
> of thousands of fds easily, and right now the table is just an array
> (though split into segments, avoiding huge allocs).

I guess that will just naturally follow by registering the empty set of
a given size initially. So should not actually be a problem...
Stefan Metzmacher Aug. 16, 2021, 3:45 p.m. UTC | #11
Hi Pavel,

> The behaviour is controlled by setting sqe->file_index, where 0 implies
> the old behaviour. If non-zero value is specified, then it will behave
> as described and place the file into a fixed file slot
> sqe->file_index - 1. A file table should be already created, the slot
> should be valid and empty, otherwise the operation will fail.
> 
> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
> accept takes a file, and it already uses the flag with a different
> meaning.

Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx?
(And mkdirat, linkat, symlinkat when they arrive)
renameat and linkat might be trickier as they take two dirfds, but it
would make the feature more complete and useful.

metze
Pavel Begunkov Aug. 17, 2021, 9:33 a.m. UTC | #12
On 8/16/21 4:45 PM, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>> the old behaviour. If non-zero value is specified, then it will behave
>> as described and place the file into a fixed file slot
>> sqe->file_index - 1. A file table should be already created, the slot
>> should be valid and empty, otherwise the operation will fail.
>>
>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>> accept takes a file, and it already uses the flag with a different
>> meaning.
> 
> Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx?
> (And mkdirat, linkat, symlinkat when they arrive)
> renameat and linkat might be trickier as they take two dirfds, but it
> would make the feature more complete and useful.

Good idea. There is nothing blocking on the io_uring side, but
the fs part may get ugly, e.g. too intrusive. We definitely need
to take a look
Jens Axboe Aug. 17, 2021, 2:57 p.m. UTC | #13
On 8/17/21 3:33 AM, Pavel Begunkov wrote:
> On 8/16/21 4:45 PM, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>>> the old behaviour. If non-zero value is specified, then it will behave
>>> as described and place the file into a fixed file slot
>>> sqe->file_index - 1. A file table should be already created, the slot
>>> should be valid and empty, otherwise the operation will fail.
>>>
>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>>> accept takes a file, and it already uses the flag with a different
>>> meaning.
>>
>> Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx?
>> (And mkdirat, linkat, symlinkat when they arrive)
>> renameat and linkat might be trickier as they take two dirfds, but it
>> would make the feature more complete and useful.
> 
> Good idea. There is nothing blocking on the io_uring side, but
> the fs part may get ugly, e.g. too intrusive. We definitely need
> to take a look

Indeed, the io_uring side is trivial, but the VFS interface would
require a lot of man handling... That's why I didn't add support for
fixed files originally.