Message ID | cover.1628871893.git.asml.silence@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | open/accept directly into io_uring fixed file table | expand |
On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote: > Add an optional feature to open/accept directly into io_uring's fixed > file table bypassing the normal file table. Same behaviour if as the > snippet below, but in one operation: > > sqe = prep_[open,accept](...); > cqe = submit_and_wait(sqe); > // error handling > io_uring_register_files_update(uring_idx, (fd = cqe->res)); > // optionally > close((fd = cqe->res)); > > The idea in pretty old, and was brough up and implemented a year ago > by Josh Triplett, though haven't sought the light for some reasons. Thank you for working to get this over the finish line! > Tested on basic cases, will be sent out as liburing patches later. > > A copy paste from 2/2 describing user API and some notes: > > The behaviour is controlled by setting sqe->file_index, where 0 implies > the old behaviour. If non-zero value is specified, then it will behave > as described and place the file into a fixed file slot > sqe->file_index - 1. A file table should be already created, the slot > should be valid and empty, otherwise the operation will fail. > > Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because > accept takes a file, and it already uses the flag with a different > meaning. > > Note 2: it's u16, where in theory the limit for fixed file tables might > get increased in the future. If would ever happen so, we'll better > workaround later, e.g. by making ioprio to represent upper bits 16 bits. > The layout for open is tight already enough. Rather than using sqe->file_index - 1, which feels like an error-prone interface, I think it makes sense to use a dedicated flag for this, like IOSQE_OPEN_FIXED. That flag could work for any open-like operation, including open, accept, and in the future many other operations such as memfd_create. (Imagine using a single ring submission to open a memfd, write a buffer into it, seal it, send it over a UNIX socket, and then close it.) The only downside is that you'll need to reject that flag in all non-open operations. One way to unify that code might be to add a flag in io_op_def for open-like operations, and then check in common code for the case of non-open-like operations passing IOSQE_OPEN_FIXED. Also, rather than using a 16-bit index for the fixed file table and potentially requiring expansion into a different field in the future, what about overlapping it with the nofile field in the open and accept requests? If they're not opening a normal file descriptor, they don't need nofile. And in the original sqe, you can then overlap it with a 32-bit field like splice_fd_in. EEXIST seems like the wrong error-code to use if the index is already in use; open can already return EEXIST if you pass O_EXCL. How about EBADF, or better yet EBADSLT which is unlikely to be returned for any other reason? - Josh Triplett
On 8/13/21 8:00 PM, Josh Triplett wrote: > On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote: >> Add an optional feature to open/accept directly into io_uring's fixed >> file table bypassing the normal file table. Same behaviour if as the >> snippet below, but in one operation: >> >> sqe = prep_[open,accept](...); >> cqe = submit_and_wait(sqe); >> // error handling >> io_uring_register_files_update(uring_idx, (fd = cqe->res)); >> // optionally >> close((fd = cqe->res)); >> >> The idea in pretty old, and was brough up and implemented a year ago >> by Josh Triplett, though haven't sought the light for some reasons. > > Thank you for working to get this over the finish line! > >> Tested on basic cases, will be sent out as liburing patches later. >> >> A copy paste from 2/2 describing user API and some notes: >> >> The behaviour is controlled by setting sqe->file_index, where 0 implies >> the old behaviour. If non-zero value is specified, then it will behave >> as described and place the file into a fixed file slot >> sqe->file_index - 1. A file table should be already created, the slot >> should be valid and empty, otherwise the operation will fail. >> >> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because >> accept takes a file, and it already uses the flag with a different >> meaning. >> >> Note 2: it's u16, where in theory the limit for fixed file tables might >> get increased in the future. If would ever happen so, we'll better >> workaround later, e.g. by making ioprio to represent upper bits 16 bits. >> The layout for open is tight already enough. > > Rather than using sqe->file_index - 1, which feels like an error-prone > interface, I think it makes sense to use a dedicated flag for this, like > IOSQE_OPEN_FIXED. That flag could work for any open-like operation, > including open, accept, and in the future many other operations such as > memfd_create. (Imagine using a single ring submission to open a memfd, > write a buffer into it, seal it, send it over a UNIX socket, and then > close it.) > > The only downside is that you'll need to reject that flag in all > non-open operations. One way to unify that code might be to add a flag > in io_op_def for open-like operations, and then check in common code for > the case of non-open-like operations passing IOSQE_OPEN_FIXED. io_uring is really thin, and so I absolutely don't want any extra overhead in the generic path, IOW anything affecting reads/writes/sends/recvs. The other reason is that there are only 2 bits left in sqe->flags, and we may use them for something better, considering that it's only open/accept and not much as this. I agree that it feels error-prone, but at least it can be wrapped nicely enough in liburing, e.g. void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, const char *path, int flags, mode_t mode, int slot_idx); > Also, rather than using a 16-bit index for the fixed file table and > potentially requiring expansion into a different field in the future, > what about overlapping it with the nofile field in the open and accept > requests? If they're not opening a normal file descriptor, they don't > need nofile. And in the original sqe, you can then overlap it with a > 32-bit field like splice_fd_in. There is no nofile in SQEs, though req->open.nofile = rlimit(RLIMIT_NOFILE); > EEXIST seems like the wrong error-code to use if the index is already in > use; open can already return EEXIST if you pass O_EXCL. How about EBADF, > or better yet EBADSLT which is unlikely to be returned for any other > reason? Sure, sounds better indeed!
On 8/14/21 6:50 AM, Pavel Begunkov wrote: > On 8/13/21 8:00 PM, Josh Triplett wrote: >> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote: >>> Add an optional feature to open/accept directly into io_uring's fixed >>> file table bypassing the normal file table. Same behaviour if as the >>> snippet below, but in one operation: >>> >>> sqe = prep_[open,accept](...); >>> cqe = submit_and_wait(sqe); >>> // error handling >>> io_uring_register_files_update(uring_idx, (fd = cqe->res)); >>> // optionally >>> close((fd = cqe->res)); >>> >>> The idea in pretty old, and was brough up and implemented a year ago >>> by Josh Triplett, though haven't sought the light for some reasons. >> >> Thank you for working to get this over the finish line! >> >>> Tested on basic cases, will be sent out as liburing patches later. >>> >>> A copy paste from 2/2 describing user API and some notes: >>> >>> The behaviour is controlled by setting sqe->file_index, where 0 implies >>> the old behaviour. If non-zero value is specified, then it will behave >>> as described and place the file into a fixed file slot >>> sqe->file_index - 1. A file table should be already created, the slot >>> should be valid and empty, otherwise the operation will fail. >>> >>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because >>> accept takes a file, and it already uses the flag with a different >>> meaning. >>> >>> Note 2: it's u16, where in theory the limit for fixed file tables might >>> get increased in the future. If would ever happen so, we'll better >>> workaround later, e.g. by making ioprio to represent upper bits 16 bits. >>> The layout for open is tight already enough. >> >> Rather than using sqe->file_index - 1, which feels like an error-prone >> interface, I think it makes sense to use a dedicated flag for this, like >> IOSQE_OPEN_FIXED. That flag could work for any open-like operation, >> including open, accept, and in the future many other operations such as >> memfd_create. (Imagine using a single ring submission to open a memfd, >> write a buffer into it, seal it, send it over a UNIX socket, and then >> close it.) >> >> The only downside is that you'll need to reject that flag in all >> non-open operations. One way to unify that code might be to add a flag >> in io_op_def for open-like operations, and then check in common code for >> the case of non-open-like operations passing IOSQE_OPEN_FIXED. > > io_uring is really thin, and so I absolutely don't want any extra > overhead in the generic path, IOW anything affecting > reads/writes/sends/recvs. > > The other reason is that there are only 2 bits left in sqe->flags, > and we may use them for something better, considering that it's > only open/accept and not much as this. > > I agree that it feels error-prone, but at least it can be wrapped > nicely enough in liburing, e.g. > > void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, > const char *path, int flags, > mode_t mode, int slot_idx); > > >> Also, rather than using a 16-bit index for the fixed file table and >> potentially requiring expansion into a different field in the future, >> what about overlapping it with the nofile field in the open and accept >> requests? If they're not opening a normal file descriptor, they don't >> need nofile. And in the original sqe, you can then overlap it with a >> 32-bit field like splice_fd_in. > > There is no nofile in SQEs, though > > req->open.nofile = rlimit(RLIMIT_NOFILE); What's the plan in terms of limiting the amount of direct descriptors (for lack of a better word)? That seems like an important aspect that should get sorted out upfront. Do we include the regular file table max_fds count for creating a new direct descriptor, and limit to RLIMIT_NOFILE? That would seem logical, but then that also implies that the regular file table should include the ctx (potentially several) direct descriptors. And the latter is much worse. Maybe we have a way to size the direct table, which will consume entries from the same pool that the regular file table does? That would then work both ways, and could potentially just be done dynamically similarly to how we expand the regular file table when we exceed its current size. Anyway, just throwing a few ideas out there, with the intent to spark a bit of discussion on this topic. I really like the direct descriptors, it'll be a lot more efficient for certain use cases.
On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote: > On 8/13/21 8:00 PM, Josh Triplett wrote: > > Rather than using sqe->file_index - 1, which feels like an error-prone > > interface, I think it makes sense to use a dedicated flag for this, like > > IOSQE_OPEN_FIXED. That flag could work for any open-like operation, > > including open, accept, and in the future many other operations such as > > memfd_create. (Imagine using a single ring submission to open a memfd, > > write a buffer into it, seal it, send it over a UNIX socket, and then > > close it.) > > > > The only downside is that you'll need to reject that flag in all > > non-open operations. One way to unify that code might be to add a flag > > in io_op_def for open-like operations, and then check in common code for > > the case of non-open-like operations passing IOSQE_OPEN_FIXED. > > io_uring is really thin, and so I absolutely don't want any extra > overhead in the generic path, IOW anything affecting > reads/writes/sends/recvs. There are already several checks for valid flags in io_init_req. For instance: if ((sqe_flags & IOSQE_BUFFER_SELECT) && !io_op_defs[req->opcode].buffer_select) return -EOPNOTSUPP; It'd be trivial to make io_op_defs have a "valid flags" byte, and one bitwise op tells you if any invalid flags were passed. *Zero* additional overhead for other operations. Alternatively, since there are so few operations that open a file descriptor, you could just add a separate opcode for those few operations. That still seems preferable to overloading a 16-bit index field for this. With this new mechanism, I think we're going to want to support more than 65535 fixed-file entries. I can easily imagine wanting to handle hundreds of thousands of files or sockets this way. > The other reason is that there are only 2 bits left in sqe->flags, > and we may use them for something better, considering that it's > only open/accept and not much as this. pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd, timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount, memfd_secret. Of those, I personally would *love* to have at least pipe, socket, pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of dup-like operation for moving things between the fixed-file table and file descriptors. I think this is valuable and versatile enough to merit a flag. It would also be entirely reasonable to create separate operations for these. But either way, I don't think this should just be determined by whether a 16-bit index is non-zero. > I agree that it feels error-prone, but at least it can be wrapped > nicely enough in liburing, e.g. > > void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, > const char *path, int flags, > mode_t mode, int slot_idx); That wrapper wouldn't be able to handle more than a 16-bit slot index though. > > Also, rather than using a 16-bit index for the fixed file table and > > potentially requiring expansion into a different field in the future, > > what about overlapping it with the nofile field in the open and accept > > requests? If they're not opening a normal file descriptor, they don't > > need nofile. And in the original sqe, you can then overlap it with a > > 32-bit field like splice_fd_in. > > There is no nofile in SQEs, though > > req->open.nofile = rlimit(RLIMIT_NOFILE); nofile isn't needed for opening into the fixed-file table, so it could be omitted in that case, and another field unioned with it. That would allow passing a 32-bit fixed-file index into open and accept without growing the size of their structures. I think, with this new capability, we're going to want a large number of fixed files available. In the SQE, you could overlap it with the splice_fd_in field, which isn't needed by any calls other than splice. - Josh Triplett
On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote: > What's the plan in terms of limiting the amount of direct descriptors > (for lack of a better word)? That seems like an important aspect that > should get sorted out upfront. [...] > Maybe we have a way to size the direct table, which will consume entries > from the same pool that the regular file table does? That would then > work both ways, and could potentially just be done dynamically similarly > to how we expand the regular file table when we exceed its current size. I think we'll want a way to size the direct table regardless, so that it's pre-allocated and doesn't need to be resized when an index is used. Then, we could do one of two equally easy things, depending on what policy we want to set: - Deduct the full size of the fixed-file table from the allowed number of files the process can have open. So, if RLIMIT_NOFILE is 1048576, and you pre-allocate 1000000 entries in the fixed-file table, you can have no more than 48576 file descriptors open. Stricter, but potentially problematic: a program *might* expect that it can dup2(some_fd, nofile - 1) successfully. - Use RLIMIT_NOFILE as the maximum size of the fixed-file table. There's precedent for this: we already use RLIMIT_NOFILE as the maximum number of file descriptors you can have in flight over UNIX sockets. I personally would favor the latter; it seems simple and straightforward.
On 8/15/21 4:31 AM, Josh Triplett wrote: > On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote: >> On 8/13/21 8:00 PM, Josh Triplett wrote: >>> Rather than using sqe->file_index - 1, which feels like an error-prone >>> interface, I think it makes sense to use a dedicated flag for this, like >>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation, >>> including open, accept, and in the future many other operations such as >>> memfd_create. (Imagine using a single ring submission to open a memfd, >>> write a buffer into it, seal it, send it over a UNIX socket, and then >>> close it.) >>> >>> The only downside is that you'll need to reject that flag in all >>> non-open operations. One way to unify that code might be to add a flag >>> in io_op_def for open-like operations, and then check in common code for >>> the case of non-open-like operations passing IOSQE_OPEN_FIXED. >> >> io_uring is really thin, and so I absolutely don't want any extra >> overhead in the generic path, IOW anything affecting >> reads/writes/sends/recvs. > > There are already several checks for valid flags in io_init_req. For > instance: Yes, it's horrible and I don't want to make it any worse. > if ((sqe_flags & IOSQE_BUFFER_SELECT) && > !io_op_defs[req->opcode].buffer_select) > return -EOPNOTSUPP; > It'd be trivial to make io_op_defs have a "valid flags" byte, and one > bitwise op tells you if any invalid flags were passed. *Zero* additional > overhead for other operations. Good point > Alternatively, since there are so few operations that open a file > descriptor, you could just add a separate opcode for those few > operations. That still seems preferable to overloading a 16-bit index > field for this. I don't think so > With this new mechanism, I think we're going to want to support more > than 65535 fixed-file entries. I can easily imagine wanting to handle > hundreds of thousands of files or sockets this way. May be. What I'm curious about is that the feature doesn't really change anything in this regard, but seems I haven't heard people asking for larger tables. >> The other reason is that there are only 2 bits left in sqe->flags, >> and we may use them for something better, considering that it's >> only open/accept and not much as this. > > pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a > ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd, > timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount, > memfd_secret. We could argue for many of those whether they should be in io_uring, and whether there are many benefits having them async and so. It would have another story if all the ecosystem was io_uring centric, but that's speculations. > Of those, I personally would *love* to have at least pipe, socket, > pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of > dup-like operation for moving things between the fixed-file table and > file descriptors. > > I think this is valuable and versatile enough to merit a flag. It would > also be entirely reasonable to create separate operations for these. But > either way, I don't think this should just be determined by whether a > 16-bit index is non-zero. > >> I agree that it feels error-prone, but at least it can be wrapped >> nicely enough in liburing, e.g. >> >> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, >> const char *path, int flags, >> mode_t mode, int slot_idx); > > That wrapper wouldn't be able to handle more than a 16-bit slot index > though. It would. Note, the index is "int" there, so if doesn't fit into u16, we can fail it. And do conversion if required. >>> Also, rather than using a 16-bit index for the fixed file table and >>> potentially requiring expansion into a different field in the future, >>> what about overlapping it with the nofile field in the open and accept >>> requests? If they're not opening a normal file descriptor, they don't >>> need nofile. And in the original sqe, you can then overlap it with a >>> 32-bit field like splice_fd_in. >> >> There is no nofile in SQEs, though >> >> req->open.nofile = rlimit(RLIMIT_NOFILE); > > nofile isn't needed for opening into the fixed-file table, so it could > be omitted in that case, and another field unioned with it. There is no problem to place it internally. Moreover, it's at the moment uniformly placed inside io_kiocb, but with nofile we'd need to find the place on per-op basis. Not like any matters, it's just bike shedding. > allow passing a 32-bit fixed-file index into open and accept without > growing the size of their structures. I think, with this new capability, > we're going to want a large number of fixed files available. > > In the SQE, you could overlap it with the splice_fd_in field, which > isn't needed by any calls other than splice. But it doesn't mean it won't be used, as happened with pretty every other field in SQE. So, it rather depends on what packing is wanted. And reusing almost never used ->buf_index (and potentially ->ioprio), sounds reasonable.
On 8/15/21 12:03 AM, Jens Axboe wrote: > On 8/14/21 6:50 AM, Pavel Begunkov wrote: >> On 8/13/21 8:00 PM, Josh Triplett wrote: >>> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote: >>>> Add an optional feature to open/accept directly into io_uring's fixed >>>> file table bypassing the normal file table. Same behaviour if as the >>>> snippet below, but in one operation: >>>> >>>> sqe = prep_[open,accept](...); >>>> cqe = submit_and_wait(sqe); >>>> // error handling >>>> io_uring_register_files_update(uring_idx, (fd = cqe->res)); >>>> // optionally >>>> close((fd = cqe->res)); >>>> >>>> The idea in pretty old, and was brough up and implemented a year ago >>>> by Josh Triplett, though haven't sought the light for some reasons. >>> >>> Thank you for working to get this over the finish line! >>> >>>> Tested on basic cases, will be sent out as liburing patches later. >>>> >>>> A copy paste from 2/2 describing user API and some notes: >>>> >>>> The behaviour is controlled by setting sqe->file_index, where 0 implies >>>> the old behaviour. If non-zero value is specified, then it will behave >>>> as described and place the file into a fixed file slot >>>> sqe->file_index - 1. A file table should be already created, the slot >>>> should be valid and empty, otherwise the operation will fail. >>>> >>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because >>>> accept takes a file, and it already uses the flag with a different >>>> meaning. >>>> >>>> Note 2: it's u16, where in theory the limit for fixed file tables might >>>> get increased in the future. If would ever happen so, we'll better >>>> workaround later, e.g. by making ioprio to represent upper bits 16 bits. >>>> The layout for open is tight already enough. >>> >>> Rather than using sqe->file_index - 1, which feels like an error-prone >>> interface, I think it makes sense to use a dedicated flag for this, like >>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation, >>> including open, accept, and in the future many other operations such as >>> memfd_create. (Imagine using a single ring submission to open a memfd, >>> write a buffer into it, seal it, send it over a UNIX socket, and then >>> close it.) >>> >>> The only downside is that you'll need to reject that flag in all >>> non-open operations. One way to unify that code might be to add a flag >>> in io_op_def for open-like operations, and then check in common code for >>> the case of non-open-like operations passing IOSQE_OPEN_FIXED. >> >> io_uring is really thin, and so I absolutely don't want any extra >> overhead in the generic path, IOW anything affecting >> reads/writes/sends/recvs. >> >> The other reason is that there are only 2 bits left in sqe->flags, >> and we may use them for something better, considering that it's >> only open/accept and not much as this. >> >> I agree that it feels error-prone, but at least it can be wrapped >> nicely enough in liburing, e.g. >> >> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, >> const char *path, int flags, >> mode_t mode, int slot_idx); >> >> >>> Also, rather than using a 16-bit index for the fixed file table and >>> potentially requiring expansion into a different field in the future, >>> what about overlapping it with the nofile field in the open and accept >>> requests? If they're not opening a normal file descriptor, they don't >>> need nofile. And in the original sqe, you can then overlap it with a >>> 32-bit field like splice_fd_in. >> >> There is no nofile in SQEs, though >> >> req->open.nofile = rlimit(RLIMIT_NOFILE); > > What's the plan in terms of limiting the amount of direct descriptors > (for lack of a better word)? That seems like an important aspect that > should get sorted out upfront. As was brought before, agree that it have to be solved. However, don't think it holds this feature, as the same problems can be perfectly achieved without it. fd = open(); io_uring_register(fd); close(fd); > Do we include the regular file table max_fds count for creating a new > direct descriptor, and limit to RLIMIT_NOFILE? That would seem logical, > but then that also implies that the regular file table should include > the ctx (potentially several) direct descriptors. And the latter is much > worse. To which object we're binding the counting? To the task that created the ring? I'd be afraid of the following case then: fork(NO_FDTABLE_SHARE, callback -> { ring = create_io_uring(); io_uring_register_fds(&ring); pass_ring_to_parent(ring); // e.g. via socket or so. exit(); }); Restricting based on user may have been a better option, but as well not without problems. Another option, which is too ugly to exist but have to mention, is to count number of tasks and io_urings together. Maybe can spark some better idea. Also, do we have anything related in cgroups/namespaces? > Maybe we have a way to size the direct table, which will consume entries > from the same pool that the regular file table does? That would then > work both ways, and could potentially just be done dynamically similarly > to how we expand the regular file table when we exceed its current size. > > Anyway, just throwing a few ideas out there, with the intent to spark a > bit of discussion on this topic. I really like the direct descriptors, > it'll be a lot more efficient for certain use cases. >
On 8/15/21 11:48 AM, Pavel Begunkov wrote: > On 8/15/21 4:31 AM, Josh Triplett wrote: >> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote: >>> On 8/13/21 8:00 PM, Josh Triplett wrote: >>>> Rather than using sqe->file_index - 1, which feels like an error-prone >>>> interface, I think it makes sense to use a dedicated flag for this, like >>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation, >>>> including open, accept, and in the future many other operations such as >>>> memfd_create. (Imagine using a single ring submission to open a memfd, >>>> write a buffer into it, seal it, send it over a UNIX socket, and then >>>> close it.) >>>> >>>> The only downside is that you'll need to reject that flag in all >>>> non-open operations. One way to unify that code might be to add a flag >>>> in io_op_def for open-like operations, and then check in common code for >>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED. >>> >>> io_uring is really thin, and so I absolutely don't want any extra >>> overhead in the generic path, IOW anything affecting >>> reads/writes/sends/recvs. >> >> There are already several checks for valid flags in io_init_req. For >> instance: > > Yes, it's horrible and I don't want to make it any worse. > >> if ((sqe_flags & IOSQE_BUFFER_SELECT) && >> !io_op_defs[req->opcode].buffer_select) >> return -EOPNOTSUPP; >> It'd be trivial to make io_op_defs have a "valid flags" byte, and one >> bitwise op tells you if any invalid flags were passed. *Zero* additional >> overhead for other operations. > > Good point > >> Alternatively, since there are so few operations that open a file >> descriptor, you could just add a separate opcode for those few >> operations. That still seems preferable to overloading a 16-bit index >> field for this. > > I don't think so > >> With this new mechanism, I think we're going to want to support more >> than 65535 fixed-file entries. I can easily imagine wanting to handle >> hundreds of thousands of files or sockets this way. > > May be. What I'm curious about is that the feature doesn't really > change anything in this regard, but seems I haven't heard people > asking for larger tables. > >>> The other reason is that there are only 2 bits left in sqe->flags, >>> and we may use them for something better, considering that it's >>> only open/accept and not much as this. >> >> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a >> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd, >> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount, >> memfd_secret. > > We could argue for many of those whether they should be in io_uring, > and whether there are many benefits having them async and so. It would > have another story if all the ecosystem was io_uring centric, but > that's speculations. > >> Of those, I personally would *love* to have at least pipe, socket, >> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of >> dup-like operation for moving things between the fixed-file table and >> file descriptors. >> >> I think this is valuable and versatile enough to merit a flag. It would >> also be entirely reasonable to create separate operations for these. But >> either way, I don't think this should just be determined by whether a >> 16-bit index is non-zero. >> >>> I agree that it feels error-prone, but at least it can be wrapped >>> nicely enough in liburing, e.g. >>> >>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd, >>> const char *path, int flags, >>> mode_t mode, int slot_idx); >> >> That wrapper wouldn't be able to handle more than a 16-bit slot index >> though. > > It would. Note, the index is "int" there, so if doesn't fit > into u16, we can fail it. And do conversion if required. > >>>> Also, rather than using a 16-bit index for the fixed file table and >>>> potentially requiring expansion into a different field in the future, >>>> what about overlapping it with the nofile field in the open and accept >>>> requests? If they're not opening a normal file descriptor, they don't >>>> need nofile. And in the original sqe, you can then overlap it with a >>>> 32-bit field like splice_fd_in. >>> >>> There is no nofile in SQEs, though >>> >>> req->open.nofile = rlimit(RLIMIT_NOFILE); >> >> nofile isn't needed for opening into the fixed-file table, so it could >> be omitted in that case, and another field unioned with it. > > There is no problem to place it internally. Moreover, it's at the > moment uniformly placed inside io_kiocb, but with nofile we'd need > to find the place on per-op basis. > > Not like any matters, it's just bike shedding. > >> allow passing a 32-bit fixed-file index into open and accept without >> growing the size of their structures. I think, with this new capability, >> we're going to want a large number of fixed files available. >> >> In the SQE, you could overlap it with the splice_fd_in field, which >> isn't needed by any calls other than splice. > > But it doesn't mean it won't be used, as happened with pretty every > other field in SQE. So, it rather depends on what packing is wanted. > And reusing almost never used ->buf_index (and potentially ->ioprio), > sounds reasonable. Aliasing with ->splice_fd_in looks better indeed (apart from it inherently not being checked, but meh?), But I still don't think it's a good option to use sqe->flags, and so still needs some way to switch between modes. Can be sqe->rw_flags as once was done with SPLICE_F_FD_IN_FIXED, but it's IMHO an ugly hackish way. I still lean to the 0 vs >0 encoding .
On 8/14/21 9:42 PM, Josh Triplett wrote: > On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote: >> What's the plan in terms of limiting the amount of direct descriptors >> (for lack of a better word)? That seems like an important aspect that >> should get sorted out upfront. > [...] >> Maybe we have a way to size the direct table, which will consume entries >> from the same pool that the regular file table does? That would then >> work both ways, and could potentially just be done dynamically similarly >> to how we expand the regular file table when we exceed its current size. > > I think we'll want a way to size the direct table regardless, so that > it's pre-allocated and doesn't need to be resized when an index is used. But how do you size it then? I can see this being used into the hundreds of thousands of fds easily, and right now the table is just an array (though split into segments, avoiding huge allocs). > Then, we could do one of two equally easy things, depending on what > policy we want to set: > > - Deduct the full size of the fixed-file table from the allowed number > of files the process can have open. So, if RLIMIT_NOFILE is 1048576, > and you pre-allocate 1000000 entries in the fixed-file table, you can > have no more than 48576 file descriptors open. Stricter, but > potentially problematic: a program *might* expect that it can > dup2(some_fd, nofile - 1) successfully. > > - Use RLIMIT_NOFILE as the maximum size of the fixed-file table. There's > precedent for this: we already use RLIMIT_NOFILE as the maximum number > of file descriptors you can have in flight over UNIX sockets. > > I personally would favor the latter; it seems simple and > straightforward. I strongly prefer the latter too, and hopefully that's palatable since the default limits are quite low anyway. And, as you say, it already is done for inflight fds as well.
On 8/15/21 9:05 AM, Jens Axboe wrote: > On 8/14/21 9:42 PM, Josh Triplett wrote: >> On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote: >>> What's the plan in terms of limiting the amount of direct descriptors >>> (for lack of a better word)? That seems like an important aspect that >>> should get sorted out upfront. >> [...] >>> Maybe we have a way to size the direct table, which will consume entries >>> from the same pool that the regular file table does? That would then >>> work both ways, and could potentially just be done dynamically similarly >>> to how we expand the regular file table when we exceed its current size. >> >> I think we'll want a way to size the direct table regardless, so that >> it's pre-allocated and doesn't need to be resized when an index is used. > > But how do you size it then? I can see this being used into the hundreds > of thousands of fds easily, and right now the table is just an array > (though split into segments, avoiding huge allocs). I guess that will just naturally follow by registering the empty set of a given size initially. So should not actually be a problem...
Hi Pavel, > The behaviour is controlled by setting sqe->file_index, where 0 implies > the old behaviour. If non-zero value is specified, then it will behave > as described and place the file into a fixed file slot > sqe->file_index - 1. A file table should be already created, the slot > should be valid and empty, otherwise the operation will fail. > > Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because > accept takes a file, and it already uses the flag with a different > meaning. Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx? (And mkdirat, linkat, symlinkat when they arrive) renameat and linkat might be trickier as they take two dirfds, but it would make the feature more complete and useful. metze
On 8/16/21 4:45 PM, Stefan Metzmacher wrote: > Hi Pavel, > >> The behaviour is controlled by setting sqe->file_index, where 0 implies >> the old behaviour. If non-zero value is specified, then it will behave >> as described and place the file into a fixed file slot >> sqe->file_index - 1. A file table should be already created, the slot >> should be valid and empty, otherwise the operation will fail. >> >> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because >> accept takes a file, and it already uses the flag with a different >> meaning. > > Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx? > (And mkdirat, linkat, symlinkat when they arrive) > renameat and linkat might be trickier as they take two dirfds, but it > would make the feature more complete and useful. Good idea. There is nothing blocking on the io_uring side, but the fs part may get ugly, e.g. too intrusive. We definitely need to take a look
On 8/17/21 3:33 AM, Pavel Begunkov wrote: > On 8/16/21 4:45 PM, Stefan Metzmacher wrote: >> Hi Pavel, >> >>> The behaviour is controlled by setting sqe->file_index, where 0 implies >>> the old behaviour. If non-zero value is specified, then it will behave >>> as described and place the file into a fixed file slot >>> sqe->file_index - 1. A file table should be already created, the slot >>> should be valid and empty, otherwise the operation will fail. >>> >>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because >>> accept takes a file, and it already uses the flag with a different >>> meaning. >> >> Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx? >> (And mkdirat, linkat, symlinkat when they arrive) >> renameat and linkat might be trickier as they take two dirfds, but it >> would make the feature more complete and useful. > > Good idea. There is nothing blocking on the io_uring side, but > the fs part may get ugly, e.g. too intrusive. We definitely need > to take a look Indeed, the io_uring side is trivial, but the VFS interface would require a lot of man handling... That's why I didn't add support for fixed files originally.