Message ID | 20230406144330.1932798-1-leitao@debian.org (mailing list archive) |
---|---|
Headers | show |
Series | add initial io_uring_cmd support for sockets | expand |
On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote: > > From: Breno Leitao <leit@fb.com> > > This patchset creates the initial plumbing for a io_uring command for > sockets. > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is > heavily based on the ioctl operations. This duplicates all the existing ioctl logic of each protocol. Can this just call the existing proto_ops.ioctl internally and translate from/to io_uring format as needed?
On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote: > On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote: > > > > From: Breno Leitao <leit@fb.com> > > > > This patchset creates the initial plumbing for a io_uring command for > > sockets. > > > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ > > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations > > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is > > heavily based on the ioctl operations. > > This duplicates all the existing ioctl logic of each protocol. > > Can this just call the existing proto_ops.ioctl internally and translate from/to > io_uring format as needed? This is doable, and we have two options in this case: 1) Create a ioctl core function that does not call `put_user()`, and call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper translations. Something as: int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg) { int amount; switch (cmd) { case SIOCOUTQ: { amount = sk_wmem_alloc_get(sk); break; } case SIOCINQ: { amount = max_t(int, 0, first_packet_length(sk)); break; } default: return -ENOIOCTLCMD; } return amount; } int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) { int amount = udp_ioctl_core(sk, cmd, arg); return put_user(amount, (int __user *)arg); } EXPORT_SYMBOL(udp_ioctl); 2) Create a function for each "case entry". This seems a bit silly for UDP, but it makes more sense for other protocols. The code will look something like: int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) { switch (cmd) { case SIOCOUTQ: { int amount = udp_ioctl_siocoutq(); return put_user(amount, (int __user *)arg); } ... } What is the best approach?
On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote: > This patchset creates the initial plumbing for a io_uring command for > sockets. > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is > heavily based on the ioctl operations. Do you have asynchronous operations in mind for a future patch? The io_uring command infrastructure makes more sense for operations that return EIOCBQUEUED, otherwise it doesn't have much benefit over ioctl. > In order to test this code, I created a liburing test, which is > currently located at [1], and I will create a pull request once we are > good with this patch. > > I've also run test/io_uring_passthrough to make sure the first patch > didn't regressed the NVME passthrough path. > > This patchset is a RFC for two different reasons: > * It changes slighlty on how IO uring command operates. I.e, we are > now passing the whole SQE to the io_uring_cmd callback (instead of > an opaque buffer). This seems to be more palatable instead of > creating some custom structure just to fit small parameters, as in > SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK? I think I'm missing something from this series. Where is the io_uring_cmd change to point to the sqe?
On 4/6/23 10:41?AM, Keith Busch wrote: > On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote: >> This patchset creates the initial plumbing for a io_uring command for >> sockets. >> >> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ >> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations >> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is >> heavily based on the ioctl operations. > > Do you have asynchronous operations in mind for a future patch? The > io_uring command infrastructure makes more sense for operations that > return EIOCBQUEUED, otherwise it doesn't have much benefit over ioctl. Basically nothing returns EIOCBQUEUED, it's mostly sync/poll driven on the networking side. The primary use case for this is with direct descriptors, as you can't do get/setsockopt with those. And that means you'd then need to instantiate a regular descriptor first and then register it, rather than keep it all direct from the start.
Hello Keith, On Thu, Apr 06, 2023 at 10:41:52AM -0600, Keith Busch wrote: > On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote: > > This patchset creates the initial plumbing for a io_uring command for > > sockets. > > > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ > > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations > > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is > > heavily based on the ioctl operations. > > Do you have asynchronous operations in mind for a future patch? The io_uring > command infrastructure makes more sense for operations that return EIOCBQUEUED, > otherwise it doesn't have much benefit over ioctl. I think this brings value even for synchronos operations, such as, you can just keep using io_uring operations on network operations, other than, using some io_uring operations and then doing a regular ioctl(2). So, it improves the user experience. The other benefit is calling several operations at a single io_uring submit. It means you can save several syscalls and getting the same work done. > > > In order to test this code, I created a liburing test, which is > > currently located at [1], and I will create a pull request once we are > > good with this patch. > > > > I've also run test/io_uring_passthrough to make sure the first patch > > didn't regressed the NVME passthrough path. > > > > This patchset is a RFC for two different reasons: > > * It changes slighlty on how IO uring command operates. I.e, we are > > now passing the whole SQE to the io_uring_cmd callback (instead of > > an opaque buffer). This seems to be more palatable instead of > > creating some custom structure just to fit small parameters, as in > > SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK? > > I think I'm missing something from this series. Where is the io_uring_cmd > change to point to the sqe? My bad, the patch was not part of the patchset. I've just submitted it under the same RFC cover letter now. Here is the link, if it helps: https://lkml.org/lkml/2023/4/6/990
On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote: > > On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote: > > On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote: > > > > > > From: Breno Leitao <leit@fb.com> > > > > > > This patchset creates the initial plumbing for a io_uring command for > > > sockets. > > > > > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ > > > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations > > > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is > > > heavily based on the ioctl operations. > > > > This duplicates all the existing ioctl logic of each protocol. > > > > Can this just call the existing proto_ops.ioctl internally and translate from/to > > io_uring format as needed? > > This is doable, and we have two options in this case: > > 1) Create a ioctl core function that does not call `put_user()`, and > call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper > translations. Something as: > > int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg) > { > int amount; > switch (cmd) { > case SIOCOUTQ: { > amount = sk_wmem_alloc_get(sk); > break; > } > case SIOCINQ: { > amount = max_t(int, 0, first_packet_length(sk)); > break; > } > default: > return -ENOIOCTLCMD; > } > return amount; > } > > int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) > { > int amount = udp_ioctl_core(sk, cmd, arg); > > return put_user(amount, (int __user *)arg); > } > EXPORT_SYMBOL(udp_ioctl); > > > 2) Create a function for each "case entry". This seems a bit silly for > UDP, but it makes more sense for other protocols. The code will look > something like: > > int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) > { > switch (cmd) { > case SIOCOUTQ: > { > int amount = udp_ioctl_siocoutq(); > return put_user(amount, (int __user *)arg); > } > ... > } > > What is the best approach? A, the issue is that sock->ops->ioctl directly call put_user. I was thinking just having sock_uring_cmd call sock->ops->ioctl, like sock_do_ioctl. But that would require those callbacks to return a negative error or positive integer, rather than calling put_user. And then move the put_user to sock_do_ioctl. Such a change is at least as much code change as your series. Though without the ending up with code duplication. It also works only if all ioctls only put_user of integer size. That's true for TCP, UDP and RAW, but not sure if true more broadly. Another approach may be to pass another argument to the ioctl callbacks, whether to call put_user or return the integer and let the caller take care of the output to user. This could possibly be embedded in the a high-order bit of the cmd, so that it fails on ioctl callbacks that do not support this mode. Of the two approaches you suggest, I find the first preferable.
On 4/6/23 12:16 PM, Willem de Bruijn wrote: > On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote: >> >> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote: >>> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote: >>>> >>>> From: Breno Leitao <leit@fb.com> >>>> >>>> This patchset creates the initial plumbing for a io_uring command for >>>> sockets. >>>> >>>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ >>>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations >>>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is >>>> heavily based on the ioctl operations. >>> >>> This duplicates all the existing ioctl logic of each protocol. >>> >>> Can this just call the existing proto_ops.ioctl internally and translate from/to >>> io_uring format as needed? >> >> This is doable, and we have two options in this case: >> >> 1) Create a ioctl core function that does not call `put_user()`, and >> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper >> translations. Something as: >> >> int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg) >> { >> int amount; >> switch (cmd) { >> case SIOCOUTQ: { >> amount = sk_wmem_alloc_get(sk); >> break; >> } >> case SIOCINQ: { >> amount = max_t(int, 0, first_packet_length(sk)); >> break; >> } >> default: >> return -ENOIOCTLCMD; >> } >> return amount; >> } >> >> int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) >> { >> int amount = udp_ioctl_core(sk, cmd, arg); >> >> return put_user(amount, (int __user *)arg); >> } >> EXPORT_SYMBOL(udp_ioctl); >> >> >> 2) Create a function for each "case entry". This seems a bit silly for >> UDP, but it makes more sense for other protocols. The code will look >> something like: >> >> int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) >> { >> switch (cmd) { >> case SIOCOUTQ: >> { >> int amount = udp_ioctl_siocoutq(); >> return put_user(amount, (int __user *)arg); >> } >> ... >> } >> >> What is the best approach? > > A, the issue is that sock->ops->ioctl directly call put_user. > > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like > sock_do_ioctl. > > But that would require those callbacks to return a negative error or > positive integer, rather than calling put_user. And then move the > put_user to sock_do_ioctl. Such a change is at least as much code > change as your series. Though without the ending up with code > duplication. It also works only if all ioctls only put_user of integer > size. That's true for TCP, UDP and RAW, but not sure if true more > broadly. > > Another approach may be to pass another argument to the ioctl > callbacks, whether to call put_user or return the integer and let the > caller take care of the output to user. This could possibly be > embedded in the a high-order bit of the cmd, so that it fails on ioctl > callbacks that do not support this mode. > > Of the two approaches you suggest, I find the first preferable. The first approach sounds better to me and it would be good to avoid io_uring details in the networking code (ie., cmd->sqe->cmd_op).
On Thu, Apr 06, 2023 at 08:46:38PM -0600, David Ahern wrote: > On 4/6/23 12:16 PM, Willem de Bruijn wrote: > > On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote: > >> > >> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote: > >>> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote: > >>>> > >>>> From: Breno Leitao <leit@fb.com> > >>>> > >>>> This patchset creates the initial plumbing for a io_uring command for > >>>> sockets. > >>>> > >>>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ > >>>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations > >>>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is > >>>> heavily based on the ioctl operations. > >>> > >>> This duplicates all the existing ioctl logic of each protocol. > >>> > >>> Can this just call the existing proto_ops.ioctl internally and translate from/to > >>> io_uring format as needed? > >> > >> This is doable, and we have two options in this case: > >> > >> 1) Create a ioctl core function that does not call `put_user()`, and > >> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper > >> translations. Something as: > >> > >> int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg) > >> { > >> int amount; > >> switch (cmd) { > >> case SIOCOUTQ: { > >> amount = sk_wmem_alloc_get(sk); > >> break; > >> } > >> case SIOCINQ: { > >> amount = max_t(int, 0, first_packet_length(sk)); > >> break; > >> } > >> default: > >> return -ENOIOCTLCMD; > >> } > >> return amount; > >> } > >> > >> int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) > >> { > >> int amount = udp_ioctl_core(sk, cmd, arg); > >> > >> return put_user(amount, (int __user *)arg); > >> } > >> EXPORT_SYMBOL(udp_ioctl); > >> > >> > >> 2) Create a function for each "case entry". This seems a bit silly for > >> UDP, but it makes more sense for other protocols. The code will look > >> something like: > >> > >> int udp_ioctl(struct sock *sk, int cmd, unsigned long arg) > >> { > >> switch (cmd) { > >> case SIOCOUTQ: > >> { > >> int amount = udp_ioctl_siocoutq(); > >> return put_user(amount, (int __user *)arg); > >> } > >> ... > >> } > >> > >> What is the best approach? > > > > A, the issue is that sock->ops->ioctl directly call put_user. > > > > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like > > sock_do_ioctl. > > > > But that would require those callbacks to return a negative error or > > positive integer, rather than calling put_user. And then move the > > put_user to sock_do_ioctl. Such a change is at least as much code > > change as your series. Though without the ending up with code > > duplication. It also works only if all ioctls only put_user of integer > > size. That's true for TCP, UDP and RAW, but not sure if true more > > broadly. > > > > Another approach may be to pass another argument to the ioctl > > callbacks, whether to call put_user or return the integer and let the > > caller take care of the output to user. This could possibly be > > embedded in the a high-order bit of the cmd, so that it fails on ioctl > > callbacks that do not support this mode. > > > > Of the two approaches you suggest, I find the first preferable. > > The first approach sounds better to me and it would be good to avoid > io_uring details in the networking code (ie., cmd->sqe->cmd_op). I am not sure if avoiding io_uring details in network code is possible. The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() in the TCP case) could be somewhere else, such as in the io_uring/ directory, but, I think it might be cleaner if these implementations are closer to function assignment (in the network subsystem). And this function (tcp_uring_cmd() for instance) is the one that I am planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ -> SIOCINQ. Please let me know if you have any other idea in mind.
On 4/11/23 6:00 AM, Breno Leitao wrote: > I am not sure if avoiding io_uring details in network code is possible. > > The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() > in the TCP case) could be somewhere else, such as in the io_uring/ > directory, but, I think it might be cleaner if these implementations are > closer to function assignment (in the network subsystem). > > And this function (tcp_uring_cmd() for instance) is the one that I am > planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ > -> SIOCINQ. > > Please let me know if you have any other idea in mind. I am not convinced that this io_uring_cmd is needed. This is one in-kernel subsystem calling into another, and there are APIs for that. All of this set is ioctl based and as Willem noted a little refactoring separates the get_user/put_user out so that in-kernel can call can be made with existing ops.
On 4/11/23 8:36?AM, David Ahern wrote: > On 4/11/23 6:00 AM, Breno Leitao wrote: >> I am not sure if avoiding io_uring details in network code is possible. >> >> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >> in the TCP case) could be somewhere else, such as in the io_uring/ >> directory, but, I think it might be cleaner if these implementations are >> closer to function assignment (in the network subsystem). >> >> And this function (tcp_uring_cmd() for instance) is the one that I am >> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >> -> SIOCINQ. >> >> Please let me know if you have any other idea in mind. > > I am not convinced that this io_uring_cmd is needed. This is one > in-kernel subsystem calling into another, and there are APIs for that. > All of this set is ioctl based and as Willem noted a little refactoring > separates the get_user/put_user out so that in-kernel can call can be > made with existing ops. How do you want to wire it up then? We can't use fops->unlocked_ioctl() obviously, and we already have ->uring_cmd() for this purpose. I do think the right thing to do is have a common helper that returns whatever value you want (or sets it), and split the ioctl parts into a wrapper around that that simply copies in/out as needed. Then ->uring_cmd() could call that, or you could some exported function that does supports that. This works for the basic cases, though I do suspect we'll want to go down the ->uring_cmd() at some point for more advanced cases or cases that cannot sanely be done in an ioctl fashion.
Jens Axboe wrote: > On 4/11/23 8:36?AM, David Ahern wrote: > > On 4/11/23 6:00 AM, Breno Leitao wrote: > >> I am not sure if avoiding io_uring details in network code is possible. > >> > >> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() > >> in the TCP case) could be somewhere else, such as in the io_uring/ > >> directory, but, I think it might be cleaner if these implementations are > >> closer to function assignment (in the network subsystem). > >> > >> And this function (tcp_uring_cmd() for instance) is the one that I am > >> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ > >> -> SIOCINQ. > >> > >> Please let me know if you have any other idea in mind. > > > > I am not convinced that this io_uring_cmd is needed. This is one > > in-kernel subsystem calling into another, and there are APIs for that. > > All of this set is ioctl based and as Willem noted a little refactoring > > separates the get_user/put_user out so that in-kernel can call can be > > made with existing ops. > > How do you want to wire it up then? We can't use fops->unlocked_ioctl() > obviously, and we already have ->uring_cmd() for this purpose. Does this suggestion not work? > > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like > > sock_do_ioctl. > I do think the right thing to do is have a common helper that returns > whatever value you want (or sets it), and split the ioctl parts into a > wrapper around that that simply copies in/out as needed. Then > ->uring_cmd() could call that, or you could some exported function that > does supports that. > > This works for the basic cases, though I do suspect we'll want to go > down the ->uring_cmd() at some point for more advanced cases or cases > that cannot sanely be done in an ioctl fashion. Right now the two examples are ioctls that return an integer. Do you already have other calls in mind? That would help estimate whether ->uring_cmd() indeed will be needed and we might as well do it now.
On 4/11/23 8:51?AM, Willem de Bruijn wrote: > Jens Axboe wrote: >> On 4/11/23 8:36?AM, David Ahern wrote: >>> On 4/11/23 6:00 AM, Breno Leitao wrote: >>>> I am not sure if avoiding io_uring details in network code is possible. >>>> >>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>>> in the TCP case) could be somewhere else, such as in the io_uring/ >>>> directory, but, I think it might be cleaner if these implementations are >>>> closer to function assignment (in the network subsystem). >>>> >>>> And this function (tcp_uring_cmd() for instance) is the one that I am >>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>>> -> SIOCINQ. >>>> >>>> Please let me know if you have any other idea in mind. >>> >>> I am not convinced that this io_uring_cmd is needed. This is one >>> in-kernel subsystem calling into another, and there are APIs for that. >>> All of this set is ioctl based and as Willem noted a little refactoring >>> separates the get_user/put_user out so that in-kernel can call can be >>> made with existing ops. >> >> How do you want to wire it up then? We can't use fops->unlocked_ioctl() >> obviously, and we already have ->uring_cmd() for this purpose. > > Does this suggestion not work? Not sure I follow, what suggestion? >>> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like >>> sock_do_ioctl. > >> I do think the right thing to do is have a common helper that returns >> whatever value you want (or sets it), and split the ioctl parts into a >> wrapper around that that simply copies in/out as needed. Then >> ->uring_cmd() could call that, or you could some exported function that >> does supports that. >> >> This works for the basic cases, though I do suspect we'll want to go >> down the ->uring_cmd() at some point for more advanced cases or cases >> that cannot sanely be done in an ioctl fashion. > > Right now the two examples are ioctls that return an integer. Do you > already have other calls in mind? That would help estimate whether > ->uring_cmd() indeed will be needed and we might as well do it now. Right, it's a proof of concept. But we'd want to support anything that setsockopt/getsockopt would do. This is necessary so that direct descriptors (eg ones that describe a struct file that isn't in the process file table or have a regular fd) can be used for anything that a regular file can. Beyond that, perhaps various things necessary for efficient zero copy rx. I do think we can make the ->uring_cmd() hookup a bit more palatable in terms of API. It really should be just a sub-opcode and then arguments to support that. The grunt of the work is really refactoring the ioctl and set/getsockopt bits so that they can be called in-kernel rather than assuming copy in/out is needed. Once that is done, the actual uring_cmd hookup should be simple and trivial.
Jens Axboe wrote: > On 4/11/23 8:51?AM, Willem de Bruijn wrote: > > Jens Axboe wrote: > >> On 4/11/23 8:36?AM, David Ahern wrote: > >>> On 4/11/23 6:00 AM, Breno Leitao wrote: > >>>> I am not sure if avoiding io_uring details in network code is possible. > >>>> > >>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() > >>>> in the TCP case) could be somewhere else, such as in the io_uring/ > >>>> directory, but, I think it might be cleaner if these implementations are > >>>> closer to function assignment (in the network subsystem). > >>>> > >>>> And this function (tcp_uring_cmd() for instance) is the one that I am > >>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ > >>>> -> SIOCINQ. > >>>> > >>>> Please let me know if you have any other idea in mind. > >>> > >>> I am not convinced that this io_uring_cmd is needed. This is one > >>> in-kernel subsystem calling into another, and there are APIs for that. > >>> All of this set is ioctl based and as Willem noted a little refactoring > >>> separates the get_user/put_user out so that in-kernel can call can be > >>> made with existing ops. > >> > >> How do you want to wire it up then? We can't use fops->unlocked_ioctl() > >> obviously, and we already have ->uring_cmd() for this purpose. > > > > Does this suggestion not work? > > Not sure I follow, what suggestion? > This quote from earlier in the thread: I was thinking just having sock_uring_cmd call sock->ops->ioctl, like sock_do_ioctl. > > > >> I do think the right thing to do is have a common helper that returns > >> whatever value you want (or sets it), and split the ioctl parts into a > >> wrapper around that that simply copies in/out as needed. Then > >> ->uring_cmd() could call that, or you could some exported function that > >> does supports that. > >> > >> This works for the basic cases, though I do suspect we'll want to go > >> down the ->uring_cmd() at some point for more advanced cases or cases > >> that cannot sanely be done in an ioctl fashion. > > > > Right now the two examples are ioctls that return an integer. Do you > > already have other calls in mind? That would help estimate whether > > ->uring_cmd() indeed will be needed and we might as well do it now. > > Right, it's a proof of concept. But we'd want to support anything that > setsockopt/getsockopt would do. This is necessary so that direct > descriptors (eg ones that describe a struct file that isn't in the > process file table or have a regular fd) can be used for anything that a > regular file can. Beyond that, perhaps various things necessary for > efficient zero copy rx. > > I do think we can make the ->uring_cmd() hookup a bit more palatable in > terms of API. It really should be just a sub-opcode and then arguments > to support that. The grunt of the work is really refactoring the ioctl > and set/getsockopt bits so that they can be called in-kernel rather than > assuming copy in/out is needed. Once that is done, the actual uring_cmd > hookup should be simple and trivial. That sounds like what I proposed above. That suggestion was only for the narrow case where ioctls return an integer. The general approach has to handle any put_user. Though my initial skim of TCP, UDP and RAW did not bring up any other forms. getsockopt indeed has plenty of examples, such as receive zerocopy.
On 4/11/23 9:00?AM, Willem de Bruijn wrote: > Jens Axboe wrote: >> On 4/11/23 8:51?AM, Willem de Bruijn wrote: >>> Jens Axboe wrote: >>>> On 4/11/23 8:36?AM, David Ahern wrote: >>>>> On 4/11/23 6:00 AM, Breno Leitao wrote: >>>>>> I am not sure if avoiding io_uring details in network code is possible. >>>>>> >>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>>>>> in the TCP case) could be somewhere else, such as in the io_uring/ >>>>>> directory, but, I think it might be cleaner if these implementations are >>>>>> closer to function assignment (in the network subsystem). >>>>>> >>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am >>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>>>>> -> SIOCINQ. >>>>>> >>>>>> Please let me know if you have any other idea in mind. >>>>> >>>>> I am not convinced that this io_uring_cmd is needed. This is one >>>>> in-kernel subsystem calling into another, and there are APIs for that. >>>>> All of this set is ioctl based and as Willem noted a little refactoring >>>>> separates the get_user/put_user out so that in-kernel can call can be >>>>> made with existing ops. >>>> >>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl() >>>> obviously, and we already have ->uring_cmd() for this purpose. >>> >>> Does this suggestion not work? >> >> Not sure I follow, what suggestion? >> > > This quote from earlier in the thread: > > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like > sock_do_ioctl. But that doesn't work, because sock->ops->ioctl() assumes the arg is memory in userspace. Or do you mean change all of the sock->ops->ioctl() to pass in on-stack memory (or similar) and have it work with a kernel address? >>>> I do think the right thing to do is have a common helper that returns >>>> whatever value you want (or sets it), and split the ioctl parts into a >>>> wrapper around that that simply copies in/out as needed. Then >>>> ->uring_cmd() could call that, or you could some exported function that >>>> does supports that. >>>> >>>> This works for the basic cases, though I do suspect we'll want to go >>>> down the ->uring_cmd() at some point for more advanced cases or cases >>>> that cannot sanely be done in an ioctl fashion. >>> >>> Right now the two examples are ioctls that return an integer. Do you >>> already have other calls in mind? That would help estimate whether >>> ->uring_cmd() indeed will be needed and we might as well do it now. >> >> Right, it's a proof of concept. But we'd want to support anything that >> setsockopt/getsockopt would do. This is necessary so that direct >> descriptors (eg ones that describe a struct file that isn't in the >> process file table or have a regular fd) can be used for anything that a >> regular file can. Beyond that, perhaps various things necessary for >> efficient zero copy rx. >> >> I do think we can make the ->uring_cmd() hookup a bit more palatable in >> terms of API. It really should be just a sub-opcode and then arguments >> to support that. The grunt of the work is really refactoring the ioctl >> and set/getsockopt bits so that they can be called in-kernel rather than >> assuming copy in/out is needed. Once that is done, the actual uring_cmd >> hookup should be simple and trivial. > > That sounds like what I proposed above. That suggestion was only for > the narrow case where ioctls return an integer. The general approach > has to handle any put_user. Right > Though my initial skim of TCP, UDP and RAW did not bring up any other > forms. > > getsockopt indeed has plenty of examples, such as receive zerocopy.
On 4/11/23 8:41 AM, Jens Axboe wrote: > On 4/11/23 8:36?AM, David Ahern wrote: >> On 4/11/23 6:00 AM, Breno Leitao wrote: >>> I am not sure if avoiding io_uring details in network code is possible. >>> >>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>> in the TCP case) could be somewhere else, such as in the io_uring/ >>> directory, but, I think it might be cleaner if these implementations are >>> closer to function assignment (in the network subsystem). >>> >>> And this function (tcp_uring_cmd() for instance) is the one that I am >>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>> -> SIOCINQ. >>> >>> Please let me know if you have any other idea in mind. >> >> I am not convinced that this io_uring_cmd is needed. This is one >> in-kernel subsystem calling into another, and there are APIs for that. >> All of this set is ioctl based and as Willem noted a little refactoring >> separates the get_user/put_user out so that in-kernel can call can be >> made with existing ops. > > How do you want to wire it up then? We can't use fops->unlocked_ioctl() > obviously, and we already have ->uring_cmd() for this purpose. > > I do think the right thing to do is have a common helper that returns > whatever value you want (or sets it), and split the ioctl parts into a > wrapper around that that simply copies in/out as needed. Then > ->uring_cmd() could call that, or you could some exported function that > does supports that. > > This works for the basic cases, though I do suspect we'll want to go > down the ->uring_cmd() at some point for more advanced cases or cases > that cannot sanely be done in an ioctl fashion. > My meta point is that there are uapis today to return this information to applications (and I suspect this is just the start of more networking changes - both data retrieval and adjusting settings). io_uring is wanting to do this on behalf of the application without a syscall. That makes io_uring yet another subsystem / component managing a socket. Any change to the networking stack required by io_uring should be usable by all other in-kernel socket owners or managers. ie., there is no reason for io_uring specific code here.
On 4/11/23 9:10?AM, David Ahern wrote: > On 4/11/23 8:41 AM, Jens Axboe wrote: >> On 4/11/23 8:36?AM, David Ahern wrote: >>> On 4/11/23 6:00 AM, Breno Leitao wrote: >>>> I am not sure if avoiding io_uring details in network code is possible. >>>> >>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>>> in the TCP case) could be somewhere else, such as in the io_uring/ >>>> directory, but, I think it might be cleaner if these implementations are >>>> closer to function assignment (in the network subsystem). >>>> >>>> And this function (tcp_uring_cmd() for instance) is the one that I am >>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>>> -> SIOCINQ. >>>> >>>> Please let me know if you have any other idea in mind. >>> >>> I am not convinced that this io_uring_cmd is needed. This is one >>> in-kernel subsystem calling into another, and there are APIs for that. >>> All of this set is ioctl based and as Willem noted a little refactoring >>> separates the get_user/put_user out so that in-kernel can call can be >>> made with existing ops. >> >> How do you want to wire it up then? We can't use fops->unlocked_ioctl() >> obviously, and we already have ->uring_cmd() for this purpose. >> >> I do think the right thing to do is have a common helper that returns >> whatever value you want (or sets it), and split the ioctl parts into a >> wrapper around that that simply copies in/out as needed. Then >> ->uring_cmd() could call that, or you could some exported function that >> does supports that. >> >> This works for the basic cases, though I do suspect we'll want to go >> down the ->uring_cmd() at some point for more advanced cases or cases >> that cannot sanely be done in an ioctl fashion. >> > > My meta point is that there are uapis today to return this information > to applications (and I suspect this is just the start of more networking > changes - both data retrieval and adjusting settings). io_uring is > wanting to do this on behalf of the application without a syscall. That > makes io_uring yet another subsystem / component managing a socket. Any > change to the networking stack required by io_uring should be usable by > all other in-kernel socket owners or managers. ie., there is no reason > for io_uring specific code here. I think we are in violent agreement here, what I'm describing is exactly that - it'd make ioctl/{set,get}sockopt call into the same helpers that ->uring_cmd() would, with the only difference being that the former would need copy in/out and the latter would not. But let me just stress that for direct descriptors, we cannot currently call ioctl or set/getsockopt. This means we have to instantiate a regular descriptor first, do those things, then register it to never use the regular file descriptor again. That's wasteful, and this is what we want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file descriptor). It's not just for "oh it'd be handy to also do this from io_uring" even if that would be a worthwhile goal in itself.
Jens Axboe wrote: > On 4/11/23 9:00?AM, Willem de Bruijn wrote: > > Jens Axboe wrote: > >> On 4/11/23 8:51?AM, Willem de Bruijn wrote: > >>> Jens Axboe wrote: > >>>> On 4/11/23 8:36?AM, David Ahern wrote: > >>>>> On 4/11/23 6:00 AM, Breno Leitao wrote: > >>>>>> I am not sure if avoiding io_uring details in network code is possible. > >>>>>> > >>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() > >>>>>> in the TCP case) could be somewhere else, such as in the io_uring/ > >>>>>> directory, but, I think it might be cleaner if these implementations are > >>>>>> closer to function assignment (in the network subsystem). > >>>>>> > >>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am > >>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ > >>>>>> -> SIOCINQ. > >>>>>> > >>>>>> Please let me know if you have any other idea in mind. > >>>>> > >>>>> I am not convinced that this io_uring_cmd is needed. This is one > >>>>> in-kernel subsystem calling into another, and there are APIs for that. > >>>>> All of this set is ioctl based and as Willem noted a little refactoring > >>>>> separates the get_user/put_user out so that in-kernel can call can be > >>>>> made with existing ops. > >>>> > >>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl() > >>>> obviously, and we already have ->uring_cmd() for this purpose. > >>> > >>> Does this suggestion not work? > >> > >> Not sure I follow, what suggestion? > >> > > > > This quote from earlier in the thread: > > > > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like > > sock_do_ioctl. > > But that doesn't work, because sock->ops->ioctl() assumes the arg is > memory in userspace. Or do you mean change all of the sock->ops->ioctl() > to pass in on-stack memory (or similar) and have it work with a kernel > address? That was what I suggested indeed. It's about as much code change as this patch series. But it avoids the code duplication.
On 4/11/23 9:17 AM, Jens Axboe wrote: > On 4/11/23 9:10?AM, David Ahern wrote: >> On 4/11/23 8:41 AM, Jens Axboe wrote: >>> On 4/11/23 8:36?AM, David Ahern wrote: >>>> On 4/11/23 6:00 AM, Breno Leitao wrote: >>>>> I am not sure if avoiding io_uring details in network code is possible. >>>>> >>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>>>> in the TCP case) could be somewhere else, such as in the io_uring/ >>>>> directory, but, I think it might be cleaner if these implementations are >>>>> closer to function assignment (in the network subsystem). >>>>> >>>>> And this function (tcp_uring_cmd() for instance) is the one that I am >>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>>>> -> SIOCINQ. >>>>> >>>>> Please let me know if you have any other idea in mind. >>>> >>>> I am not convinced that this io_uring_cmd is needed. This is one >>>> in-kernel subsystem calling into another, and there are APIs for that. >>>> All of this set is ioctl based and as Willem noted a little refactoring >>>> separates the get_user/put_user out so that in-kernel can call can be >>>> made with existing ops. >>> >>> How do you want to wire it up then? We can't use fops->unlocked_ioctl() >>> obviously, and we already have ->uring_cmd() for this purpose. >>> >>> I do think the right thing to do is have a common helper that returns >>> whatever value you want (or sets it), and split the ioctl parts into a >>> wrapper around that that simply copies in/out as needed. Then >>> ->uring_cmd() could call that, or you could some exported function that >>> does supports that. >>> >>> This works for the basic cases, though I do suspect we'll want to go >>> down the ->uring_cmd() at some point for more advanced cases or cases >>> that cannot sanely be done in an ioctl fashion. >>> >> >> My meta point is that there are uapis today to return this information >> to applications (and I suspect this is just the start of more networking >> changes - both data retrieval and adjusting settings). io_uring is >> wanting to do this on behalf of the application without a syscall. That >> makes io_uring yet another subsystem / component managing a socket. Any >> change to the networking stack required by io_uring should be usable by >> all other in-kernel socket owners or managers. ie., there is no reason >> for io_uring specific code here. > > I think we are in violent agreement here, what I'm describing is exactly > that - it'd make ioctl/{set,get}sockopt call into the same helpers that > ->uring_cmd() would, with the only difference being that the former > would need copy in/out and the latter would not. > > But let me just stress that for direct descriptors, we cannot currently > call ioctl or set/getsockopt. This means we have to instantiate a > regular descriptor first, do those things, then register it to never use > the regular file descriptor again. That's wasteful, and this is what we > want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file > descriptor). It's not just for "oh it'd be handy to also do this from > io_uring" even if that would be a worthwhile goal in itself. > Christoph's patch set a few years back that removed set_fs broke the ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not follow that change; was it a deliberate intent to not allow these in-kernel calls vs wanting to remove the set_fs? e.g., can we add a kioctl variant for in-kernel use of the APIs?
On 4/11/23 9:24?AM, Willem de Bruijn wrote: > Jens Axboe wrote: >> On 4/11/23 9:00?AM, Willem de Bruijn wrote: >>> Jens Axboe wrote: >>>> On 4/11/23 8:51?AM, Willem de Bruijn wrote: >>>>> Jens Axboe wrote: >>>>>> On 4/11/23 8:36?AM, David Ahern wrote: >>>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote: >>>>>>>> I am not sure if avoiding io_uring details in network code is possible. >>>>>>>> >>>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/ >>>>>>>> directory, but, I think it might be cleaner if these implementations are >>>>>>>> closer to function assignment (in the network subsystem). >>>>>>>> >>>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am >>>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>>>>>>> -> SIOCINQ. >>>>>>>> >>>>>>>> Please let me know if you have any other idea in mind. >>>>>>> >>>>>>> I am not convinced that this io_uring_cmd is needed. This is one >>>>>>> in-kernel subsystem calling into another, and there are APIs for that. >>>>>>> All of this set is ioctl based and as Willem noted a little refactoring >>>>>>> separates the get_user/put_user out so that in-kernel can call can be >>>>>>> made with existing ops. >>>>>> >>>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl() >>>>>> obviously, and we already have ->uring_cmd() for this purpose. >>>>> >>>>> Does this suggestion not work? >>>> >>>> Not sure I follow, what suggestion? >>>> >>> >>> This quote from earlier in the thread: >>> >>> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like >>> sock_do_ioctl. >> >> But that doesn't work, because sock->ops->ioctl() assumes the arg is >> memory in userspace. Or do you mean change all of the sock->ops->ioctl() >> to pass in on-stack memory (or similar) and have it work with a kernel >> address? > > That was what I suggested indeed. > > It's about as much code change as this patch series. But it avoids > the code duplication. Breno, want to tackle that as a prep patch first? Should make the functional changes afterwards much more straightforward, and will allow support for anything really.
On 4/11/23 9:27?AM, David Ahern wrote: > On 4/11/23 9:17 AM, Jens Axboe wrote: >> On 4/11/23 9:10?AM, David Ahern wrote: >>> On 4/11/23 8:41 AM, Jens Axboe wrote: >>>> On 4/11/23 8:36?AM, David Ahern wrote: >>>>> On 4/11/23 6:00 AM, Breno Leitao wrote: >>>>>> I am not sure if avoiding io_uring details in network code is possible. >>>>>> >>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd() >>>>>> in the TCP case) could be somewhere else, such as in the io_uring/ >>>>>> directory, but, I think it might be cleaner if these implementations are >>>>>> closer to function assignment (in the network subsystem). >>>>>> >>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am >>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ >>>>>> -> SIOCINQ. >>>>>> >>>>>> Please let me know if you have any other idea in mind. >>>>> >>>>> I am not convinced that this io_uring_cmd is needed. This is one >>>>> in-kernel subsystem calling into another, and there are APIs for that. >>>>> All of this set is ioctl based and as Willem noted a little refactoring >>>>> separates the get_user/put_user out so that in-kernel can call can be >>>>> made with existing ops. >>>> >>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl() >>>> obviously, and we already have ->uring_cmd() for this purpose. >>>> >>>> I do think the right thing to do is have a common helper that returns >>>> whatever value you want (or sets it), and split the ioctl parts into a >>>> wrapper around that that simply copies in/out as needed. Then >>>> ->uring_cmd() could call that, or you could some exported function that >>>> does supports that. >>>> >>>> This works for the basic cases, though I do suspect we'll want to go >>>> down the ->uring_cmd() at some point for more advanced cases or cases >>>> that cannot sanely be done in an ioctl fashion. >>>> >>> >>> My meta point is that there are uapis today to return this information >>> to applications (and I suspect this is just the start of more networking >>> changes - both data retrieval and adjusting settings). io_uring is >>> wanting to do this on behalf of the application without a syscall. That >>> makes io_uring yet another subsystem / component managing a socket. Any >>> change to the networking stack required by io_uring should be usable by >>> all other in-kernel socket owners or managers. ie., there is no reason >>> for io_uring specific code here. >> >> I think we are in violent agreement here, what I'm describing is exactly >> that - it'd make ioctl/{set,get}sockopt call into the same helpers that >> ->uring_cmd() would, with the only difference being that the former >> would need copy in/out and the latter would not. >> >> But let me just stress that for direct descriptors, we cannot currently >> call ioctl or set/getsockopt. This means we have to instantiate a >> regular descriptor first, do those things, then register it to never use >> the regular file descriptor again. That's wasteful, and this is what we >> want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file >> descriptor). It's not just for "oh it'd be handy to also do this from >> io_uring" even if that would be a worthwhile goal in itself. >> > > Christoph's patch set a few years back that removed set_fs broke the > ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not > follow that change; was it a deliberate intent to not allow these > in-kernel calls vs wanting to remove the set_fs? e.g., can we add a > kioctl variant for in-kernel use of the APIs? I think it'd be much better to cleanly split it out rather than try and hack around it.
From: David Ahern > Sent: 11 April 2023 16:28 .... > Christoph's patch set a few years back that removed set_fs broke the > ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not > follow that change; was it a deliberate intent to not allow these > in-kernel calls vs wanting to remove the set_fs? e.g., can we add a > kioctl variant for in-kernel use of the APIs? I think that was a side effect, and with no in-tree in-kernel users (apart from limited calls in bpf) it was deemed acceptable. (It is a PITA for any code trying to use SCTP in kernel.) One problem is that not all sockopt calls pass the correct length. And some of them can have very long buffers. Not to mention the ones that are read-modify-write. A plausible solution is to pass a 'fat pointer' that contains some, or all, of: - A userspace buffer pointer. - A kernel buffer pointer. - The length supplied by the user. - The length of the kernel buffer. = The number of bytes to copy on completion. For simple user requests the syscall entry/exit code would copy the data to a short on-stack buffer. Kernel users just pass the kernel address. Odd requests can just use the user pointer. Probably needs accessors that add in an offset. It might also be that some of the problematic sockopt were in decnet - now removed. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote: > On 4/11/23 9:24?AM, Willem de Bruijn wrote: > > Jens Axboe wrote: > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote: > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl() > >> to pass in on-stack memory (or similar) and have it work with a kernel > >> address? > > > > That was what I suggested indeed. > > > > It's about as much code change as this patch series. But it avoids > > the code duplication. > > Breno, want to tackle that as a prep patch first? Should make the > functional changes afterwards much more straightforward, and will allow > support for anything really. Absolutely. I just want to make sure that I got the proper approach that we agreed here. Let me explain what I understood taking TCP as an example: 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg' argument is now just a kernel memory (located in the stack frame from the callee). 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg) 3) Repeat it for the 20 protocols that implement ioctl: ag "struct proto .* = {" -A 20 net/ | grep \.ioctl net/dccp/ipv6.c .ioctl = dccp_ioctl, net/dccp/ipv4.c .ioctl = dccp_ioctl, net/ieee802154/socket.c .ioctl = dgram_ioctl, net/ipv4/udplite.c .ioctl = udp_ioctl, net/ipv4/raw.c .ioctl = raw_ioctl, net/ipv4/udp.c .ioctl = udp_ioctl, net/ipv4/tcp_ipv4.c .ioctl = tcp_ioctl, net/ipv6/raw.c .ioctl = rawv6_ioctl, net/ipv6/tcp_ipv6.c .ioctl = tcp_ioctl, net/ipv6/udp.c .ioctl = udp_ioctl, net/ipv6/udplite.c .ioctl = udp_ioctl, net/l2tp/l2tp_ip6.c .ioctl = l2tp_ioctl, net/l2tp/l2tp_ip.c .ioctl = l2tp_ioctl, net/phonet/datagram.: .ioctl = pn_ioctl, net/phonet/pep.c .ioctl = pep_ioctl, net/rds/af_rds.c .ioctl = rds_ioctl, net/sctp/socket.c .ioctl = sctp_ioctl, net/sctp/socket.c .ioctl = sctp_ioctl, net/xdp/xsk.c .ioctl = sock_no_ioctl, net/mptcp/protocol.c .ioctl = mptcp_ioctl, Am I missing something? Thanks!
Breno Leitao wrote: > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote: > > On 4/11/23 9:24?AM, Willem de Bruijn wrote: > > > Jens Axboe wrote: > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote: > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl() > > >> to pass in on-stack memory (or similar) and have it work with a kernel > > >> address? > > > > > > That was what I suggested indeed. > > > > > > It's about as much code change as this patch series. But it avoids > > > the code duplication. > > > > Breno, want to tackle that as a prep patch first? Should make the > > functional changes afterwards much more straightforward, and will allow > > support for anything really. > > Absolutely. I just want to make sure that I got the proper approach that > we agreed here. > > Let me explain what I understood taking TCP as an example: > > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg' > argument is now just a kernel memory (located in the stack frame from the > callee). > > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg) > > 3) Repeat it for the 20 protocols that implement ioctl: > > ag "struct proto .* = {" -A 20 net/ | grep \.ioctl > net/dccp/ipv6.c .ioctl = dccp_ioctl, > net/dccp/ipv4.c .ioctl = dccp_ioctl, > net/ieee802154/socket.c .ioctl = dgram_ioctl, > net/ipv4/udplite.c .ioctl = udp_ioctl, > net/ipv4/raw.c .ioctl = raw_ioctl, > net/ipv4/udp.c .ioctl = udp_ioctl, > net/ipv4/tcp_ipv4.c .ioctl = tcp_ioctl, > net/ipv6/raw.c .ioctl = rawv6_ioctl, > net/ipv6/tcp_ipv6.c .ioctl = tcp_ioctl, > net/ipv6/udp.c .ioctl = udp_ioctl, > net/ipv6/udplite.c .ioctl = udp_ioctl, > net/l2tp/l2tp_ip6.c .ioctl = l2tp_ioctl, > net/l2tp/l2tp_ip.c .ioctl = l2tp_ioctl, > net/phonet/datagram.: .ioctl = pn_ioctl, > net/phonet/pep.c .ioctl = pep_ioctl, > net/rds/af_rds.c .ioctl = rds_ioctl, > net/sctp/socket.c .ioctl = sctp_ioctl, > net/sctp/socket.c .ioctl = sctp_ioctl, > net/xdp/xsk.c .ioctl = sock_no_ioctl, > net/mptcp/protocol.c .ioctl = mptcp_ioctl, > > Am I missing something? The suggestion is to convert all to take kernel memory and do the put_cmsg in the caller of .ioctl. Rather than create a wrapper for each individual instance and add a separate .iouring_cmd for each. "change all of the sock->ops->ioctl() to pass in on-stack memory (or similar) and have it work with a kernel address"
On Wed, Apr 12, 2023 at 10:28:41AM -0400, Willem de Bruijn wrote: > Breno Leitao wrote: > > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote: > > > On 4/11/23 9:24?AM, Willem de Bruijn wrote: > > > > Jens Axboe wrote: > > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote: > > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is > > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl() > > > >> to pass in on-stack memory (or similar) and have it work with a kernel > > > >> address? > > > > > > > > That was what I suggested indeed. > > > > > > > > It's about as much code change as this patch series. But it avoids > > > > the code duplication. > > > > > > Breno, want to tackle that as a prep patch first? Should make the > > > functional changes afterwards much more straightforward, and will allow > > > support for anything really. > > > > Absolutely. I just want to make sure that I got the proper approach that > > we agreed here. > > > > Let me explain what I understood taking TCP as an example: > > > > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg' > > argument is now just a kernel memory (located in the stack frame from the > > callee). > > > > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the > > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of > > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg) > > > > 3) Repeat it for the 20 protocols that implement ioctl: > > > > ag "struct proto .* = {" -A 20 net/ | grep \.ioctl > > net/dccp/ipv6.c .ioctl = dccp_ioctl, > > net/dccp/ipv4.c .ioctl = dccp_ioctl, > > net/ieee802154/socket.c .ioctl = dgram_ioctl, > > net/ipv4/udplite.c .ioctl = udp_ioctl, > > net/ipv4/raw.c .ioctl = raw_ioctl, > > net/ipv4/udp.c .ioctl = udp_ioctl, > > net/ipv4/tcp_ipv4.c .ioctl = tcp_ioctl, > > net/ipv6/raw.c .ioctl = rawv6_ioctl, > > net/ipv6/tcp_ipv6.c .ioctl = tcp_ioctl, > > net/ipv6/udp.c .ioctl = udp_ioctl, > > net/ipv6/udplite.c .ioctl = udp_ioctl, > > net/l2tp/l2tp_ip6.c .ioctl = l2tp_ioctl, > > net/l2tp/l2tp_ip.c .ioctl = l2tp_ioctl, > > net/phonet/datagram.: .ioctl = pn_ioctl, > > net/phonet/pep.c .ioctl = pep_ioctl, > > net/rds/af_rds.c .ioctl = rds_ioctl, > > net/sctp/socket.c .ioctl = sctp_ioctl, > > net/sctp/socket.c .ioctl = sctp_ioctl, > > net/xdp/xsk.c .ioctl = sock_no_ioctl, > > net/mptcp/protocol.c .ioctl = mptcp_ioctl, > > > > Am I missing something? > > The suggestion is to convert all to take kernel memory and do the > put_cmsg in the caller of .ioctl. Rather than create a wrapper for > each individual instance and add a separate .iouring_cmd for each. > > "change all of the sock->ops->ioctl() to pass in on-stack memory > (or similar) and have it work with a kernel address" is it possible to do it for cases where we don't know what is the size of the buffer? For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is used in different ways (one for input and one for output): 1) If cmd == SIOCOUTQ or SIOCINQ, then the return value will be returned to userspace: put_user(amount, (int __user *)arg) 2) For default cmd, ipmr_ioctl() is called, which reads from the `arg` parameter: copy_from_user(&vr, arg, sizeof(vr) How to handle these contradictory behaviour ahead of time (at callee time, where the buffers will be prepared)? Thank you!
Breno Leitao wrote: > On Wed, Apr 12, 2023 at 10:28:41AM -0400, Willem de Bruijn wrote: > > Breno Leitao wrote: > > > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote: > > > > On 4/11/23 9:24?AM, Willem de Bruijn wrote: > > > > > Jens Axboe wrote: > > > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote: > > > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is > > > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl() > > > > >> to pass in on-stack memory (or similar) and have it work with a kernel > > > > >> address? > > > > > > > > > > That was what I suggested indeed. > > > > > > > > > > It's about as much code change as this patch series. But it avoids > > > > > the code duplication. > > > > > > > > Breno, want to tackle that as a prep patch first? Should make the > > > > functional changes afterwards much more straightforward, and will allow > > > > support for anything really. > > > > > > Absolutely. I just want to make sure that I got the proper approach that > > > we agreed here. > > > > > > Let me explain what I understood taking TCP as an example: > > > > > > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg' > > > argument is now just a kernel memory (located in the stack frame from the > > > callee). > > > > > > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the > > > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of > > > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg) > > > > > > 3) Repeat it for the 20 protocols that implement ioctl: > > > > > > ag "struct proto .* = {" -A 20 net/ | grep \.ioctl > > > net/dccp/ipv6.c .ioctl = dccp_ioctl, > > > net/dccp/ipv4.c .ioctl = dccp_ioctl, > > > net/ieee802154/socket.c .ioctl = dgram_ioctl, > > > net/ipv4/udplite.c .ioctl = udp_ioctl, > > > net/ipv4/raw.c .ioctl = raw_ioctl, > > > net/ipv4/udp.c .ioctl = udp_ioctl, > > > net/ipv4/tcp_ipv4.c .ioctl = tcp_ioctl, > > > net/ipv6/raw.c .ioctl = rawv6_ioctl, > > > net/ipv6/tcp_ipv6.c .ioctl = tcp_ioctl, > > > net/ipv6/udp.c .ioctl = udp_ioctl, > > > net/ipv6/udplite.c .ioctl = udp_ioctl, > > > net/l2tp/l2tp_ip6.c .ioctl = l2tp_ioctl, > > > net/l2tp/l2tp_ip.c .ioctl = l2tp_ioctl, > > > net/phonet/datagram.: .ioctl = pn_ioctl, > > > net/phonet/pep.c .ioctl = pep_ioctl, > > > net/rds/af_rds.c .ioctl = rds_ioctl, > > > net/sctp/socket.c .ioctl = sctp_ioctl, > > > net/sctp/socket.c .ioctl = sctp_ioctl, > > > net/xdp/xsk.c .ioctl = sock_no_ioctl, > > > net/mptcp/protocol.c .ioctl = mptcp_ioctl, > > > > > > Am I missing something? > > > > The suggestion is to convert all to take kernel memory and do the > > put_cmsg in the caller of .ioctl. Rather than create a wrapper for > > each individual instance and add a separate .iouring_cmd for each. > > > > "change all of the sock->ops->ioctl() to pass in on-stack memory > > (or similar) and have it work with a kernel address" > > is it possible to do it for cases where we don't know what is the size > of the buffer? > > For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is > used in different ways (one for input and one for output): > > 1) If cmd == SIOCOUTQ or SIOCINQ, then the return value will be > returned to userspace: > put_user(amount, (int __user *)arg) > > 2) For default cmd, ipmr_ioctl() is called, which reads from the `arg` > parameter: > copy_from_user(&vr, arg, sizeof(vr) > > How to handle these contradictory behaviour ahead of time (at callee > time, where the buffers will be prepared)? > > Thank you! Ah you found a counter-example to the simple pattern of put_user. The answer perhaps depends on how many such counter-examples you encounter in the list you gave. If this is the only one, exceptions in the wrapper are reasonable. Not if there are many. Is the intent for io_uring to support all cases eventually? The current patch series only targeted more common fast path operations. Probably also relevant is whether/how the approach can be extended to [gs]etsockopt, as that was another example given, with the same challenge.
On Thu, 13 Apr 2023 10:24:31 -0400 Willem de Bruijn wrote: > Probably also relevant is whether/how the approach can be extended > to [gs]etsockopt, as that was another example given, with the same > challenge. I had the same thought, given BPF filtering/integration with *etsockopt is repeatedly giving us grief. The only lesson from that I can think of is that we should perhaps suffer thru the one-by-one conversions for a while. Pulling the cases we inspected out into common code, rather than hope we can cover everything in one fell swoop.
From: Willem de Bruijn > Sent: 13 April 2023 15:25 ... > > For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is > > used in different ways (one for input and one for output): > > > > 1) If cmd == SIOCOUTQ or SIOCINQ, then the return value will be > > returned to userspace: > > put_user(amount, (int __user *)arg) There is always the option of defining alternate ioctl 'cmd' codes that user IOR() and IOW() and requiring that io_uring applications use the alternate forms. Then have two 'ioctl' functions with a new one for IOR() type commands and the existing one for compatibility that might just do a translation (or return a translated command to avoid extra stack use). You may still want to pass through both the kernel and user (if a user request) buffer addresses to allow for those broken requests where the buffer direction bits are wrong. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote: > > How to handle these contradictory behaviour ahead of time (at callee > > time, where the buffers will be prepared)? > > Ah you found a counter-example to the simple pattern of put_user. > > The answer perhaps depends on how many such counter-examples you > encounter in the list you gave. If this is the only one, exceptions > in the wrapper are reasonable. Not if there are many. Hello Williem, I spend sometime dealing with it, and the best way for me to figure out how much work this is, was implementing a PoC. You can find a basic PoC in the link below. It is not 100% complete (still need to convert 4 simple ioctls), but, it deals with the most complicated cases. The missing parts are straighforward if we are OK with this approach. https://github.com/leitao/linux/commits/ioctl_refactor Details ======= 1) Change the ioctl callback to use kernel memory arguments. This changes a lot of files but most of them are trivial. This is the new ioctl callback: struct proto { int (*ioctl)(struct sock *sk, int cmd, - unsigned long arg); + int *karg); You can see the full changeset in the following commit (which is the last in the tree above) https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead of sk->sk_prot->ioctl(). For every exception, calls a specific function for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3) This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl(). ip{6}mr is the hardest one, and I implemented the exception flow for it. You could find ipmr changes here: https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9 Is this what you had in mind? Thank you!
Breno Leitao wrote: > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote: > > > How to handle these contradictory behaviour ahead of time (at callee > > > time, where the buffers will be prepared)? > > > > Ah you found a counter-example to the simple pattern of put_user. > > > > The answer perhaps depends on how many such counter-examples you > > encounter in the list you gave. If this is the only one, exceptions > > in the wrapper are reasonable. Not if there are many. > > > Hello Williem, > > I spend sometime dealing with it, and the best way for me to figure out > how much work this is, was implementing a PoC. You can find a basic PoC > in the link below. It is not 100% complete (still need to convert 4 > simple ioctls), but, it deals with the most complicated cases. The > missing parts are straighforward if we are OK with this approach. > > https://github.com/leitao/linux/commits/ioctl_refactor > > Details > ======= > > 1) Change the ioctl callback to use kernel memory arguments. This > changes a lot of files but most of them are trivial. This is the new > ioctl callback: > > struct proto { > > int (*ioctl)(struct sock *sk, int cmd, > - unsigned long arg); > + int *karg); > > You can see the full changeset in the following commit (which is > the last in the tree above) > https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7 > > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead > of sk->sk_prot->ioctl(). For every exception, calls a specific function > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3) > > This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227 > > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl(). > ip{6}mr is the hardest one, and I implemented the exception flow for it. > > You could find ipmr changes here: > https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9 > > Is this what you had in mind? > > Thank you! Thanks for the series, Breno. Yes, this looks very much what I hoped for. The series shows two cases of ioctls: getters that return an int, and combined getter/setters that take a struct of a certain size and return the exact same. I would deduplicate the four ipmr/ip6mr cases that constitute the second type, by having a single helper for this type. sock_skprot_ioctl_struct, which takes an argument for the struct size to copy in/out. Did this series cover all proto ioctls, or is this still a subset just for demonstration purposes -- and might there still be other types lurking elsewhere? If this is all, this looks like a reasonable amount of code churn to me. Three small points * please keep the __user annotation. Use make C=2 when unsure to warn about mismatched annotation * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl and treat the "return int" (type 1) ioctls as the default case. * introduce code in a patch together with its use-case, so no separate patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one patch, or two, for each type of conversion.
On Tue, Apr 18, 2023 at 03:41:24PM -0400, Willem de Bruijn wrote: > Breno Leitao wrote: > > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote: > > > > How to handle these contradictory behaviour ahead of time (at callee > > > > time, where the buffers will be prepared)? > > > > > > Ah you found a counter-example to the simple pattern of put_user. > > > > > > The answer perhaps depends on how many such counter-examples you > > > encounter in the list you gave. If this is the only one, exceptions > > > in the wrapper are reasonable. Not if there are many. > > > > > > Hello Williem, > > > > I spend sometime dealing with it, and the best way for me to figure out > > how much work this is, was implementing a PoC. You can find a basic PoC > > in the link below. It is not 100% complete (still need to convert 4 > > simple ioctls), but, it deals with the most complicated cases. The > > missing parts are straighforward if we are OK with this approach. > > > > https://github.com/leitao/linux/commits/ioctl_refactor > > > > Details > > ======= > > > > 1) Change the ioctl callback to use kernel memory arguments. This > > changes a lot of files but most of them are trivial. This is the new > > ioctl callback: > > > > struct proto { > > > > int (*ioctl)(struct sock *sk, int cmd, > > - unsigned long arg); > > + int *karg); > > > > You can see the full changeset in the following commit (which is > > the last in the tree above) > > https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7 > > > > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead > > of sk->sk_prot->ioctl(). For every exception, calls a specific function > > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3) > > > > This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227 > > > > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl(). > > ip{6}mr is the hardest one, and I implemented the exception flow for it. > > > > You could find ipmr changes here: > > https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9 > > > > Is this what you had in mind? > > > > Thank you! > > Thanks for the series, Breno. Yes, this looks very much what I hoped for. Awesome. Thanks. > The series shows two cases of ioctls: getters that return an int, and > combined getter/setters that take a struct of a certain size and > return the exact same. > > I would deduplicate the four ipmr/ip6mr cases that constitute the second > type, by having a single helper for this type. sock_skprot_ioctl_struct, > which takes an argument for the struct size to copy in/out. Ok, that is a good advice. Thanks! > Did this series cover all proto ioctls, or is this still a subset just > for demonstration purposes -- and might there still be other types > lurking elsewhere? It does not cover all the cases. I would say it cover 80% of the cases, and the hardest cases. These are the missing cases, and what they do: * pn_ioctl (getters/setter that reads/return an int) * l2tp_ioctl (getters that return an int) * dgram_ioctl (getters that return an int) * sctp_ioctl (getters that return an int) * mptcp_ioctl (getters that return an int) * dccp_ioctl (getters that return an int) * dgram_ioctl (getters that return an int) * pep_ioctl (getters that return an int) Here is what I am using to get the full list: # ag --no-filename -A 20 "struct proto \w* = {" | grep .ioctl | cut -d "=" -f 2 | tr -d '\n' dccp_ioctl, dccp_ioctl, dgram_ioctl, tcp_ioctl, raw_ioctl, udp_ioctl, udp_ioctl, udp_ioctl, tcp_ioctl, l2tp_ioctl, rawv6_ioctl, l2tp_ioctl, mptcp_ioctl, pep_ioctl, pn_ioctl, rds_ioctl, sctp_ioctl, sctp_ioctl, sock_no_ioctl > If this is all, this looks like a reasonable amount of code churn to me. Should I proceed and create a final patch? I don't see a way to break up the last patch, which changes the API , in smaller patches. I.e., the last patch will be huge, right? > Three small points > > * please keep the __user annotation. Use make C=2 when unsure to warn > about mismatched annotation ack! > * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl > and treat the "return int" (type 1) ioctls as the default case. ack! > * introduce code in a patch together with its use-case, so no separate > patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one > patch, or two, for each type of conversion. I am not sure how to change the ABI (struct proto) without doing all the protocol changes in the same patch. Otherwise compilation will be broken between the patch that changes the "struct proto" and the patch that changes the _ioctl for protocol X. I mean, is it possible to break up changing "struct proto" and the affected protocols? Thank you for the review and suggestions! PS: I will take some days off next week, and I am planning to send the final patch when I come back.
Breno Leitao wrote: > On Tue, Apr 18, 2023 at 03:41:24PM -0400, Willem de Bruijn wrote: > > Breno Leitao wrote: > > > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote: > > > > > How to handle these contradictory behaviour ahead of time (at callee > > > > > time, where the buffers will be prepared)? > > > > > > > > Ah you found a counter-example to the simple pattern of put_user. > > > > > > > > The answer perhaps depends on how many such counter-examples you > > > > encounter in the list you gave. If this is the only one, exceptions > > > > in the wrapper are reasonable. Not if there are many. > > > > > > > > > Hello Williem, > > > > > > I spend sometime dealing with it, and the best way for me to figure out > > > how much work this is, was implementing a PoC. You can find a basic PoC > > > in the link below. It is not 100% complete (still need to convert 4 > > > simple ioctls), but, it deals with the most complicated cases. The > > > missing parts are straighforward if we are OK with this approach. > > > > > > https://github.com/leitao/linux/commits/ioctl_refactor > > > > > > Details > > > ======= > > > > > > 1) Change the ioctl callback to use kernel memory arguments. This > > > changes a lot of files but most of them are trivial. This is the new > > > ioctl callback: > > > > > > struct proto { > > > > > > int (*ioctl)(struct sock *sk, int cmd, > > > - unsigned long arg); > > > + int *karg); > > > > > > You can see the full changeset in the following commit (which is > > > the last in the tree above) > > > https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7 > > > > > > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead > > > of sk->sk_prot->ioctl(). For every exception, calls a specific function > > > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3) > > > > > > This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227 > > > > > > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl(). > > > ip{6}mr is the hardest one, and I implemented the exception flow for it. > > > > > > You could find ipmr changes here: > > > https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9 > > > > > > Is this what you had in mind? > > > > > > Thank you! > > > > Thanks for the series, Breno. Yes, this looks very much what I hoped for. > > Awesome. Thanks. > > > The series shows two cases of ioctls: getters that return an int, and > > combined getter/setters that take a struct of a certain size and > > return the exact same. > > > > I would deduplicate the four ipmr/ip6mr cases that constitute the second > > type, by having a single helper for this type. sock_skprot_ioctl_struct, > > which takes an argument for the struct size to copy in/out. > > Ok, that is a good advice. Thanks! > > > Did this series cover all proto ioctls, or is this still a subset just > > for demonstration purposes -- and might there still be other types > > lurking elsewhere? > > It does not cover all the cases. I would say it cover 80% of the cases, > and the hardest cases. These are the missing cases, and what they do: > > * pn_ioctl (getters/setter that reads/return an int) > * l2tp_ioctl (getters that return an int) > * dgram_ioctl (getters that return an int) > * sctp_ioctl (getters that return an int) > * mptcp_ioctl (getters that return an int) > * dccp_ioctl (getters that return an int) > * dgram_ioctl (getters that return an int) > * pep_ioctl (getters that return an int) Thanks for the thorough review. So we have io_struct, io_int and o_int variants only. And the io_int can use the proposed io_struct helper that takes an explicit length to copy in and out. > Here is what I am using to get the full list: > # ag --no-filename -A 20 "struct proto \w* = {" | grep .ioctl | cut -d "=" -f 2 | tr -d '\n' > > dccp_ioctl, dccp_ioctl, dgram_ioctl, tcp_ioctl, raw_ioctl, udp_ioctl, > udp_ioctl, udp_ioctl, tcp_ioctl, l2tp_ioctl, rawv6_ioctl, l2tp_ioctl, > mptcp_ioctl, pep_ioctl, pn_ioctl, rds_ioctl, sctp_ioctl, sctp_ioctl, > sock_no_ioctl > > > If this is all, this looks like a reasonable amount of code churn to me. > > Should I proceed and create a final patch? I don't see a way to break up > the last patch, which changes the API , in smaller patches. I.e., the > last patch will be huge, right? Good point. So be it, then. > > Three small points > > > > * please keep the __user annotation. Use make C=2 when unsure to warn > > about mismatched annotation > > ack! > > > * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl > > and treat the "return int" (type 1) ioctls as the default case. > > ack! > > > * introduce code in a patch together with its use-case, so no separate > > patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one > > patch, or two, for each type of conversion. > > I am not sure how to change the ABI (struct proto) without doing all the > protocol changes in the same patch. Otherwise compilation will be broken between > the patch that changes the "struct proto" and the patch that changes the > _ioctl for protocol X. I mean, is it possible to break up changing > "struct proto" and the affected protocols? > > Thank you for the review and suggestions! > > PS: I will take some days off next week, and I am planning to send the > final patch when I come back.
From: Breno Leitao <leit@fb.com> This patchset creates the initial plumbing for a io_uring command for sockets. For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is heavily based on the ioctl operations. In order to test this code, I created a liburing test, which is currently located at [1], and I will create a pull request once we are good with this patch. I've also run test/io_uring_passthrough to make sure the first patch didn't regressed the NVME passthrough path. This patchset is a RFC for two different reasons: * It changes slighlty on how IO uring command operates. I.e, we are now passing the whole SQE to the io_uring_cmd callback (instead of an opaque buffer). This seems to be more palatable instead of creating some custom structure just to fit small parameters, as in SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK? * Pavel has some ideas about the SQE->cmd_op field, so, we can start discussing it here. This work is heavily inspired by Jens Axboe's initial implementation. [1] https://github.com/leitao/liburing/blob/master/test/socket-io-cmd.c Breno Leitao (4): net: wire up support for file_operations->uring_cmd() net: add uring_cmd callback to UDP net: add uring_cmd callback to TCP net: add uring_cmd callback to raw "protocol" include/linux/net.h | 2 ++ include/net/raw.h | 3 +++ include/net/sock.h | 6 ++++++ include/net/tcp.h | 2 ++ include/net/udp.h | 2 ++ include/uapi/linux/net.h | 5 +++++ net/core/sock.c | 17 +++++++++++++++-- net/dccp/ipv4.c | 1 + net/ipv4/af_inet.c | 3 +++ net/ipv4/raw.c | 26 ++++++++++++++++++++++++++ net/ipv4/tcp.c | 34 ++++++++++++++++++++++++++++++++++ net/ipv4/tcp_ipv4.c | 1 + net/ipv4/udp.c | 18 ++++++++++++++++++ net/l2tp/l2tp_ip.c | 1 + net/mptcp/protocol.c | 1 + net/sctp/protocol.c | 1 + net/socket.c | 13 +++++++++++++ 17 files changed, 134 insertions(+), 2 deletions(-)