Message ID | cover.1743449872.git.metze@samba.org (mailing list archive) |
---|---|
Headers | show |
Series | net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() | expand |
On 03/31, Stefan Metzmacher wrote: > The motivation for this is to remove the SOL_SOCKET limitation > from io_uring_cmd_getsockopt(). > > The reason for this limitation is that io_uring_cmd_getsockopt() > passes a kernel pointer as optlen to do_sock_getsockopt() > and can't reach the ops->getsockopt() path. > > The first idea would be to change the optval and optlen arguments > to the protocol specific hooks also to sockptr_t, as that > is already used for setsockopt() and also by do_sock_getsockopt() > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > But as Linus don't like 'sockptr_t' I used a different approach. > > @Linus, would that optlen_t approach fit better for you? [..] > Instead of passing the optlen as user or kernel pointer, > we only ever pass a kernel pointer and do the > translation from/to userspace in do_sock_getsockopt(). At this point why not just fully embrace iov_iter? You have the size now + the user (or kernel) pointer. Might as well do s/sockptr_t/iov_iter/ conversion?
Hi Stefan, Thank you for your modifications, that's great! Our CI did some validations and here is its report: - KVM Validation: normal: Success! ✅ - KVM Validation: debug: Critical: Global Timeout ❌ - KVM Validation: btf-normal (only bpftest_all): Success! ✅ - KVM Validation: btf-debug (only bpftest_all): Unstable: 1 failed test(s): bpftest_test_progs_mptcp
Hi Stefan, Thank you for your modifications, that's great! Our CI did some validations and here is its report: - KVM Validation: normal: Unstable: 1 failed test(s): packetdrill_sockopts
Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > On 03/31, Stefan Metzmacher wrote: >> The motivation for this is to remove the SOL_SOCKET limitation >> from io_uring_cmd_getsockopt(). >> >> The reason for this limitation is that io_uring_cmd_getsockopt() >> passes a kernel pointer as optlen to do_sock_getsockopt() >> and can't reach the ops->getsockopt() path. >> >> The first idea would be to change the optval and optlen arguments >> to the protocol specific hooks also to sockptr_t, as that >> is already used for setsockopt() and also by do_sock_getsockopt() >> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >> >> But as Linus don't like 'sockptr_t' I used a different approach. >> >> @Linus, would that optlen_t approach fit better for you? > > [..] > >> Instead of passing the optlen as user or kernel pointer, >> we only ever pass a kernel pointer and do the >> translation from/to userspace in do_sock_getsockopt(). > > At this point why not just fully embrace iov_iter? You have the size > now + the user (or kernel) pointer. Might as well do > s/sockptr_t/iov_iter/ conversion? I think that would only be possible if we introduce proto[_ops].getsockopt_iter() and then convert the implementations step by step. Doing it all in one go has a lot of potential to break the uapi. I could try to convert things like socket, ip and tcp myself, but the rest needs to be converted by the maintainer of the specific protocol, as it needs to be tested. As there are crazy things happening in the existing implementations, e.g. some getsockopt() implementations use optval as in and out buffer. I first tried to convert both optval and optlen of getsockopt to sockptr_t, and that showed that touching the optval part starts to get complex very soon, see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 (note it didn't converted everything, I gave up after hitting sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe more are the ones also doing both copy_from_user and copy_to_user on optval) I come also across one implementation that returned -ERANGE because *optlen was too short and put the required length into *optlen, which means the returned *optlen is larger than the optval buffer given from userspace. Because of all these strange things I tried to do a minimal change in order to get rid of the io_uring limitation and only converted optlen and leave optval as is. In order to have a patchset that has a low risk to cause regressions. But as alternative introducing a prototype like this: int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_iter); That returns a non-negative value which can be placed into *optlen or negative value as error and *optlen will not be changed on error. optval_iter will get direction ITER_DEST, so it can only be written to. Implementations could then opt in for the new interface and allow do_sock_getsockopt() work also for the io_uring case, while all others would still get -EOPNOTSUPP. So what should be the way to go? metze
Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >> On 03/31, Stefan Metzmacher wrote: >>> The motivation for this is to remove the SOL_SOCKET limitation >>> from io_uring_cmd_getsockopt(). >>> >>> The reason for this limitation is that io_uring_cmd_getsockopt() >>> passes a kernel pointer as optlen to do_sock_getsockopt() >>> and can't reach the ops->getsockopt() path. >>> >>> The first idea would be to change the optval and optlen arguments >>> to the protocol specific hooks also to sockptr_t, as that >>> is already used for setsockopt() and also by do_sock_getsockopt() >>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>> >>> But as Linus don't like 'sockptr_t' I used a different approach. >>> >>> @Linus, would that optlen_t approach fit better for you? >> >> [..] >> >>> Instead of passing the optlen as user or kernel pointer, >>> we only ever pass a kernel pointer and do the >>> translation from/to userspace in do_sock_getsockopt(). >> >> At this point why not just fully embrace iov_iter? You have the size >> now + the user (or kernel) pointer. Might as well do >> s/sockptr_t/iov_iter/ conversion? > > I think that would only be possible if we introduce > proto[_ops].getsockopt_iter() and then convert the implementations > step by step. Doing it all in one go has a lot of potential to break > the uapi. I could try to convert things like socket, ip and tcp myself, but > the rest needs to be converted by the maintainer of the specific protocol, > as it needs to be tested. As there are crazy things happening in the existing > implementations, e.g. some getsockopt() implementations use optval as in and out > buffer. > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > and that showed that touching the optval part starts to get complex very soon, > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > (note it didn't converted everything, I gave up after hitting > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > more are the ones also doing both copy_from_user and copy_to_user on optval) > > I come also across one implementation that returned -ERANGE because *optlen was > too short and put the required length into *optlen, which means the returned > *optlen is larger than the optval buffer given from userspace. > > Because of all these strange things I tried to do a minimal change > in order to get rid of the io_uring limitation and only converted > optlen and leave optval as is. > > In order to have a patchset that has a low risk to cause regressions. > > But as alternative introducing a prototype like this: > > int (*getsockopt_iter)(struct socket *sock, int level, int optname, > struct iov_iter *optval_iter); > > That returns a non-negative value which can be placed into *optlen > or negative value as error and *optlen will not be changed on error. > optval_iter will get direction ITER_DEST, so it can only be written to. > > Implementations could then opt in for the new interface and > allow do_sock_getsockopt() work also for the io_uring case, > while all others would still get -EOPNOTSUPP. > > So what should be the way to go? Ok, I've added the infrastructure for getsockopt_iter, see below, but the first part I wanted to convert was tcp_ao_copy_mkts_to_user() and that also reads from userspace before writing. So we could go with the optlen_t approach, or we need logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one with ITER_DEST... So who wants to decide? Thanks! metze --- include/linux/net.h | 4 +++ include/net/sock.h | 64 +++++++++++++++++++++++++++++++++++++++++++++ net/core/sock.c | 12 +++++++-- net/socket.c | 12 +++++++-- 4 files changed, 88 insertions(+), 4 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index 0ff950eecc6b..ceb9f9ed84b9 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -194,6 +194,10 @@ struct proto_ops { unsigned int optlen); int (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); + int (*getsockopt_iter)(struct socket *sock, + int level, + int optname, + struct iov_iter *optval_iter); void (*show_fdinfo)(struct seq_file *m, struct socket *sock); int (*sendmsg) (struct socket *sock, struct msghdr *m, size_t total_len); diff --git a/include/net/sock.h b/include/net/sock.h index 8daf1b3b12c6..e741b219056e 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1249,6 +1249,11 @@ struct proto { int (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval, int __user *option); + int (*getsockopt_iter)(struct sock *sk, + int level, + int optname, + struct iov_iter *optval_iter); + void (*keepalive)(struct sock *sk, int valbool); #ifdef CONFIG_COMPAT int (*compat_ioctl)(struct sock *sk, @@ -1781,6 +1786,65 @@ int do_sock_setsockopt(struct socket *sock, bool compat, int level, int do_sock_getsockopt(struct socket *sock, bool compat, int level, int optname, sockptr_t optval, sockptr_t optlen); +#define __generic_wrap_getsockopt_iter(__s, __level, \ + __optname, __optval, __optlen, \ + __getsockopt_iter) \ +do { \ + struct iov_iter optval_iter; \ + struct kvec optval_kvec; \ + int len; \ + int err; \ + \ + if (unlikely(__getsockopt_iter == NULL)) \ + return -EOPNOTSUPP; \ + \ + if (copy_from_sockptr(&len, __optlen, sizeof(len))) \ + return -EFAULT; \ + \ + if (len < 0) \ + return -EINVAL; \ + \ + if (__optval.is_kernel) { \ + if (__optval.kernel == NULL && len != 0) \ + return -EFAULT; \ + \ + optval_kvec = (struct kvec) { \ + .iov_base = __optval.kernel, \ + .iov_len = len, \ + }; \ + \ + iov_iter_kvec(&optval_iter, ITER_DEST, \ + &optval_kvec, 1, optval_kvec.iov_len); \ + } else { \ + if (import_ubuf(ITER_DEST, __optval.user, len, &optval_iter)) \ + return -EFAULT; \ + } \ + \ + err = getsockopt_iter(__s, __level, __optname, &optval_iter); \ + if (unlikely(err < 0)) \ + return err; \ + \ + len = err; \ + if (copy_to_sockptr(__optlen, &len, sizeof(len))) \ + return -EFAULT; \ + \ + return 0; \ +} while (0) + +static __always_inline +int sk_wrap_getsockopt_iter(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen, + int (*getsockopt_iter)(struct sock *sk, int level, int optname, struct iov_iter *optval_iter)) +{ + __generic_wrap_getsockopt_iter(sk, level, optname, optval, optlen, getsockopt_iter); +} + +static __always_inline +int sock_wrap_getsockopt_iter(struct socket *sock, int level, int optname, sockptr_t optval, sockptr_t optlen, + int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_iter)) +{ + __generic_wrap_getsockopt_iter(sock, level, optname, optval, optlen, getsockopt_iter); +} + int sk_getsockopt(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen); int sock_gettstamp(struct socket *sock, void __user *userstamp, diff --git a/net/core/sock.c b/net/core/sock.c index 323892066def..61625060e724 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3857,9 +3857,17 @@ int sock_common_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { struct sock *sk = sock->sk; - /* IPV6_ADDRFORM can change sk->sk_prot under us. */ - return READ_ONCE(sk->sk_prot)->getsockopt(sk, level, optname, optval, optlen); + struct proto *prot = READ_ONCE(sk->sk_prot); + + if (prot->getsockopt_iter) { + return sk_wrap_getsockopt_iter(sk, level, optname, + USER_SOCKPTR(optval), + USER_SOCKPTR(optlen), + prot->getsockopt_iter); + } + + return prot->getsockopt(sk, level, optname, optval, optlen); } EXPORT_SYMBOL(sock_common_getsockopt); diff --git a/net/socket.c b/net/socket.c index 9a0e720f0859..792cfd272611 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2335,6 +2335,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level, { int max_optlen __maybe_unused = 0; const struct proto_ops *ops; + const struct proto *prot; int err; err = security_socket_getsockopt(sock, level, optname); @@ -2345,12 +2346,19 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level, copy_from_sockptr(&max_optlen, optlen, sizeof(int)); ops = READ_ONCE(sock->ops); + prot = READ_ONCE(sock->sk->sk_prot); if (level == SOL_SOCKET) { err = sk_getsockopt(sock->sk, level, optname, optval, optlen); - } else if (unlikely(!ops->getsockopt)) { + } else if (ops->getsockopt_iter) { + err = sock_wrap_getsockopt_iter(sock, level, optname, optval, optlen, + ops->getsockopt_iter); + } else if (ops->getsockopt == sock_common_getsockopt && prot->getsockopt_iter) { + err = sk_wrap_getsockopt_iter(sock->sk, level, optname, optval, optlen, + prot->getsockopt_iter); + } else if (unlikely(!ops->getsockopt || optlen.is_kernel)) { err = -EOPNOTSUPP; } else { - if (WARN_ONCE(optval.is_kernel || optlen.is_kernel, + if (WARN_ONCE(optval.is_kernel, "Invalid argument type")) return -EOPNOTSUPP;
Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: >> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >>> On 03/31, Stefan Metzmacher wrote: >>>> The motivation for this is to remove the SOL_SOCKET limitation >>>> from io_uring_cmd_getsockopt(). >>>> >>>> The reason for this limitation is that io_uring_cmd_getsockopt() >>>> passes a kernel pointer as optlen to do_sock_getsockopt() >>>> and can't reach the ops->getsockopt() path. >>>> >>>> The first idea would be to change the optval and optlen arguments >>>> to the protocol specific hooks also to sockptr_t, as that >>>> is already used for setsockopt() and also by do_sock_getsockopt() >>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>>> >>>> But as Linus don't like 'sockptr_t' I used a different approach. >>>> >>>> @Linus, would that optlen_t approach fit better for you? >>> >>> [..] >>> >>>> Instead of passing the optlen as user or kernel pointer, >>>> we only ever pass a kernel pointer and do the >>>> translation from/to userspace in do_sock_getsockopt(). >>> >>> At this point why not just fully embrace iov_iter? You have the size >>> now + the user (or kernel) pointer. Might as well do >>> s/sockptr_t/iov_iter/ conversion? >> >> I think that would only be possible if we introduce >> proto[_ops].getsockopt_iter() and then convert the implementations >> step by step. Doing it all in one go has a lot of potential to break >> the uapi. I could try to convert things like socket, ip and tcp myself, but >> the rest needs to be converted by the maintainer of the specific protocol, >> as it needs to be tested. As there are crazy things happening in the existing >> implementations, e.g. some getsockopt() implementations use optval as in and out >> buffer. >> >> I first tried to convert both optval and optlen of getsockopt to sockptr_t, >> and that showed that touching the optval part starts to get complex very soon, >> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 >> (note it didn't converted everything, I gave up after hitting >> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. >> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe >> more are the ones also doing both copy_from_user and copy_to_user on optval) >> >> I come also across one implementation that returned -ERANGE because *optlen was >> too short and put the required length into *optlen, which means the returned >> *optlen is larger than the optval buffer given from userspace. >> >> Because of all these strange things I tried to do a minimal change >> in order to get rid of the io_uring limitation and only converted >> optlen and leave optval as is. >> >> In order to have a patchset that has a low risk to cause regressions. >> >> But as alternative introducing a prototype like this: >> >> int (*getsockopt_iter)(struct socket *sock, int level, int optname, >> struct iov_iter *optval_iter); >> >> That returns a non-negative value which can be placed into *optlen >> or negative value as error and *optlen will not be changed on error. >> optval_iter will get direction ITER_DEST, so it can only be written to. >> >> Implementations could then opt in for the new interface and >> allow do_sock_getsockopt() work also for the io_uring case, >> while all others would still get -EOPNOTSUPP. >> >> So what should be the way to go? > > Ok, I've added the infrastructure for getsockopt_iter, see below, > but the first part I wanted to convert was > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > writing. > > So we could go with the optlen_t approach, or we need > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > with ITER_DEST... > > So who wants to decide? I just noticed that it's even possible in same cases to pass in a short buffer to optval, but have a longer value in optlen, hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. This makes it really hard to believe that trying to use iov_iter for this is a good idea :-( Any ideas beside just going with optlen_t? metze
On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > On 03/31, Stefan Metzmacher wrote: > > > > > The motivation for this is to remove the SOL_SOCKET limitation > > > > > from io_uring_cmd_getsockopt(). > > > > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > and can't reach the ops->getsockopt() path. > > > > > > > > > > The first idea would be to change the optval and optlen arguments > > > > > to the protocol specific hooks also to sockptr_t, as that > > > > > is already used for setsockopt() and also by do_sock_getsockopt() > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > > > > > @Linus, would that optlen_t approach fit better for you? > > > > > > > > [..] > > > > > > > > > Instead of passing the optlen as user or kernel pointer, > > > > > we only ever pass a kernel pointer and do the > > > > > translation from/to userspace in do_sock_getsockopt(). > > > > > > > > At this point why not just fully embrace iov_iter? You have the size > > > > now + the user (or kernel) pointer. Might as well do > > > > s/sockptr_t/iov_iter/ conversion? > > > > > > I think that would only be possible if we introduce > > > proto[_ops].getsockopt_iter() and then convert the implementations > > > step by step. Doing it all in one go has a lot of potential to break > > > the uapi. I could try to convert things like socket, ip and tcp myself, but > > > the rest needs to be converted by the maintainer of the specific protocol, > > > as it needs to be tested. As there are crazy things happening in the existing > > > implementations, e.g. some getsockopt() implementations use optval as in and out > > > buffer. > > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > and that showed that touching the optval part starts to get complex very soon, > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > (note it didn't converted everything, I gave up after hitting > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > I come also across one implementation that returned -ERANGE because *optlen was > > > too short and put the required length into *optlen, which means the returned > > > *optlen is larger than the optval buffer given from userspace. > > > > > > Because of all these strange things I tried to do a minimal change > > > in order to get rid of the io_uring limitation and only converted > > > optlen and leave optval as is. > > > > > > In order to have a patchset that has a low risk to cause regressions. > > > > > > But as alternative introducing a prototype like this: > > > > > > int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > struct iov_iter *optval_iter); > > > > > > That returns a non-negative value which can be placed into *optlen > > > or negative value as error and *optlen will not be changed on error. > > > optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > Implementations could then opt in for the new interface and > > > allow do_sock_getsockopt() work also for the io_uring case, > > > while all others would still get -EOPNOTSUPP. > > > > > > So what should be the way to go? > > > > Ok, I've added the infrastructure for getsockopt_iter, see below, > > but the first part I wanted to convert was > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > writing. > > > > So we could go with the optlen_t approach, or we need > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > with ITER_DEST... > > > > So who wants to decide? > > I just noticed that it's even possible in same cases > to pass in a short buffer to optval, but have a longer value in optlen, > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > This makes it really hard to believe that trying to use iov_iter for this > is a good idea :-( That was my finding as well a while ago, when I was planning to get the __user pointers converted to iov_iter. There are some weird ways of using optlen and optval, which makes them non-trivial to covert to iov_iter.
On 04/01, Breno Leitao wrote: > On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > On 03/31, Stefan Metzmacher wrote: > > > > > > The motivation for this is to remove the SOL_SOCKET limitation > > > > > > from io_uring_cmd_getsockopt(). > > > > > > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > > passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > and can't reach the ops->getsockopt() path. > > > > > > > > > > > > The first idea would be to change the optval and optlen arguments > > > > > > to the protocol specific hooks also to sockptr_t, as that > > > > > > is already used for setsockopt() and also by do_sock_getsockopt() > > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > > > > > > > @Linus, would that optlen_t approach fit better for you? > > > > > > > > > > [..] > > > > > > > > > > > Instead of passing the optlen as user or kernel pointer, > > > > > > we only ever pass a kernel pointer and do the > > > > > > translation from/to userspace in do_sock_getsockopt(). > > > > > > > > > > At this point why not just fully embrace iov_iter? You have the size > > > > > now + the user (or kernel) pointer. Might as well do > > > > > s/sockptr_t/iov_iter/ conversion? > > > > > > > > I think that would only be possible if we introduce > > > > proto[_ops].getsockopt_iter() and then convert the implementations > > > > step by step. Doing it all in one go has a lot of potential to break > > > > the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > the rest needs to be converted by the maintainer of the specific protocol, > > > > as it needs to be tested. As there are crazy things happening in the existing > > > > implementations, e.g. some getsockopt() implementations use optval as in and out > > > > buffer. > > > > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > and that showed that touching the optval part starts to get complex very soon, > > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > (note it didn't converted everything, I gave up after hitting > > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > > > I come also across one implementation that returned -ERANGE because *optlen was > > > > too short and put the required length into *optlen, which means the returned > > > > *optlen is larger than the optval buffer given from userspace. > > > > > > > > Because of all these strange things I tried to do a minimal change > > > > in order to get rid of the io_uring limitation and only converted > > > > optlen and leave optval as is. > > > > > > > > In order to have a patchset that has a low risk to cause regressions. > > > > > > > > But as alternative introducing a prototype like this: > > > > > > > > int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > struct iov_iter *optval_iter); > > > > > > > > That returns a non-negative value which can be placed into *optlen > > > > or negative value as error and *optlen will not be changed on error. > > > > optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > > > Implementations could then opt in for the new interface and > > > > allow do_sock_getsockopt() work also for the io_uring case, > > > > while all others would still get -EOPNOTSUPP. > > > > > > > > So what should be the way to go? > > > > > > Ok, I've added the infrastructure for getsockopt_iter, see below, > > > but the first part I wanted to convert was > > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > writing. > > > > > > So we could go with the optlen_t approach, or we need > > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > with ITER_DEST... > > > > > > So who wants to decide? > > > > I just noticed that it's even possible in same cases > > to pass in a short buffer to optval, but have a longer value in optlen, > > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > This makes it really hard to believe that trying to use iov_iter for this > > is a good idea :-( > > That was my finding as well a while ago, when I was planning to get the > __user pointers converted to iov_iter. There are some weird ways of > using optlen and optval, which makes them non-trivial to covert to > iov_iter. Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% of useful socket opts. See if there are any obvious problems with them and if not, try converting. The rest we can cover separately when/if needed.
Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > On 04/01, Breno Leitao wrote: >> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: >>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: >>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: >>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >>>>>> On 03/31, Stefan Metzmacher wrote: >>>>>>> The motivation for this is to remove the SOL_SOCKET limitation >>>>>>> from io_uring_cmd_getsockopt(). >>>>>>> >>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() >>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() >>>>>>> and can't reach the ops->getsockopt() path. >>>>>>> >>>>>>> The first idea would be to change the optval and optlen arguments >>>>>>> to the protocol specific hooks also to sockptr_t, as that >>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() >>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>>>>>> >>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. >>>>>>> >>>>>>> @Linus, would that optlen_t approach fit better for you? >>>>>> >>>>>> [..] >>>>>> >>>>>>> Instead of passing the optlen as user or kernel pointer, >>>>>>> we only ever pass a kernel pointer and do the >>>>>>> translation from/to userspace in do_sock_getsockopt(). >>>>>> >>>>>> At this point why not just fully embrace iov_iter? You have the size >>>>>> now + the user (or kernel) pointer. Might as well do >>>>>> s/sockptr_t/iov_iter/ conversion? >>>>> >>>>> I think that would only be possible if we introduce >>>>> proto[_ops].getsockopt_iter() and then convert the implementations >>>>> step by step. Doing it all in one go has a lot of potential to break >>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but >>>>> the rest needs to be converted by the maintainer of the specific protocol, >>>>> as it needs to be tested. As there are crazy things happening in the existing >>>>> implementations, e.g. some getsockopt() implementations use optval as in and out >>>>> buffer. >>>>> >>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, >>>>> and that showed that touching the optval part starts to get complex very soon, >>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 >>>>> (note it didn't converted everything, I gave up after hitting >>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. >>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe >>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) >>>>> >>>>> I come also across one implementation that returned -ERANGE because *optlen was >>>>> too short and put the required length into *optlen, which means the returned >>>>> *optlen is larger than the optval buffer given from userspace. >>>>> >>>>> Because of all these strange things I tried to do a minimal change >>>>> in order to get rid of the io_uring limitation and only converted >>>>> optlen and leave optval as is. >>>>> >>>>> In order to have a patchset that has a low risk to cause regressions. >>>>> >>>>> But as alternative introducing a prototype like this: >>>>> >>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, >>>>> struct iov_iter *optval_iter); >>>>> >>>>> That returns a non-negative value which can be placed into *optlen >>>>> or negative value as error and *optlen will not be changed on error. >>>>> optval_iter will get direction ITER_DEST, so it can only be written to. >>>>> >>>>> Implementations could then opt in for the new interface and >>>>> allow do_sock_getsockopt() work also for the io_uring case, >>>>> while all others would still get -EOPNOTSUPP. >>>>> >>>>> So what should be the way to go? >>>> >>>> Ok, I've added the infrastructure for getsockopt_iter, see below, >>>> but the first part I wanted to convert was >>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before >>>> writing. >>>> >>>> So we could go with the optlen_t approach, or we need >>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one >>>> with ITER_DEST... >>>> >>>> So who wants to decide? >>> >>> I just noticed that it's even possible in same cases >>> to pass in a short buffer to optval, but have a longer value in optlen, >>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. >>> >>> This makes it really hard to believe that trying to use iov_iter for this >>> is a good idea :-( >> >> That was my finding as well a while ago, when I was planning to get the >> __user pointers converted to iov_iter. There are some weird ways of >> using optlen and optval, which makes them non-trivial to covert to >> iov_iter. > > Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > of useful socket opts. See if there are any obvious problems with them > and if not, try converting. The rest we can cover separately when/if > needed. That's what I tried, but it fails with tcp_getsockopt -> do_tcp_getsockopt -> tcp_ao_get_mkts -> tcp_ao_copy_mkts_to_user -> copy_struct_from_sockptr tcp_ao_get_sock_info -> copy_struct_from_sockptr That's not possible with a ITER_DEST iov_iter. metze
On 04/01, Stefan Metzmacher wrote: > Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > On 04/01, Breno Leitao wrote: > > > On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > > > On 03/31, Stefan Metzmacher wrote: > > > > > > > > The motivation for this is to remove the SOL_SOCKET limitation > > > > > > > > from io_uring_cmd_getsockopt(). > > > > > > > > > > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > > > > passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > > > and can't reach the ops->getsockopt() path. > > > > > > > > > > > > > > > > The first idea would be to change the optval and optlen arguments > > > > > > > > to the protocol specific hooks also to sockptr_t, as that > > > > > > > > is already used for setsockopt() and also by do_sock_getsockopt() > > > > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > > > > > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > > > > > > > > > > > @Linus, would that optlen_t approach fit better for you? > > > > > > > > > > > > > > [..] > > > > > > > > > > > > > > > Instead of passing the optlen as user or kernel pointer, > > > > > > > > we only ever pass a kernel pointer and do the > > > > > > > > translation from/to userspace in do_sock_getsockopt(). > > > > > > > > > > > > > > At this point why not just fully embrace iov_iter? You have the size > > > > > > > now + the user (or kernel) pointer. Might as well do > > > > > > > s/sockptr_t/iov_iter/ conversion? > > > > > > > > > > > > I think that would only be possible if we introduce > > > > > > proto[_ops].getsockopt_iter() and then convert the implementations > > > > > > step by step. Doing it all in one go has a lot of potential to break > > > > > > the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > > > the rest needs to be converted by the maintainer of the specific protocol, > > > > > > as it needs to be tested. As there are crazy things happening in the existing > > > > > > implementations, e.g. some getsockopt() implementations use optval as in and out > > > > > > buffer. > > > > > > > > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > > > and that showed that touching the optval part starts to get complex very soon, > > > > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > > (note it didn't converted everything, I gave up after hitting > > > > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > > > more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > > > > > > > I come also across one implementation that returned -ERANGE because *optlen was > > > > > > too short and put the required length into *optlen, which means the returned > > > > > > *optlen is larger than the optval buffer given from userspace. > > > > > > > > > > > > Because of all these strange things I tried to do a minimal change > > > > > > in order to get rid of the io_uring limitation and only converted > > > > > > optlen and leave optval as is. > > > > > > > > > > > > In order to have a patchset that has a low risk to cause regressions. > > > > > > > > > > > > But as alternative introducing a prototype like this: > > > > > > > > > > > > int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > > > struct iov_iter *optval_iter); > > > > > > > > > > > > That returns a non-negative value which can be placed into *optlen > > > > > > or negative value as error and *optlen will not be changed on error. > > > > > > optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > > > > > > > Implementations could then opt in for the new interface and > > > > > > allow do_sock_getsockopt() work also for the io_uring case, > > > > > > while all others would still get -EOPNOTSUPP. > > > > > > > > > > > > So what should be the way to go? > > > > > > > > > > Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > > but the first part I wanted to convert was > > > > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > > writing. > > > > > > > > > > So we could go with the optlen_t approach, or we need > > > > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > > with ITER_DEST... > > > > > > > > > > So who wants to decide? > > > > > > > > I just noticed that it's even possible in same cases > > > > to pass in a short buffer to optval, but have a longer value in optlen, > > > > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > > > > > This makes it really hard to believe that trying to use iov_iter for this > > > > is a good idea :-( > > > > > > That was my finding as well a while ago, when I was planning to get the > > > __user pointers converted to iov_iter. There are some weird ways of > > > using optlen and optval, which makes them non-trivial to covert to > > > iov_iter. > > > > Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > of useful socket opts. See if there are any obvious problems with them > > and if not, try converting. The rest we can cover separately when/if > > needed. > > That's what I tried, but it fails with > tcp_getsockopt -> > do_tcp_getsockopt -> > tcp_ao_get_mkts -> > tcp_ao_copy_mkts_to_user -> > copy_struct_from_sockptr > tcp_ao_get_sock_info -> > copy_struct_from_sockptr > > That's not possible with a ITER_DEST iov_iter. > > metze Can we create two iterators over the same memory? One for ITER_SOURCE and another for ITER_DEST. And then make getsockopt_iter accept optval_in and optval_out. We can also use optval_out position (iov_offset) as optlen output value. Don't see why it won't work, but I agree that's gonna be a messy conversion so let's see if someone else has better suggestions.
Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > On 04/01, Stefan Metzmacher wrote: >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: >>> On 04/01, Breno Leitao wrote: >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >>>>>>>> On 03/31, Stefan Metzmacher wrote: >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation >>>>>>>>> from io_uring_cmd_getsockopt(). >>>>>>>>> >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() >>>>>>>>> and can't reach the ops->getsockopt() path. >>>>>>>>> >>>>>>>>> The first idea would be to change the optval and optlen arguments >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>>>>>>>> >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. >>>>>>>>> >>>>>>>>> @Linus, would that optlen_t approach fit better for you? >>>>>>>> >>>>>>>> [..] >>>>>>>> >>>>>>>>> Instead of passing the optlen as user or kernel pointer, >>>>>>>>> we only ever pass a kernel pointer and do the >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). >>>>>>>> >>>>>>>> At this point why not just fully embrace iov_iter? You have the size >>>>>>>> now + the user (or kernel) pointer. Might as well do >>>>>>>> s/sockptr_t/iov_iter/ conversion? >>>>>>> >>>>>>> I think that would only be possible if we introduce >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations >>>>>>> step by step. Doing it all in one go has a lot of potential to break >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, >>>>>>> as it needs to be tested. As there are crazy things happening in the existing >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out >>>>>>> buffer. >>>>>>> >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, >>>>>>> and that showed that touching the optval part starts to get complex very soon, >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 >>>>>>> (note it didn't converted everything, I gave up after hitting >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) >>>>>>> >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was >>>>>>> too short and put the required length into *optlen, which means the returned >>>>>>> *optlen is larger than the optval buffer given from userspace. >>>>>>> >>>>>>> Because of all these strange things I tried to do a minimal change >>>>>>> in order to get rid of the io_uring limitation and only converted >>>>>>> optlen and leave optval as is. >>>>>>> >>>>>>> In order to have a patchset that has a low risk to cause regressions. >>>>>>> >>>>>>> But as alternative introducing a prototype like this: >>>>>>> >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, >>>>>>> struct iov_iter *optval_iter); >>>>>>> >>>>>>> That returns a non-negative value which can be placed into *optlen >>>>>>> or negative value as error and *optlen will not be changed on error. >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. >>>>>>> >>>>>>> Implementations could then opt in for the new interface and >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, >>>>>>> while all others would still get -EOPNOTSUPP. >>>>>>> >>>>>>> So what should be the way to go? >>>>>> >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, >>>>>> but the first part I wanted to convert was >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before >>>>>> writing. >>>>>> >>>>>> So we could go with the optlen_t approach, or we need >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one >>>>>> with ITER_DEST... >>>>>> >>>>>> So who wants to decide? >>>>> >>>>> I just noticed that it's even possible in same cases >>>>> to pass in a short buffer to optval, but have a longer value in optlen, >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. >>>>> >>>>> This makes it really hard to believe that trying to use iov_iter for this >>>>> is a good idea :-( >>>> >>>> That was my finding as well a while ago, when I was planning to get the >>>> __user pointers converted to iov_iter. There are some weird ways of >>>> using optlen and optval, which makes them non-trivial to covert to >>>> iov_iter. >>> >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% >>> of useful socket opts. See if there are any obvious problems with them >>> and if not, try converting. The rest we can cover separately when/if >>> needed. >> >> That's what I tried, but it fails with >> tcp_getsockopt -> >> do_tcp_getsockopt -> >> tcp_ao_get_mkts -> >> tcp_ao_copy_mkts_to_user -> >> copy_struct_from_sockptr >> tcp_ao_get_sock_info -> >> copy_struct_from_sockptr >> >> That's not possible with a ITER_DEST iov_iter. >> >> metze > > Can we create two iterators over the same memory? One for ITER_SOURCE and > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > optval_out. We can also use optval_out position (iov_offset) as optlen output > value. Don't see why it won't work, but I agree that's gonna be a messy > conversion so let's see if someone else has better suggestions. Yes, that might work, but it would be good to get some feedback if this would be the way to go: int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_in, struct iov_iter *optval_out); And *optlen = optval_out->iov_offset; Any objection or better ideas? Linus would that be what you had in mind? Thanks! metze
" On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher <metze@samba.org> wrote: > > But as Linus don't like 'sockptr_t' I used a different approach. So the sockptr_t thing has already happened. I hate it, and I think it's ugly as hell, but it is what it is. I think it's a complete hack and having that "kernel or user" pointer flag is disgusting. Making things worse, the naming is disgusting too, talking about some random "socket pointer", when it has absolutely nothing to do with socket, and isn't even a pointer. It's something else. It's literally called "socket" not because it has anything to do with sockets, but because it's a socket-specific hack that isn't acceptable anywhere else in the kernel. So that "socket" part of the name is literally shorthand for "only sockets are disgusting enough to use this, and nobody else should ever touch this crap". At least so far that part has mostly worked, even if there's some "sockptr_t" use in the crypto code. I didn't look closer, because I didn't want to lose my lunch. I don't understand why the networking code uses that thing. If you have a "fat pointer", you should damn well make it have the size of the area too, and do things *right*. Instead of doing what sockptr_t does, which is a complete hack to just pass a kernel/user flag, and then passes the length *separately* because the socket code couldn't be arsed to do the right thing. So I do still think "sockptr_t" should die. As Stanislav says, if you actually want that "user or kernel" thing, just use an "iov_iter". No, an "iov_iter" isn't exactly a pretty thing either, but at least it's the standard way to say "this pointer can have multiple different kinds of sources". And it keeps the size of the thing it points to around, so it's at least a fat pointer with proper ranges, even if it isn't exactly "type safe" (yes, it's type safe in the sense that it stays as a "iov_iter", but it's still basically a "random pointer"). > @Linus, would that optlen_t approach fit better for you? The optlen_t thing is slightly better mainly because it's more type-safe. At least it's not a "random misnamed user-or-kernel-pointer" thing where the name is about how nothing else is so broken as to use it. So it's better because it's more limited, and it's better in that at least it has a type-safe pointer rather than a "void *" with no size or type associated with it. That said, I don't think it's exactly great. It's just another case of "networking can't just do it right, and uses a random hack with special flag values". So I do think that it would be better to actually get rid of "sockptr_t optval, unsigned int optlen" ENTIRELY, and replace that with iov_iter and just make networking bite the bullet and do the RightThing(tm). In fact, to make it *really* typesafe, it might be a good idea to wrap the iov_iter in another struct, something like typedef struct sockopt { struct iov_iter iter; } sockopt_t; and make the networking functions make the typing very clear, and end up with an interface something like int do_tcp_setsockopt(struct sock *sk, int level, int optname, sockopt_t *val); where that "sockopt_t *val" replaces not just the "sockptr_t optval", but also the "unsigned int optlen" thing. And no, I didn't look at how much churn that would be. Probably a lot. Maybe more than people are willing to do - even if I think some of it could be automated with coccinelle or whatever. Linus
On Wed, 2 Apr 2025 00:53:58 +0200 Stefan Metzmacher <metze@samba.org> wrote: > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > On 04/01, Stefan Metzmacher wrote: > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > >>> On 04/01, Breno Leitao wrote: > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > >>>>>>>>> from io_uring_cmd_getsockopt(). > >>>>>>>>> > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > >>>>>>>>> and can't reach the ops->getsockopt() path. > >>>>>>>>> > >>>>>>>>> The first idea would be to change the optval and optlen arguments > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > >>>>>>>>> > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > >>>>>>>>> > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > >>>>>>>> > >>>>>>>> [..] > >>>>>>>> > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > >>>>>>>>> we only ever pass a kernel pointer and do the > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > >>>>>>>> > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > >>>>>>>> now + the user (or kernel) pointer. Might as well do > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > >>>>>>> > >>>>>>> I think that would only be possible if we introduce > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > >>>>>>> buffer. > >>>>>>> > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > >>>>>>> (note it didn't converted everything, I gave up after hitting > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > >>>>>>> > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > >>>>>>> too short and put the required length into *optlen, which means the returned > >>>>>>> *optlen is larger than the optval buffer given from userspace. > >>>>>>> > >>>>>>> Because of all these strange things I tried to do a minimal change > >>>>>>> in order to get rid of the io_uring limitation and only converted > >>>>>>> optlen and leave optval as is. > >>>>>>> > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > >>>>>>> > >>>>>>> But as alternative introducing a prototype like this: > >>>>>>> > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, > >>>>>>> struct iov_iter *optval_iter); > >>>>>>> > >>>>>>> That returns a non-negative value which can be placed into *optlen > >>>>>>> or negative value as error and *optlen will not be changed on error. > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > >>>>>>> > >>>>>>> Implementations could then opt in for the new interface and > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > >>>>>>> while all others would still get -EOPNOTSUPP. > >>>>>>> > >>>>>>> So what should be the way to go? > >>>>>> > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > >>>>>> but the first part I wanted to convert was > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > >>>>>> writing. > >>>>>> > >>>>>> So we could go with the optlen_t approach, or we need > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > >>>>>> with ITER_DEST... > >>>>>> > >>>>>> So who wants to decide? > >>>>> > >>>>> I just noticed that it's even possible in same cases > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > >>>>> > >>>>> This makes it really hard to believe that trying to use iov_iter for this > >>>>> is a good idea :-( > >>>> > >>>> That was my finding as well a while ago, when I was planning to get the > >>>> __user pointers converted to iov_iter. There are some weird ways of > >>>> using optlen and optval, which makes them non-trivial to covert to > >>>> iov_iter. > >>> > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > >>> of useful socket opts. See if there are any obvious problems with them > >>> and if not, try converting. The rest we can cover separately when/if > >>> needed. > >> > >> That's what I tried, but it fails with > >> tcp_getsockopt -> > >> do_tcp_getsockopt -> > >> tcp_ao_get_mkts -> > >> tcp_ao_copy_mkts_to_user -> > >> copy_struct_from_sockptr > >> tcp_ao_get_sock_info -> > >> copy_struct_from_sockptr > >> > >> That's not possible with a ITER_DEST iov_iter. > >> > >> metze > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > value. Don't see why it won't work, but I agree that's gonna be a messy > > conversion so let's see if someone else has better suggestions. > > Yes, that might work, but it would be good to get some feedback > if this would be the way to go: > > int (*getsockopt_iter)(struct socket *sock, > int level, int optname, > struct iov_iter *optval_in, > struct iov_iter *optval_out); > > And *optlen = optval_out->iov_offset; > > Any objection or better ideas? Linus would that be what you had in mind? I'd worry about performance - yes I know 'iter' are used elsewhere but... Also look at the SCTP code. How do you handle code that wants to return an updated length (often longer than the one provided) and an error code (eg ERRSIZE or similar). There is also a very strange use (I think it is a sockopt rather than an ioctl) where the buffer length the application provides is only that of the header. The actual buffer length is contained in the header. The return length is the amount written into the full buffer. David
On Tue, 1 Apr 2025 17:40:19 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > " > > On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher <metze@samba.org> wrote: > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > So the sockptr_t thing has already happened. I hate it, and I think > it's ugly as hell, but it is what it is. > > I think it's a complete hack and having that "kernel or user" pointer > flag is disgusting. I have proposed a patch which replaced it with a structure. That showed up some really hacky code in IIRC io_uring. Using sockptr_t for the buffer was one thing, the generic code can't copy the buffer to/from user because code lies about the length. But using for the length is just brain-dead. That is fixed size and can be copied from/to user by the wrapper. The code bloat reduction will be significant. David
On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 00:53:58 +0200 > Stefan Metzmacher <metze@samba.org> wrote: > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > On 04/01, Stefan Metzmacher wrote: > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > >>> On 04/01, Breno Leitao wrote: > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > >>>>>>>>> > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > >>>>>>>>> > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > >>>>>>>>> > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > >>>>>>>>> > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > >>>>>>>> > > >>>>>>>> [..] > > >>>>>>>> > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > >>>>>>>>> we only ever pass a kernel pointer and do the > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > >>>>>>>> > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > >>>>>>> > > >>>>>>> I think that would only be possible if we introduce > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > >>>>>>> buffer. > > >>>>>>> > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > >>>>>>> > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > >>>>>>> too short and put the required length into *optlen, which means the returned > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > >>>>>>> > > >>>>>>> Because of all these strange things I tried to do a minimal change > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > >>>>>>> optlen and leave optval as is. > > >>>>>>> > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > >>>>>>> > > >>>>>>> But as alternative introducing a prototype like this: > > >>>>>>> > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > >>>>>>> struct iov_iter *optval_iter); > > >>>>>>> > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > >>>>>>> > > >>>>>>> Implementations could then opt in for the new interface and > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > >>>>>>> while all others would still get -EOPNOTSUPP. > > >>>>>>> > > >>>>>>> So what should be the way to go? > > >>>>>> > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > >>>>>> but the first part I wanted to convert was > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > >>>>>> writing. > > >>>>>> > > >>>>>> So we could go with the optlen_t approach, or we need > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > >>>>>> with ITER_DEST... > > >>>>>> > > >>>>>> So who wants to decide? > > >>>>> > > >>>>> I just noticed that it's even possible in same cases > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > >>>>> > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > >>>>> is a good idea :-( > > >>>> > > >>>> That was my finding as well a while ago, when I was planning to get the > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > >>>> using optlen and optval, which makes them non-trivial to covert to > > >>>> iov_iter. > > >>> > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > >>> of useful socket opts. See if there are any obvious problems with them > > >>> and if not, try converting. The rest we can cover separately when/if > > >>> needed. > > >> > > >> That's what I tried, but it fails with > > >> tcp_getsockopt -> > > >> do_tcp_getsockopt -> > > >> tcp_ao_get_mkts -> > > >> tcp_ao_copy_mkts_to_user -> > > >> copy_struct_from_sockptr > > >> tcp_ao_get_sock_info -> > > >> copy_struct_from_sockptr > > >> > > >> That's not possible with a ITER_DEST iov_iter. > > >> > > >> metze > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > conversion so let's see if someone else has better suggestions. > > > > Yes, that might work, but it would be good to get some feedback > > if this would be the way to go: > > > > int (*getsockopt_iter)(struct socket *sock, > > int level, int optname, > > struct iov_iter *optval_in, > > struct iov_iter *optval_out); > > > > And *optlen = optval_out->iov_offset; > > > > Any objection or better ideas? Linus would that be what you had in mind? > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > Also look at the SCTP code. Performance usually does not matter for set/getsockopts, there are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) and maybe recent devmem sockopts; we can special-case these if needed, or keep sockptr_t, idk. I'm skeptical we can convert everything though, that's why the suggestion to start with sk/ip/tcp/udp. > How do you handle code that wants to return an updated length (often longer > than the one provided) and an error code (eg ERRSIZE or similar). > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > where the buffer length the application provides is only that of the header. > The actual buffer length is contained in the header. > The return length is the amount written into the full buffer. Let's discuss these special cases as they come up? Worst case these places can always re-init iov_iter with a comment on why it is ok. But I do agree in general that there are a few places that do wild stuff.
On Wed, 2 Apr 2025 07:19:46 -0700 Stanislav Fomichev <stfomichev@gmail.com> wrote: > On 04/02, David Laight wrote: > > On Wed, 2 Apr 2025 00:53:58 +0200 > > Stefan Metzmacher <metze@samba.org> wrote: > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > On 04/01, Stefan Metzmacher wrote: > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > >>> On 04/01, Breno Leitao wrote: > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > >>>>>>>>> > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > >>>>>>>>> > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > >>>>>>>>> > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > >>>>>>>>> > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > >>>>>>>> > > > >>>>>>>> [..] > > > >>>>>>>> > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > >>>>>>>> > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > >>>>>>> > > > >>>>>>> I think that would only be possible if we introduce > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > >>>>>>> buffer. > > > >>>>>>> > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > >>>>>>> > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > >>>>>>> > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > >>>>>>> optlen and leave optval as is. > > > >>>>>>> > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > >>>>>>> > > > >>>>>>> But as alternative introducing a prototype like this: > > > >>>>>>> > > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > >>>>>>> struct iov_iter *optval_iter); > > > >>>>>>> > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > >>>>>>> > > > >>>>>>> Implementations could then opt in for the new interface and > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > >>>>>>> > > > >>>>>>> So what should be the way to go? > > > >>>>>> > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > >>>>>> but the first part I wanted to convert was > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > >>>>>> writing. > > > >>>>>> > > > >>>>>> So we could go with the optlen_t approach, or we need > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > >>>>>> with ITER_DEST... > > > >>>>>> > > > >>>>>> So who wants to decide? > > > >>>>> > > > >>>>> I just noticed that it's even possible in same cases > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > >>>>> > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > >>>>> is a good idea :-( > > > >>>> > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > >>>> iov_iter. > > > >>> > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > >>> of useful socket opts. See if there are any obvious problems with them > > > >>> and if not, try converting. The rest we can cover separately when/if > > > >>> needed. > > > >> > > > >> That's what I tried, but it fails with > > > >> tcp_getsockopt -> > > > >> do_tcp_getsockopt -> > > > >> tcp_ao_get_mkts -> > > > >> tcp_ao_copy_mkts_to_user -> > > > >> copy_struct_from_sockptr > > > >> tcp_ao_get_sock_info -> > > > >> copy_struct_from_sockptr > > > >> > > > >> That's not possible with a ITER_DEST iov_iter. > > > >> > > > >> metze > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > conversion so let's see if someone else has better suggestions. > > > > > > Yes, that might work, but it would be good to get some feedback > > > if this would be the way to go: > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > int level, int optname, > > > struct iov_iter *optval_in, > > > struct iov_iter *optval_out); > > > > > > And *optlen = optval_out->iov_offset; > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > Also look at the SCTP code. > > Performance usually does not matter for set/getsockopts, there > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) That might be the one that is really horrid and completely abuses the 'length' parameter. > and maybe recent > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > idk. I'm skeptical we can convert everything though, that's why the > suggestion to start with sk/ip/tcp/udp. > > > How do you handle code that wants to return an updated length (often longer > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > where the buffer length the application provides is only that of the header. > > The actual buffer length is contained in the header. > > The return length is the amount written into the full buffer. > > Let's discuss these special cases as they come up? Worst case these > places can always re-init iov_iter with a comment on why it is ok. > But I do agree in general that there are a few places that do wild > stuff. The problem is that the generic code has to deal with all the 'wild stuff'. It is also common to do non-sequential accesses - so iov_iter doesn't match at all. There also isn't a requirement for scatter-gather. For 'normal' getsockopt (and setsockopt) with short lengths it actually makes sense for the syscall wrapper to do the user copies. But it would need to pass the user ptr+len as well as the kernel ptr+len to give the required flexibilty. Then you have to work out whether the final copy to user is needed or not. (not that hard, but it all adds complication). David
On Wed, 2 Apr 2025 at 13:46, David Laight <david.laight.linux@gmail.com> wrote: > > The problem is that the generic code has to deal with all the 'wild stuff'. > It is also common to do non-sequential accesses - so iov_iter doesn't match > at all. > There also isn't a requirement for scatter-gather. Note that the generic code has special cases for the simple stuff, which is all that the sockopt code would need. Now, that's _particularly_ true for the "single user address range" thing, where there's a special ITER_UBUF thing. We don't actually have a "single kernel range" version of that, but ITER_KVEC is simple to use, and the sockopt code could say "I only ever look at the first buffer". It's ok to just not handle all the cases, and you don't *have* to use the generic "copy_from_iter()" routines if you don't want to. In fact, I would expect that something like sockopt generally wouldn't want to use the normal iter copying routines, since those are basically all geared towards "copy and update the iter". Linus
On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 07:19:46 -0700 > Stanislav Fomichev <stfomichev@gmail.com> wrote: > > > On 04/02, David Laight wrote: > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > Stefan Metzmacher <metze@samba.org> wrote: > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > On 04/01, Stefan Metzmacher wrote: > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > >>> On 04/01, Breno Leitao wrote: > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > >>>>>>>>> > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > >>>>>>>>> > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > >>>>>>>>> > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > > >>>>>>>>> > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > >>>>>>>> > > > > >>>>>>>> [..] > > > > >>>>>>>> > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > >>>>>>>> > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > >>>>>>> > > > > >>>>>>> I think that would only be possible if we introduce > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > > >>>>>>> buffer. > > > > >>>>>>> > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > >>>>>>> > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > > >>>>>>> > > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > > >>>>>>> optlen and leave optval as is. > > > > >>>>>>> > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > > >>>>>>> > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > >>>>>>> > > > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > >>>>>>> struct iov_iter *optval_iter); > > > > >>>>>>> > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > > >>>>>>> > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > >>>>>>> > > > > >>>>>>> So what should be the way to go? > > > > >>>>>> > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > >>>>>> but the first part I wanted to convert was > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > >>>>>> writing. > > > > >>>>>> > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > >>>>>> with ITER_DEST... > > > > >>>>>> > > > > >>>>>> So who wants to decide? > > > > >>>>> > > > > >>>>> I just noticed that it's even possible in same cases > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > >>>>> > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > > >>>>> is a good idea :-( > > > > >>>> > > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > > >>>> iov_iter. > > > > >>> > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > > >>> of useful socket opts. See if there are any obvious problems with them > > > > >>> and if not, try converting. The rest we can cover separately when/if > > > > >>> needed. > > > > >> > > > > >> That's what I tried, but it fails with > > > > >> tcp_getsockopt -> > > > > >> do_tcp_getsockopt -> > > > > >> tcp_ao_get_mkts -> > > > > >> tcp_ao_copy_mkts_to_user -> > > > > >> copy_struct_from_sockptr > > > > >> tcp_ao_get_sock_info -> > > > > >> copy_struct_from_sockptr > > > > >> > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > >> > > > > >> metze > > > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > if this would be the way to go: > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > int level, int optname, > > > > struct iov_iter *optval_in, > > > > struct iov_iter *optval_out); > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > > Also look at the SCTP code. > > > > Performance usually does not matter for set/getsockopts, there > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > That might be the one that is really horrid and completely abuses > the 'length' parameter. It is reading and writing, yes, but it's not a huge problem. And it does enforce the optlen (to copy back the same amount of bytes). It's not that bad, it's just an example of where we need to be extra careful. > > and maybe recent > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > idk. I'm skeptical we can convert everything though, that's why the > > suggestion to start with sk/ip/tcp/udp. > > > > > How do you handle code that wants to return an updated length (often longer > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > > where the buffer length the application provides is only that of the header. > > > The actual buffer length is contained in the header. > > > The return length is the amount written into the full buffer. > > > > Let's discuss these special cases as they come up? Worst case these > > places can always re-init iov_iter with a comment on why it is ok. > > But I do agree in general that there are a few places that do wild > > stuff. > > The problem is that the generic code has to deal with all the 'wild stuff'. getsockopt_iter will have optval_in for the minority of socket options (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well as optval_out. The latter is what the majority of socket options will use to write their value. That doesn't seem too complicated to handle? > It is also common to do non-sequential accesses - so iov_iter doesn't match > at all. I disagree that it's 'common'. Searching for copy_from_sockptr_offset returns a few cases and they are mostly using read-with-offset because there is no sequential read (iterator) semantics with sockptr_t. > There also isn't a requirement for scatter-gather. > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes > sense for the syscall wrapper to do the user copies. > But it would need to pass the user ptr+len as well as the kernel ptr+len > to give the required flexibilty. > Then you have to work out whether the final copy to user is needed or not. > (not that hard, but it all adds complication). Not sure I understand what's the problem. The user vs kernel part will be abstracted by iov_iter. The callers will have to write the optlen back. And there are two call sites we care about: io_uring and regular system call. What's your suggestion? Maybe I'm missing something. Do you prefer get_optlen/put_optlen?
On Wed, 2 Apr 2025 14:21:35 -0700 Stanislav Fomichev <stfomichev@gmail.com> wrote: > On 04/02, David Laight wrote: > > On Wed, 2 Apr 2025 07:19:46 -0700 > > Stanislav Fomichev <stfomichev@gmail.com> wrote: > > > > > On 04/02, David Laight wrote: > > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > > Stefan Metzmacher <metze@samba.org> wrote: > > > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > > On 04/01, Stefan Metzmacher wrote: > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > > >>> On 04/01, Breno Leitao wrote: > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > > >>>>>>>>> > > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > > >>>>>>>>> > > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > >>>>>>>>> > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > > > >>>>>>>>> > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > > >>>>>>>> > > > > > >>>>>>>> [..] > > > > > >>>>>>>> > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > > >>>>>>>> > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > > >>>>>>> > > > > > >>>>>>> I think that would only be possible if we introduce > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > > > >>>>>>> buffer. > > > > > >>>>>>> > > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > >>>>>>> > > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > > > >>>>>>> > > > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > > > >>>>>>> optlen and leave optval as is. > > > > > >>>>>>> > > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > > > >>>>>>> > > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > > >>>>>>> > > > > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > > >>>>>>> struct iov_iter *optval_iter); > > > > > >>>>>>> > > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > >>>>>>> > > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > > >>>>>>> > > > > > >>>>>>> So what should be the way to go? > > > > > >>>>>> > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > > >>>>>> but the first part I wanted to convert was > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > > >>>>>> writing. > > > > > >>>>>> > > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > > >>>>>> with ITER_DEST... > > > > > >>>>>> > > > > > >>>>>> So who wants to decide? > > > > > >>>>> > > > > > >>>>> I just noticed that it's even possible in same cases > > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > > >>>>> > > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > > > >>>>> is a good idea :-( > > > > > >>>> > > > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > > > >>>> iov_iter. > > > > > >>> > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > > > >>> of useful socket opts. See if there are any obvious problems with them > > > > > >>> and if not, try converting. The rest we can cover separately when/if > > > > > >>> needed. > > > > > >> > > > > > >> That's what I tried, but it fails with > > > > > >> tcp_getsockopt -> > > > > > >> do_tcp_getsockopt -> > > > > > >> tcp_ao_get_mkts -> > > > > > >> tcp_ao_copy_mkts_to_user -> > > > > > >> copy_struct_from_sockptr > > > > > >> tcp_ao_get_sock_info -> > > > > > >> copy_struct_from_sockptr > > > > > >> > > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > > >> > > > > > >> metze > > > > > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > > if this would be the way to go: > > > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > > int level, int optname, > > > > > struct iov_iter *optval_in, > > > > > struct iov_iter *optval_out); > > > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > > > Also look at the SCTP code. > > > > > > Performance usually does not matter for set/getsockopts, there > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > > > That might be the one that is really horrid and completely abuses > > the 'length' parameter. > > It is reading and writing, yes, but it's not a huge problem. And it > does enforce the optlen (to copy back the same amount of bytes). It's > not that bad, it's just an example of where we need to be extra > careful. > > > > and maybe recent > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > > idk. I'm skeptical we can convert everything though, that's why the > > > suggestion to start with sk/ip/tcp/udp. > > > > > > > How do you handle code that wants to return an updated length (often longer > > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > > > where the buffer length the application provides is only that of the header. > > > > The actual buffer length is contained in the header. > > > > The return length is the amount written into the full buffer. > > > > > > Let's discuss these special cases as they come up? Worst case these > > > places can always re-init iov_iter with a comment on why it is ok. > > > But I do agree in general that there are a few places that do wild > > > stuff. > > > > The problem is that the generic code has to deal with all the 'wild stuff'. > > getsockopt_iter will have optval_in for the minority of socket options > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well > as optval_out. The latter is what the majority of socket options > will use to write their value. That doesn't seem too complicated to > handle? > > > It is also common to do non-sequential accesses - so iov_iter doesn't match > > at all. > > I disagree that it's 'common'. Searching for copy_from_sockptr_offset > returns a few cases and they are mostly using read-with-offset because > there is no sequential read (iterator) semantics with sockptr_t. > > > There also isn't a requirement for scatter-gather. > > > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes > > sense for the syscall wrapper to do the user copies. > > But it would need to pass the user ptr+len as well as the kernel ptr+len > > to give the required flexibilty. > > Then you have to work out whether the final copy to user is needed or not. > > (not that hard, but it all adds complication). > > Not sure I understand what's the problem. The user vs kernel part will > be abstracted by iov_iter. The callers will have to write the optlen > back. And there are two call sites we care about: io_uring and regular > system call. What's your suggestion? Maybe I'm missing something. Do you > prefer get_optlen/put_optlen? I think the final aim should be to pass the user supplied length to the per-protocol code and have it return the length/error to be passed back to the user. But in a lot of cases the syscall wrapper can do the buffer copies (as well as the length copies). That would be restricted to short length (on stack). So code that needed a long buffer (like some of the sctp options) would need to directly access the user buffer (or a long buffer provided by an in-kernel user). But you'll find code that reads/writes well beyond the apparent size of the user buffer. (And not just code that accesses 4 bytes without checking the length). David
On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 14:21:35 -0700 > Stanislav Fomichev <stfomichev@gmail.com> wrote: > > > On 04/02, David Laight wrote: > > > On Wed, 2 Apr 2025 07:19:46 -0700 > > > Stanislav Fomichev <stfomichev@gmail.com> wrote: > > > > > > > On 04/02, David Laight wrote: > > > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > > > Stefan Metzmacher <metze@samba.org> wrote: > > > > > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > > > On 04/01, Stefan Metzmacher wrote: > > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > > > >>> On 04/01, Breno Leitao wrote: > > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > > > >>>>>>>> > > > > > > >>>>>>>> [..] > > > > > > >>>>>>>> > > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > > > >>>>>>>> > > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > > > >>>>>>> > > > > > > >>>>>>> I think that would only be possible if we introduce > > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > > > > >>>>>>> buffer. > > > > > > >>>>>>> > > > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > >>>>>>> > > > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > > > > >>>>>>> > > > > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > > > > >>>>>>> optlen and leave optval as is. > > > > > > >>>>>>> > > > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > > > > >>>>>>> > > > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > > > >>>>>>> > > > > > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > > > >>>>>>> struct iov_iter *optval_iter); > > > > > > >>>>>>> > > > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > >>>>>>> > > > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > > > >>>>>>> > > > > > > >>>>>>> So what should be the way to go? > > > > > > >>>>>> > > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > > > >>>>>> but the first part I wanted to convert was > > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > > > >>>>>> writing. > > > > > > >>>>>> > > > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > > > >>>>>> with ITER_DEST... > > > > > > >>>>>> > > > > > > >>>>>> So who wants to decide? > > > > > > >>>>> > > > > > > >>>>> I just noticed that it's even possible in same cases > > > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > > > >>>>> > > > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > > > > >>>>> is a good idea :-( > > > > > > >>>> > > > > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > > > > >>>> iov_iter. > > > > > > >>> > > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > > > > >>> of useful socket opts. See if there are any obvious problems with them > > > > > > >>> and if not, try converting. The rest we can cover separately when/if > > > > > > >>> needed. > > > > > > >> > > > > > > >> That's what I tried, but it fails with > > > > > > >> tcp_getsockopt -> > > > > > > >> do_tcp_getsockopt -> > > > > > > >> tcp_ao_get_mkts -> > > > > > > >> tcp_ao_copy_mkts_to_user -> > > > > > > >> copy_struct_from_sockptr > > > > > > >> tcp_ao_get_sock_info -> > > > > > > >> copy_struct_from_sockptr > > > > > > >> > > > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > > > >> > > > > > > >> metze > > > > > > > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > > > if this would be the way to go: > > > > > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > > > int level, int optname, > > > > > > struct iov_iter *optval_in, > > > > > > struct iov_iter *optval_out); > > > > > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > > > > Also look at the SCTP code. > > > > > > > > Performance usually does not matter for set/getsockopts, there > > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > > > > > That might be the one that is really horrid and completely abuses > > > the 'length' parameter. > > > > It is reading and writing, yes, but it's not a huge problem. And it > > does enforce the optlen (to copy back the same amount of bytes). It's > > not that bad, it's just an example of where we need to be extra > > careful. > > > > > > and maybe recent > > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > > > idk. I'm skeptical we can convert everything though, that's why the > > > > suggestion to start with sk/ip/tcp/udp. > > > > > > > > > How do you handle code that wants to return an updated length (often longer > > > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > > > > where the buffer length the application provides is only that of the header. > > > > > The actual buffer length is contained in the header. > > > > > The return length is the amount written into the full buffer. > > > > > > > > Let's discuss these special cases as they come up? Worst case these > > > > places can always re-init iov_iter with a comment on why it is ok. > > > > But I do agree in general that there are a few places that do wild > > > > stuff. > > > > > > The problem is that the generic code has to deal with all the 'wild stuff'. > > > > getsockopt_iter will have optval_in for the minority of socket options > > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well > > as optval_out. The latter is what the majority of socket options > > will use to write their value. That doesn't seem too complicated to > > handle? > > > > > It is also common to do non-sequential accesses - so iov_iter doesn't match > > > at all. > > > > I disagree that it's 'common'. Searching for copy_from_sockptr_offset > > returns a few cases and they are mostly using read-with-offset because > > there is no sequential read (iterator) semantics with sockptr_t. > > > > > There also isn't a requirement for scatter-gather. > > > > > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes > > > sense for the syscall wrapper to do the user copies. > > > But it would need to pass the user ptr+len as well as the kernel ptr+len > > > to give the required flexibilty. > > > Then you have to work out whether the final copy to user is needed or not. > > > (not that hard, but it all adds complication). > > > > Not sure I understand what's the problem. The user vs kernel part will > > be abstracted by iov_iter. The callers will have to write the optlen > > back. And there are two call sites we care about: io_uring and regular > > system call. What's your suggestion? Maybe I'm missing something. Do you > > prefer get_optlen/put_optlen? > > I think the final aim should be to pass the user supplied length to the > per-protocol code and have it return the length/error to be passed back to the > user. Like what Stefan's patch 3 is doing? Or you're suggesting to change getsockopt handlers to handle length more explicitly? If we were to proceed with sockptr to iov_iter conversion we'll have to do it anyway (or pass the length as the size of iov_iter). > But in a lot of cases the syscall wrapper can do the buffer copies (as well > as the length copies). > That would be restricted to short length (on stack). > So code that needed a long buffer (like some of the sctp options) > would need to directly access the user buffer (or a long buffer provided > by an in-kernel user). This sounds similar to what we did with bpf hooks - copy (head of) the buffer and run bpf program on top of it. I remember iptables setsockopt begin problematic because of its huge size.. It is an option, yes (to convert protocol handler to kernel memory mostly). > But you'll find code that reads/writes well beyond the apparent size of > the user buffer. > (And not just code that accesses 4 bytes without checking the length). With can start with getsockopt_iter + sk_getsockopt to see if there are any issues with that approach. If not, adding ip/tcp/udp to the mix should be doable. We can explain and comment on special cases if needed. When other protocols are needed from io_uring, we can convert more. But at least the new code will use the correct abstractions.