mbox series

[RFC,0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt()

Message ID cover.1743449872.git.metze@samba.org (mailing list archive)
Headers show
Series net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() | expand

Message

Stefan Metzmacher March 31, 2025, 8:10 p.m. UTC
The motivation for this is to remove the SOL_SOCKET limitation
from io_uring_cmd_getsockopt().

The reason for this limitation is that io_uring_cmd_getsockopt()
passes a kernel pointer as optlen to do_sock_getsockopt()
and can't reach the ops->getsockopt() path.

The first idea would be to change the optval and optlen arguments
to the protocol specific hooks also to sockptr_t, as that
is already used for setsockopt() and also by do_sock_getsockopt()
sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().

But as Linus don't like 'sockptr_t' I used a different approach.

@Linus, would that optlen_t approach fit better for you?

Instead of passing the optlen as user or kernel pointer,
we only ever pass a kernel pointer and do the
translation from/to userspace in do_sock_getsockopt().

The simple solution would be to just remove the
'__user' from the int *optlen argument, but it
seems the compiler doesn't complain about
'__user' vs. without it, so instead I used
a helper struct in order to make sure everything
compiles with a typesafe change.

The patchset does the transformation in 3
easy to review steps:

1/4: introduces get_optlen(len, optlen) and put_optlen(len, optlen) helpers
     on top of the existing get_user(len, optlen) and put_user(len, optlen)
     usages.

2/4: introduces a simple optlen_t that just contains 'int __user *up;'
     that makes sure get_optlen and put_optlen get a typesafe optlen argument
     and they are the only functions looking at optlen.
     (The existing sockptr_t optlen code gets OPTLEN_SOCKPTR(optlen) passed)

3/4: The changes do_sock_getsockopt() to pass a kernel pointer instead
     of a __user pointer via optlen_t. This is a bit tricky as
     directly failing the copy_from_sockptr(&koptlen, optlen, sizeof(koptlen)
     with -EFAULT might change the uapi, as some getsockopt() hooks
     doesn't even touch optlen at all. And userspace could do something
     like this:

        feature_x_supported = true;
        ret = getsockopt(fd, level, optname, NULL, NULL);
        if (ret == -1 && errno == ENOTSUPP) {
            feature_x_supported = false;
        }

     And this should not give -EFAULT after the changes,
     so optlen.kp is passed down as NULL, so that -EFAULT is
     deferred to get_optlen() and put_optlen().

4/4: Removes the SOL_SOCKET restriction for io-uring.

This patchset doesn't touch any existing getsockopt() that
was already converted to sockptr_t optlen, that's something
for a later cleanup.

Link: https://lore.kernel.org/io-uring/86b1dce5-4bb4-4a0b-9cff-e72f488bf57d@samba.org/T/#t
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Ayush Sawal <ayush.sawal@chelsio.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Joerg Reuter <jreuter@yaina.de>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: kernel@pengutronix.de
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@datenfreihafen.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Alexandra Winter <wintera@linux.ibm.com>
Cc: Thorsten Winkler <twinkler@linux.ibm.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Mat Martineau <martineau@kernel.org>
Cc: Geliang Tang <geliang@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
Cc: "D. Wythe" <alibuda@linux.alibaba.com>
Cc: Tony Lu <tonylu@linux.alibaba.com>
Cc: Wen Gu <guwen@linux.alibaba.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Boris Pismenny <borisp@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
CC: Stefan Metzmacher <metze@samba.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-sctp@vger.kernel.org
Cc: linux-hams@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-can@vger.kernel.org
Cc: dccp@vger.kernel.org
Cc: linux-wpan@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: mptcp@lists.linux.dev
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-afs@lists.infradead.org
Cc: tipc-discussion@lists.sourceforge.net
Cc: virtualization@lists.linux.dev
Cc: linux-x25@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: isdn4linux@listserv.isdn4linux.de
Cc: io-uring@vger.kernel.org

Stefan Metzmacher (4):
  net: introduce get_optlen() and put_optlen() helpers
  net: pass 'optlen_t' to proto[ops].getsockopt() hooks
  net: pass a kernel pointer via 'optlen_t' to proto[ops].getsockopt()
    hooks
  io_uring: let io_uring_cmd_getsockopt() allow level other than
    SOL_SOCKET

 drivers/isdn/mISDN/socket.c                   |   4 +-
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   4 +-
 include/linux/net.h                           |   2 +-
 include/linux/sockptr.h                       |  41 ++++
 include/net/inet_connection_sock.h            |   2 +-
 include/net/ip.h                              |   2 +-
 include/net/ipv6.h                            |   2 +-
 include/net/sctp/structs.h                    |   2 +-
 include/net/sock.h                            |   4 +-
 include/net/tcp.h                             |   2 +-
 include/net/udp.h                             |   2 +-
 io_uring/uring_cmd.c                          |   3 -
 net/atm/common.c                              |   4 +-
 net/atm/common.h                              |   2 +-
 net/atm/pvc.c                                 |   2 +-
 net/atm/svc.c                                 |   4 +-
 net/ax25/af_ax25.c                            |   6 +-
 net/bluetooth/hci_sock.c                      |   6 +-
 net/bluetooth/iso.c                           |   6 +-
 net/bluetooth/l2cap_sock.c                    |   8 +-
 net/bluetooth/rfcomm/sock.c                   |   8 +-
 net/bluetooth/sco.c                           |  10 +-
 net/can/isotp.c                               |   6 +-
 net/can/j1939/socket.c                        |   6 +-
 net/can/raw.c                                 |  14 +-
 net/core/sock.c                               |   2 +-
 net/dccp/ccid.c                               |   4 +-
 net/dccp/ccid.h                               |  10 +-
 net/dccp/ccids/ccid3.c                        |   8 +-
 net/dccp/dccp.h                               |   2 +-
 net/dccp/proto.c                              |  12 +-
 net/ieee802154/socket.c                       |   8 +-
 net/ipv4/ip_sockglue.c                        |   8 +-
 net/ipv4/raw.c                                |  10 +-
 net/ipv4/tcp.c                                |   4 +-
 net/ipv4/udp.c                                |   8 +-
 net/ipv4/udp_impl.h                           |   2 +-
 net/ipv6/ipv6_sockglue.c                      |   8 +-
 net/ipv6/raw.c                                |  14 +-
 net/ipv6/udp.c                                |   2 +-
 net/ipv6/udp_impl.h                           |   2 +-
 net/iucv/af_iucv.c                            |   6 +-
 net/kcm/kcmsock.c                             |   6 +-
 net/l2tp/l2tp_ppp.c                           |   6 +-
 net/llc/af_llc.c                              |   6 +-
 net/mctp/af_mctp.c                            |   4 +-
 net/mptcp/protocol.h                          |   2 +-
 net/mptcp/sockopt.c                           |  48 ++--
 net/netlink/af_netlink.c                      |   8 +-
 net/netrom/af_netrom.c                        |   6 +-
 net/nfc/llcp_sock.c                           |   6 +-
 net/packet/af_packet.c                        |   6 +-
 net/phonet/pep.c                              |   6 +-
 net/rds/af_rds.c                              |   8 +-
 net/rds/info.c                                |   6 +-
 net/rds/info.h                                |   2 +-
 net/rose/af_rose.c                            |   6 +-
 net/rxrpc/af_rxrpc.c                          |   6 +-
 net/sctp/socket.c                             | 220 +++++++++---------
 net/smc/af_smc.c                              |   8 +-
 net/smc/smc.h                                 |   2 +-
 net/socket.c                                  |  34 ++-
 net/tipc/socket.c                             |   8 +-
 net/tls/tls_main.c                            |  18 +-
 net/vmw_vsock/af_vsock.c                      |   6 +-
 net/x25/af_x25.c                              |   6 +-
 net/xdp/xsk.c                                 |  10 +-
 67 files changed, 387 insertions(+), 319 deletions(-)

Comments

Stanislav Fomichev March 31, 2025, 9:04 p.m. UTC | #1
On 03/31, Stefan Metzmacher wrote:
> The motivation for this is to remove the SOL_SOCKET limitation
> from io_uring_cmd_getsockopt().
> 
> The reason for this limitation is that io_uring_cmd_getsockopt()
> passes a kernel pointer as optlen to do_sock_getsockopt()
> and can't reach the ops->getsockopt() path.
> 
> The first idea would be to change the optval and optlen arguments
> to the protocol specific hooks also to sockptr_t, as that
> is already used for setsockopt() and also by do_sock_getsockopt()
> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> 
> But as Linus don't like 'sockptr_t' I used a different approach.
> 
> @Linus, would that optlen_t approach fit better for you?

[..]

> Instead of passing the optlen as user or kernel pointer,
> we only ever pass a kernel pointer and do the
> translation from/to userspace in do_sock_getsockopt().

At this point why not just fully embrace iov_iter? You have the size
now + the user (or kernel) pointer. Might as well do
s/sockptr_t/iov_iter/ conversion?
Stefan Metzmacher April 1, 2025, 8:19 a.m. UTC | #2
Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> On 03/31, Stefan Metzmacher wrote:
>> The motivation for this is to remove the SOL_SOCKET limitation
>> from io_uring_cmd_getsockopt().
>>
>> The reason for this limitation is that io_uring_cmd_getsockopt()
>> passes a kernel pointer as optlen to do_sock_getsockopt()
>> and can't reach the ops->getsockopt() path.
>>
>> The first idea would be to change the optval and optlen arguments
>> to the protocol specific hooks also to sockptr_t, as that
>> is already used for setsockopt() and also by do_sock_getsockopt()
>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>
>> But as Linus don't like 'sockptr_t' I used a different approach.
>>
>> @Linus, would that optlen_t approach fit better for you?
> 
> [..]
> 
>> Instead of passing the optlen as user or kernel pointer,
>> we only ever pass a kernel pointer and do the
>> translation from/to userspace in do_sock_getsockopt().
> 
> At this point why not just fully embrace iov_iter? You have the size
> now + the user (or kernel) pointer. Might as well do
> s/sockptr_t/iov_iter/ conversion?

I think that would only be possible if we introduce
proto[_ops].getsockopt_iter() and then convert the implementations
step by step. Doing it all in one go has a lot of potential to break
the uapi. I could try to convert things like socket, ip and tcp myself, but
the rest needs to be converted by the maintainer of the specific protocol,
as it needs to be tested. As there are crazy things happening in the existing
implementations, e.g. some getsockopt() implementations use optval as in and out
buffer.

I first tried to convert both optval and optlen of getsockopt to sockptr_t,
and that showed that touching the optval part starts to get complex very soon,
see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
(note it didn't converted everything, I gave up after hitting
sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
more are the ones also doing both copy_from_user and copy_to_user on optval)

I come also across one implementation that returned -ERANGE because *optlen was
too short and put the required length into *optlen, which means the returned
*optlen is larger than the optval buffer given from userspace.

Because of all these strange things I tried to do a minimal change
in order to get rid of the io_uring limitation and only converted
optlen and leave optval as is.

In order to have a patchset that has a low risk to cause regressions.

But as alternative introducing a prototype like this:

         int (*getsockopt_iter)(struct socket *sock, int level, int optname,
                                struct iov_iter *optval_iter);

That returns a non-negative value which can be placed into *optlen
or negative value as error and *optlen will not be changed on error.
optval_iter will get direction ITER_DEST, so it can only be written to.

Implementations could then opt in for the new interface and
allow do_sock_getsockopt() work also for the io_uring case,
while all others would still get -EOPNOTSUPP.

So what should be the way to go?

metze
Stefan Metzmacher April 1, 2025, 1:37 p.m. UTC | #3
Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>> On 03/31, Stefan Metzmacher wrote:
>>> The motivation for this is to remove the SOL_SOCKET limitation
>>> from io_uring_cmd_getsockopt().
>>>
>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>> and can't reach the ops->getsockopt() path.
>>>
>>> The first idea would be to change the optval and optlen arguments
>>> to the protocol specific hooks also to sockptr_t, as that
>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>
>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>
>>> @Linus, would that optlen_t approach fit better for you?
>>
>> [..]
>>
>>> Instead of passing the optlen as user or kernel pointer,
>>> we only ever pass a kernel pointer and do the
>>> translation from/to userspace in do_sock_getsockopt().
>>
>> At this point why not just fully embrace iov_iter? You have the size
>> now + the user (or kernel) pointer. Might as well do
>> s/sockptr_t/iov_iter/ conversion?
> 
> I think that would only be possible if we introduce
> proto[_ops].getsockopt_iter() and then convert the implementations
> step by step. Doing it all in one go has a lot of potential to break
> the uapi. I could try to convert things like socket, ip and tcp myself, but
> the rest needs to be converted by the maintainer of the specific protocol,
> as it needs to be tested. As there are crazy things happening in the existing
> implementations, e.g. some getsockopt() implementations use optval as in and out
> buffer.
> 
> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> and that showed that touching the optval part starts to get complex very soon,
> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> (note it didn't converted everything, I gave up after hitting
> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> more are the ones also doing both copy_from_user and copy_to_user on optval)
> 
> I come also across one implementation that returned -ERANGE because *optlen was
> too short and put the required length into *optlen, which means the returned
> *optlen is larger than the optval buffer given from userspace.
> 
> Because of all these strange things I tried to do a minimal change
> in order to get rid of the io_uring limitation and only converted
> optlen and leave optval as is.
> 
> In order to have a patchset that has a low risk to cause regressions.
> 
> But as alternative introducing a prototype like this:
> 
>          int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>                                 struct iov_iter *optval_iter);
> 
> That returns a non-negative value which can be placed into *optlen
> or negative value as error and *optlen will not be changed on error.
> optval_iter will get direction ITER_DEST, so it can only be written to.
> 
> Implementations could then opt in for the new interface and
> allow do_sock_getsockopt() work also for the io_uring case,
> while all others would still get -EOPNOTSUPP.
> 
> So what should be the way to go?

Ok, I've added the infrastructure for getsockopt_iter, see below,
but the first part I wanted to convert was
tcp_ao_copy_mkts_to_user() and that also reads from userspace before
writing.

So we could go with the optlen_t approach, or we need
logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
with ITER_DEST...

So who wants to decide?

Thanks!
metze
---
  include/linux/net.h |  4 +++
  include/net/sock.h  | 64 +++++++++++++++++++++++++++++++++++++++++++++
  net/core/sock.c     | 12 +++++++--
  net/socket.c        | 12 +++++++--
  4 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 0ff950eecc6b..ceb9f9ed84b9 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -194,6 +194,10 @@ struct proto_ops {
  				      unsigned int optlen);
  	int		(*getsockopt)(struct socket *sock, int level,
  				      int optname, char __user *optval, int __user *optlen);
+	int		(*getsockopt_iter)(struct socket *sock,
+					   int level,
+					   int optname,
+					   struct iov_iter *optval_iter);
  	void		(*show_fdinfo)(struct seq_file *m, struct socket *sock);
  	int		(*sendmsg)   (struct socket *sock, struct msghdr *m,
  				      size_t total_len);
diff --git a/include/net/sock.h b/include/net/sock.h
index 8daf1b3b12c6..e741b219056e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1249,6 +1249,11 @@ struct proto {
  	int			(*getsockopt)(struct sock *sk, int level,
  					int optname, char __user *optval,
  					int __user *option);
+	int			(*getsockopt_iter)(struct sock *sk,
+						   int level,
+						   int optname,
+						   struct iov_iter *optval_iter);
+
  	void			(*keepalive)(struct sock *sk, int valbool);
  #ifdef CONFIG_COMPAT
  	int			(*compat_ioctl)(struct sock *sk,
@@ -1781,6 +1786,65 @@ int do_sock_setsockopt(struct socket *sock, bool compat, int level,
  int do_sock_getsockopt(struct socket *sock, bool compat, int level,
  		       int optname, sockptr_t optval, sockptr_t optlen);

+#define __generic_wrap_getsockopt_iter(__s, __level,				\
+				       __optname, __optval, __optlen, 		\
+				       __getsockopt_iter) 			\
+do {										\
+	struct iov_iter optval_iter;						\
+	struct kvec optval_kvec;						\
+	int len;								\
+	int err;								\
+										\
+	if (unlikely(__getsockopt_iter == NULL))				\
+		return -EOPNOTSUPP;						\
+										\
+	if (copy_from_sockptr(&len, __optlen, sizeof(len)))			\
+		return -EFAULT;							\
+										\
+	if (len < 0)								\
+		return -EINVAL;							\
+										\
+	if (__optval.is_kernel) {						\
+		if (__optval.kernel == NULL && len != 0)			\
+			return -EFAULT;						\
+										\
+		optval_kvec = (struct kvec) {					\
+			.iov_base = __optval.kernel,				\
+			.iov_len = len,						\
+		};								\
+										\
+		iov_iter_kvec(&optval_iter, ITER_DEST,				\
+			      &optval_kvec, 1, optval_kvec.iov_len);		\
+	} else {								\
+		if (import_ubuf(ITER_DEST, __optval.user, len, &optval_iter))	\
+			return -EFAULT;						\
+	}									\
+										\
+	err = getsockopt_iter(__s, __level, __optname, &optval_iter);		\
+	if (unlikely(err < 0))							\
+		return err;							\
+										\
+	len = err;								\
+	if (copy_to_sockptr(__optlen, &len, sizeof(len)))			\
+		return -EFAULT;							\
+										\
+	return 0;								\
+} while (0)
+
+static __always_inline
+int sk_wrap_getsockopt_iter(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen,
+	    int (*getsockopt_iter)(struct sock *sk, int level, int optname, struct iov_iter *optval_iter))
+{
+	__generic_wrap_getsockopt_iter(sk, level, optname, optval, optlen, getsockopt_iter);
+}
+
+static __always_inline
+int sock_wrap_getsockopt_iter(struct socket *sock, int level, int optname, sockptr_t optval, sockptr_t optlen,
+	    int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_iter))
+{
+	__generic_wrap_getsockopt_iter(sock, level, optname, optval, optlen, getsockopt_iter);
+}
+
  int sk_getsockopt(struct sock *sk, int level, int optname,
  		  sockptr_t optval, sockptr_t optlen);
  int sock_gettstamp(struct socket *sock, void __user *userstamp,
diff --git a/net/core/sock.c b/net/core/sock.c
index 323892066def..61625060e724 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3857,9 +3857,17 @@ int sock_common_getsockopt(struct socket *sock, int level, int optname,
  			   char __user *optval, int __user *optlen)
  {
  	struct sock *sk = sock->sk;
-
  	/* IPV6_ADDRFORM can change sk->sk_prot under us. */
-	return READ_ONCE(sk->sk_prot)->getsockopt(sk, level, optname, optval, optlen);
+	struct proto *prot = READ_ONCE(sk->sk_prot);
+
+	if (prot->getsockopt_iter) {
+		return sk_wrap_getsockopt_iter(sk, level, optname,
+					       USER_SOCKPTR(optval),
+					       USER_SOCKPTR(optlen),
+					       prot->getsockopt_iter);
+	}
+
+	return prot->getsockopt(sk, level, optname, optval, optlen);
  }
  EXPORT_SYMBOL(sock_common_getsockopt);

diff --git a/net/socket.c b/net/socket.c
index 9a0e720f0859..792cfd272611 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2335,6 +2335,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
  {
  	int max_optlen __maybe_unused = 0;
  	const struct proto_ops *ops;
+	const struct proto *prot;
  	int err;

  	err = security_socket_getsockopt(sock, level, optname);
@@ -2345,12 +2346,19 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
  		copy_from_sockptr(&max_optlen, optlen, sizeof(int));

  	ops = READ_ONCE(sock->ops);
+	prot = READ_ONCE(sock->sk->sk_prot);
  	if (level == SOL_SOCKET) {
  		err = sk_getsockopt(sock->sk, level, optname, optval, optlen);
-	} else if (unlikely(!ops->getsockopt)) {
+	} else if (ops->getsockopt_iter) {
+		err = sock_wrap_getsockopt_iter(sock, level, optname, optval, optlen,
+					        ops->getsockopt_iter);
+	} else if (ops->getsockopt == sock_common_getsockopt && prot->getsockopt_iter) {
+		err = sk_wrap_getsockopt_iter(sock->sk, level, optname, optval, optlen,
+					      prot->getsockopt_iter);
+	} else if (unlikely(!ops->getsockopt || optlen.is_kernel)) {
  		err = -EOPNOTSUPP;
  	} else {
-		if (WARN_ONCE(optval.is_kernel || optlen.is_kernel,
+		if (WARN_ONCE(optval.is_kernel,
  			      "Invalid argument type"))
  			return -EOPNOTSUPP;
Stefan Metzmacher April 1, 2025, 1:48 p.m. UTC | #4
Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>>> On 03/31, Stefan Metzmacher wrote:
>>>> The motivation for this is to remove the SOL_SOCKET limitation
>>>> from io_uring_cmd_getsockopt().
>>>>
>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>>> and can't reach the ops->getsockopt() path.
>>>>
>>>> The first idea would be to change the optval and optlen arguments
>>>> to the protocol specific hooks also to sockptr_t, as that
>>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>>
>>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>>
>>>> @Linus, would that optlen_t approach fit better for you?
>>>
>>> [..]
>>>
>>>> Instead of passing the optlen as user or kernel pointer,
>>>> we only ever pass a kernel pointer and do the
>>>> translation from/to userspace in do_sock_getsockopt().
>>>
>>> At this point why not just fully embrace iov_iter? You have the size
>>> now + the user (or kernel) pointer. Might as well do
>>> s/sockptr_t/iov_iter/ conversion?
>>
>> I think that would only be possible if we introduce
>> proto[_ops].getsockopt_iter() and then convert the implementations
>> step by step. Doing it all in one go has a lot of potential to break
>> the uapi. I could try to convert things like socket, ip and tcp myself, but
>> the rest needs to be converted by the maintainer of the specific protocol,
>> as it needs to be tested. As there are crazy things happening in the existing
>> implementations, e.g. some getsockopt() implementations use optval as in and out
>> buffer.
>>
>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
>> and that showed that touching the optval part starts to get complex very soon,
>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
>> (note it didn't converted everything, I gave up after hitting
>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
>> more are the ones also doing both copy_from_user and copy_to_user on optval)
>>
>> I come also across one implementation that returned -ERANGE because *optlen was
>> too short and put the required length into *optlen, which means the returned
>> *optlen is larger than the optval buffer given from userspace.
>>
>> Because of all these strange things I tried to do a minimal change
>> in order to get rid of the io_uring limitation and only converted
>> optlen and leave optval as is.
>>
>> In order to have a patchset that has a low risk to cause regressions.
>>
>> But as alternative introducing a prototype like this:
>>
>>          int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>>                                 struct iov_iter *optval_iter);
>>
>> That returns a non-negative value which can be placed into *optlen
>> or negative value as error and *optlen will not be changed on error.
>> optval_iter will get direction ITER_DEST, so it can only be written to.
>>
>> Implementations could then opt in for the new interface and
>> allow do_sock_getsockopt() work also for the io_uring case,
>> while all others would still get -EOPNOTSUPP.
>>
>> So what should be the way to go?
> 
> Ok, I've added the infrastructure for getsockopt_iter, see below,
> but the first part I wanted to convert was
> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> writing.
> 
> So we could go with the optlen_t approach, or we need
> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> with ITER_DEST...
> 
> So who wants to decide?

I just noticed that it's even possible in same cases
to pass in a short buffer to optval, but have a longer value in optlen,
hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.

This makes it really hard to believe that trying to use iov_iter for this
is a good idea :-(

Any ideas beside just going with optlen_t?

metze
Breno Leitao April 1, 2025, 3:35 p.m. UTC | #5
On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > On 03/31, Stefan Metzmacher wrote:
> > > > > The motivation for this is to remove the SOL_SOCKET limitation
> > > > > from io_uring_cmd_getsockopt().
> > > > > 
> > > > > The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > and can't reach the ops->getsockopt() path.
> > > > > 
> > > > > The first idea would be to change the optval and optlen arguments
> > > > > to the protocol specific hooks also to sockptr_t, as that
> > > > > is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > 
> > > > > But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > 
> > > > > @Linus, would that optlen_t approach fit better for you?
> > > > 
> > > > [..]
> > > > 
> > > > > Instead of passing the optlen as user or kernel pointer,
> > > > > we only ever pass a kernel pointer and do the
> > > > > translation from/to userspace in do_sock_getsockopt().
> > > > 
> > > > At this point why not just fully embrace iov_iter? You have the size
> > > > now + the user (or kernel) pointer. Might as well do
> > > > s/sockptr_t/iov_iter/ conversion?
> > > 
> > > I think that would only be possible if we introduce
> > > proto[_ops].getsockopt_iter() and then convert the implementations
> > > step by step. Doing it all in one go has a lot of potential to break
> > > the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > the rest needs to be converted by the maintainer of the specific protocol,
> > > as it needs to be tested. As there are crazy things happening in the existing
> > > implementations, e.g. some getsockopt() implementations use optval as in and out
> > > buffer.
> > > 
> > > I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > and that showed that touching the optval part starts to get complex very soon,
> > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > (note it didn't converted everything, I gave up after hitting
> > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > 
> > > I come also across one implementation that returned -ERANGE because *optlen was
> > > too short and put the required length into *optlen, which means the returned
> > > *optlen is larger than the optval buffer given from userspace.
> > > 
> > > Because of all these strange things I tried to do a minimal change
> > > in order to get rid of the io_uring limitation and only converted
> > > optlen and leave optval as is.
> > > 
> > > In order to have a patchset that has a low risk to cause regressions.
> > > 
> > > But as alternative introducing a prototype like this:
> > > 
> > >          int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > >                                 struct iov_iter *optval_iter);
> > > 
> > > That returns a non-negative value which can be placed into *optlen
> > > or negative value as error and *optlen will not be changed on error.
> > > optval_iter will get direction ITER_DEST, so it can only be written to.
> > > 
> > > Implementations could then opt in for the new interface and
> > > allow do_sock_getsockopt() work also for the io_uring case,
> > > while all others would still get -EOPNOTSUPP.
> > > 
> > > So what should be the way to go?
> > 
> > Ok, I've added the infrastructure for getsockopt_iter, see below,
> > but the first part I wanted to convert was
> > tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > writing.
> > 
> > So we could go with the optlen_t approach, or we need
> > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > with ITER_DEST...
> > 
> > So who wants to decide?
> 
> I just noticed that it's even possible in same cases
> to pass in a short buffer to optval, but have a longer value in optlen,
> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> 
> This makes it really hard to believe that trying to use iov_iter for this
> is a good idea :-(

That was my finding as well a while ago, when I was planning to get the
__user pointers converted to iov_iter. There are some weird ways of
using optlen and optval, which makes them non-trivial to covert to
iov_iter.
Stanislav Fomichev April 1, 2025, 3:45 p.m. UTC | #6
On 04/01, Breno Leitao wrote:
> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > > On 03/31, Stefan Metzmacher wrote:
> > > > > > The motivation for this is to remove the SOL_SOCKET limitation
> > > > > > from io_uring_cmd_getsockopt().
> > > > > > 
> > > > > > The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > > passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > > and can't reach the ops->getsockopt() path.
> > > > > > 
> > > > > > The first idea would be to change the optval and optlen arguments
> > > > > > to the protocol specific hooks also to sockptr_t, as that
> > > > > > is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > > 
> > > > > > But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > > 
> > > > > > @Linus, would that optlen_t approach fit better for you?
> > > > > 
> > > > > [..]
> > > > > 
> > > > > > Instead of passing the optlen as user or kernel pointer,
> > > > > > we only ever pass a kernel pointer and do the
> > > > > > translation from/to userspace in do_sock_getsockopt().
> > > > > 
> > > > > At this point why not just fully embrace iov_iter? You have the size
> > > > > now + the user (or kernel) pointer. Might as well do
> > > > > s/sockptr_t/iov_iter/ conversion?
> > > > 
> > > > I think that would only be possible if we introduce
> > > > proto[_ops].getsockopt_iter() and then convert the implementations
> > > > step by step. Doing it all in one go has a lot of potential to break
> > > > the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > the rest needs to be converted by the maintainer of the specific protocol,
> > > > as it needs to be tested. As there are crazy things happening in the existing
> > > > implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > buffer.
> > > > 
> > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > and that showed that touching the optval part starts to get complex very soon,
> > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > (note it didn't converted everything, I gave up after hitting
> > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > 
> > > > I come also across one implementation that returned -ERANGE because *optlen was
> > > > too short and put the required length into *optlen, which means the returned
> > > > *optlen is larger than the optval buffer given from userspace.
> > > > 
> > > > Because of all these strange things I tried to do a minimal change
> > > > in order to get rid of the io_uring limitation and only converted
> > > > optlen and leave optval as is.
> > > > 
> > > > In order to have a patchset that has a low risk to cause regressions.
> > > > 
> > > > But as alternative introducing a prototype like this:
> > > > 
> > > >          int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > >                                 struct iov_iter *optval_iter);
> > > > 
> > > > That returns a non-negative value which can be placed into *optlen
> > > > or negative value as error and *optlen will not be changed on error.
> > > > optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > 
> > > > Implementations could then opt in for the new interface and
> > > > allow do_sock_getsockopt() work also for the io_uring case,
> > > > while all others would still get -EOPNOTSUPP.
> > > > 
> > > > So what should be the way to go?
> > > 
> > > Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > but the first part I wanted to convert was
> > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > writing.
> > > 
> > > So we could go with the optlen_t approach, or we need
> > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > with ITER_DEST...
> > > 
> > > So who wants to decide?
> > 
> > I just noticed that it's even possible in same cases
> > to pass in a short buffer to optval, but have a longer value in optlen,
> > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > 
> > This makes it really hard to believe that trying to use iov_iter for this
> > is a good idea :-(
> 
> That was my finding as well a while ago, when I was planning to get the
> __user pointers converted to iov_iter. There are some weird ways of
> using optlen and optval, which makes them non-trivial to covert to
> iov_iter.

Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
of useful socket opts. See if there are any obvious problems with them
and if not, try converting. The rest we can cover separately when/if
needed.
Stefan Metzmacher April 1, 2025, 9:20 p.m. UTC | #7
Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
> On 04/01, Breno Leitao wrote:
>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>>>>>> On 03/31, Stefan Metzmacher wrote:
>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
>>>>>>> from io_uring_cmd_getsockopt().
>>>>>>>
>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>>>>>> and can't reach the ops->getsockopt() path.
>>>>>>>
>>>>>>> The first idea would be to change the optval and optlen arguments
>>>>>>> to the protocol specific hooks also to sockptr_t, as that
>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>>>>>
>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>>>>>
>>>>>>> @Linus, would that optlen_t approach fit better for you?
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> Instead of passing the optlen as user or kernel pointer,
>>>>>>> we only ever pass a kernel pointer and do the
>>>>>>> translation from/to userspace in do_sock_getsockopt().
>>>>>>
>>>>>> At this point why not just fully embrace iov_iter? You have the size
>>>>>> now + the user (or kernel) pointer. Might as well do
>>>>>> s/sockptr_t/iov_iter/ conversion?
>>>>>
>>>>> I think that would only be possible if we introduce
>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
>>>>> step by step. Doing it all in one go has a lot of potential to break
>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
>>>>> the rest needs to be converted by the maintainer of the specific protocol,
>>>>> as it needs to be tested. As there are crazy things happening in the existing
>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
>>>>> buffer.
>>>>>
>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
>>>>> and that showed that touching the optval part starts to get complex very soon,
>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
>>>>> (note it didn't converted everything, I gave up after hitting
>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
>>>>>
>>>>> I come also across one implementation that returned -ERANGE because *optlen was
>>>>> too short and put the required length into *optlen, which means the returned
>>>>> *optlen is larger than the optval buffer given from userspace.
>>>>>
>>>>> Because of all these strange things I tried to do a minimal change
>>>>> in order to get rid of the io_uring limitation and only converted
>>>>> optlen and leave optval as is.
>>>>>
>>>>> In order to have a patchset that has a low risk to cause regressions.
>>>>>
>>>>> But as alternative introducing a prototype like this:
>>>>>
>>>>>           int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>>>>>                                  struct iov_iter *optval_iter);
>>>>>
>>>>> That returns a non-negative value which can be placed into *optlen
>>>>> or negative value as error and *optlen will not be changed on error.
>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
>>>>>
>>>>> Implementations could then opt in for the new interface and
>>>>> allow do_sock_getsockopt() work also for the io_uring case,
>>>>> while all others would still get -EOPNOTSUPP.
>>>>>
>>>>> So what should be the way to go?
>>>>
>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
>>>> but the first part I wanted to convert was
>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
>>>> writing.
>>>>
>>>> So we could go with the optlen_t approach, or we need
>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
>>>> with ITER_DEST...
>>>>
>>>> So who wants to decide?
>>>
>>> I just noticed that it's even possible in same cases
>>> to pass in a short buffer to optval, but have a longer value in optlen,
>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
>>>
>>> This makes it really hard to believe that trying to use iov_iter for this
>>> is a good idea :-(
>>
>> That was my finding as well a while ago, when I was planning to get the
>> __user pointers converted to iov_iter. There are some weird ways of
>> using optlen and optval, which makes them non-trivial to covert to
>> iov_iter.
> 
> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> of useful socket opts. See if there are any obvious problems with them
> and if not, try converting. The rest we can cover separately when/if
> needed.

That's what I tried, but it fails with
tcp_getsockopt ->
    do_tcp_getsockopt ->
      tcp_ao_get_mkts ->
         tcp_ao_copy_mkts_to_user ->
            copy_struct_from_sockptr
      tcp_ao_get_sock_info ->
         copy_struct_from_sockptr

That's not possible with a ITER_DEST iov_iter.

metze
Stanislav Fomichev April 1, 2025, 10:04 p.m. UTC | #8
On 04/01, Stefan Metzmacher wrote:
> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
> > On 04/01, Breno Leitao wrote:
> > > On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> > > > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > > > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > > > > On 03/31, Stefan Metzmacher wrote:
> > > > > > > > The motivation for this is to remove the SOL_SOCKET limitation
> > > > > > > > from io_uring_cmd_getsockopt().
> > > > > > > > 
> > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > > > > passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > > > > and can't reach the ops->getsockopt() path.
> > > > > > > > 
> > > > > > > > The first idea would be to change the optval and optlen arguments
> > > > > > > > to the protocol specific hooks also to sockptr_t, as that
> > > > > > > > is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > > > > 
> > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > > > > 
> > > > > > > > @Linus, would that optlen_t approach fit better for you?
> > > > > > > 
> > > > > > > [..]
> > > > > > > 
> > > > > > > > Instead of passing the optlen as user or kernel pointer,
> > > > > > > > we only ever pass a kernel pointer and do the
> > > > > > > > translation from/to userspace in do_sock_getsockopt().
> > > > > > > 
> > > > > > > At this point why not just fully embrace iov_iter? You have the size
> > > > > > > now + the user (or kernel) pointer. Might as well do
> > > > > > > s/sockptr_t/iov_iter/ conversion?
> > > > > > 
> > > > > > I think that would only be possible if we introduce
> > > > > > proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > > step by step. Doing it all in one go has a lot of potential to break
> > > > > > the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > > the rest needs to be converted by the maintainer of the specific protocol,
> > > > > > as it needs to be tested. As there are crazy things happening in the existing
> > > > > > implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > > buffer.
> > > > > > 
> > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > > and that showed that touching the optval part starts to get complex very soon,
> > > > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > > (note it didn't converted everything, I gave up after hitting
> > > > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > > more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > > 
> > > > > > I come also across one implementation that returned -ERANGE because *optlen was
> > > > > > too short and put the required length into *optlen, which means the returned
> > > > > > *optlen is larger than the optval buffer given from userspace.
> > > > > > 
> > > > > > Because of all these strange things I tried to do a minimal change
> > > > > > in order to get rid of the io_uring limitation and only converted
> > > > > > optlen and leave optval as is.
> > > > > > 
> > > > > > In order to have a patchset that has a low risk to cause regressions.
> > > > > > 
> > > > > > But as alternative introducing a prototype like this:
> > > > > > 
> > > > > >           int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > >                                  struct iov_iter *optval_iter);
> > > > > > 
> > > > > > That returns a non-negative value which can be placed into *optlen
> > > > > > or negative value as error and *optlen will not be changed on error.
> > > > > > optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > > 
> > > > > > Implementations could then opt in for the new interface and
> > > > > > allow do_sock_getsockopt() work also for the io_uring case,
> > > > > > while all others would still get -EOPNOTSUPP.
> > > > > > 
> > > > > > So what should be the way to go?
> > > > > 
> > > > > Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > but the first part I wanted to convert was
> > > > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > writing.
> > > > > 
> > > > > So we could go with the optlen_t approach, or we need
> > > > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > with ITER_DEST...
> > > > > 
> > > > > So who wants to decide?
> > > > 
> > > > I just noticed that it's even possible in same cases
> > > > to pass in a short buffer to optval, but have a longer value in optlen,
> > > > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > 
> > > > This makes it really hard to believe that trying to use iov_iter for this
> > > > is a good idea :-(
> > > 
> > > That was my finding as well a while ago, when I was planning to get the
> > > __user pointers converted to iov_iter. There are some weird ways of
> > > using optlen and optval, which makes them non-trivial to covert to
> > > iov_iter.
> > 
> > Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > of useful socket opts. See if there are any obvious problems with them
> > and if not, try converting. The rest we can cover separately when/if
> > needed.
> 
> That's what I tried, but it fails with
> tcp_getsockopt ->
>    do_tcp_getsockopt ->
>      tcp_ao_get_mkts ->
>         tcp_ao_copy_mkts_to_user ->
>            copy_struct_from_sockptr
>      tcp_ao_get_sock_info ->
>         copy_struct_from_sockptr
> 
> That's not possible with a ITER_DEST iov_iter.
> 
> metze

Can we create two iterators over the same memory? One for ITER_SOURCE and
another for ITER_DEST. And then make getsockopt_iter accept optval_in and
optval_out. We can also use optval_out position (iov_offset) as optlen output
value. Don't see why it won't work, but I agree that's gonna be a messy
conversion so let's see if someone else has better suggestions.
Stefan Metzmacher April 1, 2025, 10:53 p.m. UTC | #9
Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> On 04/01, Stefan Metzmacher wrote:
>> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
>>> On 04/01, Breno Leitao wrote:
>>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
>>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
>>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
>>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
>>>>>>>> On 03/31, Stefan Metzmacher wrote:
>>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
>>>>>>>>> from io_uring_cmd_getsockopt().
>>>>>>>>>
>>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
>>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
>>>>>>>>> and can't reach the ops->getsockopt() path.
>>>>>>>>>
>>>>>>>>> The first idea would be to change the optval and optlen arguments
>>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
>>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
>>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>>>>>>>>
>>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
>>>>>>>>>
>>>>>>>>> @Linus, would that optlen_t approach fit better for you?
>>>>>>>>
>>>>>>>> [..]
>>>>>>>>
>>>>>>>>> Instead of passing the optlen as user or kernel pointer,
>>>>>>>>> we only ever pass a kernel pointer and do the
>>>>>>>>> translation from/to userspace in do_sock_getsockopt().
>>>>>>>>
>>>>>>>> At this point why not just fully embrace iov_iter? You have the size
>>>>>>>> now + the user (or kernel) pointer. Might as well do
>>>>>>>> s/sockptr_t/iov_iter/ conversion?
>>>>>>>
>>>>>>> I think that would only be possible if we introduce
>>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
>>>>>>> step by step. Doing it all in one go has a lot of potential to break
>>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
>>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
>>>>>>> as it needs to be tested. As there are crazy things happening in the existing
>>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
>>>>>>> buffer.
>>>>>>>
>>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
>>>>>>> and that showed that touching the optval part starts to get complex very soon,
>>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
>>>>>>> (note it didn't converted everything, I gave up after hitting
>>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
>>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
>>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
>>>>>>>
>>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
>>>>>>> too short and put the required length into *optlen, which means the returned
>>>>>>> *optlen is larger than the optval buffer given from userspace.
>>>>>>>
>>>>>>> Because of all these strange things I tried to do a minimal change
>>>>>>> in order to get rid of the io_uring limitation and only converted
>>>>>>> optlen and leave optval as is.
>>>>>>>
>>>>>>> In order to have a patchset that has a low risk to cause regressions.
>>>>>>>
>>>>>>> But as alternative introducing a prototype like this:
>>>>>>>
>>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
>>>>>>>                                   struct iov_iter *optval_iter);
>>>>>>>
>>>>>>> That returns a non-negative value which can be placed into *optlen
>>>>>>> or negative value as error and *optlen will not be changed on error.
>>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
>>>>>>>
>>>>>>> Implementations could then opt in for the new interface and
>>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
>>>>>>> while all others would still get -EOPNOTSUPP.
>>>>>>>
>>>>>>> So what should be the way to go?
>>>>>>
>>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
>>>>>> but the first part I wanted to convert was
>>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
>>>>>> writing.
>>>>>>
>>>>>> So we could go with the optlen_t approach, or we need
>>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
>>>>>> with ITER_DEST...
>>>>>>
>>>>>> So who wants to decide?
>>>>>
>>>>> I just noticed that it's even possible in same cases
>>>>> to pass in a short buffer to optval, but have a longer value in optlen,
>>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
>>>>>
>>>>> This makes it really hard to believe that trying to use iov_iter for this
>>>>> is a good idea :-(
>>>>
>>>> That was my finding as well a while ago, when I was planning to get the
>>>> __user pointers converted to iov_iter. There are some weird ways of
>>>> using optlen and optval, which makes them non-trivial to covert to
>>>> iov_iter.
>>>
>>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
>>> of useful socket opts. See if there are any obvious problems with them
>>> and if not, try converting. The rest we can cover separately when/if
>>> needed.
>>
>> That's what I tried, but it fails with
>> tcp_getsockopt ->
>>     do_tcp_getsockopt ->
>>       tcp_ao_get_mkts ->
>>          tcp_ao_copy_mkts_to_user ->
>>             copy_struct_from_sockptr
>>       tcp_ao_get_sock_info ->
>>          copy_struct_from_sockptr
>>
>> That's not possible with a ITER_DEST iov_iter.
>>
>> metze
> 
> Can we create two iterators over the same memory? One for ITER_SOURCE and
> another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> optval_out. We can also use optval_out position (iov_offset) as optlen output
> value. Don't see why it won't work, but I agree that's gonna be a messy
> conversion so let's see if someone else has better suggestions.

Yes, that might work, but it would be good to get some feedback
if this would be the way to go:

           int (*getsockopt_iter)(struct socket *sock,
				 int level, int optname,
				 struct iov_iter *optval_in,
				 struct iov_iter *optval_out);

And *optlen = optval_out->iov_offset;

Any objection or better ideas? Linus would that be what you had in mind?

Thanks!
metze
Linus Torvalds April 2, 2025, 12:40 a.m. UTC | #10
"

On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher <metze@samba.org> wrote:
>
> But as Linus don't like 'sockptr_t' I used a different approach.

So the sockptr_t thing has already happened. I hate it, and I think
it's ugly as hell, but it is what it is.

I think it's a complete hack and having that "kernel or user" pointer
flag is disgusting.

Making things worse, the naming is disgusting too, talking about some
random "socket pointer", when it has absolutely nothing to do with
socket, and isn't even a pointer. It's something else.

It's literally called "socket" not because it has anything to do with
sockets, but because it's a socket-specific hack that isn't acceptable
anywhere else in the kernel.

So that "socket" part of the name is literally shorthand for "only
sockets are disgusting enough to use this, and nobody else should ever
touch this crap".

At least so far that part has mostly worked, even if there's some
"sockptr_t" use in the crypto code. I didn't look closer, because I
didn't want to lose my lunch.

I don't understand why the networking code uses that thing.

If you have a "fat pointer", you should damn well make it have the
size of the area too, and do things *right*.

Instead of doing what sockptr_t does, which is a complete hack to just
pass a kernel/user flag, and then passes the length *separately*
because the socket code couldn't be arsed to do the right thing.

So I do still think "sockptr_t" should die.

As Stanislav says, if you actually want that "user or kernel" thing,
just use an "iov_iter".

No, an "iov_iter" isn't exactly a pretty thing either, but at least
it's the standard way to say "this pointer can have multiple different
kinds of sources".

And it keeps the size of the thing it points to around, so it's at
least a fat pointer with proper ranges, even if it isn't exactly "type
safe" (yes, it's type safe in the sense that it stays as a "iov_iter",
but it's still basically a "random pointer").

> @Linus, would that optlen_t approach fit better for you?

The optlen_t thing is slightly better mainly because it's more
type-safe. At least it's not a "random misnamed
user-or-kernel-pointer" thing where the name is about how nothing else
is so broken as to use it.

So it's better because it's more limited, and it's better in that at
least it has a type-safe pointer rather than a "void *" with no size
or type associated with it.

That said, I don't think it's exactly great.

It's just another case of "networking can't just do it right, and uses
a random hack with special flag values".

So I do think that it would be better to actually get rid of
"sockptr_t optval, unsigned int optlen" ENTIRELY, and replace that
with iov_iter and just make networking bite the bullet and do the
RightThing(tm).

In fact, to make it *really* typesafe, it might be a good idea to wrap
the iov_iter in another struct, something like

   typedef struct sockopt {
        struct iov_iter iter;
   } sockopt_t;

and make the networking functions make the typing very clear, and end
up with an interface something like

   int do_tcp_setsockopt(struct sock *sk,
                     int level, int optname,
                     sockopt_t *val);

where that "sockopt_t *val" replaces not just the "sockptr_t optval",
but also the "unsigned int optlen" thing.

And no, I didn't look at how much churn that would be. Probably a lot.
Maybe more than people are willing to do - even if I think some of it
could be automated with coccinelle or whatever.

                Linus
David Laight April 2, 2025, 12:29 p.m. UTC | #11
On Wed, 2 Apr 2025 00:53:58 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> > On 04/01, Stefan Metzmacher wrote:  
> >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:  
> >>> On 04/01, Breno Leitao wrote:  
> >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:  
> >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:  
> >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:  
> >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:  
> >>>>>>>> On 03/31, Stefan Metzmacher wrote:  
> >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> >>>>>>>>> from io_uring_cmd_getsockopt().
> >>>>>>>>>
> >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> >>>>>>>>> and can't reach the ops->getsockopt() path.
> >>>>>>>>>
> >>>>>>>>> The first idea would be to change the optval and optlen arguments
> >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> >>>>>>>>>
> >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> >>>>>>>>>
> >>>>>>>>> @Linus, would that optlen_t approach fit better for you?  
> >>>>>>>>
> >>>>>>>> [..]
> >>>>>>>>  
> >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> >>>>>>>>> we only ever pass a kernel pointer and do the
> >>>>>>>>> translation from/to userspace in do_sock_getsockopt().  
> >>>>>>>>
> >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> >>>>>>>> now + the user (or kernel) pointer. Might as well do
> >>>>>>>> s/sockptr_t/iov_iter/ conversion?  
> >>>>>>>
> >>>>>>> I think that would only be possible if we introduce
> >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> >>>>>>> buffer.
> >>>>>>>
> >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> >>>>>>> (note it didn't converted everything, I gave up after hitting
> >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> >>>>>>>
> >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> >>>>>>> too short and put the required length into *optlen, which means the returned
> >>>>>>> *optlen is larger than the optval buffer given from userspace.
> >>>>>>>
> >>>>>>> Because of all these strange things I tried to do a minimal change
> >>>>>>> in order to get rid of the io_uring limitation and only converted
> >>>>>>> optlen and leave optval as is.
> >>>>>>>
> >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> >>>>>>>
> >>>>>>> But as alternative introducing a prototype like this:
> >>>>>>>
> >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> >>>>>>>                                   struct iov_iter *optval_iter);
> >>>>>>>
> >>>>>>> That returns a non-negative value which can be placed into *optlen
> >>>>>>> or negative value as error and *optlen will not be changed on error.
> >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> >>>>>>>
> >>>>>>> Implementations could then opt in for the new interface and
> >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> >>>>>>> while all others would still get -EOPNOTSUPP.
> >>>>>>>
> >>>>>>> So what should be the way to go?  
> >>>>>>
> >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> >>>>>> but the first part I wanted to convert was
> >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> >>>>>> writing.
> >>>>>>
> >>>>>> So we could go with the optlen_t approach, or we need
> >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> >>>>>> with ITER_DEST...
> >>>>>>
> >>>>>> So who wants to decide?  
> >>>>>
> >>>>> I just noticed that it's even possible in same cases
> >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> >>>>>
> >>>>> This makes it really hard to believe that trying to use iov_iter for this
> >>>>> is a good idea :-(  
> >>>>
> >>>> That was my finding as well a while ago, when I was planning to get the
> >>>> __user pointers converted to iov_iter. There are some weird ways of
> >>>> using optlen and optval, which makes them non-trivial to covert to
> >>>> iov_iter.  
> >>>
> >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> >>> of useful socket opts. See if there are any obvious problems with them
> >>> and if not, try converting. The rest we can cover separately when/if
> >>> needed.  
> >>
> >> That's what I tried, but it fails with
> >> tcp_getsockopt ->
> >>     do_tcp_getsockopt ->
> >>       tcp_ao_get_mkts ->
> >>          tcp_ao_copy_mkts_to_user ->
> >>             copy_struct_from_sockptr
> >>       tcp_ao_get_sock_info ->
> >>          copy_struct_from_sockptr
> >>
> >> That's not possible with a ITER_DEST iov_iter.
> >>
> >> metze  
> > 
> > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > value. Don't see why it won't work, but I agree that's gonna be a messy
> > conversion so let's see if someone else has better suggestions.  
> 
> Yes, that might work, but it would be good to get some feedback
> if this would be the way to go:
> 
>            int (*getsockopt_iter)(struct socket *sock,
> 				 int level, int optname,
> 				 struct iov_iter *optval_in,
> 				 struct iov_iter *optval_out);
> 
> And *optlen = optval_out->iov_offset;
> 
> Any objection or better ideas? Linus would that be what you had in mind?

I'd worry about performance - yes I know 'iter' are used elsewhere but...
Also look at the SCTP code.

How do you handle code that wants to return an updated length (often longer
than the one provided) and an error code (eg ERRSIZE or similar).

There is also a very strange use (I think it is a sockopt rather than an ioctl)
where the buffer length the application provides is only that of the header.
The actual buffer length is contained in the header.
The return length is the amount written into the full buffer.

	David
David Laight April 2, 2025, 12:35 p.m. UTC | #12
On Tue, 1 Apr 2025 17:40:19 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> "
> 
> On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher <metze@samba.org> wrote:
> >
> > But as Linus don't like 'sockptr_t' I used a different approach.  
> 
> So the sockptr_t thing has already happened. I hate it, and I think
> it's ugly as hell, but it is what it is.
> 
> I think it's a complete hack and having that "kernel or user" pointer
> flag is disgusting.

I have proposed a patch which replaced it with a structure.
That showed up some really hacky code in IIRC io_uring.

Using sockptr_t for the buffer was one thing, the generic code
can't copy the buffer to/from user because code lies about the length.

But using for the length is just brain-dead.
That is fixed size and can be copied from/to user by the wrapper.
The code bloat reduction will be significant.

	David
Stanislav Fomichev April 2, 2025, 2:19 p.m. UTC | #13
On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 00:53:58 +0200
> Stefan Metzmacher <metze@samba.org> wrote:
> 
> > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> > > On 04/01, Stefan Metzmacher wrote:  
> > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:  
> > >>> On 04/01, Breno Leitao wrote:  
> > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:  
> > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:  
> > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:  
> > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:  
> > >>>>>>>> On 03/31, Stefan Metzmacher wrote:  
> > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > >>>>>>>>> from io_uring_cmd_getsockopt().
> > >>>>>>>>>
> > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > >>>>>>>>>
> > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > >>>>>>>>>
> > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > >>>>>>>>>
> > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?  
> > >>>>>>>>
> > >>>>>>>> [..]
> > >>>>>>>>  
> > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > >>>>>>>>> we only ever pass a kernel pointer and do the
> > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().  
> > >>>>>>>>
> > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > >>>>>>>> s/sockptr_t/iov_iter/ conversion?  
> > >>>>>>>
> > >>>>>>> I think that would only be possible if we introduce
> > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > >>>>>>> buffer.
> > >>>>>>>
> > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > >>>>>>>
> > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > >>>>>>> too short and put the required length into *optlen, which means the returned
> > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > >>>>>>>
> > >>>>>>> Because of all these strange things I tried to do a minimal change
> > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > >>>>>>> optlen and leave optval as is.
> > >>>>>>>
> > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > >>>>>>>
> > >>>>>>> But as alternative introducing a prototype like this:
> > >>>>>>>
> > >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > >>>>>>>                                   struct iov_iter *optval_iter);
> > >>>>>>>
> > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > >>>>>>>
> > >>>>>>> Implementations could then opt in for the new interface and
> > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > >>>>>>> while all others would still get -EOPNOTSUPP.
> > >>>>>>>
> > >>>>>>> So what should be the way to go?  
> > >>>>>>
> > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > >>>>>> but the first part I wanted to convert was
> > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > >>>>>> writing.
> > >>>>>>
> > >>>>>> So we could go with the optlen_t approach, or we need
> > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > >>>>>> with ITER_DEST...
> > >>>>>>
> > >>>>>> So who wants to decide?  
> > >>>>>
> > >>>>> I just noticed that it's even possible in same cases
> > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > >>>>>
> > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > >>>>> is a good idea :-(  
> > >>>>
> > >>>> That was my finding as well a while ago, when I was planning to get the
> > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > >>>> using optlen and optval, which makes them non-trivial to covert to
> > >>>> iov_iter.  
> > >>>
> > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > >>> of useful socket opts. See if there are any obvious problems with them
> > >>> and if not, try converting. The rest we can cover separately when/if
> > >>> needed.  
> > >>
> > >> That's what I tried, but it fails with
> > >> tcp_getsockopt ->
> > >>     do_tcp_getsockopt ->
> > >>       tcp_ao_get_mkts ->
> > >>          tcp_ao_copy_mkts_to_user ->
> > >>             copy_struct_from_sockptr
> > >>       tcp_ao_get_sock_info ->
> > >>          copy_struct_from_sockptr
> > >>
> > >> That's not possible with a ITER_DEST iov_iter.
> > >>
> > >> metze  
> > > 
> > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > conversion so let's see if someone else has better suggestions.  
> > 
> > Yes, that might work, but it would be good to get some feedback
> > if this would be the way to go:
> > 
> >            int (*getsockopt_iter)(struct socket *sock,
> > 				 int level, int optname,
> > 				 struct iov_iter *optval_in,
> > 				 struct iov_iter *optval_out);
> > 
> > And *optlen = optval_out->iov_offset;
> > 
> > Any objection or better ideas? Linus would that be what you had in mind?
> 
> I'd worry about performance - yes I know 'iter' are used elsewhere but...
> Also look at the SCTP code.

Performance usually does not matter for set/getsockopts, there
are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) and maybe recent
devmem sockopts; we can special-case these if needed, or keep sockptr_t,
idk. I'm skeptical we can convert everything though, that's why the
suggestion to start with sk/ip/tcp/udp.

> How do you handle code that wants to return an updated length (often longer
> than the one provided) and an error code (eg ERRSIZE or similar).
>
> There is also a very strange use (I think it is a sockopt rather than an ioctl)
> where the buffer length the application provides is only that of the header.
> The actual buffer length is contained in the header.
> The return length is the amount written into the full buffer.

Let's discuss these special cases as they come up? Worst case these
places can always re-init iov_iter with a comment on why it is ok.
But I do agree in general that there are a few places that do wild
stuff.
David Laight April 2, 2025, 8:46 p.m. UTC | #14
On Wed, 2 Apr 2025 07:19:46 -0700
Stanislav Fomichev <stfomichev@gmail.com> wrote:

> On 04/02, David Laight wrote:
> > On Wed, 2 Apr 2025 00:53:58 +0200
> > Stefan Metzmacher <metze@samba.org> wrote:
> >   
> > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:  
> > > > On 04/01, Stefan Metzmacher wrote:    
> > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:    
> > > >>> On 04/01, Breno Leitao wrote:    
> > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:    
> > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:    
> > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:    
> > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:    
> > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:    
> > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > >>>>>>>>>
> > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > >>>>>>>>>
> > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > >>>>>>>>>
> > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > >>>>>>>>>
> > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?    
> > > >>>>>>>>
> > > >>>>>>>> [..]
> > > >>>>>>>>    
> > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().    
> > > >>>>>>>>
> > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?    
> > > >>>>>>>
> > > >>>>>>> I think that would only be possible if we introduce
> > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > >>>>>>> buffer.
> > > >>>>>>>
> > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > >>>>>>>
> > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > >>>>>>>
> > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > >>>>>>> optlen and leave optval as is.
> > > >>>>>>>
> > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > >>>>>>>
> > > >>>>>>> But as alternative introducing a prototype like this:
> > > >>>>>>>
> > > >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > >>>>>>>                                   struct iov_iter *optval_iter);
> > > >>>>>>>
> > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > >>>>>>>
> > > >>>>>>> Implementations could then opt in for the new interface and
> > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > >>>>>>>
> > > >>>>>>> So what should be the way to go?    
> > > >>>>>>
> > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > >>>>>> but the first part I wanted to convert was
> > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > >>>>>> writing.
> > > >>>>>>
> > > >>>>>> So we could go with the optlen_t approach, or we need
> > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > >>>>>> with ITER_DEST...
> > > >>>>>>
> > > >>>>>> So who wants to decide?    
> > > >>>>>
> > > >>>>> I just noticed that it's even possible in same cases
> > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > >>>>>
> > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > >>>>> is a good idea :-(    
> > > >>>>
> > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > >>>> iov_iter.    
> > > >>>
> > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > >>> of useful socket opts. See if there are any obvious problems with them
> > > >>> and if not, try converting. The rest we can cover separately when/if
> > > >>> needed.    
> > > >>
> > > >> That's what I tried, but it fails with
> > > >> tcp_getsockopt ->
> > > >>     do_tcp_getsockopt ->
> > > >>       tcp_ao_get_mkts ->
> > > >>          tcp_ao_copy_mkts_to_user ->
> > > >>             copy_struct_from_sockptr
> > > >>       tcp_ao_get_sock_info ->
> > > >>          copy_struct_from_sockptr
> > > >>
> > > >> That's not possible with a ITER_DEST iov_iter.
> > > >>
> > > >> metze    
> > > > 
> > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > conversion so let's see if someone else has better suggestions.    
> > > 
> > > Yes, that might work, but it would be good to get some feedback
> > > if this would be the way to go:
> > > 
> > >            int (*getsockopt_iter)(struct socket *sock,
> > > 				 int level, int optname,
> > > 				 struct iov_iter *optval_in,
> > > 				 struct iov_iter *optval_out);
> > > 
> > > And *optlen = optval_out->iov_offset;
> > > 
> > > Any objection or better ideas? Linus would that be what you had in mind?  
> > 
> > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > Also look at the SCTP code.  
> 
> Performance usually does not matter for set/getsockopts, there
> are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)

That might be the one that is really horrid and completely abuses
the 'length' parameter.

> and maybe recent
> devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> idk. I'm skeptical we can convert everything though, that's why the
> suggestion to start with sk/ip/tcp/udp.
> 
> > How do you handle code that wants to return an updated length (often longer
> > than the one provided) and an error code (eg ERRSIZE or similar).
> >
> > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > where the buffer length the application provides is only that of the header.
> > The actual buffer length is contained in the header.
> > The return length is the amount written into the full buffer.  
> 
> Let's discuss these special cases as they come up? Worst case these
> places can always re-init iov_iter with a comment on why it is ok.
> But I do agree in general that there are a few places that do wild
> stuff.

The problem is that the generic code has to deal with all the 'wild stuff'.
It is also common to do non-sequential accesses - so iov_iter doesn't match
at all.
There also isn't a requirement for scatter-gather.

For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
sense for the syscall wrapper to do the user copies.
But it would need to pass the user ptr+len as well as the kernel ptr+len
to give the required flexibilty.
Then you have to work out whether the final copy to user is needed or not.
(not that hard, but it all adds complication).

	David
Linus Torvalds April 2, 2025, 9:07 p.m. UTC | #15
On Wed, 2 Apr 2025 at 13:46, David Laight <david.laight.linux@gmail.com> wrote:
>
> The problem is that the generic code has to deal with all the 'wild stuff'.
> It is also common to do non-sequential accesses - so iov_iter doesn't match
> at all.
> There also isn't a requirement for scatter-gather.

Note that the generic code has special cases for the simple stuff,
which is all that the sockopt code would need.

Now, that's _particularly_ true for the "single user address range"
thing, where there's a special ITER_UBUF thing.

We don't actually have a "single kernel range" version of that, but
ITER_KVEC is simple to use, and the sockopt code could say "I only
ever look at the first buffer".

It's ok to just not handle all the cases, and you don't *have* to use
the generic "copy_from_iter()" routines if you don't want to.

In fact, I would expect that something like sockopt generally wouldn't
want to use the normal iter copying routines, since those are
basically all geared towards "copy and update the iter".

           Linus
Stanislav Fomichev April 2, 2025, 9:21 p.m. UTC | #16
On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 07:19:46 -0700
> Stanislav Fomichev <stfomichev@gmail.com> wrote:
> 
> > On 04/02, David Laight wrote:
> > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > Stefan Metzmacher <metze@samba.org> wrote:
> > >   
> > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:  
> > > > > On 04/01, Stefan Metzmacher wrote:    
> > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:    
> > > > >>> On 04/01, Breno Leitao wrote:    
> > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:    
> > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:    
> > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:    
> > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:    
> > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:    
> > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > >>>>>>>>>
> > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > >>>>>>>>>
> > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > >>>>>>>>>
> > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > >>>>>>>>>
> > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?    
> > > > >>>>>>>>
> > > > >>>>>>>> [..]
> > > > >>>>>>>>    
> > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().    
> > > > >>>>>>>>
> > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?    
> > > > >>>>>>>
> > > > >>>>>>> I think that would only be possible if we introduce
> > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > >>>>>>> buffer.
> > > > >>>>>>>
> > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > >>>>>>>
> > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > >>>>>>>
> > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > >>>>>>> optlen and leave optval as is.
> > > > >>>>>>>
> > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > >>>>>>>
> > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > >>>>>>>
> > > > >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > >>>>>>>                                   struct iov_iter *optval_iter);
> > > > >>>>>>>
> > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > >>>>>>>
> > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > >>>>>>>
> > > > >>>>>>> So what should be the way to go?    
> > > > >>>>>>
> > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > >>>>>> but the first part I wanted to convert was
> > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > >>>>>> writing.
> > > > >>>>>>
> > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > >>>>>> with ITER_DEST...
> > > > >>>>>>
> > > > >>>>>> So who wants to decide?    
> > > > >>>>>
> > > > >>>>> I just noticed that it's even possible in same cases
> > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > >>>>>
> > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > >>>>> is a good idea :-(    
> > > > >>>>
> > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > >>>> iov_iter.    
> > > > >>>
> > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > >>> needed.    
> > > > >>
> > > > >> That's what I tried, but it fails with
> > > > >> tcp_getsockopt ->
> > > > >>     do_tcp_getsockopt ->
> > > > >>       tcp_ao_get_mkts ->
> > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > >>             copy_struct_from_sockptr
> > > > >>       tcp_ao_get_sock_info ->
> > > > >>          copy_struct_from_sockptr
> > > > >>
> > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > >>
> > > > >> metze    
> > > > > 
> > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > conversion so let's see if someone else has better suggestions.    
> > > > 
> > > > Yes, that might work, but it would be good to get some feedback
> > > > if this would be the way to go:
> > > > 
> > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > 				 int level, int optname,
> > > > 				 struct iov_iter *optval_in,
> > > > 				 struct iov_iter *optval_out);
> > > > 
> > > > And *optlen = optval_out->iov_offset;
> > > > 
> > > > Any objection or better ideas? Linus would that be what you had in mind?  
> > > 
> > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > Also look at the SCTP code.  
> > 
> > Performance usually does not matter for set/getsockopts, there
> > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)
> 
> That might be the one that is really horrid and completely abuses
> the 'length' parameter.

It is reading and writing, yes, but it's not a huge problem. And it
does enforce the optlen (to copy back the same amount of bytes). It's
not that bad, it's just an example of where we need to be extra
careful.

> > and maybe recent
> > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > idk. I'm skeptical we can convert everything though, that's why the
> > suggestion to start with sk/ip/tcp/udp.
> > 
> > > How do you handle code that wants to return an updated length (often longer
> > > than the one provided) and an error code (eg ERRSIZE or similar).
> > >
> > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > where the buffer length the application provides is only that of the header.
> > > The actual buffer length is contained in the header.
> > > The return length is the amount written into the full buffer.  
> > 
> > Let's discuss these special cases as they come up? Worst case these
> > places can always re-init iov_iter with a comment on why it is ok.
> > But I do agree in general that there are a few places that do wild
> > stuff.
> 
> The problem is that the generic code has to deal with all the 'wild stuff'.

getsockopt_iter will have optval_in for the minority of socket options
(like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
as optval_out. The latter is what the majority of socket options
will use to write their value. That doesn't seem too complicated to
handle?

> It is also common to do non-sequential accesses - so iov_iter doesn't match
> at all.

I disagree that it's 'common'. Searching for copy_from_sockptr_offset
returns a few cases and they are mostly using read-with-offset because
there is no sequential read (iterator) semantics with sockptr_t.

> There also isn't a requirement for scatter-gather.
> 
> For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> sense for the syscall wrapper to do the user copies.
> But it would need to pass the user ptr+len as well as the kernel ptr+len
> to give the required flexibilty.
> Then you have to work out whether the final copy to user is needed or not.
> (not that hard, but it all adds complication).

Not sure I understand what's the problem. The user vs kernel part will
be abstracted by iov_iter. The callers will have to write the optlen
back. And there are two call sites we care about: io_uring and regular
system call. What's your suggestion? Maybe I'm missing something. Do you
prefer get_optlen/put_optlen?
David Laight April 2, 2025, 10:38 p.m. UTC | #17
On Wed, 2 Apr 2025 14:21:35 -0700
Stanislav Fomichev <stfomichev@gmail.com> wrote:

> On 04/02, David Laight wrote:
> > On Wed, 2 Apr 2025 07:19:46 -0700
> > Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >   
> > > On 04/02, David Laight wrote:  
> > > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > > Stefan Metzmacher <metze@samba.org> wrote:
> > > >     
> > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:    
> > > > > > On 04/01, Stefan Metzmacher wrote:      
> > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:      
> > > > > >>> On 04/01, Breno Leitao wrote:      
> > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:      
> > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:      
> > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:      
> > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:      
> > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:      
> > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?      
> > > > > >>>>>>>>
> > > > > >>>>>>>> [..]
> > > > > >>>>>>>>      
> > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().      
> > > > > >>>>>>>>
> > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?      
> > > > > >>>>>>>
> > > > > >>>>>>> I think that would only be possible if we introduce
> > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > >>>>>>> buffer.
> > > > > >>>>>>>
> > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > >>>>>>>
> > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > > >>>>>>>
> > > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > > >>>>>>> optlen and leave optval as is.
> > > > > >>>>>>>
> > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > > >>>>>>>
> > > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > > >>>>>>>
> > > > > >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > >>>>>>>                                   struct iov_iter *optval_iter);
> > > > > >>>>>>>
> > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > >>>>>>>
> > > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > > >>>>>>>
> > > > > >>>>>>> So what should be the way to go?      
> > > > > >>>>>>
> > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > >>>>>> but the first part I wanted to convert was
> > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > >>>>>> writing.
> > > > > >>>>>>
> > > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > >>>>>> with ITER_DEST...
> > > > > >>>>>>
> > > > > >>>>>> So who wants to decide?      
> > > > > >>>>>
> > > > > >>>>> I just noticed that it's even possible in same cases
> > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > > >>>>>
> > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > > >>>>> is a good idea :-(      
> > > > > >>>>
> > > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > > >>>> iov_iter.      
> > > > > >>>
> > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > > >>> needed.      
> > > > > >>
> > > > > >> That's what I tried, but it fails with
> > > > > >> tcp_getsockopt ->
> > > > > >>     do_tcp_getsockopt ->
> > > > > >>       tcp_ao_get_mkts ->
> > > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > > >>             copy_struct_from_sockptr
> > > > > >>       tcp_ao_get_sock_info ->
> > > > > >>          copy_struct_from_sockptr
> > > > > >>
> > > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > > >>
> > > > > >> metze      
> > > > > > 
> > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > > conversion so let's see if someone else has better suggestions.      
> > > > > 
> > > > > Yes, that might work, but it would be good to get some feedback
> > > > > if this would be the way to go:
> > > > > 
> > > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > > 				 int level, int optname,
> > > > > 				 struct iov_iter *optval_in,
> > > > > 				 struct iov_iter *optval_out);
> > > > > 
> > > > > And *optlen = optval_out->iov_offset;
> > > > > 
> > > > > Any objection or better ideas? Linus would that be what you had in mind?    
> > > > 
> > > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > > Also look at the SCTP code.    
> > > 
> > > Performance usually does not matter for set/getsockopts, there
> > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)  
> > 
> > That might be the one that is really horrid and completely abuses
> > the 'length' parameter.  
> 
> It is reading and writing, yes, but it's not a huge problem. And it
> does enforce the optlen (to copy back the same amount of bytes). It's
> not that bad, it's just an example of where we need to be extra
> careful.
> 
> > > and maybe recent
> > > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > > idk. I'm skeptical we can convert everything though, that's why the
> > > suggestion to start with sk/ip/tcp/udp.
> > >   
> > > > How do you handle code that wants to return an updated length (often longer
> > > > than the one provided) and an error code (eg ERRSIZE or similar).
> > > >
> > > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > > where the buffer length the application provides is only that of the header.
> > > > The actual buffer length is contained in the header.
> > > > The return length is the amount written into the full buffer.    
> > > 
> > > Let's discuss these special cases as they come up? Worst case these
> > > places can always re-init iov_iter with a comment on why it is ok.
> > > But I do agree in general that there are a few places that do wild
> > > stuff.  
> > 
> > The problem is that the generic code has to deal with all the 'wild stuff'.  
> 
> getsockopt_iter will have optval_in for the minority of socket options
> (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
> as optval_out. The latter is what the majority of socket options
> will use to write their value. That doesn't seem too complicated to
> handle?
> 
> > It is also common to do non-sequential accesses - so iov_iter doesn't match
> > at all.  
> 
> I disagree that it's 'common'. Searching for copy_from_sockptr_offset
> returns a few cases and they are mostly using read-with-offset because
> there is no sequential read (iterator) semantics with sockptr_t.
> 
> > There also isn't a requirement for scatter-gather.
> > 
> > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> > sense for the syscall wrapper to do the user copies.
> > But it would need to pass the user ptr+len as well as the kernel ptr+len
> > to give the required flexibilty.
> > Then you have to work out whether the final copy to user is needed or not.
> > (not that hard, but it all adds complication).  
> 
> Not sure I understand what's the problem. The user vs kernel part will
> be abstracted by iov_iter. The callers will have to write the optlen
> back. And there are two call sites we care about: io_uring and regular
> system call. What's your suggestion? Maybe I'm missing something. Do you
> prefer get_optlen/put_optlen?

I think the final aim should be to pass the user supplied length to the
per-protocol code and have it return the length/error to be passed back to the
user.

But in a lot of cases the syscall wrapper can do the buffer copies (as well
as the length copies).
That would be restricted to short length (on stack).
So code that needed a long buffer (like some of the sctp options)
would need to directly access the user buffer (or a long buffer provided
by an in-kernel user).

But you'll find code that reads/writes well beyond the apparent size of
the user buffer.
(And not just code that accesses 4 bytes without checking the length).

	David
Stanislav Fomichev April 2, 2025, 11:39 p.m. UTC | #18
On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 14:21:35 -0700
> Stanislav Fomichev <stfomichev@gmail.com> wrote:
> 
> > On 04/02, David Laight wrote:
> > > On Wed, 2 Apr 2025 07:19:46 -0700
> > > Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >   
> > > > On 04/02, David Laight wrote:  
> > > > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > > > Stefan Metzmacher <metze@samba.org> wrote:
> > > > >     
> > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:    
> > > > > > > On 04/01, Stefan Metzmacher wrote:      
> > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:      
> > > > > > >>> On 04/01, Breno Leitao wrote:      
> > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:      
> > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:      
> > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:      
> > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:      
> > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:      
> > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?      
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [..]
> > > > > > >>>>>>>>      
> > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().      
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?      
> > > > > > >>>>>>>
> > > > > > >>>>>>> I think that would only be possible if we introduce
> > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > > >>>>>>> buffer.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > > >>>>>>>
> > > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > > > >>>>>>> optlen and leave optval as is.
> > > > > > >>>>>>>
> > > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > > > >>>>>>>
> > > > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > > > >>>>>>>
> > > > > > >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > > >>>>>>>                                   struct iov_iter *optval_iter);
> > > > > > >>>>>>>
> > > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > > > >>>>>>>
> > > > > > >>>>>>> So what should be the way to go?      
> > > > > > >>>>>>
> > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > > >>>>>> but the first part I wanted to convert was
> > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > > >>>>>> writing.
> > > > > > >>>>>>
> > > > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > > >>>>>> with ITER_DEST...
> > > > > > >>>>>>
> > > > > > >>>>>> So who wants to decide?      
> > > > > > >>>>>
> > > > > > >>>>> I just noticed that it's even possible in same cases
> > > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > > > >>>>>
> > > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > > > >>>>> is a good idea :-(      
> > > > > > >>>>
> > > > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > > > >>>> iov_iter.      
> > > > > > >>>
> > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > > > >>> needed.      
> > > > > > >>
> > > > > > >> That's what I tried, but it fails with
> > > > > > >> tcp_getsockopt ->
> > > > > > >>     do_tcp_getsockopt ->
> > > > > > >>       tcp_ao_get_mkts ->
> > > > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > > > >>             copy_struct_from_sockptr
> > > > > > >>       tcp_ao_get_sock_info ->
> > > > > > >>          copy_struct_from_sockptr
> > > > > > >>
> > > > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > > > >>
> > > > > > >> metze      
> > > > > > > 
> > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > > > conversion so let's see if someone else has better suggestions.      
> > > > > > 
> > > > > > Yes, that might work, but it would be good to get some feedback
> > > > > > if this would be the way to go:
> > > > > > 
> > > > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > > > 				 int level, int optname,
> > > > > > 				 struct iov_iter *optval_in,
> > > > > > 				 struct iov_iter *optval_out);
> > > > > > 
> > > > > > And *optlen = optval_out->iov_offset;
> > > > > > 
> > > > > > Any objection or better ideas? Linus would that be what you had in mind?    
> > > > > 
> > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > > > Also look at the SCTP code.    
> > > > 
> > > > Performance usually does not matter for set/getsockopts, there
> > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)  
> > > 
> > > That might be the one that is really horrid and completely abuses
> > > the 'length' parameter.  
> > 
> > It is reading and writing, yes, but it's not a huge problem. And it
> > does enforce the optlen (to copy back the same amount of bytes). It's
> > not that bad, it's just an example of where we need to be extra
> > careful.
> > 
> > > > and maybe recent
> > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > > > idk. I'm skeptical we can convert everything though, that's why the
> > > > suggestion to start with sk/ip/tcp/udp.
> > > >   
> > > > > How do you handle code that wants to return an updated length (often longer
> > > > > than the one provided) and an error code (eg ERRSIZE or similar).
> > > > >
> > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > > > where the buffer length the application provides is only that of the header.
> > > > > The actual buffer length is contained in the header.
> > > > > The return length is the amount written into the full buffer.    
> > > > 
> > > > Let's discuss these special cases as they come up? Worst case these
> > > > places can always re-init iov_iter with a comment on why it is ok.
> > > > But I do agree in general that there are a few places that do wild
> > > > stuff.  
> > > 
> > > The problem is that the generic code has to deal with all the 'wild stuff'.  
> > 
> > getsockopt_iter will have optval_in for the minority of socket options
> > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
> > as optval_out. The latter is what the majority of socket options
> > will use to write their value. That doesn't seem too complicated to
> > handle?
> > 
> > > It is also common to do non-sequential accesses - so iov_iter doesn't match
> > > at all.  
> > 
> > I disagree that it's 'common'. Searching for copy_from_sockptr_offset
> > returns a few cases and they are mostly using read-with-offset because
> > there is no sequential read (iterator) semantics with sockptr_t.
> > 
> > > There also isn't a requirement for scatter-gather.
> > > 
> > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> > > sense for the syscall wrapper to do the user copies.
> > > But it would need to pass the user ptr+len as well as the kernel ptr+len
> > > to give the required flexibilty.
> > > Then you have to work out whether the final copy to user is needed or not.
> > > (not that hard, but it all adds complication).  
> > 
> > Not sure I understand what's the problem. The user vs kernel part will
> > be abstracted by iov_iter. The callers will have to write the optlen
> > back. And there are two call sites we care about: io_uring and regular
> > system call. What's your suggestion? Maybe I'm missing something. Do you
> > prefer get_optlen/put_optlen?
> 
> I think the final aim should be to pass the user supplied length to the
> per-protocol code and have it return the length/error to be passed back to the
> user.

Like what Stefan's patch 3 is doing? Or you're suggesting to change
getsockopt handlers to handle length more explicitly? If we were
to proceed with sockptr to iov_iter conversion we'll have to do it anyway
(or pass the length as the size of iov_iter).

> But in a lot of cases the syscall wrapper can do the buffer copies (as well
> as the length copies).
> That would be restricted to short length (on stack).
> So code that needed a long buffer (like some of the sctp options)
> would need to directly access the user buffer (or a long buffer provided
> by an in-kernel user).

This sounds similar to what we did with bpf hooks - copy (head of) the
buffer and run bpf program on top of it. I remember iptables setsockopt
begin problematic because of its huge size.. It is an option, yes (to
convert protocol handler to kernel memory mostly).

> But you'll find code that reads/writes well beyond the apparent size of
> the user buffer.
> (And not just code that accesses 4 bytes without checking the length).

With can start with getsockopt_iter + sk_getsockopt to see if there are any
issues with that approach. If not, adding ip/tcp/udp to the mix should be doable.
We can explain and comment on special cases if needed. When other protocols
are needed from io_uring, we can convert more. But at least the new code
will use the correct abstractions.