diff mbox series

[RFC,3/4] net: pass a kernel pointer via 'optlen_t' to proto[ops].getsockopt() hooks

Message ID d482e207223f434f0d306d3158b2142dceac4631.1743449872.git.metze@samba.org (mailing list archive)
State New
Headers show
Series net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() | expand

Commit Message

Stefan Metzmacher March 31, 2025, 8:10 p.m. UTC
The motivation for this is to remove the SOL_SOCKET limitation
from io_uring_cmd_getsockopt().

The reason for this limitation is that io_uring_cmd_getsockopt()
passes a kernel pointer.

The first idea would be to change the optval and optlen arguments
to the protocol specific hooks also to sockptr_t, as that
is already used for setsockopt() and also by do_sock_getsockopt()
sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().

But as Linus don't like 'sockptr_t' I used a different approach.

Instead of passing the optlen as user or kernel pointer,
we only ever pass a kernel pointer and do the
translation from/to userspace in do_sock_getsockopt().

The simple solution would be to just remove the
'__user' from the int *optlen argument, but it
seems the compiler doesn't complain about
'__user' vs. without it, so instead I used
a helper struct in order to make sure everything
compiles with a typesafe change.

That together with get_optlen() and put_optlen() helper
macros make it relatively easy to review and check the
behaviour is most likely unchanged.

In order to avoid uapi changes regarding different error
code orders regarding -EFAULT, the real -EFAULT handling
is deferred to get_optlen() and put_optlen().

This allows io_uring_cmd_getsockopt() to remove the
SOL_SOCKET limitation.

Removing 'sockptr_t optlen' from existing code
is for patch for another day.

Link: https://lore.kernel.org/io-uring/86b1dce5-4bb4-4a0b-9cff-e72f488bf57d@samba.org/T/#t
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Ayush Sawal <ayush.sawal@chelsio.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Joerg Reuter <jreuter@yaina.de>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: kernel@pengutronix.de
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@datenfreihafen.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Alexandra Winter <wintera@linux.ibm.com>
Cc: Thorsten Winkler <twinkler@linux.ibm.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Mat Martineau <martineau@kernel.org>
Cc: Geliang Tang <geliang@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
Cc: "D. Wythe" <alibuda@linux.alibaba.com>
Cc: Tony Lu <tonylu@linux.alibaba.com>
Cc: Wen Gu <guwen@linux.alibaba.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Boris Pismenny <borisp@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
CC: Stefan Metzmacher <metze@samba.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-sctp@vger.kernel.org
Cc: linux-hams@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-can@vger.kernel.org
Cc: dccp@vger.kernel.org
Cc: linux-wpan@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: mptcp@lists.linux.dev
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-afs@lists.infradead.org
Cc: tipc-discussion@lists.sourceforge.net
Cc: virtualization@lists.linux.dev
Cc: linux-x25@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: isdn4linux@listserv.isdn4linux.de
Cc: io-uring@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
---
 include/linux/sockptr.h | 20 +++++++++++++++-----
 net/socket.c            | 31 +++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 7 deletions(-)

Comments

David Laight March 31, 2025, 9:49 p.m. UTC | #1
On Mon, 31 Mar 2025 22:10:55 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> The motivation for this is to remove the SOL_SOCKET limitation
> from io_uring_cmd_getsockopt().
> 
> The reason for this limitation is that io_uring_cmd_getsockopt()
> passes a kernel pointer.
> 
> The first idea would be to change the optval and optlen arguments
> to the protocol specific hooks also to sockptr_t, as that
> is already used for setsockopt() and also by do_sock_getsockopt()
> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> 
> But as Linus don't like 'sockptr_t' I used a different approach.
> 
> Instead of passing the optlen as user or kernel pointer,
> we only ever pass a kernel pointer and do the
> translation from/to userspace in do_sock_getsockopt().
> 
> The simple solution would be to just remove the
> '__user' from the int *optlen argument, but it
> seems the compiler doesn't complain about
> '__user' vs. without it, so instead I used
> a helper struct in order to make sure everything
> compiles with a typesafe change.
> 
> That together with get_optlen() and put_optlen() helper
> macros make it relatively easy to review and check the
> behaviour is most likely unchanged.

I've looked into this before (and fallen down the patch rabbit hole).

I think the best (final) solution is to pass a validated non-negative
'optlen' into all getsockopt() functions and to have them usually return
either -errno or the modified length.
This simplifies 99% of the functions.

The problem case is functions that want to update the length and return
an error.
By best solution is to support return values of -errno << 20 | length
(as well as -errno and length).

There end up being some slight behaviour changes.
- Some code tries to 'undo' actions if the length can't be updated.
  I'm sure this is unnecessary and the recovery path is untested and
  could be buggy. Provided the kernel data is consistent there is
  no point trying to get code to recover from EFAULT.
  The 'length' has been read - so would also need to be readonly
  or unmapped by a second thread!
- A lot of getsockopt functions actually treat a negative length as 4.
  I think this 'bug' needs to preserved to avoid breaking applications.

The changes are mechanical but very widespread.

They also give the option of not writing back the length if unchanged.

	David
Stefan Metzmacher April 1, 2025, 8:24 a.m. UTC | #2
Am 31.03.25 um 23:49 schrieb David Laight:
> On Mon, 31 Mar 2025 22:10:55 +0200
> Stefan Metzmacher <metze@samba.org> wrote:
> 
>> The motivation for this is to remove the SOL_SOCKET limitation
>> from io_uring_cmd_getsockopt().
>>
>> The reason for this limitation is that io_uring_cmd_getsockopt()
>> passes a kernel pointer.
>>
>> The first idea would be to change the optval and optlen arguments
>> to the protocol specific hooks also to sockptr_t, as that
>> is already used for setsockopt() and also by do_sock_getsockopt()
>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
>>
>> But as Linus don't like 'sockptr_t' I used a different approach.
>>
>> Instead of passing the optlen as user or kernel pointer,
>> we only ever pass a kernel pointer and do the
>> translation from/to userspace in do_sock_getsockopt().
>>
>> The simple solution would be to just remove the
>> '__user' from the int *optlen argument, but it
>> seems the compiler doesn't complain about
>> '__user' vs. without it, so instead I used
>> a helper struct in order to make sure everything
>> compiles with a typesafe change.
>>
>> That together with get_optlen() and put_optlen() helper
>> macros make it relatively easy to review and check the
>> behaviour is most likely unchanged.
> 
> I've looked into this before (and fallen down the patch rabbit hole).

Yes, if you want to change the logic at the same time as
changing the kind of argument variable, then it get messy
quite fast.

> I think the best (final) solution is to pass a validated non-negative
> 'optlen' into all getsockopt() functions and to have them usually return
> either -errno or the modified length.
> This simplifies 99% of the functions.

Yes, maybe not 99%, but a lot.

> The problem case is functions that want to update the length and return
> an error.
> By best solution is to support return values of -errno << 20 | length
> (as well as -errno and length).
> 
> There end up being some slight behaviour changes.
> - Some code tries to 'undo' actions if the length can't be updated.
>    I'm sure this is unnecessary and the recovery path is untested and
>    could be buggy. Provided the kernel data is consistent there is
>    no point trying to get code to recover from EFAULT.
>    The 'length' has been read - so would also need to be readonly
>    or unmapped by a second thread!
> - A lot of getsockopt functions actually treat a negative length as 4.
>    I think this 'bug' needs to preserved to avoid breaking applications.
> 
> The changes are mechanical but very widespread.
> 
> They also give the option of not writing back the length if unchanged.

See my other mail regarding proto[_ops].getsockopt_iter(),
where implementation could be converted step by step.

But we may still need to keep the current  proto[ops].getsockopt()
as proto[ops].getsockopt_legacy() in order to keep the
insane uapi semantics alive.

metze
diff mbox series

Patch

diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h
index 1baf66f26f4f..06ec7fd73028 100644
--- a/include/linux/sockptr.h
+++ b/include/linux/sockptr.h
@@ -170,20 +170,25 @@  static inline int check_zeroed_sockptr(sockptr_t src, size_t offset,
 }
 
 typedef struct {
-	int __user *up;
+	int *kp;
 } optlen_t;
 
 #define __check_optlen_t(__optlen)				\
 ({								\
 	optlen_t *__ptr __maybe_unused = &__optlen; \
-	BUILD_BUG_ON(sizeof(*((__ptr)->up)) != sizeof(int));	\
+	BUILD_BUG_ON(sizeof(*((__ptr)->kp)) != sizeof(int));	\
 })
 
 #define get_optlen(__val, __optlen)				\
 ({								\
 	long __err;						\
 	__check_optlen_t(__optlen);				\
-	__err = get_user(__val, __optlen.up);			\
+	if ((__optlen).kp != NULL) {				\
+		(__val) = *((__optlen).kp);			\
+		__err = 0;					\
+	} else {						\
+		__err = -EFAULT;				\
+	}							\
 	__err;							\
 })
 
@@ -191,13 +196,18 @@  typedef struct {
 ({								\
 	long __err;						\
 	__check_optlen_t(__optlen);				\
-	__err = put_user(__val, __optlen.up);			\
+	if ((__optlen).kp != NULL) {				\
+		*((__optlen).kp) = (__val);			\
+		__err = 0;					\
+	} else {						\
+		__err = -EFAULT;				\
+	}							\
 	__err;							\
 })
 
 static inline sockptr_t OPTLEN_SOCKPTR(optlen_t optlen)
 {
-	return (sockptr_t) { .user = optlen.up, };
+	return (sockptr_t) { .kernel = optlen.kp, .is_kernel = true };
 }
 
 #endif /* _LINUX_SOCKPTR_H */
diff --git a/net/socket.c b/net/socket.c
index fa2de12c10e6..81e5c9767bbc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2350,15 +2350,42 @@  int do_sock_getsockopt(struct socket *sock, bool compat, int level,
 	} else if (unlikely(!ops->getsockopt)) {
 		err = -EOPNOTSUPP;
 	} else {
-		optlen_t _optlen = { .up = NULL, };
+		optlen_t _optlen = { .kp = NULL, };
+		int koptlen;
 
 		if (WARN_ONCE(optval.is_kernel,
 			      "Invalid argument type"))
 			return -EOPNOTSUPP;
 
-		_optlen.up = optlen.user;
+		if (optlen.is_kernel) {
+			_optlen.kp = optlen.kernel;
+		} else if (optlen.user != NULL) {
+			/*
+			 * If optlen.user is NULL,
+			 * we pass _optlen.kp = NULL
+			 * in order to avoid breaking
+			 * any uapi for getsockopt()
+			 * implementations that ignore
+			 * the optlen pointer completely
+			 * or do any level and optname
+			 * checking before hitting a
+			 * potential -EFAULT condition.
+			 *
+			 * Also when optlen.user is not NULL,
+			 * but copy_from_sockptr() causes -EFAULT,
+			 * we'll pass optlen.kp = NULL in order
+			 * to defer a possible -EFAULT return
+			 * to the caller to get_optlen() and put_optlen().
+			 */
+			if (copy_from_sockptr(&koptlen, optlen, sizeof(koptlen)) == 0)
+				_optlen.kp = &koptlen;
+		}
 		err = ops->getsockopt(sock, level, optname, optval.user,
 				      _optlen);
+		if (err != -EFAULT && _optlen.kp == &koptlen) {
+			if (copy_to_sockptr(optlen, &koptlen, sizeof(koptlen)))
+				return -EFAULT;
+		}
 	}
 
 	if (!compat)