diff mbox series

[v2,net,01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.

Message ID 20240604165241.44758-2-kuniyu@amazon.com (mailing list archive)
State Accepted
Commit 26bfb8b57063f52b867f9b6c8d1742fcb5bd656c
Delegated to: Netdev Maintainers
Headers show
Series af_unix: Fix lockless access of sk->sk_state and others fields. | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 906 this patch: 906
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: ast@kernel.org; 1 maintainers not CCed: ast@kernel.org
netdev/build_clang success Errors and warnings before: 904 this patch: 904
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 910 this patch: 910
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-06-05--00-00 (tests: 1045)

Commit Message

Kuniyuki Iwashima June 4, 2024, 4:52 p.m. UTC
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.

When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.

Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.

Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.

After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:

  $ ss -xa
  Netid State  Recv-Q Send-Q  Local Address:Port  Peer Address:PortProcess
  u_dgr ESTAB  0      0       @A 641              * 642
  u_dgr ESTAB  0      0       @B 642              * 643
  u_dgr ESTAB  0      0       @C 643              * 0

And after the disconnect, B's state is TCP_CLOSE even though it's still
connected to C and C's state is TCP_ESTABLISHED.

  $ ss -xa
  Netid State  Recv-Q Send-Q  Local Address:Port  Peer Address:PortProcess
  u_dgr UNCONN 0      0       @A 641              * 0
  u_dgr UNCONN 0      0       @B 642              * 643
  u_dgr ESTAB  0      0       @C 643              * 0

In this case, we cannot register B to SOCKMAP.

So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().

Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers.  These data-races will be fixed in the following patches.

Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Cong Wang <cong.wang@bytedance.com>
---
 net/unix/af_unix.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Michal Luczaj June 9, 2024, 11:28 a.m. UTC | #1
On 6/4/24 18:52, Kuniyuki Iwashima wrote:
> When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
> sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
> to BPF SOCKMAP. (...)

Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for
bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when
sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but
unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL:

	T0 bpf				T1 connect
	======				==========

				WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
				sock_hold(newsk)
				smp_mb__after_atomic()
				unix_peer(sk) = newsk
				unix_state_unlock(sk)

With mdelay(1) stuffed in unix_stream_connect():

[  902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080
[  902.277633] #PF: supervisor write access in kernel mode
[  902.277661] #PF: error_code(0x0002) - not-present page
[  902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0
[  902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI
[  902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G      D            6.10.0-rc1+ #130
[  902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150

Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about
something like

@@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
        /* Set credentials */
        copy_peercred(sk, other);

-       sock->state     = SS_CONNECTED;
-       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
        sock_hold(newsk);
+       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
+       WRITE_ONCE(unix_peer(sk), newsk);
+       smp_wmb(); /* ensure peer is set before sk_state */

-       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
-       unix_peer(sk)   = newsk;
+       sock->state = SS_CONNECTED;
+       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);

        unix_state_unlock(sk);

@@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
         * be a single matching destroy operation.
         */
        if (!psock->sk_pair) {
-               sk_pair = unix_peer(sk);
+               smp_rmb();
+               sk_pair = READ_ONCE(unix_peer(sk));
                sock_hold(sk_pair);
                psock->sk_pair = sk_pair;
        }

This should keep things ordered and lockless... I hope.

Alternatively, maybe it would be better just to make BPF respect the unix
state lock?

@@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
+               unix_state_lock(sk);
                sk_pair = unix_peer(sk);
+               unix_state_unlock(sk);
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;

What do you think?

Thanks,
Michal
Kuniyuki Iwashima June 9, 2024, 7:53 p.m. UTC | #2
From: Michal Luczaj <mhal@rbox.co>
Date: Sun, 9 Jun 2024 13:28:34 +0200
> On 6/4/24 18:52, Kuniyuki Iwashima wrote:
> > When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
> > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
> > to BPF SOCKMAP. (...)
> 
> Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for
> bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when
> sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but
> unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL:
> 
> 	T0 bpf				T1 connect
> 	======				==========
> 
> 				WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> 				sock_hold(newsk)
> 				smp_mb__after_atomic()
> 				unix_peer(sk) = newsk
> 				unix_state_unlock(sk)
> 
> With mdelay(1) stuffed in unix_stream_connect():
> 
> [  902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080
> [  902.277633] #PF: supervisor write access in kernel mode
> [  902.277661] #PF: error_code(0x0002) - not-present page
> [  902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0
> [  902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI
> [  902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G      D            6.10.0-rc1+ #130
> [  902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [  902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150
> 
> Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about
> something like
> 
> @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>         /* Set credentials */
>         copy_peercred(sk, other);
> 
> -       sock->state     = SS_CONNECTED;
> -       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
>         sock_hold(newsk);
> +       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> +       WRITE_ONCE(unix_peer(sk), newsk);
> +       smp_wmb(); /* ensure peer is set before sk_state */
> 
> -       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> -       unix_peer(sk)   = newsk;
> +       sock->state = SS_CONNECTED;
> +       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
> 
>         unix_state_unlock(sk);
> 
> @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>          * be a single matching destroy operation.
>          */
>         if (!psock->sk_pair) {
> -               sk_pair = unix_peer(sk);
> +               smp_rmb();
> +               sk_pair = READ_ONCE(unix_peer(sk));
>                 sock_hold(sk_pair);
>                 psock->sk_pair = sk_pair;
>         }
> 
> This should keep things ordered and lockless... I hope.

sock_map_update_elem() assumes that the socket is protected
by lock_sock(), but AF_UNIX uses it only for the general path.

So, I think we should fix sock_map_sk_state_allowed() and
then use smp_store_release()/smp_load_acquire() rather than
smp_[rw]mb() for unix_peer(sk).

Could you test this with the mdelay(1) change ?

Note that we need not touch sock->state.  I have a patch for
net-next that removes sock->state uses completely from AF_UNIX
as we don't use it.  Even unix_seq_show() depends on sk->sk_state.

---8<---
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..67794d2c7498 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	return true;
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 80846279de9f..a558745c7d76 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1632,11 +1632,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	copy_peercred(sk, other);
 
 	sock->state	= SS_CONNECTED;
-	WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
 	sock_hold(newsk);
 
 	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
-	unix_peer(sk)	= newsk;
+	smp_store_release(&unix_peer(sk), newsk);
+	WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
 
 	unix_state_unlock(sk);
 
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..6d9ae8e63901 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -180,7 +180,7 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
-		sk_pair = unix_peer(sk);
+		sk_pair = smp_load_acquire(&unix_peer(sk));
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;
 	}
---8<---


> 
> Alternatively, maybe it would be better just to make BPF respect the unix
> state lock?
> 
> @@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>  	 * be a single matching destroy operation.
>  	 */
>  	if (!psock->sk_pair) {
> +               unix_state_lock(sk);
>                 sk_pair = unix_peer(sk);
> +               unix_state_unlock(sk);
>  		sock_hold(sk_pair);
>  		psock->sk_pair = sk_pair;
> 
> What do you think?

If we'd go this way, I'd change like this:

---8<---
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..1db42cfee70d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
-	struct sock *sk_pair;
-
 	/* Restore does not decrement the sk_pair reference yet because we must
 	 * keep the a reference to the socket until after an RCU grace period
 	 * and any pending sends have completed.
@@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
-		sk_pair = unix_peer(sk);
-		sock_hold(sk_pair);
-		psock->sk_pair = sk_pair;
+		psock->sk_pair = unix_peer_get(sk);
+		if (WARN_ON_ONCE(!psock->sk_pair))
+			return -EINVAL;
 	}
 
 	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
---8<---


And the _last_ option would be..., no :)

---8<---
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b6eedf7650da..c7e31bc3e95e 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -94,8 +94,8 @@ struct unix_sock {
 #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk)
 #define unix_peer(sk) (unix_sk(sk)->peer)
 
-#define unix_state_lock(s)	spin_lock(&unix_sk(s)->lock)
-#define unix_state_unlock(s)	spin_unlock(&unix_sk(s)->lock)
+#define unix_state_lock(s)	lock_sock(s)
+#define unix_state_unlock(s)	release_sock(s)
 enum unix_socket_lock_class {
 	U_LOCK_NORMAL,
 	U_LOCK_SECOND,	/* for double locking, see unix_state_double_lock(). */
@@ -108,7 +108,7 @@ enum unix_socket_lock_class {
 static inline void unix_state_lock_nested(struct sock *sk,
 				   enum unix_socket_lock_class subclass)
 {
-	spin_lock_nested(&unix_sk(sk)->lock, subclass);
+	lock_sock_nested(sk, subclass);
 }
 
 #define peer_wait peer_wq.wait
---8<---
Kuniyuki Iwashima June 9, 2024, 9:03 p.m. UTC | #3
From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Sun, 9 Jun 2024 12:53:20 -0700
> From: Michal Luczaj <mhal@rbox.co>
> Date: Sun, 9 Jun 2024 13:28:34 +0200
> > On 6/4/24 18:52, Kuniyuki Iwashima wrote:
> > > When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
> > > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
> > > to BPF SOCKMAP. (...)
> > 
> > Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for
> > bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when
> > sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but
> > unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL:
> > 
> > 	T0 bpf				T1 connect
> > 	======				==========
> > 
> > 				WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> > sock_map_sk_state_allowed(sk)
> > ...
> > sk_pair = unix_peer(sk)
> > sock_hold(sk_pair)
> > 				sock_hold(newsk)
> > 				smp_mb__after_atomic()
> > 				unix_peer(sk) = newsk
> > 				unix_state_unlock(sk)
> > 
> > With mdelay(1) stuffed in unix_stream_connect():
> > 
> > [  902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080
> > [  902.277633] #PF: supervisor write access in kernel mode
> > [  902.277661] #PF: error_code(0x0002) - not-present page
> > [  902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0
> > [  902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI
> > [  902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G      D            6.10.0-rc1+ #130
> > [  902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > [  902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150
> > 
> > Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about
> > something like
> > 
> > @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
> >         /* Set credentials */
> >         copy_peercred(sk, other);
> > 
> > -       sock->state     = SS_CONNECTED;
> > -       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
> >         sock_hold(newsk);
> > +       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> > +       WRITE_ONCE(unix_peer(sk), newsk);
> > +       smp_wmb(); /* ensure peer is set before sk_state */
> > 
> > -       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> > -       unix_peer(sk)   = newsk;
> > +       sock->state = SS_CONNECTED;
> > +       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
> > 
> >         unix_state_unlock(sk);
> > 
> > @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
> >          * be a single matching destroy operation.
> >          */
> >         if (!psock->sk_pair) {
> > -               sk_pair = unix_peer(sk);
> > +               smp_rmb();
> > +               sk_pair = READ_ONCE(unix_peer(sk));
> >                 sock_hold(sk_pair);
> >                 psock->sk_pair = sk_pair;
> >         }
> > 
> > This should keep things ordered and lockless... I hope.
> 
> sock_map_update_elem() assumes that the socket is protected
> by lock_sock(), but AF_UNIX uses it only for the general path.
> 
> So, I think we should fix sock_map_sk_state_allowed() and
> then use smp_store_release()/smp_load_acquire() rather than
> smp_[rw]mb() for unix_peer(sk).

Sorry, I think I was wrong and we can't use smp_store_release()
and smp_load_acquire(), and smp_[rw]mb() is needed.

Given we avoid adding code in the hotpath in the original fix
8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
path again.

[0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/

---8<---
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..67794d2c7498 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	return true;
 }
 
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..1db42cfee70d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
-	struct sock *sk_pair;
-
 	/* Restore does not decrement the sk_pair reference yet because we must
 	 * keep the a reference to the socket until after an RCU grace period
 	 * and any pending sends have completed.
@@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
-		sk_pair = unix_peer(sk);
-		sock_hold(sk_pair);
-		psock->sk_pair = sk_pair;
+		psock->sk_pair = unix_peer_get(sk);
+		if (WARN_ON_ONCE(!psock->sk_pair))
+			return -EINVAL;
 	}
 
 	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
---8<---
Michal Luczaj June 10, 2024, 12:55 p.m. UTC | #4
On 6/9/24 23:03, Kuniyuki Iwashima wrote:
> (...)
> Sorry, I think I was wrong and we can't use smp_store_release()
> and smp_load_acquire(), and smp_[rw]mb() is needed.
> 
> Given we avoid adding code in the hotpath in the original fix
> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
> path again.
>
> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/

You're saying smp_wmb() in connect() is too much for the hot path, do I
understand correctly?

> ---8<---
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index d3dbb92153f2..67794d2c7498 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
>  	if (sk_is_tcp(sk))
>  		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
>  	if (sk_is_stream_unix(sk))
> -		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
> +		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
>  	return true;
>  }
>  
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index bd84785bf8d6..1db42cfee70d 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
>  
>  int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
> -	struct sock *sk_pair;
> -
>  	/* Restore does not decrement the sk_pair reference yet because we must
>  	 * keep the a reference to the socket until after an RCU grace period
>  	 * and any pending sends have completed.
> @@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>  	 * be a single matching destroy operation.
>  	 */
>  	if (!psock->sk_pair) {
> -		sk_pair = unix_peer(sk);
> -		sock_hold(sk_pair);
> -		psock->sk_pair = sk_pair;
> +		psock->sk_pair = unix_peer_get(sk);
> +		if (WARN_ON_ONCE(!psock->sk_pair))
> +			return -EINVAL;
>  	}
>  
>  	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
> ---8<---

FWIW, we've passed sock_map_sk_state_allowed(), so critical section can be
empty:

	if (!psock->sk_pair) {
		unix_state_lock(sk);
		unix_state_unlock(sk);
		sk_pair = READ_ONCE(unix_peer(sk));
		...
	}

Thanks,
Michal
Kuniyuki Iwashima June 10, 2024, 5:49 p.m. UTC | #5
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 10 Jun 2024 14:55:08 +0200
> On 6/9/24 23:03, Kuniyuki Iwashima wrote:
> > (...)
> > Sorry, I think I was wrong and we can't use smp_store_release()
> > and smp_load_acquire(), and smp_[rw]mb() is needed.
> > 
> > Given we avoid adding code in the hotpath in the original fix
> > 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
> > path again.
> >
> > [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/
> 
> You're saying smp_wmb() in connect() is too much for the hot path, do I
> understand correctly?

Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely
that the delay happens between the two store ops and concurrent bpf()
is in progress.

If syzkaller was able to hit this on vanilla kernel, we can revisit.

Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users
who call bpf() in such a way know that the state was TCP_CLOSE while
calling bpf().

---8<---
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..46dc747349f2 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 */
 	if (!psock->sk_pair) {
 		sk_pair = unix_peer(sk);
+		if (WARN_ON_ONCE(!sk_pair))
+			return -EINVAL;
+
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;
 	}
---8<---
Michal Luczaj June 16, 2024, 11:28 p.m. UTC | #6
On 6/10/24 19:49, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 10 Jun 2024 14:55:08 +0200
>> On 6/9/24 23:03, Kuniyuki Iwashima wrote:
>>> (...)
>>> Sorry, I think I was wrong and we can't use smp_store_release()
>>> and smp_load_acquire(), and smp_[rw]mb() is needed.
>>>
>>> Given we avoid adding code in the hotpath in the original fix
>>> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
>>> path again.
>>>
>>> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/
>>
>> You're saying smp_wmb() in connect() is too much for the hot path, do I
>> understand correctly?
> 
> Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely
> that the delay happens between the two store ops and concurrent bpf()
> is in progress.
> 
> If syzkaller was able to hit this on vanilla kernel, we can revisit.
> 
> Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users
> who call bpf() in such a way know that the state was TCP_CLOSE while
> calling bpf().
> 
> ---8<---
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index bd84785bf8d6..46dc747349f2 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>  	 */
>  	if (!psock->sk_pair) {
>  		sk_pair = unix_peer(sk);
> +		if (WARN_ON_ONCE(!sk_pair))
> +			return -EINVAL;
> +
>  		sock_hold(sk_pair);
>  		psock->sk_pair = sk_pair;
>  	}
> ---8<---

Oh. That's a peculiar approach :) But, hey, it's your call.

Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
added to recv queue, but also u->oob_skb is set. Here's the problem: when
this skb goes through bpf_sk_redirect_map() and is moved between socks,
oob_skb remains set on the original sock.

[   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
[   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
[   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   23.689024] Workqueue: events_unbound __unix_gc
[   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0

I wanted to write a patch, but then I realized I'm not sure what's the
expected behaviour. Should the oob_skb setting follow to the skb's new sock
or should it be dropped (similarly to what is happening today with
scm_fp_list, i.e. redirect strips inflights)?
Kuniyuki Iwashima June 17, 2024, 6:21 p.m. UTC | #7
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 17 Jun 2024 01:28:52 +0200
> On 6/10/24 19:49, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 10 Jun 2024 14:55:08 +0200
> >> On 6/9/24 23:03, Kuniyuki Iwashima wrote:
> >>> (...)
> >>> Sorry, I think I was wrong and we can't use smp_store_release()
> >>> and smp_load_acquire(), and smp_[rw]mb() is needed.
> >>>
> >>> Given we avoid adding code in the hotpath in the original fix
> >>> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
> >>> path again.
> >>>
> >>> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/
> >>
> >> You're saying smp_wmb() in connect() is too much for the hot path, do I
> >> understand correctly?
> > 
> > Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely
> > that the delay happens between the two store ops and concurrent bpf()
> > is in progress.
> > 
> > If syzkaller was able to hit this on vanilla kernel, we can revisit.
> > 
> > Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users
> > who call bpf() in such a way know that the state was TCP_CLOSE while
> > calling bpf().
> > 
> > ---8<---
> > diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> > index bd84785bf8d6..46dc747349f2 100644
> > --- a/net/unix/unix_bpf.c
> > +++ b/net/unix/unix_bpf.c
> > @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
> >  	 */
> >  	if (!psock->sk_pair) {
> >  		sk_pair = unix_peer(sk);
> > +		if (WARN_ON_ONCE(!sk_pair))
> > +			return -EINVAL;
> > +
> >  		sock_hold(sk_pair);
> >  		psock->sk_pair = sk_pair;
> >  	}
> > ---8<---
> 
> Oh. That's a peculiar approach :) But, hey, it's your call.
> 
> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
> added to recv queue, but also u->oob_skb is set. Here's the problem: when
> this skb goes through bpf_sk_redirect_map() and is moved between socks,
> oob_skb remains set on the original sock.

Good catch!

> 
> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [   23.689024] Workqueue: events_unbound __unix_gc
> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
> 
> I wanted to write a patch, but then I realized I'm not sure what's the
> expected behaviour. Should the oob_skb setting follow to the skb's new sock
> or should it be dropped (similarly to what is happening today with
> scm_fp_list, i.e. redirect strips inflights)?

The former will require large refactoring as we need to check if the
redirect happens for BPF_F_INGRESS and if the redirected sk is also
SOCK_STREAM etc.

So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
and drop OOB data and retry next in unix_stream_read_skb(), and forbid
MSG_OOB in unix_bpf_recvmsg().

Both features were merged in 5.15 and OOB was a month later than SOCKMAP,
so the Fixes tag would be 314001f0bf927 again, where ioctl(SIOCATMARK)
(and epoll(EPOLLPRI) after d9a232d435dcc was backported to all stable)
is lying due to redirected OOB msg.

Thanks!
Michal Luczaj June 19, 2024, 6:14 p.m. UTC | #8
On 6/17/24 20:21, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 17 Jun 2024 01:28:52 +0200
>> (...)
>> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
>> added to recv queue, but also u->oob_skb is set. Here's the problem: when
>> this skb goes through bpf_sk_redirect_map() and is moved between socks,
>> oob_skb remains set on the original sock.
> 
> Good catch!
> 
>>
>> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
>> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
>> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>> [   23.689024] Workqueue: events_unbound __unix_gc
>> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
>>
>> I wanted to write a patch, but then I realized I'm not sure what's the
>> expected behaviour. Should the oob_skb setting follow to the skb's new sock
>> or should it be dropped (similarly to what is happening today with
>> scm_fp_list, i.e. redirect strips inflights)?
> 
> The former will require large refactoring as we need to check if the
> redirect happens for BPF_F_INGRESS and if the redirected sk is also
> SOCK_STREAM etc.
> 
> So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
> and drop OOB data and retry next in unix_stream_read_skb(), and forbid
> MSG_OOB in unix_bpf_recvmsg().
> (...)

Yeah, sounds reasonable. I'm just not sure I understand the retry part. For
each skb_queue_tail() there's one ->sk_data_ready() (which does
->read_skb()). Why bother with a retry?

This is what I was thinking:

 static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 {
+	struct unix_sock *u = unix_sk(sk);
+	struct sk_buff *skb;
+	int err;
+
 	if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED))
 		return -ENOTCONN;
 
-	return unix_read_skb(sk, recv_actor);
+	mutex_lock(&u->iolock);
+	skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err);
+
+#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+	if (skb) {
+		bool drop = false;
+
+		spin_lock(&sk->sk_receive_queue.lock);
+		if (skb == u->oob_skb) {
+			WRITE_ONCE(u->oob_skb, NULL);
+			drop = true;
+		}
+		spin_unlock(&sk->sk_receive_queue.lock);
+
+		if (drop) {
+			WARN_ON_ONCE(skb_unref(skb));
+			kfree_skb(skb);
+			skb = NULL;
+			err = 0;
+		}
+	}
+#endif
+
+	mutex_unlock(&u->iolock);
+	return skb ? recv_actor(sk, skb) : err;
 }
Kuniyuki Iwashima June 19, 2024, 7:19 p.m. UTC | #9
From: Michal Luczaj <mhal@rbox.co>
Date: Wed, 19 Jun 2024 20:14:48 +0200
> On 6/17/24 20:21, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 17 Jun 2024 01:28:52 +0200
> >> (...)
> >> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
> >> added to recv queue, but also u->oob_skb is set. Here's the problem: when
> >> this skb goes through bpf_sk_redirect_map() and is moved between socks,
> >> oob_skb remains set on the original sock.
> > 
> > Good catch!
> > 
> >>
> >> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
> >> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
> >> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> >> [   23.689024] Workqueue: events_unbound __unix_gc
> >> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
> >>
> >> I wanted to write a patch, but then I realized I'm not sure what's the
> >> expected behaviour. Should the oob_skb setting follow to the skb's new sock
> >> or should it be dropped (similarly to what is happening today with
> >> scm_fp_list, i.e. redirect strips inflights)?
> > 
> > The former will require large refactoring as we need to check if the
> > redirect happens for BPF_F_INGRESS and if the redirected sk is also
> > SOCK_STREAM etc.
> > 
> > So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
> > and drop OOB data and retry next in unix_stream_read_skb(), and forbid
> > MSG_OOB in unix_bpf_recvmsg().
> > (...)
> 
> Yeah, sounds reasonable. I'm just not sure I understand the retry part. For
> each skb_queue_tail() there's one ->sk_data_ready() (which does
> ->read_skb()). Why bother with a retry?

Exactly.


> 
> This is what I was thinking:
>

When you post it, please make sure to CC bpf@ and sockmap maintainers too.
Michal Luczaj June 20, 2024, 8:35 p.m. UTC | #10
On 6/19/24 21:19, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Wed, 19 Jun 2024 20:14:48 +0200
>> On 6/17/24 20:21, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon, 17 Jun 2024 01:28:52 +0200
>>>> (...)
>>>> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
>>>> added to recv queue, but also u->oob_skb is set. Here's the problem: when
>>>> this skb goes through bpf_sk_redirect_map() and is moved between socks,
>>>> oob_skb remains set on the original sock.
>>>
>>> Good catch!
>>>
>>>>
>>>> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
>>>> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
>>>> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>>>> [   23.689024] Workqueue: events_unbound __unix_gc
>>>> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
>>>>
>>>> I wanted to write a patch, but then I realized I'm not sure what's the
>>>> expected behaviour. Should the oob_skb setting follow to the skb's new sock
>>>> or should it be dropped (similarly to what is happening today with
>>>> scm_fp_list, i.e. redirect strips inflights)?
>>>
>>> The former will require large refactoring as we need to check if the
>>> redirect happens for BPF_F_INGRESS and if the redirected sk is also
>>> SOCK_STREAM etc.
>>>
>>> So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
>>> and drop OOB data and retry next in unix_stream_read_skb(), and forbid
>>> MSG_OOB in unix_bpf_recvmsg().
>>> (...)
>>
>> Yeah, sounds reasonable. I'm just not sure I understand the retry part. For
>> each skb_queue_tail() there's one ->sk_data_ready() (which does
>> ->read_skb()). Why bother with a retry?
> 
> Exactly.
> 
> 
>>
>> This is what I was thinking:
>>
> 
> When you post it, please make sure to CC bpf@ and sockmap maintainers too.

Done: https://lore.kernel.org/netdev/20240620203009.2610301-1-mhal@rbox.co/
Thanks!

In fact, should I try to document those not-so-obvious OOB/sockmap
interaction? And speaking of documentation, an astute reader noted that
`man unix` is lying:

  Sockets API
       ...
       UNIX domain sockets do not support the transmission of out-of-band
       data (the MSG_OOB flag for send(2) and recv(2)).

  NOTES
       ...
       UNIX domain stream sockets do not support the notion of out-of-band
       data.
Kuniyuki Iwashima June 20, 2024, 9:56 p.m. UTC | #11
From: Michal Luczaj <mhal@rbox.co>
Date: Thu, 20 Jun 2024 22:35:55 +0200
> In fact, should I try to document those not-so-obvious OOB/sockmap
> interaction? And speaking of documentation, an astute reader noted that
> `man unix` is lying:

At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling
is stable enough; the behaviour is not compliant with TCP now.

While rewriting the oob test thoroughly, I found few more weird behaviours,
and patches will follow.

For example:

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX)
  >>> c1.send(b'hello', MSG_OOB)
  5
  >>> c1.send(b'world')
  5
  >>> c2.recv(10)
  b'hell'
  >>> c2.recv(1, MSG_OOB)
  b'o'
  >>> c2.setblocking(False)  # This causes -EAGAIN even with available data
  >>> c2.recv(5)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  BlockingIOError: [Errno 11] Resource temporarily unavailable
  >>> c2.recv(5)
  b'world'


And we need

---8<---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5e695a9a609c..2875a7ce1887 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 	struct unix_sock *u = unix_sk(sk);
 
 	if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) {
-		skb_unlink(skb, &sk->sk_receive_queue);
-		consume_skb(skb);
-		skb = NULL;
+		struct sk_buff *unlinked_skb = skb;
+
+		spin_lock(&sk->sk_receive_queue.lock);
+
+		__skb_unlink(skb, &sk->sk_receive_queue);
+
+		if (copied)
+			skb = NULL;
+		else
+			skb = skb_peek(&sk->sk_receive_queue);
+
+		spin_unlock(&sk->sk_receive_queue.lock);
+
+		consume_skb(unlinked_skb);
 	} else {
 		struct sk_buff *unlinked_skb = NULL;
 
---8<---
Michal Luczaj June 22, 2024, 10:43 p.m. UTC | #12
On 6/20/24 23:56, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Thu, 20 Jun 2024 22:35:55 +0200
>> In fact, should I try to document those not-so-obvious OOB/sockmap
>> interaction? And speaking of documentation, an astute reader noted that
>> `man unix` is lying:
> 
> At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling
> is stable enough; the behaviour is not compliant with TCP now.

Sure, I get it.

> (...)
> And we need
> 
> ---8<---
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 5e695a9a609c..2875a7ce1887 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>  	struct unix_sock *u = unix_sk(sk);
>  
>  	if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) {
> -		skb_unlink(skb, &sk->sk_receive_queue);
> -		consume_skb(skb);
> -		skb = NULL;
> +		struct sk_buff *unlinked_skb = skb;
> +
> +		spin_lock(&sk->sk_receive_queue.lock);
> +
> +		__skb_unlink(skb, &sk->sk_receive_queue);
> +
> +		if (copied)
> +			skb = NULL;
> +		else
> +			skb = skb_peek(&sk->sk_receive_queue);
> +
> +		spin_unlock(&sk->sk_receive_queue.lock);
> +
> +		consume_skb(unlinked_skb);
>  	} else {
>  		struct sk_buff *unlinked_skb = NULL;
>  
> ---8<---

I gotta ask, is there a reason for unlinking an already consumed
('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
IOW, can't it be unlinked immediately once it's consumed in
unix_stream_recv_urg()? I suppose that would simplify things.
Kuniyuki Iwashima June 23, 2024, 5:19 a.m. UTC | #13
From: Michal Luczaj <mhal@rbox.co>
Date: Sun, 23 Jun 2024 00:43:27 +0200
> On 6/20/24 23:56, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Thu, 20 Jun 2024 22:35:55 +0200
> >> In fact, should I try to document those not-so-obvious OOB/sockmap
> >> interaction? And speaking of documentation, an astute reader noted that
> >> `man unix` is lying:
> > 
> > At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling
> > is stable enough; the behaviour is not compliant with TCP now.
> 
> Sure, I get it.
> 
> > (...)
> > And we need
> > 
> > ---8<---
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index 5e695a9a609c..2875a7ce1887 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> > @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >  	struct unix_sock *u = unix_sk(sk);
> >  
> >  	if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) {
> > -		skb_unlink(skb, &sk->sk_receive_queue);
> > -		consume_skb(skb);
> > -		skb = NULL;
> > +		struct sk_buff *unlinked_skb = skb;
> > +
> > +		spin_lock(&sk->sk_receive_queue.lock);
> > +
> > +		__skb_unlink(skb, &sk->sk_receive_queue);
> > +
> > +		if (copied)
> > +			skb = NULL;
> > +		else
> > +			skb = skb_peek(&sk->sk_receive_queue);
> > +
> > +		spin_unlock(&sk->sk_receive_queue.lock);
> > +
> > +		consume_skb(unlinked_skb);
> >  	} else {
> >  		struct sk_buff *unlinked_skb = NULL;
> >  
> > ---8<---
> 
> I gotta ask, is there a reason for unlinking an already consumed
> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
> IOW, can't it be unlinked immediately once it's consumed in
> unix_stream_recv_urg()? I suppose that would simplify things.

I also thought that before, but we can't do that.

Even after reading OOB data, we need to remember the position
and break recv() at that point.  That's why the skb is unlinked
in manage_oob() rather than unix_stream_recv_urg().
Michal Luczaj June 26, 2024, 10:48 a.m. UTC | #14
On 6/23/24 07:19, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Sun, 23 Jun 2024 00:43:27 +0200
>> I gotta ask, is there a reason for unlinking an already consumed
>> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
>> IOW, can't it be unlinked immediately once it's consumed in
>> unix_stream_recv_urg()? I suppose that would simplify things.
> 
> I also thought that before, but we can't do that.
> 
> Even after reading OOB data, we need to remember the position
> and break recv() at that point.  That's why the skb is unlinked
> in manage_oob() rather than unix_stream_recv_urg().

Ahh, I see. Thanks for explaining.

One more thing about unix sockmap. AF_UNIX SOCK_DGRAM supports 0-length
packets. But sockmap doesn't handle that; once a 0-length skb/msg is in the
psock queue, unix_bpf_recvmsg() starts throwing -EFAULT. Sockmap'ed AF_INET
SOCK_DGRAM does the same, so is this a bug or a feature?
Kuniyuki Iwashima June 26, 2024, 9:56 p.m. UTC | #15
From: Michal Luczaj <mhal@rbox.co>
Date: Wed, 26 Jun 2024 12:48:27 +0200
> On 6/23/24 07:19, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Sun, 23 Jun 2024 00:43:27 +0200
> >> I gotta ask, is there a reason for unlinking an already consumed
> >> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
> >> IOW, can't it be unlinked immediately once it's consumed in
> >> unix_stream_recv_urg()? I suppose that would simplify things.
> > 
> > I also thought that before, but we can't do that.
> > 
> > Even after reading OOB data, we need to remember the position
> > and break recv() at that point.  That's why the skb is unlinked
> > in manage_oob() rather than unix_stream_recv_urg().
> 
> Ahh, I see. Thanks for explaining.
> 
> One more thing about unix sockmap. AF_UNIX SOCK_DGRAM supports 0-length
> packets. But sockmap doesn't handle that; once a 0-length skb/msg is in the
> psock queue, unix_bpf_recvmsg() starts throwing -EFAULT. Sockmap'ed AF_INET
> SOCK_DGRAM does the same, so is this a bug or a feature?

I guess it's kind of safeguard.

The retval 0 has special meaning for SOCK_STREAM as EOF/shutdown().
If we bypass 0-byte dgram to SOCK_STREAM sk, the application will be
confused as if the original peer has disconnected.

At least, -EFAULT avoids such confusion so that can only the true
peer trigger 0-byte via the saved ->recvmsg().

So, the requirement would be similar to scm handling, we need to
recognize the sockmap verdict and destination to support full
features.
diff mbox series

Patch

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 25b49efc0926..b162164b7a42 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -570,7 +570,6 @@  static void unix_dgram_disconnected(struct sock *sk, struct sock *other)
 			sk_error_report(other);
 		}
 	}
-	other->sk_state = TCP_CLOSE;
 }
 
 static void unix_sock_destructor(struct sock *sk)
@@ -1424,8 +1423,15 @@  static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 
 		unix_state_double_unlock(sk, other);
 
-		if (other != old_peer)
+		if (other != old_peer) {
 			unix_dgram_disconnected(sk, old_peer);
+
+			unix_state_lock(old_peer);
+			if (!unix_peer(old_peer))
+				WRITE_ONCE(old_peer->sk_state, TCP_CLOSE);
+			unix_state_unlock(old_peer);
+		}
+
 		sock_put(old_peer);
 	} else {
 		unix_peer(sk) = other;