Message ID | 20240604165241.44758-2-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 26bfb8b57063f52b867f9b6c8d1742fcb5bd656c |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | af_unix: Fix lockless access of sk->sk_state and others fields. | expand |
On 6/4/24 18:52, Kuniyuki Iwashima wrote: > When a SOCK_DGRAM socket connect()s to another socket, the both sockets' > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them > to BPF SOCKMAP. (...) Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL: T0 bpf T1 connect ====== ========== WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED) sock_map_sk_state_allowed(sk) ... sk_pair = unix_peer(sk) sock_hold(sk_pair) sock_hold(newsk) smp_mb__after_atomic() unix_peer(sk) = newsk unix_state_unlock(sk) With mdelay(1) stuffed in unix_stream_connect(): [ 902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080 [ 902.277633] #PF: supervisor write access in kernel mode [ 902.277661] #PF: error_code(0x0002) - not-present page [ 902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0 [ 902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI [ 902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G D 6.10.0-rc1+ #130 [ 902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150 Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about something like @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, /* Set credentials */ copy_peercred(sk, other); - sock->state = SS_CONNECTED; - WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); sock_hold(newsk); + smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ + WRITE_ONCE(unix_peer(sk), newsk); + smp_wmb(); /* ensure peer is set before sk_state */ - smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ - unix_peer(sk) = newsk; + sock->state = SS_CONNECTED; + WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); unix_state_unlock(sk); @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); + smp_rmb(); + sk_pair = READ_ONCE(unix_peer(sk)); sock_hold(sk_pair); psock->sk_pair = sk_pair; } This should keep things ordered and lockless... I hope. Alternatively, maybe it would be better just to make BPF respect the unix state lock? @@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { + unix_state_lock(sk); sk_pair = unix_peer(sk); + unix_state_unlock(sk); sock_hold(sk_pair); psock->sk_pair = sk_pair; What do you think? Thanks, Michal
From: Michal Luczaj <mhal@rbox.co> Date: Sun, 9 Jun 2024 13:28:34 +0200 > On 6/4/24 18:52, Kuniyuki Iwashima wrote: > > When a SOCK_DGRAM socket connect()s to another socket, the both sockets' > > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them > > to BPF SOCKMAP. (...) > > Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for > bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when > sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but > unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL: > > T0 bpf T1 connect > ====== ========== > > WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED) > sock_map_sk_state_allowed(sk) > ... > sk_pair = unix_peer(sk) > sock_hold(sk_pair) > sock_hold(newsk) > smp_mb__after_atomic() > unix_peer(sk) = newsk > unix_state_unlock(sk) > > With mdelay(1) stuffed in unix_stream_connect(): > > [ 902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080 > [ 902.277633] #PF: supervisor write access in kernel mode > [ 902.277661] #PF: error_code(0x0002) - not-present page > [ 902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0 > [ 902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI > [ 902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G D 6.10.0-rc1+ #130 > [ 902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > [ 902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150 > > Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about > something like > > @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, > /* Set credentials */ > copy_peercred(sk, other); > > - sock->state = SS_CONNECTED; > - WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); > sock_hold(newsk); > + smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ > + WRITE_ONCE(unix_peer(sk), newsk); > + smp_wmb(); /* ensure peer is set before sk_state */ > > - smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ > - unix_peer(sk) = newsk; > + sock->state = SS_CONNECTED; > + WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); > > unix_state_unlock(sk); > > @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r > * be a single matching destroy operation. > */ > if (!psock->sk_pair) { > - sk_pair = unix_peer(sk); > + smp_rmb(); > + sk_pair = READ_ONCE(unix_peer(sk)); > sock_hold(sk_pair); > psock->sk_pair = sk_pair; > } > > This should keep things ordered and lockless... I hope. sock_map_update_elem() assumes that the socket is protected by lock_sock(), but AF_UNIX uses it only for the general path. So, I think we should fix sock_map_sk_state_allowed() and then use smp_store_release()/smp_load_acquire() rather than smp_[rw]mb() for unix_peer(sk). Could you test this with the mdelay(1) change ? Note that we need not touch sock->state. I have a patch for net-next that removes sock->state uses completely from AF_UNIX as we don't use it. Even unix_seq_show() depends on sk->sk_state. ---8<--- diff --git a/net/core/sock_map.c b/net/core/sock_map.c index d3dbb92153f2..67794d2c7498 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk) if (sk_is_tcp(sk)) return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN); if (sk_is_stream_unix(sk)) - return (1 << sk->sk_state) & TCPF_ESTABLISHED; + return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED; return true; } diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 80846279de9f..a558745c7d76 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1632,11 +1632,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, copy_peercred(sk, other); sock->state = SS_CONNECTED; - WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); sock_hold(newsk); smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ - unix_peer(sk) = newsk; + smp_store_release(&unix_peer(sk), newsk); + WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); unix_state_unlock(sk); diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c index bd84785bf8d6..6d9ae8e63901 100644 --- a/net/unix/unix_bpf.c +++ b/net/unix/unix_bpf.c @@ -180,7 +180,7 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); + sk_pair = smp_load_acquire(&unix_peer(sk)); sock_hold(sk_pair); psock->sk_pair = sk_pair; } ---8<--- > > Alternatively, maybe it would be better just to make BPF respect the unix > state lock? > > @@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r > * be a single matching destroy operation. > */ > if (!psock->sk_pair) { > + unix_state_lock(sk); > sk_pair = unix_peer(sk); > + unix_state_unlock(sk); > sock_hold(sk_pair); > psock->sk_pair = sk_pair; > > What do you think? If we'd go this way, I'd change like this: ---8<--- diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c index bd84785bf8d6..1db42cfee70d 100644 --- a/net/unix/unix_bpf.c +++ b/net/unix/unix_bpf.c @@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) { - struct sock *sk_pair; - /* Restore does not decrement the sk_pair reference yet because we must * keep the a reference to the socket until after an RCU grace period * and any pending sends have completed. @@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); - sock_hold(sk_pair); - psock->sk_pair = sk_pair; + psock->sk_pair = unix_peer_get(sk); + if (WARN_ON_ONCE(!psock->sk_pair)) + return -EINVAL; } unix_stream_bpf_check_needs_rebuild(psock->sk_proto); ---8<--- And the _last_ option would be..., no :) ---8<--- diff --git a/include/net/af_unix.h b/include/net/af_unix.h index b6eedf7650da..c7e31bc3e95e 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -94,8 +94,8 @@ struct unix_sock { #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk) #define unix_peer(sk) (unix_sk(sk)->peer) -#define unix_state_lock(s) spin_lock(&unix_sk(s)->lock) -#define unix_state_unlock(s) spin_unlock(&unix_sk(s)->lock) +#define unix_state_lock(s) lock_sock(s) +#define unix_state_unlock(s) release_sock(s) enum unix_socket_lock_class { U_LOCK_NORMAL, U_LOCK_SECOND, /* for double locking, see unix_state_double_lock(). */ @@ -108,7 +108,7 @@ enum unix_socket_lock_class { static inline void unix_state_lock_nested(struct sock *sk, enum unix_socket_lock_class subclass) { - spin_lock_nested(&unix_sk(sk)->lock, subclass); + lock_sock_nested(sk, subclass); } #define peer_wait peer_wq.wait ---8<---
From: Kuniyuki Iwashima <kuniyu@amazon.com> Date: Sun, 9 Jun 2024 12:53:20 -0700 > From: Michal Luczaj <mhal@rbox.co> > Date: Sun, 9 Jun 2024 13:28:34 +0200 > > On 6/4/24 18:52, Kuniyuki Iwashima wrote: > > > When a SOCK_DGRAM socket connect()s to another socket, the both sockets' > > > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them > > > to BPF SOCKMAP. (...) > > > > Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for > > bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when > > sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but > > unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL: > > > > T0 bpf T1 connect > > ====== ========== > > > > WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED) > > sock_map_sk_state_allowed(sk) > > ... > > sk_pair = unix_peer(sk) > > sock_hold(sk_pair) > > sock_hold(newsk) > > smp_mb__after_atomic() > > unix_peer(sk) = newsk > > unix_state_unlock(sk) > > > > With mdelay(1) stuffed in unix_stream_connect(): > > > > [ 902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080 > > [ 902.277633] #PF: supervisor write access in kernel mode > > [ 902.277661] #PF: error_code(0x0002) - not-present page > > [ 902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0 > > [ 902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI > > [ 902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G D 6.10.0-rc1+ #130 > > [ 902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > > [ 902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150 > > > > Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about > > something like > > > > @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, > > /* Set credentials */ > > copy_peercred(sk, other); > > > > - sock->state = SS_CONNECTED; > > - WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); > > sock_hold(newsk); > > + smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ > > + WRITE_ONCE(unix_peer(sk), newsk); > > + smp_wmb(); /* ensure peer is set before sk_state */ > > > > - smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */ > > - unix_peer(sk) = newsk; > > + sock->state = SS_CONNECTED; > > + WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED); > > > > unix_state_unlock(sk); > > > > @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r > > * be a single matching destroy operation. > > */ > > if (!psock->sk_pair) { > > - sk_pair = unix_peer(sk); > > + smp_rmb(); > > + sk_pair = READ_ONCE(unix_peer(sk)); > > sock_hold(sk_pair); > > psock->sk_pair = sk_pair; > > } > > > > This should keep things ordered and lockless... I hope. > > sock_map_update_elem() assumes that the socket is protected > by lock_sock(), but AF_UNIX uses it only for the general path. > > So, I think we should fix sock_map_sk_state_allowed() and > then use smp_store_release()/smp_load_acquire() rather than > smp_[rw]mb() for unix_peer(sk). Sorry, I think I was wrong and we can't use smp_store_release() and smp_load_acquire(), and smp_[rw]mb() is needed. Given we avoid adding code in the hotpath in the original fix 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP path again. [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/ ---8<--- diff --git a/net/core/sock_map.c b/net/core/sock_map.c index d3dbb92153f2..67794d2c7498 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk) if (sk_is_tcp(sk)) return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN); if (sk_is_stream_unix(sk)) - return (1 << sk->sk_state) & TCPF_ESTABLISHED; + return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED; return true; } diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c index bd84785bf8d6..1db42cfee70d 100644 --- a/net/unix/unix_bpf.c +++ b/net/unix/unix_bpf.c @@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) { - struct sock *sk_pair; - /* Restore does not decrement the sk_pair reference yet because we must * keep the a reference to the socket until after an RCU grace period * and any pending sends have completed. @@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r * be a single matching destroy operation. */ if (!psock->sk_pair) { - sk_pair = unix_peer(sk); - sock_hold(sk_pair); - psock->sk_pair = sk_pair; + psock->sk_pair = unix_peer_get(sk); + if (WARN_ON_ONCE(!psock->sk_pair)) + return -EINVAL; } unix_stream_bpf_check_needs_rebuild(psock->sk_proto); ---8<---
On 6/9/24 23:03, Kuniyuki Iwashima wrote: > (...) > Sorry, I think I was wrong and we can't use smp_store_release() > and smp_load_acquire(), and smp_[rw]mb() is needed. > > Given we avoid adding code in the hotpath in the original fix > 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP > path again. > > [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/ You're saying smp_wmb() in connect() is too much for the hot path, do I understand correctly? > ---8<--- > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index d3dbb92153f2..67794d2c7498 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk) > if (sk_is_tcp(sk)) > return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN); > if (sk_is_stream_unix(sk)) > - return (1 << sk->sk_state) & TCPF_ESTABLISHED; > + return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED; > return true; > } > > diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c > index bd84785bf8d6..1db42cfee70d 100644 > --- a/net/unix/unix_bpf.c > +++ b/net/unix/unix_bpf.c > @@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re > > int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) > { > - struct sock *sk_pair; > - > /* Restore does not decrement the sk_pair reference yet because we must > * keep the a reference to the socket until after an RCU grace period > * and any pending sends have completed. > @@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r > * be a single matching destroy operation. > */ > if (!psock->sk_pair) { > - sk_pair = unix_peer(sk); > - sock_hold(sk_pair); > - psock->sk_pair = sk_pair; > + psock->sk_pair = unix_peer_get(sk); > + if (WARN_ON_ONCE(!psock->sk_pair)) > + return -EINVAL; > } > > unix_stream_bpf_check_needs_rebuild(psock->sk_proto); > ---8<--- FWIW, we've passed sock_map_sk_state_allowed(), so critical section can be empty: if (!psock->sk_pair) { unix_state_lock(sk); unix_state_unlock(sk); sk_pair = READ_ONCE(unix_peer(sk)); ... } Thanks, Michal
From: Michal Luczaj <mhal@rbox.co> Date: Mon, 10 Jun 2024 14:55:08 +0200 > On 6/9/24 23:03, Kuniyuki Iwashima wrote: > > (...) > > Sorry, I think I was wrong and we can't use smp_store_release() > > and smp_load_acquire(), and smp_[rw]mb() is needed. > > > > Given we avoid adding code in the hotpath in the original fix > > 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP > > path again. > > > > [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/ > > You're saying smp_wmb() in connect() is too much for the hot path, do I > understand correctly? Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely that the delay happens between the two store ops and concurrent bpf() is in progress. If syzkaller was able to hit this on vanilla kernel, we can revisit. Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users who call bpf() in such a way know that the state was TCP_CLOSE while calling bpf(). ---8<--- diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c index bd84785bf8d6..46dc747349f2 100644 --- a/net/unix/unix_bpf.c +++ b/net/unix/unix_bpf.c @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r */ if (!psock->sk_pair) { sk_pair = unix_peer(sk); + if (WARN_ON_ONCE(!sk_pair)) + return -EINVAL; + sock_hold(sk_pair); psock->sk_pair = sk_pair; } ---8<---
On 6/10/24 19:49, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Mon, 10 Jun 2024 14:55:08 +0200 >> On 6/9/24 23:03, Kuniyuki Iwashima wrote: >>> (...) >>> Sorry, I think I was wrong and we can't use smp_store_release() >>> and smp_load_acquire(), and smp_[rw]mb() is needed. >>> >>> Given we avoid adding code in the hotpath in the original fix >>> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP >>> path again. >>> >>> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/ >> >> You're saying smp_wmb() in connect() is too much for the hot path, do I >> understand correctly? > > Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely > that the delay happens between the two store ops and concurrent bpf() > is in progress. > > If syzkaller was able to hit this on vanilla kernel, we can revisit. > > Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users > who call bpf() in such a way know that the state was TCP_CLOSE while > calling bpf(). > > ---8<--- > diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c > index bd84785bf8d6..46dc747349f2 100644 > --- a/net/unix/unix_bpf.c > +++ b/net/unix/unix_bpf.c > @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r > */ > if (!psock->sk_pair) { > sk_pair = unix_peer(sk); > + if (WARN_ON_ONCE(!sk_pair)) > + return -EINVAL; > + > sock_hold(sk_pair); > psock->sk_pair = sk_pair; > } > ---8<--- Oh. That's a peculiar approach :) But, hey, it's your call. Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is added to recv queue, but also u->oob_skb is set. Here's the problem: when this skb goes through bpf_sk_redirect_map() and is moved between socks, oob_skb remains set on the original sock. [ 23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0 [ 23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137 [ 23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 23.689024] Workqueue: events_unbound __unix_gc [ 23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0 I wanted to write a patch, but then I realized I'm not sure what's the expected behaviour. Should the oob_skb setting follow to the skb's new sock or should it be dropped (similarly to what is happening today with scm_fp_list, i.e. redirect strips inflights)?
From: Michal Luczaj <mhal@rbox.co> Date: Mon, 17 Jun 2024 01:28:52 +0200 > On 6/10/24 19:49, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Mon, 10 Jun 2024 14:55:08 +0200 > >> On 6/9/24 23:03, Kuniyuki Iwashima wrote: > >>> (...) > >>> Sorry, I think I was wrong and we can't use smp_store_release() > >>> and smp_load_acquire(), and smp_[rw]mb() is needed. > >>> > >>> Given we avoid adding code in the hotpath in the original fix > >>> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP > >>> path again. > >>> > >>> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/ > >> > >> You're saying smp_wmb() in connect() is too much for the hot path, do I > >> understand correctly? > > > > Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely > > that the delay happens between the two store ops and concurrent bpf() > > is in progress. > > > > If syzkaller was able to hit this on vanilla kernel, we can revisit. > > > > Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users > > who call bpf() in such a way know that the state was TCP_CLOSE while > > calling bpf(). > > > > ---8<--- > > diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c > > index bd84785bf8d6..46dc747349f2 100644 > > --- a/net/unix/unix_bpf.c > > +++ b/net/unix/unix_bpf.c > > @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r > > */ > > if (!psock->sk_pair) { > > sk_pair = unix_peer(sk); > > + if (WARN_ON_ONCE(!sk_pair)) > > + return -EINVAL; > > + > > sock_hold(sk_pair); > > psock->sk_pair = sk_pair; > > } > > ---8<--- > > Oh. That's a peculiar approach :) But, hey, it's your call. > > Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is > added to recv queue, but also u->oob_skb is set. Here's the problem: when > this skb goes through bpf_sk_redirect_map() and is moved between socks, > oob_skb remains set on the original sock. Good catch! > > [ 23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0 > [ 23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137 > [ 23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > [ 23.689024] Workqueue: events_unbound __unix_gc > [ 23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0 > > I wanted to write a patch, but then I realized I'm not sure what's the > expected behaviour. Should the oob_skb setting follow to the skb's new sock > or should it be dropped (similarly to what is happening today with > scm_fp_list, i.e. redirect strips inflights)? The former will require large refactoring as we need to check if the redirect happens for BPF_F_INGRESS and if the redirected sk is also SOCK_STREAM etc. So, I'd go with the latter. Probably we can check if skb is u->oob_skb and drop OOB data and retry next in unix_stream_read_skb(), and forbid MSG_OOB in unix_bpf_recvmsg(). Both features were merged in 5.15 and OOB was a month later than SOCKMAP, so the Fixes tag would be 314001f0bf927 again, where ioctl(SIOCATMARK) (and epoll(EPOLLPRI) after d9a232d435dcc was backported to all stable) is lying due to redirected OOB msg. Thanks!
On 6/17/24 20:21, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Mon, 17 Jun 2024 01:28:52 +0200 >> (...) >> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is >> added to recv queue, but also u->oob_skb is set. Here's the problem: when >> this skb goes through bpf_sk_redirect_map() and is moved between socks, >> oob_skb remains set on the original sock. > > Good catch! > >> >> [ 23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0 >> [ 23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137 >> [ 23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 >> [ 23.689024] Workqueue: events_unbound __unix_gc >> [ 23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0 >> >> I wanted to write a patch, but then I realized I'm not sure what's the >> expected behaviour. Should the oob_skb setting follow to the skb's new sock >> or should it be dropped (similarly to what is happening today with >> scm_fp_list, i.e. redirect strips inflights)? > > The former will require large refactoring as we need to check if the > redirect happens for BPF_F_INGRESS and if the redirected sk is also > SOCK_STREAM etc. > > So, I'd go with the latter. Probably we can check if skb is u->oob_skb > and drop OOB data and retry next in unix_stream_read_skb(), and forbid > MSG_OOB in unix_bpf_recvmsg(). > (...) Yeah, sounds reasonable. I'm just not sure I understand the retry part. For each skb_queue_tail() there's one ->sk_data_ready() (which does ->read_skb()). Why bother with a retry? This is what I was thinking: static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { + struct unix_sock *u = unix_sk(sk); + struct sk_buff *skb; + int err; + if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)) return -ENOTCONN; - return unix_read_skb(sk, recv_actor); + mutex_lock(&u->iolock); + skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err); + +#if IS_ENABLED(CONFIG_AF_UNIX_OOB) + if (skb) { + bool drop = false; + + spin_lock(&sk->sk_receive_queue.lock); + if (skb == u->oob_skb) { + WRITE_ONCE(u->oob_skb, NULL); + drop = true; + } + spin_unlock(&sk->sk_receive_queue.lock); + + if (drop) { + WARN_ON_ONCE(skb_unref(skb)); + kfree_skb(skb); + skb = NULL; + err = 0; + } + } +#endif + + mutex_unlock(&u->iolock); + return skb ? recv_actor(sk, skb) : err; }
From: Michal Luczaj <mhal@rbox.co> Date: Wed, 19 Jun 2024 20:14:48 +0200 > On 6/17/24 20:21, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Mon, 17 Jun 2024 01:28:52 +0200 > >> (...) > >> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is > >> added to recv queue, but also u->oob_skb is set. Here's the problem: when > >> this skb goes through bpf_sk_redirect_map() and is moved between socks, > >> oob_skb remains set on the original sock. > > > > Good catch! > > > >> > >> [ 23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0 > >> [ 23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137 > >> [ 23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > >> [ 23.689024] Workqueue: events_unbound __unix_gc > >> [ 23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0 > >> > >> I wanted to write a patch, but then I realized I'm not sure what's the > >> expected behaviour. Should the oob_skb setting follow to the skb's new sock > >> or should it be dropped (similarly to what is happening today with > >> scm_fp_list, i.e. redirect strips inflights)? > > > > The former will require large refactoring as we need to check if the > > redirect happens for BPF_F_INGRESS and if the redirected sk is also > > SOCK_STREAM etc. > > > > So, I'd go with the latter. Probably we can check if skb is u->oob_skb > > and drop OOB data and retry next in unix_stream_read_skb(), and forbid > > MSG_OOB in unix_bpf_recvmsg(). > > (...) > > Yeah, sounds reasonable. I'm just not sure I understand the retry part. For > each skb_queue_tail() there's one ->sk_data_ready() (which does > ->read_skb()). Why bother with a retry? Exactly. > > This is what I was thinking: > When you post it, please make sure to CC bpf@ and sockmap maintainers too.
On 6/19/24 21:19, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Wed, 19 Jun 2024 20:14:48 +0200 >> On 6/17/24 20:21, Kuniyuki Iwashima wrote: >>> From: Michal Luczaj <mhal@rbox.co> >>> Date: Mon, 17 Jun 2024 01:28:52 +0200 >>>> (...) >>>> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is >>>> added to recv queue, but also u->oob_skb is set. Here's the problem: when >>>> this skb goes through bpf_sk_redirect_map() and is moved between socks, >>>> oob_skb remains set on the original sock. >>> >>> Good catch! >>> >>>> >>>> [ 23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0 >>>> [ 23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137 >>>> [ 23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 >>>> [ 23.689024] Workqueue: events_unbound __unix_gc >>>> [ 23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0 >>>> >>>> I wanted to write a patch, but then I realized I'm not sure what's the >>>> expected behaviour. Should the oob_skb setting follow to the skb's new sock >>>> or should it be dropped (similarly to what is happening today with >>>> scm_fp_list, i.e. redirect strips inflights)? >>> >>> The former will require large refactoring as we need to check if the >>> redirect happens for BPF_F_INGRESS and if the redirected sk is also >>> SOCK_STREAM etc. >>> >>> So, I'd go with the latter. Probably we can check if skb is u->oob_skb >>> and drop OOB data and retry next in unix_stream_read_skb(), and forbid >>> MSG_OOB in unix_bpf_recvmsg(). >>> (...) >> >> Yeah, sounds reasonable. I'm just not sure I understand the retry part. For >> each skb_queue_tail() there's one ->sk_data_ready() (which does >> ->read_skb()). Why bother with a retry? > > Exactly. > > >> >> This is what I was thinking: >> > > When you post it, please make sure to CC bpf@ and sockmap maintainers too. Done: https://lore.kernel.org/netdev/20240620203009.2610301-1-mhal@rbox.co/ Thanks! In fact, should I try to document those not-so-obvious OOB/sockmap interaction? And speaking of documentation, an astute reader noted that `man unix` is lying: Sockets API ... UNIX domain sockets do not support the transmission of out-of-band data (the MSG_OOB flag for send(2) and recv(2)). NOTES ... UNIX domain stream sockets do not support the notion of out-of-band data.
From: Michal Luczaj <mhal@rbox.co> Date: Thu, 20 Jun 2024 22:35:55 +0200 > In fact, should I try to document those not-so-obvious OOB/sockmap > interaction? And speaking of documentation, an astute reader noted that > `man unix` is lying: At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling is stable enough; the behaviour is not compliant with TCP now. While rewriting the oob test thoroughly, I found few more weird behaviours, and patches will follow. For example: >>> from socket import * >>> c1, c2 = socketpair(AF_UNIX) >>> c1.send(b'hello', MSG_OOB) 5 >>> c1.send(b'world') 5 >>> c2.recv(10) b'hell' >>> c2.recv(1, MSG_OOB) b'o' >>> c2.setblocking(False) # This causes -EAGAIN even with available data >>> c2.recv(5) Traceback (most recent call last): File "<stdin>", line 1, in <module> BlockingIOError: [Errno 11] Resource temporarily unavailable >>> c2.recv(5) b'world' And we need ---8<--- diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 5e695a9a609c..2875a7ce1887 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, struct unix_sock *u = unix_sk(sk); if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) { - skb_unlink(skb, &sk->sk_receive_queue); - consume_skb(skb); - skb = NULL; + struct sk_buff *unlinked_skb = skb; + + spin_lock(&sk->sk_receive_queue.lock); + + __skb_unlink(skb, &sk->sk_receive_queue); + + if (copied) + skb = NULL; + else + skb = skb_peek(&sk->sk_receive_queue); + + spin_unlock(&sk->sk_receive_queue.lock); + + consume_skb(unlinked_skb); } else { struct sk_buff *unlinked_skb = NULL; ---8<---
On 6/20/24 23:56, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Thu, 20 Jun 2024 22:35:55 +0200 >> In fact, should I try to document those not-so-obvious OOB/sockmap >> interaction? And speaking of documentation, an astute reader noted that >> `man unix` is lying: > > At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling > is stable enough; the behaviour is not compliant with TCP now. Sure, I get it. > (...) > And we need > > ---8<--- > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 5e695a9a609c..2875a7ce1887 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, > struct unix_sock *u = unix_sk(sk); > > if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) { > - skb_unlink(skb, &sk->sk_receive_queue); > - consume_skb(skb); > - skb = NULL; > + struct sk_buff *unlinked_skb = skb; > + > + spin_lock(&sk->sk_receive_queue.lock); > + > + __skb_unlink(skb, &sk->sk_receive_queue); > + > + if (copied) > + skb = NULL; > + else > + skb = skb_peek(&sk->sk_receive_queue); > + > + spin_unlock(&sk->sk_receive_queue.lock); > + > + consume_skb(unlinked_skb); > } else { > struct sk_buff *unlinked_skb = NULL; > > ---8<--- I gotta ask, is there a reason for unlinking an already consumed ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()? IOW, can't it be unlinked immediately once it's consumed in unix_stream_recv_urg()? I suppose that would simplify things.
From: Michal Luczaj <mhal@rbox.co> Date: Sun, 23 Jun 2024 00:43:27 +0200 > On 6/20/24 23:56, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Thu, 20 Jun 2024 22:35:55 +0200 > >> In fact, should I try to document those not-so-obvious OOB/sockmap > >> interaction? And speaking of documentation, an astute reader noted that > >> `man unix` is lying: > > > > At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling > > is stable enough; the behaviour is not compliant with TCP now. > > Sure, I get it. > > > (...) > > And we need > > > > ---8<--- > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > > index 5e695a9a609c..2875a7ce1887 100644 > > --- a/net/unix/af_unix.c > > +++ b/net/unix/af_unix.c > > @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, > > struct unix_sock *u = unix_sk(sk); > > > > if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) { > > - skb_unlink(skb, &sk->sk_receive_queue); > > - consume_skb(skb); > > - skb = NULL; > > + struct sk_buff *unlinked_skb = skb; > > + > > + spin_lock(&sk->sk_receive_queue.lock); > > + > > + __skb_unlink(skb, &sk->sk_receive_queue); > > + > > + if (copied) > > + skb = NULL; > > + else > > + skb = skb_peek(&sk->sk_receive_queue); > > + > > + spin_unlock(&sk->sk_receive_queue.lock); > > + > > + consume_skb(unlinked_skb); > > } else { > > struct sk_buff *unlinked_skb = NULL; > > > > ---8<--- > > I gotta ask, is there a reason for unlinking an already consumed > ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()? > IOW, can't it be unlinked immediately once it's consumed in > unix_stream_recv_urg()? I suppose that would simplify things. I also thought that before, but we can't do that. Even after reading OOB data, we need to remember the position and break recv() at that point. That's why the skb is unlinked in manage_oob() rather than unix_stream_recv_urg().
On 6/23/24 07:19, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Sun, 23 Jun 2024 00:43:27 +0200 >> I gotta ask, is there a reason for unlinking an already consumed >> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()? >> IOW, can't it be unlinked immediately once it's consumed in >> unix_stream_recv_urg()? I suppose that would simplify things. > > I also thought that before, but we can't do that. > > Even after reading OOB data, we need to remember the position > and break recv() at that point. That's why the skb is unlinked > in manage_oob() rather than unix_stream_recv_urg(). Ahh, I see. Thanks for explaining. One more thing about unix sockmap. AF_UNIX SOCK_DGRAM supports 0-length packets. But sockmap doesn't handle that; once a 0-length skb/msg is in the psock queue, unix_bpf_recvmsg() starts throwing -EFAULT. Sockmap'ed AF_INET SOCK_DGRAM does the same, so is this a bug or a feature?
From: Michal Luczaj <mhal@rbox.co> Date: Wed, 26 Jun 2024 12:48:27 +0200 > On 6/23/24 07:19, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Sun, 23 Jun 2024 00:43:27 +0200 > >> I gotta ask, is there a reason for unlinking an already consumed > >> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()? > >> IOW, can't it be unlinked immediately once it's consumed in > >> unix_stream_recv_urg()? I suppose that would simplify things. > > > > I also thought that before, but we can't do that. > > > > Even after reading OOB data, we need to remember the position > > and break recv() at that point. That's why the skb is unlinked > > in manage_oob() rather than unix_stream_recv_urg(). > > Ahh, I see. Thanks for explaining. > > One more thing about unix sockmap. AF_UNIX SOCK_DGRAM supports 0-length > packets. But sockmap doesn't handle that; once a 0-length skb/msg is in the > psock queue, unix_bpf_recvmsg() starts throwing -EFAULT. Sockmap'ed AF_INET > SOCK_DGRAM does the same, so is this a bug or a feature? I guess it's kind of safeguard. The retval 0 has special meaning for SOCK_STREAM as EOF/shutdown(). If we bypass 0-byte dgram to SOCK_STREAM sk, the application will be confused as if the original peer has disconnected. At least, -EFAULT avoids such confusion so that can only the true peer trigger 0-byte via the saved ->recvmsg(). So, the requirement would be similar to scm handling, we need to recognize the sockmap verdict and destination to support full features.
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 25b49efc0926..b162164b7a42 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -570,7 +570,6 @@ static void unix_dgram_disconnected(struct sock *sk, struct sock *other) sk_error_report(other); } } - other->sk_state = TCP_CLOSE; } static void unix_sock_destructor(struct sock *sk) @@ -1424,8 +1423,15 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr, unix_state_double_unlock(sk, other); - if (other != old_peer) + if (other != old_peer) { unix_dgram_disconnected(sk, old_peer); + + unix_state_lock(old_peer); + if (!unix_peer(old_peer)) + WRITE_ONCE(old_peer->sk_state, TCP_CLOSE); + unix_state_unlock(old_peer); + } + sock_put(old_peer); } else { unix_peer(sk) = other;
When a SOCK_DGRAM socket connect()s to another socket, the both sockets' sk->sk_state are changed to TCP_ESTABLISHED so that we can register them to BPF SOCKMAP. When the socket disconnects from the peer by connect(AF_UNSPEC), the state is set back to TCP_CLOSE. Then, the peer's state is also set to TCP_CLOSE, but the update is done locklessly and unconditionally. Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects from B. After the first two connect()s, all three sockets' sk->sk_state are TCP_ESTABLISHED: $ ss -xa Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess u_dgr ESTAB 0 0 @A 641 * 642 u_dgr ESTAB 0 0 @B 642 * 643 u_dgr ESTAB 0 0 @C 643 * 0 And after the disconnect, B's state is TCP_CLOSE even though it's still connected to C and C's state is TCP_ESTABLISHED. $ ss -xa Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess u_dgr UNCONN 0 0 @A 641 * 0 u_dgr UNCONN 0 0 @B 642 * 643 u_dgr ESTAB 0 0 @C 643 * 0 In this case, we cannot register B to SOCKMAP. So, when a socket disconnects from the peer, we should not set TCP_CLOSE to the peer if the peer is connected to yet another socket, and this must be done under unix_state_lock(). Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless readers. These data-races will be fixed in the following patches. Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- CC: Cong Wang <cong.wang@bytedance.com> --- net/unix/af_unix.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)