diff mbox series

[bpf-next,v6,2/8] bpf: reject unhashed sockets in bpf_sk_assign

Message ID 20230720-so-reuseport-v6-2-7021b683cdae@isovalent.com (mailing list archive)
State Accepted
Commit 67312adc96b5a585970d03b62412847afe2c6b01
Headers show
Series Add SO_REUSEPORT support for TC bpf_sk_assign | expand

Commit Message

Lorenz Bauer July 20, 2023, 3:30 p.m. UTC
The semantics for bpf_sk_assign are as follows:

    sk = some_lookup_func()
    bpf_sk_assign(skb, sk)
    bpf_sk_release(sk)

That is, the sk is not consumed by bpf_sk_assign. The function
therefore needs to make sure that sk lives long enough to be
consumed from __inet_lookup_skb. The path through the stack for a
TCPv4 packet is roughly:

  netif_receive_skb_core: takes RCU read lock
    __netif_receive_skb_core:
      sch_handle_ingress:
        tcf_classify:
          bpf_sk_assign()
      deliver_ptype_list_skb:
        deliver_skb:
          ip_packet_type->func == ip_rcv:
            ip_rcv_core:
            ip_rcv_finish_core:
              dst_input:
                ip_local_deliver:
                  ip_local_deliver_finish:
                    ip_protocol_deliver_rcu:
                      tcp_v4_rcv:
                        __inet_lookup_skb:
                          skb_steal_sock

The existing helper takes advantage of the fact that everything
happens in the same RCU critical section: for sockets with
SOCK_RCU_FREE set bpf_sk_assign never takes a reference.
skb_steal_sock then checks SOCK_RCU_FREE again and does sock_put
if necessary.

This approach assumes that SOCK_RCU_FREE is never set on a sk
between bpf_sk_assign and skb_steal_sock, but this invariant is
violated by unhashed UDP sockets. A new UDP socket is created
in TCP_CLOSE state but without SOCK_RCU_FREE set. That flag is only
added in udp_lib_get_port() which happens when a socket is bound.

When bpf_sk_assign was added it wasn't possible to access unhashed
UDP sockets from BPF, so this wasn't a problem. This changed
in commit 0c48eefae712 ("sock_map: Lift socket state restriction
for datagram sockets"), but the helper wasn't adjusted accordingly.
The following sequence of events will therefore lead to a refcount
leak:

1. Add socket(AF_INET, SOCK_DGRAM) to a sockmap.
2. Pull socket out of sockmap and bpf_sk_assign it. Since
   SOCK_RCU_FREE is not set we increment the refcount.
3. bind() or connect() the socket, setting SOCK_RCU_FREE.
4. skb_steal_sock will now set refcounted = false due to
   SOCK_RCU_FREE.
5. tcp_v4_rcv() skips sock_put().

Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
This matches the behaviour of __inet_lookup_skb which is ultimately
the goal of bpf_sk_assign().

Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
Cc: Joe Stringer <joe@cilium.io>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
---
 net/core/filter.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Kuniyuki Iwashima July 20, 2023, 9:16 p.m. UTC | #1
From: Lorenz Bauer <lmb@isovalent.com>
Date: Thu, 20 Jul 2023 17:30:06 +0200
> The semantics for bpf_sk_assign are as follows:
> 
>     sk = some_lookup_func()
>     bpf_sk_assign(skb, sk)
>     bpf_sk_release(sk)
> 
> That is, the sk is not consumed by bpf_sk_assign. The function
> therefore needs to make sure that sk lives long enough to be
> consumed from __inet_lookup_skb. The path through the stack for a
> TCPv4 packet is roughly:
> 
>   netif_receive_skb_core: takes RCU read lock
>     __netif_receive_skb_core:
>       sch_handle_ingress:
>         tcf_classify:
>           bpf_sk_assign()
>       deliver_ptype_list_skb:
>         deliver_skb:
>           ip_packet_type->func == ip_rcv:
>             ip_rcv_core:
>             ip_rcv_finish_core:
>               dst_input:
>                 ip_local_deliver:
>                   ip_local_deliver_finish:
>                     ip_protocol_deliver_rcu:
>                       tcp_v4_rcv:
>                         __inet_lookup_skb:
>                           skb_steal_sock
> 
> The existing helper takes advantage of the fact that everything
> happens in the same RCU critical section: for sockets with
> SOCK_RCU_FREE set bpf_sk_assign never takes a reference.
> skb_steal_sock then checks SOCK_RCU_FREE again and does sock_put
> if necessary.
> 
> This approach assumes that SOCK_RCU_FREE is never set on a sk
> between bpf_sk_assign and skb_steal_sock, but this invariant is
> violated by unhashed UDP sockets. A new UDP socket is created
> in TCP_CLOSE state but without SOCK_RCU_FREE set. That flag is only
> added in udp_lib_get_port() which happens when a socket is bound.
> 
> When bpf_sk_assign was added it wasn't possible to access unhashed
> UDP sockets from BPF, so this wasn't a problem. This changed
> in commit 0c48eefae712 ("sock_map: Lift socket state restriction
> for datagram sockets"), but the helper wasn't adjusted accordingly.
> The following sequence of events will therefore lead to a refcount
> leak:
> 
> 1. Add socket(AF_INET, SOCK_DGRAM) to a sockmap.
> 2. Pull socket out of sockmap and bpf_sk_assign it. Since
>    SOCK_RCU_FREE is not set we increment the refcount.
> 3. bind() or connect() the socket, setting SOCK_RCU_FREE.
> 4. skb_steal_sock will now set refcounted = false due to
>    SOCK_RCU_FREE.
> 5. tcp_v4_rcv() skips sock_put().
> 
> Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
> This matches the behaviour of __inet_lookup_skb which is ultimately
> the goal of bpf_sk_assign().
> 
> Fixes: cf7fbe660f2d ("bpf: Add socket assign support")

Should this be 0c48eefae712 then ?


> Cc: Joe Stringer <joe@cilium.io>
> Signed-off-by: Lorenz Bauer <lmb@isovalent.com>

The change itself looks good.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Thanks!


> ---
>  net/core/filter.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 797e8f039696..b5b51ef48c5f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7353,6 +7353,8 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>  		return -ENETUNREACH;
>  	if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
>  		return -ESOCKTNOSUPPORT;
> +	if (sk_unhashed(sk))
> +		return -EOPNOTSUPP;
>  	if (sk_is_refcounted(sk) &&
>  	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>  		return -ENOENT;
> 
> -- 
> 2.41.0
Lorenz Bauer July 24, 2023, 8:01 a.m. UTC | #2
On Thu, Jul 20, 2023 at 11:17 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:

> > Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
> > This matches the behaviour of __inet_lookup_skb which is ultimately
> > the goal of bpf_sk_assign().
> >
> > Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
>
> Should this be 0c48eefae712 then ?

I think it makes sense to target it at the original helper add, since
we really should've done the unhashed check back then. Relying on
unhashed not being available is too subtle.
diff mbox series

Patch

diff --git a/net/core/filter.c b/net/core/filter.c
index 797e8f039696..b5b51ef48c5f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7353,6 +7353,8 @@  BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
 		return -ENETUNREACH;
 	if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
 		return -ESOCKTNOSUPPORT;
+	if (sk_unhashed(sk))
+		return -EOPNOTSUPP;
 	if (sk_is_refcounted(sk) &&
 	    unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
 		return -ENOENT;