[v3,bpf-next,10/11] bpf: tcp: Support arbitrary SYN Cookie.

Message ID	20231121184245.69569-11-kuniyu@amazon.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="BK8AAHpZ" From: Kuniyuki Iwashima <kuniyu@amazon.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, David Ahern <dsahern@kernel.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Song Liu <song@kernel.org>, Yonghong Song <yonghong.song@linux.dev>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>, Mykola Lysenko <mykolal@fb.com> CC: Kuniyuki Iwashima <kuniyu@amazon.com>, Kuniyuki Iwashima <kuni1840@gmail.com>, <bpf@vger.kernel.org>, <netdev@vger.kernel.org> Subject: [PATCH v3 bpf-next 10/11] bpf: tcp: Support arbitrary SYN Cookie. Date: Tue, 21 Nov 2023 10:42:44 -0800 Message-ID: <20231121184245.69569-11-kuniyu@amazon.com> In-Reply-To: <20231121184245.69569-1-kuniyu@amazon.com> References: <20231121184245.69569-1-kuniyu@amazon.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: Bulk
Series	bpf: tcp: Support arbitrary SYN Cookie at TC. \| expand [v3,bpf-next,00/11] bpf: tcp: Support arbitrary SYN Cookie at TC. [v3,bpf-next,01/11] tcp: Clean up reverse xmas tree in cookie_v[46]_check(). [v3,bpf-next,02/11] tcp: Cache sock_net(sk) in cookie_v[46]_check(). [v3,bpf-next,03/11] tcp: Clean up goto labels in cookie_v[46]_check(). [v3,bpf-next,04/11] tcp: Don't pass cookie to __cookie_v[46]_check(). [v3,bpf-next,05/11] tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock(). [v3,bpf-next,06/11] tcp: Move TCP-AO bits from cookie_v[46]_check() to tcp_ao_syncookie(). [v3,bpf-next,07/11] tcp: Factorise cookie req initialisation. [v3,bpf-next,08/11] tcp: Factorise non-BPF SYN Cookie handling. [v3,bpf-next,09/11] bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check(). [v3,bpf-next,10/11] bpf: tcp: Support arbitrary SYN Cookie. [v3,bpf-next,11/11] selftest: bpf: Test bpf_sk_assign_tcp_reqsk().

Message ID

20231121184245.69569-11-kuniyu@amazon.com (mailing list archive)

State

Superseded

Delegated to:

BPF

Headers

From: Kuniyuki Iwashima <kuniyu@amazon.com>
To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet
	<edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni
	<pabeni@redhat.com>, David Ahern <dsahern@kernel.org>, Alexei Starovoitov
	<ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko
	<andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Song Liu
	<song@kernel.org>, Yonghong Song <yonghong.song@linux.dev>, John Fastabend
	<john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev
	<sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>,
	Mykola Lysenko <mykolal@fb.com>
CC: Kuniyuki Iwashima <kuniyu@amazon.com>, Kuniyuki Iwashima
	<kuni1840@gmail.com>, <bpf@vger.kernel.org>, <netdev@vger.kernel.org>
Subject: [PATCH v3 bpf-next 10/11] bpf: tcp: Support arbitrary SYN Cookie.
Date: Tue, 21 Nov 2023 10:42:44 -0800
Message-ID: <20231121184245.69569-11-kuniyu@amazon.com>
In-Reply-To: <20231121184245.69569-1-kuniyu@amazon.com>
References: <20231121184245.69569-1-kuniyu@amazon.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Precedence: Bulk

Series

bpf: tcp: Support arbitrary SYN Cookie at TC. | expand

Context	Check	Description
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-next-PR	fail	PR summary
netdev/series_format	success	Posting correctly formatted
netdev/codegen	success	Generated files up to date
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 2134 this patch: 2134
netdev/cc_maintainers	success	CCed 16 of 16 maintainers
netdev/build_clang	success	Errors and warnings before: 1278 this patch: 1278
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 2187 this patch: 2187
netdev/checkpatch	warning	WARNING: line length of 99 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for aarch64-gcc / test
bpf/vmtest-bpf-next-VM_Test-3	fail	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for s390x-gcc / test
bpf/vmtest-bpf-next-VM_Test-6	fail	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-10	fail	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for x86_64-gcc / test
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for x86_64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-13	fail	Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for x86_64-llvm-16 / test
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for x86_64-llvm-16 / veristat

Context

Check

Description

bpf/vmtest-bpf-next-VM_Test-16

success

Logs for x86_64-gcc / build / build for x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-17

success

Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-18

success

Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-19

success

Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-20

success

Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-21

success

Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-22

success

Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-23

success

Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-24

success

Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16

bpf/vmtest-bpf-next-VM_Test-25

success

Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16

bpf/vmtest-bpf-next-VM_Test-26

success

Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16

bpf/vmtest-bpf-next-VM_Test-27

success

Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16

bpf/vmtest-bpf-next-VM_Test-28

success

Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16

bpf/vmtest-bpf-next-VM_Test-29

success

Logs for x86_64-llvm-16 / veristat

bpf/vmtest-bpf-next-PR

fail

PR summary

netdev/series_format

success

Posting correctly formatted

netdev/codegen

success

Generated files up to date

netdev/tree_selection

success

Clearly marked for bpf-next, async

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 2134 this patch: 2134

netdev/cc_maintainers

success

CCed 16 of 16 maintainers

netdev/build_clang

success

Errors and warnings before: 1278 this patch: 1278

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/deprecated_api

success

None detected

netdev/check_selftest

success

No net selftest shell script

netdev/verify_fixes

success

No Fixes tag

netdev/build_allmodconfig_warn

success

Errors and warnings before: 2187 this patch: 2187

netdev/checkpatch

warning

WARNING: line length of 99 exceeds 80 columns

netdev/build_clang_rust

success

No Rust files in patch. Skipping build

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

bpf/vmtest-bpf-next-VM_Test-0

success

Logs for Lint

bpf/vmtest-bpf-next-VM_Test-2

success

Logs for Validate matrix.py

bpf/vmtest-bpf-next-VM_Test-1

success

Logs for ShellCheck

bpf/vmtest-bpf-next-VM_Test-4

success

Logs for aarch64-gcc / test

bpf/vmtest-bpf-next-VM_Test-3

fail

Logs for aarch64-gcc / build / build for aarch64 with gcc

bpf/vmtest-bpf-next-VM_Test-5

success

Logs for aarch64-gcc / veristat

bpf/vmtest-bpf-next-VM_Test-7

success

Logs for s390x-gcc / test

bpf/vmtest-bpf-next-VM_Test-6

fail

Logs for s390x-gcc / build / build for s390x with gcc

bpf/vmtest-bpf-next-VM_Test-8

success

Logs for s390x-gcc / veristat

bpf/vmtest-bpf-next-VM_Test-9

success

Logs for set-matrix

bpf/vmtest-bpf-next-VM_Test-10

fail

Logs for x86_64-gcc / build / build for x86_64 with gcc

bpf/vmtest-bpf-next-VM_Test-11

success

Logs for x86_64-gcc / test

bpf/vmtest-bpf-next-VM_Test-12

success

Logs for x86_64-gcc / veristat

bpf/vmtest-bpf-next-VM_Test-13

fail

Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16

bpf/vmtest-bpf-next-VM_Test-14

success

Logs for x86_64-llvm-16 / test

bpf/vmtest-bpf-next-VM_Test-15

success

Logs for x86_64-llvm-16 / veristat

Commit Message

Kuniyuki Iwashima Nov. 21, 2023, 6:42 p.m. UTC

This patch adds a new kfunc available at TC hook to support arbitrary
SYN Cookie.

The basic usage is as follows:

    struct tcp_cookie_attributes attr = {
        .tcp_opt = {
            .mss_clamp = mss,
            .wscale_ok = wscale_ok,
            .snd_scale = send_scale, /* < 15 */
            .tstamp_ok = tstamp_ok,
            .sack_ok = sack_ok,
        },
        .ecn_ok = ecn_ok,
        .usec_ts_ok = usec_ts_ok,
    };

    skc = bpf_skc_lookup_tcp(...);
    sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
    bpf_sk_assign_tcp_reqsk(skb, sk, attr, sizeof(attr));
    bpf_sk_release(skc);

bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
tcp_cookie_attributes and allocates reqsk and configures it.  Then,
bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.

The notable thing here is that we do not hold refcnt for both reqsk
and listener.  To differentiate that, we mark reqsk->syncookie, which
is only used in TX for now.  So, if reqsk->syncookie is 1 in RX, it
means that the reqsk is allocated by kfunc.

When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
and in that case, we set NULL to reqsk->rsk_listener before calling
reqsk_free() as reqsk does not hold a refcnt of the listener.

When the TCP stack looks up a socket from the skb, we return
inet_reqsk(skb->sk)->rsk_listener in inet6?_steal_sock().  However,
we do not clear skb->sk and skb->destructor so that we can carry
the reqsk to cookie_v[46]_check().

The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
after creating a full sk.

Note that we can extend struct tcp_cookie_attributes in the future
when we add a new attribute that is determined in 3WHS.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet6_hashtables.h | 16 +++++-
 include/net/inet_hashtables.h  | 16 +++++-
 include/net/tcp.h              |  6 +++
 net/core/filter.c              | 98 +++++++++++++++++++++++++++++++++-
 net/core/sock.c                | 14 ++++-
 5 files changed, 144 insertions(+), 6 deletions(-)

Comments

Martin KaFai Lau Nov. 22, 2023, 11:19 p.m. UTC | #1

On 11/21/23 10:42 AM, Kuniyuki Iwashima wrote:
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 533a7337865a..9a67f47a5e64 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -116,9 +116,23 @@ struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>   	if (!sk)
>   		return NULL;
>   
> -	if (!prefetched || !sk_fullsock(sk))
> +	if (!prefetched)
>   		return sk;
>   
> +	if (sk->sk_state == TCP_NEW_SYN_RECV) {
> +#if IS_ENABLED(CONFIG_SYN_COOKIE)
> +		if (inet_reqsk(sk)->syncookie) {
> +			*refcounted = false;
> +			skb->sk = sk;
> +			skb->destructor = sock_pfree;

Instead of re-init the skb->sk and skb->destructor, can skb_steal_sock() avoid 
resetting them to NULL in the first place and skb_steal_sock() returns the 
rsk_listener instead? btw, can inet_reqsk(sk)->rsk_listener be set to NULL after 
this point?

Beside, it is essentially assigning the incoming request to a listening sk. Does 
it need to call the inet6_lookup_reuseport() a few lines below to avoid skipping 
the bpf reuseport selection that was fixed in commit 9c02bec95954 ("bpf, net: 
Support SO_REUSEPORT sockets with bpf_sk_assign")?

> +			return inet_reqsk(sk)->rsk_listener;
> +		}
> +#endif
> +		return sk;
> +	} else if (sk->sk_state == TCP_TIME_WAIT) {
> +		return sk;
> +	}
> +
>   	if (sk->sk_protocol == IPPROTO_TCP) {
>   		if (sk->sk_state != TCP_LISTEN)
>   			return sk;
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 3ecfeadbfa06..36609656a047 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -462,9 +462,23 @@ struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>   	if (!sk)
>   		return NULL;
>   
> -	if (!prefetched || !sk_fullsock(sk))
> +	if (!prefetched)
>   		return sk;
>   
> +	if (sk->sk_state == TCP_NEW_SYN_RECV) {
> +#if IS_ENABLED(CONFIG_SYN_COOKIE)
> +		if (inet_reqsk(sk)->syncookie) {
> +			*refcounted = false;
> +			skb->sk = sk;
> +			skb->destructor = sock_pfree;
> +			return inet_reqsk(sk)->rsk_listener;
> +		}
> +#endif
> +		return sk;
> +	} else if (sk->sk_state == TCP_TIME_WAIT) {
> +		return sk;
> +	}
> +
>   	if (sk->sk_protocol == IPPROTO_TCP) {
>   		if (sk->sk_state != TCP_LISTEN)

Kuniyuki Iwashima Nov. 23, 2023, 12:31 a.m. UTC | #2

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Wed, 22 Nov 2023 15:19:29 -0800
> On 11/21/23 10:42 AM, Kuniyuki Iwashima wrote:
> > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > index 533a7337865a..9a67f47a5e64 100644
> > --- a/include/net/inet6_hashtables.h
> > +++ b/include/net/inet6_hashtables.h
> > @@ -116,9 +116,23 @@ struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> >   	if (!sk)
> >   		return NULL;
> >   
> > -	if (!prefetched || !sk_fullsock(sk))
> > +	if (!prefetched)
> >   		return sk;
> >   
> > +	if (sk->sk_state == TCP_NEW_SYN_RECV) {
> > +#if IS_ENABLED(CONFIG_SYN_COOKIE)
> > +		if (inet_reqsk(sk)->syncookie) {
> > +			*refcounted = false;
> > +			skb->sk = sk;
> > +			skb->destructor = sock_pfree;
> 
> Instead of re-init the skb->sk and skb->destructor, can skb_steal_sock() avoid 
> resetting them to NULL in the first place and skb_steal_sock() returns the 
> rsk_listener instead?

Yes, but we need to move skb_steal_sock() to request_sock.h or include it just
before skb_steal_sock() in sock.h like below.  When I include request_sock.h in
top of sock.h, there were many build errors.


> btw, can inet_reqsk(sk)->rsk_listener be set to NULL after 
> this point?

except for sock_pfree(), we will not set NULL until cookie_bpf_check().


> Beside, it is essentially assigning the incoming request to a listening sk. Does 
> it need to call the inet6_lookup_reuseport() a few lines below to avoid skipping 
> the bpf reuseport selection that was fixed in commit 9c02bec95954 ("bpf, net: 
> Support SO_REUSEPORT sockets with bpf_sk_assign")?

Ah, good point.  I assumed bpf_sk_lookup() will do the random pick, but we
need to call it just in case sk is picked from bpf map.

As you suggested, if we return rsk_listener from skb_steal_sock(), we can
reuse the reuseport_lookup call in inet6_steal_sock().

Thanks!


---8<---
diff --git a/include/net/sock.h b/include/net/sock.h
index 1d6931caf0c3..83efbe0e7c3b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2838,6 +2838,8 @@ sk_is_refcounted(struct sock *sk)
        return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
 }
 
+#include <net/request_sock.h>
+
 /**
  * skb_steal_sock - steal a socket from an sk_buff
  * @skb: sk_buff to steal the socket from
@@ -2847,20 +2849,38 @@ sk_is_refcounted(struct sock *sk)
 static inline struct sock *
 skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
 {
-       if (skb->sk) {
-               struct sock *sk = skb->sk;
+       struct sock *sk = skb->sk;
 
+       if (!sk) {
+               *prefetched = false;
+               *refcounted = false;
+               return NULL;
+       }
+
+       *prefetched = skb_sk_is_prefetched(skb);
+       if (!*prefetched) {
                *refcounted = true;
-               *prefetched = skb_sk_is_prefetched(skb);
-               if (*prefetched)
-                       *refcounted = sk_is_refcounted(sk);
-               skb->destructor = NULL;
-               skb->sk = NULL;
-               return sk;
+               goto out;
        }
-       *prefetched = false;
-       *refcounted = false;
-       return NULL;
+
+       switch (sk->sk_state) {
+       case TCP_NEW_SYN_RECV:
+               if (inet_reqsk(sk)->syncookie) {
+                       *refcounted = false;
+                       return inet_reqsk(sk)->rsk_listener;
+               }
+               fallthrough;
+       case TCP_TIME_WAIT:
+               *refcounted = true;
+               break;
+       default:
+               *refcounted = !sock_flag(sk, SOCK_RCU_FREE);
+       }
+
+out:
+       skb->destructor = NULL;
+       skb->sk = NULL;
+       return sk;
 }
 
 /* Checks if this SKB belongs to an HW offloaded socket
---8<---

Martin KaFai Lau Nov. 27, 2023, 11:04 p.m. UTC | #3

On 11/22/23 4:31 PM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Wed, 22 Nov 2023 15:19:29 -0800
>> On 11/21/23 10:42 AM, Kuniyuki Iwashima wrote:
>>> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
>>> index 533a7337865a..9a67f47a5e64 100644
>>> --- a/include/net/inet6_hashtables.h
>>> +++ b/include/net/inet6_hashtables.h
>>> @@ -116,9 +116,23 @@ struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>>>    	if (!sk)
>>>    		return NULL;
>>>    
>>> -	if (!prefetched || !sk_fullsock(sk))
>>> +	if (!prefetched)
>>>    		return sk;
>>>    
>>> +	if (sk->sk_state == TCP_NEW_SYN_RECV) {
>>> +#if IS_ENABLED(CONFIG_SYN_COOKIE)
>>> +		if (inet_reqsk(sk)->syncookie) {
>>> +			*refcounted = false;
>>> +			skb->sk = sk;
>>> +			skb->destructor = sock_pfree;
>>
>> Instead of re-init the skb->sk and skb->destructor, can skb_steal_sock() avoid
>> resetting them to NULL in the first place and skb_steal_sock() returns the
>> rsk_listener instead?
> 
> Yes, but we need to move skb_steal_sock() to request_sock.h or include it just

Moving it seems better than including a header in the middle. Not sure if 
inet_sock.h or request_sock.h is a better target.


> before skb_steal_sock() in sock.h like below.  When I include request_sock.h in
> top of sock.h, there were many build errors.

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 533a7337865a..9a67f47a5e64 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -116,9 +116,23 @@  struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
 	if (!sk)
 		return NULL;
 
-	if (!prefetched || !sk_fullsock(sk))
+	if (!prefetched)
 		return sk;
 
+	if (sk->sk_state == TCP_NEW_SYN_RECV) {
+#if IS_ENABLED(CONFIG_SYN_COOKIE)
+		if (inet_reqsk(sk)->syncookie) {
+			*refcounted = false;
+			skb->sk = sk;
+			skb->destructor = sock_pfree;
+			return inet_reqsk(sk)->rsk_listener;
+		}
+#endif
+		return sk;
+	} else if (sk->sk_state == TCP_TIME_WAIT) {
+		return sk;
+	}
+
 	if (sk->sk_protocol == IPPROTO_TCP) {
 		if (sk->sk_state != TCP_LISTEN)
 			return sk;
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 3ecfeadbfa06..36609656a047 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -462,9 +462,23 @@  struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
 	if (!sk)
 		return NULL;
 
-	if (!prefetched || !sk_fullsock(sk))
+	if (!prefetched)
 		return sk;
 
+	if (sk->sk_state == TCP_NEW_SYN_RECV) {
+#if IS_ENABLED(CONFIG_SYN_COOKIE)
+		if (inet_reqsk(sk)->syncookie) {
+			*refcounted = false;
+			skb->sk = sk;
+			skb->destructor = sock_pfree;
+			return inet_reqsk(sk)->rsk_listener;
+		}
+#endif
+		return sk;
+	} else if (sk->sk_state == TCP_TIME_WAIT) {
+		return sk;
+	}
+
 	if (sk->sk_protocol == IPPROTO_TCP) {
 		if (sk->sk_state != TCP_LISTEN)
 			return sk;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 842791997f30..373afcfaefa6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -591,6 +591,12 @@  static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
 }
 
 #if IS_ENABLED(CONFIG_BPF)
+struct tcp_cookie_attributes {
+	struct tcp_options_received tcp_opt;
+	bool ecn_ok;
+	bool usec_ts_ok;
+} __packed;
+
 static inline bool cookie_bpf_ok(struct sk_buff *skb)
 {
 	return skb->sk;
diff --git a/net/core/filter.c b/net/core/filter.c
index d64baa7ac6cd..7beba469e8a7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11807,6 +11807,90 @@  __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
 
 	return 0;
 }
+
+#if IS_ENABLED(CONFIG_SYN_COOKIE)
+__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
+					struct tcp_cookie_attributes *attr,
+					int attr__sz)
+{
+	const struct request_sock_ops *ops;
+	struct inet_request_sock *ireq;
+	struct tcp_request_sock *treq;
+	struct request_sock *req;
+	__u16 min_mss;
+
+	if (attr__sz != sizeof(*attr))
+		return -EINVAL;
+
+	if (!sk)
+		return -EINVAL;
+
+	if (!skb_at_tc_ingress(skb))
+		return -EINVAL;
+
+	if (dev_net(skb->dev) != sock_net(sk))
+		return -ENETUNREACH;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ops = &tcp_request_sock_ops;
+		min_mss = 536;
+		break;
+#if IS_BUILTIN(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		ops = &tcp6_request_sock_ops;
+		min_mss = IPV6_MIN_MTU - 60;
+		break;
+#endif
+	default:
+		return -EINVAL;
+	}
+
+	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
+		return -EINVAL;
+
+	if (attr->tcp_opt.mss_clamp < min_mss) {
+		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+		return -EINVAL;
+	}
+
+	if (attr->tcp_opt.wscale_ok &&
+	    attr->tcp_opt.snd_wscale > TCP_MAX_WSCALE) {
+		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+		return -EINVAL;
+	}
+
+	if (sk_is_mptcp(sk))
+		req = mptcp_subflow_reqsk_alloc(ops, sk, false);
+	else
+		req = inet_reqsk_alloc(ops, sk, false);
+
+	if (!req)
+		return -ENOMEM;
+
+	ireq = inet_rsk(req);
+	treq = tcp_rsk(req);
+
+	req->syncookie = 1;
+	req->rsk_listener = sk;
+	req->mss = attr->tcp_opt.mss_clamp;
+
+	ireq->snd_wscale = attr->tcp_opt.snd_wscale;
+	ireq->wscale_ok = attr->tcp_opt.wscale_ok;
+	ireq->tstamp_ok	= attr->tcp_opt.tstamp_ok;
+	ireq->sack_ok = attr->tcp_opt.sack_ok;
+	ireq->ecn_ok = attr->ecn_ok;
+
+	treq->req_usec_ts = attr->usec_ts_ok;
+
+	skb_orphan(skb);
+	skb->sk = req_to_sk(req);
+	skb->destructor = sock_pfree;
+
+	return 0;
+}
+#endif
+
 __bpf_kfunc_end_defs();
 
 int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
@@ -11835,6 +11919,10 @@  BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
 BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
 BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
 
+BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
+BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
+BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -11850,6 +11938,11 @@  static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
 	.set = &bpf_kfunc_check_set_sock_addr,
 };
 
+static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_tcp_reqsk,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -11865,8 +11958,9 @@  static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
-	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
-						&bpf_kfunc_set_sock_addr);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+					       &bpf_kfunc_set_sock_addr);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
 }
 late_initcall(bpf_kfunc_init);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index fef349dd72fa..998950e97dfe 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2579,8 +2579,18 @@  EXPORT_SYMBOL(sock_efree);
 #ifdef CONFIG_INET
 void sock_pfree(struct sk_buff *skb)
 {
-	if (sk_is_refcounted(skb->sk))
-		sock_gen_put(skb->sk);
+	struct sock *sk = skb->sk;
+
+	if (!sk_is_refcounted(sk))
+		return;
+
+	if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+		inet_reqsk(sk)->rsk_listener = NULL;
+		reqsk_free(inet_reqsk(sk));
+		return;
+	}
+
+	sock_gen_put(sk);
 }
 EXPORT_SYMBOL(sock_pfree);
 #endif /* CONFIG_INET */