diff mbox series

[bpf,v2,01/12] bpf: sockmap, pass skb ownership through read_skb

Message ID 20230327175446.98151-2-john.fastabend@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf sockmap fixes | expand

Checks

Context Check Description
bpf/vmtest-bpf-PR fail merge-conflict
netdev/tree_selection success Clearly marked for bpf, async
netdev/apply fail Patch does not apply to bpf
bpf/vmtest-bpf-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-10 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-11 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-13 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-14 fail Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-15 fail Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-16 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17 fail Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-18 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-19 fail Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-20 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-21 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22 fail Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-23 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-25 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-27 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-28 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-29 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-30 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-31 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-32 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-33 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-34 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-35 success Logs for test_verifier on x86_64 with llvm-16

Commit Message

John Fastabend March 27, 2023, 5:54 p.m. UTC
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.

This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,

 skb_linearize()
   __pskb_pull_tail()
     pskb_expand_head()
       BUG_ON(skb_shared(skb))

Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.

To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.

Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.

[  106.536188] ------------[ cut here ]------------
[  106.536197] kernel BUG at net/core/skbuff.c:1693!
[  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  106.542255] Call Trace:
[  106.542383]  <IRQ>
[  106.542487]  __pskb_pull_tail+0x4b/0x3e0
[  106.542681]  skb_ensure_writable+0x85/0xa0
[  106.542882]  sk_skb_pull_data+0x18/0x20
[  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[  106.543536]  ? migrate_disable+0x66/0x80
[  106.543871]  sk_psock_verdict_recv+0xe2/0x310
[  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
[  106.544561]  tcp_read_skb+0x7b/0x120
[  106.544740]  tcp_data_queue+0x904/0xee0
[  106.544931]  tcp_rcv_established+0x212/0x7c0
[  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
[  106.545326]  tcp_v4_rcv+0xe70/0xf60
[  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
[  106.545744]  ip_local_deliver_finish+0xa7/0x150

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reported-by: William Findlay <will@isovalent.com>
Tested-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c   | 2 --
 net/ipv4/tcp.c     | 1 -
 net/ipv4/udp.c     | 5 +----
 net/unix/af_unix.c | 5 +----
 4 files changed, 2 insertions(+), 11 deletions(-)

Comments

Jakub Sitnicki March 28, 2023, 10:42 a.m. UTC | #1
On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote:
> The read_skb hook calls consume_skb() now, but this means that if the
> recv_actor program wants to use the skb it needs to inc the ref cnt
> so that the consume_skb() doesn't kfree the sk_buff.
>
> This is problematic because in some error cases under memory pressure
> we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
> Then we get this,
>
>  skb_linearize()
>    __pskb_pull_tail()
>      pskb_expand_head()
>        BUG_ON(skb_shared(skb))
>
> Because we incremented users refcnt from sk_psock_verdict_recv() we
> hit the bug on with refcnt > 1 and trip it.
>
> To fix lets simply pass ownership of the sk_buff through the skb_read
> call. Then we can drop the consume from read_skb handlers and assume
> the verdict recv does any required kfree.
>
> Bug found while testing in our CI which runs in VMs that hit memory
> constraints rather regularly. William tested TCP read_skb handlers.
>
> [  106.536188] ------------[ cut here ]------------
> [  106.536197] kernel BUG at net/core/skbuff.c:1693!
> [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
> [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
> [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
> [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
> [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
> [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
> [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
> [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
> [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
> [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
> [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
> [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  106.542255] Call Trace:
> [  106.542383]  <IRQ>
> [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
> [  106.542681]  skb_ensure_writable+0x85/0xa0
> [  106.542882]  sk_skb_pull_data+0x18/0x20
> [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
> [  106.543536]  ? migrate_disable+0x66/0x80
> [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
> [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
> [  106.544561]  tcp_read_skb+0x7b/0x120
> [  106.544740]  tcp_data_queue+0x904/0xee0
> [  106.544931]  tcp_rcv_established+0x212/0x7c0
> [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
> [  106.545326]  tcp_v4_rcv+0xe70/0xf60
> [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
> [  106.545744]  ip_local_deliver_finish+0xa7/0x150
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Reported-by: William Findlay <will@isovalent.com>
> Tested-by: William Findlay <will@isovalent.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
diff mbox series

Patch

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 53d0251788aa..2b6d9519ff29 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -1180,8 +1180,6 @@  static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
 	int ret = __SK_DROP;
 	int len = skb->len;
 
-	skb_get(skb);
-
 	rcu_read_lock();
 	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 33f559f491c8..6572962b0237 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1770,7 +1770,6 @@  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 		WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
 		tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
 		used = recv_actor(sk, skb);
-		consume_skb(skb);
 		if (used < 0) {
 			if (!copied)
 				copied = used;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9592fe3e444a..04e8c6385246 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1832,10 +1832,7 @@  int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 	}
 
 	WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
-	copied = recv_actor(sk, skb);
-	kfree_skb(skb);
-
-	return copied;
+	return recv_actor(sk, skb);
 }
 EXPORT_SYMBOL(udp_read_skb);
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f0c2293f1d3b..a5dd2ee0cfed 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2554,10 +2554,7 @@  static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 	if (!skb)
 		return err;
 
-	copied = recv_actor(sk, skb);
-	kfree_skb(skb);
-
-	return copied;
+	return recv_actor(sk, skb);
 }
 
 /*