From patchwork Tue May 2 15:51:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 13229105 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4973C7EE23 for ; Tue, 2 May 2023 15:52:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234614AbjEBPwH (ORCPT ); Tue, 2 May 2023 11:52:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58658 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234616AbjEBPwG (ORCPT ); Tue, 2 May 2023 11:52:06 -0400 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3DCB826BC; Tue, 2 May 2023 08:52:05 -0700 (PDT) Received: by mail-pl1-x62d.google.com with SMTP id d9443c01a7336-1aaff9c93a5so14333455ad.2; Tue, 02 May 2023 08:52:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683042724; x=1685634724; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=huaGQx94IhqjlpBZz9Dh/6dEAcqnZNQFnDotU/ZG3V0=; b=n541Rnzyhy8z1tOKr+QV2e9Yu+x5CbcnPpIxxTh1muXrKaPAk5+kPBsD2eKRffL6K0 M9WrwXFLGy/UV7KfaCQBi05OF3S619xETNA+m2Y0SafL5cfPdX6p1A1QlDxnaTc1d1PW vA66taARSRJgFZ6YK4RJqhGjBK0gf4SAE7PBGLXgwXujzz5K37TsknOmkubXCoCLi/as VvVkJC361fIb+ufnC7h7zsBRb58gFDH8ejiI5EylNk1oybDWIt4+xknjQJpudK27E3Ta TcfXOZprykUUEvPAzHCnE4ITdfBsRyw/Jj0zdl2nG2mNluF0LmLHnEB9Wi/ZN6FIuIw0 ajcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683042724; x=1685634724; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=huaGQx94IhqjlpBZz9Dh/6dEAcqnZNQFnDotU/ZG3V0=; b=DvhU1fqNgwLatBBbgNDv+aQrWvrnEVqzcbN3tJiw4Edc1rQAbhM26MS8S4FKh/u0EF mf0YdwJR/CIR+wr97IRwF5aueRyskkPtcmWkifIuWpS0mhOfKfWeB+c7++T9XzQ3+JKR ux0yn6II1vsrDkDX6TytW1OE5woXb3p+6cLW7hu0HHzDNgpxXZLvjVERoIP7UyDBsxid 9nuwtx8dzyBEXuAc/wKVZrhbL/3x3OGSoGrXg6+UfyDchdwk3iY/sIbqVfCtPumqOpaY UiXBuzLAcJwyq5PYyp2jNW0F4M3SHjmpzWHnQOwnlWOWCIfjQIChJeawFuiYfkKHuTFe +m5Q== X-Gm-Message-State: AC+VfDyoziAhFfPQsTZ67VvIQ87QXCA5bzy/DUkaUShtHxJ0yE2pu5Oe 62rMAki8U//xPvex5iHXqh7yM9Kn7xg= X-Google-Smtp-Source: ACHHUZ5XQU6CLiuHg1sSuBBPJjmQKUwdxiTHglEaGOuWNeSzydf/i3oIjZ6Llfm8W4LAD1iTSMKA9A== X-Received: by 2002:a17:902:a704:b0:1a6:51a6:ca76 with SMTP id w4-20020a170902a70400b001a651a6ca76mr16738577plq.11.1683042724537; Tue, 02 May 2023 08:52:04 -0700 (PDT) Received: from john.lan ([2605:59c8:148:ba10:62ab:a7fd:a4e3:bd70]) by smtp.gmail.com with ESMTPSA id o3-20020a170902778300b001a1a07d04e6sm19917212pll.77.2023.05.02.08.52.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 May 2023 08:52:04 -0700 (PDT) From: John Fastabend To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com, edumazet@google.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: [PATCH bpf v7 01/13] bpf: sockmap, pass skb ownership through read_skb Date: Tue, 2 May 2023 08:51:47 -0700 Message-Id: <20230502155159.305437-2-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20230502155159.305437-1-john.fastabend@gmail.com> References: <20230502155159.305437-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net The read_skb hook calls consume_skb() now, but this means that if the recv_actor program wants to use the skb it needs to inc the ref cnt so that the consume_skb() doesn't kfree the sk_buff. This is problematic because in some error cases under memory pressure we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). Then we get this, skb_linearize() __pskb_pull_tail() pskb_expand_head() BUG_ON(skb_shared(skb)) Because we incremented users refcnt from sk_psock_verdict_recv() we hit the bug on with refcnt > 1 and trip it. To fix lets simply pass ownership of the sk_buff through the skb_read call. Then we can drop the consume from read_skb handlers and assume the verdict recv does any required kfree. Bug found while testing in our CI which runs in VMs that hit memory constraints rather regularly. William tested TCP read_skb handlers. [ 106.536188] ------------[ cut here ]------------ [ 106.536197] kernel BUG at net/core/skbuff.c:1693! [ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 [ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 [ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 [ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 [ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 [ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 [ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 [ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 [ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 [ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 [ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.542255] Call Trace: [ 106.542383] [ 106.542487] __pskb_pull_tail+0x4b/0x3e0 [ 106.542681] skb_ensure_writable+0x85/0xa0 [ 106.542882] sk_skb_pull_data+0x18/0x20 [ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 [ 106.543536] ? migrate_disable+0x66/0x80 [ 106.543871] sk_psock_verdict_recv+0xe2/0x310 [ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 [ 106.544561] tcp_read_skb+0x7b/0x120 [ 106.544740] tcp_data_queue+0x904/0xee0 [ 106.544931] tcp_rcv_established+0x212/0x7c0 [ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 [ 106.545326] tcp_v4_rcv+0xe70/0xf60 [ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 [ 106.545744] ip_local_deliver_finish+0xa7/0x150 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reported-by: William Findlay Tested-by: William Findlay Reviewed-by: Jakub Sitnicki Signed-off-by: John Fastabend --- net/core/skmsg.c | 2 -- net/ipv4/tcp.c | 1 - net/ipv4/udp.c | 7 ++----- net/unix/af_unix.c | 7 ++----- 4 files changed, 4 insertions(+), 13 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index f81883759d38..4a3dc8d27295 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1183,8 +1183,6 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) int ret = __SK_DROP; int len = skb->len; - skb_get(skb); - rcu_read_lock(); psock = sk_psock(sk); if (unlikely(!psock)) { diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 288693981b00..1be305e3d3c7 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1770,7 +1770,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); tcp_flags = TCP_SKB_CB(skb)->tcp_flags; used = recv_actor(sk, skb); - consume_skb(skb); if (used < 0) { if (!copied) copied = used; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c605d171eb2d..8aaae82e78ae 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1813,7 +1813,7 @@ EXPORT_SYMBOL(__skb_recv_udp); int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { struct sk_buff *skb; - int err, copied; + int err; try_again: skb = skb_recv_udp(sk, MSG_DONTWAIT, &err); @@ -1832,10 +1832,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) } WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); - copied = recv_actor(sk, skb); - kfree_skb(skb); - - return copied; + return recv_actor(sk, skb); } EXPORT_SYMBOL(udp_read_skb); diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 0b0f18ecce44..bb34852bf947 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2553,7 +2553,7 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { struct unix_sock *u = unix_sk(sk); struct sk_buff *skb; - int err, copied; + int err; mutex_lock(&u->iolock); skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err); @@ -2561,10 +2561,7 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) if (!skb) return err; - copied = recv_actor(sk, skb); - kfree_skb(skb); - - return copied; + return recv_actor(sk, skb); } /*