From patchwork Mon Oct 11 19:16:44 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 12550953 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90822C433FE for ; Mon, 11 Oct 2021 19:17:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 758A660F92 for ; Mon, 11 Oct 2021 19:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234630AbhJKTTM (ORCPT ); Mon, 11 Oct 2021 15:19:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54700 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234623AbhJKTTM (ORCPT ); Mon, 11 Oct 2021 15:19:12 -0400 Received: from mail-il1-x12d.google.com (mail-il1-x12d.google.com [IPv6:2607:f8b0:4864:20::12d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C27DC06161C; Mon, 11 Oct 2021 12:17:11 -0700 (PDT) Received: by mail-il1-x12d.google.com with SMTP id f15so325610ilu.7; Mon, 11 Oct 2021 12:17:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mVRL1IBxtkGo2VfKL3uYUEStH8UIb/WtSaGcxktycbY=; b=Gr/46YvzPqB99wQ5NL5Qc1wl4ZFALf/1FZH+DYTNDidjDRI0VzOmYmBKPfImXwFuhC A2QfAgG7ADSfSKo/ELJRSF5pPCUbXLhuLHml3Hru2FKhjsz8fpTeQS36xpK/rLsZxNkC 6qIXe6KKsJctvd6/jdvQ7aYBWvSPGG05mzBEAM9sGphItWpcyFRQs3X9OCYl6w02KC2W mA8hchqvlnymycZcr0Vnx/A5mNnatBvzdgIG+2dQdDzr/pu74cJYfTdhpWzY0H0gfZJS C03w4bWG1p+d6W+Deq+ImsYLXsd7rX9iqkevLX2i792SmY5Ls9JOkOf4B0+Sy50DdNb8 M6PQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mVRL1IBxtkGo2VfKL3uYUEStH8UIb/WtSaGcxktycbY=; b=pRu9Hp+vJBzU0U9nqoYTncucEBHR4yyyTlOLmJg+JYP/WDBHriQw1s2pQj5bneaEct QxkzYFwfoRekP5b8nMhvCtDmW8bHgO0Qlr4FhPxuo2JuT31iExipHDzfCXm7edXcaH1g Qqpuq9tAevk9bAoBJqUgW5h0/KT7EeEhr/4Q/eOT+CYxoqAk7INZv4N+y5bPoDJ2hYjA oU6p4xVEGWqbtWP+gt6z+KcZ93+9kNDGevGhxrnk3lOwPJO6g/aFPnlyg2ECNQjvEXkE PG3U67E6gR/26tMJZMC1BJgIZK30r1oDrw20qk+36+7jNktAjGpBqDSsP5wrZ+EBEKb0 D60A== X-Gm-Message-State: AOAM530G4JQC22coeLBgPUDW3ydn7QUR6LJ0q6Qp8gRXkPvvEqJ9UFQ8 jEfhbbCKIkLpUE6v14UN/9tgguEFaQFbiQ== X-Google-Smtp-Source: ABdhPJzbY6MYLqm250q6zZe4XdbBqRsZ32GoXWbykvyJJ3KfsxOVgWU7sSgYeFXriT53zQJGUNJG5Q== X-Received: by 2002:a92:c24c:: with SMTP id k12mr14260642ilo.52.1633979830243; Mon, 11 Oct 2021 12:17:10 -0700 (PDT) Received: from john.lan ([172.243.157.240]) by smtp.gmail.com with ESMTPSA id n12sm4460077ilj.8.2021.10.11.12.17.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Oct 2021 12:17:09 -0700 (PDT) From: John Fastabend To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: john.fastabend@gmail.com, daniel@iogearbox.net, joamaki@gmail.com, xiyou.wangcong@gmail.com Subject: [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage Date: Mon, 11 Oct 2021 12:16:44 -0700 Message-Id: <20211011191647.418704-2-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20211011191647.418704-1-john.fastabend@gmail.com> References: <20211011191647.418704-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net We do not need to handle unhash from BPF side we can simply wait for the close to happen. The original concern was a socket could transition from ESTABLISHED state to a new state while the BPF hook was still attached. But, we convinced ourself this is no longer possible and we also improved BPF sockmap to handle listen sockets so this is no longer a problem. More importantly though there are cases where unhash is called when data is in the receive queue. The BPF unhash logic will flush this data which is wrong. To be correct it should keep the data in the receive queue and allow a receiving application to continue reading the data. This may happen when tcp_abort is received for example. Instead of complicating the logic in unhash simply moving all this to tcp_close hook solves this. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend --- net/ipv4/tcp_bpf.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index d3e9386b493e..35dcfb04f53d 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -476,7 +476,6 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS], struct proto *base) { prot[TCP_BPF_BASE] = *base; - prot[TCP_BPF_BASE].unhash = sock_map_unhash; prot[TCP_BPF_BASE].close = sock_map_close; prot[TCP_BPF_BASE].recvmsg = tcp_bpf_recvmsg; prot[TCP_BPF_BASE].stream_memory_read = tcp_bpf_stream_read; From patchwork Mon Oct 11 19:16:45 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 12550955 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED6A7C433F5 for ; Mon, 11 Oct 2021 19:17:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C9E9160C4A for ; Mon, 11 Oct 2021 19:17:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234640AbhJKTTV (ORCPT ); Mon, 11 Oct 2021 15:19:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54734 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234633AbhJKTTT (ORCPT ); Mon, 11 Oct 2021 15:19:19 -0400 Received: from mail-io1-xd2c.google.com (mail-io1-xd2c.google.com [IPv6:2607:f8b0:4864:20::d2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 552F3C061570; Mon, 11 Oct 2021 12:17:18 -0700 (PDT) Received: by mail-io1-xd2c.google.com with SMTP id r134so9870639iod.11; Mon, 11 Oct 2021 12:17:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=zWfw5yv/Wr4pji3ke4CQl5k7I1ej4+D5cKcPcTz5sWU=; b=UHsa6AC1ROttToJzWoKJzwREIQskjUaJO3EsvXhJ8oCf0hPSuoigc0UiohFFIre9sn Vq7h3vzgX0/oua759l+u7Bh/1q/iiI06g2MexwqYHQiC2fc9xlcXuGzFhMMBWJKnwjHX mDEIR5uWdWSTSaB1aJjJA3wlwno6EN3uEcazLBcb4xDb5pMi/tMcSE47mR6zX4UtBc22 gQHI1aNqqbnr8rUiHAG43M7scWG1Owms3vxta12OBNv3c0awhuEB8pVwF60SisUIGzrc I9ugU0HNAPkyxJvqxAUHWU1iYn1f8qTQvA0K9iJxnVVJ5OqVLa5avV6tlg0QgVP5mDG7 uZ6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=zWfw5yv/Wr4pji3ke4CQl5k7I1ej4+D5cKcPcTz5sWU=; b=G3BSyF6WR+J4NFYMKbH59VpFaCM1BbNraeMW3eeCpZb1P0o3IOtOU6dZ+4PpcfQPB+ fTedA+kGcqvF/+PpHCGsIx26bdM4a6CQbS2hQKI5qjKN/qlkB6P56SXAEDFZbHQOJWom Yw5JzaIIWo62AjI8acGBY5iHsTjluvNhiKo6UkvbcsxtGorXhl8nwUy+xWlo8Y/dV6jh W7KxKBaD7kmjcD8LgjrEgY6iA6EFBrSOlcUmEuiMR6DluE0wNrwN2a8gle++fJQwY6V3 HZFg6DrEsX1BRgh7wxKeMKVauIi1Kt3LKwfHrQp4rsUrzXmVn4P9M+ocviXyuB5kDVBf hSjw== X-Gm-Message-State: AOAM532qs482NE5ONOttS5YZhbOcLkYSHlyVoxh2eSZLUcHB+QnJBkJ3 p8WhNFLhiu75VQ3Vg4k/2O4M1SSEKS2YUg== X-Google-Smtp-Source: ABdhPJz70L0XmOvLLn5/WfkPb0zPBu/arcMMpYKzdbGbcIY8cKdpNeV5MwvuktvkLKKK264mDuw6hQ== X-Received: by 2002:a05:6638:3052:: with SMTP id u18mr20029594jak.148.1633979837482; Mon, 11 Oct 2021 12:17:17 -0700 (PDT) Received: from john.lan ([172.243.157.240]) by smtp.gmail.com with ESMTPSA id n12sm4460077ilj.8.2021.10.11.12.17.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Oct 2021 12:17:16 -0700 (PDT) From: John Fastabend To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: john.fastabend@gmail.com, daniel@iogearbox.net, joamaki@gmail.com, xiyou.wangcong@gmail.com Subject: [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self Date: Mon, 11 Oct 2021 12:16:45 -0700 Message-Id: <20211011191647.418704-3-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20211011191647.418704-1-john.fastabend@gmail.com> References: <20211011191647.418704-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net A socket in a sockmap may have different combinations of programs attached depending on configuration. There can be no programs in which case the socket acts as a sink only. There can be a TX program in this case a BPF program is attached to sending side, but no RX program is attached. There can be an RX program only where sends have no BPF program attached, but receives are hooked with BPF. And finally, both TX and RX programs may be attached. Giving us the permutations, None, Tx, Rx, and TxRx To date most of our use cases have been TX case being used as a fast datapath to directly copy between local application and a userspace proxy. Or Rx cases and TxRX applications that are operating an in kernel based proxy. The traffic in the first case where we hook applications into a userspace application looks like this, AppA redirect AppB Tx <-----------> Rx | | + + TCP <--> lo <--> TCP In this case all traffic from AppA (after 3whs) is copied into the AppB ingress queue and no traffic is ever on the TCP recieive_queue. In the second case the application never receives, except in some rare error cases, traffic on the actual user space socket. Instead the send happens in the kernel. AppProxy socket pool sk0 ------------->{sk1,sk2, skn} ^ | | | | v ingress lb egress TCP TCP Here because traffic is never read off the socket with userspace recv() APIs there is only ever one reader on the sk receive_queue. Namely the BPF programs. However, we've started to introduce a third configuration where the BPF program on receive should process the data, but then the normal case is to push the data into the receive queue of AppB. AppB recv() (userspace) ----------------------- tcp_bpf_recvmsg() (kernel) | | | | | | ingress_msgQ | | | RX_BPF | | | v v sk->receive_queue This is different from the App{A,B} redirect because traffic is first received on the sk->receive_queue. Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg queue for any data handled by the BPF rx program and returned with PASS code so that it was enqueued on the ingress msg queue. Then if no data exists on that queue it checks the socket receive queue. Unfortunately, this is the same receive_queue the BPF program is reading data off of. So we get a race. Its possible for the recvmsg() hook to pull data off the receive_queue before the BPF hook has a chance to read it. It typically happens when an application is banging on recv() and getting EAGAINs. Until they manage to race with the RX BPF program. To fix this we note that before this patch at attach time when the socket is loaded into the map we check if it needs a TX program or just the base set of proto bpf hooks. Then it uses the above general RX hook regardless of if we have a BPF program attached at rx or not. This patch now extends this check to handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above race when an RX program is attached we use a new hook that is nearly identical to the old one except now we do not let the recv() call skip the RX BPF program. Now only the BPF program pulls data from sk->receive_queue and recv() only pulls data from the ingress msgQ post BPF program handling. With this resolved our AppB from above has been up and running for many hours without detecting any errors. We do this by correlating counters in RX BPF events and the AppB to ensure data is never skipping the BPF program. Selftests, was not able to detect this because we only run them for a short period of time on well ordered send/recvs so we don't get any of the noise we see in real application environments. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend Acked-by: Jakub Sitnicki --- net/ipv4/tcp_bpf.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 35dcfb04f53d..0cc420c0e259 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -185,6 +185,41 @@ static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock, return ret; } +static int tcp_bpf_recvmsg_parser(struct sock *sk, + struct msghdr *msg, + size_t len, + int nonblock, + int flags, + int *addr_len) +{ + struct sk_psock *psock; + int copied; + + if (unlikely(flags & MSG_ERRQUEUE)) + return inet_recv_error(sk, msg, len, addr_len); + + psock = sk_psock_get(sk); + if (unlikely(!psock)) + return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len); + + lock_sock(sk); +msg_bytes_ready: + copied = sk_msg_recvmsg(sk, psock, msg, len, flags); + if (!copied) { + long timeo; + int data; + + timeo = sock_rcvtimeo(sk, nonblock); + data = tcp_msg_wait_data(sk, psock, timeo); + if (data && !sk_psock_queue_empty(psock)) + goto msg_bytes_ready; + copied = -EAGAIN; + } + release_sock(sk); + sk_psock_put(sk, psock); + return copied; +} + static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len) { @@ -465,6 +500,8 @@ enum { enum { TCP_BPF_BASE, TCP_BPF_TX, + TCP_BPF_RX, + TCP_BPF_TXRX, TCP_BPF_NUM_CFGS, }; @@ -483,6 +520,12 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS], prot[TCP_BPF_TX] = prot[TCP_BPF_BASE]; prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg; prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage; + + prot[TCP_BPF_RX] = prot[TCP_BPF_BASE]; + prot[TCP_BPF_RX].recvmsg = tcp_bpf_recvmsg_parser; + + prot[TCP_BPF_TXRX] = prot[TCP_BPF_TX]; + prot[TCP_BPF_TXRX].recvmsg = tcp_bpf_recvmsg_parser; } static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops) @@ -520,6 +563,10 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4; int config = psock->progs.msg_parser ? TCP_BPF_TX : TCP_BPF_BASE; + if (psock->progs.stream_verdict || psock->progs.skb_verdict) { + config = (config == TCP_BPF_TX) ? TCP_BPF_TXRX : TCP_BPF_RX; + } + if (restore) { if (inet_csk_has_ulp(sk)) { /* TLS does not have an unhash proto in SW cases, From patchwork Mon Oct 11 19:16:46 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 12550957 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DA3EC433F5 for ; Mon, 11 Oct 2021 19:17:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8964D60F57 for ; Mon, 11 Oct 2021 19:17:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234603AbhJKTTh (ORCPT ); Mon, 11 Oct 2021 15:19:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54776 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234638AbhJKTT3 (ORCPT ); Mon, 11 Oct 2021 15:19:29 -0400 Received: from mail-io1-xd35.google.com (mail-io1-xd35.google.com [IPv6:2607:f8b0:4864:20::d35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3D617C06161C; Mon, 11 Oct 2021 12:17:26 -0700 (PDT) Received: by mail-io1-xd35.google.com with SMTP id s17so15855692ioa.13; Mon, 11 Oct 2021 12:17:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8R7J7WwlVkvmd31lTevebFP8KXyHRvQbrT5FjNsptLE=; b=WZi8FMRsTDQeD8t1qv6pcfA/wlligT6nQGSYOUxkfG6rYKCkUVKXdFluVNRmxUHC6m B7M6eJb3+Iey/NniD5vYkgaFuew84fFbcMoFizbs9KxGiMzUotrgmQ1eLYVpPgBOcWcO FYaKvKv0koFthTrcBbqAktwiSaQ5ZqqVaOUq9+OZZw5DgWDqJvhCUkgfslHDQ5CJreaN qUy0vJuOcnLNvcad/+W7bZnK81pXoUY2J/vn3nTRfb6BgeMkU0FqLs26VS/ltN/5IaKW J5W/I9vvxO2tAFRjReP9WkfKGbnZyOEHGHbyzi0lNZh/dDgyks6YdQeNXWMpM9xomyy8 XHLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8R7J7WwlVkvmd31lTevebFP8KXyHRvQbrT5FjNsptLE=; b=rjOKWf6FYj6mg0xcBaltkqczi9ZCoFDaHFTn+TtmaU9MSwBxHEhn1OZK7zFBlyxIje kxod1bgEHyf1pwI2D/VSBAYP43/uQDRvn5xF3IC9cOnL5PAs2gviLmlC840d6jhZaJt9 mCQea4622md2teF0u3mqCapu609ltagMFwSrj4NwtBC7OM+t/4BIJzrrrA4rawaocGQZ JNpSF+t3HnCwhoHt8OhtuKpv7195dyfk69oA6RqssIn5ME7pUp2PUwmz6mDXgBMXdswz iWg99XthEWF0t8+DCocIW/AGToZfqptrcDXWFXmmvK0wJgy4jx1QMKZzUpLZ4YxsHUf8 +F6g== X-Gm-Message-State: AOAM533YOv9Tg/FpkdZCktb3W9O2/LcrUwf7ooHKTtd/r2GUsOMwUKKl 0cbDaMjTbIc0CfeGY2exDkeIAz78rAAoEQ== X-Google-Smtp-Source: ABdhPJx2DuwoZ4gYJYKHc53dJuQKhPZFaLoEf4FEY89dWjcAn8Vs7pfXu/GIBeLPPLyGRYOC3J7g8w== X-Received: by 2002:a05:6602:27c5:: with SMTP id l5mr7041541ios.60.1633979845397; Mon, 11 Oct 2021 12:17:25 -0700 (PDT) Received: from john.lan ([172.243.157.240]) by smtp.gmail.com with ESMTPSA id n12sm4460077ilj.8.2021.10.11.12.17.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Oct 2021 12:17:24 -0700 (PDT) From: John Fastabend To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: john.fastabend@gmail.com, daniel@iogearbox.net, joamaki@gmail.com, xiyou.wangcong@gmail.com Subject: [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding Date: Mon, 11 Oct 2021 12:16:46 -0700 Message-Id: <20211011191647.418704-4-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20211011191647.418704-1-john.fastabend@gmail.com> References: <20211011191647.418704-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling progress, e.g. offset and length of the skb. First this is poorly named and inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at this layer. But, more importantly strparser is using the following to access its metadata. (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data)) Where _strp_msg is defined as, struct _strp_msg { struct strp_msg strp; /* 0 8 */ int accum_len; /* 8 4 */ /* size: 12, cachelines: 1, members: 2 */ /* last cacheline: 12 bytes */ }; So we use 12 bytes of ->data[] in struct. However in BPF code running parser and verdict the user has read capabilities into the data[] array as well. Its not too problematic, but we should not be exposing internal state to BPF program. If its really needed then we can use the probe_read() APIs which allow reading kernel memory. And I don't believe cb[] layer poses any API breakage by moving this around because programs can't depend on cb[] across layers. In order to fix another issue with a ctx rewrite we need to stash a temp variable somewhere. To make this work cleanly this patch builds a cb struct for sk_skb types called sk_skb_cb struct. Then we can use this consistently in the strparser, sockmap space. Additionally we can start allowing ->cb[] write access after this. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface" Signed-off-by: John Fastabend Reviewed-by: Jakub Sitnicki --- include/net/strparser.h | 16 +++++++++++++++- net/core/filter.c | 22 ++++++++++++++++++++++ net/strparser/strparser.c | 10 +--------- 3 files changed, 38 insertions(+), 10 deletions(-) diff --git a/include/net/strparser.h b/include/net/strparser.h index 1d20b98493a1..bec1439bd3be 100644 --- a/include/net/strparser.h +++ b/include/net/strparser.h @@ -54,10 +54,24 @@ struct strp_msg { int offset; }; +struct _strp_msg { + /* Internal cb structure. struct strp_msg must be first for passing + * to upper layer. + */ + struct strp_msg strp; + int accum_len; +}; + +struct sk_skb_cb { +#define SK_SKB_CB_PRIV_LEN 20 + unsigned char data[SK_SKB_CB_PRIV_LEN]; + struct _strp_msg strp; +}; + static inline struct strp_msg *strp_msg(struct sk_buff *skb) { return (struct strp_msg *)((void *)skb->cb + - offsetof(struct qdisc_skb_cb, data)); + offsetof(struct sk_skb_cb, strp)); } /* Structure for an attached lower socket */ diff --git a/net/core/filter.c b/net/core/filter.c index 2e32cee2c469..23a9bf92b5bb 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9761,11 +9761,33 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type, struct bpf_prog *prog, u32 *target_size) { struct bpf_insn *insn = insn_buf; + int off; switch (si->off) { case offsetof(struct __sk_buff, data_end): insn = bpf_convert_data_end_access(si, insn); break; + case offsetof(struct __sk_buff, cb[0]) ... + offsetofend(struct __sk_buff, cb[4]) - 1: + BUILD_BUG_ON(sizeof_field(struct sk_skb_cb, data) < 20); + BUILD_BUG_ON((offsetof(struct sk_buff, cb) + + offsetof(struct sk_skb_cb, data)) % + sizeof(__u64)); + + prog->cb_access = 1; + off = si->off; + off -= offsetof(struct __sk_buff, cb[0]); + off += offsetof(struct sk_buff, cb); + off += offsetof(struct sk_skb_cb, data); + if (type == BPF_WRITE) + *insn++ = BPF_STX_MEM(BPF_SIZE(si->code), si->dst_reg, + si->src_reg, off); + else + *insn++ = BPF_LDX_MEM(BPF_SIZE(si->code), si->dst_reg, + si->src_reg, off); + break; + + default: return bpf_convert_ctx_access(type, si, insn_buf, prog, target_size); diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c index 9c0343568d2a..1a72c67afed5 100644 --- a/net/strparser/strparser.c +++ b/net/strparser/strparser.c @@ -27,18 +27,10 @@ static struct workqueue_struct *strp_wq; -struct _strp_msg { - /* Internal cb structure. struct strp_msg must be first for passing - * to upper layer. - */ - struct strp_msg strp; - int accum_len; -}; - static inline struct _strp_msg *_strp_msg(struct sk_buff *skb) { return (struct _strp_msg *)((void *)skb->cb + - offsetof(struct qdisc_skb_cb, data)); + offsetof(struct sk_skb_cb, strp)); } /* Lower lock held */ From patchwork Mon Oct 11 19:16:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 12550959 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D4CDC4332F for ; Mon, 11 Oct 2021 19:17:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3381760C4A for ; Mon, 11 Oct 2021 19:17:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234638AbhJKTTi (ORCPT ); Mon, 11 Oct 2021 15:19:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54804 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234580AbhJKTTg (ORCPT ); Mon, 11 Oct 2021 15:19:36 -0400 Received: from mail-io1-xd2e.google.com (mail-io1-xd2e.google.com [IPv6:2607:f8b0:4864:20::d2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0D938C061570; Mon, 11 Oct 2021 12:17:35 -0700 (PDT) Received: by mail-io1-xd2e.google.com with SMTP id x1so17100665iof.7; Mon, 11 Oct 2021 12:17:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=xYUEbdtNkCcfmFfRaAYdRtS4aaNAG7Whfl52kUNJaQk=; b=HjKBQ+WlvArP/pQhk+qiaNqK+vMz5LBu8M0+3xjqOAKyNaE4SsmkSOK+uI+THRemGh GMldP6Crafz39toxasqp4JNCpbUxaPIeviif4hkJWam4HReNhLLfQX0d4gdkc0oTRSxF 2CBfdoSfoquid5XJUD0lrg120NwZqJ57UseI4tl8BT0w35a+UGbCdvS0zt81AIjH8XUf vmygyjjBeeLeUXDdLw5zH7YgH/BJvZBlNXgw8+24tNxsvfcVk0OQXl/TH2rk4zT0CUj7 Moz1HagrTqkctZ2QacoBzYiMSxlIviSRPx8sRRx0/zCdtpzs53lZIlY9LcO9VZgOafKT pMEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=xYUEbdtNkCcfmFfRaAYdRtS4aaNAG7Whfl52kUNJaQk=; b=LBTaWytaCaNLLgmwIWzNAbsZZPBHpUZ0r0NAM1jmqfTqj5z1fny3OymOjyBa+6KDqr UHbmwpJLyWOz4LviJoiTg/I3OvzZGLb3v17rUgI9HeRTMW13rScJ33rma76Gl62oy8gw xZUMvHd3wIea93EvnUKgcTHyWrWnpWpuohGTQ5FrXzvlJIxYjSuTO27c2mhmyVNT7jW8 qti94SQqQjnWqDGYD3iEJZ7Co8mKR2nSSKKpGD+cC7PMSBZ8hrcu3Zyvh0arkYcxsK7f aB/lQqou4p5f87ADMNEgpGK4xdw3fVUiC2yc99O/v9yMpM+2rQutM9c2GqFMuSysGhmY vI5A== X-Gm-Message-State: AOAM5323kiy56YkL02spUbyFppHqbxBsGU8oru8WWiLr/ywlc6zUisUt THUGeL3zsOk2zKSN3SUuepVy4Dc+Q4jz5A== X-Google-Smtp-Source: ABdhPJwz2mh1lO7r0y+bxtXIj5Riwhc9WRHxLXWOc2nNLlk32Xd6FrwYwPC+i7nGxmFRBRoTdBp5OQ== X-Received: by 2002:a05:6602:487:: with SMTP id y7mr20308684iov.0.1633979853797; Mon, 11 Oct 2021 12:17:33 -0700 (PDT) Received: from john.lan ([172.243.157.240]) by smtp.gmail.com with ESMTPSA id n12sm4460077ilj.8.2021.10.11.12.17.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Oct 2021 12:17:33 -0700 (PDT) From: John Fastabend To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: john.fastabend@gmail.com, daniel@iogearbox.net, joamaki@gmail.com, xiyou.wangcong@gmail.com Subject: [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg Date: Mon, 11 Oct 2021 12:16:47 -0700 Message-Id: <20211011191647.418704-5-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20211011191647.418704-1-john.fastabend@gmail.com> References: <20211011191647.418704-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Jussi Maki The current conversion of skb->data_end reads like this, ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r2 +116) 563: (1f) r1 -= r11 But similar to the case ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"), the code will read an incorrect skb->len when src == dst. In this case we end up generating this xlated code. ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r1 +116) 563: (1f) r1 -= r11 where line 560 is the reading 4B of (skb->data + 112) instead of the intended skb->len Here the skb pointer in r1 gets set to skb->data and the later deref for skb->len ends up following skb->data instead of skb. This fixes the issue similarly to the patch mentioned above by creating an additional temporary variable and using to store the register when dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the cb context for sk_skb. Then we restore from the temp to ensure nothing is lost. Fixes: 16137b09a66f2 ("bpf: Compute data_end dynamically with JIT code") Signed-off-by: Jussi Maki Signed-off-by: John Fastabend Reviewed-by: Jakub Sitnicki --- include/net/strparser.h | 4 ++++ net/core/filter.c | 36 ++++++++++++++++++++++++++++++------ 2 files changed, 34 insertions(+), 6 deletions(-) diff --git a/include/net/strparser.h b/include/net/strparser.h index bec1439bd3be..732b7097d78e 100644 --- a/include/net/strparser.h +++ b/include/net/strparser.h @@ -66,6 +66,10 @@ struct sk_skb_cb { #define SK_SKB_CB_PRIV_LEN 20 unsigned char data[SK_SKB_CB_PRIV_LEN]; struct _strp_msg strp; + /* temp_reg is a temporary register used for bpf_convert_data_end_access + * when dst_reg == src_reg. + */ + u64 temp_reg; }; static inline struct strp_msg *strp_msg(struct sk_buff *skb) diff --git a/net/core/filter.c b/net/core/filter.c index 23a9bf92b5bb..f4a63af45f00 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9735,22 +9735,46 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, static struct bpf_insn *bpf_convert_data_end_access(const struct bpf_insn *si, struct bpf_insn *insn) { - /* si->dst_reg = skb->data */ + int reg; + int temp_reg_off = offsetof(struct sk_buff, cb) + + offsetof(struct sk_skb_cb, temp_reg); + + if (si->src_reg == si->dst_reg) { + /* We need an extra register, choose and save a register. */ + reg = BPF_REG_9; + if (si->src_reg == reg || si->dst_reg == reg) + reg--; + if (si->src_reg == reg || si->dst_reg == reg) + reg--; + *insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, reg, temp_reg_off); + } else { + reg = si->dst_reg; + } + + /* reg = skb->data */ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data), - si->dst_reg, si->src_reg, + reg, si->src_reg, offsetof(struct sk_buff, data)); /* AX = skb->len */ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len), BPF_REG_AX, si->src_reg, offsetof(struct sk_buff, len)); - /* si->dst_reg = skb->data + skb->len */ - *insn++ = BPF_ALU64_REG(BPF_ADD, si->dst_reg, BPF_REG_AX); + /* reg = skb->data + skb->len */ + *insn++ = BPF_ALU64_REG(BPF_ADD, reg, BPF_REG_AX); /* AX = skb->data_len */ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data_len), BPF_REG_AX, si->src_reg, offsetof(struct sk_buff, data_len)); - /* si->dst_reg = skb->data + skb->len - skb->data_len */ - *insn++ = BPF_ALU64_REG(BPF_SUB, si->dst_reg, BPF_REG_AX); + + /* reg = skb->data + skb->len - skb->data_len */ + *insn++ = BPF_ALU64_REG(BPF_SUB, reg, BPF_REG_AX); + + if (si->src_reg == si->dst_reg) { + /* Restore the saved register */ + *insn++ = BPF_MOV64_REG(BPF_REG_AX, si->src_reg); + *insn++ = BPF_MOV64_REG(si->dst_reg, reg); + *insn++ = BPF_LDX_MEM(BPF_DW, reg, BPF_REG_AX, temp_reg_off); + } return insn; }