From patchwork Mon Apr 3 20:01:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 13198664 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07DCEC76196 for ; Mon, 3 Apr 2023 20:02:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233030AbjDCUCP (ORCPT ); Mon, 3 Apr 2023 16:02:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232934AbjDCUB5 (ORCPT ); Mon, 3 Apr 2023 16:01:57 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5751A273D; Mon, 3 Apr 2023 13:01:54 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id h12-20020a17090aea8c00b0023d1311fab3so31727655pjz.1; Mon, 03 Apr 2023 13:01:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680552114; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/8gAHxQelt3CaAtcU0Ygo0YY4U4/7bAPQhKs4mLaCww=; b=QjaGkhbKCvdNaTTTCXkvpwfuRlL2jEShFQTVJDf0q9wJKeJ6BlXmbTpb86FkRDTyK4 45VxQV32QAi4j5iw1wTXblx2MDOeCq6WnddWQe2n/GCkSqhJph3Q4W6huOt21yTOgldB 7O95ZuCk+DZj7JzCAIo1sL+fOkomT1roy1iYHSaL2O0jOoku/65ZBOsK9nNqLjajh73p B7ehiudkG/G401HTyQrH3zz+j3DY02IxYTyotKNdB1SZYFB4J2c6qKmjNCFcbtvn52lA /SMB0WNv9bB2b/ywitHJsAy2x9FRZ4xjZGcmDoPUc0SjRQ0ZNHABJMJQqp781oSFv5vR YBAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680552114; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/8gAHxQelt3CaAtcU0Ygo0YY4U4/7bAPQhKs4mLaCww=; b=3AjcEDFLji66ZBU63rjA67EnEz6B62kyu5vGVSrYsAQflWJnwswCXlO8z10bKO37Xc 8q4OJleb8auvRYca/Ieqc4LgBtR9Yr+7GsHOEcWUxJVi55kVtK4eKz0Fat2HtzBxGO4X 3Ph4TE9PVeE5SdcWPOjBzZZdlmRuCsQBWhsFbcVqXXelqGcdnLTtLgdSgjKF0IoxPlIN K1BvrAItkNWaL7k8758jVlrl8IIR+0iJFImXnFamF+zQY4PeqwDQWstIWyJ6KDkDGLUS tGhhcqyjz4ewTo6akHcpmiREbODVSItPB0/71n+jcEnUMkTKxhkrsG6lcO6XlejXZbFM Xr6w== X-Gm-Message-State: AAQBX9feer9mSgYoEOC4z7aU60SNYFgUYt4rFq2PnuGmjURVnqzqNmAf yRTUygiUABvhaF/Ilzg8iPkc/P8qI9Ze1A== X-Google-Smtp-Source: AKy350Z2lMiUr+1iMyu1yFrllIUwa7mt4jpqhmJ+sT/QsP4ZyH1mXZ0ppgyT1VqxmLSitXfYShRAzg== X-Received: by 2002:a17:902:ea0e:b0:19c:3d78:6a54 with SMTP id s14-20020a170902ea0e00b0019c3d786a54mr186502plg.14.1680552113811; Mon, 03 Apr 2023 13:01:53 -0700 (PDT) Received: from localhost.localdomain ([2605:59c8:4c5:7110:3da7:5d97:f465:5e01]) by smtp.gmail.com with ESMTPSA id t18-20020a1709028c9200b0019c2b1c4db1sm6948835plo.239.2023.04.03.13.01.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Apr 2023 13:01:53 -0700 (PDT) From: John Fastabend To: cong.wang@bytedance.com, jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com, edumazet@google.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: [PATCH bpf v3 07/12] bpf: sockmap incorrectly handling copied_seq Date: Mon, 3 Apr 2023 13:01:33 -0700 Message-Id: <20230403200138.937569-8-john.fastabend@gmail.com> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20230403200138.937569-1-john.fastabend@gmail.com> References: <20230403200138.937569-1-john.fastabend@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net The read_skb() logic is incrementing the tcp->copied_seq which is used for among other things calculating how many outstanding bytes can be read by the application. This results in application errors, if the application does an ioctl(FIONREAD) we return zero because this is calculated from the copied_seq value. To fix this we move tcp->copied_seq accounting into the recv handler so that we update these when the recvmsg() hook is called and data is in fact copied into user buffers. This gives an accurate FIONREAD value as expected and improves ACK handling. Before we were calling the tcp_rcv_space_adjust() which would update 'number of bytes copied to user in last RTT' which is wrong for programs returning SK_PASS. The bytes are only copied to the user when recvmsg is handled. Doing the fix for recvmsg is straightforward, but fixing redirect and SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then call this from skmsg handlers. This fixes another issue where a broken socket with a BPF program doing a resubmit could hang the receiver. This happened because although read_skb() consumed the skb through sock_drop() it did not update the copied_seq. Now if a single reccv socket is redirecting to many sockets (for example for lb) the receiver sk will be hung even though we might expect it to continue. The hang comes from not updating the copied_seq numbers and memory pressure resulting from that. We have a slight layer problem of calling tcp_eat_skb even if its not a TCP socket. To fix we could refactor and create per type receiver handlers. I decided this is more work than we want in the fix and we already have some small tweaks depending on caller that use the helper skb_bpf_strparser(). So we extend that a bit and always set the strparser bit when it is in use and then we can gate the seq_copied updates on this. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend --- include/net/tcp.h | 3 +++ net/core/skmsg.c | 7 +++++-- net/ipv4/tcp.c | 10 +--------- net/ipv4/tcp_bpf.c | 28 +++++++++++++++++++++++++++- 4 files changed, 36 insertions(+), 12 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index db9f828e9d1e..674044b8bdaf 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1467,6 +1467,8 @@ static inline void tcp_adjust_rcv_ssthresh(struct sock *sk) } void tcp_cleanup_rbuf(struct sock *sk, int copied); +void __tcp_cleanup_rbuf(struct sock *sk, int copied); + /* We provision sk_rcvbuf around 200% of sk_rcvlowat. * If 87.5 % (7/8) of the space has been consumed, we want to override @@ -2321,6 +2323,7 @@ struct sk_psock; struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock); int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore); void tcp_bpf_clone(const struct sock *sk, struct sock *newsk); +void tcp_eat_skb(struct sock *sk, struct sk_buff *skb); #endif /* CONFIG_BPF_SYSCALL */ int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress, diff --git a/net/core/skmsg.c b/net/core/skmsg.c index a2e83d2aacf8..69983f40fbec 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1053,11 +1053,14 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, mutex_unlock(&psock->work_mutex); break; case __SK_REDIRECT: + tcp_eat_skb(psock->sk, skb); err = sk_psock_skb_redirect(psock, skb); break; case __SK_DROP: default: out_free: + tcp_eat_skb(psock->sk, skb); + skb_bpf_redirect_clear(skb); sock_drop(psock->sk, skb); } @@ -1102,8 +1105,7 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) skb_dst_drop(skb); skb_bpf_redirect_clear(skb); ret = bpf_prog_run_pin_on_cpu(prog, skb); - if (ret == SK_PASS) - skb_bpf_set_strparser(skb); + skb_bpf_set_strparser(skb); ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb)); skb->sk = NULL; } @@ -1211,6 +1213,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) psock = sk_psock(sk); if (unlikely(!psock)) { len = 0; + tcp_eat_skb(sk, skb); sock_drop(sk, skb); goto out; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1be305e3d3c7..5610f8341b38 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1568,7 +1568,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) * calculation of whether or not we must ACK for the sake of * a window update. */ -static void __tcp_cleanup_rbuf(struct sock *sk, int copied) +void __tcp_cleanup_rbuf(struct sock *sk, int copied) { struct tcp_sock *tp = tcp_sk(sk); bool time_to_ack = false; @@ -1783,14 +1783,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) break; } } - WRITE_ONCE(tp->copied_seq, seq); - - tcp_rcv_space_adjust(sk); - - /* Clean up data we have read: This will do ACK frames. */ - if (copied > 0) - __tcp_cleanup_rbuf(sk, copied); - return copied; } EXPORT_SYMBOL(tcp_read_skb); diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index ae6c7130551c..9e94864ce130 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -11,6 +11,24 @@ #include #include +void tcp_eat_skb(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tcp; + int copied; + + if (!skb || !skb->len || !sk_is_tcp(sk)) + return; + + if (skb_bpf_strparser(skb)) + return; + + tcp = tcp_sk(sk); + copied = tcp->copied_seq + skb->len; + WRITE_ONCE(tcp->copied_seq, copied); + tcp_rcv_space_adjust(sk); + __tcp_cleanup_rbuf(sk, skb->len); +} + static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock, struct sk_msg *msg, u32 apply_bytes, int flags) { @@ -198,8 +216,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, int flags, int *addr_len) { + struct tcp_sock *tcp = tcp_sk(sk); + u32 seq = tcp->copied_seq; struct sk_psock *psock; - int copied; + int copied = 0; if (unlikely(flags & MSG_ERRQUEUE)) return inet_recv_error(sk, msg, len, addr_len); @@ -244,9 +264,11 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, if (is_fin) { copied = 0; + seq++; goto out; } } + seq += copied; if (!copied) { long timeo; int data; @@ -284,6 +306,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, copied = -EAGAIN; } out: + WRITE_ONCE(tcp->copied_seq, seq); + tcp_rcv_space_adjust(sk); + if (copied > 0) + __tcp_cleanup_rbuf(sk, copied); release_sock(sk); sk_psock_put(sk, psock); return copied;