diff mbox series

[net-next] tcp: avoid premature drops in tcp_add_backlog()

Message ID 20240423125620.3309458-1-edumazet@google.com (mailing list archive)
State Accepted
Commit ec00ed472bdb7d0af840da68c8c11bff9f4d9caa
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: avoid premature drops in tcp_add_backlog() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 945 this patch: 945
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 938 this patch: 938
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 956 this patch: 956
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 37 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-24--15-00 (tests: 995)

Commit Message

Eric Dumazet April 23, 2024, 12:56 p.m. UTC
While testing TCP performance with latest trees,
I saw suspect SOCKET_BACKLOG drops.

tcp_add_backlog() computes its limit with :

    limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
            (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
    limit += 64 * 1024;

This does not take into account that sk->sk_backlog.len
is reset only at the very end of __release_sock().

Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
sk_rcvbuf in normal conditions.

We should double sk->sk_rcvbuf contribution in the formula
to absorb bubbles in the backlog, which happen more often
for very fast flows.

This change maintains decent protection against abuses.

Fixes: c377411f2494 ("net: sk_add_backlog() take rmem_alloc into account")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_ipv4.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski April 25, 2024, 6:46 p.m. UTC | #1
On Tue, 23 Apr 2024 12:56:20 +0000 Eric Dumazet wrote:
> Subject: [PATCH net-next] tcp: avoid premature drops in tcp_add_backlog()

This is intentionally for net-next?
Eric Dumazet April 25, 2024, 7:02 p.m. UTC | #2
On Thu, Apr 25, 2024 at 8:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 23 Apr 2024 12:56:20 +0000 Eric Dumazet wrote:
> > Subject: [PATCH net-next] tcp: avoid premature drops in tcp_add_backlog()
>
> This is intentionally for net-next?

Yes, this has been broken for a long time.

We can soak this a bit, then it will reach stable trees when this
reaches Linus tree ?
patchwork-bot+netdevbpf@kernel.org April 25, 2024, 7:30 p.m. UTC | #3
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 23 Apr 2024 12:56:20 +0000 you wrote:
> While testing TCP performance with latest trees,
> I saw suspect SOCKET_BACKLOG drops.
> 
> tcp_add_backlog() computes its limit with :
> 
>     limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
>             (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
>     limit += 64 * 1024;
> 
> [...]

Here is the summary with links:
  - [net-next] tcp: avoid premature drops in tcp_add_backlog()
    https://git.kernel.org/netdev/net-next/c/ec00ed472bdb

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 88c83ac4212957f19efad0f967952d2502bdbc7f..e06f0cd04f7eee2b00fcaebe17cbd23c26f1d28f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1995,7 +1995,7 @@  int tcp_v4_early_demux(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 		     enum skb_drop_reason *reason)
 {
-	u32 limit, tail_gso_size, tail_gso_segs;
+	u32 tail_gso_size, tail_gso_segs;
 	struct skb_shared_info *shinfo;
 	const struct tcphdr *th;
 	struct tcphdr *thtail;
@@ -2004,6 +2004,7 @@  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	bool fragstolen;
 	u32 gso_segs;
 	u32 gso_size;
+	u64 limit;
 	int delta;
 
 	/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
@@ -2099,7 +2100,13 @@  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	__skb_push(skb, hdrlen);
 
 no_coalesce:
-	limit = (u32)READ_ONCE(sk->sk_rcvbuf) + (u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
+	/* sk->sk_backlog.len is reset only at the end of __release_sock().
+	 * Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
+	 * sk_rcvbuf in normal conditions.
+	 */
+	limit = ((u64)READ_ONCE(sk->sk_rcvbuf)) << 1;
+
+	limit += ((u32)READ_ONCE(sk->sk_sndbuf)) >> 1;
 
 	/* Only socket owner can try to collapse/prune rx queues
 	 * to reduce memory overhead, so add a little headroom here.
@@ -2107,6 +2114,8 @@  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	 */
 	limit += 64 * 1024;
 
+	limit = min_t(u64, limit, UINT_MAX);
+
 	if (unlikely(sk_add_backlog(sk, skb, limit))) {
 		bh_unlock_sock(sk);
 		*reason = SKB_DROP_REASON_SOCKET_BACKLOG;