[net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT

Message ID	20220425003407.3002429-1-eric.dumazet@gmail.com (mailing list archive)
State	Accepted
Commit	4bfe744ff1644fbc0a991a2677dc874475dd6776
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Eric Dumazet <eric.dumazet@gmail.com> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: netdev <netdev@vger.kernel.org>, Eric Dumazet <edumazet@google.com>, Eric Dumazet <eric.dumazet@gmail.com>, Doug Porter <dsp@fb.com>, Soheil Hassas Yeganeh <soheil@google.com>, Neal Cardwell <ncardwell@google.com> Subject: [PATCH net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT Date: Sun, 24 Apr 2022 17:34:07 -0700 Message-Id: <20220425003407.3002429-1-eric.dumazet@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT \| expand [net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net, async
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1106 this patch: 1106
netdev/cc_maintainers	warning	2 maintainers not CCed: yoshfuji@linux-ipv6.org dsahern@kernel.org
netdev/build_clang	success	Errors and warnings before: 133 this patch: 133
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1111 this patch: 1111
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 32 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Eric Dumazet April 25, 2022, 12:34 a.m. UTC

From: Eric Dumazet <edumazet@google.com>

I had this bug sitting for too long in my pile, it is time to fix it.

Thanks to Doug Porter for reminding me of it!

We had various attempts in the past, including commit
0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.

If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.

What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.

This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.

In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.

Tested:

200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
 V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
 echo $V
 SUM=$(($SUM + $V))
done
echo SUM=$SUM

Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000

After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000  # This is 410 % of the value before patch.

Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Doug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 include/net/tcp.h     |  1 +
 net/ipv4/tcp_input.c  | 12 +++++++++++-
 net/ipv4/tcp_output.c |  1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

Soheil Hassas Yeganeh April 25, 2022, 4:06 a.m. UTC | #1

On Sun, Apr 24, 2022 at 8:34 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> I had this bug sitting for too long in my pile, it is time to fix it.
>
> Thanks to Doug Porter for reminding me of it!
>
> We had various attempts in the past, including commit
> 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"),
> but the issue is that TCP stack currently only generates
> EPOLLOUT from input path, when tp->snd_una has advanced
> and skb(s) cleaned from rtx queue.
>
> If a flow has a big RTT, and/or receives SACKs, it is possible
> that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
> and no more data can be sent until tp->snd_una finally advances.
>
> What is needed is to also check if POLLOUT needs to be generated
> whenever tp->snd_nxt is advanced, from output path.
>
> This bug triggers more often after an idle period, as
> we do not receive ACK for at least one RTT. tcp_notsent_lowat
> could be a fraction of what CWND and pacing rate would allow to
> send during this RTT.
>
> In a followup patch, I will remove the bogus call
> to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
> from tcp_check_space(). Fact that we have decided to generate
> an EPOLLOUT does not mean the application has immediately
> refilled the transmit queue. This optimistic call
> might have been the reason the bug seemed not too serious.
>
> Tested:
>
> 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
>
> $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
> $ cat bench_rr.sh
> SUM=0
> for i in {1..10}
> do
>  V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
>  echo $V
>  SUM=$(($SUM + $V))
> done
> echo SUM=$SUM
>
> Before patch:
> $ bench_rr.sh
> 130000000
> 80000000
> 140000000
> 140000000
> 140000000
> 140000000
> 130000000
> 40000000
> 90000000
> 110000000
> SUM=1140000000
>
> After patch:
> $ bench_rr.sh
> 430000000
> 590000000
> 530000000
> 450000000
> 450000000
> 350000000
> 450000000
> 490000000
> 480000000
> 460000000
> SUM=4680000000  # This is 410 % of the value before patch.
>
> Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Doug Porter <dsp@fb.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thank you for the fix!

> ---
>  include/net/tcp.h     |  1 +
>  net/ipv4/tcp_input.c  | 12 +++++++++++-
>  net/ipv4/tcp_output.c |  1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 9987b3fba9f202632916cc439af9d17f1e68bcd3..cc1295037533a7741e454f7c040f77a21deae02b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -621,6 +621,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
>  void tcp_reset(struct sock *sk, struct sk_buff *skb);
>  void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
>  void tcp_fin(struct sock *sk);
> +void tcp_check_space(struct sock *sk);
>
>  /* tcp_timer.c */
>  void tcp_init_xmit_timers(struct sock *);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2088f93fa37b5fb9110e7933242a27bd4009990e..48f6075228600896daa6507c4cd06acfc851a0fa 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5454,7 +5454,17 @@ static void tcp_new_space(struct sock *sk)
>         INDIRECT_CALL_1(sk->sk_write_space, sk_stream_write_space, sk);
>  }
>
> -static void tcp_check_space(struct sock *sk)
> +/* Caller made space either from:
> + * 1) Freeing skbs in rtx queues (after tp->snd_una has advanced)
> + * 2) Sent skbs from output queue (and thus advancing tp->snd_nxt)
> + *
> + * We might be able to generate EPOLLOUT to the application if:
> + * 1) Space consumed in output/rtx queues is below sk->sk_sndbuf/2
> + * 2) notsent amount (tp->write_seq - tp->snd_nxt) became
> + *    small enough that tcp_stream_memory_free() decides it
> + *    is time to generate EPOLLOUT.
> + */
> +void tcp_check_space(struct sock *sk)
>  {
>         /* pairs with tcp_poll() */
>         smp_mb();
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9ede847f4199844c5884e3f62ea450562072a0a7..1ca2f28c9981018e6cfaee3435d711467af6048d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -82,6 +82,7 @@ static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
>
>         NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT,
>                       tcp_skb_pcount(skb));
> +       tcp_check_space(sk);
>  }
>
>  /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
> --
> 2.36.0.rc2.479.g8af0fa9b8e-goog
>

patchwork-bot+netdevbpf@kernel.org April 25, 2022, 11:10 a.m. UTC | #2

Hello:

This patch was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Sun, 24 Apr 2022 17:34:07 -0700 you wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> I had this bug sitting for too long in my pile, it is time to fix it.
> 
> Thanks to Doug Porter for reminding me of it!
> 
> We had various attempts in the past, including commit
> 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"),
> but the issue is that TCP stack currently only generates
> EPOLLOUT from input path, when tp->snd_una has advanced
> and skb(s) cleaned from rtx queue.
> 
> [...]

Here is the summary with links:
  - [net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
    https://git.kernel.org/netdev/net/c/4bfe744ff164

You are awesome, thank you!

Dave Taht April 25, 2022, 1:16 p.m. UTC | #3

On Sun, Apr 24, 2022 at 7:00 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> I had this bug sitting for too long in my pile, it is time to fix it.
>
> Thanks to Doug Porter for reminding me of it!
>
> We had various attempts in the past, including commit
> 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"),
> but the issue is that TCP stack currently only generates
> EPOLLOUT from input path, when tp->snd_una has advanced
> and skb(s) cleaned from rtx queue.
>
> If a flow has a big RTT, and/or receives SACKs, it is possible
> that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
> and no more data can be sent until tp->snd_una finally advances.
>
> What is needed is to also check if POLLOUT needs to be generated
> whenever tp->snd_nxt is advanced, from output path.
>
> This bug triggers more often after an idle period, as
> we do not receive ACK for at least one RTT. tcp_notsent_lowat
> could be a fraction of what CWND and pacing rate would allow to
> send during this RTT.
>
> In a followup patch, I will remove the bogus call
> to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
> from tcp_check_space(). Fact that we have decided to generate
> an EPOLLOUT does not mean the application has immediately
> refilled the transmit queue. This optimistic call
> might have been the reason the bug seemed not too serious.
>
> Tested:
>
> 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
>
> $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
> $ cat bench_rr.sh
> SUM=0
> for i in {1..10}
> do
>  V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
>  echo $V
>  SUM=$(($SUM + $V))
> done
> echo SUM=$SUM
>
> Before patch:
> $ bench_rr.sh
> 130000000
> 80000000
> 140000000
> 140000000
> 140000000
> 140000000
> 130000000
> 40000000
> 90000000
> 110000000
> SUM=1140000000
>
> After patch:
> $ bench_rr.sh
> 430000000
> 590000000
> 530000000
> 450000000
> 450000000
> 350000000
> 450000000
> 490000000
> 480000000
> 460000000
> SUM=4680000000  # This is 410 % of the value before patch.
>
> Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Doug Porter <dsp@fb.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>  include/net/tcp.h     |  1 +
>  net/ipv4/tcp_input.c  | 12 +++++++++++-
>  net/ipv4/tcp_output.c |  1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 9987b3fba9f202632916cc439af9d17f1e68bcd3..cc1295037533a7741e454f7c040f77a21deae02b 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -621,6 +621,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
>  void tcp_reset(struct sock *sk, struct sk_buff *skb);
>  void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
>  void tcp_fin(struct sock *sk);
> +void tcp_check_space(struct sock *sk);
>
>  /* tcp_timer.c */
>  void tcp_init_xmit_timers(struct sock *);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2088f93fa37b5fb9110e7933242a27bd4009990e..48f6075228600896daa6507c4cd06acfc851a0fa 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5454,7 +5454,17 @@ static void tcp_new_space(struct sock *sk)
>         INDIRECT_CALL_1(sk->sk_write_space, sk_stream_write_space, sk);
>  }
>
> -static void tcp_check_space(struct sock *sk)
> +/* Caller made space either from:
> + * 1) Freeing skbs in rtx queues (after tp->snd_una has advanced)
> + * 2) Sent skbs from output queue (and thus advancing tp->snd_nxt)
> + *
> + * We might be able to generate EPOLLOUT to the application if:
> + * 1) Space consumed in output/rtx queues is below sk->sk_sndbuf/2
> + * 2) notsent amount (tp->write_seq - tp->snd_nxt) became
> + *    small enough that tcp_stream_memory_free() decides it
> + *    is time to generate EPOLLOUT.
> + */
> +void tcp_check_space(struct sock *sk)
>  {
>         /* pairs with tcp_poll() */
>         smp_mb();
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9ede847f4199844c5884e3f62ea450562072a0a7..1ca2f28c9981018e6cfaee3435d711467af6048d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -82,6 +82,7 @@ static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
>
>         NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT,
>                       tcp_skb_pcount(skb));
> +       tcp_check_space(sk);
>  }
>
>  /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
> --
> 2.36.0.rc2.479.g8af0fa9b8e-goog
>

Thx! We have been having very good results with TCP_NOTSENT_LOWAT set
to 32k or less behind an apache traffic server... and had some really
puzzling ones at geosync RTTs. Now I gotta go retest.

Side question: Is there a guide/set of recommendations to setting this
value more appropriately, under what circumstances? Could it
autoconfigure?

net.ipv4.tcp_notsent_lowat = 32768

Eric Dumazet April 25, 2022, 3:05 p.m. UTC | #4

On Mon, Apr 25, 2022 at 6:16 AM Dave Taht <dave.taht@gmail.com> wrote:
>

>
> Thx! We have been having very good results with TCP_NOTSENT_LOWAT set
> to 32k or less behind an apache traffic server... and had some really
> puzzling ones at geosync RTTs. Now I gotta go retest.
>
> Side question: Is there a guide/set of recommendations to setting this
> value more appropriately, under what circumstances? Could it
> autoconfigure?

It is a tradeoff between memory usage in the kernel (storing data in
the socket transmit queue),
and the number of times the application is woken up to let it push
another chunk of data.

32k really means : No more than one skb at a time.  (An skb can
usually store about 64KB of payload)

At Google we have been using 2MB, but I suspect this was to work
around the bug I just fixed.
We probably could use a smaller value like 1MB, leading to one
EPOLLOUT for 1/2 MB.

Precise values also depend on how much work is needed in
tcp_sendmsg(), zerocopy can play a role here.

autoconfigure ? Not sure how. Once you have provisioned a server for a
given workload,
you are all set. No need to tune the value as a function of the load.

>
> net.ipv4.tcp_notsent_lowat = 32768

This probably has caused many stalls on long rtt / lossy links...

>
> --
> FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
> Dave Täht CEO, TekLibre, LLC

[net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT

Checks

Commit Message

Comments

Patch