[net,2/2] tcp: fix delayed ACKs for MSS boundary condition

Message ID	20230927151501.1549078-2-ncardwell.sw@gmail.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1541F38FB6 for <netdev@vger.kernel.org>; Wed, 27 Sep 2023 15:15:08 +0000 (UTC) From: Neal Cardwell <ncardwell.sw@gmail.com> To: David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org, Neal Cardwell <ncardwell@google.com>, Yuchung Cheng <ycheng@google.com> Subject: [PATCH net 2/2] tcp: fix delayed ACKs for MSS boundary condition Date: Wed, 27 Sep 2023 11:15:01 -0400 Message-ID: <20230927151501.1549078-2-ncardwell.sw@gmail.com> In-Reply-To: <20230927151501.1549078-1-ncardwell.sw@gmail.com> References: <20230927151501.1549078-1-ncardwell.sw@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net,1/2] tcp: fix quick-ack counting to count actual ACKs of new data \| expand [net,1/2] tcp: fix quick-ack counting to count actual ACKs of new data [net,2/2] tcp: fix delayed ACKs for MSS boundary condition

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net, async
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1343 this patch: 1343
netdev/cc_maintainers	warning	2 maintainers not CCed: pabeni@redhat.com dsahern@kernel.org
netdev/build_clang	success	Errors and warnings before: 1364 this patch: 1364
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1366 this patch: 1366
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 19 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Neal Cardwell Sept. 27, 2023, 3:15 p.m. UTC

From: Neal Cardwell <ncardwell@google.com>

This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.

The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:

(1) If an app reads > 1*MSS data, tcp_cleanup_rbuf() ACKs immediately
    because of:

     tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||

(2) If an app reads < 1*MSS data and either (a) app is not ping-pong or
    (b) we received two packets <1*MSS, then tcp_cleanup_rbuf() ACKs
    immediately beecause of:

     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
       !inet_csk_in_pingpong_mode(sk))) &&

(3) *However*: if an app reads exactly 1*MSS of data,
    tcp_cleanup_rbuf() does not send an immediate ACK. This is true
    even if the app is not ping-pong and the 1*MSS of data had the PSH
    bit set, suggesting the sending application completed an
    application write.

Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.

The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

xin.guo Sept. 28, 2023, 8:53 a.m. UTC | #1

Hi Neal:
Cannot understand "if an app reads > 1*MSS data" , " If an app reads <
1*MSS data" and " if an app reads exactly 1*MSS of data" in the commit
message.
In my view, it should be like:"if an app reads and received data > 1*MSS",
" If an app reads and received data < 1*MSS" and " if an app reads and
received data exactly 1*MSS".

Regards
Guo Xin

Neal Cardwell <ncardwell.sw@gmail.com> 于2023年9月27日周三 23:15写道：
>
> From: Neal Cardwell <ncardwell@google.com>
>
> This commit fixes poor delayed ACK behavior that can cause poor TCP
> latency in a particular boundary condition: when an application makes
> a TCP socket write that is an exact multiple of the MSS size.
>
> The problem is that there is painful boundary discontinuity in the
> current delayed ACK behavior. With the current delayed ACK behavior,
> we have:
>
> (1) If an app reads > 1*MSS data, tcp_cleanup_rbuf() ACKs immediately
>     because of:
>
>      tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
>
> (2) If an app reads < 1*MSS data and either (a) app is not ping-pong or
>     (b) we received two packets <1*MSS, then tcp_cleanup_rbuf() ACKs
>     immediately beecause of:
>
>      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
>       ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
>        !inet_csk_in_pingpong_mode(sk))) &&
>
> (3) *However*: if an app reads exactly 1*MSS of data,
>     tcp_cleanup_rbuf() does not send an immediate ACK. This is true
>     even if the app is not ping-pong and the 1*MSS of data had the PSH
>     bit set, suggesting the sending application completed an
>     application write.
>
> Thus if the app is not ping-pong, we have this painful case where
> >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
> write whose last skb is an exact multiple of 1*MSS can get a 40ms
> delayed ACK. This means that any app that transfers data in one
> direction and takes care to align write size or packet size with MSS
> can suffer this problem. With receive zero copy making 4KB MSS values
> more common, it is becoming more common to have application writes
> naturally align with MSS, and more applications are likely to
> encounter this delayed ACK problem.
>
> The fix in this commit is to refine the delayed ACK heuristics with a
> simple check: immediately ACK a received 1*MSS skb with PSH bit set if
> the app reads all data. Why? If an skb has a len of exactly 1*MSS and
> has the PSH bit set then it is likely the end of an application
> write. So more data may not be arriving soon, and yet the data sender
> may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
> set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
> an ACK immediately if the app reads all of the data and is not
> ping-pong. Note that this logic is also executed for the case where
> len > MSS, but in that case this logic does not matter (and does not
> hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
> app reads data and there is more than an MSS of unACKed data.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Reviewed-by: Yuchung Cheng <ycheng@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_input.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 06fe1cf645d5a..8afb0950a6979 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -253,6 +253,19 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
>                 if (unlikely(len > icsk->icsk_ack.rcv_mss +
>                                    MAX_TCP_OPTION_SPACE))
>                         tcp_gro_dev_warn(sk, skb, len);
> +               /* If the skb has a len of exactly 1*MSS and has the PSH bit
> +                * set then it is likely the end of an application write. So
> +                * more data may not be arriving soon, and yet the data sender
> +                * may be waiting for an ACK if cwnd-bound or using TX zero
> +                * copy. So we set ICSK_ACK_PUSHED here so that
> +                * tcp_cleanup_rbuf() will send an ACK immediately if the app
> +                * reads all of the data and is not ping-pong. If len > MSS
> +                * then this logic does not matter (and does not hurt) because
> +                * tcp_cleanup_rbuf() will always ACK immediately if the app
> +                * reads data and there is more than an MSS of unACKed data.
> +                */
> +               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_PSH)
> +                       icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
>         } else {
>                 /* Otherwise, we make more careful check taking into account,
>                  * that SACKs block is variable.
> --
> 2.42.0.515.g380fc7ccd1-goog
>
>

Neal Cardwell Sept. 28, 2023, 2:38 p.m. UTC | #2

On Thu, Sep 28, 2023, 4:53 AM Xin Guo <guoxin0309@gmail.com> wrote:
>
> Hi Neal:
> Cannot understand "if an app reads > 1*MSS data" , " If an app reads <
> 1*MSS data" and " if an app reads exactly 1*MSS of data" in the commit
> message.
> In my view, it should be like:"if an app reads and received data > 1*MSS",
> " If an app reads and received data < 1*MSS" and " if an app reads and
> received data exactly 1*MSS".

AFAICT your suggestion for tweaking the commit message - "if an app
reads and received" - would be redundant.  Our proposed phrase, "if an
app reads", is sufficient, because a read of a certain amount of data
automatically implies that the data has been received. That is, the
"and received" part is implied already. After all, how would an app
read data if it has not been received? :-)

best regards,
neal

xin.guo Sept. 28, 2023, 3:56 p.m. UTC | #3

Neal,
thanks for your explanation,
1)when I read the patch, i cannot understood "if an app reads  >1*MSS data",
because in my view that "the app reads" mean that the copied data
length from sk_receive_queue to user-space buffer
in function tcp_recvmsg_locked(as example) when an app reads data from a socket,
but for "tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||"
"tp->rcv_nxt - tp->rcv_wup" means that the received data length from
last ack in the kernel for the sk,
and not always the length of copied data to user-space buffer.

2) when we received two small packets(<1*MSS) in the kernel for the
sk, the total length of the two packets may  > 1*MSS.

Regards
Guo Xin

Neal Cardwell <ncardwell@google.com> 于2023年9月28日周四 22:38写道：
>
> On Thu, Sep 28, 2023, 4:53 AM Xin Guo <guoxin0309@gmail.com> wrote:
> >
> > Hi Neal:
> > Cannot understand "if an app reads > 1*MSS data" , " If an app reads <
> > 1*MSS data" and " if an app reads exactly 1*MSS of data" in the commit
> > message.
> > In my view, it should be like:"if an app reads and received data > 1*MSS",
> > " If an app reads and received data < 1*MSS" and " if an app reads and
> > received data exactly 1*MSS".
>
> AFAICT your suggestion for tweaking the commit message - "if an app
> reads and received" - would be redundant.  Our proposed phrase, "if an
> app reads", is sufficient, because a read of a certain amount of data
> automatically implies that the data has been received. That is, the
> "and received" part is implied already. After all, how would an app
> read data if it has not been received? :-)
>
> best regards,
> neal

Neal Cardwell Oct. 1, 2023, 3:19 p.m. UTC | #4

On Thu, Sep 28, 2023 at 11:56 AM Xin Guo <guoxin0309@gmail.com> wrote:
>
> Neal,
> thanks for your explanation,
> 1)when I read the patch, i cannot understood "if an app reads  >1*MSS data",
> because in my view that "the app reads" mean that the copied data
> length from sk_receive_queue to user-space buffer
> in function tcp_recvmsg_locked(as example) when an app reads data from a socket,
> but for "tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||"
> "tp->rcv_nxt - tp->rcv_wup" means that the received data length from
> last ack in the kernel for the sk,
> and not always the length of copied data to user-space buffer.
>
> 2) when we received two small packets(<1*MSS) in the kernel for the
> sk, the total length of the two packets may  > 1*MSS.

Thanks for clarifying. Those are good points; the commit message could
and should be more precise when describing the existing logic in
tcp_cleanup_rbuf(). I have posted a v2 series with a more precise
commit message:
  https://patchwork.kernel.org/project/netdevbpf/patch/20231001151239.1866845-2-ncardwell.sw@gmail.com/

best regards,
neal

xin.guo Oct. 2, 2023, 5:33 a.m. UTC | #5

Thanks,
the commit message in the v2 is so good.

Regards
Guo Xin

Neal Cardwell <ncardwell@google.com> 于2023年10月1日周日 23:19写道：
>
> On Thu, Sep 28, 2023 at 11:56 AM Xin Guo <guoxin0309@gmail.com> wrote:
> >
> > Neal,
> > thanks for your explanation,
> > 1)when I read the patch, i cannot understood "if an app reads  >1*MSS data",
> > because in my view that "the app reads" mean that the copied data
> > length from sk_receive_queue to user-space buffer
> > in function tcp_recvmsg_locked(as example) when an app reads data from a socket,
> > but for "tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||"
> > "tp->rcv_nxt - tp->rcv_wup" means that the received data length from
> > last ack in the kernel for the sk,
> > and not always the length of copied data to user-space buffer.
> >
> > 2) when we received two small packets(<1*MSS) in the kernel for the
> > sk, the total length of the two packets may  > 1*MSS.
>
> Thanks for clarifying. Those are good points; the commit message could
> and should be more precise when describing the existing logic in
> tcp_cleanup_rbuf(). I have posted a v2 series with a more precise
> commit message:
>   https://patchwork.kernel.org/project/netdevbpf/patch/20231001151239.1866845-2-ncardwell.sw@gmail.com/
>
> best regards,
> neal

[net,2/2] tcp: fix delayed ACKs for MSS boundary condition

Checks

Commit Message

Comments

Patch