From patchwork Sun Oct  1 15:12:38 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Neal Cardwell <ncardwell.sw@gmail.com>
X-Patchwork-Id: 13405423
X-Patchwork-Delegate: kuba@kernel.org
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8DEA6FC06
	for <netdev@vger.kernel.org>; Sun,  1 Oct 2023 15:12:45 +0000 (UTC)
Received: from mail-vk1-xa36.google.com (mail-vk1-xa36.google.com
 [IPv6:2607:f8b0:4864:20::a36])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38965D9
	for <netdev@vger.kernel.org>; Sun,  1 Oct 2023 08:12:43 -0700 (PDT)
Received: by mail-vk1-xa36.google.com with SMTP id
 71dfb90a1353d-493639d6173so5826118e0c.3
        for <netdev@vger.kernel.org>; Sun, 01 Oct 2023 08:12:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1696173162; x=1696777962;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=rE8v+DuMTzieoPlnWvP+59YRal1uMwXWaOZN7N1t2xM=;
        b=FbNc7NgGndb+CDLfWIMMfwFIQtxYI6UeT+f2RmzN+dJkdrKBAqNyY5DeHA5gKLASxY
         E0Q8PK8y8C4pXY49xhllhTNypYG7IKPC0MyAedpbUnDyHmwQpVSZdLwDfw3/WOfR6Z0o
         t4/Df8IROQBayc+p8fYPbO31/WpLyZh+VvtKzPIuAZ1hZSBGGMgH894jrdA2uzGWCjMn
         goCRrxrYacNrQ8YycmDwCDAtf0XCV3/8S1Jlp0rbTiF1UtoFKKGL73B9PqnNV9eOqm5i
         S5GLXzfXDlnABogt0Yr1vljFu9utIODn/ZwirxxJ/USqNOX0eIFrc7LY/OqSyySKI4ib
         zzzA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696173162; x=1696777962;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=rE8v+DuMTzieoPlnWvP+59YRal1uMwXWaOZN7N1t2xM=;
        b=rSyymy274hLgfJLNQ0fbX6Tqhqw+aF5uBLIJjxOxaAzMtoyFPnZNoZXiqY+qm1dgvB
         0a9tduxV2FjvlElI6l3S1VpyDShRdzekYPHzzY4Y7/BaUsftNYSBYU4n9WU8v0O66L4t
         xelPllGcC9uwFT1SgkYjpIJhyuGlj4F4tSL2eSYVwF4QErenxSR1BbPt5Y5Sw6ikTRaz
         p23M52NowET/YPc6CHedOkItGZF47iZYbWN/eEHQyLUORXrI0QzK8/e1EbdA0OVZ9gH6
         rz/yjR8ISNVV7noyQx+6RoOLJdvOAuCW118EdrHPB6wuhGviCtAF2sjqiaKh1XT+t30B
         1eRw==
X-Gm-Message-State: AOJu0YyBOD8ANuiD3dStvbMi0FMk8xAZ+fCi+EQWXWD26I+Xe7z9dBCA
	cK0JL78LSZn847jxTH0WPec=
X-Google-Smtp-Source: 
 AGHT+IFsUNUqK9khu1c9hIzKL1/Z9jfvHYdzNBpkQnGq2AsgBtsAhmRF2VT3PczM9vB/54bOEcAQHg==
X-Received: by 2002:a1f:ec43:0:b0:495:febd:9187 with SMTP id
 k64-20020a1fec43000000b00495febd9187mr7444210vkh.0.1696173162192;
        Sun, 01 Oct 2023 08:12:42 -0700 (PDT)
Received: from soy.nyc.corp.google.com ([2620:0:1003:416:c6ad:8d4d:efac:c523])
        by smtp.gmail.com with ESMTPSA id
 n18-20020ae9c312000000b0076f13783743sm1433193qkg.92.2023.10.01.08.12.41
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 01 Oct 2023 08:12:41 -0700 (PDT)
From: Neal Cardwell <ncardwell.sw@gmail.com>
To: David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org,
	Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>
Subject: [PATCH v2 net 1/2] tcp: fix quick-ack counting to count actual ACKs
 of new data
Date: Sun,  1 Oct 2023 11:12:38 -0400
Message-ID: <20231001151239.1866845-1-ncardwell.sw@gmail.com>
X-Mailer: git-send-email 2.42.0.582.g8ccd20d70d-goog
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,
	RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	lindbergh.monkeyblade.net
X-Patchwork-Delegate: kuba@kernel.org

From: Neal Cardwell <ncardwell@google.com>

This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.

The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.

When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.

And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.

The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.

Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
v2: No change in this patch.
 include/net/tcp.h     | 6 ++++--
 net/ipv4/tcp_output.c | 7 +++----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 91688d0dadcd6..7b1a720691aec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -348,12 +348,14 @@ ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
 struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
 				     bool force_schedule);
 
-static inline void tcp_dec_quickack_mode(struct sock *sk,
-					 const unsigned int pkts)
+static inline void tcp_dec_quickack_mode(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	if (icsk->icsk_ack.quick) {
+		/* How many ACKs S/ACKing new data have we sent? */
+		const unsigned int pkts = inet_csk_ack_scheduled(sk) ? 1 : 0;
+
 		if (pkts >= icsk->icsk_ack.quick) {
 			icsk->icsk_ack.quick = 0;
 			/* Leaving quickack mode we deflate ATO. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ccfc8bbf74558..aa0fc8c766e50 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -177,8 +177,7 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
 }
 
 /* Account for an ACK we sent. */
-static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts,
-				      u32 rcv_nxt)
+static inline void tcp_event_ack_sent(struct sock *sk, u32 rcv_nxt)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -192,7 +191,7 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts,
 
 	if (unlikely(rcv_nxt != tp->rcv_nxt))
 		return;  /* Special ACK sent by DCTCP to reflect ECN */
-	tcp_dec_quickack_mode(sk, pkts);
+	tcp_dec_quickack_mode(sk);
 	inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
@@ -1387,7 +1386,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 			   sk, skb);
 
 	if (likely(tcb->tcp_flags & TCPHDR_ACK))
-		tcp_event_ack_sent(sk, tcp_skb_pcount(skb), rcv_nxt);
+		tcp_event_ack_sent(sk, rcv_nxt);
 
 	if (skb->len != tcp_header_size) {
 		tcp_event_data_sent(tp, sk);

From patchwork Sun Oct  1 15:12:39 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Neal Cardwell <ncardwell.sw@gmail.com>
X-Patchwork-Id: 13405424
X-Patchwork-Delegate: kuba@kernel.org
Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net
 [23.128.96.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A02618049
	for <netdev@vger.kernel.org>; Sun,  1 Oct 2023 15:12:46 +0000 (UTC)
Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com
 [IPv6:2607:f8b0:4864:20::72c])
	by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7EF6DA
	for <netdev@vger.kernel.org>; Sun,  1 Oct 2023 08:12:44 -0700 (PDT)
Received: by mail-qk1-x72c.google.com with SMTP id
 af79cd13be357-775751c35d4so391517685a.0
        for <netdev@vger.kernel.org>; Sun, 01 Oct 2023 08:12:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1696173164; x=1696777964;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=kCeAXUJnei9DWaPjcolxH4sM2ArdFIKh6/vloyHm5uM=;
        b=Zrxk1WuQCb9ESb5ylu7Czt0ksRyiHtq1xnLI2f34NXKavy0t36qlhLLRgemSISgTWU
         0jny7iG+uYzTaBEB+FGPWgSgMgaMuuQuGLn+yaXKBkQNE9/JBT0qDG1JP2WN4samoxTz
         bG6oZ7+oXmZ+lb9wXYYhG4X8WzgP2/u/waxis1twQAj0IzaBqOIHmH7ZwsVIob/KRRy8
         9xldaVWk+PkNpz+JsQ3HWvBy4mpYirtTle09/uM/ngcbTekxCM9QhQdl0pDNJ2bj0/tX
         1bSXrcInqhimLICfZrCA907JPiRIfcUjRYpweW2LCpAnLpqPiRwaNOMczkCB8BVmKSpW
         IDHg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696173164; x=1696777964;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=kCeAXUJnei9DWaPjcolxH4sM2ArdFIKh6/vloyHm5uM=;
        b=YFgaF+gRi+RZsVc6q9xm5MpuhCV9/y76wl7My5zNOl7Jlc9zHcfO+H0JeWIqAgJGOy
         /QO/WhnoKck5Km7lbgco6eifVUgwgUfpHJDyqUS9ZMNkWAD50jzqrYRg/5UGUy9bZchD
         NhLBu0wQLpFMa16UEaiF8jz0V6egYuig9kiaE6t7/pXhYScKXbmqjcJBbZ8O4wDJplyJ
         Yg9l3j8Ef7hZ3Dn7Izuv8fPo13zo+ix6kTOQm+pESR1ytszRYQP1cIN22EEpQfP+wzSt
         KXrM419lTUZqMlsR2wOxGrbjLMdxs51nk8XHBONZklDCWympsr3hel00bF5DqjX15T7v
         Pgow==
X-Gm-Message-State: AOJu0Yy8+/PNF/eicI4ynJhZbAL1bti4Ztem2AVBfswIoRTvWN2i0Do4
	AbqNOW1FFv+j/PhzS45pFpOwPDiT9cY=
X-Google-Smtp-Source: 
 AGHT+IGCqH1AyWWq7plFxP3IFzGK3m9Z9BXwB5VtOJ97z3q/ReW93XVMMGPFY50s4JoF216osI4E4A==
X-Received: by 2002:a05:620a:2988:b0:76f:f0b:a1b8 with SMTP id
 r8-20020a05620a298800b0076f0f0ba1b8mr13001084qkp.25.1696173163680;
        Sun, 01 Oct 2023 08:12:43 -0700 (PDT)
Received: from soy.nyc.corp.google.com ([2620:0:1003:416:c6ad:8d4d:efac:c523])
        by smtp.gmail.com with ESMTPSA id
 n18-20020ae9c312000000b0076f13783743sm1433193qkg.92.2023.10.01.08.12.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 01 Oct 2023 08:12:43 -0700 (PDT)
From: Neal Cardwell <ncardwell.sw@gmail.com>
To: David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org,
	Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>,
	Xin Guo <guoxin0309@gmail.com>
Subject: [PATCH v2 net 2/2] tcp: fix delayed ACKs for MSS boundary condition
Date: Sun,  1 Oct 2023 11:12:39 -0400
Message-ID: <20231001151239.1866845-2-ncardwell.sw@gmail.com>
X-Mailer: git-send-email 2.42.0.582.g8ccd20d70d-goog
In-Reply-To: <20231001151239.1866845-1-ncardwell.sw@gmail.com>
References: <20231001151239.1866845-1-ncardwell.sw@gmail.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,
	RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
	lindbergh.monkeyblade.net
X-Patchwork-Delegate: kuba@kernel.org

From: Neal Cardwell <ncardwell@google.com>

This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.

The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:

(1) If an app reads data when > 1*MSS is unacknowledged, then
    tcp_cleanup_rbuf() ACKs immediately because of:

     tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||

(2) If an app reads all received data, and the packets were < 1*MSS,
    and either (a) the app is not ping-pong or (b) we received two
    packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
    of:

     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
       !inet_csk_in_pingpong_mode(sk))) &&

(3) *However*: if an app reads exactly 1*MSS of data,
    tcp_cleanup_rbuf() does not send an immediate ACK. This is true
    even if the app is not ping-pong and the 1*MSS of data had the PSH
    bit set, suggesting the sending application completed an
    application write.

Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.

The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
---
v2: Fix the details of the description in the commit message of the
  current tcp_cleanup_rbuf() logic, thanks to Xin Guo <guoxin0309@gmail.com>.
 net/ipv4/tcp_input.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 06fe1cf645d5a..8afb0950a6979 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -253,6 +253,19 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 		if (unlikely(len > icsk->icsk_ack.rcv_mss +
 				   MAX_TCP_OPTION_SPACE))
 			tcp_gro_dev_warn(sk, skb, len);
+		/* If the skb has a len of exactly 1*MSS and has the PSH bit
+		 * set then it is likely the end of an application write. So
+		 * more data may not be arriving soon, and yet the data sender
+		 * may be waiting for an ACK if cwnd-bound or using TX zero
+		 * copy. So we set ICSK_ACK_PUSHED here so that
+		 * tcp_cleanup_rbuf() will send an ACK immediately if the app
+		 * reads all of the data and is not ping-pong. If len > MSS
+		 * then this logic does not matter (and does not hurt) because
+		 * tcp_cleanup_rbuf() will always ACK immediately if the app
+		 * reads data and there is more than an MSS of unACKed data.
+		 */
+		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_PSH)
+			icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
 	} else {
 		/* Otherwise, we make more careful check taking into account,
 		 * that SACKs block is variable.