From patchwork Wed Sep 27 15:15:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neal Cardwell X-Patchwork-Id: 13400995 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE9621170F for ; Wed, 27 Sep 2023 15:15:08 +0000 (UTC) Received: from mail-qk1-x730.google.com (mail-qk1-x730.google.com [IPv6:2607:f8b0:4864:20::730]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1852F4 for ; Wed, 27 Sep 2023 08:15:06 -0700 (PDT) Received: by mail-qk1-x730.google.com with SMTP id af79cd13be357-775751c35d4so84464785a.0 for ; Wed, 27 Sep 2023 08:15:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695827706; x=1696432506; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=sD4WKXBQaVFamFapX0kob4ZJEemipK0YXHyWxjLGdU4=; b=IK04HQR9nPzzL0o2BZWm6IgetezMJ/wWi1qQu5Rda28BcXHpQ9LP38rmEkcnjIc+an nappOqCK3c/cB1EFMuIGJnAZNpZr/r6yLdZFSX1nCPdSjt3uONtFDCeiJRlNUAp//Rh6 PMBlPVnfg9QfUvhC4SrBDNdAh0NV14drp74H2JqlyLbLUMwYZHhv1GghnSod7zsAr46c WYdiu+FVh1y9sXAOS9epD15ucUJQqgt8cRJS64L5g2Uss6ATiFi1muqqqwLkWQQnjrM1 NnG05U9RijJa70W0+XZjCv7ajnSwMD7gbgdnM3/IO0OhVDyJuEqsMIZXnSTd0gdXYgoA EDsg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695827706; x=1696432506; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=sD4WKXBQaVFamFapX0kob4ZJEemipK0YXHyWxjLGdU4=; b=BLx7hFTn52mJPAZV8dzZWxQszPCPD5ROeXqtE7wVe/j6QFVr1CPVyiD1WKS3LthNx3 WqzI6T7C/9lIUcBHBtUguv8VS+9bvnUButmvwGOCIk2dd8cDGqFwwykuSeT/9qJmkOvJ gN8xYOIV/oN/kVTmG7JY3T2VRpan6ITtFWpjzdtD5dHpGuV0BQe0/a8qw7SpnnxuAdUK YZj5DE5ZbXnPHUBqD3vsigNanHNy7bGUVjH8q38eBhZ4btQ6uX97ZjxmSB6zrDBq3uPE QfrpJcRL1sfdlkuleF21+ESSkaWzpshJx5RKZgj8K2QJ1203kO4pz3/167Mz7949g3T7 dfHA== X-Gm-Message-State: AOJu0YwDPXMW0+eVBISSsbfn1R5ndbu9dmbmZq5W583ERbReU54eALQu wT9mHG1RtF6srhWf1VKqEgz9svqLbC2yVg== X-Google-Smtp-Source: AGHT+IEpE+3lGHPW3t6feYkUFsQRq8OJ/mPNANGy1A6+PE7yfB0s3EBd/FxEkcS8hLN2j+hGy3pmXQ== X-Received: by 2002:a05:620a:f0b:b0:76e:f3a3:4ac6 with SMTP id v11-20020a05620a0f0b00b0076ef3a34ac6mr2079595qkl.35.1695827705627; Wed, 27 Sep 2023 08:15:05 -0700 (PDT) Received: from soy.nyc.corp.google.com ([2620:0:1003:416:3cb2:ba69:4389:2a97]) by smtp.gmail.com with ESMTPSA id r25-20020a05620a03d900b00767e2668536sm5533069qkm.17.2023.09.27.08.15.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Sep 2023 08:15:05 -0700 (PDT) From: Neal Cardwell To: David Miller , Jakub Kicinski , Eric Dumazet Cc: netdev@vger.kernel.org, Neal Cardwell , Yuchung Cheng Subject: [PATCH net 1/2] tcp: fix quick-ack counting to count actual ACKs of new data Date: Wed, 27 Sep 2023 11:15:00 -0400 Message-ID: <20230927151501.1549078-1-ncardwell.sw@gmail.com> X-Mailer: git-send-email 2.42.0.515.g380fc7ccd1-goog Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: kuba@kernel.org From: Neal Cardwell This commit fixes quick-ack counting so that it only considers that a quick-ack has been provided if we are sending an ACK that newly acknowledges data. The code was erroneously using the number of data segments in outgoing skbs when deciding how many quick-ack credits to remove. This logic does not make sense, and could cause poor performance in request-response workloads, like RPC traffic, where requests or responses can be multi-segment skbs. When a TCP connection decides to send N quick-acks, that is to accelerate the cwnd growth of the congestion control module controlling the remote endpoint of the TCP connection. That quick-ack decision is purely about the incoming data and outgoing ACKs. It has nothing to do with the outgoing data or the size of outgoing data. And in particular, an ACK only serves the intended purpose of allowing the remote congestion control to grow the congestion window quickly if the ACK is ACKing or SACKing new data. The fix is simple: only count packets as serving the goal of the quickack mechanism if they are ACKing/SACKing new data. We can tell whether this is the case by checking inet_csk_ack_scheduled(), since we schedule an ACK exactly when we are ACKing/SACKing new data. Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell Reviewed-by: Yuchung Cheng Reviewed-by: Eric Dumazet --- include/net/tcp.h | 6 ++++-- net/ipv4/tcp_output.c | 7 +++---- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 91688d0dadcd6..7b1a720691aec 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -348,12 +348,14 @@ ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos, struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, bool force_schedule); -static inline void tcp_dec_quickack_mode(struct sock *sk, - const unsigned int pkts) +static inline void tcp_dec_quickack_mode(struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); if (icsk->icsk_ack.quick) { + /* How many ACKs S/ACKing new data have we sent? */ + const unsigned int pkts = inet_csk_ack_scheduled(sk) ? 1 : 0; + if (pkts >= icsk->icsk_ack.quick) { icsk->icsk_ack.quick = 0; /* Leaving quickack mode we deflate ATO. */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index ccfc8bbf74558..aa0fc8c766e50 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -177,8 +177,7 @@ static void tcp_event_data_sent(struct tcp_sock *tp, } /* Account for an ACK we sent. */ -static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts, - u32 rcv_nxt) +static inline void tcp_event_ack_sent(struct sock *sk, u32 rcv_nxt) { struct tcp_sock *tp = tcp_sk(sk); @@ -192,7 +191,7 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts, if (unlikely(rcv_nxt != tp->rcv_nxt)) return; /* Special ACK sent by DCTCP to reflect ECN */ - tcp_dec_quickack_mode(sk, pkts); + tcp_dec_quickack_mode(sk); inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK); } @@ -1387,7 +1386,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, sk, skb); if (likely(tcb->tcp_flags & TCPHDR_ACK)) - tcp_event_ack_sent(sk, tcp_skb_pcount(skb), rcv_nxt); + tcp_event_ack_sent(sk, rcv_nxt); if (skb->len != tcp_header_size) { tcp_event_data_sent(tp, sk); From patchwork Wed Sep 27 15:15:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neal Cardwell X-Patchwork-Id: 13400996 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1541F38FB6 for ; Wed, 27 Sep 2023 15:15:08 +0000 (UTC) Received: from mail-qk1-x730.google.com (mail-qk1-x730.google.com [IPv6:2607:f8b0:4864:20::730]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40439F5 for ; Wed, 27 Sep 2023 08:15:07 -0700 (PDT) Received: by mail-qk1-x730.google.com with SMTP id af79cd13be357-77574c6cab0so94307485a.3 for ; Wed, 27 Sep 2023 08:15:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695827706; x=1696432506; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ORIn1A7s/0tWSsc9QSeckLC4AflhU8YOvNqZo+tbNd0=; b=KvKy8dVq/fzvp+MImLwVnmPRR4S9mcnDotWa54pgLhl/CBF9ROSaO2zLMR9HIq6RIB mStBpzPBVX7De/nITI6bBTSU5YVEafZ2pUhrSTHPB6DpiwChTVkX44njzGgFUF3bwsUh v4heq2mMYyRo/mcbu4EuSLBMx/X3SCnunh9oIlVB+7am1vtJBTriTfPSIn4h+n54c04d izXNcwBC72hVwLRMRtyZbg1kmYwK3s+SWAOQVlFmq00os9je3rcd4hVxyXKL8bzaJhtG h661wa7zLsS3feA1Qaz2GyguUXvoPCswyHppqtS+XV0hkCcKZC54uqiL7WtHaAlIHsnw IwXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695827706; x=1696432506; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ORIn1A7s/0tWSsc9QSeckLC4AflhU8YOvNqZo+tbNd0=; b=wSV+9yvZUgOxum+aX4VX5oyU9Mm0eu7SLs/SfQWe/Jhl8iw946DM/slF/KGn28Cj3Z aXGpYgp1V2EF/z8TCCfhevGJs+6AS3mwpohE64/af6jUITsKLuuqPBCx1M12TIdaw1RW e148yMLqDQchTrIuzf8HQBrTRofDFyuwUKdMPUCE7ZTnYIxNtaKX9KcFDe1q+1oT9nU1 gQPYoOiMSeMUL3qBA2gfVJYUZFUrghiHvbwRrqqjjiIb7teUXKvYzg2J6t+zRVEBpgGn 4H8JFn1CuD7MT7TQMB4cCneeV3jWJjZcUnorCxwi5b0k09eoqgg4yZODjxPxUnj8gn9B LEDg== X-Gm-Message-State: AOJu0YweDQJ6t+fMB2TVgaf82ZzM+CEmyva8bdBUurJ0eETXetNe0ZFP MqRF4uywilRh4x4/zDYmWQo= X-Google-Smtp-Source: AGHT+IFfW98Wrz1BIDURkn1XB/nO0ZKUTNCJ5e7ptWA4YnKtWYCZB7CXFPWK9648tBAd8xyVRnA2yw== X-Received: by 2002:a05:620a:2456:b0:76f:19fd:5063 with SMTP id h22-20020a05620a245600b0076f19fd5063mr2208738qkn.34.1695827706230; Wed, 27 Sep 2023 08:15:06 -0700 (PDT) Received: from soy.nyc.corp.google.com ([2620:0:1003:416:3cb2:ba69:4389:2a97]) by smtp.gmail.com with ESMTPSA id r25-20020a05620a03d900b00767e2668536sm5533069qkm.17.2023.09.27.08.15.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Sep 2023 08:15:05 -0700 (PDT) From: Neal Cardwell To: David Miller , Jakub Kicinski , Eric Dumazet Cc: netdev@vger.kernel.org, Neal Cardwell , Yuchung Cheng Subject: [PATCH net 2/2] tcp: fix delayed ACKs for MSS boundary condition Date: Wed, 27 Sep 2023 11:15:01 -0400 Message-ID: <20230927151501.1549078-2-ncardwell.sw@gmail.com> X-Mailer: git-send-email 2.42.0.515.g380fc7ccd1-goog In-Reply-To: <20230927151501.1549078-1-ncardwell.sw@gmail.com> References: <20230927151501.1549078-1-ncardwell.sw@gmail.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: kuba@kernel.org From: Neal Cardwell This commit fixes poor delayed ACK behavior that can cause poor TCP latency in a particular boundary condition: when an application makes a TCP socket write that is an exact multiple of the MSS size. The problem is that there is painful boundary discontinuity in the current delayed ACK behavior. With the current delayed ACK behavior, we have: (1) If an app reads > 1*MSS data, tcp_cleanup_rbuf() ACKs immediately because of: tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || (2) If an app reads < 1*MSS data and either (a) app is not ping-pong or (b) we received two packets <1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause of: ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && (3) *However*: if an app reads exactly 1*MSS of data, tcp_cleanup_rbuf() does not send an immediate ACK. This is true even if the app is not ping-pong and the 1*MSS of data had the PSH bit set, suggesting the sending application completed an application write. Thus if the app is not ping-pong, we have this painful case where >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a write whose last skb is an exact multiple of 1*MSS can get a 40ms delayed ACK. This means that any app that transfers data in one direction and takes care to align write size or packet size with MSS can suffer this problem. With receive zero copy making 4KB MSS values more common, it is becoming more common to have application writes naturally align with MSS, and more applications are likely to encounter this delayed ACK problem. The fix in this commit is to refine the delayed ACK heuristics with a simple check: immediately ACK a received 1*MSS skb with PSH bit set if the app reads all data. Why? If an skb has a len of exactly 1*MSS and has the PSH bit set then it is likely the end of an application write. So more data may not be arriving soon, and yet the data sender may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send an ACK immediately if the app reads all of the data and is not ping-pong. Note that this logic is also executed for the case where len > MSS, but in that case this logic does not matter (and does not hurt) because tcp_cleanup_rbuf() will always ACK immediately if the app reads data and there is more than an MSS of unACKed data. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell Reviewed-by: Yuchung Cheng Reviewed-by: Eric Dumazet --- net/ipv4/tcp_input.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 06fe1cf645d5a..8afb0950a6979 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -253,6 +253,19 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb) if (unlikely(len > icsk->icsk_ack.rcv_mss + MAX_TCP_OPTION_SPACE)) tcp_gro_dev_warn(sk, skb, len); + /* If the skb has a len of exactly 1*MSS and has the PSH bit + * set then it is likely the end of an application write. So + * more data may not be arriving soon, and yet the data sender + * may be waiting for an ACK if cwnd-bound or using TX zero + * copy. So we set ICSK_ACK_PUSHED here so that + * tcp_cleanup_rbuf() will send an ACK immediately if the app + * reads all of the data and is not ping-pong. If len > MSS + * then this logic does not matter (and does not hurt) because + * tcp_cleanup_rbuf() will always ACK immediately if the app + * reads data and there is more than an MSS of unACKed data. + */ + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_PSH) + icsk->icsk_ack.pending |= ICSK_ACK_PUSHED; } else { /* Otherwise, we make more careful check taking into account, * that SACKs block is variable.