diff mbox series

[net-next,v2] tcp: make TCP Fast Open retransmission ignore Tx status

Message ID 20210302175259.971778-1-kuba@kernel.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v2] tcp: make TCP Fast Open retransmission ignore Tx status | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 2 maintainers not CCed: yoshfuji@linux-ipv6.org dsahern@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 24 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Jakub Kicinski March 2, 2021, 5:52 p.m. UTC
When receiver does not accept TCP Fast Open it will only ack
the SYN, and not the data. We detect this and immediately queue
the data for (re)transmission in tcp_rcv_fastopen_synack().

In DC networks with very low RTT and without RFS the SYN-ACK
may arrive before NIC driver reported Tx completion on
the original SYN. In which case skb_still_in_host_queue()
returns true and sender will need to wait for the retransmission
timer to fire milliseconds later.

Work around this issue by passing negative segment count to
__tcp_retransmit_skb() as suggested by Eric.

The condition triggers more often when Tx coalescing is configured
higher than Rx coalescing on the underlying NIC, but it does happen
even with relatively moderate and even settings (e.g. 33us).

Note that DC machines usually run configured to always accept
TCP FastOpen data so the problem may not be very common.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neil Spring <ntspring@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/ipv4/tcp_input.c  | 3 ++-
 net/ipv4/tcp_output.c | 8 ++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

Comments

Eric Dumazet March 2, 2021, 6:01 p.m. UTC | #1
On Tue, Mar 2, 2021 at 7:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue,  2 Mar 2021 09:52:59 -0800 Jakub Kicinski wrote:
> > When receiver does not accept TCP Fast Open it will only ack
> > the SYN, and not the data. We detect this and immediately queue
> > the data for (re)transmission in tcp_rcv_fastopen_synack().
> >
> > In DC networks with very low RTT and without RFS the SYN-ACK
> > may arrive before NIC driver reported Tx completion on
> > the original SYN. In which case skb_still_in_host_queue()
> > returns true and sender will need to wait for the retransmission
> > timer to fire milliseconds later.
> >
> > Work around this issue by passing negative segment count to
> > __tcp_retransmit_skb() as suggested by Eric.
> >
> > The condition triggers more often when Tx coalescing is configured
> > higher than Rx coalescing on the underlying NIC, but it does happen
> > even with relatively moderate and even settings (e.g. 33us).
> >
> > Note that DC machines usually run configured to always accept
> > TCP FastOpen data so the problem may not be very common.
> >
> > Suggested-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Neil Spring <ntspring@fb.com>
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>
> .. and now I realized net-next is closed. I'll keep an eye on patchwork
> and resend as needed, sorry.

Ah, just open it ;)
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 69a545db80d2..fb453a4799be 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5994,8 +5994,9 @@  static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 			tp->fastopen_client_fail = TFO_SYN_RETRANSMITTED;
 		else
 			tp->fastopen_client_fail = TFO_DATA_NOT_ACKED;
+		/* segs = -1 to bypass skb_still_in_host_queue() check */
 		skb_rbtree_walk_from(data) {
-			if (__tcp_retransmit_skb(sk, data, 1))
+			if (__tcp_retransmit_skb(sk, data, -1))
 				break;
 		}
 		tcp_rearm_rto(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fbf140a770d8..1d1489e59697 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3155,8 +3155,12 @@  int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 		  sk->sk_sndbuf))
 		return -EAGAIN;
 
-	if (skb_still_in_host_queue(sk, skb))
-		return -EBUSY;
+	if (segs > 0) {
+		if (skb_still_in_host_queue(sk, skb))
+			return -EBUSY;
+	} else {
+		segs = -segs;
+	}
 
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {