Message ID | 20240718-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v2-1-d653f85639f6@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: restrict crossed SYN specific actions to SYN-ACK | expand |
On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) <matttbe@kernel.org> wrote: > > The 'Fixes' commit recently changed the behaviour of TCP by skipping the > processing of the 3rd ACK when a sk->sk_socket is set. The goal was to > skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an > unnecessary ACK in case of simultaneous connect(). Unfortunately, that > had an impact on TFO and MPTCP. > > I started to look at the impact on MPTCP, because the MPTCP CI found > some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested > me to look at the impact on TFO with "plain" TCP. > > For MPTCP, when receiving the 3rd ACK of a request adding a new path > (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that > has been created when the MPTCP connection got established before with > the first path. The newly added 'goto' will then skip the processing of > the segment text (step 7) and not go through tcp_data_queue() where the > MPTCP options are validated, and some actions are triggered, e.g. > sending the MPJ 4th ACK [2] as demonstrated by the new errors when > running a packetdrill test [3] establishing a second subflow. > > This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be > delayed. Still, we don't want to have this behaviour as it delays the > switch to the fully established mode, and invalid MPTCP options in this > 3rd ACK will not be caught any more. This modification also affects the > MPTCP + TFO feature as well, and being the reason why the selftests > started to be unstable the last few days [4]. > > For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer > passing: if the 3rd ACK contains data, and the connection is accept()ed > before receiving them, these data would no longer be processed, and thus > not ACKed. > > One last thing about MPTCP, in case of simultaneous connect(), a > fallback to TCP will be done, which seems fine: > > `../common/defaults.sh` > > 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 > +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > > +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> > > +0 write(3, ..., 100) = 100 > +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> > +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> > > Simultaneous SYN-data crossing is also not supported by TFO, see [6]. > > Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] > Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] > Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] > Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] > Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] > Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] > Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") > Suggested-by: Paolo Abeni <pabeni@redhat.com> > Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > --- > Notes: > - We could also drop this 'goto consume', and send the unnecessary ACK > in this simultaneous connect case, which doesn't seem to be a "real" > case, more something for fuzzers. But that's not what the RFC 9293 > recommends to do. > - v2: > - Check if the SYN bit is set instead of looking for TFO and MPTCP > specific attributes, as suggested by Kuniyuki. > - Updated the comment above > - Please note that the v2 has been sent mainly to satisfy the CI (to > be able to catch new bugs with MPTCP), and because the suggestion > from Kuniyuki looks better. It has not been sent to urge TCP > maintainers to review it quicker than it should, please take your > time and enjoy netdev.conf :) > --- > net/ipv4/tcp_input.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index ff9ab3d01ced..bfe1bc69dc3e 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) > if (sk->sk_shutdown & SEND_SHUTDOWN) > tcp_shutdown(sk, SEND_SHUTDOWN); > > - if (sk->sk_socket) > + /* For crossed SYN cases, not to send an unnecessary ACK. > + * Note that sk->sk_socket can be assigned in other cases, e.g. > + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: > + * sk_socket is the parent MPTCP sock). > + */ > + if (sk->sk_socket && th->syn) > goto consume; I think we should simply remove this part completely, because we should send an ack anyway. diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) if (sk->sk_shutdown & SEND_SHUTDOWN) tcp_shutdown(sk, SEND_SHUTDOWN); - if (sk->sk_socket) - goto consume; break; case TCP_FIN_WAIT1: { I have a failing packetdrill test after Kuniyuki patch : // // Test the simultaneous open scenario that both end sends // SYN/data. Although we don't support that the connection should // still be established. // `../../common/defaults.sh ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` // Cache warmup: send a Fast Open cookie request 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS (Operation is now in progress) +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> +0 > . 1:1(0) ack 1 +.01 close(3) = 0 +0 > F. 1:1(0) ack 1 +.01 < F. 1:1(0) ack 2 win 92 +0 > . 2:2(0) ack 2 // // Test: simulatenous fast open // +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> // SYN data is never retried. +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> +0 > . 1001:1001(0) ack 1 // The other end retries +.1 < P. 1:501(500) ack 1000 win 257 +0 > . 1001:1001(0) ack 501 +0 read(4, ..., 4096) = 500 +0 close(4) = 0 +0 > F. 1001:1001(0) ack 501 +.05 < F. 501:501(0) ack 1002 win 257 +0 > . 1002:1002(0) ack 502 `/tmp/sysctl_restore_${PPID}.sh`
Hi Eric, +cc Neal -cc Jerry (NoSuchUser) On 23/07/2024 16:37, Eric Dumazet wrote: > On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) > <matttbe@kernel.org> wrote: >> >> The 'Fixes' commit recently changed the behaviour of TCP by skipping the >> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to >> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an >> unnecessary ACK in case of simultaneous connect(). Unfortunately, that >> had an impact on TFO and MPTCP. >> >> I started to look at the impact on MPTCP, because the MPTCP CI found >> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested >> me to look at the impact on TFO with "plain" TCP. >> >> For MPTCP, when receiving the 3rd ACK of a request adding a new path >> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that >> has been created when the MPTCP connection got established before with >> the first path. The newly added 'goto' will then skip the processing of >> the segment text (step 7) and not go through tcp_data_queue() where the >> MPTCP options are validated, and some actions are triggered, e.g. >> sending the MPJ 4th ACK [2] as demonstrated by the new errors when >> running a packetdrill test [3] establishing a second subflow. >> >> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be >> delayed. Still, we don't want to have this behaviour as it delays the >> switch to the fully established mode, and invalid MPTCP options in this >> 3rd ACK will not be caught any more. This modification also affects the >> MPTCP + TFO feature as well, and being the reason why the selftests >> started to be unstable the last few days [4]. >> >> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer >> passing: if the 3rd ACK contains data, and the connection is accept()ed >> before receiving them, these data would no longer be processed, and thus >> not ACKed. >> >> One last thing about MPTCP, in case of simultaneous connect(), a >> fallback to TCP will be done, which seems fine: >> >> `../common/defaults.sh` >> >> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) >> >> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> >> >> +0 write(3, ..., 100) = 100 >> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> >> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> >> >> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. >> >> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] >> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] >> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] >> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] >> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") >> Suggested-by: Paolo Abeni <pabeni@redhat.com> >> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> >> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> >> --- >> Notes: >> - We could also drop this 'goto consume', and send the unnecessary ACK >> in this simultaneous connect case, which doesn't seem to be a "real" >> case, more something for fuzzers. But that's not what the RFC 9293 >> recommends to do. >> - v2: >> - Check if the SYN bit is set instead of looking for TFO and MPTCP >> specific attributes, as suggested by Kuniyuki. >> - Updated the comment above >> - Please note that the v2 has been sent mainly to satisfy the CI (to >> be able to catch new bugs with MPTCP), and because the suggestion >> from Kuniyuki looks better. It has not been sent to urge TCP >> maintainers to review it quicker than it should, please take your >> time and enjoy netdev.conf :) >> --- >> net/ipv4/tcp_input.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >> index ff9ab3d01ced..bfe1bc69dc3e 100644 >> --- a/net/ipv4/tcp_input.c >> +++ b/net/ipv4/tcp_input.c >> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) >> if (sk->sk_shutdown & SEND_SHUTDOWN) >> tcp_shutdown(sk, SEND_SHUTDOWN); >> >> - if (sk->sk_socket) >> + /* For crossed SYN cases, not to send an unnecessary ACK. >> + * Note that sk->sk_socket can be assigned in other cases, e.g. >> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: >> + * sk_socket is the parent MPTCP sock). >> + */ >> + if (sk->sk_socket && th->syn) >> goto consume; > > I think we should simply remove this part completely, because we > should send an ack anyway. Thank you for having looked, and ran the full packetdrill test suite! > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct > sk_buff *skb) > if (sk->sk_shutdown & SEND_SHUTDOWN) > tcp_shutdown(sk, SEND_SHUTDOWN); > > - if (sk->sk_socket) > - goto consume; > break; > > case TCP_FIN_WAIT1: { > > > I have a failing packetdrill test after Kuniyuki patch : > > > > // > // Test the simultaneous open scenario that both end sends > // SYN/data. Although we don't support that the connection should > // still be established. > // > `../../common/defaults.sh > ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` > > // Cache warmup: send a Fast Open cookie request > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS > (Operation is now in progress) > +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> > +.01 < S. 123:123(0) ack 1 win 14600 <mss > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> > +0 > . 1:1(0) ack 1 > +.01 close(3) = 0 > +0 > F. 1:1(0) ack 1 > +.01 < F. 1:1(0) ack 2 win 92 > +0 > . 2:2(0) ack 2 > > > // > // Test: simulatenous fast open > // > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > abcd1234,nop,nop> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > 6,FO 87654321,nop,nop> > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > > // SYN data is never retried. > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > +0 > . 1001:1001(0) ack 1 I recently sent a PR -- already applied -- to Neal to remove this line: https://github.com/google/packetdrill/pull/86 I thought it was the intension of Kuniyuki's patch not to send this ACK in this case to follow the RFC 9293's recommendation. This TFO test looks a bit similar to the example from Kuniyuki's patch: --------------- 8< --------------- 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8> +0 < S 0:0(0) win 1000 <mss 1000> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8> +0 < S. 0:0(0) ack 1 win 1000 /* No ACK here */ +0 write(3, ..., 100) = 100 +0 > P. 1:101(100) ack 1 --------------- 8< --------------- But maybe here that should be different for TFO? For my case with MPTCP (and TFO), it is fine to drop this 'goto consume' but I don't know how "strict" we want to be regarding the RFC and this marginal case. > // The other end retries > +.1 < P. 1:501(500) ack 1000 win 257 > +0 > . 1001:1001(0) ack 501 > +0 read(4, ..., 4096) = 500 > +0 close(4) = 0 > +0 > F. 1001:1001(0) ack 501 > +.05 < F. 501:501(0) ack 1002 win 257 > +0 > . 1002:1002(0) ack 502 > > `/tmp/sysctl_restore_${PPID}.sh` > Cheers, Matt
On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote: > > Hi Eric, > > +cc Neal > -cc Jerry (NoSuchUser) > > On 23/07/2024 16:37, Eric Dumazet wrote: > > On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) > > <matttbe@kernel.org> wrote: > >> > >> The 'Fixes' commit recently changed the behaviour of TCP by skipping the > >> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to > >> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an > >> unnecessary ACK in case of simultaneous connect(). Unfortunately, that > >> had an impact on TFO and MPTCP. > >> > >> I started to look at the impact on MPTCP, because the MPTCP CI found > >> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested > >> me to look at the impact on TFO with "plain" TCP. > >> > >> For MPTCP, when receiving the 3rd ACK of a request adding a new path > >> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that > >> has been created when the MPTCP connection got established before with > >> the first path. The newly added 'goto' will then skip the processing of > >> the segment text (step 7) and not go through tcp_data_queue() where the > >> MPTCP options are validated, and some actions are triggered, e.g. > >> sending the MPJ 4th ACK [2] as demonstrated by the new errors when > >> running a packetdrill test [3] establishing a second subflow. > >> > >> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be > >> delayed. Still, we don't want to have this behaviour as it delays the > >> switch to the fully established mode, and invalid MPTCP options in this > >> 3rd ACK will not be caught any more. This modification also affects the > >> MPTCP + TFO feature as well, and being the reason why the selftests > >> started to be unstable the last few days [4]. > >> > >> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer > >> passing: if the 3rd ACK contains data, and the connection is accept()ed > >> before receiving them, these data would no longer be processed, and thus > >> not ACKed. > >> > >> One last thing about MPTCP, in case of simultaneous connect(), a > >> fallback to TCP will be done, which seems fine: > >> > >> `../common/defaults.sh` > >> > >> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 > >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > >> > >> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > >> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > >> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > >> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> > >> > >> +0 write(3, ..., 100) = 100 > >> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> > >> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> > >> > >> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. > >> > >> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] > >> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] > >> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] > >> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] > >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] > >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] > >> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") > >> Suggested-by: Paolo Abeni <pabeni@redhat.com> > >> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> > >> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > >> --- > >> Notes: > >> - We could also drop this 'goto consume', and send the unnecessary ACK > >> in this simultaneous connect case, which doesn't seem to be a "real" > >> case, more something for fuzzers. But that's not what the RFC 9293 > >> recommends to do. > >> - v2: > >> - Check if the SYN bit is set instead of looking for TFO and MPTCP > >> specific attributes, as suggested by Kuniyuki. > >> - Updated the comment above > >> - Please note that the v2 has been sent mainly to satisfy the CI (to > >> be able to catch new bugs with MPTCP), and because the suggestion > >> from Kuniyuki looks better. It has not been sent to urge TCP > >> maintainers to review it quicker than it should, please take your > >> time and enjoy netdev.conf :) > >> --- > >> net/ipv4/tcp_input.c | 7 ++++++- > >> 1 file changed, 6 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >> index ff9ab3d01ced..bfe1bc69dc3e 100644 > >> --- a/net/ipv4/tcp_input.c > >> +++ b/net/ipv4/tcp_input.c > >> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) > >> if (sk->sk_shutdown & SEND_SHUTDOWN) > >> tcp_shutdown(sk, SEND_SHUTDOWN); > >> > >> - if (sk->sk_socket) > >> + /* For crossed SYN cases, not to send an unnecessary ACK. > >> + * Note that sk->sk_socket can be assigned in other cases, e.g. > >> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: > >> + * sk_socket is the parent MPTCP sock). > >> + */ > >> + if (sk->sk_socket && th->syn) > >> goto consume; > > > > I think we should simply remove this part completely, because we > > should send an ack anyway. > > Thank you for having looked, and ran the full packetdrill test suite! > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce > > 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct > > sk_buff *skb) > > if (sk->sk_shutdown & SEND_SHUTDOWN) > > tcp_shutdown(sk, SEND_SHUTDOWN); > > > > - if (sk->sk_socket) > > - goto consume; > > break; > > > > case TCP_FIN_WAIT1: { > > > > > > I have a failing packetdrill test after Kuniyuki patch : > > > > > > > > // > > // Test the simultaneous open scenario that both end sends > > // SYN/data. Although we don't support that the connection should > > // still be established. > > // > > `../../common/defaults.sh > > ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` > > > > // Cache warmup: send a Fast Open cookie request > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS > > (Operation is now in progress) > > +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> > > +.01 < S. 123:123(0) ack 1 win 14600 <mss > > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> > > +0 > . 1:1(0) ack 1 > > +.01 close(3) = 0 > > +0 > F. 1:1(0) ack 1 > > +.01 < F. 1:1(0) ack 2 win 92 > > +0 > . 2:2(0) ack 2 > > > > > > // > > // Test: simulatenous fast open > > // > > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > > abcd1234,nop,nop> > > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > > 6,FO 87654321,nop,nop> > > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > > > > // SYN data is never retried. > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > > +0 > . 1001:1001(0) ack 1 > > I recently sent a PR -- already applied -- to Neal to remove this line: > > https://github.com/google/packetdrill/pull/86 > > I thought it was the intension of Kuniyuki's patch not to send this ACK > in this case to follow the RFC 9293's recommendation. This TFO test > looks a bit similar to the example from Kuniyuki's patch: > > > --------------- 8< --------------- > 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 > +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > > +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8> > +0 < S 0:0(0) win 1000 <mss 1000> > +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8> > +0 < S. 0:0(0) ack 1 win 1000 > > /* No ACK here */ > > +0 write(3, ..., 100) = 100 > +0 > P. 1:101(100) ack 1 > --------------- 8< --------------- > > > > But maybe here that should be different for TFO? > > For my case with MPTCP (and TFO), it is fine to drop this 'goto consume' > but I don't know how "strict" we want to be regarding the RFC and this > marginal case. Problem of this 'goto consume' is that we are not properly sending a DUPACK in this case. +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here Not sending a dupack seems wrong and could hurt. I had reservations about this part, if you look back to the discussion. This is why Kuniyuki added in his changelog : Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to send an unnecessary ACK, but this could be a bit risky for net.git, so this targets for net-next.
Hi Eric, On 23/07/2024 17:38, Eric Dumazet wrote: > On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote: >> >> Hi Eric, >> >> +cc Neal >> -cc Jerry (NoSuchUser) >> >> On 23/07/2024 16:37, Eric Dumazet wrote: >>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) >>> <matttbe@kernel.org> wrote: >>>> >>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the >>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to >>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an >>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that >>>> had an impact on TFO and MPTCP. >>>> >>>> I started to look at the impact on MPTCP, because the MPTCP CI found >>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested >>>> me to look at the impact on TFO with "plain" TCP. >>>> >>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path >>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that >>>> has been created when the MPTCP connection got established before with >>>> the first path. The newly added 'goto' will then skip the processing of >>>> the segment text (step 7) and not go through tcp_data_queue() where the >>>> MPTCP options are validated, and some actions are triggered, e.g. >>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when >>>> running a packetdrill test [3] establishing a second subflow. >>>> >>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be >>>> delayed. Still, we don't want to have this behaviour as it delays the >>>> switch to the fully established mode, and invalid MPTCP options in this >>>> 3rd ACK will not be caught any more. This modification also affects the >>>> MPTCP + TFO feature as well, and being the reason why the selftests >>>> started to be unstable the last few days [4]. >>>> >>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer >>>> passing: if the 3rd ACK contains data, and the connection is accept()ed >>>> before receiving them, these data would no longer be processed, and thus >>>> not ACKed. >>>> >>>> One last thing about MPTCP, in case of simultaneous connect(), a >>>> fallback to TCP will be done, which seems fine: >>>> >>>> `../common/defaults.sh` >>>> >>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 >>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) >>>> >>>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >>>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >>>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >>>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> >>>> >>>> +0 write(3, ..., 100) = 100 >>>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> >>>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> >>>> >>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. >>>> >>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] >>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] >>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] >>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] >>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") >>>> Suggested-by: Paolo Abeni <pabeni@redhat.com> >>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> >>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> >>>> --- >>>> Notes: >>>> - We could also drop this 'goto consume', and send the unnecessary ACK >>>> in this simultaneous connect case, which doesn't seem to be a "real" >>>> case, more something for fuzzers. But that's not what the RFC 9293 >>>> recommends to do. >>>> - v2: >>>> - Check if the SYN bit is set instead of looking for TFO and MPTCP >>>> specific attributes, as suggested by Kuniyuki. >>>> - Updated the comment above >>>> - Please note that the v2 has been sent mainly to satisfy the CI (to >>>> be able to catch new bugs with MPTCP), and because the suggestion >>>> from Kuniyuki looks better. It has not been sent to urge TCP >>>> maintainers to review it quicker than it should, please take your >>>> time and enjoy netdev.conf :) >>>> --- >>>> net/ipv4/tcp_input.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>> index ff9ab3d01ced..bfe1bc69dc3e 100644 >>>> --- a/net/ipv4/tcp_input.c >>>> +++ b/net/ipv4/tcp_input.c >>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) >>>> if (sk->sk_shutdown & SEND_SHUTDOWN) >>>> tcp_shutdown(sk, SEND_SHUTDOWN); >>>> >>>> - if (sk->sk_socket) >>>> + /* For crossed SYN cases, not to send an unnecessary ACK. >>>> + * Note that sk->sk_socket can be assigned in other cases, e.g. >>>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: >>>> + * sk_socket is the parent MPTCP sock). >>>> + */ >>>> + if (sk->sk_socket && th->syn) >>>> goto consume; >>> >>> I think we should simply remove this part completely, because we >>> should send an ack anyway. >> >> Thank you for having looked, and ran the full packetdrill test suite! >> >>> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce >>> 100644 >>> --- a/net/ipv4/tcp_input.c >>> +++ b/net/ipv4/tcp_input.c >>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct >>> sk_buff *skb) >>> if (sk->sk_shutdown & SEND_SHUTDOWN) >>> tcp_shutdown(sk, SEND_SHUTDOWN); >>> >>> - if (sk->sk_socket) >>> - goto consume; >>> break; >>> >>> case TCP_FIN_WAIT1: { >>> >>> >>> I have a failing packetdrill test after Kuniyuki patch : >>> >>> >>> >>> // >>> // Test the simultaneous open scenario that both end sends >>> // SYN/data. Although we don't support that the connection should >>> // still be established. >>> // >>> `../../common/defaults.sh >>> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` >>> >>> // Cache warmup: send a Fast Open cookie request >>> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 >>> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS >>> (Operation is now in progress) >>> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> >>> +.01 < S. 123:123(0) ack 1 win 14600 <mss >>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> >>> +0 > . 1:1(0) ack 1 >>> +.01 close(3) = 0 >>> +0 > F. 1:1(0) ack 1 >>> +.01 < F. 1:1(0) ack 2 win 92 >>> +0 > . 2:2(0) ack 2 >>> >>> >>> // >>> // Test: simulatenous fast open >>> // >>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 >>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 >>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO >>> abcd1234,nop,nop> >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale >>> 6,FO 87654321,nop,nop> >>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> >>> >>> // SYN data is never retried. >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> >>> +0 > . 1001:1001(0) ack 1 >> >> I recently sent a PR -- already applied -- to Neal to remove this line: >> >> https://github.com/google/packetdrill/pull/86 >> >> I thought it was the intension of Kuniyuki's patch not to send this ACK >> in this case to follow the RFC 9293's recommendation. This TFO test >> looks a bit similar to the example from Kuniyuki's patch: >> >> >> --------------- 8< --------------- >> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) >> >> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8> >> +0 < S 0:0(0) win 1000 <mss 1000> >> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8> >> +0 < S. 0:0(0) ack 1 win 1000 >> >> /* No ACK here */ >> >> +0 write(3, ..., 100) = 100 >> +0 > P. 1:101(100) ack 1 >> --------------- 8< --------------- >> >> >> >> But maybe here that should be different for TFO? >> >> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume' >> but I don't know how "strict" we want to be regarding the RFC and this >> marginal case. > > Problem of this 'goto consume' is that we are not properly sending a > DUPACK in this case. > > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > abcd1234,nop,nop> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > 6,FO 87654321,nop,nop> > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here? > Not sending a dupack seems wrong and could hurt. Indeed, I thought the RFC 9293 was not allowing that, but I didn't see anything forbidding this DUPACK: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 > I had reservations about this part, if you look back to the discussion. > > This is why Kuniyuki added in his changelog : > > Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to > send an unnecessary ACK, but this could be a bit risky for net.git, so this > targets for net-next. I understand, I can send a v3 dropping this part, and not including patch 2/2 for -net. I can also send a PR to Neal re-adding the ACK with 'sack' (if it is normal). Cheers, Matt
On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote: > > Hi Eric, > > On 23/07/2024 17:38, Eric Dumazet wrote: > > On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote: > >> > >> Hi Eric, > >> > >> +cc Neal > >> -cc Jerry (NoSuchUser) > >> > >> On 23/07/2024 16:37, Eric Dumazet wrote: > >>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) > >>> <matttbe@kernel.org> wrote: > >>>> > >>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the > >>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to > >>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an > >>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that > >>>> had an impact on TFO and MPTCP. > >>>> > >>>> I started to look at the impact on MPTCP, because the MPTCP CI found > >>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested > >>>> me to look at the impact on TFO with "plain" TCP. > >>>> > >>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path > >>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that > >>>> has been created when the MPTCP connection got established before with > >>>> the first path. The newly added 'goto' will then skip the processing of > >>>> the segment text (step 7) and not go through tcp_data_queue() where the > >>>> MPTCP options are validated, and some actions are triggered, e.g. > >>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when > >>>> running a packetdrill test [3] establishing a second subflow. > >>>> > >>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be > >>>> delayed. Still, we don't want to have this behaviour as it delays the > >>>> switch to the fully established mode, and invalid MPTCP options in this > >>>> 3rd ACK will not be caught any more. This modification also affects the > >>>> MPTCP + TFO feature as well, and being the reason why the selftests > >>>> started to be unstable the last few days [4]. > >>>> > >>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer > >>>> passing: if the 3rd ACK contains data, and the connection is accept()ed > >>>> before receiving them, these data would no longer be processed, and thus > >>>> not ACKed. > >>>> > >>>> One last thing about MPTCP, in case of simultaneous connect(), a > >>>> fallback to TCP will be done, which seems fine: > >>>> > >>>> `../common/defaults.sh` > >>>> > >>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 > >>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > >>>> > >>>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > >>>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > >>>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> > >>>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> > >>>> > >>>> +0 write(3, ..., 100) = 100 > >>>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> > >>>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> > >>>> > >>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. > >>>> > >>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] > >>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] > >>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] > >>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] > >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] > >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] > >>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") > >>>> Suggested-by: Paolo Abeni <pabeni@redhat.com> > >>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> > >>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > >>>> --- > >>>> Notes: > >>>> - We could also drop this 'goto consume', and send the unnecessary ACK > >>>> in this simultaneous connect case, which doesn't seem to be a "real" > >>>> case, more something for fuzzers. But that's not what the RFC 9293 > >>>> recommends to do. > >>>> - v2: > >>>> - Check if the SYN bit is set instead of looking for TFO and MPTCP > >>>> specific attributes, as suggested by Kuniyuki. > >>>> - Updated the comment above > >>>> - Please note that the v2 has been sent mainly to satisfy the CI (to > >>>> be able to catch new bugs with MPTCP), and because the suggestion > >>>> from Kuniyuki looks better. It has not been sent to urge TCP > >>>> maintainers to review it quicker than it should, please take your > >>>> time and enjoy netdev.conf :) > >>>> --- > >>>> net/ipv4/tcp_input.c | 7 ++++++- > >>>> 1 file changed, 6 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >>>> index ff9ab3d01ced..bfe1bc69dc3e 100644 > >>>> --- a/net/ipv4/tcp_input.c > >>>> +++ b/net/ipv4/tcp_input.c > >>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) > >>>> if (sk->sk_shutdown & SEND_SHUTDOWN) > >>>> tcp_shutdown(sk, SEND_SHUTDOWN); > >>>> > >>>> - if (sk->sk_socket) > >>>> + /* For crossed SYN cases, not to send an unnecessary ACK. > >>>> + * Note that sk->sk_socket can be assigned in other cases, e.g. > >>>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: > >>>> + * sk_socket is the parent MPTCP sock). > >>>> + */ > >>>> + if (sk->sk_socket && th->syn) > >>>> goto consume; > >>> > >>> I think we should simply remove this part completely, because we > >>> should send an ack anyway. > >> > >> Thank you for having looked, and ran the full packetdrill test suite! > >> > >>> > >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce > >>> 100644 > >>> --- a/net/ipv4/tcp_input.c > >>> +++ b/net/ipv4/tcp_input.c > >>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct > >>> sk_buff *skb) > >>> if (sk->sk_shutdown & SEND_SHUTDOWN) > >>> tcp_shutdown(sk, SEND_SHUTDOWN); > >>> > >>> - if (sk->sk_socket) > >>> - goto consume; > >>> break; > >>> > >>> case TCP_FIN_WAIT1: { > >>> > >>> > >>> I have a failing packetdrill test after Kuniyuki patch : > >>> > >>> > >>> > >>> // > >>> // Test the simultaneous open scenario that both end sends > >>> // SYN/data. Although we don't support that the connection should > >>> // still be established. > >>> // > >>> `../../common/defaults.sh > >>> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` > >>> > >>> // Cache warmup: send a Fast Open cookie request > >>> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > >>> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > >>> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS > >>> (Operation is now in progress) > >>> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> > >>> +.01 < S. 123:123(0) ack 1 win 14600 <mss > >>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> > >>> +0 > . 1:1(0) ack 1 > >>> +.01 close(3) = 0 > >>> +0 > F. 1:1(0) ack 1 > >>> +.01 < F. 1:1(0) ack 2 win 92 > >>> +0 > . 2:2(0) ack 2 > >>> > >>> > >>> // > >>> // Test: simulatenous fast open > >>> // > >>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > >>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > >>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > >>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > >>> abcd1234,nop,nop> > >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > >>> 6,FO 87654321,nop,nop> > >>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > >>> > >>> // SYN data is never retried. > >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > >>> +0 > . 1001:1001(0) ack 1 > >> > >> I recently sent a PR -- already applied -- to Neal to remove this line: > >> > >> https://github.com/google/packetdrill/pull/86 > >> > >> I thought it was the intension of Kuniyuki's patch not to send this ACK > >> in this case to follow the RFC 9293's recommendation. This TFO test > >> looks a bit similar to the example from Kuniyuki's patch: > >> > >> > >> --------------- 8< --------------- > >> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 > >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > >> > >> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8> > >> +0 < S 0:0(0) win 1000 <mss 1000> > >> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8> > >> +0 < S. 0:0(0) ack 1 win 1000 > >> > >> /* No ACK here */ > >> > >> +0 write(3, ..., 100) = 100 > >> +0 > P. 1:101(100) ack 1 > >> --------------- 8< --------------- > >> > >> > >> > >> But maybe here that should be different for TFO? > >> > >> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume' > >> but I don't know how "strict" we want to be regarding the RFC and this > >> marginal case. > > > > Problem of this 'goto consume' is that we are not properly sending a > > DUPACK in this case. > > > > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > > abcd1234,nop,nop> > > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > > 6,FO 87654321,nop,nop> > > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > > > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > > +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here > > I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here? It is normal, because the SYN was already received/processed. sack 0:1 represents this SYN sequence.
Hi Eric, On 23/07/2024 18:42, Eric Dumazet wrote: > On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote: >> >> Hi Eric, >> >> On 23/07/2024 17:38, Eric Dumazet wrote: >>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote: >>>> >>>> Hi Eric, >>>> >>>> +cc Neal >>>> -cc Jerry (NoSuchUser) >>>> >>>> On 23/07/2024 16:37, Eric Dumazet wrote: >>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) >>>>> <matttbe@kernel.org> wrote: >>>>>> >>>>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the >>>>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to >>>>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an >>>>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that >>>>>> had an impact on TFO and MPTCP. >>>>>> >>>>>> I started to look at the impact on MPTCP, because the MPTCP CI found >>>>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested >>>>>> me to look at the impact on TFO with "plain" TCP. >>>>>> >>>>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path >>>>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that >>>>>> has been created when the MPTCP connection got established before with >>>>>> the first path. The newly added 'goto' will then skip the processing of >>>>>> the segment text (step 7) and not go through tcp_data_queue() where the >>>>>> MPTCP options are validated, and some actions are triggered, e.g. >>>>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when >>>>>> running a packetdrill test [3] establishing a second subflow. >>>>>> >>>>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be >>>>>> delayed. Still, we don't want to have this behaviour as it delays the >>>>>> switch to the fully established mode, and invalid MPTCP options in this >>>>>> 3rd ACK will not be caught any more. This modification also affects the >>>>>> MPTCP + TFO feature as well, and being the reason why the selftests >>>>>> started to be unstable the last few days [4]. >>>>>> >>>>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer >>>>>> passing: if the 3rd ACK contains data, and the connection is accept()ed >>>>>> before receiving them, these data would no longer be processed, and thus >>>>>> not ACKed. >>>>>> >>>>>> One last thing about MPTCP, in case of simultaneous connect(), a >>>>>> fallback to TCP will be done, which seems fine: >>>>>> >>>>>> `../common/defaults.sh` >>>>>> >>>>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 >>>>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) >>>>>> >>>>>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >>>>>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >>>>>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> >>>>>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> >>>>>> >>>>>> +0 write(3, ..., 100) = 100 >>>>>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> >>>>>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> >>>>>> >>>>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. >>>>>> >>>>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] >>>>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] >>>>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] >>>>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] >>>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] >>>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] >>>>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") >>>>>> Suggested-by: Paolo Abeni <pabeni@redhat.com> >>>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> >>>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> >>>>>> --- >>>>>> Notes: >>>>>> - We could also drop this 'goto consume', and send the unnecessary ACK >>>>>> in this simultaneous connect case, which doesn't seem to be a "real" >>>>>> case, more something for fuzzers. But that's not what the RFC 9293 >>>>>> recommends to do. >>>>>> - v2: >>>>>> - Check if the SYN bit is set instead of looking for TFO and MPTCP >>>>>> specific attributes, as suggested by Kuniyuki. >>>>>> - Updated the comment above >>>>>> - Please note that the v2 has been sent mainly to satisfy the CI (to >>>>>> be able to catch new bugs with MPTCP), and because the suggestion >>>>>> from Kuniyuki looks better. It has not been sent to urge TCP >>>>>> maintainers to review it quicker than it should, please take your >>>>>> time and enjoy netdev.conf :) >>>>>> --- >>>>>> net/ipv4/tcp_input.c | 7 ++++++- >>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>>>> index ff9ab3d01ced..bfe1bc69dc3e 100644 >>>>>> --- a/net/ipv4/tcp_input.c >>>>>> +++ b/net/ipv4/tcp_input.c >>>>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) >>>>>> if (sk->sk_shutdown & SEND_SHUTDOWN) >>>>>> tcp_shutdown(sk, SEND_SHUTDOWN); >>>>>> >>>>>> - if (sk->sk_socket) >>>>>> + /* For crossed SYN cases, not to send an unnecessary ACK. >>>>>> + * Note that sk->sk_socket can be assigned in other cases, e.g. >>>>>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: >>>>>> + * sk_socket is the parent MPTCP sock). >>>>>> + */ >>>>>> + if (sk->sk_socket && th->syn) >>>>>> goto consume; >>>>> >>>>> I think we should simply remove this part completely, because we >>>>> should send an ack anyway. >>>> >>>> Thank you for having looked, and ran the full packetdrill test suite! >>>> >>>>> >>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce >>>>> 100644 >>>>> --- a/net/ipv4/tcp_input.c >>>>> +++ b/net/ipv4/tcp_input.c >>>>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct >>>>> sk_buff *skb) >>>>> if (sk->sk_shutdown & SEND_SHUTDOWN) >>>>> tcp_shutdown(sk, SEND_SHUTDOWN); >>>>> >>>>> - if (sk->sk_socket) >>>>> - goto consume; >>>>> break; >>>>> >>>>> case TCP_FIN_WAIT1: { >>>>> >>>>> >>>>> I have a failing packetdrill test after Kuniyuki patch : >>>>> >>>>> >>>>> >>>>> // >>>>> // Test the simultaneous open scenario that both end sends >>>>> // SYN/data. Although we don't support that the connection should >>>>> // still be established. >>>>> // >>>>> `../../common/defaults.sh >>>>> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` >>>>> >>>>> // Cache warmup: send a Fast Open cookie request >>>>> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 >>>>> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS >>>>> (Operation is now in progress) >>>>> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> >>>>> +.01 < S. 123:123(0) ack 1 win 14600 <mss >>>>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> >>>>> +0 > . 1:1(0) ack 1 >>>>> +.01 close(3) = 0 >>>>> +0 > F. 1:1(0) ack 1 >>>>> +.01 < F. 1:1(0) ack 2 win 92 >>>>> +0 > . 2:2(0) ack 2 >>>>> >>>>> >>>>> // >>>>> // Test: simulatenous fast open >>>>> // >>>>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 >>>>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 >>>>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO >>>>> abcd1234,nop,nop> >>>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN >>>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale >>>>> 6,FO 87654321,nop,nop> >>>>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> >>>>> >>>>> // SYN data is never retried. >>>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss >>>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> >>>>> +0 > . 1001:1001(0) ack 1 >>>> >>>> I recently sent a PR -- already applied -- to Neal to remove this line: >>>> >>>> https://github.com/google/packetdrill/pull/86 >>>> >>>> I thought it was the intension of Kuniyuki's patch not to send this ACK >>>> in this case to follow the RFC 9293's recommendation. This TFO test >>>> looks a bit similar to the example from Kuniyuki's patch: >>>> >>>> >>>> --------------- 8< --------------- >>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 >>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) >>>> >>>> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8> >>>> +0 < S 0:0(0) win 1000 <mss 1000> >>>> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8> >>>> +0 < S. 0:0(0) ack 1 win 1000 >>>> >>>> /* No ACK here */ >>>> >>>> +0 write(3, ..., 100) = 100 >>>> +0 > P. 1:101(100) ack 1 >>>> --------------- 8< --------------- >>>> >>>> >>>> >>>> But maybe here that should be different for TFO? >>>> >>>> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume' >>>> but I don't know how "strict" we want to be regarding the RFC and this >>>> marginal case. >>> >>> Problem of this 'goto consume' is that we are not properly sending a >>> DUPACK in this case. >>> >>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 >>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 >>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO >>> abcd1234,nop,nop> >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale >>> 6,FO 87654321,nop,nop> >>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> >>> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> >>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here >> >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here? > > It is normal, because the SYN was already received/processed. > > sack 0:1 represents this SYN sequence. Thank you for your reply! Maybe it is just me, but does it not look strange to have the SACK covering a segment (0:1) that is before the ACK (1)? 'ack 1' and 'sack 0:1' seem to cover the same block, no? Before Kuniyuki's patch, this 'sack 0:1' was not present. Cheers, Matt
From: Eric Dumazet <edumazet@google.com> Date: Tue, 23 Jul 2024 17:38:27 +0200 [...] > > > // > > > // Test the simultaneous open scenario that both end sends > > > // SYN/data. Although we don't support that the connection should > > > // still be established. > > > // > > > `../../common/defaults.sh > > > ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0` > > > > > > // Cache warmup: send a Fast Open cookie request > > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > > > +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > > +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS > > > (Operation is now in progress) > > > +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> > > > +.01 < S. 123:123(0) ack 1 win 14600 <mss > > > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> > > > +0 > . 1:1(0) ack 1 > > > +.01 close(3) = 0 > > > +0 > F. 1:1(0) ack 1 > > > +.01 < F. 1:1(0) ack 2 win 92 > > > +0 > . 2:2(0) ack 2 > > > > > > > > > // > > > // Test: simulatenous fast open > > > // > > > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > > > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > > > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > > > abcd1234,nop,nop> > > > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > > > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > > > 6,FO 87654321,nop,nop> > > > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > > > > > > // SYN data is never retried. > > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > > > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > > > +0 > . 1001:1001(0) ack 1 > > > > I recently sent a PR -- already applied -- to Neal to remove this line: > > > > https://github.com/google/packetdrill/pull/86 > > > > I thought it was the intension of Kuniyuki's patch not to send this ACK > > in this case to follow the RFC 9293's recommendation. This TFO test > > looks a bit similar to the example from Kuniyuki's patch: > > > > > > --------------- 8< --------------- > > 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 > > +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > > > > +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8> > > +0 < S 0:0(0) win 1000 <mss 1000> > > +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8> > > +0 < S. 0:0(0) ack 1 win 1000 > > > > /* No ACK here */ > > > > +0 write(3, ..., 100) = 100 > > +0 > P. 1:101(100) ack 1 > > --------------- 8< --------------- > > > > > > > > But maybe here that should be different for TFO? > > > > For my case with MPTCP (and TFO), it is fine to drop this 'goto consume' > > but I don't know how "strict" we want to be regarding the RFC and this > > marginal case. > > Problem of this 'goto consume' is that we are not properly sending a > DUPACK in this case. > > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > abcd1234,nop,nop> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > 6,FO 87654321,nop,nop> > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here > > Not sending a dupack seems wrong and could hurt. I think the situation where we should send ACK after simultaneous SYN+ACK would be: ---8<--- +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 +0 > S 0:1000(1000) <mss 1440,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN +.005 < S 1234:1734(500) win 14400 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop> +0 > S. 0:0(0) ack 1235 <mss 1440,nop,nop,sackOK,nop,wscale 8> // SYN data is not ACKed too. +.045 < S. 1234:1234(0) ack 1 win 14400 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> +0 > . 1:1001(1000) ack 1 ---8<--- When the first data in SYN is not ACKed, it must be retransmitted, and it can be done just after SYN+ACK is received, which is skipped by 'goto consume'. Retransmitting data in SYN is not supported though. ---8<--- sendto syscall: 1721769223.194675 outbound sniffed packet: 0.040288 S 833802090:833803090(1000) win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop> inbound injected packet: 0.045323 S 1234:1734(500) win 14400 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop> outbound sniffed packet: 0.045355 S. 833802090:833802090(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8> inbound injected packet: 0.090429 S. 1234:1234(0) ack 833802091 win 14400 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> outbound sniffed packet: 0.090460 . 833803091:833803091(0) ack 1235 win 1052 outbound sniffed packet: 1.051776 S. 833802090:833802090(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8> simul2.pkt:38: error handling packet: live packet field tcp_data_offset: expected: 5 (0x5) vs actual: 8 (0x8) script packet: 0.090462 . 1:1001(1000) ack 1 actual #0 packet: 0.090460 . 1001:1001(0) ack 1 win 1052 actual #1 packet: 1.051776 S. 0:0(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8> ---8<--- Except the case, I think the dupack is not needed in theory. But I understand the dupack could help the other quickly retransmit the not-yet-ACKed data in SYN instead of waiting for a timer as expected in the comment. ---8<--- // The other end retries +.1 < P. 1:501(500) ack 1000 win 257 ---8<---
From: Matthieu Baerts <matttbe@kernel.org> Date: Tue, 23 Jul 2024 21:09:40 +0200 [...] > >>> Problem of this 'goto consume' is that we are not properly sending a > >>> DUPACK in this case. > >>> > >>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 > >>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > >>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000 > >>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO > >>> abcd1234,nop,nop> > >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN > >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale > >>> 6,FO 87654321,nop,nop> > >>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8> > >>> > >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > >>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here > >> > >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here? > > > > It is normal, because the SYN was already received/processed. > > > > sack 0:1 represents this SYN sequence. > > Thank you for your reply! > > Maybe it is just me, but does it not look strange to have the SACK > covering a segment (0:1) that is before the ACK (1)? > > 'ack 1' and 'sack 0:1' seem to cover the same block, no? > Before Kuniyuki's patch, this 'sack 0:1' was not present. This looks a bit strange to me too, but I guess this is also not forbidden ?
On Tue, Jul 23, 2024 at 3:09 PM Matthieu Baerts <matttbe@kernel.org> wrote: > > Hi Eric, > > On 23/07/2024 18:42, Eric Dumazet wrote: > > On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote: > >> > >> Hi Eric, > >> > >> On 23/07/2024 17:38, Eric Dumazet wrote: > >>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote: > >>>> > >>>> Hi Eric, > >>>> > >>>> +cc Neal > >>>> -cc Jerry (NoSuchUser) > >>>> > >>>> On 23/07/2024 16:37, Eric Dumazet wrote: > >>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) > >>>>> <matttbe@kernel.org> wrote: > >>>>>> ... > >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss > >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> > >>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here > >> > >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here? > > > > It is normal, because the SYN was already received/processed. > > > > sack 0:1 represents this SYN sequence. > > Thank you for your reply! > > Maybe it is just me, but does it not look strange to have the SACK > covering a segment (0:1) that is before the ACK (1)? > > 'ack 1' and 'sack 0:1' seem to cover the same block, no? > Before Kuniyuki's patch, this 'sack 0:1' was not present. A SACK that covers a sequence range that was already cumulatively or selectively acknowledged is legal and useful, and is called a Duplicate Selective Acknowledgement (DSACK or D-SACK). A DSACK indicates that a receiver received duplicate data. That can be very useful in allowing a data sender to determine that a retransmission was not needed (spurious). If a data sender receives DSACKs for all retransmitted data in a loss detection episode then it knows all retransmissions were spurious, and thus it can "undo" its congestion control reaction to that estimated loss, since the congestion control algorithm was responding to an incorrect loss signal. This can be very helpful for performance in the presence of varying delays or reordering, which can cause spurious loss detection episodes.. See: https://datatracker.ietf.org/doc/html/rfc2883 An Extension to the Selective Acknowledgement (SACK) Option for TCP https://www.rfc-editor.org/rfc/rfc3708.html "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions" neal
Hi Neal, On 24/07/2024 00:01, Neal Cardwell wrote: > On Tue, Jul 23, 2024 at 3:09 PM Matthieu Baerts <matttbe@kernel.org> wrote: >> >> Hi Eric, >> >> On 23/07/2024 18:42, Eric Dumazet wrote: >>> On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote: >>>> >>>> Hi Eric, >>>> >>>> On 23/07/2024 17:38, Eric Dumazet wrote: >>>>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote: >>>>>> >>>>>> Hi Eric, >>>>>> >>>>>> +cc Neal >>>>>> -cc Jerry (NoSuchUser) >>>>>> >>>>>> On 23/07/2024 16:37, Eric Dumazet wrote: >>>>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0) >>>>>>> <matttbe@kernel.org> wrote: >>>>>>>> > ... >>>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss >>>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop> >>>>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here >>>> >>>> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here? >>> >>> It is normal, because the SYN was already received/processed. >>> >>> sack 0:1 represents this SYN sequence. >> >> Thank you for your reply! >> >> Maybe it is just me, but does it not look strange to have the SACK >> covering a segment (0:1) that is before the ACK (1)? >> >> 'ack 1' and 'sack 0:1' seem to cover the same block, no? >> Before Kuniyuki's patch, this 'sack 0:1' was not present. > > A SACK that covers a sequence range that was already cumulatively or > selectively acknowledged is legal and useful, and is called a > Duplicate Selective Acknowledgement (DSACK or D-SACK). > > A DSACK indicates that a receiver received duplicate data. That can be > very useful in allowing a data sender to determine that a > retransmission was not needed (spurious). If a data sender receives > DSACKs for all retransmitted data in a loss detection episode then it > knows all retransmissions were spurious, and thus it can "undo" its > congestion control reaction to that estimated loss, since the > congestion control algorithm was responding to an incorrect loss > signal. This can be very helpful for performance in the presence of > varying delays or reordering, which can cause spurious loss detection > episodes.. > > See: > > https://datatracker.ietf.org/doc/html/rfc2883 > An Extension to the Selective Acknowledgement (SACK) Option for TCP > > https://www.rfc-editor.org/rfc/rfc3708.html > "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream > Control Transmission Protocol (SCTP) Duplicate Transmission Sequence > Numbers (TSNs) to Detect Spurious Retransmissions" Thank you for the great explanations! I was not expecting this in a 3rd ACK :) I don't know if it will help the congestion control algorithm in this case, but I guess we don't need to worry about this marginal case. I will then send the v3, and the Packetdrill modification. Cheers, Matt
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ff9ab3d01ced..bfe1bc69dc3e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) if (sk->sk_shutdown & SEND_SHUTDOWN) tcp_shutdown(sk, SEND_SHUTDOWN); - if (sk->sk_socket) + /* For crossed SYN cases, not to send an unnecessary ACK. + * Note that sk->sk_socket can be assigned in other cases, e.g. + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ: + * sk_socket is the parent MPTCP sock). + */ + if (sk->sk_socket && th->syn) goto consume; break;
The 'Fixes' commit recently changed the behaviour of TCP by skipping the processing of the 3rd ACK when a sk->sk_socket is set. The goal was to skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an unnecessary ACK in case of simultaneous connect(). Unfortunately, that had an impact on TFO and MPTCP. I started to look at the impact on MPTCP, because the MPTCP CI found some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested me to look at the impact on TFO with "plain" TCP. For MPTCP, when receiving the 3rd ACK of a request adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that has been created when the MPTCP connection got established before with the first path. The newly added 'goto' will then skip the processing of the segment text (step 7) and not go through tcp_data_queue() where the MPTCP options are validated, and some actions are triggered, e.g. sending the MPJ 4th ACK [2] as demonstrated by the new errors when running a packetdrill test [3] establishing a second subflow. This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be delayed. Still, we don't want to have this behaviour as it delays the switch to the fully established mode, and invalid MPTCP options in this 3rd ACK will not be caught any more. This modification also affects the MPTCP + TFO feature as well, and being the reason why the selftests started to be unstable the last few days [4]. For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer passing: if the 3rd ACK contains data, and the connection is accept()ed before receiving them, these data would no longer be processed, and thus not ACKed. One last thing about MPTCP, in case of simultaneous connect(), a fallback to TCP will be done, which seems fine: `../common/defaults.sh` 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> +0 write(3, ..., 100) = 100 +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- Notes: - We could also drop this 'goto consume', and send the unnecessary ACK in this simultaneous connect case, which doesn't seem to be a "real" case, more something for fuzzers. But that's not what the RFC 9293 recommends to do. - v2: - Check if the SYN bit is set instead of looking for TFO and MPTCP specific attributes, as suggested by Kuniyuki. - Updated the comment above - Please note that the v2 has been sent mainly to satisfy the CI (to be able to catch new bugs with MPTCP), and because the suggestion from Kuniyuki looks better. It has not been sent to urge TCP maintainers to review it quicker than it should, please take your time and enjoy netdev.conf :) --- net/ipv4/tcp_input.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)