diff mbox series

[net,v2,1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP

Message ID 20240718-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v2-1-d653f85639f6@kernel.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series tcp: restrict crossed SYN specific actions to SYN-ACK | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 661 this patch: 661
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 662 this patch: 662
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 663 this patch: 663
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 13 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-07-19--12-00 (tests: 699)

Commit Message

Matthieu Baerts July 18, 2024, 10:33 a.m. UTC
The 'Fixes' commit recently changed the behaviour of TCP by skipping the
processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
unnecessary ACK in case of simultaneous connect(). Unfortunately, that
had an impact on TFO and MPTCP.

I started to look at the impact on MPTCP, because the MPTCP CI found
some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
me to look at the impact on TFO with "plain" TCP.

For MPTCP, when receiving the 3rd ACK of a request adding a new path
(MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
has been created when the MPTCP connection got established before with
the first path. The newly added 'goto' will then skip the processing of
the segment text (step 7) and not go through tcp_data_queue() where the
MPTCP options are validated, and some actions are triggered, e.g.
sending the MPJ 4th ACK [2] as demonstrated by the new errors when
running a packetdrill test [3] establishing a second subflow.

This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed. Still, we don't want to have this behaviour as it delays the
switch to the fully established mode, and invalid MPTCP options in this
3rd ACK will not be caught any more. This modification also affects the
MPTCP + TFO feature as well, and being the reason why the selftests
started to be unstable the last few days [4].

For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
passing: if the 3rd ACK contains data, and the connection is accept()ed
before receiving them, these data would no longer be processed, and thus
not ACKed.

One last thing about MPTCP, in case of simultaneous connect(), a
fallback to TCP will be done, which seems fine:

  `../common/defaults.sh`

   0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
  +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

  +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
  +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
  +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
  +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>

  +0 write(3, ..., 100) = 100
  +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
  +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>

Simultaneous SYN-data crossing is also not supported by TFO, see [6].

Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Notes:
 - We could also drop this 'goto consume', and send the unnecessary ACK
   in this simultaneous connect case, which doesn't seem to be a "real"
   case, more something for fuzzers. But that's not what the RFC 9293
   recommends to do.
 - v2:
   - Check if the SYN bit is set instead of looking for TFO and MPTCP
     specific attributes, as suggested by Kuniyuki.
   - Updated the comment above
   - Please note that the v2 has been sent mainly to satisfy the CI (to
     be able to catch new bugs with MPTCP), and because the suggestion
     from Kuniyuki looks better. It has not been sent to urge TCP
     maintainers to review it quicker than it should, please take your
     time and enjoy netdev.conf :)
---
 net/ipv4/tcp_input.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Eric Dumazet July 23, 2024, 2:37 p.m. UTC | #1
On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
<matttbe@kernel.org> wrote:
>
> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
> had an impact on TFO and MPTCP.
>
> I started to look at the impact on MPTCP, because the MPTCP CI found
> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
> me to look at the impact on TFO with "plain" TCP.
>
> For MPTCP, when receiving the 3rd ACK of a request adding a new path
> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
> has been created when the MPTCP connection got established before with
> the first path. The newly added 'goto' will then skip the processing of
> the segment text (step 7) and not go through tcp_data_queue() where the
> MPTCP options are validated, and some actions are triggered, e.g.
> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
> running a packetdrill test [3] establishing a second subflow.
>
> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> delayed. Still, we don't want to have this behaviour as it delays the
> switch to the fully established mode, and invalid MPTCP options in this
> 3rd ACK will not be caught any more. This modification also affects the
> MPTCP + TFO feature as well, and being the reason why the selftests
> started to be unstable the last few days [4].
>
> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
> passing: if the 3rd ACK contains data, and the connection is accept()ed
> before receiving them, these data would no longer be processed, and thus
> not ACKed.
>
> One last thing about MPTCP, in case of simultaneous connect(), a
> fallback to TCP will be done, which seems fine:
>
>   `../common/defaults.sh`
>
>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>
>   +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>   +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>   +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>   +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>
>   +0 write(3, ..., 100) = 100
>   +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>   +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>
> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>
> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> ---
> Notes:
>  - We could also drop this 'goto consume', and send the unnecessary ACK
>    in this simultaneous connect case, which doesn't seem to be a "real"
>    case, more something for fuzzers. But that's not what the RFC 9293
>    recommends to do.
>  - v2:
>    - Check if the SYN bit is set instead of looking for TFO and MPTCP
>      specific attributes, as suggested by Kuniyuki.
>    - Updated the comment above
>    - Please note that the v2 has been sent mainly to satisfy the CI (to
>      be able to catch new bugs with MPTCP), and because the suggestion
>      from Kuniyuki looks better. It has not been sent to urge TCP
>      maintainers to review it quicker than it should, please take your
>      time and enjoy netdev.conf :)
> ---
>  net/ipv4/tcp_input.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ff9ab3d01ced..bfe1bc69dc3e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>
> -               if (sk->sk_socket)
> +               /* For crossed SYN cases, not to send an unnecessary ACK.
> +                * Note that sk->sk_socket can be assigned in other cases, e.g.
> +                * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
> +                * sk_socket is the parent MPTCP sock).
> +                */
> +               if (sk->sk_socket && th->syn)
>                         goto consume;

I think we should simply remove this part completely, because we
should send an ack anyway.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
sk_buff *skb)
                if (sk->sk_shutdown & SEND_SHUTDOWN)
                        tcp_shutdown(sk, SEND_SHUTDOWN);

-               if (sk->sk_socket)
-                       goto consume;
                break;

        case TCP_FIN_WAIT1: {


I have a failing packetdrill test after  Kuniyuki  patch :



//
// Test the simultaneous open scenario that both end sends
// SYN/data. Although we don't support that the connection should
// still be established.
//
`../../common/defaults.sh
 ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`

// Cache warmup: send a Fast Open cookie request
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
(Operation is now in progress)
   +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
 +.01 < S. 123:123(0) ack 1 win 14600 <mss
1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
   +0 > . 1:1(0) ack 1
 +.01 close(3) = 0
   +0 > F. 1:1(0) ack 1
 +.01 < F. 1:1(0) ack 2 win 92
   +0 > .  2:2(0) ack 2


//
// Test: simulatenous fast open
//
 +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
   +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
abcd1234,nop,nop>
// Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
+.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
6,FO 87654321,nop,nop>
   +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>

// SYN data is never retried.
+.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
   +0 > . 1001:1001(0) ack 1
// The other end retries
  +.1 < P. 1:501(500) ack 1000 win 257
   +0 > . 1001:1001(0) ack 501
   +0 read(4, ..., 4096) = 500
   +0 close(4) = 0
   +0 > F. 1001:1001(0) ack 501
 +.05 < F. 501:501(0) ack 1002 win 257
   +0 > . 1002:1002(0) ack 502

`/tmp/sysctl_restore_${PPID}.sh`
Matthieu Baerts July 23, 2024, 2:58 p.m. UTC | #2
Hi Eric,

+cc Neal
-cc Jerry (NoSuchUser)

On 23/07/2024 16:37, Eric Dumazet wrote:
> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> <matttbe@kernel.org> wrote:
>>
>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
>> had an impact on TFO and MPTCP.
>>
>> I started to look at the impact on MPTCP, because the MPTCP CI found
>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
>> me to look at the impact on TFO with "plain" TCP.
>>
>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
>> has been created when the MPTCP connection got established before with
>> the first path. The newly added 'goto' will then skip the processing of
>> the segment text (step 7) and not go through tcp_data_queue() where the
>> MPTCP options are validated, and some actions are triggered, e.g.
>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
>> running a packetdrill test [3] establishing a second subflow.
>>
>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>> delayed. Still, we don't want to have this behaviour as it delays the
>> switch to the fully established mode, and invalid MPTCP options in this
>> 3rd ACK will not be caught any more. This modification also affects the
>> MPTCP + TFO feature as well, and being the reason why the selftests
>> started to be unstable the last few days [4].
>>
>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
>> passing: if the 3rd ACK contains data, and the connection is accept()ed
>> before receiving them, these data would no longer be processed, and thus
>> not ACKed.
>>
>> One last thing about MPTCP, in case of simultaneous connect(), a
>> fallback to TCP will be done, which seems fine:
>>
>>   `../common/defaults.sh`
>>
>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>
>>   +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>   +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>   +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>   +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>>
>>   +0 write(3, ..., 100) = 100
>>   +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>>   +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>>
>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>>
>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>> ---
>> Notes:
>>  - We could also drop this 'goto consume', and send the unnecessary ACK
>>    in this simultaneous connect case, which doesn't seem to be a "real"
>>    case, more something for fuzzers. But that's not what the RFC 9293
>>    recommends to do.
>>  - v2:
>>    - Check if the SYN bit is set instead of looking for TFO and MPTCP
>>      specific attributes, as suggested by Kuniyuki.
>>    - Updated the comment above
>>    - Please note that the v2 has been sent mainly to satisfy the CI (to
>>      be able to catch new bugs with MPTCP), and because the suggestion
>>      from Kuniyuki looks better. It has not been sent to urge TCP
>>      maintainers to review it quicker than it should, please take your
>>      time and enjoy netdev.conf :)
>> ---
>>  net/ipv4/tcp_input.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index ff9ab3d01ced..bfe1bc69dc3e 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>
>> -               if (sk->sk_socket)
>> +               /* For crossed SYN cases, not to send an unnecessary ACK.
>> +                * Note that sk->sk_socket can be assigned in other cases, e.g.
>> +                * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
>> +                * sk_socket is the parent MPTCP sock).
>> +                */
>> +               if (sk->sk_socket && th->syn)
>>                         goto consume;
> 
> I think we should simply remove this part completely, because we
> should send an ack anyway.

Thank you for having looked, and ran the full packetdrill test suite!

> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
> 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
> sk_buff *skb)
>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>                         tcp_shutdown(sk, SEND_SHUTDOWN);
> 
> -               if (sk->sk_socket)
> -                       goto consume;
>                 break;
> 
>         case TCP_FIN_WAIT1: {
> 
> 
> I have a failing packetdrill test after  Kuniyuki  patch :
> 
> 
> 
> //
> // Test the simultaneous open scenario that both end sends
> // SYN/data. Although we don't support that the connection should
> // still be established.
> //
> `../../common/defaults.sh
>  ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> 
> // Cache warmup: send a Fast Open cookie request
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> (Operation is now in progress)
>    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
>  +.01 < S. 123:123(0) ack 1 win 14600 <mss
> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
>    +0 > . 1:1(0) ack 1
>  +.01 close(3) = 0
>    +0 > F. 1:1(0) ack 1
>  +.01 < F. 1:1(0) ack 2 win 92
>    +0 > .  2:2(0) ack 2
> 
> 
> //
> // Test: simulatenous fast open
> //
>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> 6,FO 87654321,nop,nop>
>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> 
> // SYN data is never retried.
> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>    +0 > . 1001:1001(0) ack 1

I recently sent a PR -- already applied -- to Neal to remove this line:

  https://github.com/google/packetdrill/pull/86

I thought it was the intension of Kuniyuki's patch not to send this ACK
in this case to follow the RFC 9293's recommendation. This TFO test
looks a bit similar to the example from Kuniyuki's patch:


--------------- 8< ---------------
 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

+0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
+0 < S  0:0(0) win 1000 <mss 1000>
+0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
+0 < S. 0:0(0) ack 1 win 1000

  /* No ACK here */

+0 write(3, ..., 100) = 100
+0 > P. 1:101(100) ack 1
--------------- 8< ---------------



But maybe here that should be different for TFO?

For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
but I don't know how "strict" we want to be regarding the RFC and this
marginal case.


> // The other end retries
>   +.1 < P. 1:501(500) ack 1000 win 257
>    +0 > . 1001:1001(0) ack 501
>    +0 read(4, ..., 4096) = 500
>    +0 close(4) = 0
>    +0 > F. 1001:1001(0) ack 501
>  +.05 < F. 501:501(0) ack 1002 win 257
>    +0 > . 1002:1002(0) ack 502
> 
> `/tmp/sysctl_restore_${PPID}.sh`
> 

Cheers,
Matt
Eric Dumazet July 23, 2024, 3:38 p.m. UTC | #3
On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> +cc Neal
> -cc Jerry (NoSuchUser)
>
> On 23/07/2024 16:37, Eric Dumazet wrote:
> > On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> > <matttbe@kernel.org> wrote:
> >>
> >> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
> >> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
> >> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
> >> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
> >> had an impact on TFO and MPTCP.
> >>
> >> I started to look at the impact on MPTCP, because the MPTCP CI found
> >> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
> >> me to look at the impact on TFO with "plain" TCP.
> >>
> >> For MPTCP, when receiving the 3rd ACK of a request adding a new path
> >> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
> >> has been created when the MPTCP connection got established before with
> >> the first path. The newly added 'goto' will then skip the processing of
> >> the segment text (step 7) and not go through tcp_data_queue() where the
> >> MPTCP options are validated, and some actions are triggered, e.g.
> >> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
> >> running a packetdrill test [3] establishing a second subflow.
> >>
> >> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> >> delayed. Still, we don't want to have this behaviour as it delays the
> >> switch to the fully established mode, and invalid MPTCP options in this
> >> 3rd ACK will not be caught any more. This modification also affects the
> >> MPTCP + TFO feature as well, and being the reason why the selftests
> >> started to be unstable the last few days [4].
> >>
> >> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
> >> passing: if the 3rd ACK contains data, and the connection is accept()ed
> >> before receiving them, these data would no longer be processed, and thus
> >> not ACKed.
> >>
> >> One last thing about MPTCP, in case of simultaneous connect(), a
> >> fallback to TCP will be done, which seems fine:
> >>
> >>   `../common/defaults.sh`
> >>
> >>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
> >>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>
> >>   +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>   +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>   +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>   +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
> >>
> >>   +0 write(3, ..., 100) = 100
> >>   +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
> >>   +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
> >>
> >> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
> >>
> >> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
> >> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
> >> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
> >> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
> >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
> >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
> >> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
> >> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> >> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >> ---
> >> Notes:
> >>  - We could also drop this 'goto consume', and send the unnecessary ACK
> >>    in this simultaneous connect case, which doesn't seem to be a "real"
> >>    case, more something for fuzzers. But that's not what the RFC 9293
> >>    recommends to do.
> >>  - v2:
> >>    - Check if the SYN bit is set instead of looking for TFO and MPTCP
> >>      specific attributes, as suggested by Kuniyuki.
> >>    - Updated the comment above
> >>    - Please note that the v2 has been sent mainly to satisfy the CI (to
> >>      be able to catch new bugs with MPTCP), and because the suggestion
> >>      from Kuniyuki looks better. It has not been sent to urge TCP
> >>      maintainers to review it quicker than it should, please take your
> >>      time and enjoy netdev.conf :)
> >> ---
> >>  net/ipv4/tcp_input.c | 7 ++++++-
> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index ff9ab3d01ced..bfe1bc69dc3e 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >>
> >> -               if (sk->sk_socket)
> >> +               /* For crossed SYN cases, not to send an unnecessary ACK.
> >> +                * Note that sk->sk_socket can be assigned in other cases, e.g.
> >> +                * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
> >> +                * sk_socket is the parent MPTCP sock).
> >> +                */
> >> +               if (sk->sk_socket && th->syn)
> >>                         goto consume;
> >
> > I think we should simply remove this part completely, because we
> > should send an ack anyway.
>
> Thank you for having looked, and ran the full packetdrill test suite!
>
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
> > 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
> > sk_buff *skb)
> >                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >
> > -               if (sk->sk_socket)
> > -                       goto consume;
> >                 break;
> >
> >         case TCP_FIN_WAIT1: {
> >
> >
> > I have a failing packetdrill test after  Kuniyuki  patch :
> >
> >
> >
> > //
> > // Test the simultaneous open scenario that both end sends
> > // SYN/data. Although we don't support that the connection should
> > // still be established.
> > //
> > `../../common/defaults.sh
> >  ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> >
> > // Cache warmup: send a Fast Open cookie request
> >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> > (Operation is now in progress)
> >    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> >  +.01 < S. 123:123(0) ack 1 win 14600 <mss
> > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> >    +0 > . 1:1(0) ack 1
> >  +.01 close(3) = 0
> >    +0 > F. 1:1(0) ack 1
> >  +.01 < F. 1:1(0) ack 2 win 92
> >    +0 > .  2:2(0) ack 2
> >
> >
> > //
> > // Test: simulatenous fast open
> > //
> >  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> >    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> >    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> > abcd1234,nop,nop>
> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> > 6,FO 87654321,nop,nop>
> >    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >
> > // SYN data is never retried.
> > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >    +0 > . 1001:1001(0) ack 1
>
> I recently sent a PR -- already applied -- to Neal to remove this line:
>
>   https://github.com/google/packetdrill/pull/86
>
> I thought it was the intension of Kuniyuki's patch not to send this ACK
> in this case to follow the RFC 9293's recommendation. This TFO test
> looks a bit similar to the example from Kuniyuki's patch:
>
>
> --------------- 8< ---------------
>  0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>
> +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> +0 < S  0:0(0) win 1000 <mss 1000>
> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> +0 < S. 0:0(0) ack 1 win 1000
>
>   /* No ACK here */
>
> +0 write(3, ..., 100) = 100
> +0 > P. 1:101(100) ack 1
> --------------- 8< ---------------
>
>
>
> But maybe here that should be different for TFO?
>
> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
> but I don't know how "strict" we want to be regarding the RFC and this
> marginal case.

Problem of this 'goto consume' is that we are not properly sending a
DUPACK in this case.

 +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
   +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
abcd1234,nop,nop>
// Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
+.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
6,FO 87654321,nop,nop>
   +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>

+.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
   +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here

Not sending a dupack seems wrong and could hurt.

I had reservations about this part, if you look back to the discussion.

This is why Kuniyuki added in his changelog :

    Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
    send an unnecessary ACK, but this could be a bit risky for net.git, so this
    targets for net-next.
Matthieu Baerts July 23, 2024, 4:08 p.m. UTC | #4
Hi Eric,

On 23/07/2024 17:38, Eric Dumazet wrote:
> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> +cc Neal
>> -cc Jerry (NoSuchUser)
>>
>> On 23/07/2024 16:37, Eric Dumazet wrote:
>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
>>> <matttbe@kernel.org> wrote:
>>>>
>>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
>>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
>>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
>>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
>>>> had an impact on TFO and MPTCP.
>>>>
>>>> I started to look at the impact on MPTCP, because the MPTCP CI found
>>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
>>>> me to look at the impact on TFO with "plain" TCP.
>>>>
>>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
>>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
>>>> has been created when the MPTCP connection got established before with
>>>> the first path. The newly added 'goto' will then skip the processing of
>>>> the segment text (step 7) and not go through tcp_data_queue() where the
>>>> MPTCP options are validated, and some actions are triggered, e.g.
>>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
>>>> running a packetdrill test [3] establishing a second subflow.
>>>>
>>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>>>> delayed. Still, we don't want to have this behaviour as it delays the
>>>> switch to the fully established mode, and invalid MPTCP options in this
>>>> 3rd ACK will not be caught any more. This modification also affects the
>>>> MPTCP + TFO feature as well, and being the reason why the selftests
>>>> started to be unstable the last few days [4].
>>>>
>>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
>>>> passing: if the 3rd ACK contains data, and the connection is accept()ed
>>>> before receiving them, these data would no longer be processed, and thus
>>>> not ACKed.
>>>>
>>>> One last thing about MPTCP, in case of simultaneous connect(), a
>>>> fallback to TCP will be done, which seems fine:
>>>>
>>>>   `../common/defaults.sh`
>>>>
>>>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>>>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>>
>>>>   +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>   +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>   +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>   +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>>>>
>>>>   +0 write(3, ..., 100) = 100
>>>>   +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>>>>   +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>>>>
>>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>>>>
>>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
>>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
>>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
>>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
>>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
>>>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>>> ---
>>>> Notes:
>>>>  - We could also drop this 'goto consume', and send the unnecessary ACK
>>>>    in this simultaneous connect case, which doesn't seem to be a "real"
>>>>    case, more something for fuzzers. But that's not what the RFC 9293
>>>>    recommends to do.
>>>>  - v2:
>>>>    - Check if the SYN bit is set instead of looking for TFO and MPTCP
>>>>      specific attributes, as suggested by Kuniyuki.
>>>>    - Updated the comment above
>>>>    - Please note that the v2 has been sent mainly to satisfy the CI (to
>>>>      be able to catch new bugs with MPTCP), and because the suggestion
>>>>      from Kuniyuki looks better. It has not been sent to urge TCP
>>>>      maintainers to review it quicker than it should, please take your
>>>>      time and enjoy netdev.conf :)
>>>> ---
>>>>  net/ipv4/tcp_input.c | 7 ++++++-
>>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>> index ff9ab3d01ced..bfe1bc69dc3e 100644
>>>> --- a/net/ipv4/tcp_input.c
>>>> +++ b/net/ipv4/tcp_input.c
>>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>>>
>>>> -               if (sk->sk_socket)
>>>> +               /* For crossed SYN cases, not to send an unnecessary ACK.
>>>> +                * Note that sk->sk_socket can be assigned in other cases, e.g.
>>>> +                * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
>>>> +                * sk_socket is the parent MPTCP sock).
>>>> +                */
>>>> +               if (sk->sk_socket && th->syn)
>>>>                         goto consume;
>>>
>>> I think we should simply remove this part completely, because we
>>> should send an ack anyway.
>>
>> Thank you for having looked, and ran the full packetdrill test suite!
>>
>>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
>>> 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
>>> sk_buff *skb)
>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>>
>>> -               if (sk->sk_socket)
>>> -                       goto consume;
>>>                 break;
>>>
>>>         case TCP_FIN_WAIT1: {
>>>
>>>
>>> I have a failing packetdrill test after  Kuniyuki  patch :
>>>
>>>
>>>
>>> //
>>> // Test the simultaneous open scenario that both end sends
>>> // SYN/data. Although we don't support that the connection should
>>> // still be established.
>>> //
>>> `../../common/defaults.sh
>>>  ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
>>>
>>> // Cache warmup: send a Fast Open cookie request
>>>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>>>    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
>>> (Operation is now in progress)
>>>    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
>>>  +.01 < S. 123:123(0) ack 1 win 14600 <mss
>>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
>>>    +0 > . 1:1(0) ack 1
>>>  +.01 close(3) = 0
>>>    +0 > F. 1:1(0) ack 1
>>>  +.01 < F. 1:1(0) ack 2 win 92
>>>    +0 > .  2:2(0) ack 2
>>>
>>>
>>> //
>>> // Test: simulatenous fast open
>>> //
>>>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>>>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>>>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
>>> abcd1234,nop,nop>
>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
>>> 6,FO 87654321,nop,nop>
>>>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>>>
>>> // SYN data is never retried.
>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>>    +0 > . 1001:1001(0) ack 1
>>
>> I recently sent a PR -- already applied -- to Neal to remove this line:
>>
>>   https://github.com/google/packetdrill/pull/86
>>
>> I thought it was the intension of Kuniyuki's patch not to send this ACK
>> in this case to follow the RFC 9293's recommendation. This TFO test
>> looks a bit similar to the example from Kuniyuki's patch:
>>
>>
>> --------------- 8< ---------------
>>  0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>
>> +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>> +0 < S  0:0(0) win 1000 <mss 1000>
>> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>> +0 < S. 0:0(0) ack 1 win 1000
>>
>>   /* No ACK here */
>>
>> +0 write(3, ..., 100) = 100
>> +0 > P. 1:101(100) ack 1
>> --------------- 8< ---------------
>>
>>
>>
>> But maybe here that should be different for TFO?
>>
>> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
>> but I don't know how "strict" we want to be regarding the RFC and this
>> marginal case.
> 
> Problem of this 'goto consume' is that we are not properly sending a
> DUPACK in this case.
> 
>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> 6,FO 87654321,nop,nop>
>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> 
> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here

I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?

> Not sending a dupack seems wrong and could hurt.

Indeed, I thought the RFC 9293 was not allowing that, but I didn't see
anything forbidding this DUPACK:

  https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7

> I had reservations about this part, if you look back to the discussion.
> 
> This is why Kuniyuki added in his changelog :
> 
>     Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
>     send an unnecessary ACK, but this could be a bit risky for net.git, so this
>     targets for net-next.

I understand, I can send a v3 dropping this part, and not including
patch 2/2 for -net. I can also send a PR to Neal re-adding the ACK with
'sack' (if it is normal).

Cheers,
Matt
Eric Dumazet July 23, 2024, 4:42 p.m. UTC | #5
On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> On 23/07/2024 17:38, Eric Dumazet wrote:
> > On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>
> >> Hi Eric,
> >>
> >> +cc Neal
> >> -cc Jerry (NoSuchUser)
> >>
> >> On 23/07/2024 16:37, Eric Dumazet wrote:
> >>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> >>> <matttbe@kernel.org> wrote:
> >>>>
> >>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
> >>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
> >>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
> >>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
> >>>> had an impact on TFO and MPTCP.
> >>>>
> >>>> I started to look at the impact on MPTCP, because the MPTCP CI found
> >>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
> >>>> me to look at the impact on TFO with "plain" TCP.
> >>>>
> >>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
> >>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
> >>>> has been created when the MPTCP connection got established before with
> >>>> the first path. The newly added 'goto' will then skip the processing of
> >>>> the segment text (step 7) and not go through tcp_data_queue() where the
> >>>> MPTCP options are validated, and some actions are triggered, e.g.
> >>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
> >>>> running a packetdrill test [3] establishing a second subflow.
> >>>>
> >>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> >>>> delayed. Still, we don't want to have this behaviour as it delays the
> >>>> switch to the fully established mode, and invalid MPTCP options in this
> >>>> 3rd ACK will not be caught any more. This modification also affects the
> >>>> MPTCP + TFO feature as well, and being the reason why the selftests
> >>>> started to be unstable the last few days [4].
> >>>>
> >>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
> >>>> passing: if the 3rd ACK contains data, and the connection is accept()ed
> >>>> before receiving them, these data would no longer be processed, and thus
> >>>> not ACKed.
> >>>>
> >>>> One last thing about MPTCP, in case of simultaneous connect(), a
> >>>> fallback to TCP will be done, which seems fine:
> >>>>
> >>>>   `../common/defaults.sh`
> >>>>
> >>>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
> >>>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>>>
> >>>>   +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>>>   +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>>>   +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>>>   +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
> >>>>
> >>>>   +0 write(3, ..., 100) = 100
> >>>>   +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
> >>>>   +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
> >>>>
> >>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
> >>>>
> >>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
> >>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
> >>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
> >>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
> >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
> >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
> >>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
> >>>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> >>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >>>> ---
> >>>> Notes:
> >>>>  - We could also drop this 'goto consume', and send the unnecessary ACK
> >>>>    in this simultaneous connect case, which doesn't seem to be a "real"
> >>>>    case, more something for fuzzers. But that's not what the RFC 9293
> >>>>    recommends to do.
> >>>>  - v2:
> >>>>    - Check if the SYN bit is set instead of looking for TFO and MPTCP
> >>>>      specific attributes, as suggested by Kuniyuki.
> >>>>    - Updated the comment above
> >>>>    - Please note that the v2 has been sent mainly to satisfy the CI (to
> >>>>      be able to catch new bugs with MPTCP), and because the suggestion
> >>>>      from Kuniyuki looks better. It has not been sent to urge TCP
> >>>>      maintainers to review it quicker than it should, please take your
> >>>>      time and enjoy netdev.conf :)
> >>>> ---
> >>>>  net/ipv4/tcp_input.c | 7 ++++++-
> >>>>  1 file changed, 6 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>> index ff9ab3d01ced..bfe1bc69dc3e 100644
> >>>> --- a/net/ipv4/tcp_input.c
> >>>> +++ b/net/ipv4/tcp_input.c
> >>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >>>>
> >>>> -               if (sk->sk_socket)
> >>>> +               /* For crossed SYN cases, not to send an unnecessary ACK.
> >>>> +                * Note that sk->sk_socket can be assigned in other cases, e.g.
> >>>> +                * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
> >>>> +                * sk_socket is the parent MPTCP sock).
> >>>> +                */
> >>>> +               if (sk->sk_socket && th->syn)
> >>>>                         goto consume;
> >>>
> >>> I think we should simply remove this part completely, because we
> >>> should send an ack anyway.
> >>
> >> Thank you for having looked, and ran the full packetdrill test suite!
> >>
> >>>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
> >>> 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
> >>> sk_buff *skb)
> >>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >>>
> >>> -               if (sk->sk_socket)
> >>> -                       goto consume;
> >>>                 break;
> >>>
> >>>         case TCP_FIN_WAIT1: {
> >>>
> >>>
> >>> I have a failing packetdrill test after  Kuniyuki  patch :
> >>>
> >>>
> >>>
> >>> //
> >>> // Test the simultaneous open scenario that both end sends
> >>> // SYN/data. Although we don't support that the connection should
> >>> // still be established.
> >>> //
> >>> `../../common/defaults.sh
> >>>  ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> >>>
> >>> // Cache warmup: send a Fast Open cookie request
> >>>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >>>    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >>>    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> >>> (Operation is now in progress)
> >>>    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> >>>  +.01 < S. 123:123(0) ack 1 win 14600 <mss
> >>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> >>>    +0 > . 1:1(0) ack 1
> >>>  +.01 close(3) = 0
> >>>    +0 > F. 1:1(0) ack 1
> >>>  +.01 < F. 1:1(0) ack 2 win 92
> >>>    +0 > .  2:2(0) ack 2
> >>>
> >>>
> >>> //
> >>> // Test: simulatenous fast open
> >>> //
> >>>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> >>>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >>>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> >>>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> >>> abcd1234,nop,nop>
> >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> >>> 6,FO 87654321,nop,nop>
> >>>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >>>
> >>> // SYN data is never retried.
> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >>>    +0 > . 1001:1001(0) ack 1
> >>
> >> I recently sent a PR -- already applied -- to Neal to remove this line:
> >>
> >>   https://github.com/google/packetdrill/pull/86
> >>
> >> I thought it was the intension of Kuniyuki's patch not to send this ACK
> >> in this case to follow the RFC 9293's recommendation. This TFO test
> >> looks a bit similar to the example from Kuniyuki's patch:
> >>
> >>
> >> --------------- 8< ---------------
> >>  0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>
> >> +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> >> +0 < S  0:0(0) win 1000 <mss 1000>
> >> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> >> +0 < S. 0:0(0) ack 1 win 1000
> >>
> >>   /* No ACK here */
> >>
> >> +0 write(3, ..., 100) = 100
> >> +0 > P. 1:101(100) ack 1
> >> --------------- 8< ---------------
> >>
> >>
> >>
> >> But maybe here that should be different for TFO?
> >>
> >> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
> >> but I don't know how "strict" we want to be regarding the RFC and this
> >> marginal case.
> >
> > Problem of this 'goto consume' is that we are not properly sending a
> > DUPACK in this case.
> >
> >  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> >    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> >    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> > abcd1234,nop,nop>
> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> > 6,FO 87654321,nop,nop>
> >    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >
> > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here
>
> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?

It is normal, because the SYN was already received/processed.

sack 0:1 represents this SYN sequence.
Matthieu Baerts July 23, 2024, 7:09 p.m. UTC | #6
Hi Eric,

On 23/07/2024 18:42, Eric Dumazet wrote:
> On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> On 23/07/2024 17:38, Eric Dumazet wrote:
>>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>
>>>> Hi Eric,
>>>>
>>>> +cc Neal
>>>> -cc Jerry (NoSuchUser)
>>>>
>>>> On 23/07/2024 16:37, Eric Dumazet wrote:
>>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
>>>>> <matttbe@kernel.org> wrote:
>>>>>>
>>>>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
>>>>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
>>>>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
>>>>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
>>>>>> had an impact on TFO and MPTCP.
>>>>>>
>>>>>> I started to look at the impact on MPTCP, because the MPTCP CI found
>>>>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
>>>>>> me to look at the impact on TFO with "plain" TCP.
>>>>>>
>>>>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
>>>>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
>>>>>> has been created when the MPTCP connection got established before with
>>>>>> the first path. The newly added 'goto' will then skip the processing of
>>>>>> the segment text (step 7) and not go through tcp_data_queue() where the
>>>>>> MPTCP options are validated, and some actions are triggered, e.g.
>>>>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
>>>>>> running a packetdrill test [3] establishing a second subflow.
>>>>>>
>>>>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>>>>>> delayed. Still, we don't want to have this behaviour as it delays the
>>>>>> switch to the fully established mode, and invalid MPTCP options in this
>>>>>> 3rd ACK will not be caught any more. This modification also affects the
>>>>>> MPTCP + TFO feature as well, and being the reason why the selftests
>>>>>> started to be unstable the last few days [4].
>>>>>>
>>>>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
>>>>>> passing: if the 3rd ACK contains data, and the connection is accept()ed
>>>>>> before receiving them, these data would no longer be processed, and thus
>>>>>> not ACKed.
>>>>>>
>>>>>> One last thing about MPTCP, in case of simultaneous connect(), a
>>>>>> fallback to TCP will be done, which seems fine:
>>>>>>
>>>>>>   `../common/defaults.sh`
>>>>>>
>>>>>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>>>>>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>>>>
>>>>>>   +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>>>   +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>>>   +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>>>   +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>>>>>>
>>>>>>   +0 write(3, ..., 100) = 100
>>>>>>   +0 >  . 1:1(0)     ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>>>>>>   +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>>>>>>
>>>>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>>>>>>
>>>>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
>>>>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
>>>>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
>>>>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
>>>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
>>>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
>>>>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
>>>>>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
>>>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>>>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>>>>> ---
>>>>>> Notes:
>>>>>>  - We could also drop this 'goto consume', and send the unnecessary ACK
>>>>>>    in this simultaneous connect case, which doesn't seem to be a "real"
>>>>>>    case, more something for fuzzers. But that's not what the RFC 9293
>>>>>>    recommends to do.
>>>>>>  - v2:
>>>>>>    - Check if the SYN bit is set instead of looking for TFO and MPTCP
>>>>>>      specific attributes, as suggested by Kuniyuki.
>>>>>>    - Updated the comment above
>>>>>>    - Please note that the v2 has been sent mainly to satisfy the CI (to
>>>>>>      be able to catch new bugs with MPTCP), and because the suggestion
>>>>>>      from Kuniyuki looks better. It has not been sent to urge TCP
>>>>>>      maintainers to review it quicker than it should, please take your
>>>>>>      time and enjoy netdev.conf :)
>>>>>> ---
>>>>>>  net/ipv4/tcp_input.c | 7 ++++++-
>>>>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>>> index ff9ab3d01ced..bfe1bc69dc3e 100644
>>>>>> --- a/net/ipv4/tcp_input.c
>>>>>> +++ b/net/ipv4/tcp_input.c
>>>>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>>>>>
>>>>>> -               if (sk->sk_socket)
>>>>>> +               /* For crossed SYN cases, not to send an unnecessary ACK.
>>>>>> +                * Note that sk->sk_socket can be assigned in other cases, e.g.
>>>>>> +                * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
>>>>>> +                * sk_socket is the parent MPTCP sock).
>>>>>> +                */
>>>>>> +               if (sk->sk_socket && th->syn)
>>>>>>                         goto consume;
>>>>>
>>>>> I think we should simply remove this part completely, because we
>>>>> should send an ack anyway.
>>>>
>>>> Thank you for having looked, and ran the full packetdrill test suite!
>>>>
>>>>>
>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
>>>>> 100644
>>>>> --- a/net/ipv4/tcp_input.c
>>>>> +++ b/net/ipv4/tcp_input.c
>>>>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
>>>>> sk_buff *skb)
>>>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>>>>
>>>>> -               if (sk->sk_socket)
>>>>> -                       goto consume;
>>>>>                 break;
>>>>>
>>>>>         case TCP_FIN_WAIT1: {
>>>>>
>>>>>
>>>>> I have a failing packetdrill test after  Kuniyuki  patch :
>>>>>
>>>>>
>>>>>
>>>>> //
>>>>> // Test the simultaneous open scenario that both end sends
>>>>> // SYN/data. Although we don't support that the connection should
>>>>> // still be established.
>>>>> //
>>>>> `../../common/defaults.sh
>>>>>  ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
>>>>>
>>>>> // Cache warmup: send a Fast Open cookie request
>>>>>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>>>>>    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>>>    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
>>>>> (Operation is now in progress)
>>>>>    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
>>>>>  +.01 < S. 123:123(0) ack 1 win 14600 <mss
>>>>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
>>>>>    +0 > . 1:1(0) ack 1
>>>>>  +.01 close(3) = 0
>>>>>    +0 > F. 1:1(0) ack 1
>>>>>  +.01 < F. 1:1(0) ack 2 win 92
>>>>>    +0 > .  2:2(0) ack 2
>>>>>
>>>>>
>>>>> //
>>>>> // Test: simulatenous fast open
>>>>> //
>>>>>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>>>>>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>>>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>>>>>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
>>>>> abcd1234,nop,nop>
>>>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
>>>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
>>>>> 6,FO 87654321,nop,nop>
>>>>>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>>>>>
>>>>> // SYN data is never retried.
>>>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>>>>    +0 > . 1001:1001(0) ack 1
>>>>
>>>> I recently sent a PR -- already applied -- to Neal to remove this line:
>>>>
>>>>   https://github.com/google/packetdrill/pull/86
>>>>
>>>> I thought it was the intension of Kuniyuki's patch not to send this ACK
>>>> in this case to follow the RFC 9293's recommendation. This TFO test
>>>> looks a bit similar to the example from Kuniyuki's patch:
>>>>
>>>>
>>>> --------------- 8< ---------------
>>>>  0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>>
>>>> +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>>>> +0 < S  0:0(0) win 1000 <mss 1000>
>>>> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>>>> +0 < S. 0:0(0) ack 1 win 1000
>>>>
>>>>   /* No ACK here */
>>>>
>>>> +0 write(3, ..., 100) = 100
>>>> +0 > P. 1:101(100) ack 1
>>>> --------------- 8< ---------------
>>>>
>>>>
>>>>
>>>> But maybe here that should be different for TFO?
>>>>
>>>> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
>>>> but I don't know how "strict" we want to be regarding the RFC and this
>>>> marginal case.
>>>
>>> Problem of this 'goto consume' is that we are not properly sending a
>>> DUPACK in this case.
>>>
>>>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>>>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>>>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
>>> abcd1234,nop,nop>
>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
>>> 6,FO 87654321,nop,nop>
>>>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>>>
>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>>    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here
>>
>> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
> 
> It is normal, because the SYN was already received/processed.
> 
> sack 0:1 represents this SYN sequence.

Thank you for your reply!

Maybe it is just me, but does it not look strange to have the SACK
covering a segment (0:1) that is before the ACK (1)?

'ack 1' and 'sack 0:1' seem to cover the same block, no?
Before Kuniyuki's patch, this 'sack 0:1' was not present.

Cheers,
Matt
Kuniyuki Iwashima July 23, 2024, 9:27 p.m. UTC | #7
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 23 Jul 2024 17:38:27 +0200
[...]
> > > //
> > > // Test the simultaneous open scenario that both end sends
> > > // SYN/data. Although we don't support that the connection should
> > > // still be established.
> > > //
> > > `../../common/defaults.sh
> > >  ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> > >
> > > // Cache warmup: send a Fast Open cookie request
> > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > >    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > >    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> > > (Operation is now in progress)
> > >    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> > >  +.01 < S. 123:123(0) ack 1 win 14600 <mss
> > > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> > >    +0 > . 1:1(0) ack 1
> > >  +.01 close(3) = 0
> > >    +0 > F. 1:1(0) ack 1
> > >  +.01 < F. 1:1(0) ack 2 win 92
> > >    +0 > .  2:2(0) ack 2
> > >
> > >
> > > //
> > > // Test: simulatenous fast open
> > > //
> > >  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> > >    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > >    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> > >    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> > > abcd1234,nop,nop>
> > > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> > > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> > > 6,FO 87654321,nop,nop>
> > >    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> > >
> > > // SYN data is never retried.
> > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> > > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> > >    +0 > . 1001:1001(0) ack 1
> >
> > I recently sent a PR -- already applied -- to Neal to remove this line:
> >
> >   https://github.com/google/packetdrill/pull/86
> >
> > I thought it was the intension of Kuniyuki's patch not to send this ACK
> > in this case to follow the RFC 9293's recommendation. This TFO test
> > looks a bit similar to the example from Kuniyuki's patch:
> >
> >
> > --------------- 8< ---------------
> >  0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> > +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >
> > +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> > +0 < S  0:0(0) win 1000 <mss 1000>
> > +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> > +0 < S. 0:0(0) ack 1 win 1000
> >
> >   /* No ACK here */
> >
> > +0 write(3, ..., 100) = 100
> > +0 > P. 1:101(100) ack 1
> > --------------- 8< ---------------
> >
> >
> >
> > But maybe here that should be different for TFO?
> >
> > For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
> > but I don't know how "strict" we want to be regarding the RFC and this
> > marginal case.
> 
> Problem of this 'goto consume' is that we are not properly sending a
> DUPACK in this case.
> 
>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> 6,FO 87654321,nop,nop>
>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> 
> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here
> 
> Not sending a dupack seems wrong and could hurt.

I think the situation where we should send ACK after simultaneous
SYN+ACK would be:

---8<---
 +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
   +0 > S 0:1000(1000) <mss 1440,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop>
// Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
+.005 < S 1234:1734(500) win 14400 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop>
   +0 > S. 0:0(0) ack 1235 <mss 1440,nop,nop,sackOK,nop,wscale 8>

// SYN data is not ACKed too.
+.045 < S. 1234:1234(0) ack 1 win 14400 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>

   +0 > . 1:1001(1000) ack 1
---8<---

When the first data in SYN is not ACKed, it must be retransmitted,
and it can be done just after SYN+ACK is received, which is skipped
by 'goto consume'.

Retransmitting data in SYN is not supported though.

---8<---
sendto syscall: 1721769223.194675
outbound sniffed packet:  0.040288 S 833802090:833803090(1000) win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop>
inbound injected packet:  0.045323 S 1234:1734(500) win 14400 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop>
outbound sniffed packet:  0.045355 S. 833802090:833802090(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
inbound injected packet:  0.090429 S. 1234:1234(0) ack 833802091 win 14400 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
outbound sniffed packet:  0.090460 . 833803091:833803091(0) ack 1235 win 1052 
outbound sniffed packet:  1.051776 S. 833802090:833802090(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
simul2.pkt:38: error handling packet: live packet field tcp_data_offset: expected: 5 (0x5) vs actual: 8 (0x8)
   script packet:  0.090462 . 1:1001(1000) ack 1 
actual #0 packet:  0.090460 . 1001:1001(0) ack 1 win 1052 
actual #1 packet:  1.051776 S. 0:0(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
---8<---

Except the case, I think the dupack is not needed in theory.

But I understand the dupack could help the other quickly retransmit the
not-yet-ACKed data in SYN instead of waiting for a timer as expected in
the comment.

---8<---
// The other end retries
  +.1 < P. 1:501(500) ack 1000 win 257
---8<---
Kuniyuki Iwashima July 23, 2024, 9:41 p.m. UTC | #8
From: Matthieu Baerts <matttbe@kernel.org>
Date: Tue, 23 Jul 2024 21:09:40 +0200
[...]
> >>> Problem of this 'goto consume' is that we are not properly sending a
> >>> DUPACK in this case.
> >>>
> >>>  +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> >>>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >>>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> >>>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> >>> abcd1234,nop,nop>
> >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> >>> 6,FO 87654321,nop,nop>
> >>>    +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >>>
> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >>>    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here
> >>
> >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
> > 
> > It is normal, because the SYN was already received/processed.
> > 
> > sack 0:1 represents this SYN sequence.
> 
> Thank you for your reply!
> 
> Maybe it is just me, but does it not look strange to have the SACK
> covering a segment (0:1) that is before the ACK (1)?
> 
> 'ack 1' and 'sack 0:1' seem to cover the same block, no?
> Before Kuniyuki's patch, this 'sack 0:1' was not present.

This looks a bit strange to me too, but I guess this is also not
forbidden ?
Neal Cardwell July 23, 2024, 10:01 p.m. UTC | #9
On Tue, Jul 23, 2024 at 3:09 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> On 23/07/2024 18:42, Eric Dumazet wrote:
> > On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>
> >> Hi Eric,
> >>
> >> On 23/07/2024 17:38, Eric Dumazet wrote:
> >>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>>>
> >>>> Hi Eric,
> >>>>
> >>>> +cc Neal
> >>>> -cc Jerry (NoSuchUser)
> >>>>
> >>>> On 23/07/2024 16:37, Eric Dumazet wrote:
> >>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> >>>>> <matttbe@kernel.org> wrote:
> >>>>>>
...
> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >>>    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here
> >>
> >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
> >
> > It is normal, because the SYN was already received/processed.
> >
> > sack 0:1 represents this SYN sequence.
>
> Thank you for your reply!
>
> Maybe it is just me, but does it not look strange to have the SACK
> covering a segment (0:1) that is before the ACK (1)?
>
> 'ack 1' and 'sack 0:1' seem to cover the same block, no?
> Before Kuniyuki's patch, this 'sack 0:1' was not present.

A SACK that covers a sequence range that was already cumulatively or
selectively acknowledged is legal and useful, and is called a
Duplicate Selective Acknowledgement (DSACK or D-SACK).

A DSACK indicates that a receiver received duplicate data. That can be
very useful in allowing a data sender to determine that a
retransmission was not needed (spurious). If a data sender receives
DSACKs for all retransmitted data in a loss detection episode then it
knows all retransmissions were spurious, and thus it can "undo" its
congestion control reaction to that estimated loss, since the
congestion control algorithm was responding to an incorrect loss
signal. This can be very helpful for performance in the presence of
varying delays or reordering, which can cause spurious loss detection
episodes..

See:

https://datatracker.ietf.org/doc/html/rfc2883
An Extension to the Selective Acknowledgement (SACK) Option for TCP

https://www.rfc-editor.org/rfc/rfc3708.html
"Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream
Control Transmission Protocol (SCTP) Duplicate  Transmission Sequence
Numbers (TSNs) to Detect Spurious Retransmissions"

neal
Matthieu Baerts July 24, 2024, 8:15 a.m. UTC | #10
Hi Neal,

On 24/07/2024 00:01, Neal Cardwell wrote:
> On Tue, Jul 23, 2024 at 3:09 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> On 23/07/2024 18:42, Eric Dumazet wrote:
>>> On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>
>>>> Hi Eric,
>>>>
>>>> On 23/07/2024 17:38, Eric Dumazet wrote:
>>>>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>>>
>>>>>> Hi Eric,
>>>>>>
>>>>>> +cc Neal
>>>>>> -cc Jerry (NoSuchUser)
>>>>>>
>>>>>> On 23/07/2024 16:37, Eric Dumazet wrote:
>>>>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
>>>>>>> <matttbe@kernel.org> wrote:
>>>>>>>>
> ...
>>>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>>>>    +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1>  // See here
>>>>
>>>> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
>>>
>>> It is normal, because the SYN was already received/processed.
>>>
>>> sack 0:1 represents this SYN sequence.
>>
>> Thank you for your reply!
>>
>> Maybe it is just me, but does it not look strange to have the SACK
>> covering a segment (0:1) that is before the ACK (1)?
>>
>> 'ack 1' and 'sack 0:1' seem to cover the same block, no?
>> Before Kuniyuki's patch, this 'sack 0:1' was not present.
> 
> A SACK that covers a sequence range that was already cumulatively or
> selectively acknowledged is legal and useful, and is called a
> Duplicate Selective Acknowledgement (DSACK or D-SACK).
> 
> A DSACK indicates that a receiver received duplicate data. That can be
> very useful in allowing a data sender to determine that a
> retransmission was not needed (spurious). If a data sender receives
> DSACKs for all retransmitted data in a loss detection episode then it
> knows all retransmissions were spurious, and thus it can "undo" its
> congestion control reaction to that estimated loss, since the
> congestion control algorithm was responding to an incorrect loss
> signal. This can be very helpful for performance in the presence of
> varying delays or reordering, which can cause spurious loss detection
> episodes..
> 
> See:
> 
> https://datatracker.ietf.org/doc/html/rfc2883
> An Extension to the Selective Acknowledgement (SACK) Option for TCP
> 
> https://www.rfc-editor.org/rfc/rfc3708.html
> "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream
> Control Transmission Protocol (SCTP) Duplicate  Transmission Sequence
> Numbers (TSNs) to Detect Spurious Retransmissions"

Thank you for the great explanations!

I was not expecting this in a 3rd ACK :)
I don't know if it will help the congestion control algorithm in this
case, but I guess we don't need to worry about this marginal case.

I will then send the v3, and the Packetdrill modification.

Cheers,
Matt
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ff9ab3d01ced..bfe1bc69dc3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6820,7 +6820,12 @@  tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			tcp_shutdown(sk, SEND_SHUTDOWN);
 
-		if (sk->sk_socket)
+		/* For crossed SYN cases, not to send an unnecessary ACK.
+		 * Note that sk->sk_socket can be assigned in other cases, e.g.
+		 * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
+		 * sk_socket is the parent MPTCP sock).
+		 */
+		if (sk->sk_socket && th->syn)
 			goto consume;
 		break;