diff mbox series

[net,v2] Fix race for duplicate reqsk on identical SYN

Message ID 20240614102628.446642-1-luoxuanqiang@kylinos.cn (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] Fix race for duplicate reqsk on identical SYN | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2876 this patch: 2876
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 904 this patch: 904
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3059 this patch: 3059
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 74 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 4 this patch: 4
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-06-14--21-00 (tests: 650)

Commit Message

luoxuanqiang June 14, 2024, 10:26 a.m. UTC
When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
SYN packets are received at the same time and processed on different CPUs,
it can potentially create the same sk (sock) but two different reqsk
(request_sock) in tcp_conn_request().

These two different reqsk will respond with two SYNACK packets, and since
the generation of the seq (ISN) incorporates a timestamp, the final two
SYNACK packets will have different seq values.

The consequence is that when the Client receives and replies with an ACK
to the earlier SYNACK packet, we will reset(RST) it.

========================================================================

This behavior is consistently reproducible in my local setup,
which comprises:

                  | NETA1 ------ NETB1 |
PC_A --- bond --- |                    | --- bond --- PC_B
                  | NETA2 ------ NETB2 |

- PC_A is the Server and has two network cards, NETA1 and NETA2. I have
  bonded these two cards using BOND_MODE_BROADCAST mode and configured
  them to be handled by different CPU.

- PC_B is the Client, also equipped with two network cards, NETB1 and
  NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode.

If the client attempts a TCP connection to the server, it might encounter
a failure. Capturing packets from the server side reveals:

10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <==
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <==
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117,

Two SYNACKs with different seq numbers are sent by localhost,
resulting in an anomaly.

========================================================================

The attempted solution is as follows:
In the tcp_conn_request(), while inserting reqsk into the ehash table,
it also checks if an entry already exists. If found, it avoids
reinsertion and releases it.

Simultaneously, In the reqsk_queue_hash_req(), the start of the
req->rsk_timer is adjusted to be after successful insertion.

Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
---
 include/net/inet_connection_sock.h |  2 +-
 net/dccp/ipv4.c                    |  2 +-
 net/dccp/ipv6.c                    |  2 +-
 net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
 net/ipv4/tcp_input.c               | 11 ++++++++++-
 5 files changed, 24 insertions(+), 8 deletions(-)

Comments

Florian Westphal June 14, 2024, 10:54 a.m. UTC | #1
luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>  include/net/inet_connection_sock.h |  2 +-
>  net/dccp/ipv4.c                    |  2 +-
>  net/dccp/ipv6.c                    |  2 +-
>  net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>  net/ipv4/tcp_input.c               | 11 ++++++++++-
>  5 files changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 7d6b1254c92d..8773d161d184 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  				      struct request_sock *req,
>  				      struct sock *child);
>  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> -				   unsigned long timeout);
> +				   unsigned long timeout, bool *found_dup_sk);

Nit:

I think it would be preferrable to change retval to bool rather than
bool *found_dup_sk extra arg, so one can do

bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
  				   unsigned long timeout)
{
	if (!reqsk_queue_hash_req(req, timeout))
		return false;

i.e. let retval indicate wheter reqsk was inserted or not.

Patch looks good to me otherwise.
luoxuanqiang June 14, 2024, 12:42 p.m. UTC | #2
在 2024/6/14 18:54, Florian Westphal 写道:
> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>>   include/net/inet_connection_sock.h |  2 +-
>>   net/dccp/ipv4.c                    |  2 +-
>>   net/dccp/ipv6.c                    |  2 +-
>>   net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>>   net/ipv4/tcp_input.c               | 11 ++++++++++-
>>   5 files changed, 24 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> index 7d6b1254c92d..8773d161d184 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>>   				      struct request_sock *req,
>>   				      struct sock *child);
>>   void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>> -				   unsigned long timeout);
>> +				   unsigned long timeout, bool *found_dup_sk);
> Nit:
>
> I think it would be preferrable to change retval to bool rather than
> bool *found_dup_sk extra arg, so one can do
>
> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>    				   unsigned long timeout)
> {
> 	if (!reqsk_queue_hash_req(req, timeout))
> 		return false;
>
> i.e. let retval indicate wheter reqsk was inserted or not.
>
> Patch looks good to me otherwise.

Thank you for your confirmation!

Regarding your suggestion, I had considered it before,
but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
dccp_v4(v6)_conn_request() also calls it. However, there is no
consideration for a failed insertion within that function, so it's
reasonable to let the caller decide whether to check for duplicate
reqsk.

The purpose of my modification this time is solely to confirm if a
reqsk for the same connection has already been inserted into the ehash.
If the insertion fails, inet_ehash_insert() will handle the
non-insertion gracefully, and I only need to release the duplicate
reqsk. I believe this change is minimal and effective.

Those are my considerations.
Kuniyuki Iwashima June 14, 2024, 10:24 p.m. UTC | #3
From: luoxuanqiang <luoxuanqiang@kylinos.cn>
Date: Fri, 14 Jun 2024 20:42:07 +0800
> 在 2024/6/14 18:54, Florian Westphal 写道:
> > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
> >>   include/net/inet_connection_sock.h |  2 +-
> >>   net/dccp/ipv4.c                    |  2 +-
> >>   net/dccp/ipv6.c                    |  2 +-
> >>   net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
> >>   net/ipv4/tcp_input.c               | 11 ++++++++++-
> >>   5 files changed, 24 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> >> index 7d6b1254c92d..8773d161d184 100644
> >> --- a/include/net/inet_connection_sock.h
> >> +++ b/include/net/inet_connection_sock.h
> >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> >>   				      struct request_sock *req,
> >>   				      struct sock *child);
> >>   void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> >> -				   unsigned long timeout);
> >> +				   unsigned long timeout, bool *found_dup_sk);
> > Nit:
> >
> > I think it would be preferrable to change retval to bool rather than
> > bool *found_dup_sk extra arg, so one can do

+1


> >
> > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> >    				   unsigned long timeout)
> > {
> > 	if (!reqsk_queue_hash_req(req, timeout))
> > 		return false;
> >
> > i.e. let retval indicate wheter reqsk was inserted or not.
> >
> > Patch looks good to me otherwise.
> 
> Thank you for your confirmation!
> 
> Regarding your suggestion, I had considered it before,
> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
> dccp_v4(v6)_conn_request() also calls it. However, there is no
> consideration for a failed insertion within that function, so it's
> reasonable to let the caller decide whether to check for duplicate
> reqsk.

I guess you followed 01770a1661657 where found_dup_sk was introduced,
but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
and DCCP is not related.

Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
additional argument.

However, another similar commit 5e0724d027f05 actually added own_req check
in DCCP path.

I personally would'nt care if DCCP was not changed to handle such a
failure because DCCP will be removed next year, but I still prefer
Florian's suggestion.

Thanks
Eric Dumazet June 15, 2024, 6:40 a.m. UTC | #4
On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
> Date: Fri, 14 Jun 2024 20:42:07 +0800
> > 在 2024/6/14 18:54, Florian Westphal 写道:
> > > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
> > >>   include/net/inet_connection_sock.h |  2 +-
> > >>   net/dccp/ipv4.c                    |  2 +-
> > >>   net/dccp/ipv6.c                    |  2 +-
> > >>   net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
> > >>   net/ipv4/tcp_input.c               | 11 ++++++++++-
> > >>   5 files changed, 24 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > >> index 7d6b1254c92d..8773d161d184 100644
> > >> --- a/include/net/inet_connection_sock.h
> > >> +++ b/include/net/inet_connection_sock.h
> > >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >>                                  struct request_sock *req,
> > >>                                  struct sock *child);
> > >>   void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > >> -                             unsigned long timeout);
> > >> +                             unsigned long timeout, bool *found_dup_sk);
> > > Nit:
> > >
> > > I think it would be preferrable to change retval to bool rather than
> > > bool *found_dup_sk extra arg, so one can do
>
> +1
>
>
> > >
> > > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > >                                unsigned long timeout)
> > > {
> > >     if (!reqsk_queue_hash_req(req, timeout))
> > >             return false;
> > >
> > > i.e. let retval indicate wheter reqsk was inserted or not.
> > >
> > > Patch looks good to me otherwise.
> >
> > Thank you for your confirmation!
> >
> > Regarding your suggestion, I had considered it before,
> > but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
> > dccp_v4(v6)_conn_request() also calls it. However, there is no
> > consideration for a failed insertion within that function, so it's
> > reasonable to let the caller decide whether to check for duplicate
> > reqsk.
>
> I guess you followed 01770a1661657 where found_dup_sk was introduced,
> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
> and DCCP is not related.
>
> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
> additional argument.
>
> However, another similar commit 5e0724d027f05 actually added own_req check
> in DCCP path.
>
> I personally would'nt care if DCCP was not changed to handle such a
> failure because DCCP will be removed next year, but I still prefer
> Florian's suggestion.
>

Other things to consider :

- I presume this patch targets net tree, and luoxuanqiang needs the
fix to reach stable trees.

- This means a Fixes: tag is needed

- This also means that we should favor a patch with no or trivial
conflicts for stable backports.

Should the patch target the net-next tree, then the requirements can
be different.
alexandre.ferrieux@orange.com June 16, 2024, 11:45 p.m. UTC | #5
On 14/06/2024 12:26, luoxuanqiang wrote:
> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
> SYN packets are received at the same time and processed on different CPUs,
> it can potentially create the same sk (sock) but two different reqsk
> (request_sock) in tcp_conn_request().
>
> These two different reqsk will respond with two SYNACK packets, and since
> the generation of the seq (ISN) incorporates a timestamp, the final two
> SYNACK packets will have different seq values.
>
> The consequence is that when the Client receives and replies with an ACK
> to the earlier SYNACK packet, we will reset(RST) it.
>
> ========================================================================
This is close, but not identical, to a race we observed on a *single* CPU with
the TPROXY iptables target, in the following situation:

  - two identical SYNs, sent one second apart from the same client socket,
    arrive back-to-back on the interface (due to network jitter)

  - they happen to be handled in the same batch of packet from one softirq
    name_your_nic_poll()

  - there, two loops run sequentially: one for netfilter (doing TPROXY), one
    for the network stack (doing TCP processing)

  - the first generates two distinct contexts for the two SYNs

  - the second respects these contexts and never gets a chance to merge them

The result is exactly as you describe, but in this case there is no need for 
bonding,
and everything happens in one single CPU, which is pretty ironic for a race.
My uneducated feeling is that the two loops are the cause of a simulated
parallelism, yielding the race. If each packet of the batch was handled
"to completion" (full netfilter handling followed immediately by full network
stack ingestion), the problem would not exist.
luoxuanqiang June 17, 2024, 2:01 a.m. UTC | #6
在 2024/6/15 14:40, Eric Dumazet 写道:
> On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>> Date: Fri, 14 Jun 2024 20:42:07 +0800
>>> 在 2024/6/14 18:54, Florian Westphal 写道:
>>>> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>>>>>    include/net/inet_connection_sock.h |  2 +-
>>>>>    net/dccp/ipv4.c                    |  2 +-
>>>>>    net/dccp/ipv6.c                    |  2 +-
>>>>>    net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>>>>>    net/ipv4/tcp_input.c               | 11 ++++++++++-
>>>>>    5 files changed, 24 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>>>>> index 7d6b1254c92d..8773d161d184 100644
>>>>> --- a/include/net/inet_connection_sock.h
>>>>> +++ b/include/net/inet_connection_sock.h
>>>>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>>>>>                                   struct request_sock *req,
>>>>>                                   struct sock *child);
>>>>>    void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>> -                             unsigned long timeout);
>>>>> +                             unsigned long timeout, bool *found_dup_sk);
>>>> Nit:
>>>>
>>>> I think it would be preferrable to change retval to bool rather than
>>>> bool *found_dup_sk extra arg, so one can do
>> +1
>>
>>
>>>> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>                                 unsigned long timeout)
>>>> {
>>>>      if (!reqsk_queue_hash_req(req, timeout))
>>>>              return false;
>>>>
>>>> i.e. let retval indicate wheter reqsk was inserted or not.
>>>>
>>>> Patch looks good to me otherwise.
>>> Thank you for your confirmation!
>>>
>>> Regarding your suggestion, I had considered it before,
>>> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
>>> dccp_v4(v6)_conn_request() also calls it. However, there is no
>>> consideration for a failed insertion within that function, so it's
>>> reasonable to let the caller decide whether to check for duplicate
>>> reqsk.
>> I guess you followed 01770a1661657 where found_dup_sk was introduced,
>> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
>> and DCCP is not related.
>>
>> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
>> additional argument.
>>
>> However, another similar commit 5e0724d027f05 actually added own_req check
>> in DCCP path.
>>
>> I personally would'nt care if DCCP was not changed to handle such a
>> failure because DCCP will be removed next year, but I still prefer
>> Florian's suggestion.
>>
> Other things to consider :
>
> - I presume this patch targets net tree, and luoxuanqiang needs the
> fix to reach stable trees.
>
> - This means a Fixes: tag is needed
>
> - This also means that we should favor a patch with no or trivial
> conflicts for stable backports.
>
> Should the patch target the net-next tree, then the requirements can
> be different.

Hello Eric and Kuniyuk,

Thank you for the information!

I've tested the kernel versions 4.19 and 6.10, and they both have
similar issues (I suspect this problem has been around for quite some
time). My intention is to propose a fix to the more stable branches as
soon as possible to cover a wider range. Like Eric mentioned, I hope to
minimize conflicts, so I expect to keep the original DCCP logic intact
and refer to the check for found_dup_sk in 01770a1661657. For DCCP, if
insertion into ehash fails, we might also need to consider handling
rsk_refcnt, as tcp_conn_request() requires rsk_refcnt to be 0 to release
reqsk.

Of course, if DCCP will be removed from net-next, I agree with
Kuniyuki and Florian's suggestions and will envision a better commit
content.

BRs!
luoxuanqiang June 17, 2024, 2:53 a.m. UTC | #7
在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道:
> On 14/06/2024 12:26, luoxuanqiang wrote:
>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
>> SYN packets are received at the same time and processed on different 
>> CPUs,
>> it can potentially create the same sk (sock) but two different reqsk
>> (request_sock) in tcp_conn_request().
>>
>> These two different reqsk will respond with two SYNACK packets, and 
>> since
>> the generation of the seq (ISN) incorporates a timestamp, the final two
>> SYNACK packets will have different seq values.
>>
>> The consequence is that when the Client receives and replies with an ACK
>> to the earlier SYNACK packet, we will reset(RST) it.
>>
>> ========================================================================
> This is close, but not identical, to a race we observed on a *single* 
> CPU with
> the TPROXY iptables target, in the following situation:
>
>  - two identical SYNs, sent one second apart from the same client socket,
>    arrive back-to-back on the interface (due to network jitter)
>
>  - they happen to be handled in the same batch of packet from one softirq
>    name_your_nic_poll()
>
>  - there, two loops run sequentially: one for netfilter (doing 
> TPROXY), one
>    for the network stack (doing TCP processing)
>
>  - the first generates two distinct contexts for the two SYNs
>
>  - the second respects these contexts and never gets a chance to merge 
> them
>
> The result is exactly as you describe, but in this case there is no 
> need for bonding,
> and everything happens in one single CPU, which is pretty ironic for a 
> race.
> My uneducated feeling is that the two loops are the cause of a simulated
> parallelism, yielding the race. If each packet of the batch was handled
> "to completion" (full netfilter handling followed immediately by full 
> network
> stack ingestion), the problem would not exist.

Based on your explanation, I believe a
similar issue can occur on a single CPU if two SYN packets are processed
  closely enough. However, apart from using bond3 mode and having them
processed on different CPUs to facilitate reproducibility, I haven't
found a good way to replicate it.

Could you please provide a more practical example or detailed test
steps to help me understand the reproduction scenario you mentioned?
Thank you very much!
luoxuanqiang June 17, 2024, 8:07 a.m. UTC | #8
在 2024/6/15 14:40, Eric Dumazet 写道:
> On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> 你好 Eric和Kuniyuk,
>> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>> Date: Fri, 14 Jun 2024 20:42:07 +0800
>>> 在 2024/6/14 18:54, Florian Westphal 写道:
>>>> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>>>>>    include/net/inet_connection_sock.h |  2 +-
>>>>>    net/dccp/ipv4.c                    |  2 +-
>>>>>    net/dccp/ipv6.c                    |  2 +-
>>>>>    net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>>>>>    net/ipv4/tcp_input.c               | 11 ++++++++++-
>>>>>    5 files changed, 24 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>>>>> index 7d6b1254c92d..8773d161d184 100644
>>>>> --- a/include/net/inet_connection_sock.h
>>>>> +++ b/include/net/inet_connection_sock.h
>>>>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>>>>>                                   struct request_sock *req,
>>>>>                                   struct sock *child);
>>>>>    void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>> -                             unsigned long timeout);
>>>>> +                             unsigned long timeout, bool *found_dup_sk);
>>>> Nit:
>>>>
>>>> I think it would be preferrable to change retval to bool rather than
>>>> bool *found_dup_sk extra arg, so one can do
>> +1
>>
>>
>>>> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>                                 unsigned long timeout)
>>>> {
>>>>      if (!reqsk_queue_hash_req(req, timeout))
>>>>              return false;
>>>>
>>>> i.e. let retval indicate wheter reqsk was inserted or not.
>>>>
>>>> Patch looks good to me otherwise.
>>> Thank you for your confirmation!
>>>
>>> Regarding your suggestion, I had considered it before,
>>> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
>>> dccp_v4(v6)_conn_request() also calls it. However, there is no
>>> consideration for a failed insertion within that function, so it's
>>> reasonable to let the caller decide whether to check for duplicate
>>> reqsk.
>> I guess you followed 01770a1661657 where found_dup_sk was introduced,
>> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
>> and DCCP is not related.
>>
>> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
>> additional argument.
>>
>> However, another similar commit 5e0724d027f05 actually added own_req check
>> in DCCP path.
>>
>> I personally would'nt care if DCCP was not changed to handle such a
>> failure because DCCP will be removed next year, but I still prefer
>> Florian's suggestion.
>>
> Other things to consider :
>
> - I presume this patch targets net tree, and luoxuanqiang needs the
> fix to reach stable trees.
>
> - This means a Fixes: tag is needed
>
> - This also means that we should favor a patch with no or trivial
> conflicts for stable backports.
>
> Should the patch target the net-next tree, then the requirements can
> be different.

Hi Kuniyuk and Florian,

I've created version 3 based on your suggestions, but I've kept the use
of 'found_dup_sk' since we need to pass NULL in DCCP to maintain its
logic unchanged. Could you please review this update and let me know if
it's okay? Thank you!

BRs!
alexandre.ferrieux@orange.com June 17, 2024, 2:44 p.m. UTC | #9
On 17/06/2024 04:53, luoxuanqiang wrote:
> 
> 在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道:
>> On 14/06/2024 12:26, luoxuanqiang wrote:
>>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
>>> SYN packets are received at the same time and processed on different CPUs,
>>> it can potentially create the same sk (sock) but two different reqsk
>>> (request_sock) in tcp_conn_request().
>>>
>>> These two different reqsk will respond with two SYNACK packets, and since
>>> the generation of the seq (ISN) incorporates a timestamp, the final two
>>> SYNACK packets will have different seq values.
>>>
>>> The consequence is that when the Client receives and replies with an ACK
>>> to the earlier SYNACK packet, we will reset(RST) it.
>>>
>>> ========================================================================
>> This is close, but not identical, to a race we observed on a *single* CPU with
>> the TPROXY iptables target, in the following situation:
>>
>>  - two identical SYNs, sent one second apart from the same client socket,
>>    arrive back-to-back on the interface (due to network jitter)
>>
>>  - they happen to be handled in the same batch of packet from one softirq
>>    name_your_nic_poll()
>>
>>  - there, two loops run sequentially: one for netfilter (doing TPROXY), one
>>    for the network stack (doing TCP processing)
>>
>>  - the first generates two distinct contexts for the two SYNs
>>
>>  - the second respects these contexts and never gets a chance to merge them
>>
>> The result is exactly as you describe, but in this case there is no need for 
>> bonding,
>> and everything happens in one single CPU, which is pretty ironic for a race.
>> My uneducated feeling is that the two loops are the cause of a simulated
>> parallelism, yielding the race. If each packet of the batch was handled
>> "to completion" (full netfilter handling followed immediately by full network
>> stack ingestion), the problem would not exist.
> 
> Based on your explanation, I believe a
> similar issue can occur on a single CPU if two SYN packets are processed
>   closely enough. However, apart from using bond3 mode and having them
> processed on different CPUs to facilitate reproducibility, I haven't
> found a good way to replicate it.
> 
> Could you please provide a more practical example or detailed test
> steps to help me understand the reproduction scenario you mentioned?
> Thank you very much!

To reproduce in my case, I just need the two SYNs to arrive back-to-back in the 
ingress buffer and get in the same softirq run. To reach this goal easily, you 
can set the interrupt coalescence to a large value (like several milliseconds), 
and on the emitter side, send them in rapid sequence from userland. If that's 
not enough, you can just send one and duplicate it with TEE.

Then, if the packets are naturally aimed at the host (normal INPUT chain), I 
can't see the problem (as could be expected as 99.9999% of webservers do just 
this). Quite clearly, tcp_v4_rcv() does a good job in this case and is able to 
link the second packet to the context (reqsk?) of the first one.

But, if packets are "in transit", in a transparent proxying context, with the 
TPROXY target doing a redirection to the local listener, the race happens 
deterministically: I've even managed to squeeze 6 or 7 duplicate packets in the 
softirq run, and all of them get a different ISN !!!

In summary, the minimal setup is just:

    - ethtool -C $ITF rx-usecs 30000
    - a listener bound on port $PO
    - iptables -t mangle -A PREROUTING -i $ITF -p tcp -j TPROXY --on-port $PO 
--on-ip 0.0.0.0


And to get to the specifics, I have the impression that  in  ip_sublist_rcv(), 
the "two-pass" method of calling first NF_HOOK_LIST(), then 
ip_list_rcv_finish(), gets confused as the first pass (NF_HOOK_LIST calling 
TPROXY) does a "half-job" of attaching a context, making all of them different, 
while the second pass retrieves these contexts and doesn't try to "merge" them 
when needed:

   static void ip_sublist_rcv(...)
   {
     NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
                  head, dev, NULL, ip_rcv_finish);
     ip_list_rcv_finish(net, NULL, head);
   }

My naive impression is that reducing the "batch" size to 1 would do the job.
In other words, "run each packet to completion", netfilter *and* ip_*_rcv.
But I lack the vision of the big picture leading to the current choice.
Thanks in advance for shedding light on this :)

-Alex
alexandre.ferrieux@orange.com Aug. 25, 2024, 12:24 p.m. UTC | #10
On 17/06/2024 16:44, Alexandre Ferrieux wrote:
> On 17/06/2024 04:53, luoxuanqiang wrote:
>>
>> 在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道:
>>> On 14/06/2024 12:26, luoxuanqiang wrote:
>>>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
>>>> SYN packets are received at the same time and processed on different CPUs,
>>>> it can potentially create the same sk (sock) but two different reqsk
>>>> (request_sock) in tcp_conn_request().
>>>>
>>>> These two different reqsk will respond with two SYNACK packets, and since
>>>> the generation of the seq (ISN) incorporates a timestamp, the final two
>>>> SYNACK packets will have different seq values.
>>>>
>>>> The consequence is that when the Client receives and replies with an ACK
>>>> to the earlier SYNACK packet, we will reset(RST) it.
>>>>
>>>> ========================================================================
>>> This is close, but not identical, to a race we observed on a *single* CPU with
>>> the TPROXY iptables target, in the following situation:
>>>
>>>  - two identical SYNs, sent one second apart from the same client socket,
>>>    arrive back-to-back on the interface (due to network jitter)
>>>
>>>  - they happen to be handled in the same batch of packet from one softirq
>>>    name_your_nic_poll()
>>>
>>>  - there, two loops run sequentially: one for netfilter (doing TPROXY), one
>>>    for the network stack (doing TCP processing)
>>>
>>>  - the first generates two distinct contexts for the two SYNs
>>>
>>>  - the second respects these contexts and never gets a chance to merge them
>>>
>>> The result is exactly as you describe, but in this case there is no need for 
>>> bonding,
>>> and everything happens in one single CPU, which is pretty ironic for a race.
>>> My uneducated feeling is that the two loops are the cause of a simulated
>>> parallelism, yielding the race. If each packet of the batch was handled
>>> "to completion" (full netfilter handling followed immediately by full network
>>> stack ingestion), the problem would not exist.
>>
>> Based on your explanation, I believe a
>> similar issue can occur on a single CPU if two SYN packets are processed
>>   closely enough. However, apart from using bond3 mode and having them
>> processed on different CPUs to facilitate reproducibility, I haven't
>> found a good way to replicate it.
>>
>> Could you please provide a more practical example or detailed test
>> steps to help me understand the reproduction scenario you mentioned?
>> Thank you very much!
> 
> To reproduce in my case, I just need the two SYNs to arrive back-to-back in the 
> ingress buffer and get in the same softirq run. To reach this goal easily, you 
> can set the interrupt coalescence to a large value (like several milliseconds), 
> and on the emitter side, send them in rapid sequence from userland. If that's 
> not enough, you can just send one and duplicate it with TEE.

Good news: as I suspected, your fix (ff46e3b44219 shipped in 6.10) DOES solve my 
problem too !

As a consequence, this means the single-CPU scenario was exposed too, through 
netfilter's peculiar "breadth-first" iteration approach. This gives extra weight 
to the importance of your work.

TL;DR: thanks, kudos, congrats, and thanks !!!
diff mbox series

Patch

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7d6b1254c92d..8773d161d184 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -264,7 +264,7 @@  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 				      struct request_sock *req,
 				      struct sock *child);
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
-				   unsigned long timeout);
+				   unsigned long timeout, bool *found_dup_sk);
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
 					 struct request_sock *req,
 					 bool own_req);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ff41bd6f99c3..13aafdeb9205 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -657,7 +657,7 @@  int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (dccp_v4_send_response(sk, req))
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
+	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT, NULL);
 	reqsk_put(req);
 	return 0;
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 85f4b8fdbe5e..493cdb12ce2b 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -400,7 +400,7 @@  static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (dccp_v6_send_response(sk, req))
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
+	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT, NULL);
 	reqsk_put(req);
 	return 0;
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d81f74ce0f02..045d0701acfd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1123,12 +1123,16 @@  static void reqsk_timer_handler(struct timer_list *t)
 }
 
 static void reqsk_queue_hash_req(struct request_sock *req,
-				 unsigned long timeout)
+				 unsigned long timeout, bool *found_dup_sk)
 {
+	inet_ehash_insert(req_to_sk(req), NULL, found_dup_sk);
+	if (found_dup_sk && *found_dup_sk)
+		return;
+
+	/* The timer needs to be setup after a successful insertion. */
 	timer_setup(&req->rsk_timer, reqsk_timer_handler, TIMER_PINNED);
 	mod_timer(&req->rsk_timer, jiffies + timeout);
 
-	inet_ehash_insert(req_to_sk(req), NULL, NULL);
 	/* before letting lookups find us, make sure all req fields
 	 * are committed to memory and refcnt initialized.
 	 */
@@ -1137,9 +1141,12 @@  static void reqsk_queue_hash_req(struct request_sock *req,
 }
 
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
-				   unsigned long timeout)
+				   unsigned long timeout, bool *found_dup_sk)
 {
-	reqsk_queue_hash_req(req, timeout);
+	reqsk_queue_hash_req(req, timeout, found_dup_sk);
+	if (found_dup_sk && *found_dup_sk)
+		return;
+
 	inet_csk_reqsk_queue_added(sk);
 }
 EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c04a9c8be9d..49876477c2b9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -7255,8 +7255,17 @@  int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	} else {
 		tcp_rsk(req)->tfo_listener = false;
 		if (!want_cookie) {
+			bool found_dup_sk = false;
+
 			req->timeout = tcp_timeout_init((struct sock *)req);
-			inet_csk_reqsk_queue_hash_add(sk, req, req->timeout);
+			inet_csk_reqsk_queue_hash_add(sk, req, req->timeout,
+						      &found_dup_sk);
+
+			if (unlikely(found_dup_sk)) {
+				reqsk_free(req);
+				return 0;
+			}
+
 		}
 		af_ops->send_synack(sk, dst, &fl, req, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :