Message ID | 20240614102628.446642-1-luoxuanqiang@kylinos.cn (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2] Fix race for duplicate reqsk on identical SYN | expand |
luoxuanqiang <luoxuanqiang@kylinos.cn> wrote: > include/net/inet_connection_sock.h | 2 +- > net/dccp/ipv4.c | 2 +- > net/dccp/ipv6.c | 2 +- > net/ipv4/inet_connection_sock.c | 15 +++++++++++---- > net/ipv4/tcp_input.c | 11 ++++++++++- > 5 files changed, 24 insertions(+), 8 deletions(-) > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > index 7d6b1254c92d..8773d161d184 100644 > --- a/include/net/inet_connection_sock.h > +++ b/include/net/inet_connection_sock.h > @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, > struct request_sock *req, > struct sock *child); > void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, > - unsigned long timeout); > + unsigned long timeout, bool *found_dup_sk); Nit: I think it would be preferrable to change retval to bool rather than bool *found_dup_sk extra arg, so one can do bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, unsigned long timeout) { if (!reqsk_queue_hash_req(req, timeout)) return false; i.e. let retval indicate wheter reqsk was inserted or not. Patch looks good to me otherwise.
在 2024/6/14 18:54, Florian Westphal 写道: > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote: >> include/net/inet_connection_sock.h | 2 +- >> net/dccp/ipv4.c | 2 +- >> net/dccp/ipv6.c | 2 +- >> net/ipv4/inet_connection_sock.c | 15 +++++++++++---- >> net/ipv4/tcp_input.c | 11 ++++++++++- >> 5 files changed, 24 insertions(+), 8 deletions(-) >> >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h >> index 7d6b1254c92d..8773d161d184 100644 >> --- a/include/net/inet_connection_sock.h >> +++ b/include/net/inet_connection_sock.h >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, >> struct request_sock *req, >> struct sock *child); >> void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, >> - unsigned long timeout); >> + unsigned long timeout, bool *found_dup_sk); > Nit: > > I think it would be preferrable to change retval to bool rather than > bool *found_dup_sk extra arg, so one can do > > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, > unsigned long timeout) > { > if (!reqsk_queue_hash_req(req, timeout)) > return false; > > i.e. let retval indicate wheter reqsk was inserted or not. > > Patch looks good to me otherwise. Thank you for your confirmation! Regarding your suggestion, I had considered it before, but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(), dccp_v4(v6)_conn_request() also calls it. However, there is no consideration for a failed insertion within that function, so it's reasonable to let the caller decide whether to check for duplicate reqsk. The purpose of my modification this time is solely to confirm if a reqsk for the same connection has already been inserted into the ehash. If the insertion fails, inet_ehash_insert() will handle the non-insertion gracefully, and I only need to release the duplicate reqsk. I believe this change is minimal and effective. Those are my considerations.
From: luoxuanqiang <luoxuanqiang@kylinos.cn> Date: Fri, 14 Jun 2024 20:42:07 +0800 > 在 2024/6/14 18:54, Florian Westphal 写道: > > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote: > >> include/net/inet_connection_sock.h | 2 +- > >> net/dccp/ipv4.c | 2 +- > >> net/dccp/ipv6.c | 2 +- > >> net/ipv4/inet_connection_sock.c | 15 +++++++++++---- > >> net/ipv4/tcp_input.c | 11 ++++++++++- > >> 5 files changed, 24 insertions(+), 8 deletions(-) > >> > >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > >> index 7d6b1254c92d..8773d161d184 100644 > >> --- a/include/net/inet_connection_sock.h > >> +++ b/include/net/inet_connection_sock.h > >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, > >> struct request_sock *req, > >> struct sock *child); > >> void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, > >> - unsigned long timeout); > >> + unsigned long timeout, bool *found_dup_sk); > > Nit: > > > > I think it would be preferrable to change retval to bool rather than > > bool *found_dup_sk extra arg, so one can do +1 > > > > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, > > unsigned long timeout) > > { > > if (!reqsk_queue_hash_req(req, timeout)) > > return false; > > > > i.e. let retval indicate wheter reqsk was inserted or not. > > > > Patch looks good to me otherwise. > > Thank you for your confirmation! > > Regarding your suggestion, I had considered it before, > but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(), > dccp_v4(v6)_conn_request() also calls it. However, there is no > consideration for a failed insertion within that function, so it's > reasonable to let the caller decide whether to check for duplicate > reqsk. I guess you followed 01770a1661657 where found_dup_sk was introduced, but note that the commit is specific to TCP SYN Cookie and TCP Fast Open and DCCP is not related. Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an additional argument. However, another similar commit 5e0724d027f05 actually added own_req check in DCCP path. I personally would'nt care if DCCP was not changed to handle such a failure because DCCP will be removed next year, but I still prefer Florian's suggestion. Thanks
On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: luoxuanqiang <luoxuanqiang@kylinos.cn> > Date: Fri, 14 Jun 2024 20:42:07 +0800 > > 在 2024/6/14 18:54, Florian Westphal 写道: > > > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote: > > >> include/net/inet_connection_sock.h | 2 +- > > >> net/dccp/ipv4.c | 2 +- > > >> net/dccp/ipv6.c | 2 +- > > >> net/ipv4/inet_connection_sock.c | 15 +++++++++++---- > > >> net/ipv4/tcp_input.c | 11 ++++++++++- > > >> 5 files changed, 24 insertions(+), 8 deletions(-) > > >> > > >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > > >> index 7d6b1254c92d..8773d161d184 100644 > > >> --- a/include/net/inet_connection_sock.h > > >> +++ b/include/net/inet_connection_sock.h > > >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, > > >> struct request_sock *req, > > >> struct sock *child); > > >> void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, > > >> - unsigned long timeout); > > >> + unsigned long timeout, bool *found_dup_sk); > > > Nit: > > > > > > I think it would be preferrable to change retval to bool rather than > > > bool *found_dup_sk extra arg, so one can do > > +1 > > > > > > > > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, > > > unsigned long timeout) > > > { > > > if (!reqsk_queue_hash_req(req, timeout)) > > > return false; > > > > > > i.e. let retval indicate wheter reqsk was inserted or not. > > > > > > Patch looks good to me otherwise. > > > > Thank you for your confirmation! > > > > Regarding your suggestion, I had considered it before, > > but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(), > > dccp_v4(v6)_conn_request() also calls it. However, there is no > > consideration for a failed insertion within that function, so it's > > reasonable to let the caller decide whether to check for duplicate > > reqsk. > > I guess you followed 01770a1661657 where found_dup_sk was introduced, > but note that the commit is specific to TCP SYN Cookie and TCP Fast Open > and DCCP is not related. > > Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an > additional argument. > > However, another similar commit 5e0724d027f05 actually added own_req check > in DCCP path. > > I personally would'nt care if DCCP was not changed to handle such a > failure because DCCP will be removed next year, but I still prefer > Florian's suggestion. > Other things to consider : - I presume this patch targets net tree, and luoxuanqiang needs the fix to reach stable trees. - This means a Fixes: tag is needed - This also means that we should favor a patch with no or trivial conflicts for stable backports. Should the patch target the net-next tree, then the requirements can be different.
On 14/06/2024 12:26, luoxuanqiang wrote: > When bonding is configured in BOND_MODE_BROADCAST mode, if two identical > SYN packets are received at the same time and processed on different CPUs, > it can potentially create the same sk (sock) but two different reqsk > (request_sock) in tcp_conn_request(). > > These two different reqsk will respond with two SYNACK packets, and since > the generation of the seq (ISN) incorporates a timestamp, the final two > SYNACK packets will have different seq values. > > The consequence is that when the Client receives and replies with an ACK > to the earlier SYNACK packet, we will reset(RST) it. > > ======================================================================== This is close, but not identical, to a race we observed on a *single* CPU with the TPROXY iptables target, in the following situation: - two identical SYNs, sent one second apart from the same client socket, arrive back-to-back on the interface (due to network jitter) - they happen to be handled in the same batch of packet from one softirq name_your_nic_poll() - there, two loops run sequentially: one for netfilter (doing TPROXY), one for the network stack (doing TCP processing) - the first generates two distinct contexts for the two SYNs - the second respects these contexts and never gets a chance to merge them The result is exactly as you describe, but in this case there is no need for bonding, and everything happens in one single CPU, which is pretty ironic for a race. My uneducated feeling is that the two loops are the cause of a simulated parallelism, yielding the race. If each packet of the batch was handled "to completion" (full netfilter handling followed immediately by full network stack ingestion), the problem would not exist.
在 2024/6/15 14:40, Eric Dumazet 写道: > On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: >> From: luoxuanqiang <luoxuanqiang@kylinos.cn> >> Date: Fri, 14 Jun 2024 20:42:07 +0800 >>> 在 2024/6/14 18:54, Florian Westphal 写道: >>>> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote: >>>>> include/net/inet_connection_sock.h | 2 +- >>>>> net/dccp/ipv4.c | 2 +- >>>>> net/dccp/ipv6.c | 2 +- >>>>> net/ipv4/inet_connection_sock.c | 15 +++++++++++---- >>>>> net/ipv4/tcp_input.c | 11 ++++++++++- >>>>> 5 files changed, 24 insertions(+), 8 deletions(-) >>>>> >>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h >>>>> index 7d6b1254c92d..8773d161d184 100644 >>>>> --- a/include/net/inet_connection_sock.h >>>>> +++ b/include/net/inet_connection_sock.h >>>>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, >>>>> struct request_sock *req, >>>>> struct sock *child); >>>>> void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, >>>>> - unsigned long timeout); >>>>> + unsigned long timeout, bool *found_dup_sk); >>>> Nit: >>>> >>>> I think it would be preferrable to change retval to bool rather than >>>> bool *found_dup_sk extra arg, so one can do >> +1 >> >> >>>> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, >>>> unsigned long timeout) >>>> { >>>> if (!reqsk_queue_hash_req(req, timeout)) >>>> return false; >>>> >>>> i.e. let retval indicate wheter reqsk was inserted or not. >>>> >>>> Patch looks good to me otherwise. >>> Thank you for your confirmation! >>> >>> Regarding your suggestion, I had considered it before, >>> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(), >>> dccp_v4(v6)_conn_request() also calls it. However, there is no >>> consideration for a failed insertion within that function, so it's >>> reasonable to let the caller decide whether to check for duplicate >>> reqsk. >> I guess you followed 01770a1661657 where found_dup_sk was introduced, >> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open >> and DCCP is not related. >> >> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an >> additional argument. >> >> However, another similar commit 5e0724d027f05 actually added own_req check >> in DCCP path. >> >> I personally would'nt care if DCCP was not changed to handle such a >> failure because DCCP will be removed next year, but I still prefer >> Florian's suggestion. >> > Other things to consider : > > - I presume this patch targets net tree, and luoxuanqiang needs the > fix to reach stable trees. > > - This means a Fixes: tag is needed > > - This also means that we should favor a patch with no or trivial > conflicts for stable backports. > > Should the patch target the net-next tree, then the requirements can > be different. Hello Eric and Kuniyuk, Thank you for the information! I've tested the kernel versions 4.19 and 6.10, and they both have similar issues (I suspect this problem has been around for quite some time). My intention is to propose a fix to the more stable branches as soon as possible to cover a wider range. Like Eric mentioned, I hope to minimize conflicts, so I expect to keep the original DCCP logic intact and refer to the check for found_dup_sk in 01770a1661657. For DCCP, if insertion into ehash fails, we might also need to consider handling rsk_refcnt, as tcp_conn_request() requires rsk_refcnt to be 0 to release reqsk. Of course, if DCCP will be removed from net-next, I agree with Kuniyuki and Florian's suggestions and will envision a better commit content. BRs!
在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道: > On 14/06/2024 12:26, luoxuanqiang wrote: >> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical >> SYN packets are received at the same time and processed on different >> CPUs, >> it can potentially create the same sk (sock) but two different reqsk >> (request_sock) in tcp_conn_request(). >> >> These two different reqsk will respond with two SYNACK packets, and >> since >> the generation of the seq (ISN) incorporates a timestamp, the final two >> SYNACK packets will have different seq values. >> >> The consequence is that when the Client receives and replies with an ACK >> to the earlier SYNACK packet, we will reset(RST) it. >> >> ======================================================================== > This is close, but not identical, to a race we observed on a *single* > CPU with > the TPROXY iptables target, in the following situation: > > - two identical SYNs, sent one second apart from the same client socket, > arrive back-to-back on the interface (due to network jitter) > > - they happen to be handled in the same batch of packet from one softirq > name_your_nic_poll() > > - there, two loops run sequentially: one for netfilter (doing > TPROXY), one > for the network stack (doing TCP processing) > > - the first generates two distinct contexts for the two SYNs > > - the second respects these contexts and never gets a chance to merge > them > > The result is exactly as you describe, but in this case there is no > need for bonding, > and everything happens in one single CPU, which is pretty ironic for a > race. > My uneducated feeling is that the two loops are the cause of a simulated > parallelism, yielding the race. If each packet of the batch was handled > "to completion" (full netfilter handling followed immediately by full > network > stack ingestion), the problem would not exist. Based on your explanation, I believe a similar issue can occur on a single CPU if two SYN packets are processed closely enough. However, apart from using bond3 mode and having them processed on different CPUs to facilitate reproducibility, I haven't found a good way to replicate it. Could you please provide a more practical example or detailed test steps to help me understand the reproduction scenario you mentioned? Thank you very much!
在 2024/6/15 14:40, Eric Dumazet 写道: > On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > 你好 Eric和Kuniyuk, >> From: luoxuanqiang <luoxuanqiang@kylinos.cn> >> Date: Fri, 14 Jun 2024 20:42:07 +0800 >>> 在 2024/6/14 18:54, Florian Westphal 写道: >>>> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote: >>>>> include/net/inet_connection_sock.h | 2 +- >>>>> net/dccp/ipv4.c | 2 +- >>>>> net/dccp/ipv6.c | 2 +- >>>>> net/ipv4/inet_connection_sock.c | 15 +++++++++++---- >>>>> net/ipv4/tcp_input.c | 11 ++++++++++- >>>>> 5 files changed, 24 insertions(+), 8 deletions(-) >>>>> >>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h >>>>> index 7d6b1254c92d..8773d161d184 100644 >>>>> --- a/include/net/inet_connection_sock.h >>>>> +++ b/include/net/inet_connection_sock.h >>>>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, >>>>> struct request_sock *req, >>>>> struct sock *child); >>>>> void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, >>>>> - unsigned long timeout); >>>>> + unsigned long timeout, bool *found_dup_sk); >>>> Nit: >>>> >>>> I think it would be preferrable to change retval to bool rather than >>>> bool *found_dup_sk extra arg, so one can do >> +1 >> >> >>>> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, >>>> unsigned long timeout) >>>> { >>>> if (!reqsk_queue_hash_req(req, timeout)) >>>> return false; >>>> >>>> i.e. let retval indicate wheter reqsk was inserted or not. >>>> >>>> Patch looks good to me otherwise. >>> Thank you for your confirmation! >>> >>> Regarding your suggestion, I had considered it before, >>> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(), >>> dccp_v4(v6)_conn_request() also calls it. However, there is no >>> consideration for a failed insertion within that function, so it's >>> reasonable to let the caller decide whether to check for duplicate >>> reqsk. >> I guess you followed 01770a1661657 where found_dup_sk was introduced, >> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open >> and DCCP is not related. >> >> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an >> additional argument. >> >> However, another similar commit 5e0724d027f05 actually added own_req check >> in DCCP path. >> >> I personally would'nt care if DCCP was not changed to handle such a >> failure because DCCP will be removed next year, but I still prefer >> Florian's suggestion. >> > Other things to consider : > > - I presume this patch targets net tree, and luoxuanqiang needs the > fix to reach stable trees. > > - This means a Fixes: tag is needed > > - This also means that we should favor a patch with no or trivial > conflicts for stable backports. > > Should the patch target the net-next tree, then the requirements can > be different. Hi Kuniyuk and Florian, I've created version 3 based on your suggestions, but I've kept the use of 'found_dup_sk' since we need to pass NULL in DCCP to maintain its logic unchanged. Could you please review this update and let me know if it's okay? Thank you! BRs!
On 17/06/2024 04:53, luoxuanqiang wrote: > > 在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道: >> On 14/06/2024 12:26, luoxuanqiang wrote: >>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical >>> SYN packets are received at the same time and processed on different CPUs, >>> it can potentially create the same sk (sock) but two different reqsk >>> (request_sock) in tcp_conn_request(). >>> >>> These two different reqsk will respond with two SYNACK packets, and since >>> the generation of the seq (ISN) incorporates a timestamp, the final two >>> SYNACK packets will have different seq values. >>> >>> The consequence is that when the Client receives and replies with an ACK >>> to the earlier SYNACK packet, we will reset(RST) it. >>> >>> ======================================================================== >> This is close, but not identical, to a race we observed on a *single* CPU with >> the TPROXY iptables target, in the following situation: >> >> - two identical SYNs, sent one second apart from the same client socket, >> arrive back-to-back on the interface (due to network jitter) >> >> - they happen to be handled in the same batch of packet from one softirq >> name_your_nic_poll() >> >> - there, two loops run sequentially: one for netfilter (doing TPROXY), one >> for the network stack (doing TCP processing) >> >> - the first generates two distinct contexts for the two SYNs >> >> - the second respects these contexts and never gets a chance to merge them >> >> The result is exactly as you describe, but in this case there is no need for >> bonding, >> and everything happens in one single CPU, which is pretty ironic for a race. >> My uneducated feeling is that the two loops are the cause of a simulated >> parallelism, yielding the race. If each packet of the batch was handled >> "to completion" (full netfilter handling followed immediately by full network >> stack ingestion), the problem would not exist. > > Based on your explanation, I believe a > similar issue can occur on a single CPU if two SYN packets are processed > closely enough. However, apart from using bond3 mode and having them > processed on different CPUs to facilitate reproducibility, I haven't > found a good way to replicate it. > > Could you please provide a more practical example or detailed test > steps to help me understand the reproduction scenario you mentioned? > Thank you very much! To reproduce in my case, I just need the two SYNs to arrive back-to-back in the ingress buffer and get in the same softirq run. To reach this goal easily, you can set the interrupt coalescence to a large value (like several milliseconds), and on the emitter side, send them in rapid sequence from userland. If that's not enough, you can just send one and duplicate it with TEE. Then, if the packets are naturally aimed at the host (normal INPUT chain), I can't see the problem (as could be expected as 99.9999% of webservers do just this). Quite clearly, tcp_v4_rcv() does a good job in this case and is able to link the second packet to the context (reqsk?) of the first one. But, if packets are "in transit", in a transparent proxying context, with the TPROXY target doing a redirection to the local listener, the race happens deterministically: I've even managed to squeeze 6 or 7 duplicate packets in the softirq run, and all of them get a different ISN !!! In summary, the minimal setup is just: - ethtool -C $ITF rx-usecs 30000 - a listener bound on port $PO - iptables -t mangle -A PREROUTING -i $ITF -p tcp -j TPROXY --on-port $PO --on-ip 0.0.0.0 And to get to the specifics, I have the impression that in ip_sublist_rcv(), the "two-pass" method of calling first NF_HOOK_LIST(), then ip_list_rcv_finish(), gets confused as the first pass (NF_HOOK_LIST calling TPROXY) does a "half-job" of attaching a context, making all of them different, while the second pass retrieves these contexts and doesn't try to "merge" them when needed: static void ip_sublist_rcv(...) { NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL, head, dev, NULL, ip_rcv_finish); ip_list_rcv_finish(net, NULL, head); } My naive impression is that reducing the "batch" size to 1 would do the job. In other words, "run each packet to completion", netfilter *and* ip_*_rcv. But I lack the vision of the big picture leading to the current choice. Thanks in advance for shedding light on this :) -Alex
On 17/06/2024 16:44, Alexandre Ferrieux wrote: > On 17/06/2024 04:53, luoxuanqiang wrote: >> >> 在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道: >>> On 14/06/2024 12:26, luoxuanqiang wrote: >>>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical >>>> SYN packets are received at the same time and processed on different CPUs, >>>> it can potentially create the same sk (sock) but two different reqsk >>>> (request_sock) in tcp_conn_request(). >>>> >>>> These two different reqsk will respond with two SYNACK packets, and since >>>> the generation of the seq (ISN) incorporates a timestamp, the final two >>>> SYNACK packets will have different seq values. >>>> >>>> The consequence is that when the Client receives and replies with an ACK >>>> to the earlier SYNACK packet, we will reset(RST) it. >>>> >>>> ======================================================================== >>> This is close, but not identical, to a race we observed on a *single* CPU with >>> the TPROXY iptables target, in the following situation: >>> >>> - two identical SYNs, sent one second apart from the same client socket, >>> arrive back-to-back on the interface (due to network jitter) >>> >>> - they happen to be handled in the same batch of packet from one softirq >>> name_your_nic_poll() >>> >>> - there, two loops run sequentially: one for netfilter (doing TPROXY), one >>> for the network stack (doing TCP processing) >>> >>> - the first generates two distinct contexts for the two SYNs >>> >>> - the second respects these contexts and never gets a chance to merge them >>> >>> The result is exactly as you describe, but in this case there is no need for >>> bonding, >>> and everything happens in one single CPU, which is pretty ironic for a race. >>> My uneducated feeling is that the two loops are the cause of a simulated >>> parallelism, yielding the race. If each packet of the batch was handled >>> "to completion" (full netfilter handling followed immediately by full network >>> stack ingestion), the problem would not exist. >> >> Based on your explanation, I believe a >> similar issue can occur on a single CPU if two SYN packets are processed >> closely enough. However, apart from using bond3 mode and having them >> processed on different CPUs to facilitate reproducibility, I haven't >> found a good way to replicate it. >> >> Could you please provide a more practical example or detailed test >> steps to help me understand the reproduction scenario you mentioned? >> Thank you very much! > > To reproduce in my case, I just need the two SYNs to arrive back-to-back in the > ingress buffer and get in the same softirq run. To reach this goal easily, you > can set the interrupt coalescence to a large value (like several milliseconds), > and on the emitter side, send them in rapid sequence from userland. If that's > not enough, you can just send one and duplicate it with TEE. Good news: as I suspected, your fix (ff46e3b44219 shipped in 6.10) DOES solve my problem too ! As a consequence, this means the single-CPU scenario was exposed too, through netfilter's peculiar "breadth-first" iteration approach. This gives extra weight to the importance of your work. TL;DR: thanks, kudos, congrats, and thanks !!!
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 7d6b1254c92d..8773d161d184 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, struct request_sock *req, struct sock *child); void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, - unsigned long timeout); + unsigned long timeout, bool *found_dup_sk); struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, struct request_sock *req, bool own_req); diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index ff41bd6f99c3..13aafdeb9205 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -657,7 +657,7 @@ int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb) if (dccp_v4_send_response(sk, req)) goto drop_and_free; - inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT); + inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT, NULL); reqsk_put(req); return 0; diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index 85f4b8fdbe5e..493cdb12ce2b 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -400,7 +400,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb) if (dccp_v6_send_response(sk, req)) goto drop_and_free; - inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT); + inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT, NULL); reqsk_put(req); return 0; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index d81f74ce0f02..045d0701acfd 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -1123,12 +1123,16 @@ static void reqsk_timer_handler(struct timer_list *t) } static void reqsk_queue_hash_req(struct request_sock *req, - unsigned long timeout) + unsigned long timeout, bool *found_dup_sk) { + inet_ehash_insert(req_to_sk(req), NULL, found_dup_sk); + if (found_dup_sk && *found_dup_sk) + return; + + /* The timer needs to be setup after a successful insertion. */ timer_setup(&req->rsk_timer, reqsk_timer_handler, TIMER_PINNED); mod_timer(&req->rsk_timer, jiffies + timeout); - inet_ehash_insert(req_to_sk(req), NULL, NULL); /* before letting lookups find us, make sure all req fields * are committed to memory and refcnt initialized. */ @@ -1137,9 +1141,12 @@ static void reqsk_queue_hash_req(struct request_sock *req, } void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, - unsigned long timeout) + unsigned long timeout, bool *found_dup_sk) { - reqsk_queue_hash_req(req, timeout); + reqsk_queue_hash_req(req, timeout, found_dup_sk); + if (found_dup_sk && *found_dup_sk) + return; + inet_csk_reqsk_queue_added(sk); } EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9c04a9c8be9d..49876477c2b9 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -7255,8 +7255,17 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } else { tcp_rsk(req)->tfo_listener = false; if (!want_cookie) { + bool found_dup_sk = false; + req->timeout = tcp_timeout_init((struct sock *)req); - inet_csk_reqsk_queue_hash_add(sk, req, req->timeout); + inet_csk_reqsk_queue_hash_add(sk, req, req->timeout, + &found_dup_sk); + + if (unlikely(found_dup_sk)) { + reqsk_free(req); + return 0; + } + } af_ops->send_synack(sk, dst, &fl, req, &foc, !want_cookie ? TCP_SYNACK_NORMAL :
When bonding is configured in BOND_MODE_BROADCAST mode, if two identical SYN packets are received at the same time and processed on different CPUs, it can potentially create the same sk (sock) but two different reqsk (request_sock) in tcp_conn_request(). These two different reqsk will respond with two SYNACK packets, and since the generation of the seq (ISN) incorporates a timestamp, the final two SYNACK packets will have different seq values. The consequence is that when the Client receives and replies with an ACK to the earlier SYNACK packet, we will reset(RST) it. ======================================================================== This behavior is consistently reproducible in my local setup, which comprises: | NETA1 ------ NETB1 | PC_A --- bond --- | | --- bond --- PC_B | NETA2 ------ NETB2 | - PC_A is the Server and has two network cards, NETA1 and NETA2. I have bonded these two cards using BOND_MODE_BROADCAST mode and configured them to be handled by different CPU. - PC_B is the Client, also equipped with two network cards, NETB1 and NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode. If the client attempts a TCP connection to the server, it might encounter a failure. Capturing packets from the server side reveals: 10.10.10.10.45182 > localhost: Flags [S], seq 320236027, 10.10.10.10.45182 > localhost: Flags [S], seq 320236027, localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116, localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <== 10.10.10.10.45182 > localhost: Flags [.], ack 4294967290, 10.10.10.10.45182 > localhost: Flags [.], ack 4294967290, localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <== localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, Two SYNACKs with different seq numbers are sent by localhost, resulting in an anomaly. ======================================================================== The attempted solution is as follows: In the tcp_conn_request(), while inserting reqsk into the ehash table, it also checks if an entry already exists. If found, it avoids reinsertion and releases it. Simultaneously, In the reqsk_queue_hash_req(), the start of the req->rsk_timer is adjusted to be after successful insertion. Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn> --- include/net/inet_connection_sock.h | 2 +- net/dccp/ipv4.c | 2 +- net/dccp/ipv6.c | 2 +- net/ipv4/inet_connection_sock.c | 15 +++++++++++---- net/ipv4/tcp_input.c | 11 ++++++++++- 5 files changed, 24 insertions(+), 8 deletions(-)