Message ID | 20230117175340.91712-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v6,net] tcp: avoid the lookup process failing to get sk in ehash table | expand |
From: Jason Xing <kerneljasonxing@gmail.com> Date: Wed, 18 Jan 2023 01:53:40 +0800 > From: Jason Xing <kernelxing@tencent.com> > > While one cpu is working on looking up the right socket from ehash > table, another cpu is done deleting the request socket and is about > to add (or is adding) the big socket from the table. It means that > we could miss both of them, even though it has little chance. > > Let me draw a call trace map of the server side. > CPU 0 CPU 1 > ----- ----- > tcp_v4_rcv() syn_recv_sock() > inet_ehash_insert() > -> sk_nulls_del_node_init_rcu(osk) > __inet_lookup_established() > -> __sk_nulls_add_node_rcu(sk, list) > > Notice that the CPU 0 is receiving the data after the final ack > during 3-way shakehands and CPU 1 is still handling the final ack. > > Why could this be a real problem? > This case is happening only when the final ack and the first data > receiving by different CPUs. Then the server receiving data with > ACK flag tries to search one proper established socket from ehash > table, but apparently it fails as my map shows above. After that, > the server fetches a listener socket and then sends a RST because > it finds a ACK flag in the skb (data), which obeys RST definition > in RFC 793. > > Besides, Eric pointed out there's one more race condition where it > handles tw socket hashdance. Only by adding to the tail of the list > before deleting the old one can we avoid the race if the reader has > already begun the bucket traversal and it would possibly miss the head. > > Many thanks to Eric for great help from beginning to end. > > Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions") > Suggested-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Jason Xing <kernelxing@tencent.com> > Reviewed-by: Eric Dumazet <edumazet@google.com> > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> > Link: https://lore.kernel.org/lkml/20230112065336.41034-1-kerneljasonxing@gmail.com/ > --- > v3,4,5,6: > 1) nit: adjust the coding style. > > v2: > 1) add the sk node into the tail of list to prevent the race. > 2) fix the race condition when handling time-wait socket hashdance. > --- > net/ipv4/inet_hashtables.c | 17 +++++++++++++++-- > net/ipv4/inet_timewait_sock.c | 12 ++++++------ > 2 files changed, 21 insertions(+), 8 deletions(-) > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index 24a38b56fab9..f58d73888638 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -650,8 +650,20 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) > spin_lock(lock); > if (osk) { > WARN_ON_ONCE(sk->sk_hash != osk->sk_hash); > - ret = sk_nulls_del_node_init_rcu(osk); > - } else if (found_dup_sk) { > + ret = sk_hashed(osk); > + if (ret) { > + /* Before deleting the node, we insert a new one to make > + * sure that the look-up-sk process would not miss either > + * of them and that at least one node would exist in ehash > + * table all the time. Otherwise there's a tiny chance > + * that lookup process could find nothing in ehash table. > + */ > + __sk_nulls_add_node_tail_rcu(sk, list); > + sk_nulls_del_node_init_rcu(osk); > + } > + goto unlock; > + } > + if (found_dup_sk) { > *found_dup_sk = inet_ehash_lookup_by_sk(sk, list); > if (*found_dup_sk) > ret = false; > @@ -660,6 +672,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) > if (ret) > __sk_nulls_add_node_rcu(sk, list); > > +unlock: > spin_unlock(lock); > > return ret; > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > index 1d77d992e6e7..b66f2dea5a78 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -91,20 +91,20 @@ void inet_twsk_put(struct inet_timewait_sock *tw) > } > EXPORT_SYMBOL_GPL(inet_twsk_put); > > -static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw, > - struct hlist_nulls_head *list) > +static void inet_twsk_add_node_tail_rcu(struct inet_timewait_sock *tw, > + struct hlist_nulls_head *list) > { > - hlist_nulls_add_head_rcu(&tw->tw_node, list); > + hlist_nulls_add_tail_rcu(&tw->tw_node, list); > } > > static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw, > - struct hlist_head *list) > + struct hlist_head *list) > { > hlist_add_head(&tw->tw_bind_node, list); > } > > static void inet_twsk_add_bind2_node(struct inet_timewait_sock *tw, > - struct hlist_head *list) > + struct hlist_head *list) > { > hlist_add_head(&tw->tw_bind2_node, list); > } You need not change inet_twsk_add_bind_node() and inet_twsk_add_bind2_node(). Thanks, Kuniyuki > @@ -147,7 +147,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, > > spin_lock(lock); > > - inet_twsk_add_node_rcu(tw, &ehead->chain); > + inet_twsk_add_node_tail_rcu(tw, &ehead->chain); > > /* Step 3: Remove SK from hash chain */ > if (__sk_nulls_del_node_init_rcu(sk)) > -- > 2.37.3
On Wed, Jan 18, 2023 at 2:42 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Jason Xing <kerneljasonxing@gmail.com> > Date: Wed, 18 Jan 2023 01:53:40 +0800 > > From: Jason Xing <kernelxing@tencent.com> > > > > While one cpu is working on looking up the right socket from ehash > > table, another cpu is done deleting the request socket and is about > > to add (or is adding) the big socket from the table. It means that > > we could miss both of them, even though it has little chance. > > > > Let me draw a call trace map of the server side. > > CPU 0 CPU 1 > > ----- ----- > > tcp_v4_rcv() syn_recv_sock() > > inet_ehash_insert() > > -> sk_nulls_del_node_init_rcu(osk) > > __inet_lookup_established() > > -> __sk_nulls_add_node_rcu(sk, list) > > > > Notice that the CPU 0 is receiving the data after the final ack > > during 3-way shakehands and CPU 1 is still handling the final ack. > > > > Why could this be a real problem? > > This case is happening only when the final ack and the first data > > receiving by different CPUs. Then the server receiving data with > > ACK flag tries to search one proper established socket from ehash > > table, but apparently it fails as my map shows above. After that, > > the server fetches a listener socket and then sends a RST because > > it finds a ACK flag in the skb (data), which obeys RST definition > > in RFC 793. > > > > Besides, Eric pointed out there's one more race condition where it > > handles tw socket hashdance. Only by adding to the tail of the list > > before deleting the old one can we avoid the race if the reader has > > already begun the bucket traversal and it would possibly miss the head. > > > > Many thanks to Eric for great help from beginning to end. > > > > Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions") > > Suggested-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > Link: https://lore.kernel.org/lkml/20230112065336.41034-1-kerneljasonxing@gmail.com/ > > --- > > v3,4,5,6: > > 1) nit: adjust the coding style. > > > > v2: > > 1) add the sk node into the tail of list to prevent the race. > > 2) fix the race condition when handling time-wait socket hashdance. > > --- > > net/ipv4/inet_hashtables.c | 17 +++++++++++++++-- > > net/ipv4/inet_timewait_sock.c | 12 ++++++------ > > 2 files changed, 21 insertions(+), 8 deletions(-) > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > > index 24a38b56fab9..f58d73888638 100644 > > --- a/net/ipv4/inet_hashtables.c > > +++ b/net/ipv4/inet_hashtables.c > > @@ -650,8 +650,20 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) > > spin_lock(lock); > > if (osk) { > > WARN_ON_ONCE(sk->sk_hash != osk->sk_hash); > > - ret = sk_nulls_del_node_init_rcu(osk); > > - } else if (found_dup_sk) { > > + ret = sk_hashed(osk); > > + if (ret) { > > + /* Before deleting the node, we insert a new one to make > > + * sure that the look-up-sk process would not miss either > > + * of them and that at least one node would exist in ehash > > + * table all the time. Otherwise there's a tiny chance > > + * that lookup process could find nothing in ehash table. > > + */ > > + __sk_nulls_add_node_tail_rcu(sk, list); > > + sk_nulls_del_node_init_rcu(osk); > > + } > > + goto unlock; > > + } > > + if (found_dup_sk) { > > *found_dup_sk = inet_ehash_lookup_by_sk(sk, list); > > if (*found_dup_sk) > > ret = false; > > @@ -660,6 +672,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) > > if (ret) > > __sk_nulls_add_node_rcu(sk, list); > > > > +unlock: > > spin_unlock(lock); > > > > return ret; > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > index 1d77d992e6e7..b66f2dea5a78 100644 > > --- a/net/ipv4/inet_timewait_sock.c > > +++ b/net/ipv4/inet_timewait_sock.c > > @@ -91,20 +91,20 @@ void inet_twsk_put(struct inet_timewait_sock *tw) > > } > > EXPORT_SYMBOL_GPL(inet_twsk_put); > > > > -static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw, > > - struct hlist_nulls_head *list) > > +static void inet_twsk_add_node_tail_rcu(struct inet_timewait_sock *tw, > > + struct hlist_nulls_head *list) > > { > > - hlist_nulls_add_head_rcu(&tw->tw_node, list); > > + hlist_nulls_add_tail_rcu(&tw->tw_node, list); > > } > > > > static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw, > > - struct hlist_head *list) > > + struct hlist_head *list) > > { > > hlist_add_head(&tw->tw_bind_node, list); > > } > > > > static void inet_twsk_add_bind2_node(struct inet_timewait_sock *tw, > > - struct hlist_head *list) > > + struct hlist_head *list) > > { > > hlist_add_head(&tw->tw_bind2_node, list); > > } > > You need not change inet_twsk_add_bind_node() and I'll drop them and then send a v7 patch. Thanks, Jason > inet_twsk_add_bind2_node(). > > Thanks, > Kuniyuki > > > > @@ -147,7 +147,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, > > > > spin_lock(lock); > > > > - inet_twsk_add_node_rcu(tw, &ehead->chain); > > + inet_twsk_add_node_tail_rcu(tw, &ehead->chain); > > > > /* Step 3: Remove SK from hash chain */ > > if (__sk_nulls_del_node_init_rcu(sk)) > > -- > > 2.37.3
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 24a38b56fab9..f58d73888638 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -650,8 +650,20 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) spin_lock(lock); if (osk) { WARN_ON_ONCE(sk->sk_hash != osk->sk_hash); - ret = sk_nulls_del_node_init_rcu(osk); - } else if (found_dup_sk) { + ret = sk_hashed(osk); + if (ret) { + /* Before deleting the node, we insert a new one to make + * sure that the look-up-sk process would not miss either + * of them and that at least one node would exist in ehash + * table all the time. Otherwise there's a tiny chance + * that lookup process could find nothing in ehash table. + */ + __sk_nulls_add_node_tail_rcu(sk, list); + sk_nulls_del_node_init_rcu(osk); + } + goto unlock; + } + if (found_dup_sk) { *found_dup_sk = inet_ehash_lookup_by_sk(sk, list); if (*found_dup_sk) ret = false; @@ -660,6 +672,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) if (ret) __sk_nulls_add_node_rcu(sk, list); +unlock: spin_unlock(lock); return ret; diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 1d77d992e6e7..b66f2dea5a78 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -91,20 +91,20 @@ void inet_twsk_put(struct inet_timewait_sock *tw) } EXPORT_SYMBOL_GPL(inet_twsk_put); -static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw, - struct hlist_nulls_head *list) +static void inet_twsk_add_node_tail_rcu(struct inet_timewait_sock *tw, + struct hlist_nulls_head *list) { - hlist_nulls_add_head_rcu(&tw->tw_node, list); + hlist_nulls_add_tail_rcu(&tw->tw_node, list); } static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw, - struct hlist_head *list) + struct hlist_head *list) { hlist_add_head(&tw->tw_bind_node, list); } static void inet_twsk_add_bind2_node(struct inet_timewait_sock *tw, - struct hlist_head *list) + struct hlist_head *list) { hlist_add_head(&tw->tw_bind2_node, list); } @@ -147,7 +147,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, spin_lock(lock); - inet_twsk_add_node_rcu(tw, &ehead->chain); + inet_twsk_add_node_tail_rcu(tw, &ehead->chain); /* Step 3: Remove SK from hash chain */ if (__sk_nulls_del_node_init_rcu(sk))