Message ID | 20231115210509.481514-2-vschneid@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp/dcpp: tw_timer tweaks for nohz_full and PREEMPT_RT | expand |
On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote: > > The TCP timewait timer is proving to be problematic for setups where scheduler > CPU isolation is achieved at runtime via cpusets (as opposed to statically via > isolcpus=domains). > > What happens there is a CPU goes through tcp_time_wait(), arming the time_wait > timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing > interference for the now-isolated CPU. This is conceptually similar to the issue > described in > e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") > > Keep softirqs disabled, but make the timer un-pinned and arm it *after* the > hashdance. > > This introduces the following (non-fatal) race: > > CPU0 CPU1 > allocates a tw > insert it in hash table > finds the TW and removes it > (timer cancel does nothing) > arms a TW timer, lasting > > This partially reverts > ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") > and > ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") > > This also reinstores a comment from > ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") > as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step > 2" had gone missing. > > Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/ > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > net/dccp/minisocks.c | 16 +++++++--------- > net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++----- > net/ipv4/tcp_minisocks.c | 16 +++++++--------- > 3 files changed, 29 insertions(+), 23 deletions(-) > > diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c > index 64d805b27adde..2f0fad4255e36 100644 > --- a/net/dccp/minisocks.c > +++ b/net/dccp/minisocks.c > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) > if (state == DCCP_TIME_WAIT) > timeo = DCCP_TIMEWAIT_LEN; > > - /* tw_timer is pinned, so we need to make sure BH are disabled > - * in following section, otherwise timer handler could run before > - * we complete the initialization. > - */ > - local_bh_disable(); > - inet_twsk_schedule(tw, timeo); > - /* Linkage updates. > - * Note that access to tw after this point is illegal. > - */ > + local_bh_disable(); > + > + // Linkage updates > inet_twsk_hashdance(tw, sk, &dccp_hashinfo); > + inet_twsk_schedule(tw, timeo); We could arm a timer there, while another thread/cpu found the TW in the ehash table. > + // Access to tw after this point is illegal. > + inet_twsk_put(tw); This would eventually call inet_twsk_free() while the timer is armed. I think more work is needed. Perhaps make sure that a live timer owns a reference on tw->tw_refcnt (This is not the case atm)
Hi Eric, On Mon, 20 Nov 2023 at 18:56, Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote: > > > > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) > > if (state == DCCP_TIME_WAIT) > > timeo = DCCP_TIMEWAIT_LEN; > > > > - /* tw_timer is pinned, so we need to make sure BH are disabled > > - * in following section, otherwise timer handler could run before > > - * we complete the initialization. > > - */ > > - local_bh_disable(); > > - inet_twsk_schedule(tw, timeo); > > - /* Linkage updates. > > - * Note that access to tw after this point is illegal. > > - */ > > + local_bh_disable(); > > + > > + // Linkage updates > > inet_twsk_hashdance(tw, sk, &dccp_hashinfo); > > + inet_twsk_schedule(tw, timeo); > > We could arm a timer there, while another thread/cpu found the TW in > the ehash table. > > > > > + // Access to tw after this point is illegal. > > + inet_twsk_put(tw); > > This would eventually call inet_twsk_free() while the timer is armed. > > I think more work is needed. > > Perhaps make sure that a live timer owns a reference on tw->tw_refcnt > (This is not the case atm) > I thought that was already the case, per inet_twsk_hashdance(): /* tw_refcnt is set to 3 because we have : * - one reference for bhash chain. * - one reference for ehash chain. * - one reference for timer. and tw_timer_handler() `\ inet_twsk_kill() `\ inet_twsk_put() So AFAICT, after we go through the hashdance, there's a reference on tw_refcnt held by the tw_timer. inet_twsk_deschedule_put() can race with arming the timer, but it only calls inet_twsk_kill() if the timer was already armed & has been deleted, so there's no risk of calling it twice... If I got it right :-)
On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote: > > Hi Eric, > > On Mon, 20 Nov 2023 at 18:56, Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Nov 15, 2023 at 10:05 PM Valentin Schneider <vschneid@redhat.com> wrote: > > > > > > @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) > > > if (state == DCCP_TIME_WAIT) > > > timeo = DCCP_TIMEWAIT_LEN; > > > > > > - /* tw_timer is pinned, so we need to make sure BH are disabled > > > - * in following section, otherwise timer handler could run before > > > - * we complete the initialization. > > > - */ > > > - local_bh_disable(); > > > - inet_twsk_schedule(tw, timeo); > > > - /* Linkage updates. > > > - * Note that access to tw after this point is illegal. > > > - */ > > > + local_bh_disable(); > > > + > > > + // Linkage updates > > > inet_twsk_hashdance(tw, sk, &dccp_hashinfo); > > > + inet_twsk_schedule(tw, timeo); > > > > We could arm a timer there, while another thread/cpu found the TW in > > the ehash table. > > > > > > > > > + // Access to tw after this point is illegal. > > > + inet_twsk_put(tw); > > > > This would eventually call inet_twsk_free() while the timer is armed. > > > > I think more work is needed. > > > > Perhaps make sure that a live timer owns a reference on tw->tw_refcnt > > (This is not the case atm) > > > > I thought that was already the case, per inet_twsk_hashdance(): > > /* tw_refcnt is set to 3 because we have : > * - one reference for bhash chain. > * - one reference for ehash chain. > * - one reference for timer. > > and > > tw_timer_handler() > `\ > inet_twsk_kill() > `\ > inet_twsk_put() > > So AFAICT, after we go through the hashdance, there's a reference on > tw_refcnt held by the tw_timer. > inet_twsk_deschedule_put() can race with arming the timer, but it only > calls inet_twsk_kill() if the timer > was already armed & has been deleted, so there's no risk of calling it > twice... If I got it right :-) > Again, I think you missed some details. I am OOO for a few days, I do not have time to elaborate. You will need to properly track active timer by elevating tw->tw_refcnt, or I guarantee something wrong will happen.
On Thu, 23 Nov 2023 at 17:32, Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote: > > I thought that was already the case, per inet_twsk_hashdance(): > > > > /* tw_refcnt is set to 3 because we have : > > * - one reference for bhash chain. > > * - one reference for ehash chain. > > * - one reference for timer. > > > > and > > > > tw_timer_handler() > > `\ > > inet_twsk_kill() > > `\ > > inet_twsk_put() > > > > So AFAICT, after we go through the hashdance, there's a reference on > > tw_refcnt held by the tw_timer. > > inet_twsk_deschedule_put() can race with arming the timer, but it only > > calls inet_twsk_kill() if the timer > > was already armed & has been deleted, so there's no risk of calling it > > twice... If I got it right :-) > > > > Again, I think you missed some details. > > I am OOO for a few days, I do not have time to elaborate. > > You will need to properly track active timer by elevating > tw->tw_refcnt, or I guarantee something wrong will happen. > Gotcha, let me dig into this then!
On 23/11/23 17:32, Eric Dumazet wrote: > Again, I think you missed some details. > > I am OOO for a few days, I do not have time to elaborate. > > You will need to properly track active timer by elevating > tw->tw_refcnt, or I guarantee something wrong will happen. I apologize if I'm being thick skulled, I've been staring at the code and tracing on live systems and I still can't see the issue with refcounts. The patch has the hashdance set the refcount to 4: * - one reference for bhash chain. * - one reference for ehash chain. * - one reference for timer. * - one reference for ourself (our caller will release it). AFAICT, finding & using the socket via the ehash table comes with a refcount_inc (e.g. __inet_lookup_established()). Worst case, the lookup happens after programming the timer, and we get a inet_twsk_deschedule_put(). This reduces the refcount by: 3 via inet_twsk_kill(): 1 (sk_nulls_del_node_init_rcu()) 1 (inet_twsk_bind_unhash()) 1 (inet_twsk_put()) 1 via inet_twsk_put() IOW 4 total. So we can have: tcp_time_wait() inet_twsk_hashdance(); // refcount = 4 inet_twsk_schedule(); // timer armed tcp_v4_rcv() sk = __inet_lookup_skb(); // refcount = 5 (+1) inet_twsk_deschedule_put(inet_twsk(sk)); inet_twsk_kill(tw) // refcount = 2 (-3) inet_twsk_put(tw) // refcount = 1 (-1) inet_twsk_put(tw) // refcount = 0 (-1) __inet_hash_connect() can invoke inet_twsk_bind_unhash() by itself before calling inet_twsk_deschedule_put(), but that just means it won't be done by the latter, so the total refcount delta remains the same. Thinking about it differently, this would mean that currently (without the patch) another CPU can bring the refcount to 0 without disarming the timer, because the patch is making the initial value one higher. What am I missing?
Hi, On Thu, 2023-11-23 at 17:32 +0100, Eric Dumazet wrote: > On Thu, Nov 23, 2023 at 3:34 PM Valentin Schneider <vschneid@redhat.com> wrote: > > So AFAICT, after we go through the hashdance, there's a reference on > > tw_refcnt held by the tw_timer. > > inet_twsk_deschedule_put() can race with arming the timer, but it only > > calls inet_twsk_kill() if the timer > > was already armed & has been deleted, so there's no risk of calling it > > twice... If I got it right :-) > > Again, I think you missed some details. > > I am OOO for a few days, I do not have time to elaborate. > > You will need to properly track active timer by elevating > tw->tw_refcnt, or I guarantee something wrong will happen. I'm sorry to bring this up again, but I tried to understand what is missing in Valentin's patch and I could not find it. Direct link to the patch, just in case the thread has been lost: https://lore.kernel.org/lkml/20231115210509.481514-2-vschneid@redhat.com/ The patch raises the initial tw->tw_refcnt to 4, so it tracks (in advance) the reference for the tw_timer. AFAICS the patch is still prone to the race you mentioned on the RFC: CPU0: allocates a tw, insert it in hash table CPU1: finds the TW and removes it (timer cancel does nothing) CPU0: arms a TW timer, lasting but I understood such race is acceptable. Could you please shed some light? Many thanks, Paolo
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c index 64d805b27adde..2f0fad4255e36 100644 --- a/net/dccp/minisocks.c +++ b/net/dccp/minisocks.c @@ -53,16 +53,14 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) if (state == DCCP_TIME_WAIT) timeo = DCCP_TIMEWAIT_LEN; - /* tw_timer is pinned, so we need to make sure BH are disabled - * in following section, otherwise timer handler could run before - * we complete the initialization. - */ - local_bh_disable(); - inet_twsk_schedule(tw, timeo); - /* Linkage updates. - * Note that access to tw after this point is illegal. - */ + local_bh_disable(); + + // Linkage updates inet_twsk_hashdance(tw, sk, &dccp_hashinfo); + inet_twsk_schedule(tw, timeo); + // Access to tw after this point is illegal. + inet_twsk_put(tw); + local_bh_enable(); } else { /* Sorry, if we're out of memory, just CLOSE this diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index dd37a5bf68811..f9b2bbedf1cfc 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -144,6 +144,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, spin_lock(lock); + /* Step 2: Hash TW into tcp ehash chain */ inet_twsk_add_node_rcu(tw, &ehead->chain); /* Step 3: Remove SK from hash chain */ @@ -152,16 +153,15 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, spin_unlock(lock); - /* tw_refcnt is set to 3 because we have : + /* tw_refcnt is set to 4 because we have : * - one reference for bhash chain. * - one reference for ehash chain. * - one reference for timer. + * - one reference for ourself (our caller will release it). * We can use atomic_set() because prior spin_lock()/spin_unlock() * committed into memory all tw fields. - * Also note that after this point, we lost our implicit reference - * so we are not allowed to use tw anymore. */ - refcount_set(&tw->tw_refcnt, 3); + refcount_set(&tw->tw_refcnt, 4); } EXPORT_SYMBOL_GPL(inet_twsk_hashdance); @@ -207,7 +207,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, tw->tw_prot = sk->sk_prot_creator; atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie)); twsk_net_set(tw, sock_net(sk)); - timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED); + timer_setup(&tw->tw_timer, tw_timer_handler, 0); /* * Because we use RCU lookups, we should not set tw_refcnt * to a non null value before everything is setup for this @@ -232,6 +232,16 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc); */ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw) { + /* This can race with tcp_time_wait() and dccp_time_wait(), as the timer + * is armed /after/ adding it to the hashtables. + * + * If this is interleaved between inet_twsk_hashdance() and inet_twsk_put(), + * then this is a no-op: the timer will still end up armed. + * + * Conversely, if this successfully deletes the timer, then we know we + * have already gone through {tcp,dcpp}_time_wait(), and we can safely + * call inet_twsk_kill(). + */ if (del_timer_sync(&tw->tw_timer)) inet_twsk_kill(tw); inet_twsk_put(tw); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index a9807eeb311ca..48eb0310fe837 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -338,16 +338,14 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) if (state == TCP_TIME_WAIT) timeo = TCP_TIMEWAIT_LEN; - /* tw_timer is pinned, so we need to make sure BH are disabled - * in following section, otherwise timer handler could run before - * we complete the initialization. - */ - local_bh_disable(); - inet_twsk_schedule(tw, timeo); - /* Linkage updates. - * Note that access to tw after this point is illegal. - */ + local_bh_disable(); + + // Linkage updates. inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo); + inet_twsk_schedule(tw, timeo); + // Access to tw after this point is illegal. + inet_twsk_put(tw); + local_bh_enable(); } else { /* Sorry, if we're out of memory, just CLOSE this
The TCP timewait timer is proving to be problematic for setups where scheduler CPU isolation is achieved at runtime via cpusets (as opposed to statically via isolcpus=domains). What happens there is a CPU goes through tcp_time_wait(), arming the time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing interference for the now-isolated CPU. This is conceptually similar to the issue described in e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") Keep softirqs disabled, but make the timer un-pinned and arm it *after* the hashdance. This introduces the following (non-fatal) race: CPU0 CPU1 allocates a tw insert it in hash table finds the TW and removes it (timer cancel does nothing) arms a TW timer, lasting This partially reverts ed2e92394589 ("tcp/dccp: fix timewait races in timer handling") and ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") This also reinstores a comment from ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing. Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/ Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- net/dccp/minisocks.c | 16 +++++++--------- net/ipv4/inet_timewait_sock.c | 20 +++++++++++++++----- net/ipv4/tcp_minisocks.c | 16 +++++++--------- 3 files changed, 29 insertions(+), 23 deletions(-)