[v5,0/2] tcp/dcpp: Un-pin tw_timer

Message ID	20240415113436.3261042-1-vschneid@redhat.com (mailing list archive)
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99ADD5D471 for <netdev@vger.kernel.org>; Mon, 15 Apr 2024 11:35:01 +0000 (UTC) From: Valentin Schneider <vschneid@redhat.com> To: dccp@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, mleitner@redhat.com, David Ahern <dsahern@kernel.org>, Juri Lelli <juri.lelli@redhat.com>, Tomas Glozar <tglozar@redhat.com>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH v5 0/2] tcp/dcpp: Un-pin tw_timer Date: Mon, 15 Apr 2024 13:34:34 +0200 Message-ID: <20240415113436.3261042-1-vschneid@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	tcp/dcpp: Un-pin tw_timer \| expand [v5,0/2] tcp/dcpp: Un-pin tw_timer [v5,1/2] SQUASH: tcp/dcpp: Convert timewait timer into a delayed_work [v5,2/2] tcp/dcpp: Un-pin tw_timer

Valentin Schneider April 15, 2024, 11:34 a.m. UTC

Hi,

This is v5 of the series where the tw_timer is un-pinned to get rid of
interferences in isolated CPUs setups.

The first patch is a new one stemming from Jakub's bug reported. It's there
mainly to make the reviewing a bit easier, but as it changes behaviour it should
be squashed with the second one.

Revisions
=========

v4 -> v5
++++++++

o Rebased against latest Linus' tree
o Converted tw_timer into a delayed work following Jakub's bug report on v4
  http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org

v3 -> v4
++++++++

o Rebased against latest Linus' tree
o Added ehash lock usage to serialize scheduling vs descheduling of the tw_timer
  (Paolo)

v2 -> v3
++++++++

o Dropped bh_disable patch
o Rebased against latest Linus' tree

RFCv1 -> v2
++++++++

o Added comment in inet_twsk_deschedule_put() to highlight the race
o Added bh_disable patch

Valentin Schneider (2):
  SQUASH: tcp/dcpp: Convert timewait timer into a delayed_work
  tcp/dcpp: Un-pin tw_timer

 include/net/inet_timewait_sock.h              |  8 +-
 net/dccp/minisocks.c                          |  9 +--
 net/ipv4/inet_diag.c                          |  2 +-
 net/ipv4/inet_timewait_sock.c                 | 73 +++++++++++++------
 net/ipv4/tcp_ipv4.c                           |  2 +-
 net/ipv4/tcp_minisocks.c                      |  9 +--
 net/ipv6/tcp_ipv6.c                           |  2 +-
 .../selftests/bpf/progs/bpf_iter_tcp4.c       |  2 +-
 .../selftests/bpf/progs/bpf_iter_tcp6.c       |  2 +-
 9 files changed, 69 insertions(+), 40 deletions(-)

--
2.43.0

Eric Dumazet April 15, 2024, 12:35 p.m. UTC | #1

On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> Hi,
>
> This is v5 of the series where the tw_timer is un-pinned to get rid of
> interferences in isolated CPUs setups.
>
> The first patch is a new one stemming from Jakub's bug reported. It's there
> mainly to make the reviewing a bit easier, but as it changes behaviour it should
> be squashed with the second one.
>
> Revisions
> =========
>
> v4 -> v5
> ++++++++
>
> o Rebased against latest Linus' tree
> o Converted tw_timer into a delayed work following Jakub's bug report on v4
>   http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org

What was the issue again ?

Please explain precisely why it was fundamentally tied to the use of
timers (and this was not possible to fix the issue without
adding work queues and more dependencies to TCP stack)

Valentin Schneider April 15, 2024, 2:33 p.m. UTC | #2

On 15/04/24 14:35, Eric Dumazet wrote:
> On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> Hi,
>>
>> This is v5 of the series where the tw_timer is un-pinned to get rid of
>> interferences in isolated CPUs setups.
>>
>> The first patch is a new one stemming from Jakub's bug reported. It's there
>> mainly to make the reviewing a bit easier, but as it changes behaviour it should
>> be squashed with the second one.
>>
>> Revisions
>> =========
>>
>> v4 -> v5
>> ++++++++
>>
>> o Rebased against latest Linus' tree
>> o Converted tw_timer into a delayed work following Jakub's bug report on v4
>>   http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org
>
> What was the issue again ?
>
> Please explain precisely why it was fundamentally tied to the use of
> timers (and this was not possible to fix the issue without
> adding work queues and more dependencies to TCP stack)

In v4 I added the use of the ehash lock to serialize arming the timewait
timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()).

Unfortunately, holding a lock both in a timer callback and in the context
in which it is destroyed is invalid. AIUI the issue is as follows:

  CPUx                        CPUy
  spin_lock(foo);
                              <timer fires>
                              call_timer_fn()
                                spin_lock(foo) // blocks
  timer_shutdown_sync()
    __timer_delete_sync()
      __try_to_del_timer_sync() // looped as long as timer is running
                       <deadlock>

In our case, we had in v4:

  inet_twsk_deschedule_put()
    spin_lock(ehash_lock);
                                          tw_timer_handler()
                                            inet_twsk_kill()
                                              spin_lock(ehash_lock);
                                              __inet_twsk_kill();
    timer_shutdown_sync(&tw->tw_timer);

The fix here is to move the timer deletion to a non-timer
context. Workqueues fit the bill, and as the tw_timer_handler() would just queue
a work item, I converted it to a delayed_work.

Eric Dumazet April 16, 2024, 3:01 p.m. UTC | #3

On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 15/04/24 14:35, Eric Dumazet wrote:
> > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> This is v5 of the series where the tw_timer is un-pinned to get rid of
> >> interferences in isolated CPUs setups.
> >>
> >> The first patch is a new one stemming from Jakub's bug reported. It's there
> >> mainly to make the reviewing a bit easier, but as it changes behaviour it should
> >> be squashed with the second one.
> >>
> >> Revisions
> >> =========
> >>
> >> v4 -> v5
> >> ++++++++
> >>
> >> o Rebased against latest Linus' tree
> >> o Converted tw_timer into a delayed work following Jakub's bug report on v4
> >>   http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org
> >
> > What was the issue again ?
> >
> > Please explain precisely why it was fundamentally tied to the use of
> > timers (and this was not possible to fix the issue without
> > adding work queues and more dependencies to TCP stack)
>
> In v4 I added the use of the ehash lock to serialize arming the timewait
> timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()).
>
> Unfortunately, holding a lock both in a timer callback and in the context
> in which it is destroyed is invalid. AIUI the issue is as follows:
>
>   CPUx                        CPUy
>   spin_lock(foo);
>                               <timer fires>
>                               call_timer_fn()
>                                 spin_lock(foo) // blocks
>   timer_shutdown_sync()
>     __timer_delete_sync()
>       __try_to_del_timer_sync() // looped as long as timer is running
>                        <deadlock>
>
> In our case, we had in v4:
>
>   inet_twsk_deschedule_put()
>     spin_lock(ehash_lock);
>                                           tw_timer_handler()
>                                             inet_twsk_kill()
>                                               spin_lock(ehash_lock);
>                                               __inet_twsk_kill();
>     timer_shutdown_sync(&tw->tw_timer);
>
> The fix here is to move the timer deletion to a non-timer
> context. Workqueues fit the bill, and as the tw_timer_handler() would just queue
> a work item, I converted it to a delayed_work.

I do not like this delayed work approach.

Adding more dependencies to the TCP stack is not very nice from a
maintainer point of view.

Why couldn't you call timer_shutdown_sync() before grabbing the lock ?

Valentin Schneider April 22, 2024, 2:31 p.m. UTC | #4

Apologies for the delayed reply, I was away for most of last week;

On 16/04/24 17:01, Eric Dumazet wrote:
> On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> On 15/04/24 14:35, Eric Dumazet wrote:
>> > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> v4 -> v5
>> >> ++++++++
>> >>
>> >> o Rebased against latest Linus' tree
>> >> o Converted tw_timer into a delayed work following Jakub's bug report on v4
>> >>   http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org
>> >
>> > What was the issue again ?
>> >
>> > Please explain precisely why it was fundamentally tied to the use of
>> > timers (and this was not possible to fix the issue without
>> > adding work queues and more dependencies to TCP stack)
>>
>> In v4 I added the use of the ehash lock to serialize arming the timewait
>> timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()).
>>
>> Unfortunately, holding a lock both in a timer callback and in the context
>> in which it is destroyed is invalid. AIUI the issue is as follows:
>>
>>   CPUx                        CPUy
>>   spin_lock(foo);
>>                               <timer fires>
>>                               call_timer_fn()
>>                                 spin_lock(foo) // blocks
>>   timer_shutdown_sync()
>>     __timer_delete_sync()
>>       __try_to_del_timer_sync() // looped as long as timer is running
>>                        <deadlock>
>>
>> In our case, we had in v4:
>>
>>   inet_twsk_deschedule_put()
>>     spin_lock(ehash_lock);
>>                                           tw_timer_handler()
>>                                             inet_twsk_kill()
>>                                               spin_lock(ehash_lock);
>>                                               __inet_twsk_kill();
>>     timer_shutdown_sync(&tw->tw_timer);
>>
>> The fix here is to move the timer deletion to a non-timer
>> context. Workqueues fit the bill, and as the tw_timer_handler() would just queue
>> a work item, I converted it to a delayed_work.
>
> I do not like this delayed work approach.
>
> Adding more dependencies to the TCP stack is not very nice from a
> maintainer point of view.
>
> Why couldn't you call timer_shutdown_sync() before grabbing the lock ?

We need the timer_shutdown_sync() and mod_timer() of tw->tw_timer to be
serialized in some way. If they aren't, we have the following race:

                             tcp_time_wait()
                               inet_twsk_hashdance()
  inet_twsk_deschedule_put()
    // Returns 0 because not pending, but prevents future arming
    timer_shutdown_sync()
                               inet_twsk_schedule()
                                 // Returns 0 as if timer had been succesfully armed
                                 mod_timer()

This means inet_twsk_deschedule_put() doesn't end up calling
inet_twsk_kill() (because the timer wasn't pending when it got shutdown),
but inet_twsk_schedule() doesn't arm it either despite the hashdance()
having updated the refcounts.

If we leave the deschedule as a del_timer_sync(), the timer ends up armed
in inet_twsk_schedule(), but that means waiting for the timer to fire to
clean up the resources despite having called inet_twsk_deschedule_put().

Valentin Schneider May 21, 2024, 9:03 a.m. UTC | #5

Hi,

On 22/04/24 16:31, Valentin Schneider wrote:
> Apologies for the delayed reply, I was away for most of last week;
>
> On 16/04/24 17:01, Eric Dumazet wrote:
>> On Mon, Apr 15, 2024 at 4:33 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>>
>>> On 15/04/24 14:35, Eric Dumazet wrote:
>>> > On Mon, Apr 15, 2024 at 1:34 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>> >> v4 -> v5
>>> >> ++++++++
>>> >>
>>> >> o Rebased against latest Linus' tree
>>> >> o Converted tw_timer into a delayed work following Jakub's bug report on v4
>>> >>   http://lore.kernel.org/r/20240411100536.224fa1e7@kernel.org
>>> >
>>> > What was the issue again ?
>>> >
>>> > Please explain precisely why it was fundamentally tied to the use of
>>> > timers (and this was not possible to fix the issue without
>>> > adding work queues and more dependencies to TCP stack)
>>>
>>> In v4 I added the use of the ehash lock to serialize arming the timewait
>>> timer vs destroying it (inet_twsk_schedule() vs inet_twsk_deschedule_put()).
>>>
>>> Unfortunately, holding a lock both in a timer callback and in the context
>>> in which it is destroyed is invalid. AIUI the issue is as follows:
>>>
>>>   CPUx                        CPUy
>>>   spin_lock(foo);
>>>                               <timer fires>
>>>                               call_timer_fn()
>>>                                 spin_lock(foo) // blocks
>>>   timer_shutdown_sync()
>>>     __timer_delete_sync()
>>>       __try_to_del_timer_sync() // looped as long as timer is running
>>>                        <deadlock>
>>>
>>> In our case, we had in v4:
>>>
>>>   inet_twsk_deschedule_put()
>>>     spin_lock(ehash_lock);
>>>                                           tw_timer_handler()
>>>                                             inet_twsk_kill()
>>>                                               spin_lock(ehash_lock);
>>>                                               __inet_twsk_kill();
>>>     timer_shutdown_sync(&tw->tw_timer);
>>>
>>> The fix here is to move the timer deletion to a non-timer
>>> context. Workqueues fit the bill, and as the tw_timer_handler() would just queue
>>> a work item, I converted it to a delayed_work.

Does this explanation make sense? This is the reasoning that drove me to
involve workqueues. I'm open to suggestions on alternative approaches.

[v5,0/2] tcp/dcpp: Un-pin tw_timer

Message

Comments