[v2,net-next,00/15] locking: Introduce nested-BH locking.

Message ID	20240503182957.1042122-1-bigeasy@linutronix.de (mailing list archive)
Headers	show Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E019139D19; Fri, 3 May 2024 18:30:07 +0000 (UTC) From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net>, Boqun Feng <boqun.feng@gmail.com>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, Frederic Weisbecker <frederic@kernel.org>, Ingo Molnar <mingo@redhat.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Waiman Long <longman@redhat.com>, Will Deacon <will@kernel.org> Subject: [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking. Date: Fri, 3 May 2024 20:25:04 +0200 Message-ID: <20240503182957.1042122-1-bigeasy@linutronix.de> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Series	locking: Introduce nested-BH locking. \| expand [v2,net-next,00/15] locking: Introduce nested-BH locking. [net-next,01/15] locking/local_lock: Introduce guard definition for local_lock. [net-next,02/15] locking/local_lock: Add local nested BH locking infrastructure. [net-next,03/15] net: Use __napi_alloc_frag_align() instead of open coding it. [net-next,04/15] net: Use nested-BH locking for napi_alloc_cache. [net-next,05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch. [net-next,06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk. [net-next,07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage. [net-next,08/15] net: softnet_data: Make xmit.recursion per task. [net-next,09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*(). [net-next,10/15] dev: Use nested-BH locking for softnet_data.process_queue. [net-next,11/15] lwt: Don't disable migration prio invoking BPF. [net-next,12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states. [net-next,13/15] net: Use nested-BH locking for bpf_scratchpad. [net-next,14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. [net-next,15/15] net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.

Message ID

20240503182957.1042122-1-bigeasy@linutronix.de (mailing list archive)

Headers

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
	Boqun Feng <boqun.feng@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <edumazet@google.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <longman@redhat.com>,
	Will Deacon <will@kernel.org>
Subject: [PATCH v2 net-next 00/15] locking: Introduce nested-BH locking.
Date: Fri,  3 May 2024 20:25:04 +0200
Message-ID: <20240503182957.1042122-1-bigeasy@linutronix.de>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Series

locking: Introduce nested-BH locking. | expand

Message

Sebastian Andrzej Siewior May 3, 2024, 6:25 p.m. UTC

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

v1…v2 https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/:
- Jakub complained about touching networking drivers to make the
  additional locking work. Alexei complained about the additional
  locking within the XDP/eBFP case.
  This led to a change in how the per-CPU variables are accessed for the
  XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and
  the task pointer to the structure is saved in the task_struct while
  keeping every for !RT unchanged. This was proposed as a RFC in
  	v1: https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de/

  and then updated

        v2: https://lore.kernel.org/all/20240229183109.646865-1-bigeasy@linutronix.de/
	  - Renamed the container struct from xdp_storage to bpf_net_context.
            Suggested by Toke Høiland-Jørgensen.
	  - Use the container struct also on !PREEMPT_RT builds. Store the
	    pointer to the on-stack struct in a per-CPU variable. Suggested by
            Toke Høiland-Jørgensen.

  This reduces the initial queue from 24 to 15 patches.

- There were complains about the scoped_guard() which shifts the whole
  block and makes it harder to review because the whole gets removed and
  added again. The usage has been replaced with local_lock_nested_bh()+
  its unlock counterpart.

Sebastian

Comments

Paolo Abeni May 6, 2024, 8:43 a.m. UTC | #1

On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> local_bh_disable() section remains preemtible. As a result high prior
> tasks (or threaded interrupts) will be blocked by lower-prio task (or
> threaded interrupts) which are long running which includes softirq
> sections.
> 
> The proposed way out is to introduce explicit per-CPU locks for
> resources which are protected by local_bh_disable() and use those only
> on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

Let me rephrase to check I understood the plan correctly.

The idea is to pair 'bare' local_bh_{disable,enable} with local lock
and late make local_bh_{disable,enable} no ops (on RT).

With 'bare' I mean not followed by a spin_lock() - which is enough to
ensure mutual exclusion vs BH on RT build - am I correct?

> The series introduces the infrastructure and converts large parts of
> networking which is largest stake holder here. Once this done the
> per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

AFAICS there are a bunch of local_bh_* call-sites under 'net' matching
the above description and not addressed here. Is this series supposed
to cover 'net' fully?

Could you please include the diffstat for the whole series? I
think/hope it will help catching the full picture more easily.

Note that some callers use local_bh_disable(), no additional lock, and
there is no specific struct to protect, but enforce explicit
serialization vs bh to a bunch of operation, e.g.  the
local_bh_disable() in inet_twsk_purge().

I guess such call site should be handled, too?

Thanks!

Paolo

Sebastian Andrzej Siewior May 6, 2024, 9:38 a.m. UTC | #2

On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote:
> On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> > local_bh_disable() section remains preemtible. As a result high prior
> > tasks (or threaded interrupts) will be blocked by lower-prio task (or
> > threaded interrupts) which are long running which includes softirq
> > sections.
> > 
> > The proposed way out is to introduce explicit per-CPU locks for
> > resources which are protected by local_bh_disable() and use those only
> > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> 
> Let me rephrase to check I understood the plan correctly.
> 
> The idea is to pair 'bare' local_bh_{disable,enable} with local lock
> and late make local_bh_{disable,enable} no ops (on RT).
> 
> With 'bare' I mean not followed by a spin_lock() - which is enough to
> ensure mutual exclusion vs BH on RT build - am I correct?

I might have I misunderstood your rephrase. But to make it clear:
| $ git grep -p local_lock\( kernel/softirq.c
| kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
| kernel/softirq.c:                       local_lock(&softirq_ctrl.lock);

this is what I want to remove. This is upstream RT only (not RT queue
only). !RT builds are not affected by this change.

> > The series introduces the infrastructure and converts large parts of
> > networking which is largest stake holder here. Once this done the
> > per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
> 
> AFAICS there are a bunch of local_bh_* call-sites under 'net' matching
> the above description and not addressed here. Is this series supposed
> to cover 'net' fully?

The net subsystem has not been fully audited but the major parts have
been. I checked global per-CPU variables but there might be dynamic
ones. Also new ones might have appeared in the meantime. There are
two things which are not fixed yet that I am aware of:
- tw_timer timer
  https://lore.kernel.org/all/20240415113436.3261042-1-vschneid@redhat.com/T/#u

- can gw
  https://lore.kernel.org/linux-can/20231031112349.y0aLoBrz@linutronix.de/
  https://lore.kernel.org/all/20231221123703.8170-1-socketcan@hartkopp.net/T/#u

That means those two need to be fixed first before that local_local()
can disappear from local_bh_disable()/ enable. Also the whole tree
should be checked.

> Could you please include the diffstat for the whole series? I
> think/hope it will help catching the full picture more easily.

total over the series:

| include/linux/filter.h              | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
| include/linux/local_lock.h          |  21 +++++++++++++++++++++
| include/linux/local_lock_internal.h |  31 +++++++++++++++++++++++++++++++
| include/linux/lockdep.h             |   3 +++
| include/linux/netdevice.h           |  12 ++++++++++++
| include/linux/sched.h               |   9 ++++++++-
| include/net/seg6_local.h            |   1 +
| include/net/sock.h                  |   5 +++++
| kernel/bpf/cpumap.c                 |  27 +++++++++++----------------
| kernel/bpf/devmap.c                 |  16 ++++++++--------
| kernel/fork.c                       |   3 +++
| kernel/locking/spinlock.c           |   8 ++++++++
| net/bpf/test_run.c                  |  11 ++++++++++-
| net/bridge/br_netfilter_hooks.c     |   7 ++++++-
| net/core/dev.c                      |  39 +++++++++++++++++++++++++++++++++------
| net/core/dev.h                      |  20 ++++++++++++++++++++
| net/core/filter.c                   | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
| net/core/lwt_bpf.c                  |   9 +++++----
| net/core/skbuff.c                   |  25 ++++++++++++++++---------
| net/ipv4/tcp_ipv4.c                 |  15 +++++++++++----
| net/ipv4/tcp_sigpool.c              |  17 +++++++++++++----
| net/ipv6/seg6_local.c               |  22 ++++++++++++++--------
| net/xdp/xsk.c                       |  19 +++++++++++--------
| 23 files changed, 445 insertions(+), 116 deletions(-)

> Note that some callers use local_bh_disable(), no additional lock, and
> there is no specific struct to protect, but enforce explicit
> serialization vs bh to a bunch of operation, e.g.  the
> local_bh_disable() in inet_twsk_purge().
> 
> I guess such call site should be handled, too?

Yes but I didn't find much. inet_twsk_purge() is the first item from my
list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and
could be mixed.

The only resources that can be protected by disabling BH are per-CPU
resources. Either explicit defined (such as napi_alloc_cache) or
implicit by other means of per-CPU usage such as a CPU-bound timer,
worker, …. Protecting global variables by disabling BH is broken on SMP
(see the CAN gw example) so I am not too worried about those.
Unless you are aware of a category I did not think of.

> Thanks!
> 
> Paolo

Sebastian

Paolo Abeni May 6, 2024, 2:12 p.m. UTC | #3

On Mon, 2024-05-06 at 11:38 +0200, Sebastian Andrzej Siewior wrote:
> On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote:
> > On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote:
> > > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
> > > local_bh_disable() section remains preemtible. As a result high prior
> > > tasks (or threaded interrupts) will be blocked by lower-prio task (or
> > > threaded interrupts) which are long running which includes softirq
> > > sections.
> > > 
> > > The proposed way out is to introduce explicit per-CPU locks for
> > > resources which are protected by local_bh_disable() and use those only
> > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> > 
> > Let me rephrase to check I understood the plan correctly.
> > 
> > The idea is to pair 'bare' local_bh_{disable,enable} with local lock
> > and late make local_bh_{disable,enable} no ops (on RT).
> > 
> > With 'bare' I mean not followed by a spin_lock() - which is enough to
> > ensure mutual exclusion vs BH on RT build - am I correct?
> 
> I might have I misunderstood your rephrase. But to make it clear:
> > $ git grep -p local_lock\( kernel/softirq.c
> > kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
> > kernel/softirq.c:                       local_lock(&softirq_ctrl.lock);
> 
> this is what I want to remove. This is upstream RT only (not RT queue
> only). !RT builds are not affected by this change.

I was trying to describe the places that need the additional
local_lock(), but I think we are on the same page WRT the overall
semantic.

> > 
> > Note that some callers use local_bh_disable(), no additional lock, and
> > there is no specific struct to protect, but enforce explicit
> > serialization vs bh to a bunch of operation, e.g.  the
> > local_bh_disable() in inet_twsk_purge().
> > 
> > I guess such call site should be handled, too?
> 
> Yes but I didn't find much. inet_twsk_purge() is the first item from my
> list. On RT spin_lock() vs spin_lock_bh() is the first item from my
> list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and
> could be mixed.
> 
> The only resources that can be protected by disabling BH are per-CPU
> resources. Either explicit defined (such as napi_alloc_cache) or
> implicit by other means of per-CPU usage such as a CPU-bound timer,
> worker, …. Protecting global variables by disabling BH is broken on SMP
> (see the CAN gw example) so I am not too worried about those.
> Unless you are aware of a category I did not think of.

I think sometimes the stack could call local_bh_enable() after a while
WRT the paired spin lock release, to enforce some serialization - alike
what inet_twsk_purge() is doing - but I can't point to any specific
line on top of my head.

A possible side-effect you should/could observe in the final tree is
more pressure on the process scheduler, as something alike:

local_bh_disable()

<spinlock lock unlock>

<again spinlock lock unlock>

local_bh_enable()

could results in more invocation of the scheduler, right? 

Cheers,

Paolo

Sebastian Andrzej Siewior May 6, 2024, 2:43 p.m. UTC | #4

On 2024-05-06 16:12:00 [+0200], Paolo Abeni wrote:
> 
> I think sometimes the stack could call local_bh_enable() after a while
> WRT the paired spin lock release, to enforce some serialization - alike
> what inet_twsk_purge() is doing - but I can't point to any specific
> line on top of my head.

I *think* the inet_twsk_purge() is special because the timer is pinned
and that bh_disable call ensures that the timer does not fire.

> A possible side-effect you should/could observe in the final tree is
> more pressure on the process scheduler, as something alike:
> 
> local_bh_disable()
> 
> <spinlock lock unlock>
> 
> <again spinlock lock unlock>
> 
> local_bh_enable()
> 
> could results in more invocation of the scheduler, right? 

Yes, to some degree.
On PREEMPT_RT "spinlock lock" does not disable preemption so the section
remains preemptible. A task with elevated priority (SCHED_RR/FIFO/DL)
remains on the CPU unless preempted by task with higher priority.
Regardless of the locks.

A SCHED_OTHER task can be preempted by another SCHED_OTHER task even
with an acquired spinlock_t. This can be bad performance wise if this
other SCHED_OTHER task preempts the lock owner and blocks on the same
lock. To cope with this we had something called PREEMPT_LAZY (now
PREEMPT_AUTO) in the RT-queue to avoid preemption within SCHED_OTHER
tasks as long as a spinlock_t (or other lock that spins on !RT) is
acquired.
By removing the lock from local_bh_disable() we lose that "please don't
preempt me" feature from your scenario above across the BH disabled
section for SCHED_OTHER tasks. Nothing changes for tasks with elevated
priority.

> Cheers,
> 
> Paolo

Sebastian