Message ID | 20240503182957.1042122-1-bigeasy@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | locking: Introduce nested-BH locking. | expand |
On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote: > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within > local_bh_disable() section remains preemtible. As a result high prior > tasks (or threaded interrupts) will be blocked by lower-prio task (or > threaded interrupts) which are long running which includes softirq > sections. > > The proposed way out is to introduce explicit per-CPU locks for > resources which are protected by local_bh_disable() and use those only > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. Let me rephrase to check I understood the plan correctly. The idea is to pair 'bare' local_bh_{disable,enable} with local lock and late make local_bh_{disable,enable} no ops (on RT). With 'bare' I mean not followed by a spin_lock() - which is enough to ensure mutual exclusion vs BH on RT build - am I correct? > The series introduces the infrastructure and converts large parts of > networking which is largest stake holder here. Once this done the > per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted. AFAICS there are a bunch of local_bh_* call-sites under 'net' matching the above description and not addressed here. Is this series supposed to cover 'net' fully? Could you please include the diffstat for the whole series? I think/hope it will help catching the full picture more easily. Note that some callers use local_bh_disable(), no additional lock, and there is no specific struct to protect, but enforce explicit serialization vs bh to a bunch of operation, e.g. the local_bh_disable() in inet_twsk_purge(). I guess such call site should be handled, too? Thanks! Paolo
On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote: > On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote: > > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within > > local_bh_disable() section remains preemtible. As a result high prior > > tasks (or threaded interrupts) will be blocked by lower-prio task (or > > threaded interrupts) which are long running which includes softirq > > sections. > > > > The proposed way out is to introduce explicit per-CPU locks for > > resources which are protected by local_bh_disable() and use those only > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. > > Let me rephrase to check I understood the plan correctly. > > The idea is to pair 'bare' local_bh_{disable,enable} with local lock > and late make local_bh_{disable,enable} no ops (on RT). > > With 'bare' I mean not followed by a spin_lock() - which is enough to > ensure mutual exclusion vs BH on RT build - am I correct? I might have I misunderstood your rephrase. But to make it clear: | $ git grep -p local_lock\( kernel/softirq.c | kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt) | kernel/softirq.c: local_lock(&softirq_ctrl.lock); this is what I want to remove. This is upstream RT only (not RT queue only). !RT builds are not affected by this change. > > The series introduces the infrastructure and converts large parts of > > networking which is largest stake holder here. Once this done the > > per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted. > > AFAICS there are a bunch of local_bh_* call-sites under 'net' matching > the above description and not addressed here. Is this series supposed > to cover 'net' fully? The net subsystem has not been fully audited but the major parts have been. I checked global per-CPU variables but there might be dynamic ones. Also new ones might have appeared in the meantime. There are two things which are not fixed yet that I am aware of: - tw_timer timer https://lore.kernel.org/all/20240415113436.3261042-1-vschneid@redhat.com/T/#u - can gw https://lore.kernel.org/linux-can/20231031112349.y0aLoBrz@linutronix.de/ https://lore.kernel.org/all/20231221123703.8170-1-socketcan@hartkopp.net/T/#u That means those two need to be fixed first before that local_local() can disappear from local_bh_disable()/ enable. Also the whole tree should be checked. > Could you please include the diffstat for the whole series? I > think/hope it will help catching the full picture more easily. total over the series: | include/linux/filter.h | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------- | include/linux/local_lock.h | 21 +++++++++++++++++++++ | include/linux/local_lock_internal.h | 31 +++++++++++++++++++++++++++++++ | include/linux/lockdep.h | 3 +++ | include/linux/netdevice.h | 12 ++++++++++++ | include/linux/sched.h | 9 ++++++++- | include/net/seg6_local.h | 1 + | include/net/sock.h | 5 +++++ | kernel/bpf/cpumap.c | 27 +++++++++++---------------- | kernel/bpf/devmap.c | 16 ++++++++-------- | kernel/fork.c | 3 +++ | kernel/locking/spinlock.c | 8 ++++++++ | net/bpf/test_run.c | 11 ++++++++++- | net/bridge/br_netfilter_hooks.c | 7 ++++++- | net/core/dev.c | 39 +++++++++++++++++++++++++++++++++------ | net/core/dev.h | 20 ++++++++++++++++++++ | net/core/filter.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------- | net/core/lwt_bpf.c | 9 +++++---- | net/core/skbuff.c | 25 ++++++++++++++++--------- | net/ipv4/tcp_ipv4.c | 15 +++++++++++---- | net/ipv4/tcp_sigpool.c | 17 +++++++++++++---- | net/ipv6/seg6_local.c | 22 ++++++++++++++-------- | net/xdp/xsk.c | 19 +++++++++++-------- | 23 files changed, 445 insertions(+), 116 deletions(-) > Note that some callers use local_bh_disable(), no additional lock, and > there is no specific struct to protect, but enforce explicit > serialization vs bh to a bunch of operation, e.g. the > local_bh_disable() in inet_twsk_purge(). > > I guess such call site should be handled, too? Yes but I didn't find much. inet_twsk_purge() is the first item from my list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and could be mixed. The only resources that can be protected by disabling BH are per-CPU resources. Either explicit defined (such as napi_alloc_cache) or implicit by other means of per-CPU usage such as a CPU-bound timer, worker, …. Protecting global variables by disabling BH is broken on SMP (see the CAN gw example) so I am not too worried about those. Unless you are aware of a category I did not think of. > Thanks! > > Paolo Sebastian
On Mon, 2024-05-06 at 11:38 +0200, Sebastian Andrzej Siewior wrote: > On 2024-05-06 10:43:49 [+0200], Paolo Abeni wrote: > > On Fri, 2024-05-03 at 20:25 +0200, Sebastian Andrzej Siewior wrote: > > > Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within > > > local_bh_disable() section remains preemtible. As a result high prior > > > tasks (or threaded interrupts) will be blocked by lower-prio task (or > > > threaded interrupts) which are long running which includes softirq > > > sections. > > > > > > The proposed way out is to introduce explicit per-CPU locks for > > > resources which are protected by local_bh_disable() and use those only > > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. > > > > Let me rephrase to check I understood the plan correctly. > > > > The idea is to pair 'bare' local_bh_{disable,enable} with local lock > > and late make local_bh_{disable,enable} no ops (on RT). > > > > With 'bare' I mean not followed by a spin_lock() - which is enough to > > ensure mutual exclusion vs BH on RT build - am I correct? > > I might have I misunderstood your rephrase. But to make it clear: > > $ git grep -p local_lock\( kernel/softirq.c > > kernel/softirq.c=void __local_bh_disable_ip(unsigned long ip, unsigned int cnt) > > kernel/softirq.c: local_lock(&softirq_ctrl.lock); > > this is what I want to remove. This is upstream RT only (not RT queue > only). !RT builds are not affected by this change. I was trying to describe the places that need the additional local_lock(), but I think we are on the same page WRT the overall semantic. > > > > Note that some callers use local_bh_disable(), no additional lock, and > > there is no specific struct to protect, but enforce explicit > > serialization vs bh to a bunch of operation, e.g. the > > local_bh_disable() in inet_twsk_purge(). > > > > I guess such call site should be handled, too? > > Yes but I didn't find much. inet_twsk_purge() is the first item from my > list. On RT spin_lock() vs spin_lock_bh() is the first item from my > list. On RT spin_lock() vs spin_lock_bh() usage does not deadlock and > could be mixed. > > The only resources that can be protected by disabling BH are per-CPU > resources. Either explicit defined (such as napi_alloc_cache) or > implicit by other means of per-CPU usage such as a CPU-bound timer, > worker, …. Protecting global variables by disabling BH is broken on SMP > (see the CAN gw example) so I am not too worried about those. > Unless you are aware of a category I did not think of. I think sometimes the stack could call local_bh_enable() after a while WRT the paired spin lock release, to enforce some serialization - alike what inet_twsk_purge() is doing - but I can't point to any specific line on top of my head. A possible side-effect you should/could observe in the final tree is more pressure on the process scheduler, as something alike: local_bh_disable() <spinlock lock unlock> <again spinlock lock unlock> local_bh_enable() could results in more invocation of the scheduler, right? Cheers, Paolo
On 2024-05-06 16:12:00 [+0200], Paolo Abeni wrote: > > I think sometimes the stack could call local_bh_enable() after a while > WRT the paired spin lock release, to enforce some serialization - alike > what inet_twsk_purge() is doing - but I can't point to any specific > line on top of my head. I *think* the inet_twsk_purge() is special because the timer is pinned and that bh_disable call ensures that the timer does not fire. > A possible side-effect you should/could observe in the final tree is > more pressure on the process scheduler, as something alike: > > local_bh_disable() > > <spinlock lock unlock> > > <again spinlock lock unlock> > > local_bh_enable() > > could results in more invocation of the scheduler, right? Yes, to some degree. On PREEMPT_RT "spinlock lock" does not disable preemption so the section remains preemptible. A task with elevated priority (SCHED_RR/FIFO/DL) remains on the CPU unless preempted by task with higher priority. Regardless of the locks. A SCHED_OTHER task can be preempted by another SCHED_OTHER task even with an acquired spinlock_t. This can be bad performance wise if this other SCHED_OTHER task preempts the lock owner and blocks on the same lock. To cope with this we had something called PREEMPT_LAZY (now PREEMPT_AUTO) in the RT-queue to avoid preemption within SCHED_OTHER tasks as long as a spinlock_t (or other lock that spins on !RT) is acquired. By removing the lock from local_bh_disable() we lose that "please don't preempt me" feature from your scenario above across the BH disabled section for SCHED_OTHER tasks. Nothing changes for tasks with elevated priority. > Cheers, > > Paolo Sebastian