Message ID | 20250107085559.3081563-1-houtao@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | Free htab element out of bucket lock | expand |
On 1/7/2025 4:55 PM, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > > Hi, > > The patch set continues the previous work [1] to move all the freeings > of htab elements out of bucket lock. One motivation for the patch set is > the locking problem reported by Sebastian [2]: the freeing of bpf_timer > under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock). > However the freeing procedure for htab element has already held a > raw-spin-lock (namely bucket lock), and it will trigger the warning: > "BUG: scheduling while atomic" as demonstrated by the selftests patch. > Another motivation is to reduce the locked scope of bucket lock. > > The patch set is structured as follows: > > * Patch #1 moves the element freeing out of lock for > htab_lru_map_delete_node() > * Patch #2~#3 move the element freeing out of lock for > __htab_map_lookup_and_delete_elem() > * Patch #4~#6 move the element freeing out of lock for > htab_map_update_elem() > * Patch #7 adds a selftest for the locking problem > > The changes for htab_map_update_elem() require some explanation. The > reason that the previous work [1] can't move the element freeing out of > the bucket lock for preallocated hash table is due to ->extra_elems > optimization. When alloc_htab_elem() returns, the existed-old element > has already been stashed in per-cpu ->extra_elems. To handle that, patch > #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems > into two independent steps, do resue with bucket lock being held and do > refill after unlocking the bucket lock. The downside is that concurrent > updates on the same CPU may need to pop free element from per-cpu list > instead of reusing ->extra_elems directly, but I think such case will be > rare. Er, the break of reuse and refill of ->extra_elems is buggy. It failed the htab_update/concurrent_update in BPF CI occasionally. It may also return the unexpected E2BIG error when the map is full and there are concurrent overwrites procedure on the same CPU. Need to figure out another way to handle the lock problem. > Please see individual patches for more details. Comments are always > welcome. > > [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@huaweicloud.com > [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de > > Hou Tao (7): > bpf: Free special fields after unlock in htab_lru_map_delete_node() > bpf: Bail out early in __htab_map_lookup_and_delete_elem() > bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() > bpf: Support refilling extra_elems in free_htab_elem() > bpf: Factor out the element allocation for pre-allocated htab > bpf: Free element after unlock for pre-allocated htab > selftests/bpf: Add test case for the freeing of bpf_timer > > kernel/bpf/hashtab.c | 170 ++++++++++-------- > .../selftests/bpf/prog_tests/free_timer.c | 165 +++++++++++++++++ > .../testing/selftests/bpf/progs/free_timer.c | 71 ++++++++ > 3 files changed, 332 insertions(+), 74 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c > create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c >
Hi, On 1/7/2025 8:09 PM, Hou Tao wrote: > > On 1/7/2025 4:55 PM, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> Hi, >> >> The patch set continues the previous work [1] to move all the freeings >> of htab elements out of bucket lock. One motivation for the patch set is >> the locking problem reported by Sebastian [2]: the freeing of bpf_timer >> under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock). >> However the freeing procedure for htab element has already held a >> raw-spin-lock (namely bucket lock), and it will trigger the warning: >> "BUG: scheduling while atomic" as demonstrated by the selftests patch. >> Another motivation is to reduce the locked scope of bucket lock. >> >> The patch set is structured as follows: >> >> * Patch #1 moves the element freeing out of lock for >> htab_lru_map_delete_node() >> * Patch #2~#3 move the element freeing out of lock for >> __htab_map_lookup_and_delete_elem() >> * Patch #4~#6 move the element freeing out of lock for >> htab_map_update_elem() >> * Patch #7 adds a selftest for the locking problem >> >> The changes for htab_map_update_elem() require some explanation. The >> reason that the previous work [1] can't move the element freeing out of >> the bucket lock for preallocated hash table is due to ->extra_elems >> optimization. When alloc_htab_elem() returns, the existed-old element >> has already been stashed in per-cpu ->extra_elems. To handle that, patch >> #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems >> into two independent steps, do resue with bucket lock being held and do >> refill after unlocking the bucket lock. The downside is that concurrent >> updates on the same CPU may need to pop free element from per-cpu list >> instead of reusing ->extra_elems directly, but I think such case will be >> rare. > Er, the break of reuse and refill of ->extra_elems is buggy. It failed > the htab_update/concurrent_update in BPF CI occasionally. It may also > return the unexpected E2BIG error when the map is full and there are > concurrent overwrites procedure on the same CPU. Need to figure out > another way to handle the lock problem. Other approach is to ensure that the reuse of extra_elems and the free of the special fields in the extra_elems run sequentially. Considering the reuse of extra_elems may happen in a IRQ context, I still can not figure out a better way to handle the lock problem. I decide to change the implementation of bpf_timer_delete_work() to fix the lock problem: it will use hrtimer_try_to_cancel() firstly, if the function returns -1, bpf_timer_delete_work() will try to queue a work to cancel the timer again. @Sebastian Is it possible that softirq_expiry_lock is changed to a raw-spin-lock instead ? >> Please see individual patches for more details. Comments are always >> welcome. >> >> [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@huaweicloud.com >> [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de >> >> Hou Tao (7): >> bpf: Free special fields after unlock in htab_lru_map_delete_node() >> bpf: Bail out early in __htab_map_lookup_and_delete_elem() >> bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() >> bpf: Support refilling extra_elems in free_htab_elem() >> bpf: Factor out the element allocation for pre-allocated htab >> bpf: Free element after unlock for pre-allocated htab >> selftests/bpf: Add test case for the freeing of bpf_timer >> >> kernel/bpf/hashtab.c | 170 ++++++++++-------- >> .../selftests/bpf/prog_tests/free_timer.c | 165 +++++++++++++++++ >> .../testing/selftests/bpf/progs/free_timer.c | 71 ++++++++ >> 3 files changed, 332 insertions(+), 74 deletions(-) >> create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c >> create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c >>
On 2025-01-08 09:24:02 [+0800], Hou Tao wrote: > @Sebastian > Is it possible that softirq_expiry_lock is changed to a raw-spin-lock > instead ? No. The point is to PI-boost the timer-task by the task that is canceling the timer. This is possible if the timer-task got preempted by the canceling task - both can't be migrated to another CPU and if the canceling task has higher priority then it will continue to spin and live lock the system. Making the expire lock raw would also force every timer to run with disabled interrupts which would not allow to acquire any spinlock_t locks. Sebastian
Hi, On 1/8/2025 3:29 PM, Sebastian Andrzej Siewior wrote: > On 2025-01-08 09:24:02 [+0800], Hou Tao wrote: >> @Sebastian >> Is it possible that softirq_expiry_lock is changed to a raw-spin-lock >> instead ? > No. The point is to PI-boost the timer-task by the task that is > canceling the timer. This is possible if the timer-task got preempted by > the canceling task - both can't be migrated to another CPU and if the > canceling task has higher priority then it will continue to spin and > live lock the system. > Making the expire lock raw would also force every timer to run with > disabled interrupts which would not allow to acquire any spinlock_t > locks. Thanks for the explanation. However I still can not understand why making the expire lock raw will force every timer to run with disabled interrupt. In my simple understanding, hrtimer_cpu_base_lock_expiry() doesn't disable the irq. Do you mean if change the expire lock to raw, it also needs to disable the irq to prevent something from happening, right ? Also does the raw spinlock have the PI-boost functionality ? > > Sebastian > > .
On 2025-01-08 17:06:06 [+0800], Hou Tao wrote: > Hi, Hi, > On 1/8/2025 3:29 PM, Sebastian Andrzej Siewior wrote: > > On 2025-01-08 09:24:02 [+0800], Hou Tao wrote: > >> @Sebastian > >> Is it possible that softirq_expiry_lock is changed to a raw-spin-lock > >> instead ? > > No. The point is to PI-boost the timer-task by the task that is > > canceling the timer. This is possible if the timer-task got preempted by > > the canceling task - both can't be migrated to another CPU and if the > > canceling task has higher priority then it will continue to spin and > > live lock the system. > > Making the expire lock raw would also force every timer to run with > > disabled interrupts which would not allow to acquire any spinlock_t > > locks. > > Thanks for the explanation. However I still can not understand why > making the expire lock raw will force every timer to run with disabled > interrupt. I'm sorry. Not disabled interrupts but preemption. hrtimer_run_softirq() acquires the lock via hrtimer_cpu_base_lock_expiry() and holds it while during __hrtimer_run_queues() -> __run_hrtimer(). Only hrtimer_cpu_base::lock is dropped before the timer is invoked. If preemption is disabled you can not acquire a spinlock_t. > In my simple understanding, hrtimer_cpu_base_lock_expiry() > doesn't disable the irq. Do you mean if change the expire lock to raw, > it also needs to disable the irq to prevent something from happening, > right ? No, see the above, it should clear things up. > Also does the raw spinlock have the PI-boost functionality ? No. Blocking on a raw_spinlock_t means to spin until it is its turn to acquire the lock. The spinlock_t becomes a rt_mutex on PREEMPT_RT and blocking on a lock means: give current priority to lock owner (= Priority Inheritance) + go to sleep until lock becomes available. Sebastian
From: Hou Tao <houtao1@huawei.com> Hi, The patch set continues the previous work [1] to move all the freeings of htab elements out of bucket lock. One motivation for the patch set is the locking problem reported by Sebastian [2]: the freeing of bpf_timer under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock). However the freeing procedure for htab element has already held a raw-spin-lock (namely bucket lock), and it will trigger the warning: "BUG: scheduling while atomic" as demonstrated by the selftests patch. Another motivation is to reduce the locked scope of bucket lock. The patch set is structured as follows: * Patch #1 moves the element freeing out of lock for htab_lru_map_delete_node() * Patch #2~#3 move the element freeing out of lock for __htab_map_lookup_and_delete_elem() * Patch #4~#6 move the element freeing out of lock for htab_map_update_elem() * Patch #7 adds a selftest for the locking problem The changes for htab_map_update_elem() require some explanation. The reason that the previous work [1] can't move the element freeing out of the bucket lock for preallocated hash table is due to ->extra_elems optimization. When alloc_htab_elem() returns, the existed-old element has already been stashed in per-cpu ->extra_elems. To handle that, patch #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems into two independent steps, do resue with bucket lock being held and do refill after unlocking the bucket lock. The downside is that concurrent updates on the same CPU may need to pop free element from per-cpu list instead of reusing ->extra_elems directly, but I think such case will be rare. Please see individual patches for more details. Comments are always welcome. [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@huaweicloud.com [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de Hou Tao (7): bpf: Free special fields after unlock in htab_lru_map_delete_node() bpf: Bail out early in __htab_map_lookup_and_delete_elem() bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() bpf: Support refilling extra_elems in free_htab_elem() bpf: Factor out the element allocation for pre-allocated htab bpf: Free element after unlock for pre-allocated htab selftests/bpf: Add test case for the freeing of bpf_timer kernel/bpf/hashtab.c | 170 ++++++++++-------- .../selftests/bpf/prog_tests/free_timer.c | 165 +++++++++++++++++ .../testing/selftests/bpf/progs/free_timer.c | 71 ++++++++ 3 files changed, 332 insertions(+), 74 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c