mbox series

[bpf-next,0/7] Free htab element out of bucket lock

Message ID 20250107085559.3081563-1-houtao@huaweicloud.com (mailing list archive)
Headers show
Series Free htab element out of bucket lock | expand

Message

Hou Tao Jan. 7, 2025, 8:55 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

Hi,

The patch set continues the previous work [1] to move all the freeings
of htab elements out of bucket lock. One motivation for the patch set is
the locking problem reported by Sebastian [2]: the freeing of bpf_timer
under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock).
However the freeing procedure for htab element has already held a
raw-spin-lock (namely bucket lock), and it will trigger the warning:
"BUG: scheduling while atomic" as demonstrated by the selftests patch.
Another motivation is to reduce the locked scope of bucket lock.

The patch set is structured as follows:

* Patch #1 moves the element freeing out of lock for
  htab_lru_map_delete_node()
* Patch #2~#3 move the element freeing out of lock for
  __htab_map_lookup_and_delete_elem()
* Patch #4~#6 move the element freeing out of lock for
  htab_map_update_elem()
* Patch #7 adds a selftest for the locking problem

The changes for htab_map_update_elem() require some explanation. The
reason that the previous work [1] can't move the element freeing out of
the bucket lock for preallocated hash table is due to ->extra_elems
optimization. When alloc_htab_elem() returns, the existed-old element
has already been stashed in per-cpu ->extra_elems. To handle that, patch
#5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems
into two independent steps, do resue with bucket lock being held and do
refill after unlocking the bucket lock. The downside is that concurrent
updates on the same CPU may need to pop free element from per-cpu list
instead of reusing ->extra_elems directly, but I think such case will be
rare.

Please see individual patches for more details. Comments are always
welcome.

[1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@huaweicloud.com
[2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de

Hou Tao (7):
  bpf: Free special fields after unlock in htab_lru_map_delete_node()
  bpf: Bail out early in __htab_map_lookup_and_delete_elem()
  bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
  bpf: Support refilling extra_elems in free_htab_elem()
  bpf: Factor out the element allocation for pre-allocated htab
  bpf: Free element after unlock for pre-allocated htab
  selftests/bpf: Add test case for the freeing of bpf_timer

 kernel/bpf/hashtab.c                          | 170 ++++++++++--------
 .../selftests/bpf/prog_tests/free_timer.c     | 165 +++++++++++++++++
 .../testing/selftests/bpf/progs/free_timer.c  |  71 ++++++++
 3 files changed, 332 insertions(+), 74 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c
 create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c

Comments

Hou Tao Jan. 7, 2025, 12:09 p.m. UTC | #1
On 1/7/2025 4:55 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The patch set continues the previous work [1] to move all the freeings
> of htab elements out of bucket lock. One motivation for the patch set is
> the locking problem reported by Sebastian [2]: the freeing of bpf_timer
> under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock).
> However the freeing procedure for htab element has already held a
> raw-spin-lock (namely bucket lock), and it will trigger the warning:
> "BUG: scheduling while atomic" as demonstrated by the selftests patch.
> Another motivation is to reduce the locked scope of bucket lock.
>
> The patch set is structured as follows:
>
> * Patch #1 moves the element freeing out of lock for
>   htab_lru_map_delete_node()
> * Patch #2~#3 move the element freeing out of lock for
>   __htab_map_lookup_and_delete_elem()
> * Patch #4~#6 move the element freeing out of lock for
>   htab_map_update_elem()
> * Patch #7 adds a selftest for the locking problem
>
> The changes for htab_map_update_elem() require some explanation. The
> reason that the previous work [1] can't move the element freeing out of
> the bucket lock for preallocated hash table is due to ->extra_elems
> optimization. When alloc_htab_elem() returns, the existed-old element
> has already been stashed in per-cpu ->extra_elems. To handle that, patch
> #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems
> into two independent steps, do resue with bucket lock being held and do
> refill after unlocking the bucket lock. The downside is that concurrent
> updates on the same CPU may need to pop free element from per-cpu list
> instead of reusing ->extra_elems directly, but I think such case will be
> rare.

Er, the break of reuse and refill of ->extra_elems is buggy. It failed
the htab_update/concurrent_update in BPF CI occasionally. It may also
return the unexpected E2BIG error when the map is full and there are
concurrent overwrites procedure on the same CPU. Need to figure out
another way to handle the lock problem.
> Please see individual patches for more details. Comments are always
> welcome.
>
> [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@huaweicloud.com
> [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de
>
> Hou Tao (7):
>   bpf: Free special fields after unlock in htab_lru_map_delete_node()
>   bpf: Bail out early in __htab_map_lookup_and_delete_elem()
>   bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
>   bpf: Support refilling extra_elems in free_htab_elem()
>   bpf: Factor out the element allocation for pre-allocated htab
>   bpf: Free element after unlock for pre-allocated htab
>   selftests/bpf: Add test case for the freeing of bpf_timer
>
>  kernel/bpf/hashtab.c                          | 170 ++++++++++--------
>  .../selftests/bpf/prog_tests/free_timer.c     | 165 +++++++++++++++++
>  .../testing/selftests/bpf/progs/free_timer.c  |  71 ++++++++
>  3 files changed, 332 insertions(+), 74 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c
>  create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c
>
Hou Tao Jan. 8, 2025, 1:24 a.m. UTC | #2
Hi,

On 1/7/2025 8:09 PM, Hou Tao wrote:
>
> On 1/7/2025 4:55 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
>>
>> The patch set continues the previous work [1] to move all the freeings
>> of htab elements out of bucket lock. One motivation for the patch set is
>> the locking problem reported by Sebastian [2]: the freeing of bpf_timer
>> under PREEMPT_RT may acquire a spin-lock (namely softirq_expiry_lock).
>> However the freeing procedure for htab element has already held a
>> raw-spin-lock (namely bucket lock), and it will trigger the warning:
>> "BUG: scheduling while atomic" as demonstrated by the selftests patch.
>> Another motivation is to reduce the locked scope of bucket lock.
>>
>> The patch set is structured as follows:
>>
>> * Patch #1 moves the element freeing out of lock for
>>   htab_lru_map_delete_node()
>> * Patch #2~#3 move the element freeing out of lock for
>>   __htab_map_lookup_and_delete_elem()
>> * Patch #4~#6 move the element freeing out of lock for
>>   htab_map_update_elem()
>> * Patch #7 adds a selftest for the locking problem
>>
>> The changes for htab_map_update_elem() require some explanation. The
>> reason that the previous work [1] can't move the element freeing out of
>> the bucket lock for preallocated hash table is due to ->extra_elems
>> optimization. When alloc_htab_elem() returns, the existed-old element
>> has already been stashed in per-cpu ->extra_elems. To handle that, patch
>> #5~#7 break the reuse of ->extra_elems and the refill of ->extra_elems
>> into two independent steps, do resue with bucket lock being held and do
>> refill after unlocking the bucket lock. The downside is that concurrent
>> updates on the same CPU may need to pop free element from per-cpu list
>> instead of reusing ->extra_elems directly, but I think such case will be
>> rare.
> Er, the break of reuse and refill of ->extra_elems is buggy. It failed
> the htab_update/concurrent_update in BPF CI occasionally. It may also
> return the unexpected E2BIG error when the map is full and there are
> concurrent overwrites procedure on the same CPU. Need to figure out
> another way to handle the lock problem.

Other approach is to ensure that the reuse of extra_elems and the free
of the special fields in the extra_elems run sequentially. Considering
the reuse of extra_elems may happen in a IRQ context, I still can not
figure out a better way to handle the lock problem.  I decide to change
the implementation of bpf_timer_delete_work() to fix the lock problem:
it will use hrtimer_try_to_cancel() firstly, if the function returns -1,
bpf_timer_delete_work() will try to queue a work to cancel the timer again.

@Sebastian
Is it possible that softirq_expiry_lock is changed to a raw-spin-lock
instead ?
>> Please see individual patches for more details. Comments are always
>> welcome.
>>
>> [1]: https://lore.kernel.org/bpf/20241106063542.357743-1-houtao@huaweicloud.com
>> [2]: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de
>>
>> Hou Tao (7):
>>   bpf: Free special fields after unlock in htab_lru_map_delete_node()
>>   bpf: Bail out early in __htab_map_lookup_and_delete_elem()
>>   bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
>>   bpf: Support refilling extra_elems in free_htab_elem()
>>   bpf: Factor out the element allocation for pre-allocated htab
>>   bpf: Free element after unlock for pre-allocated htab
>>   selftests/bpf: Add test case for the freeing of bpf_timer
>>
>>  kernel/bpf/hashtab.c                          | 170 ++++++++++--------
>>  .../selftests/bpf/prog_tests/free_timer.c     | 165 +++++++++++++++++
>>  .../testing/selftests/bpf/progs/free_timer.c  |  71 ++++++++
>>  3 files changed, 332 insertions(+), 74 deletions(-)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/free_timer.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/free_timer.c
>>
Sebastian Andrzej Siewior Jan. 8, 2025, 7:29 a.m. UTC | #3
On 2025-01-08 09:24:02 [+0800], Hou Tao wrote:
> @Sebastian
> Is it possible that softirq_expiry_lock is changed to a raw-spin-lock
> instead ?

No. The point is to PI-boost the timer-task by the task that is
canceling the timer. This is possible if the timer-task got preempted by
the canceling task - both can't be migrated to another CPU and if the
canceling task has higher priority then it will continue to spin and
live lock the system.
Making the expire lock raw would also force every timer to run with
disabled interrupts which would not allow to acquire any spinlock_t
locks.

Sebastian
Hou Tao Jan. 8, 2025, 9:06 a.m. UTC | #4
Hi,

On 1/8/2025 3:29 PM, Sebastian Andrzej Siewior wrote:
> On 2025-01-08 09:24:02 [+0800], Hou Tao wrote:
>> @Sebastian
>> Is it possible that softirq_expiry_lock is changed to a raw-spin-lock
>> instead ?
> No. The point is to PI-boost the timer-task by the task that is
> canceling the timer. This is possible if the timer-task got preempted by
> the canceling task - both can't be migrated to another CPU and if the
> canceling task has higher priority then it will continue to spin and
> live lock the system.
> Making the expire lock raw would also force every timer to run with
> disabled interrupts which would not allow to acquire any spinlock_t
> locks.

Thanks for the explanation. However I still can not understand why
making the expire lock raw will force every timer to run with disabled
interrupt. In my simple understanding, hrtimer_cpu_base_lock_expiry()
doesn't disable the irq. Do you mean if change the expire lock to raw,
it also needs to disable the irq to prevent something from happening,
right ? Also does the raw spinlock have the PI-boost functionality ?
>
> Sebastian
>
> .
Sebastian Andrzej Siewior Jan. 8, 2025, 9:16 a.m. UTC | #5
On 2025-01-08 17:06:06 [+0800], Hou Tao wrote:
> Hi,
Hi,

> On 1/8/2025 3:29 PM, Sebastian Andrzej Siewior wrote:
> > On 2025-01-08 09:24:02 [+0800], Hou Tao wrote:
> >> @Sebastian
> >> Is it possible that softirq_expiry_lock is changed to a raw-spin-lock
> >> instead ?
> > No. The point is to PI-boost the timer-task by the task that is
> > canceling the timer. This is possible if the timer-task got preempted by
> > the canceling task - both can't be migrated to another CPU and if the
> > canceling task has higher priority then it will continue to spin and
> > live lock the system.
> > Making the expire lock raw would also force every timer to run with
> > disabled interrupts which would not allow to acquire any spinlock_t
> > locks.
> 
> Thanks for the explanation. However I still can not understand why
> making the expire lock raw will force every timer to run with disabled
> interrupt. 

I'm sorry. Not disabled interrupts but preemption. hrtimer_run_softirq()
acquires the lock via hrtimer_cpu_base_lock_expiry() and holds it while
during __hrtimer_run_queues() -> __run_hrtimer(). Only
hrtimer_cpu_base::lock is dropped before the timer is invoked.
If preemption is disabled you can not acquire a spinlock_t.

>            In my simple understanding, hrtimer_cpu_base_lock_expiry()
> doesn't disable the irq. Do you mean if change the expire lock to raw,
> it also needs to disable the irq to prevent something from happening,
> right ? 
No, see the above, it should clear things up.

>         Also does the raw spinlock have the PI-boost functionality ?

No. Blocking on a raw_spinlock_t means to spin until it is its turn to
acquire the lock.
The spinlock_t becomes a rt_mutex on PREEMPT_RT and blocking on a lock
means: give current priority to lock owner (= Priority Inheritance) + go
to sleep until lock becomes available.

Sebastian