diff mbox series

[bpf,1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator

Message ID 20221019115539.983394-2-houtao@huaweicloud.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Wait for busy refill_work when destorying bpf memory allocator | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/cc_maintainers fail 1 blamed authors not CCed: memxor@gmail.com; 1 maintainers not CCed: memxor@gmail.com
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 5 this patch: 5
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-PR fail PR summary
bpf/vmtest-bpf-VM_Test-4 success Logs for llvm-toolchain
bpf/vmtest-bpf-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-2 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-1 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-16 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-6 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-10 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-11 fail Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-13 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-14 fail Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-9 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-15 success Logs for test_verifier on s390x with gcc

Commit Message

Hou Tao Oct. 19, 2022, 11:55 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
and irq work is inovked in timer interrupt.

The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may has not been invoked by irq work. The other is there
will be use-after-free if irq work is freed before the callback of
irq work is invoked as shown below:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor instruction fetch in kernel mode
 #PF: error_code(0x0010) - not-present page
 PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
 Oops: 0010 [#1] PREEMPT_RT SMP
 CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 RIP: 0010:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
 ......
 Call Trace:
  <TASK>
  irq_work_single+0x24/0x60
  irq_work_run_list+0x24/0x30
  run_irq_workd+0x23/0x30
  smpboot_thread_fn+0x203/0x300
  kthread+0x126/0x150
  ret_from_fork+0x1f/0x30
  </TASK>

Considering the ease of concurrency handling and the short wait time
used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
the 99th percentile is 10us), just waiting for busy refill_work to
complete before memory draining and memory freeing.

Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Stanislav Fomichev Oct. 19, 2022, 6:38 p.m. UTC | #1
On 10/19, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>

> A busy irq work is an unfinished irq work and it can be either in the
> pending state or in the running state. When destroying bpf memory
> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
> and irq work is inovked in timer interrupt.

> The busy refill_work leads to various issues. The obvious one is that
> there will be concurrent operations on free_by_rcu and free_list between
> irq work and memory draining. Another one is call_rcu_in_progress will
> not be reliable for the checking of pending RCU callback because
> do_call_rcu() may has not been invoked by irq work. The other is there
> will be use-after-free if irq work is freed before the callback of
> irq work is invoked as shown below:

>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>   #PF: supervisor instruction fetch in kernel mode
>   #PF: error_code(0x0010) - not-present page
>   PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
>   Oops: 0010 [#1] PREEMPT_RT SMP
>   CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
>   RIP: 0010:0x0
>   Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>   RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
>   RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
>   RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
>   ......
>   Call Trace:
>    <TASK>
>    irq_work_single+0x24/0x60
>    irq_work_run_list+0x24/0x30
>    run_irq_workd+0x23/0x30
>    smpboot_thread_fn+0x203/0x300
>    kthread+0x126/0x150
>    ret_from_fork+0x1f/0x30
>    </TASK>

> Considering the ease of concurrency handling and the short wait time
> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
> the 99th percentile is 10us), just waiting for busy refill_work to
> complete before memory draining and memory freeing.

> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory  
> allocator.")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   kernel/bpf/memalloc.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)

> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 94f0f63443a6..48e606aaacf0 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>   		rcu_in_progress = 0;
>   		for_each_possible_cpu(cpu) {
>   			c = per_cpu_ptr(ma->cache, cpu);
> +			/*
> +			 * refill_work may be unfinished for PREEMPT_RT kernel
> +			 * in which irq work is invoked in a per-CPU RT thread.
> +			 * It is also possible for kernel with
> +			 * arch_irq_work_has_interrupt() being false and irq
> +			 * work is inovked in timer interrupt. So wait for the
> +			 * completion of irq work to ease the handling of
> +			 * concurrency.
> +			 */
> +			irq_work_sync(&c->refill_work);

Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
We do have a bunch of them sprinkled already to run alloc/free with
irqs disabled.

I was also trying to see if adding local_irq_save inside drain_mem_cache
to pair with the ones from refill might work, but waiting for irq to
finish seems easier...

Maybe also move both of these in some new "static void irq_work_wait"
to make it clear that the PREEMT_RT comment applies to both of them?

Or maybe that helper should do 'for_each_possible_cpu(cpu)  
irq_work_sync(&c->refill_work);'
in the PREEMPT_RT case so we don't have to call it twice?

>   			drain_mem_cache(c);
>   			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>   		}
> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>   			cc = per_cpu_ptr(ma->caches, cpu);
>   			for (i = 0; i < NUM_CACHES; i++) {
>   				c = &cc->cache[i];
> +				irq_work_sync(&c->refill_work);
>   				drain_mem_cache(c);
>   				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>   			}
> --
> 2.29.2
Hou Tao Oct. 20, 2022, 1:07 a.m. UTC | #2
Hi,

On 10/20/2022 2:38 AM, sdf@google.com wrote:
> On 10/19, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>
>> A busy irq work is an unfinished irq work and it can be either in the
>> pending state or in the running state. When destroying bpf memory
>> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
>> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
>> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
>> and irq work is inovked in timer interrupt.
>
>> The busy refill_work leads to various issues. The obvious one is that
>> there will be concurrent operations on free_by_rcu and free_list between
>> irq work and memory draining. Another one is call_rcu_in_progress will
>> not be reliable for the checking of pending RCU callback because
>> do_call_rcu() may has not been invoked by irq work. The other is there
>> will be use-after-free if irq work is freed before the callback of
>> irq work is invoked as shown below:
>
>>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>>   #PF: supervisor instruction fetch in kernel mode
>>   #PF: error_code(0x0010) - not-present page
>>   PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
>>   Oops: 0010 [#1] PREEMPT_RT SMP
>>   CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
>>   RIP: 0010:0x0
>>   Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>>   RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
>>   RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
>>   RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
>>   ......
>>   Call Trace:
>>    <TASK>
>>    irq_work_single+0x24/0x60
>>    irq_work_run_list+0x24/0x30
>>    run_irq_workd+0x23/0x30
>>    smpboot_thread_fn+0x203/0x300
>>    kthread+0x126/0x150
>>    ret_from_fork+0x1f/0x30
>>    </TASK>
>
>> Considering the ease of concurrency handling and the short wait time
>> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
>> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
>> the 99th percentile is 10us), just waiting for busy refill_work to
>> complete before memory draining and memory freeing.
>
>> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory
>> allocator.")
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>   kernel/bpf/memalloc.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 94f0f63443a6..48e606aaacf0 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>           rcu_in_progress = 0;
>>           for_each_possible_cpu(cpu) {
>>               c = per_cpu_ptr(ma->cache, cpu);
>> +            /*
>> +             * refill_work may be unfinished for PREEMPT_RT kernel
>> +             * in which irq work is invoked in a per-CPU RT thread.
>> +             * It is also possible for kernel with
>> +             * arch_irq_work_has_interrupt() being false and irq
>> +             * work is inovked in timer interrupt. So wait for the
>> +             * completion of irq work to ease the handling of
>> +             * concurrency.
>> +             */
>> +            irq_work_sync(&c->refill_work);
>
> Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
> We do have a bunch of them sprinkled already to run alloc/free with
> irqs disabled.
No. As said in the commit message and the comments, irq_work_sync() is needed
for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
false. And for other kernels, irq_work_sync() doesn't incur any overhead,
because it is  just a simple memory read through irq_work_is_busy() and nothing
else. The reason is the irq work must have been completed when invoking
bpf_mem_alloc_destroy() for these kernels.

void irq_work_sync(struct irq_work *work)
{
       /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
        /* irq wor*/
        while (irq_work_is_busy(work))
                cpu_relax();
}

>
> I was also trying to see if adding local_irq_save inside drain_mem_cache
> to pair with the ones from refill might work, but waiting for irq to
> finish seems easier...
Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
completed before freeing its memory.
>
> Maybe also move both of these in some new "static void irq_work_wait"
> to make it clear that the PREEMT_RT comment applies to both of them?
>
> Or maybe that helper should do 'for_each_possible_cpu(cpu)
> irq_work_sync(&c->refill_work);'
> in the PREEMPT_RT case so we don't have to call it twice?
drain_mem_cache() is also time consuming somethings, so I think it is better to
interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.

>
>>               drain_mem_cache(c);
>>               rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>           }
>> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>               cc = per_cpu_ptr(ma->caches, cpu);
>>               for (i = 0; i < NUM_CACHES; i++) {
>>                   c = &cc->cache[i];
>> +                irq_work_sync(&c->refill_work);
>>                   drain_mem_cache(c);
>>                   rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>               }
>> -- 
>> 2.29.2
>
> .
Stanislav Fomichev Oct. 20, 2022, 5:49 p.m. UTC | #3
On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/20/2022 2:38 AM, sdf@google.com wrote:
> > On 10/19, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >
> >> A busy irq work is an unfinished irq work and it can be either in the
> >> pending state or in the running state. When destroying bpf memory
> >> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
> >> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
> >> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
> >> and irq work is inovked in timer interrupt.
> >
> >> The busy refill_work leads to various issues. The obvious one is that
> >> there will be concurrent operations on free_by_rcu and free_list between
> >> irq work and memory draining. Another one is call_rcu_in_progress will
> >> not be reliable for the checking of pending RCU callback because
> >> do_call_rcu() may has not been invoked by irq work. The other is there
> >> will be use-after-free if irq work is freed before the callback of
> >> irq work is invoked as shown below:
> >
> >>   BUG: kernel NULL pointer dereference, address: 0000000000000000
> >>   #PF: supervisor instruction fetch in kernel mode
> >>   #PF: error_code(0x0010) - not-present page
> >>   PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
> >>   Oops: 0010 [#1] PREEMPT_RT SMP
> >>   CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
> >>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
> >>   RIP: 0010:0x0
> >>   Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >>   RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
> >>   RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
> >>   RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
> >>   ......
> >>   Call Trace:
> >>    <TASK>
> >>    irq_work_single+0x24/0x60
> >>    irq_work_run_list+0x24/0x30
> >>    run_irq_workd+0x23/0x30
> >>    smpboot_thread_fn+0x203/0x300
> >>    kthread+0x126/0x150
> >>    ret_from_fork+0x1f/0x30
> >>    </TASK>
> >
> >> Considering the ease of concurrency handling and the short wait time
> >> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
> >> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
> >> the 99th percentile is 10us), just waiting for busy refill_work to
> >> complete before memory draining and memory freeing.
> >
> >> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory
> >> allocator.")
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>   kernel/bpf/memalloc.c | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index 94f0f63443a6..48e606aaacf0 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> >>           rcu_in_progress = 0;
> >>           for_each_possible_cpu(cpu) {
> >>               c = per_cpu_ptr(ma->cache, cpu);
> >> +            /*
> >> +             * refill_work may be unfinished for PREEMPT_RT kernel
> >> +             * in which irq work is invoked in a per-CPU RT thread.
> >> +             * It is also possible for kernel with
> >> +             * arch_irq_work_has_interrupt() being false and irq
> >> +             * work is inovked in timer interrupt. So wait for the
> >> +             * completion of irq work to ease the handling of
> >> +             * concurrency.
> >> +             */
> >> +            irq_work_sync(&c->refill_work);
> >
> > Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
> > We do have a bunch of them sprinkled already to run alloc/free with
> > irqs disabled.
> No. As said in the commit message and the comments, irq_work_sync() is needed
> for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
> false. And for other kernels, irq_work_sync() doesn't incur any overhead,
> because it is  just a simple memory read through irq_work_is_busy() and nothing
> else. The reason is the irq work must have been completed when invoking
> bpf_mem_alloc_destroy() for these kernels.
>
> void irq_work_sync(struct irq_work *work)
> {
>        /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
>         /* irq wor*/
>         while (irq_work_is_busy(work))
>                 cpu_relax();
> }

I see, thanks for clarifying! I was so carried away with that
PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is
a separate thing. Agreed that doing irq_work_sync won't hurt in a
non-preempt/non-has_interrupt case.

In this case, can you still do a respin and fix the spelling issue in
the comment? You can slap my acked-by for the v2:

Acked-by: Stanislav Fomichev <sdf@google.com>

s/work is inovked in timer interrupt. So wait for the/... invoked .../

> >
> > I was also trying to see if adding local_irq_save inside drain_mem_cache
> > to pair with the ones from refill might work, but waiting for irq to
> > finish seems easier...
> Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
> completed before freeing its memory.
> >
> > Maybe also move both of these in some new "static void irq_work_wait"
> > to make it clear that the PREEMT_RT comment applies to both of them?
> >
> > Or maybe that helper should do 'for_each_possible_cpu(cpu)
> > irq_work_sync(&c->refill_work);'
> > in the PREEMPT_RT case so we don't have to call it twice?
> drain_mem_cache() is also time consuming somethings, so I think it is better to
> interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.
>
> >
> >>               drain_mem_cache(c);
> >>               rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> >>           }
> >> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> >>               cc = per_cpu_ptr(ma->caches, cpu);
> >>               for (i = 0; i < NUM_CACHES; i++) {
> >>                   c = &cc->cache[i];
> >> +                irq_work_sync(&c->refill_work);
> >>                   drain_mem_cache(c);
> >>                   rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
> >>               }
> >> --
> >> 2.29.2
> >
> > .
>
Hou Tao Oct. 21, 2022, 1:06 a.m. UTC | #4
Hi,

On 10/21/2022 1:49 AM, Stanislav Fomichev wrote:
> On Wed, Oct 19, 2022 at 6:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 10/20/2022 2:38 AM, sdf@google.com wrote:
>>> On 10/19, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
SNIP
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 94f0f63443a6..48e606aaacf0 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>>           rcu_in_progress = 0;
>>>>           for_each_possible_cpu(cpu) {
>>>>               c = per_cpu_ptr(ma->cache, cpu);
>>>> +            /*
>>>> +             * refill_work may be unfinished for PREEMPT_RT kernel
>>>> +             * in which irq work is invoked in a per-CPU RT thread.
>>>> +             * It is also possible for kernel with
>>>> +             * arch_irq_work_has_interrupt() being false and irq
>>>> +             * work is inovked in timer interrupt. So wait for the
>>>> +             * completion of irq work to ease the handling of
>>>> +             * concurrency.
>>>> +             */
>>>> +            irq_work_sync(&c->refill_work);
>>> Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
>>> We do have a bunch of them sprinkled already to run alloc/free with
>>> irqs disabled.
>> No. As said in the commit message and the comments, irq_work_sync() is needed
>> for both PREEMPT_RT kernel and kernel with arch_irq_work_has_interrupt() being
>> false. And for other kernels, irq_work_sync() doesn't incur any overhead,
>> because it is  just a simple memory read through irq_work_is_busy() and nothing
>> else. The reason is the irq work must have been completed when invoking
>> bpf_mem_alloc_destroy() for these kernels.
>>
>> void irq_work_sync(struct irq_work *work)
>> {
>>        /* Remove code snippet for PREEMPT_RT and arch_irq_work_has_interrupt() */
>>         /* irq wor*/
>>         while (irq_work_is_busy(work))
>>                 cpu_relax();
>> }
> I see, thanks for clarifying! I was so carried away with that
> PREEMPT_RT that I missed the fact that arch_irq_work_has_interrupt is
> a separate thing. Agreed that doing irq_work_sync won't hurt in a
> non-preempt/non-has_interrupt case.
>
> In this case, can you still do a respin and fix the spelling issue in
> the comment? You can slap my acked-by for the v2:
>
> Acked-by: Stanislav Fomichev <sdf@google.com>
>
> s/work is inovked in timer interrupt. So wait for the/... invoked .../
Thanks. Will update the commit message and the comments in v2 to fix the typos
and add notes about the fact that there is no overhead under non-PREEMPT_RT and
arch_irq_work_hash_interrupt() kernel.
>
>>> I was also trying to see if adding local_irq_save inside drain_mem_cache
>>> to pair with the ones from refill might work, but waiting for irq to
>>> finish seems easier...
>> Disabling hard irq works, but irq_work_sync() is still needed to ensure it is
>> completed before freeing its memory.
>>> Maybe also move both of these in some new "static void irq_work_wait"
>>> to make it clear that the PREEMT_RT comment applies to both of them?
>>>
>>> Or maybe that helper should do 'for_each_possible_cpu(cpu)
>>> irq_work_sync(&c->refill_work);'
>>> in the PREEMPT_RT case so we don't have to call it twice?
>> drain_mem_cache() is also time consuming somethings, so I think it is better to
>> interleave irq_work_sync() and drain_mem_cache() to reduce waiting time.
>>
>>>>               drain_mem_cache(c);
>>>>               rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>>>           }
>>>> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>>>>               cc = per_cpu_ptr(ma->caches, cpu);
>>>>               for (i = 0; i < NUM_CACHES; i++) {
>>>>                   c = &cc->cache[i];
>>>> +                irq_work_sync(&c->refill_work);
>>>>                   drain_mem_cache(c);
>>>>                   rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>>>>               }
>>>> --
>>>> 2.29.2
>>> .
> .
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 94f0f63443a6..48e606aaacf0 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -497,6 +497,16 @@  void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		rcu_in_progress = 0;
 		for_each_possible_cpu(cpu) {
 			c = per_cpu_ptr(ma->cache, cpu);
+			/*
+			 * refill_work may be unfinished for PREEMPT_RT kernel
+			 * in which irq work is invoked in a per-CPU RT thread.
+			 * It is also possible for kernel with
+			 * arch_irq_work_has_interrupt() being false and irq
+			 * work is inovked in timer interrupt. So wait for the
+			 * completion of irq work to ease the handling of
+			 * concurrency.
+			 */
+			irq_work_sync(&c->refill_work);
 			drain_mem_cache(c);
 			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 		}
@@ -511,6 +521,7 @@  void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 			cc = per_cpu_ptr(ma->caches, cpu);
 			for (i = 0; i < NUM_CACHES; i++) {
 				c = &cc->cache[i];
+				irq_work_sync(&c->refill_work);
 				drain_mem_cache(c);
 				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
 			}