diff mbox series

[v4,bpf-next,09/14] bpf: Allow reuse from waiting_for_gp_ttrace list.

Message ID 20230706033447.54696-10-alexei.starovoitov@gmail.com (mailing list archive)
State Accepted
Commit 04fabf00b4d3aff5d010ecb617001814e409e24a
Delegated to: BPF
Headers show
Series bpf: Introduce bpf_mem_cache_free_rcu(). | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 13 this patch: 13
netdev/cc_maintainers warning 8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com song@kernel.org sdf@google.com jolsa@kernel.org haoluo@google.com
netdev/build_clang fail Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 13 this patch: 13
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 28 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on s390x with gcc

Commit Message

Alexei Starovoitov July 6, 2023, 3:34 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Comments

Hou Tao July 7, 2023, 2:07 a.m. UTC | #1
Hi,

On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/memalloc.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 9986c6b7df4d..e5a87f6cf2cc 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>  	if (i >= cnt)
>  		return;
>  
> +	for (; i < cnt; i++) {
> +		obj = llist_del_first(&c->waiting_for_gp_ttrace);
> +		if (!obj)
> +			break;
> +		add_obj_to_free_list(c, obj);
> +	}
> +	if (i >= cnt)
> +		return;

I still think using llist_del_first() here is not safe as reported in
[1]. Not sure whether or not invoking enque_to_free() firstly for
free_llist_extra will close the race completely. Will check later.

[1]:
https://lore.kernel.org/rcu/957dd5cd-0855-1197-7045-4cb1590bd753@huaweicloud.com/
> +
>  	memcg = get_memcg(c);
>  	old_memcg = set_active_memcg(memcg);
>  	for (; i < cnt; i++) {
> @@ -295,12 +304,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
>  
>  	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
>  	llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace))
> -		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
> -		 * It doesn't race with llist_del_all either.
> -		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
> -		 * from __free_rcu() and from drain_mem_cache().
> -		 */
> -		__llist_add(llnode, &c->waiting_for_gp_ttrace);
> +		llist_add(llnode, &c->waiting_for_gp_ttrace);
>  
>  	if (unlikely(READ_ONCE(c->draining))) {
>  		__free_rcu(&c->rcu_ttrace);
Alexei Starovoitov July 7, 2023, 2:12 a.m. UTC | #2
On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  kernel/bpf/memalloc.c | 16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index 9986c6b7df4d..e5a87f6cf2cc 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> >       if (i >= cnt)
> >               return;
> >
> > +     for (; i < cnt; i++) {
> > +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
> > +             if (!obj)
> > +                     break;
> > +             add_obj_to_free_list(c, obj);
> > +     }
> > +     if (i >= cnt)
> > +             return;
>
> I still think using llist_del_first() here is not safe as reported in
> [1]. Not sure whether or not invoking enque_to_free() firstly for
> free_llist_extra will close the race completely. Will check later.

lol. see my reply a second ago in the other thread.

and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
Hou Tao July 7, 2023, 3:38 a.m. UTC | #3
Hi,

On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> ---
>>>  kernel/bpf/memalloc.c | 16 ++++++++++------
>>>  1 file changed, 10 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 9986c6b7df4d..e5a87f6cf2cc 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>       if (i >= cnt)
>>>               return;
>>>
>>> +     for (; i < cnt; i++) {
>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>>> +             if (!obj)
>>> +                     break;
>>> +             add_obj_to_free_list(c, obj);
>>> +     }
>>> +     if (i >= cnt)
>>> +             return;
>> I still think using llist_del_first() here is not safe as reported in
>> [1]. Not sure whether or not invoking enque_to_free() firstly for
>> free_llist_extra will close the race completely. Will check later.
> lol. see my reply a second ago in the other thread.
>
> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.

I think free_by_rcu_ttrace is different, because the reuse is only
possible after one tasks trace RCU grace period as shown below, and the
concurrent llist_del_first() must have been completed when the head is
reused and re-added into free_by_rcu_ttrace again.

// c0->free_by_rcu_ttrace
A -> B -> C -> nil
   
P1:
alloc_bulk()
    llist_del_first(&c->free_by_rcu_ttrace)
        entry = A
        next = B

P2:
do_call_rcu_ttrace()
    // c->free_by_rcu_ttrace->first = NULL
    llist_del_all(&c->free_by_rcu_ttrace)
        move to c->waiting_for_gp_ttrace

P1:
llist_del_first()
    return NULL

// A is only reusable after one task trace RCU grace
// llist_del_first() must have been completed
__free_rcu_tasks_trace
    free_all(llist_del_all(&c->waiting_for_gp_ttrace))
Alexei Starovoitov July 7, 2023, 4:16 a.m. UTC | #4
On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> > On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> >>> From: Alexei Starovoitov <ast@kernel.org>
> >>>
> >>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> >>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
> >>>
> >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >>> ---
> >>>  kernel/bpf/memalloc.c | 16 ++++++++++------
> >>>  1 file changed, 10 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >>> index 9986c6b7df4d..e5a87f6cf2cc 100644
> >>> --- a/kernel/bpf/memalloc.c
> >>> +++ b/kernel/bpf/memalloc.c
> >>> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> >>>       if (i >= cnt)
> >>>               return;
> >>>
> >>> +     for (; i < cnt; i++) {
> >>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
> >>> +             if (!obj)
> >>> +                     break;
> >>> +             add_obj_to_free_list(c, obj);
> >>> +     }
> >>> +     if (i >= cnt)
> >>> +             return;
> >> I still think using llist_del_first() here is not safe as reported in
> >> [1]. Not sure whether or not invoking enque_to_free() firstly for
> >> free_llist_extra will close the race completely. Will check later.
> > lol. see my reply a second ago in the other thread.
> >
> > and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
>
> I think free_by_rcu_ttrace is different, because the reuse is only
> possible after one tasks trace RCU grace period as shown below, and the
> concurrent llist_del_first() must have been completed when the head is
> reused and re-added into free_by_rcu_ttrace again.
>
> // c0->free_by_rcu_ttrace
> A -> B -> C -> nil
>
> P1:
> alloc_bulk()
>     llist_del_first(&c->free_by_rcu_ttrace)
>         entry = A
>         next = B
>
> P2:
> do_call_rcu_ttrace()
>     // c->free_by_rcu_ttrace->first = NULL
>     llist_del_all(&c->free_by_rcu_ttrace)
>         move to c->waiting_for_gp_ttrace
>
> P1:
> llist_del_first()
>     return NULL
>
> // A is only reusable after one task trace RCU grace
> // llist_del_first() must have been completed

"must have been completed" ?

I guess you're assuming that alloc_bulk() from irq_work
is running within rcu_tasks_trace critical section,
so __free_rcu_tasks_trace() callback will execute after
irq work completed?
I don't think that's the case.
In vCPU P1 is stopped for looong time by host,
P2 can execute __free_rcu_tasks_trace (or P3, since
tasks trace callbacks execute in a kthread that is not bound
to any cpu).
__free_rcu_tasks_trace() will free it into slab.
Then kmalloc the same obj and eventually put it back into
free_by_rcu_ttrace.

Since you believe that waiting_for_gp_ttrace ABA is possible
here it's the same probability. imo both lower than a bit flip due
to cosmic rays which is actually observable in practice.

> __free_rcu_tasks_trace
>     free_all(llist_del_all(&c->waiting_for_gp_ttrace))
>
>
Hou Tao July 7, 2023, 4:37 a.m. UTC | #5
Hi,

On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
>>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hi,
>>>>
>>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>>>
SNIP
>>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
>> I think free_by_rcu_ttrace is different, because the reuse is only
>> possible after one tasks trace RCU grace period as shown below, and the
>> concurrent llist_del_first() must have been completed when the head is
>> reused and re-added into free_by_rcu_ttrace again.
>>
>> // c0->free_by_rcu_ttrace
>> A -> B -> C -> nil
>>
>> P1:
>> alloc_bulk()
>>     llist_del_first(&c->free_by_rcu_ttrace)
>>         entry = A
>>         next = B
>>
>> P2:
>> do_call_rcu_ttrace()
>>     // c->free_by_rcu_ttrace->first = NULL
>>     llist_del_all(&c->free_by_rcu_ttrace)
>>         move to c->waiting_for_gp_ttrace
>>
>> P1:
>> llist_del_first()
>>     return NULL
>>
>> // A is only reusable after one task trace RCU grace
>> // llist_del_first() must have been completed
> "must have been completed" ?
>
> I guess you're assuming that alloc_bulk() from irq_work
> is running within rcu_tasks_trace critical section,
> so __free_rcu_tasks_trace() callback will execute after
> irq work completed?
> I don't think that's the case.

Yes. The following is my original thoughts. Correct me if I was wrong:

1. llist_del_first() must be running concurrently with llist_del_all().
If llist_del_first() runs after llist_del_all(), it will return NULL
directly.
2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
elements in free_by_rcu_ttrace will not be freed back to slab.
3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
to call __free_rcu_tasks_trace()
4. llist_del_first() in running in an context with irq-disabled, so the
tasks trace RCU grace period will wait for the end of llist_del_first()

It seems you thought step 4) is not true, right ?
> In vCPU P1 is stopped for looong time by host,
> P2 can execute __free_rcu_tasks_trace (or P3, since
> tasks trace callbacks execute in a kthread that is not bound
> to any cpu).
> __free_rcu_tasks_trace() will free it into slab.
> Then kmalloc the same obj and eventually put it back into
> free_by_rcu_ttrace.
>
> Since you believe that waiting_for_gp_ttrace ABA is possible
> here it's the same probability. imo both lower than a bit flip due
> to cosmic rays which is actually observable in practice.
>
>> __free_rcu_tasks_trace
>>     free_all(llist_del_all(&c->waiting_for_gp_ttrace))
>>
>>
> .
Alexei Starovoitov July 7, 2023, 4:11 p.m. UTC | #6
On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> >>>>
> SNIP
> >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
> >> I think free_by_rcu_ttrace is different, because the reuse is only
> >> possible after one tasks trace RCU grace period as shown below, and the
> >> concurrent llist_del_first() must have been completed when the head is
> >> reused and re-added into free_by_rcu_ttrace again.
> >>
> >> // c0->free_by_rcu_ttrace
> >> A -> B -> C -> nil
> >>
> >> P1:
> >> alloc_bulk()
> >>     llist_del_first(&c->free_by_rcu_ttrace)
> >>         entry = A
> >>         next = B
> >>
> >> P2:
> >> do_call_rcu_ttrace()
> >>     // c->free_by_rcu_ttrace->first = NULL
> >>     llist_del_all(&c->free_by_rcu_ttrace)
> >>         move to c->waiting_for_gp_ttrace
> >>
> >> P1:
> >> llist_del_first()
> >>     return NULL
> >>
> >> // A is only reusable after one task trace RCU grace
> >> // llist_del_first() must have been completed
> > "must have been completed" ?
> >
> > I guess you're assuming that alloc_bulk() from irq_work
> > is running within rcu_tasks_trace critical section,
> > so __free_rcu_tasks_trace() callback will execute after
> > irq work completed?
> > I don't think that's the case.
>
> Yes. The following is my original thoughts. Correct me if I was wrong:
>
> 1. llist_del_first() must be running concurrently with llist_del_all().
> If llist_del_first() runs after llist_del_all(), it will return NULL
> directly.
> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> elements in free_by_rcu_ttrace will not be freed back to slab.
> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> to call __free_rcu_tasks_trace()
> 4. llist_del_first() in running in an context with irq-disabled, so the
> tasks trace RCU grace period will wait for the end of llist_del_first()
>
> It seems you thought step 4) is not true, right ?

Yes. I think so. For two reasons:

1.
I believe irq disabled region isn't considered equivalent
to rcu_read_lock_trace() region.

Paul,
could you clarify ?

2.
Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
runs "in a per-CPU thread in preemptible context."
See irq_work_run_list.
Paul E. McKenney July 7, 2023, 5:47 p.m. UTC | #7
On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > >>>>
> > SNIP
> > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
> > >> I think free_by_rcu_ttrace is different, because the reuse is only
> > >> possible after one tasks trace RCU grace period as shown below, and the
> > >> concurrent llist_del_first() must have been completed when the head is
> > >> reused and re-added into free_by_rcu_ttrace again.
> > >>
> > >> // c0->free_by_rcu_ttrace
> > >> A -> B -> C -> nil
> > >>
> > >> P1:
> > >> alloc_bulk()
> > >>     llist_del_first(&c->free_by_rcu_ttrace)
> > >>         entry = A
> > >>         next = B
> > >>
> > >> P2:
> > >> do_call_rcu_ttrace()
> > >>     // c->free_by_rcu_ttrace->first = NULL
> > >>     llist_del_all(&c->free_by_rcu_ttrace)
> > >>         move to c->waiting_for_gp_ttrace
> > >>
> > >> P1:
> > >> llist_del_first()
> > >>     return NULL
> > >>
> > >> // A is only reusable after one task trace RCU grace
> > >> // llist_del_first() must have been completed
> > > "must have been completed" ?
> > >
> > > I guess you're assuming that alloc_bulk() from irq_work
> > > is running within rcu_tasks_trace critical section,
> > > so __free_rcu_tasks_trace() callback will execute after
> > > irq work completed?
> > > I don't think that's the case.
> >
> > Yes. The following is my original thoughts. Correct me if I was wrong:
> >
> > 1. llist_del_first() must be running concurrently with llist_del_all().
> > If llist_del_first() runs after llist_del_all(), it will return NULL
> > directly.
> > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> > elements in free_by_rcu_ttrace will not be freed back to slab.
> > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> > to call __free_rcu_tasks_trace()
> > 4. llist_del_first() in running in an context with irq-disabled, so the
> > tasks trace RCU grace period will wait for the end of llist_del_first()
> >
> > It seems you thought step 4) is not true, right ?
> 
> Yes. I think so. For two reasons:
> 
> 1.
> I believe irq disabled region isn't considered equivalent
> to rcu_read_lock_trace() region.
> 
> Paul,
> could you clarify ?

You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
count irq-disabled regions of code as readers.

But why not just put an rcu_read_lock_trace() and a matching
rcu_read_unlock_trace() within that irq-disabled region of code?

For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
Hou Tao would be correct from a strict current-implementation
viewpoint.  The reason is that, given the current implementation in
CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
take an IPI in order for the grace-period machinery to realize that this
task is done with all prior readers.

However, we need to account for the possibility of IPI-free
implementations, for example, if the real-time guys decide to start
making heavy use of BPF sleepable programs.  They would then insist on
getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
point, irq-disabled regions of code will absolutely not act as
RCU tasks trace readers.

Again, why not just put an rcu_read_lock_trace() and a matching
rcu_read_unlock_trace() within that irq-disabled region of code?

Or maybe there is a better workaround.

> 2.
> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
> runs "in a per-CPU thread in preemptible context."
> See irq_work_run_list.

Agreed, under RT, "interrupt handlers" often run in task context.

						Thanx, Paul
Joel Fernandes July 7, 2023, 10:22 p.m. UTC | #8
On Fri, Jul 7, 2023 at 1:47 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote:
> > > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >> Hi,
> > > >>
> > > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote:
> > > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
> > > >>>>
> > > SNIP
> > > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
> > > >> I think free_by_rcu_ttrace is different, because the reuse is only
> > > >> possible after one tasks trace RCU grace period as shown below, and the
> > > >> concurrent llist_del_first() must have been completed when the head is
> > > >> reused and re-added into free_by_rcu_ttrace again.
> > > >>
> > > >> // c0->free_by_rcu_ttrace
> > > >> A -> B -> C -> nil
> > > >>
> > > >> P1:
> > > >> alloc_bulk()
> > > >>     llist_del_first(&c->free_by_rcu_ttrace)
> > > >>         entry = A
> > > >>         next = B
> > > >>
> > > >> P2:
> > > >> do_call_rcu_ttrace()
> > > >>     // c->free_by_rcu_ttrace->first = NULL
> > > >>     llist_del_all(&c->free_by_rcu_ttrace)
> > > >>         move to c->waiting_for_gp_ttrace
> > > >>
> > > >> P1:
> > > >> llist_del_first()
> > > >>     return NULL
> > > >>
> > > >> // A is only reusable after one task trace RCU grace
> > > >> // llist_del_first() must have been completed
> > > > "must have been completed" ?
> > > >
> > > > I guess you're assuming that alloc_bulk() from irq_work
> > > > is running within rcu_tasks_trace critical section,
> > > > so __free_rcu_tasks_trace() callback will execute after
> > > > irq work completed?
> > > > I don't think that's the case.
> > >
> > > Yes. The following is my original thoughts. Correct me if I was wrong:
> > >
> > > 1. llist_del_first() must be running concurrently with llist_del_all().
> > > If llist_del_first() runs after llist_del_all(), it will return NULL
> > > directly.
> > > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> > > elements in free_by_rcu_ttrace will not be freed back to slab.
> > > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> > > to call __free_rcu_tasks_trace()
> > > 4. llist_del_first() in running in an context with irq-disabled, so the
> > > tasks trace RCU grace period will wait for the end of llist_del_first()
> > >
> > > It seems you thought step 4) is not true, right ?
> >
> > Yes. I think so. For two reasons:
> >
> > 1.
> > I believe irq disabled region isn't considered equivalent
> > to rcu_read_lock_trace() region.
> >
> > Paul,
> > could you clarify ?
>
> You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
> count irq-disabled regions of code as readers.
>
> But why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?
>
> For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
> Hou Tao would be correct from a strict current-implementation
> viewpoint.  The reason is that, given the current implementation in
> CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
> take an IPI in order for the grace-period machinery to realize that this
> task is done with all prior readers.
>
> However, we need to account for the possibility of IPI-free
> implementations, for example, if the real-time guys decide to start
> making heavy use of BPF sleepable programs.  They would then insist on
> getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
> point, irq-disabled regions of code will absolutely not act as
> RCU tasks trace readers.
>
> Again, why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?

If I remember correctly, the general guidance is to always put an
explicit marker if it is in an RCU-reader, instead of relying on
implementation details. So the suggestion to put the marker instead of
relying on IRQ disabling does align with that.

Thanks.
Hou Tao July 8, 2023, 7:03 a.m. UTC | #9
Hi,

On 7/8/2023 1:47 AM, Paul E. McKenney wrote:
> On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
>> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>> I guess you're assuming that alloc_bulk() from irq_work
>>> is running within rcu_tasks_trace critical section,
>>> so __free_rcu_tasks_trace() callback will execute after
>>> irq work completed?
>>> I don't think that's the case.
>>> Yes. The following is my original thoughts. Correct me if I was wrong:
>>>
>>> 1. llist_del_first() must be running concurrently with llist_del_all().
>>> If llist_del_first() runs after llist_del_all(), it will return NULL
>>> directly.
>>> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
>>> elements in free_by_rcu_ttrace will not be freed back to slab.
>>> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
>>> to call __free_rcu_tasks_trace()
>>> 4. llist_del_first() in running in an context with irq-disabled, so the
>>> tasks trace RCU grace period will wait for the end of llist_del_first()
>>>
>>> It seems you thought step 4) is not true, right ?
>> Yes. I think so. For two reasons:
>>
>> 1.
>> I believe irq disabled region isn't considered equivalent
>> to rcu_read_lock_trace() region.
>>
>> Paul,
>> could you clarify ?
> You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
> count irq-disabled regions of code as readers.

I see. But I still have one question: considering that in current
implementation one Tasks Trace RCU grace period implies one vanilla RCU
grace period (aka rcu_trace_implies_rcu_gp), so in my naive
understanding of RCU, does that mean __free_rcu_tasks_trace() will be
invoked after the expiration of current Task Trace RCU grace period,
right ? And does it also mean __free_rcu_tasks_trace() will be invoked
after the expiration of current vanilla RCU grace period, right ? If
these two conditions above are true, does it mean
__free_rcu_tasks_trace() will wait for the irq-disabled code reigion ?
> But why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?
>
> For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
> Hou Tao would be correct from a strict current-implementation
> viewpoint.  The reason is that, given the current implementation in
> CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
> take an IPI in order for the grace-period machinery to realize that this
> task is done with all prior readers.

Thanks for the detailed explanation.
> However, we need to account for the possibility of IPI-free
> implementations, for example, if the real-time guys decide to start
> making heavy use of BPF sleepable programs.  They would then insist on
> getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
> point, irq-disabled regions of code will absolutely not act as
> RCU tasks trace readers.
>
> Again, why not just put an rcu_read_lock_trace() and a matching
> rcu_read_unlock_trace() within that irq-disabled region of code?
>
> Or maybe there is a better workaround.

Yes. I think we could use rcu_read_{lock,unlock}_trace to fix the ABA
problem for free_by_rcu_ttrace.
>
>> 2.
>> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
>> runs "in a per-CPU thread in preemptible context."
>> See irq_work_run_list.
> Agreed, under RT, "interrupt handlers" often run in task context.

Yes, I missed that. I misread alloc_bulk(), and it seems it only does
inc_active() for c->free_llist.
> 						Thanx, Paul
Paul E. McKenney July 10, 2023, 4:45 a.m. UTC | #10
On Sat, Jul 08, 2023 at 03:03:40PM +0800, Hou Tao wrote:
> Hi,
> 
> On 7/8/2023 1:47 AM, Paul E. McKenney wrote:
> > On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote:
> >> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote:
> SNIP
> >>> I guess you're assuming that alloc_bulk() from irq_work
> >>> is running within rcu_tasks_trace critical section,
> >>> so __free_rcu_tasks_trace() callback will execute after
> >>> irq work completed?
> >>> I don't think that's the case.
> >>> Yes. The following is my original thoughts. Correct me if I was wrong:
> >>>
> >>> 1. llist_del_first() must be running concurrently with llist_del_all().
> >>> If llist_del_first() runs after llist_del_all(), it will return NULL
> >>> directly.
> >>> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the
> >>> elements in free_by_rcu_ttrace will not be freed back to slab.
> >>> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period
> >>> to call __free_rcu_tasks_trace()
> >>> 4. llist_del_first() in running in an context with irq-disabled, so the
> >>> tasks trace RCU grace period will wait for the end of llist_del_first()
> >>>
> >>> It seems you thought step 4) is not true, right ?
> >> Yes. I think so. For two reasons:
> >>
> >> 1.
> >> I believe irq disabled region isn't considered equivalent
> >> to rcu_read_lock_trace() region.
> >>
> >> Paul,
> >> could you clarify ?
> > You are correct, Alexei.  Unlike vanilla RCU, RCU Tasks Trace does not
> > count irq-disabled regions of code as readers.
> 
> I see. But I still have one question: considering that in current
> implementation one Tasks Trace RCU grace period implies one vanilla RCU
> grace period (aka rcu_trace_implies_rcu_gp), so in my naive
> understanding of RCU, does that mean __free_rcu_tasks_trace() will be
> invoked after the expiration of current Task Trace RCU grace period,
> right ? And does it also mean __free_rcu_tasks_trace() will be invoked
> after the expiration of current vanilla RCU grace period, right ? If
> these two conditions above are true, does it mean
> __free_rcu_tasks_trace() will wait for the irq-disabled code reigion ?

First, good show digging into the code!

However, this is guaranteed only if rcu_trace_implies_rcu_gp(), which
right now happens to return the constant true.  In other words, that is
an accident of the current implementation.  To rely on this, you must
check the return value of rcu_trace_implies_rcu_gp() and then have some
alternative code that does not rely on that synchronize_rcu().

> > But why not just put an rcu_read_lock_trace() and a matching
> > rcu_read_unlock_trace() within that irq-disabled region of code?
> >
> > For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB,
> > Hou Tao would be correct from a strict current-implementation
> > viewpoint.  The reason is that, given the current implementation in
> > CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or
> > take an IPI in order for the grace-period machinery to realize that this
> > task is done with all prior readers.
> 
> Thanks for the detailed explanation.
> 
> > However, we need to account for the possibility of IPI-free
> > implementations, for example, if the real-time guys decide to start
> > making heavy use of BPF sleepable programs.  They would then insist on
> > getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels.  At which
> > point, irq-disabled regions of code will absolutely not act as
> > RCU tasks trace readers.
> >
> > Again, why not just put an rcu_read_lock_trace() and a matching
> > rcu_read_unlock_trace() within that irq-disabled region of code?
> >
> > Or maybe there is a better workaround.
> 
> Yes. I think we could use rcu_read_{lock,unlock}_trace to fix the ABA
> problem for free_by_rcu_ttrace.

That sounds good to me!

							Thanx, Paul

> >> 2.
> >> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk()
> >> runs "in a per-CPU thread in preemptible context."
> >> See irq_work_run_list.
> > Agreed, under RT, "interrupt handlers" often run in task context.
> 
> Yes, I missed that. I misread alloc_bulk(), and it seems it only does
> inc_active() for c->free_llist.
> > 						Thanx, Paul
>
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9986c6b7df4d..e5a87f6cf2cc 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -212,6 +212,15 @@  static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	if (i >= cnt)
 		return;
 
+	for (; i < cnt; i++) {
+		obj = llist_del_first(&c->waiting_for_gp_ttrace);
+		if (!obj)
+			break;
+		add_obj_to_free_list(c, obj);
+	}
+	if (i >= cnt)
+		return;
+
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (; i < cnt; i++) {
@@ -295,12 +304,7 @@  static void do_call_rcu_ttrace(struct bpf_mem_cache *c)
 
 	WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace));
 	llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace))
-		/* There is no concurrent __llist_add(waiting_for_gp_ttrace) access.
-		 * It doesn't race with llist_del_all either.
-		 * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace):
-		 * from __free_rcu() and from drain_mem_cache().
-		 */
-		__llist_add(llnode, &c->waiting_for_gp_ttrace);
+		llist_add(llnode, &c->waiting_for_gp_ttrace);
 
 	if (unlikely(READ_ONCE(c->draining))) {
 		__free_rcu(&c->rcu_ttrace);