diff mbox series

[v2,bpf-next,09/13] bpf: Allow reuse from waiting_for_gp_ttrace list.

Message ID 20230624031333.96597-10-alexei.starovoitov@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: Introduce bpf_mem_cache_free_rcu(). | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-31 success Logs for veristat
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 13 this patch: 13
netdev/cc_maintainers warning 8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com song@kernel.org sdf@google.com jolsa@kernel.org haoluo@google.com
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 13 this patch: 13
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat

Commit Message

Alexei Starovoitov June 24, 2023, 3:13 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

alloc_bulk() can reuse elements from free_by_rcu_ttrace.
Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Hou Tao June 26, 2023, 3:30 a.m. UTC | #1
Hi,

On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  kernel/bpf/memalloc.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 692a9a30c1dc..666917c16e87 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>  	if (i >= cnt)
>  		return;
>  
> +	for (; i < cnt; i++) {
> +		obj = llist_del_first(&c->waiting_for_gp_ttrace);
After allowing to reuse elements from waiting_for_gp_ttrace, there may
be concurrent llist_del_first() and llist_del_all() as shown below and
llist_del_first() is not safe because the elements freed from free_rcu()
could be reused immediately and head->first may be added back to
c0->waiting_for_gp_ttrace by other process.

// c0
alloc_bulk()
    llist_del_first(&c->waiting_for_gp_ttrace)

// c1->tgt = c0
free_rcu()
    llist_del_all(&c->waiting_for_gp_ttrace)

> +		if (!obj)
> +			break;
> +		add_obj_to_free_list(c, obj);
> +	}
> +	if (i >= cnt)
> +		return;
> +
>  	memcg = get_memcg(c);
>  	old_memcg = set_active_memcg(memcg);
>  	for (; i < cnt; i++) {
Alexei Starovoitov June 26, 2023, 4:42 a.m. UTC | #2
On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  kernel/bpf/memalloc.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index 692a9a30c1dc..666917c16e87 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
> >       if (i >= cnt)
> >               return;
> >
> > +     for (; i < cnt; i++) {
> > +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
> After allowing to reuse elements from waiting_for_gp_ttrace, there may
> be concurrent llist_del_first() and llist_del_all() as shown below and
> llist_del_first() is not safe because the elements freed from free_rcu()
> could be reused immediately and head->first may be added back to
> c0->waiting_for_gp_ttrace by other process.
>
> // c0
> alloc_bulk()
>     llist_del_first(&c->waiting_for_gp_ttrace)
>
> // c1->tgt = c0
> free_rcu()
>     llist_del_all(&c->waiting_for_gp_ttrace)

I'm still thinking about how to fix the other issues you've reported,
but this one, I believe, is fine.
Are you basing 'not safe' on a comment?
Why xchg(&head->first, NULL); on one cpu and
try_cmpxchg(&head->first, &entry, next);
is unsafe?
Hou Tao June 26, 2023, 7:16 a.m. UTC | #3
Hi,

On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>> ---
>>>  kernel/bpf/memalloc.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index 692a9a30c1dc..666917c16e87 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>       if (i >= cnt)
>>>               return;
>>>
>>> +     for (; i < cnt; i++) {
>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>> After allowing to reuse elements from waiting_for_gp_ttrace, there may
>> be concurrent llist_del_first() and llist_del_all() as shown below and
>> llist_del_first() is not safe because the elements freed from free_rcu()
>> could be reused immediately and head->first may be added back to
>> c0->waiting_for_gp_ttrace by other process.
>>
>> // c0
>> alloc_bulk()
>>     llist_del_first(&c->waiting_for_gp_ttrace)
>>
>> // c1->tgt = c0
>> free_rcu()
>>     llist_del_all(&c->waiting_for_gp_ttrace)
> I'm still thinking about how to fix the other issues you've reported,
> but this one, I believe, is fine.
> Are you basing 'not safe' on a comment?
> Why xchg(&head->first, NULL); on one cpu and
> try_cmpxchg(&head->first, &entry, next);
> is unsafe?
No, I didn't reason it only based on the comment in llist.h. The reason
I though it was not safe because llist_del_first() may have ABA problem
as pointed by you early, because after __free_rcu(), the elements on
waiting_for_gp_ttrace would be available immediately and
waiting_for_gp_ttrace->first may be reused and then added back to
waiting_for_gp_ttrace->first again as shown below.

// c0->waiting_for_gp_ttrace 
A -> B -> C -> nil 
 
P1: 
alloc_bulk(c0) 
    llist_del_first(&c0->waiting_for_gp_ttrace) 
        entry = A 
        next = B 
 
    P2: __free_rcu 
        // A/B/C is freed back to slab 
        // c0->waiting_for_gp_ttrace->first = NULL 
        free_all(llist_del_all(c0)) 
 
    P3: alloc_bulk(c1) 
        // got A 
        __kmalloc() 
 
    // free A (from c1), ..., last free X (allocated from c0) 
    P3: unit_free(c1)
        // the last freed element X is from c0
        c1->tgt = c0 
        c1->free_llist->first -> X -> Y -> ... -> A 
    P3: free_bulk(c1) 
        enque_to_free(c0) 
            c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X 
        __llist_add_batch(c0->waiting_for_gp_ttrace) 
            c0->waiting_for_gp_ttrace = A -> ... -> Y -> X 

P1: 
    // A is added back as first again
    // but llist_del_first() didn't know
    try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B) 
    // c0->waiting_for_gp_trrace is corrupted 
    c0->waiting_for_gp_ttrace->first = B
Alexei Starovoitov June 28, 2023, 12:59 a.m. UTC | #4
On 6/26/23 12:16 AM, Hou Tao wrote:
> Hi,
> 
> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
>>>> From: Alexei Starovoitov <ast@kernel.org>
>>>>
>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc().
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>>> ---
>>>>   kernel/bpf/memalloc.c | 9 +++++++++
>>>>   1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 692a9a30c1dc..666917c16e87 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>>>>        if (i >= cnt)
>>>>                return;
>>>>
>>>> +     for (; i < cnt; i++) {
>>>> +             obj = llist_del_first(&c->waiting_for_gp_ttrace);
>>> After allowing to reuse elements from waiting_for_gp_ttrace, there may
>>> be concurrent llist_del_first() and llist_del_all() as shown below and
>>> llist_del_first() is not safe because the elements freed from free_rcu()
>>> could be reused immediately and head->first may be added back to
>>> c0->waiting_for_gp_ttrace by other process.
>>>
>>> // c0
>>> alloc_bulk()
>>>      llist_del_first(&c->waiting_for_gp_ttrace)
>>>
>>> // c1->tgt = c0
>>> free_rcu()
>>>      llist_del_all(&c->waiting_for_gp_ttrace)
>> I'm still thinking about how to fix the other issues you've reported,
>> but this one, I believe, is fine.
>> Are you basing 'not safe' on a comment?
>> Why xchg(&head->first, NULL); on one cpu and
>> try_cmpxchg(&head->first, &entry, next);
>> is unsafe?
> No, I didn't reason it only based on the comment in llist.h. The reason
> I though it was not safe because llist_del_first() may have ABA problem
> as pointed by you early, because after __free_rcu(), the elements on
> waiting_for_gp_ttrace would be available immediately and
> waiting_for_gp_ttrace->first may be reused and then added back to
> waiting_for_gp_ttrace->first again as shown below.
> 
> // c0->waiting_for_gp_ttrace
> A -> B -> C -> nil
>   
> P1:
> alloc_bulk(c0)
>      llist_del_first(&c0->waiting_for_gp_ttrace)
>          entry = A
>          next = B
>   
>      P2: __free_rcu
>          // A/B/C is freed back to slab
>          // c0->waiting_for_gp_ttrace->first = NULL
>          free_all(llist_del_all(c0))
>   
>      P3: alloc_bulk(c1)
>          // got A
>          __kmalloc()
>   
>      // free A (from c1), ..., last free X (allocated from c0)
>      P3: unit_free(c1)
>          // the last freed element X is from c0
>          c1->tgt = c0
>          c1->free_llist->first -> X -> Y -> ... -> A
>      P3: free_bulk(c1)
>          enque_to_free(c0)
>              c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
>          __llist_add_batch(c0->waiting_for_gp_ttrace)
>              c0->waiting_for_gp_ttrace = A -> ... -> Y -> X

In theory that's possible, but for this to happen one cpu needs
to be thousand times slower than all others and since there is no
preemption in llist_del_first I don't think we need to worry about it.
Also with removal of _tail optimization the above 
llist_add_batch(waiting_for_gp_ttrace)
will become a loop, so reused element will be at the very end
instead of top, so one cpu to million times slower which is not realistic.

> P1:
>      // A is added back as first again
>      // but llist_del_first() didn't know
>      try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
>      // c0->waiting_for_gp_trrace is corrupted
>      c0->waiting_for_gp_ttrace->first = B
>
Paul E. McKenney June 28, 2023, 5:38 p.m. UTC | #5
On Wed, Jun 28, 2023 at 04:09:14PM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/28/2023 8:59 AM, Alexei Starovoitov wrote:
> > On 6/26/23 12:16 AM, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote:
> >>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote:
> >>>>> From: Alexei Starovoitov <ast@kernel.org>
> >>>>>
> >>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace.
> >>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid
> >>>>> unnecessary kmalloc().
> >>>>>
> >>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >>>>> ---
> >>>>>   kernel/bpf/memalloc.c | 9 +++++++++
> >>>>>   1 file changed, 9 insertions(+)
> >>>>>
> SNIP
> >>        // free A (from c1), ..., last free X (allocated from c0)
> >>      P3: unit_free(c1)
> >>          // the last freed element X is from c0
> >>          c1->tgt = c0
> >>          c1->free_llist->first -> X -> Y -> ... -> A
> >>      P3: free_bulk(c1)
> >>          enque_to_free(c0)
> >>              c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X
> >>          __llist_add_batch(c0->waiting_for_gp_ttrace)
> >>              c0->waiting_for_gp_ttrace = A -> ... -> Y -> X
> >
> > In theory that's possible, but for this to happen one cpu needs
> > to be thousand times slower than all others and since there is no
> > preemption in llist_del_first I don't think we need to worry about it.
> 
> Not sure whether or not such case will be possible in a VM, after all,
> the CPU X is just a thread in host and it may be preempted in any time
> and with any duration.

vCPU preemption can happen even with guest-OS interrupts disabled, and
such preemption can persist for hundreds of milliseconds, or even for
several seconds.  So admittedly quite rare, but also quite possible.

							Thanx, Paul

> > Also with removal of _tail optimization the above
> > llist_add_batch(waiting_for_gp_ttrace)
> > will become a loop, so reused element will be at the very end
> > instead of top, so one cpu to million times slower which is not
> > realistic.
> 
> It is still possible A will be added back as
> waiting_for_gp_ttrace->first after switching to llist_add() as shown
> below. My questions is how much is the benefit for reusing from
> waiting_for_gp_ttrace ?
> 
>     // free A (from c1), ..., last free X (allocated from c0) 
>     P3: unit_free(c1)
>         // the last freed element X is allocated from c0
>         c1->tgt = c0
>         c1->free_llist->first -> A -> ... -> Y
>         c1->free_llist_extra -> X
> 
>     P3: free_bulk(c1)
>         enque_to_free(c0) 
>             c0->free_by_rcu_ttrace->first -> Y -> ... A
>             c0->free_by_rcu_ttrace->first -> X -> Y -> ... A
> 
>         llist_add(c0->waiting_for_gp_ttrace)
>             c0->waiting_for_gp_ttrace = A -> .. -> Y -> X
> 
> >
> >> P1:
> >>      // A is added back as first again
> >>      // but llist_del_first() didn't know
> >>      try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B)
> >>      // c0->waiting_for_gp_trrace is corrupted
> >>      c0->waiting_for_gp_ttrace->first = B
> >>
>
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 692a9a30c1dc..666917c16e87 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -203,6 +203,15 @@  static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	if (i >= cnt)
 		return;
 
+	for (; i < cnt; i++) {
+		obj = llist_del_first(&c->waiting_for_gp_ttrace);
+		if (!obj)
+			break;
+		add_obj_to_free_list(c, obj);
+	}
+	if (i >= cnt)
+		return;
+
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (; i < cnt; i++) {