diff mbox series

[bpf-next,1/2] bpf: Reuse freed element in free_by_rcu during allocation

Message ID 20221206042946.686847-2-houtao@huaweicloud.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Misc optimizations for bpf mem allocator | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/cc_maintainers success CCed 12 of 12 maintainers
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5 this patch: 5
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-26 fail Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-37 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for llvm-toolchain

Commit Message

Hou Tao Dec. 6, 2022, 4:29 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

When there are batched freeing operations on a specific CPU, part of
the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
be moved to waiting_for_gp list and the remaining part will be left in
free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
period and the next invocation of free_bulk().

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
allocate a new object, in alloc_bulk() just check whether or not there
is freed element in free_by_rcu and reuse it if available.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Comments

Yonghong Song Dec. 7, 2022, 1:52 a.m. UTC | #1
On 12/5/22 8:29 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When there are batched freeing operations on a specific CPU, part of
> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
> be moved to waiting_for_gp list and the remaining part will be left in
> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
> period and the next invocation of free_bulk().

The change below LGTM. However, the above description seems not precise.
IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
call_rcu_in_progress is true or not. If it is true, free_by_rcu list
will remain intact and not moving into waiting_for_gp list.
So it is not 'the remaining part will be left in free_by_rcu'.
It is all elements in free_by_rcu to waiting_for_gp or none.

> 
> So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
> allocate a new object, in alloc_bulk() just check whether or not there
> is freed element in free_by_rcu and reuse it if available.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>

LGTM except the above suggestions.

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
>   1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 8f0d65f2474a..7daf147bc8f6 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
>   	memcg = get_memcg(c);
>   	old_memcg = set_active_memcg(memcg);
>   	for (i = 0; i < cnt; i++) {
> -		obj = __alloc(c, node);
> -		if (!obj)
> -			break;
> +		/*
> +		 * free_by_rcu is only manipulated by irq work refill_work().
> +		 * IRQ works on the same CPU are called sequentially, so it is
> +		 * safe to use __llist_del_first() here. If alloc_bulk() is
> +		 * invoked by the initial prefill, there will be no running
> +		 * irq work, so __llist_del_first() is fine as well.
> +		 *
> +		 * In most cases, objects on free_by_rcu are from the same CPU.
> +		 * If some objects come from other CPUs, it doesn't incur any
> +		 * harm because NUMA_NO_NODE means the preference for current
> +		 * numa node and it is not a guarantee.
> +		 */
> +		obj = __llist_del_first(&c->free_by_rcu);
> +		if (!obj) {
> +			obj = __alloc(c, node);
> +			if (!obj)
> +				break;
> +		}
>   		if (IS_ENABLED(CONFIG_PREEMPT_RT))
>   			/* In RT irq_work runs in per-cpu kthread, so disable
>   			 * interrupts to avoid preemption and interrupts and
Hou Tao Dec. 7, 2022, 2:20 a.m. UTC | #2
Hi,

On 12/7/2022 9:52 AM, Yonghong Song wrote:
>
>
> On 12/5/22 8:29 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When there are batched freeing operations on a specific CPU, part of
>> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
>> be moved to waiting_for_gp list and the remaining part will be left in
>> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
>> period and the next invocation of free_bulk().
>
> The change below LGTM. However, the above description seems not precise.
> IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
> call_rcu_in_progress is true or not. If it is true, free_by_rcu list
> will remain intact and not moving into waiting_for_gp list.
> So it is not 'the remaining part will be left in free_by_rcu'.
> It is all elements in free_by_rcu to waiting_for_gp or none.
Thanks for the review and the suggestions. I tried to say that moving from
free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
stacked on free_by_rcu list. So how about the following rephrasing or do you
still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?

When there are batched freeing operations on a specific CPU, part of the freed
elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
waiting_for_gp list  and the remaining part will be left in free_by_rcu list.
These elements in free_by_rcu list will be moved into waiting_for_gp list after
one RCU-tasks-trace grace period and another invocation of free_bulk(), so there
may be many free elements being stacked on free_by_rcu_list.

>>
>> So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to
>> allocate a new object, in alloc_bulk() just check whether or not there
>> is freed element in free_by_rcu and reuse it if available.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>
> LGTM except the above suggestions.
>
> Acked-by: Yonghong Song <yhs@fb.com>
>
>> ---
>>   kernel/bpf/memalloc.c | 21 ++++++++++++++++++---
>>   1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index 8f0d65f2474a..7daf147bc8f6 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt,
>> int node)
>>       memcg = get_memcg(c);
>>       old_memcg = set_active_memcg(memcg);
>>       for (i = 0; i < cnt; i++) {
>> -        obj = __alloc(c, node);
>> -        if (!obj)
>> -            break;
>> +        /*
>> +         * free_by_rcu is only manipulated by irq work refill_work().
>> +         * IRQ works on the same CPU are called sequentially, so it is
>> +         * safe to use __llist_del_first() here. If alloc_bulk() is
>> +         * invoked by the initial prefill, there will be no running
>> +         * irq work, so __llist_del_first() is fine as well.
>> +         *
>> +         * In most cases, objects on free_by_rcu are from the same CPU.
>> +         * If some objects come from other CPUs, it doesn't incur any
>> +         * harm because NUMA_NO_NODE means the preference for current
>> +         * numa node and it is not a guarantee.
>> +         */
>> +        obj = __llist_del_first(&c->free_by_rcu);
>> +        if (!obj) {
>> +            obj = __alloc(c, node);
>> +            if (!obj)
>> +                break;
>> +        }
>>           if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>               /* In RT irq_work runs in per-cpu kthread, so disable
>>                * interrupts to avoid preemption and interrupts and
>
> .
Alexei Starovoitov Dec. 7, 2022, 2:58 a.m. UTC | #3
On Tue, Dec 6, 2022 at 6:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 12/7/2022 9:52 AM, Yonghong Song wrote:
> >
> >
> > On 12/5/22 8:29 PM, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> When there are batched freeing operations on a specific CPU, part of
> >> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
> >> be moved to waiting_for_gp list and the remaining part will be left in
> >> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
> >> period and the next invocation of free_bulk().
> >
> > The change below LGTM. However, the above description seems not precise.
> > IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
> > call_rcu_in_progress is true or not. If it is true, free_by_rcu list
> > will remain intact and not moving into waiting_for_gp list.
> > So it is not 'the remaining part will be left in free_by_rcu'.
> > It is all elements in free_by_rcu to waiting_for_gp or none.
> Thanks for the review and the suggestions. I tried to say that moving from
> free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
> stacked on free_by_rcu list. So how about the following rephrasing or do you
> still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?
>
> When there are batched freeing operations on a specific CPU, part of the freed
> elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
> waiting_for_gp list  and the remaining part will be left in free_by_rcu list.

I agree with Yonghong.
The above sentence is not just not precise.
'elements moved to waiting_for_gp list' part is not correct.
The elements never moved into it directly.
Only via free_by_rcu list.

All or none also matters.
Hou Tao Dec. 8, 2022, 1:49 a.m. UTC | #4
Hi,

On 12/7/2022 10:58 AM, Alexei Starovoitov wrote:
> On Tue, Dec 6, 2022 at 6:20 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 12/7/2022 9:52 AM, Yonghong Song wrote:
>>>
>>> On 12/5/22 8:29 PM, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> When there are batched freeing operations on a specific CPU, part of
>>>> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will
>>>> be moved to waiting_for_gp list and the remaining part will be left in
>>>> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace
>>>> period and the next invocation of free_bulk().
>>> The change below LGTM. However, the above description seems not precise.
>>> IIUC, free_by_rcu list => waiting_for_gp is controlled by whether
>>> call_rcu_in_progress is true or not. If it is true, free_by_rcu list
>>> will remain intact and not moving into waiting_for_gp list.
>>> So it is not 'the remaining part will be left in free_by_rcu'.
>>> It is all elements in free_by_rcu to waiting_for_gp or none.
>> Thanks for the review and the suggestions. I tried to say that moving from
>> free_by_rcu to waiting_for_gp is slow, and there can be many free elements being
>> stacked on free_by_rcu list. So how about the following rephrasing or do you
>> still prefer "It is all elements in free_by_rcu to waiting_for_gp or none."  ?
>>
>> When there are batched freeing operations on a specific CPU, part of the freed
>> elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to
>> waiting_for_gp list  and the remaining part will be left in free_by_rcu list.
> I agree with Yonghong.
> The above sentence is not just not precise.
> 'elements moved to waiting_for_gp list' part is not correct.
> The elements never moved into it directly.
> Only via free_by_rcu list.
Yes.
>
> All or none also matters.
I am still confused about the "All or none". Does it mean all elements in
free_by_list will be moved into waiting_for_gp or none will be moved if
call_rcu_in_progress is true, right ?

So How about the following rephrasing ?

When there are batched freeing operations on a specific CPU, part of the freed
elements ((high_watermark - lower_watermark) / 2 + 1) will be indirectly moved
into waiting_for_gp list through free_by_rcu list. After call_rcu_in_progress
becomes false again, the remaining elements in free_by_rcu list will be moved to
waiting_for_gp list by the next invocation of free_bulk(). However if the
expiration of RCU tasks trace grace period is relatively slow, none element in
free_by_rcu list will be moved.

So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to allocate a new
object, in alloc_bulk() just check whether or not there is freed element in
free_by_rcu list and reuse it if available.

> .
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 8f0d65f2474a..7daf147bc8f6 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -171,9 +171,24 @@  static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node)
 	memcg = get_memcg(c);
 	old_memcg = set_active_memcg(memcg);
 	for (i = 0; i < cnt; i++) {
-		obj = __alloc(c, node);
-		if (!obj)
-			break;
+		/*
+		 * free_by_rcu is only manipulated by irq work refill_work().
+		 * IRQ works on the same CPU are called sequentially, so it is
+		 * safe to use __llist_del_first() here. If alloc_bulk() is
+		 * invoked by the initial prefill, there will be no running
+		 * irq work, so __llist_del_first() is fine as well.
+		 *
+		 * In most cases, objects on free_by_rcu are from the same CPU.
+		 * If some objects come from other CPUs, it doesn't incur any
+		 * harm because NUMA_NO_NODE means the preference for current
+		 * numa node and it is not a guarantee.
+		 */
+		obj = __llist_del_first(&c->free_by_rcu);
+		if (!obj) {
+			obj = __alloc(c, node);
+			if (!obj)
+				break;
+		}
 		if (IS_ENABLED(CONFIG_PREEMPT_RT))
 			/* In RT irq_work runs in per-cpu kthread, so disable
 			 * interrupts to avoid preemption and interrupts and