mbox series

[bpf-next,v6,0/8] bpf: Reduce memory usage for bpf_global_percpu_ma

Message ID 20231222031729.1287957-1-yonghong.song@linux.dev (mailing list archive)
Headers show
Series bpf: Reduce memory usage for bpf_global_percpu_ma | expand

Message

Yonghong Song Dec. 22, 2023, 3:17 a.m. UTC
Currently when a bpf program intends to allocate memory for percpu kptr,
the verifier will call bpf_mem_alloc_init() to prefill all supported
unit sizes and this caused memory consumption very big for large number
of cpus. For example, for 128-cpu system, the total memory consumption
with initial prefill is ~175MB. Things will become worse for systems
with even more cpus.

Patch 1 avoids unnecessary extra percpu memory allocation.
Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be
associated with root cgroup and objcg can be passed to later
bpf_mem_alloc_percpu_unit_init().
Patch 3 addresses memory consumption issue by avoiding to prefill
with all unit sizes, i.e. only prefilling with user specified size.
Patch 4 further reduces memory consumption by limiting the
number of prefill entries for percpu memory allocation.
Patch 5 has much smaller low/high watermarks for percpu allocation
to reduce memory consumption.
Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma
when allocation size is greater than 512 bytes.
Patch 7 fixed test_bpf_ma test due to Patch 5.
Patch 8 added one test to show the verification failure log message.

Changelogs:
  v5 -> v6:
    . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters.
      For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg.
  v4 -> v5:
    . Do not do bpf_global_percpu_ma initialization at init stage, instead
      doing initialization when the verifier knows it is going to be used
      by bpf prog.
    . Using much smaller low/high watermarks for percpu allocation.
  v3 -> v4:
    . Add objcg to bpf_mem_alloc during init stage.
    . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init().
    . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init().
  v2 -> v3:
    . Clear the bpf_mem_cache if prefill fails.
    . Change test_bpf_ma percpu allocation tests to use bucket_size
      as allocation size instead of bucket_size - 8.
    . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call.
  v1 -> v2:
    . Avoid unnecessary extra percpu memory allocation.
    . Add a separate function to do bpf_global_percpu_ma initialization
    . promote.
    . Promote function static 'sizes' array to file static.
    . Add comments to explain to refill only one item for percpu alloc.

Yonghong Song (8):
  bpf: Avoid unnecessary extra percpu memory allocation
  bpf: Add objcg to bpf_mem_alloc
  bpf: Allow per unit prefill for non-fix-size percpu memory allocator
  bpf: Refill only one percpu element in memalloc
  bpf: Use smaller low/high marks for percpu allocation
  bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
  selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
  selftests/bpf: Add a selftest with > 512-byte percpu allocation size

 include/linux/bpf_mem_alloc.h                 |  8 ++
 kernel/bpf/memalloc.c                         | 93 ++++++++++++++++---
 kernel/bpf/verifier.c                         | 45 ++++++---
 .../selftests/bpf/prog_tests/test_bpf_ma.c    | 20 ++--
 .../selftests/bpf/progs/percpu_alloc_fail.c   | 18 ++++
 .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 ++++++-------
 6 files changed, 184 insertions(+), 66 deletions(-)

Comments

Hou Tao Dec. 22, 2023, 9:35 a.m. UTC | #1
Hi,

On 12/22/2023 11:17 AM, Yonghong Song wrote:
> Currently when a bpf program intends to allocate memory for percpu kptr,
> the verifier will call bpf_mem_alloc_init() to prefill all supported
> unit sizes and this caused memory consumption very big for large number
> of cpus. For example, for 128-cpu system, the total memory consumption
> with initial prefill is ~175MB. Things will become worse for systems
> with even more cpus.
>
> Patch 1 avoids unnecessary extra percpu memory allocation.
> Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be
> associated with root cgroup and objcg can be passed to later
> bpf_mem_alloc_percpu_unit_init().
> Patch 3 addresses memory consumption issue by avoiding to prefill
> with all unit sizes, i.e. only prefilling with user specified size.
> Patch 4 further reduces memory consumption by limiting the
> number of prefill entries for percpu memory allocation.
> Patch 5 has much smaller low/high watermarks for percpu allocation
> to reduce memory consumption.
> Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma
> when allocation size is greater than 512 bytes.
> Patch 7 fixed test_bpf_ma test due to Patch 5.
> Patch 8 added one test to show the verification failure log message.

FYI. After applying the patch set, the memory consumption in bpf memory
benchmark [1] on 8-CPU VM decreases a lot:

Before the patch set:

$ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done |
grep Summary
Summary: per-prod alloc   14.16 ± 0.59M/s free   36.18 ± 0.39M/s, total
memory usage  183.71 ± 10.38MiB
Summary: per-prod alloc   12.35 ± 1.10M/s free   35.79 ± 0.51M/s, total
memory usage  744.52 ± 11.64MiB
Summary: per-prod alloc   11.15 ± 0.20M/s free   35.72 ± 0.27M/s, total
memory usage 2545.98 ± 537.57MiB

After the patch set:

$ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done |
grep Summary
Summary: per-prod alloc    0.86 ± 0.00M/s free   37.29 ± 0.11M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc    0.85 ± 0.00M/s free   36.70 ± 0.24M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc    0.84 ± 0.00M/s free   37.21 ± 0.17M/s, total
memory usage    0.00 ± 0.00MiB

However the allocation performance also degrades a lot. It seems it is
due to patch 5 (bpf: Use smaller low/high marks for percpu allocation),
because c->batch is 1 now, so each allocation needs one run of irq_work.

[1]:
https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/
> Changelogs:
>   v5 -> v6:
>     . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters.
>       For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg.
>   v4 -> v5:
>     . Do not do bpf_global_percpu_ma initialization at init stage, instead
>       doing initialization when the verifier knows it is going to be used
>       by bpf prog.
>     . Using much smaller low/high watermarks for percpu allocation.
>   v3 -> v4:
>     . Add objcg to bpf_mem_alloc during init stage.
>     . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init().
>     . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init().
>   v2 -> v3:
>     . Clear the bpf_mem_cache if prefill fails.
>     . Change test_bpf_ma percpu allocation tests to use bucket_size
>       as allocation size instead of bucket_size - 8.
>     . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call.
>   v1 -> v2:
>     . Avoid unnecessary extra percpu memory allocation.
>     . Add a separate function to do bpf_global_percpu_ma initialization
>     . promote.
>     . Promote function static 'sizes' array to file static.
>     . Add comments to explain to refill only one item for percpu alloc.
>
> Yonghong Song (8):
>   bpf: Avoid unnecessary extra percpu memory allocation
>   bpf: Add objcg to bpf_mem_alloc
>   bpf: Allow per unit prefill for non-fix-size percpu memory allocator
>   bpf: Refill only one percpu element in memalloc
>   bpf: Use smaller low/high marks for percpu allocation
>   bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
>   selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
>   selftests/bpf: Add a selftest with > 512-byte percpu allocation size
>
>  include/linux/bpf_mem_alloc.h                 |  8 ++
>  kernel/bpf/memalloc.c                         | 93 ++++++++++++++++---
>  kernel/bpf/verifier.c                         | 45 ++++++---
>  .../selftests/bpf/prog_tests/test_bpf_ma.c    | 20 ++--
>  .../selftests/bpf/progs/percpu_alloc_fail.c   | 18 ++++
>  .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 ++++++-------
>  6 files changed, 184 insertions(+), 66 deletions(-)
>
Yonghong Song Dec. 28, 2023, 12:49 a.m. UTC | #2
On 12/22/23 1:35 AM, Hou Tao wrote:
> Hi,
>
> On 12/22/2023 11:17 AM, Yonghong Song wrote:
>> Currently when a bpf program intends to allocate memory for percpu kptr,
>> the verifier will call bpf_mem_alloc_init() to prefill all supported
>> unit sizes and this caused memory consumption very big for large number
>> of cpus. For example, for 128-cpu system, the total memory consumption
>> with initial prefill is ~175MB. Things will become worse for systems
>> with even more cpus.
>>
>> Patch 1 avoids unnecessary extra percpu memory allocation.
>> Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be
>> associated with root cgroup and objcg can be passed to later
>> bpf_mem_alloc_percpu_unit_init().
>> Patch 3 addresses memory consumption issue by avoiding to prefill
>> with all unit sizes, i.e. only prefilling with user specified size.
>> Patch 4 further reduces memory consumption by limiting the
>> number of prefill entries for percpu memory allocation.
>> Patch 5 has much smaller low/high watermarks for percpu allocation
>> to reduce memory consumption.
>> Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma
>> when allocation size is greater than 512 bytes.
>> Patch 7 fixed test_bpf_ma test due to Patch 5.
>> Patch 8 added one test to show the verification failure log message.
> FYI. After applying the patch set, the memory consumption in bpf memory
> benchmark [1] on 8-CPU VM decreases a lot:
>
> Before the patch set:
>
> $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done |
> grep Summary
> Summary: per-prod alloc   14.16 ± 0.59M/s free   36.18 ± 0.39M/s, total
> memory usage  183.71 ± 10.38MiB
> Summary: per-prod alloc   12.35 ± 1.10M/s free   35.79 ± 0.51M/s, total
> memory usage  744.52 ± 11.64MiB
> Summary: per-prod alloc   11.15 ± 0.20M/s free   35.72 ± 0.27M/s, total
> memory usage 2545.98 ± 537.57MiB
>
> After the patch set:
>
> $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done |
> grep Summary
> Summary: per-prod alloc    0.86 ± 0.00M/s free   37.29 ± 0.11M/s, total
> memory usage    0.00 ± 0.00MiB
> Summary: per-prod alloc    0.85 ± 0.00M/s free   36.70 ± 0.24M/s, total
> memory usage    0.00 ± 0.00MiB
> Summary: per-prod alloc    0.84 ± 0.00M/s free   37.21 ± 0.17M/s, total
> memory usage    0.00 ± 0.00MiB
>
> However the allocation performance also degrades a lot. It seems it is
> due to patch 5 (bpf: Use smaller low/high marks for percpu allocation),
> because c->batch is 1 now, so each allocation needs one run of irq_work.

Thanks for benchmarking! With low watermark to be 1 and c->batch to 1
as well, there will be more overhead due to irq_work. In practice,
I expect we should not see a lot of such percpu map element updates or
percpu kptr allocaitons.

>
> [1]:
> https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/
>> Changelogs:
>>    v5 -> v6:
>>      . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters.
>>        For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg.
>>    v4 -> v5:
>>      . Do not do bpf_global_percpu_ma initialization at init stage, instead
>>        doing initialization when the verifier knows it is going to be used
>>        by bpf prog.
>>      . Using much smaller low/high watermarks for percpu allocation.
>>    v3 -> v4:
>>      . Add objcg to bpf_mem_alloc during init stage.
>>      . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init().
>>      . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init().
>>    v2 -> v3:
>>      . Clear the bpf_mem_cache if prefill fails.
>>      . Change test_bpf_ma percpu allocation tests to use bucket_size
>>        as allocation size instead of bucket_size - 8.
>>      . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call.
>>    v1 -> v2:
>>      . Avoid unnecessary extra percpu memory allocation.
>>      . Add a separate function to do bpf_global_percpu_ma initialization
>>      . promote.
>>      . Promote function static 'sizes' array to file static.
>>      . Add comments to explain to refill only one item for percpu alloc.
>>
>> Yonghong Song (8):
>>    bpf: Avoid unnecessary extra percpu memory allocation
>>    bpf: Add objcg to bpf_mem_alloc
>>    bpf: Allow per unit prefill for non-fix-size percpu memory allocator
>>    bpf: Refill only one percpu element in memalloc
>>    bpf: Use smaller low/high marks for percpu allocation
>>    bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
>>    selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
>>    selftests/bpf: Add a selftest with > 512-byte percpu allocation size
>>
>>   include/linux/bpf_mem_alloc.h                 |  8 ++
>>   kernel/bpf/memalloc.c                         | 93 ++++++++++++++++---
>>   kernel/bpf/verifier.c                         | 45 ++++++---
>>   .../selftests/bpf/prog_tests/test_bpf_ma.c    | 20 ++--
>>   .../selftests/bpf/progs/percpu_alloc_fail.c   | 18 ++++
>>   .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 ++++++-------
>>   6 files changed, 184 insertions(+), 66 deletions(-)
>>
patchwork-bot+netdevbpf@kernel.org Jan. 4, 2024, 5:20 a.m. UTC | #3
Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Thu, 21 Dec 2023 19:17:29 -0800 you wrote:
> Currently when a bpf program intends to allocate memory for percpu kptr,
> the verifier will call bpf_mem_alloc_init() to prefill all supported
> unit sizes and this caused memory consumption very big for large number
> of cpus. For example, for 128-cpu system, the total memory consumption
> with initial prefill is ~175MB. Things will become worse for systems
> with even more cpus.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v6,1/8] bpf: Avoid unnecessary extra percpu memory allocation
    https://git.kernel.org/bpf/bpf-next/c/9beda16c257d
  - [bpf-next,v6,2/8] bpf: Add objcg to bpf_mem_alloc
    https://git.kernel.org/bpf/bpf-next/c/9fc8e802048a
  - [bpf-next,v6,3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator
    https://git.kernel.org/bpf/bpf-next/c/c39aa3b289e9
  - [bpf-next,v6,4/8] bpf: Refill only one percpu element in memalloc
    https://git.kernel.org/bpf/bpf-next/c/5b95e638f134
  - [bpf-next,v6,5/8] bpf: Use smaller low/high marks for percpu allocation
    https://git.kernel.org/bpf/bpf-next/c/0e2ba9f96f9b
  - [bpf-next,v6,6/8] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation
    https://git.kernel.org/bpf/bpf-next/c/5c1a37653260
  - [bpf-next,v6,7/8] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma
    https://git.kernel.org/bpf/bpf-next/c/21f5a801c171
  - [bpf-next,v6,8/8] selftests/bpf: Add a selftest with > 512-byte percpu allocation size
    https://git.kernel.org/bpf/bpf-next/c/adc8c4549d9e

You are awesome, thank you!