Message ID | 20231222031729.1287957-1-yonghong.song@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Reduce memory usage for bpf_global_percpu_ma | expand |
Hi, On 12/22/2023 11:17 AM, Yonghong Song wrote: > Currently when a bpf program intends to allocate memory for percpu kptr, > the verifier will call bpf_mem_alloc_init() to prefill all supported > unit sizes and this caused memory consumption very big for large number > of cpus. For example, for 128-cpu system, the total memory consumption > with initial prefill is ~175MB. Things will become worse for systems > with even more cpus. > > Patch 1 avoids unnecessary extra percpu memory allocation. > Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be > associated with root cgroup and objcg can be passed to later > bpf_mem_alloc_percpu_unit_init(). > Patch 3 addresses memory consumption issue by avoiding to prefill > with all unit sizes, i.e. only prefilling with user specified size. > Patch 4 further reduces memory consumption by limiting the > number of prefill entries for percpu memory allocation. > Patch 5 has much smaller low/high watermarks for percpu allocation > to reduce memory consumption. > Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma > when allocation size is greater than 512 bytes. > Patch 7 fixed test_bpf_ma test due to Patch 5. > Patch 8 added one test to show the verification failure log message. FYI. After applying the patch set, the memory consumption in bpf memory benchmark [1] on 8-CPU VM decreases a lot: Before the patch set: $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done | grep Summary Summary: per-prod alloc 14.16 ± 0.59M/s free 36.18 ± 0.39M/s, total memory usage 183.71 ± 10.38MiB Summary: per-prod alloc 12.35 ± 1.10M/s free 35.79 ± 0.51M/s, total memory usage 744.52 ± 11.64MiB Summary: per-prod alloc 11.15 ± 0.20M/s free 35.72 ± 0.27M/s, total memory usage 2545.98 ± 537.57MiB After the patch set: $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done | grep Summary Summary: per-prod alloc 0.86 ± 0.00M/s free 37.29 ± 0.11M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 0.85 ± 0.00M/s free 36.70 ± 0.24M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 0.84 ± 0.00M/s free 37.21 ± 0.17M/s, total memory usage 0.00 ± 0.00MiB However the allocation performance also degrades a lot. It seems it is due to patch 5 (bpf: Use smaller low/high marks for percpu allocation), because c->batch is 1 now, so each allocation needs one run of irq_work. [1]: https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/ > Changelogs: > v5 -> v6: > . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters. > For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg. > v4 -> v5: > . Do not do bpf_global_percpu_ma initialization at init stage, instead > doing initialization when the verifier knows it is going to be used > by bpf prog. > . Using much smaller low/high watermarks for percpu allocation. > v3 -> v4: > . Add objcg to bpf_mem_alloc during init stage. > . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init(). > . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init(). > v2 -> v3: > . Clear the bpf_mem_cache if prefill fails. > . Change test_bpf_ma percpu allocation tests to use bucket_size > as allocation size instead of bucket_size - 8. > . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call. > v1 -> v2: > . Avoid unnecessary extra percpu memory allocation. > . Add a separate function to do bpf_global_percpu_ma initialization > . promote. > . Promote function static 'sizes' array to file static. > . Add comments to explain to refill only one item for percpu alloc. > > Yonghong Song (8): > bpf: Avoid unnecessary extra percpu memory allocation > bpf: Add objcg to bpf_mem_alloc > bpf: Allow per unit prefill for non-fix-size percpu memory allocator > bpf: Refill only one percpu element in memalloc > bpf: Use smaller low/high marks for percpu allocation > bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation > selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma > selftests/bpf: Add a selftest with > 512-byte percpu allocation size > > include/linux/bpf_mem_alloc.h | 8 ++ > kernel/bpf/memalloc.c | 93 ++++++++++++++++--- > kernel/bpf/verifier.c | 45 ++++++--- > .../selftests/bpf/prog_tests/test_bpf_ma.c | 20 ++-- > .../selftests/bpf/progs/percpu_alloc_fail.c | 18 ++++ > .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 ++++++------- > 6 files changed, 184 insertions(+), 66 deletions(-) >
On 12/22/23 1:35 AM, Hou Tao wrote: > Hi, > > On 12/22/2023 11:17 AM, Yonghong Song wrote: >> Currently when a bpf program intends to allocate memory for percpu kptr, >> the verifier will call bpf_mem_alloc_init() to prefill all supported >> unit sizes and this caused memory consumption very big for large number >> of cpus. For example, for 128-cpu system, the total memory consumption >> with initial prefill is ~175MB. Things will become worse for systems >> with even more cpus. >> >> Patch 1 avoids unnecessary extra percpu memory allocation. >> Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be >> associated with root cgroup and objcg can be passed to later >> bpf_mem_alloc_percpu_unit_init(). >> Patch 3 addresses memory consumption issue by avoiding to prefill >> with all unit sizes, i.e. only prefilling with user specified size. >> Patch 4 further reduces memory consumption by limiting the >> number of prefill entries for percpu memory allocation. >> Patch 5 has much smaller low/high watermarks for percpu allocation >> to reduce memory consumption. >> Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma >> when allocation size is greater than 512 bytes. >> Patch 7 fixed test_bpf_ma test due to Patch 5. >> Patch 8 added one test to show the verification failure log message. > FYI. After applying the patch set, the memory consumption in bpf memory > benchmark [1] on 8-CPU VM decreases a lot: > > Before the patch set: > > $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done | > grep Summary > Summary: per-prod alloc 14.16 ± 0.59M/s free 36.18 ± 0.39M/s, total > memory usage 183.71 ± 10.38MiB > Summary: per-prod alloc 12.35 ± 1.10M/s free 35.79 ± 0.51M/s, total > memory usage 744.52 ± 11.64MiB > Summary: per-prod alloc 11.15 ± 0.20M/s free 35.72 ± 0.27M/s, total > memory usage 2545.98 ± 537.57MiB > > After the patch set: > > $ for i in 1 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a --percpu; done | > grep Summary > Summary: per-prod alloc 0.86 ± 0.00M/s free 37.29 ± 0.11M/s, total > memory usage 0.00 ± 0.00MiB > Summary: per-prod alloc 0.85 ± 0.00M/s free 36.70 ± 0.24M/s, total > memory usage 0.00 ± 0.00MiB > Summary: per-prod alloc 0.84 ± 0.00M/s free 37.21 ± 0.17M/s, total > memory usage 0.00 ± 0.00MiB > > However the allocation performance also degrades a lot. It seems it is > due to patch 5 (bpf: Use smaller low/high marks for percpu allocation), > because c->batch is 1 now, so each allocation needs one run of irq_work. Thanks for benchmarking! With low watermark to be 1 and c->batch to 1 as well, there will be more overhead due to irq_work. In practice, I expect we should not see a lot of such percpu map element updates or percpu kptr allocaitons. > > [1]: > https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/ >> Changelogs: >> v5 -> v6: >> . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters. >> For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg. >> v4 -> v5: >> . Do not do bpf_global_percpu_ma initialization at init stage, instead >> doing initialization when the verifier knows it is going to be used >> by bpf prog. >> . Using much smaller low/high watermarks for percpu allocation. >> v3 -> v4: >> . Add objcg to bpf_mem_alloc during init stage. >> . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init(). >> . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init(). >> v2 -> v3: >> . Clear the bpf_mem_cache if prefill fails. >> . Change test_bpf_ma percpu allocation tests to use bucket_size >> as allocation size instead of bucket_size - 8. >> . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call. >> v1 -> v2: >> . Avoid unnecessary extra percpu memory allocation. >> . Add a separate function to do bpf_global_percpu_ma initialization >> . promote. >> . Promote function static 'sizes' array to file static. >> . Add comments to explain to refill only one item for percpu alloc. >> >> Yonghong Song (8): >> bpf: Avoid unnecessary extra percpu memory allocation >> bpf: Add objcg to bpf_mem_alloc >> bpf: Allow per unit prefill for non-fix-size percpu memory allocator >> bpf: Refill only one percpu element in memalloc >> bpf: Use smaller low/high marks for percpu allocation >> bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation >> selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma >> selftests/bpf: Add a selftest with > 512-byte percpu allocation size >> >> include/linux/bpf_mem_alloc.h | 8 ++ >> kernel/bpf/memalloc.c | 93 ++++++++++++++++--- >> kernel/bpf/verifier.c | 45 ++++++--- >> .../selftests/bpf/prog_tests/test_bpf_ma.c | 20 ++-- >> .../selftests/bpf/progs/percpu_alloc_fail.c | 18 ++++ >> .../testing/selftests/bpf/progs/test_bpf_ma.c | 66 ++++++------- >> 6 files changed, 184 insertions(+), 66 deletions(-) >>
Hello: This series was applied to bpf/bpf-next.git (master) by Alexei Starovoitov <ast@kernel.org>: On Thu, 21 Dec 2023 19:17:29 -0800 you wrote: > Currently when a bpf program intends to allocate memory for percpu kptr, > the verifier will call bpf_mem_alloc_init() to prefill all supported > unit sizes and this caused memory consumption very big for large number > of cpus. For example, for 128-cpu system, the total memory consumption > with initial prefill is ~175MB. Things will become worse for systems > with even more cpus. > > [...] Here is the summary with links: - [bpf-next,v6,1/8] bpf: Avoid unnecessary extra percpu memory allocation https://git.kernel.org/bpf/bpf-next/c/9beda16c257d - [bpf-next,v6,2/8] bpf: Add objcg to bpf_mem_alloc https://git.kernel.org/bpf/bpf-next/c/9fc8e802048a - [bpf-next,v6,3/8] bpf: Allow per unit prefill for non-fix-size percpu memory allocator https://git.kernel.org/bpf/bpf-next/c/c39aa3b289e9 - [bpf-next,v6,4/8] bpf: Refill only one percpu element in memalloc https://git.kernel.org/bpf/bpf-next/c/5b95e638f134 - [bpf-next,v6,5/8] bpf: Use smaller low/high marks for percpu allocation https://git.kernel.org/bpf/bpf-next/c/0e2ba9f96f9b - [bpf-next,v6,6/8] bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocation https://git.kernel.org/bpf/bpf-next/c/5c1a37653260 - [bpf-next,v6,7/8] selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_ma https://git.kernel.org/bpf/bpf-next/c/21f5a801c171 - [bpf-next,v6,8/8] selftests/bpf: Add a selftest with > 512-byte percpu allocation size https://git.kernel.org/bpf/bpf-next/c/adc8c4549d9e You are awesome, thank you!