diff mbox series

[bpf-next] bpf/test_run: increase Page Pool's ptr_ring size in live frames mode

Message ID 20240214153838.4159970-1-aleksander.lobakin@intel.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next] bpf/test_run: increase Page Pool's ptr_ring size in live frames mode | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1067 this patch: 1067
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 11 maintainers not CCed: pabeni@redhat.com jolsa@kernel.org john.fastabend@gmail.com edumazet@google.com yonghong.song@linux.dev song@kernel.org sdf@google.com hawk@kernel.org eddyz87@gmail.com kpsingh@kernel.org haoluo@google.com
netdev/build_clang success Errors and warnings before: 1066 this patch: 1066
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1084 this patch: 1084
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 7 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-13 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-7 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18

Commit Message

Alexander Lobakin Feb. 14, 2024, 3:38 p.m. UTC
Currently, when running xdp-trafficgen, test_run creates page_pools with
the ptr_ring size of %NAPI_POLL_WEIGHT (64).
This might work fine if XDP Tx queues are polled with the budget
limitation. However, we often clear them with no limitation to ensure
maximum free space when sending.
For example, in ice and idpf (upcoming), we use "lazy" cleaning, i.e. we
clean XDP Tx queue only when the free space there is less than 1/4 of
the queue size. Let's take the ring size of 512 just as an example. 3/4
of the ring is 384 and often times, when we're entering the cleaning
function, we have this whole amount ready (or 256 or 192, doesn't
matter).
Then we're calling xdp_return_frame_bulk() and after 64th frame,
page_pool_put_page_bulk() starts returning pages to the page allocator
due to that the ptr_ring is already full. put_page(), alloc_page() et at
starts consuming a ton of CPU time and leading the board of the perf top
output.

Let's not limit ptr_ring to 64 for no real reason and allow more pages
to be recycled. Just don't put anything to page_pool_params::size and
let the Page Pool core pick the default of 1024 entries (I don't believe
there are real use cases to clean more than that amount of descriptors).
After the change, the MM layer disappears from the perf top output and
all pages get recycled to the PP. On my test setup on idpf with the
default ring size (512), this gives +80% of Tx performance with no
visible memory consumption increase.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 net/bpf/test_run.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Toke Høiland-Jørgensen Feb. 14, 2024, 4:16 p.m. UTC | #1
Alexander Lobakin <aleksander.lobakin@intel.com> writes:

> Currently, when running xdp-trafficgen, test_run creates page_pools with
> the ptr_ring size of %NAPI_POLL_WEIGHT (64).
> This might work fine if XDP Tx queues are polled with the budget
> limitation. However, we often clear them with no limitation to ensure
> maximum free space when sending.
> For example, in ice and idpf (upcoming), we use "lazy" cleaning, i.e. we
> clean XDP Tx queue only when the free space there is less than 1/4 of
> the queue size. Let's take the ring size of 512 just as an example. 3/4
> of the ring is 384 and often times, when we're entering the cleaning
> function, we have this whole amount ready (or 256 or 192, doesn't
> matter).
> Then we're calling xdp_return_frame_bulk() and after 64th frame,
> page_pool_put_page_bulk() starts returning pages to the page allocator
> due to that the ptr_ring is already full. put_page(), alloc_page() et at
> starts consuming a ton of CPU time and leading the board of the perf top
> output.
>
> Let's not limit ptr_ring to 64 for no real reason and allow more pages
> to be recycled. Just don't put anything to page_pool_params::size and
> let the Page Pool core pick the default of 1024 entries (I don't believe
> there are real use cases to clean more than that amount of descriptors).
> After the change, the MM layer disappears from the perf top output and
> all pages get recycled to the PP. On my test setup on idpf with the
> default ring size (512), this gives +80% of Tx performance with no
> visible memory consumption increase.
>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Hmm, so my original idea with keeping this low was to avoid having a lot
of large rings lying around if it is used by multiple processes at once.
But we need to move away from the per-syscall allocation anyway, and
with Lorenzo's patches introducing a global system page pool we have an
avenue for that. So in the meantime, I have no objection to this...

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Toke Høiland-Jørgensen Feb. 14, 2024, 11:02 p.m. UTC | #2
Toke Høiland-Jørgensen <toke@redhat.com> writes:

> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>
>> Currently, when running xdp-trafficgen, test_run creates page_pools with
>> the ptr_ring size of %NAPI_POLL_WEIGHT (64).
>> This might work fine if XDP Tx queues are polled with the budget
>> limitation. However, we often clear them with no limitation to ensure
>> maximum free space when sending.
>> For example, in ice and idpf (upcoming), we use "lazy" cleaning, i.e. we
>> clean XDP Tx queue only when the free space there is less than 1/4 of
>> the queue size. Let's take the ring size of 512 just as an example. 3/4
>> of the ring is 384 and often times, when we're entering the cleaning
>> function, we have this whole amount ready (or 256 or 192, doesn't
>> matter).
>> Then we're calling xdp_return_frame_bulk() and after 64th frame,
>> page_pool_put_page_bulk() starts returning pages to the page allocator
>> due to that the ptr_ring is already full. put_page(), alloc_page() et at
>> starts consuming a ton of CPU time and leading the board of the perf top
>> output.
>>
>> Let's not limit ptr_ring to 64 for no real reason and allow more pages
>> to be recycled. Just don't put anything to page_pool_params::size and
>> let the Page Pool core pick the default of 1024 entries (I don't believe
>> there are real use cases to clean more than that amount of descriptors).
>> After the change, the MM layer disappears from the perf top output and
>> all pages get recycled to the PP. On my test setup on idpf with the
>> default ring size (512), this gives +80% of Tx performance with no
>> visible memory consumption increase.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>
> Hmm, so my original idea with keeping this low was to avoid having a lot
> of large rings lying around if it is used by multiple processes at once.
> But we need to move away from the per-syscall allocation anyway, and
> with Lorenzo's patches introducing a global system page pool we have an
> avenue for that. So in the meantime, I have no objection to this...
>
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

Actually, since Lorenzo's patches already landed in net-next, let's just
move to using those straight away. I'll send a patch for this tomorrow :)

-Toke
Alexander Lobakin Feb. 15, 2024, 11:57 a.m. UTC | #3
From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Thu, 15 Feb 2024 00:02:27 +0100

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
> 
>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>>
>>> Currently, when running xdp-trafficgen, test_run creates page_pools with
>>> the ptr_ring size of %NAPI_POLL_WEIGHT (64).
>>> This might work fine if XDP Tx queues are polled with the budget
>>> limitation. However, we often clear them with no limitation to ensure
>>> maximum free space when sending.
>>> For example, in ice and idpf (upcoming), we use "lazy" cleaning, i.e. we
>>> clean XDP Tx queue only when the free space there is less than 1/4 of
>>> the queue size. Let's take the ring size of 512 just as an example. 3/4
>>> of the ring is 384 and often times, when we're entering the cleaning
>>> function, we have this whole amount ready (or 256 or 192, doesn't
>>> matter).
>>> Then we're calling xdp_return_frame_bulk() and after 64th frame,
>>> page_pool_put_page_bulk() starts returning pages to the page allocator
>>> due to that the ptr_ring is already full. put_page(), alloc_page() et at
>>> starts consuming a ton of CPU time and leading the board of the perf top
>>> output.
>>>
>>> Let's not limit ptr_ring to 64 for no real reason and allow more pages
>>> to be recycled. Just don't put anything to page_pool_params::size and
>>> let the Page Pool core pick the default of 1024 entries (I don't believe
>>> there are real use cases to clean more than that amount of descriptors).
>>> After the change, the MM layer disappears from the perf top output and
>>> all pages get recycled to the PP. On my test setup on idpf with the
>>> default ring size (512), this gives +80% of Tx performance with no
>>> visible memory consumption increase.
>>>
>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>
>> Hmm, so my original idea with keeping this low was to avoid having a lot
>> of large rings lying around if it is used by multiple processes at once.
>> But we need to move away from the per-syscall allocation anyway, and
>> with Lorenzo's patches introducing a global system page pool we have an
>> avenue for that. So in the meantime, I have no objection to this...
>>
>> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Actually, since Lorenzo's patches already landed in net-next, let's just
> move to using those straight away. I'll send a patch for this tomorrow :)

Keep in mind that system page_pools do direct recycling based on cpuid
and for now, memory leaks are possible. Pls see my patch[0] for the
details :D

> 
> -Toke
> 

[0]
https://lore.kernel.org/netdev/20240215113905.96817-1-aleksander.lobakin@intel.com

Olek
diff mbox series

Patch

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index dfd919374017..1ad4f1ddcb88 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -163,7 +163,6 @@  static int xdp_test_run_setup(struct xdp_test_data *xdp, struct xdp_buff *orig_c
 	struct page_pool_params pp_params = {
 		.order = 0,
 		.flags = 0,
-		.pool_size = xdp->batch_size,
 		.nid = NUMA_NO_NODE,
 		.init_callback = xdp_test_run_init_page,
 		.init_arg = xdp,