[RFC,bpf-next,v3,6/6] selftests/bpf: Add benchmark for bpf memory allocator

From: Hou Tao <houtao1@huawei.com>

From: Hou Tao <houtao1@huawei.com>

The benchmark could be used to compare the performance of hash map
operations and the memory usage between different flavors of bpf memory
allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
could be used to check the performance improvement or the memory saving
provided by optimization.

The benchmark creates a non-preallocated hash map which uses bpf memory
allocator and shows the operation performance and the memory usage of
the hash map under different use cases:
(1) no_op
Only create the hash map and there is no operations on hash map. It is
used as the baseline. When each CPU completes the iteration of
nr_entries / nr_threads elements in hash map, the loop count is
increased.
(2) overwrite
Each CPU overwrites nonoverlapping part of hash map. When each CPU
completes overwriting of nr_entries / nr_threads elements in hash map,
the loop count is increased.
(3) batch_add_batch_del
Each CPU adds then deletes nonoverlapping part of hash map in batch.
When each CPU adds and deletes nr_entries / nr_threads elements in hash
map, the loop count is increased.
(4) add_del_on_diff_cpu
Each two CPUs add and delete nonoverlapping part of map cooperatively
When each CPU adds and deletes nr_entries / nr_threads * 2 elements in
hash map, the loop count is increased twice.

The following are the benchmark results when comparing between different
flavors of bpf memory allocator. These tests are conducted on a KVM guest
with 8 CPUs and 16 GB memory. The command line below is used to do all
the following benchmarks:

  ./bench htab-mem --use-case $name --max-entries 16384 ${OPTS} \
          --full 50 -d 9 --producers=8 --prod-affinity=0-7

These results show:
* preallocated case has both better performance and better memory
  efficiency.
* normal bpf memory doesn't add_del_on_diff_cpu very well. The larger
  memory is due to the slow tasks trace RCU grace period.
* free-after-rcu-gp has fewer memory usage compared with
  reuse-after-rcu-gp, but its performance is bad than
  reuse-after-rcu-gp. Both free-after-rcu-gp and reuse-after-rcu-gp have
  larger memory usage than normal bpf memory allocator due to the delay
  of reuse.
* for extra call_rcu + bpf_memory_allocator, its memory usage is much
  bigger than free-after-rcu-gp and reuse-after-rcu-gp.

(1) non-preallocated + no bpf memory allocator (v6.0.19)
use kmalloc() + call_rcu

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1214.42    | 0.92                 | 0.92              |
| overwrite           | 3.21       | 40.47                | 67.98             |
| batch_add_batch_del | 2.32       | 24.31                | 49.33             |
| add_del_on_diff_cpu | 2.92       | 4.03                 | 6.00              |

(2) preallocated
OPTS=--preallocated

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1156.59    | 1.88                 | 1.88              |
| overwrite           | 36.19      | 1.88                 | 1.88              |
| batch_add_batch_del | 22.27      | 1.88                 | 1.88              |
| add_del_on_diff_cpu | 4.68       | 1.95                 | 2.05              |

(3) normal bpf memory allocator
echo 0 > /sys/module/hashtab/parameters/reuse_flag
echo 0 > /sys/module/hashtab/parameters/delayed_free

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1273.55    | 0.98                 | 0.98              |
| overwrite           | 26.57      | 2.59                 | 2.74              |
| batch_add_batch_del | 11.13      | 2.59                 | 2.99              |
| add_del_on_diff_cpu | 3.72       | 15.15                | 26.04             |

(4) reuse-after-rcu-gp bpf memory allocator
echo 2 > /sys/module/hashtab/parameters/reuse_flag
echo 0 > /sys/module/hashtab/parameters/delayed_free

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1199.16    | 0.97                 | 0.99              |
| overwrite           | 16.37      | 24.01                | 31.76             |
| batch_add_batch_del | 9.61       | 16.71                | 19.95             |
| add_del_on_diff_cpu | 3.62       | 22.93                | 37.02             |

(5) free-after-rcu-gp bpf memory allocator
echo 4 > /sys/module/hashtab/parameters/reuse_flag
echo 0 > /sys/module/hashtab/parameters/delayed_free

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1274.59    | 0.99                 | 0.99              |
| overwrite           | 11.02      | 13.48                | 21.85             |
| batch_add_batch_del | 7.43       | 10.58                | 16.14             |
| add_del_on_diff_cpu | 3.15       | 6.36                 | 9.65              |

(6) extra call_rcu + bpf memory allocator
echo 0 > /sys/module/hashtab/parameters/reuse_flag
echo 1 > /sys/module/hashtab/parameters/delayed_free

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1276.85    | 0.99                 | 0.99              |
| overwrite           | 12.57      | 372.01               | 676.56            |
| batch_add_batch_del | 9.31       | 276.14               | 431.04            |
| add_del_on_diff_cpu | 3.29       | 18.73                | 35.13             |

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  64 ++++
 .../selftests/bpf/progs/htab_mem_bench.c      | 135 +++++++
 5 files changed, 558 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

Message ID	20230429101215.111262-7-houtao@huaweicloud.com (mailing list archive)
State	RFC
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Hou Tao <houtao@huaweicloud.com> To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>, Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>, Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>, John Fastabend <john.fastabend@gmail.com>, "Paul E . McKenney" <paulmck@kernel.org>, rcu@vger.kernel.org, houtao1@huawei.com Subject: [RFC bpf-next v3 6/6] selftests/bpf: Add benchmark for bpf memory allocator Date: Sat, 29 Apr 2023 18:12:15 +0800 Message-Id: <20230429101215.111262-7-houtao@huaweicloud.com> In-Reply-To: <20230429101215.111262-1-houtao@huaweicloud.com> References: <20230429101215.111262-1-houtao@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Handle immediate reuse in bpf memory allocator \| expand [RFC,bpf-next,v3,0/6] Handle immediate reuse in bpf memory allocator [RFC,bpf-next,v3,1/6] bpf: Factor out a common helper free_all() [RFC,bpf-next,v3,2/6] bpf: Pass bitwise flags to bpf_mem_alloc_init() [RFC,bpf-next,v3,3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP [RFC,bpf-next,v3,4/6] bpf: Introduce BPF_MA_FREE_AFTER_RCU_GP [RFC,bpf-next,v3,5/6] bpf: Add two module parameters in htab for memory benchmark [RFC,bpf-next,v3,6/6] selftests/bpf: Add benchmark for bpf memory allocator

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers	warning	3 maintainers not CCed: mykolal@fb.com shuah@kernel.org linux-kselftest@vger.kernel.org
netdev/build_clang	success	Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	net selftest script(s) already in Makefile
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 8 this patch: 8
netdev/checkpatch	warning	WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 100 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns WARNING: quoted string split across lines
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

[RFC,bpf-next,v3,6/6] selftests/bpf: Add benchmark for bpf memory allocator

Checks

Commit Message

Patch