[bpf-next,v5] selftests/bpf: Add benchmark for bpf memory allocator

From: Hou Tao <houtao1@huawei.com>

From: Hou Tao <houtao1@huawei.com>

The benchmark could be used to compare the performance of hash map
operations and the memory usage between different flavors of bpf memory
allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
could be used to check the performance improvement or the memory saving
provided by optimization.

The benchmark creates a non-preallocated hash map which uses bpf memory
allocator and shows the operation performance and the memory usage of
the hash map under different use cases:
(1) no_op
Only create the hash map and there is no operations on hash map. It is
used as the baseline. When each CPU completes the iteration of 64
elements in hash map, it increases the loop count.
(2) overwrite
Each CPU overwrites nonoverlapping part of hash map. When each CPU
completes overwriting of 64 elements in hash map, it increases the loop
count.
(3) batch_add_batch_del
Each CPU adds then deletes nonoverlapping part of hash map in batch.
When each CPU adds and deletes 64 elements in hash map, it increases the
loop count.
(4) add_del_on_diff_cpu
Each two-CPUs pair adds and deletes nonoverlapping part of map
cooperatively. When each pair adds and deletes 64 elements in hash map,
the two-CPUs pair will increase the loop count.

The following is the benchmark results when comparing between different
flavors of bpf memory allocator. These tests are conducted on a KVM guest
with 8 CPUs and 16 GB memory. The command line below is used to do all
the following benchmarks:

  ./bench htab-mem --use-case $name --max-entries 16384 ${OPTS} \
          --full 50 -d 10 --producers=8 --prod-affinity=0-7

These results show:
* preallocated case has both better performance and better memory
  efficiency.
* normal bpf memory doesn't handle add_del_on_diff_cpu very well. The
  large memory usage is due to the slow tasks trace RCU grace period.

(1) non-preallocated + no bpf memory allocator (v6.0.19)
use kmalloc() + call_rcu

| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 681.40    | 0.87                | 1.00             |
| overwrite          | 8.56      | 38.86               | 88.42            |
| batch_add_batch_del| 6.74      | 41.28               | 69.70            |
| add_del_on_diff_cpu| 4.68      | 3.43                | 5.70             |

(2) preallocated
OPTS=--preallocated

| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 673.95    | 1.98                | 1.98             |
| overwrite          | 114.63    | 1.99                | 1.99             |
| batch_add_batch_del| 78.34     | 2.04                | 2.06             |
| add_del_on_diff_cpu| 6.41      | 2.23                | 2.54             |

(3) normal bpf memory allocator

| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 656.20    | 0.99                | 0.99             |
| overwrite          | 81.21     | 1.10                | 2.49             |
| batch_add_batch_del| 18.40     | 2.13                | 2.62             |
| add_del_on_diff_cpu| 5.38      | 10.40               | 18.05            |

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
v5:
 * send the benchmark patch alone (suggested by Alexei)
 * limit the max number of touched elements per-bpf-program call to 64 (from Alexei)
 * show per-producer performance (from Alexei)
 * handle the return value of read() (from BPF CI)
 * do cleanup_cgroup_environment() in htab_mem_report_final()

v4: https://lore.kernel.org/bpf/20230606035310.4026145-1-houtao@huaweicloud.com/

 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 367 ++++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
 .../selftests/bpf/progs/htab_mem_bench.c      | 132 +++++++
 5 files changed, 548 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

Message ID	20230609024030.2585058-1-houtao@huaweicloud.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <rcu-owner@vger.kernel.org> From: Hou Tao <houtao@huaweicloud.com> To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>, Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>, Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>, John Fastabend <john.fastabend@gmail.com>, "Paul E . McKenney" <paulmck@kernel.org>, rcu@vger.kernel.org, houtao1@huawei.com Subject: [PATCH bpf-next v5] selftests/bpf: Add benchmark for bpf memory allocator Date: Fri, 9 Jun 2023 10:40:30 +0800 Message-Id: <20230609024030.2585058-1-houtao@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[bpf-next,v5] selftests/bpf: Add benchmark for bpf memory allocator \| expand [bpf-next,v5] selftests/bpf: Add benchmark for bpf memory allocator

[bpf-next,v5] selftests/bpf: Add benchmark for bpf memory allocator

Commit Message

Comments

Patch