[RFC,bpf-next,v2,4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

From: Hou Tao <houtao1@huawei.com>

From: Hou Tao <houtao1@huawei.com>

Currently the freed objects in bpf memory allocator may be reused
immediately by new allocation, it introduces use-after-bpf-ma-free
problem for non-preallocated hash map and makes lookup procedure
return incorrect result. The immediate reuse also makes introducing
new use case more difficult (e.g. qp-trie).

So introduce BPF_MA_REUSE_AFTER_RCU_GP to solve these problems. For
BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU
grace period and may be returned back to slab system after another
RCU-tasks-trace grace period. So for bpf programs which care about reuse
problem, these programs can use bpf_rcu_read_{lock,unlock}() to access
these freed objects safely and for those which doesn't care, there will
be safely use-after-bpf-ma-free because these objects have not been
freed by bpf memory allocator.

To make these freed elements being reusab quickly, BPF_MA_REUSE_AFTER_GP
dynamically allocates memory to create many inflight RCU callbacks to
mark these freed element being reusable. These memories used for
bpf_reuse_batch will be freed when these RCU callbacks complete. When no
memory is available, synchronize_rcu_expedited() will be used to make
these freed element reusable. In order to reduce the risk of OOM, part
of these reusable memory will be freed through RCU-tasks-trace grace
period. Before these freeing memories are freed, these memories are also
available for reuse.

The following are the benchmark results when comparing between different
flavors of bpf memory allocator. These results show:
* The performance of reuse-after-rcu-gp bpf ma is good than no bpf ma.
  Its memory usage is also good than no bpf ma except for
  add_del_on_diff_cpu case.
* The memory usage of reuse-after-rcu-gp bpf ma increases a lot compared
  with normal bpf ma.
* The memory usage of free-after-rcu-gp bpf ma is better than
  reuse-after-rcu-gp bpf ma, but its performance is bad than
  reuse-after-ruc-gp because it doesn't do reuse.

(1) no bpf memory allocator (v6.0.19)
| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1187       | 1.05                 | 1.05              |
| overwrite           | 3.74       | 32.52                | 84.18             |
| batch_add_batch_del | 2.23       | 26.38                | 48.75             |
| add_del_on_diff_cpu | 3.92       | 33.72                | 48.96             |

(2) normal bpf memory allocator
| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1187       | 0.96                 | 1.00              |
| overwrite           | 27.12      | 2.5                  | 2.99              |
| batch_add_batch_del | 8.9        | 2.77                 | 3.24              |
| add_del_on_diff_cpu | 11.30      | 218.54               | 440.37            |

(3) reuse-after-rcu-gp bpf memory allocator
| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1276       | 0.96                 | 1.00              |
| overwrite           | 15.66      | 25.00                | 33.07             |
| batch_add_batch_del | 10.32      | 18.84                | 22.64             |
| add_del_on_diff_cpu | 13.00      | 550.50               | 748.74            |

(4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu)

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1263       | 0.96                 | 1.00              |
| overwrite           | 10.73      | 12.33                | 20.32             |
| batch_add_batch_del | 7.02       | 9.45                 | 14.07             |
| add_del_on_diff_cpu | 8.99       | 131.64               | 204.42            |

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf_mem_alloc.h |   1 +
 kernel/bpf/memalloc.c         | 353 +++++++++++++++++++++++++++++++---
 2 files changed, 326 insertions(+), 28 deletions(-)

Message ID	20230408141846.1878768-5-houtao@huaweicloud.com (mailing list archive)
State	RFC
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Hou Tao <houtao@huaweicloud.com> To: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>, Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>, Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>, John Fastabend <john.fastabend@gmail.com>, "Paul E . McKenney" <paulmck@kernel.org>, rcu@vger.kernel.org, houtao1@huawei.com Subject: [RFC bpf-next v2 4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP Date: Sat, 8 Apr 2023 22:18:46 +0800 Message-Id: <20230408141846.1878768-5-houtao@huaweicloud.com> In-Reply-To: <20230408141846.1878768-1-houtao@huaweicloud.com> References: <20230408141846.1878768-1-houtao@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce BPF_MA_REUSE_AFTER_RCU_GP \| expand [RFC,bpf-next,v2,0/4] Introduce BPF_MA_REUSE_AFTER_RCU_GP [RFC,bpf-next,v2,1/4] selftests/bpf: Add benchmark for bpf memory allocator [RFC,bpf-next,v2,2/4] bpf: Factor out a common helper free_all() [RFC,bpf-next,v2,3/4] bpf: Pass bitwise flags to bpf_mem_alloc_init() [RFC,bpf-next,v2,4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Context	Check	Description
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3	fail	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5	fail	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6	fail	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for set-matrix
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 188 this patch: 188
netdev/cc_maintainers	success	CCed 12 of 12 maintainers
netdev/build_clang	success	Errors and warnings before: 30 this patch: 30
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 188 this patch: 188
netdev/checkpatch	warning	WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants WARNING: line length of 81 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	fail	Was 0 now: 1

[RFC,bpf-next,v2,4/4] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Checks

Commit Message

Comments

Patch