From patchwork Tue Nov 12 16:39:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yonghong Song X-Patchwork-Id: 13872498 Received: from 66-220-155-178.mail-mxout.facebook.com (66-220-155-178.mail-mxout.facebook.com [66.220.155.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00801206944 for ; Tue, 12 Nov 2024 16:39:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=66.220.155.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731429561; cv=none; b=tSwucO/L0qHXcRaXHd5Z35cDHoi5SxREgtqGm4yWPcpXpQjOD1l8oUSXxNqnZV7SYp1yzIgDsIgHIcoaVPR7KzqeyXJPTpQURxVSWU/PVIxs6GXbqp2deQnTZK0iYnywb+arDSN3Bqmx6xlWCbpwmI9lhQjzv0ToU8zsyK+Ye9Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1731429561; c=relaxed/simple; bh=FMaDkt4KE7bgT/1Msd2G9MyMAA/+yiV08+fWss3f69M=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=kqQ709zYjTzN6MsNoeDi423OyuOD+nu/XQGa5Ru34U1aLmjdHC+k6SgB7CFbwIlFtzaspYmajiAKeSBQAFkfvOGsRedRfz23rFwiyOZ6nN5LbgCAl/J7giDTelX96nV0QSioAdfDbqCpv7q8TwoMBg8RdXGv69IN10bap5ak3dA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev; spf=fail smtp.mailfrom=linux.dev; arc=none smtp.client-ip=66.220.155.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id 23201AFAF84A; Tue, 12 Nov 2024 08:39:02 -0800 (PST) From: Yonghong Song To: bpf@vger.kernel.org Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , kernel-team@fb.com, Martin KaFai Lau , Tejun Heo Subject: [PATCH bpf-next v12 0/7] bpf: Support private stack for bpf progs Date: Tue, 12 Nov 2024 08:39:02 -0800 Message-ID: <20241112163902.2223011-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: bpf@iogearbox.net The main motivation for private stack comes from nested scheduler in sched-ext from Tejun. The basic idea is that - each cgroup will its own associated bpf program, - bpf program with parent cgroup will call bpf programs in immediate child cgroups. Let us say we have the following cgroup hierarchy: root_cg (prog0): cg1 (prog1): cg11 (prog11): cg111 (prog111) cg112 (prog112) cg12 (prog12): cg121 (prog121) cg122 (prog122) cg2 (prog2): cg21 (prog21) cg22 (prog22) cg23 (prog23) In the above example, prog0 will call a kfunc which will call prog1 and prog2 to get sched info for cg1 and cg2 and then the information is summarized and sent back to prog0. Similarly, prog11 and prog12 will be invoked in the kfunc and the result will be summarized and sent back to prog1, etc. The following illustrates a possible call sequence: ... -> bpf prog A -> kfunc -> ops. (bpf prog B) ... Currently, for each thread, the x86 kernel allocate 16KB stack. Each bpf program (including its subprograms) has maximum 512B stack size to avoid potential stack overflow. Nested bpf programs further increase the risk of stack overflow. To avoid potential stack overflow caused by bpf programs, this patch set supported private stack and bpf program stack space is allocated during jit time. Using private stack for bpf progs can reduce or avoid potential kernel stack overflow. Currently private stack is applied to tracing programs like kprobe/uprobe, perf_event, tracepoint, raw tracepoint and struct_ops progs. Tracing progs enable private stack if any subprog stack size is more than a threshold (i.e. 64B). Struct-ops progs enable private stack based on particular struct op implementation which can enable private stack before verification at per-insn level. Struct-ops progs have the same treatment as tracing progs w.r.t when to enable private stack. For all these progs, the kernel will do recursion check (no nesting for per prog per cpu) to ensure that private stack won't be overwritten. The bpf_prog_aux struct has a callback func recursion_detected() which can be implemented by kernel subsystem to synchronously detect recursion, report error, etc. Only x86_64 arch supports private stack now. It can be extended to other archs later. Please see each individual patch for details. Change logs: v11 -> v12: - v11 link: https://lore.kernel.org/bpf/20241109025312.148539-1-yonghong.song@linux.dev/ - Fix a bug where allocated percpu space is less than actual private stack. - Add guard memory (before and after actual prog stack) to detect potential underflow/overflow. v10 -> v11: - v10 link: https://lore.kernel.org/bpf/20241107024138.3355687-1-yonghong.song@linux.dev/ - Use two bool variables, priv_stack_requested (used by struct-ops only) and jits_use_priv_stack, in order to make code cleaner. - Set env->prog->aux->jits_use_priv_stack to true if any subprog uses private stack. This is for struct-ops use case to kick in recursion protection. v9 -> v10: - v9 link: https://lore.kernel.org/bpf/20241104193455.3241859-1-yonghong.song@linux.dev/ - Simplify handling async cbs by making those async cb related progs using normal kernel stack. - Do percpu allocation in jit instead of verifier. v8 -> v9: - v8 link: https://lore.kernel.org/bpf/20241101030950.2677215-1-yonghong.song@linux.dev/ - Use enum to express priv stack mode. - Use bits in bpf_subprog_info struct to do subprog recursion check between main/async and async subprogs. - Fix potential memory leak. - Rename recursion detection func from recursion_skipped() to recursion_detected(). v7 -> v8: - v7 link: https://lore.kernel.org/bpf/20241029221637.264348-1-yonghong.song@linux.dev/ - Add recursion_skipped() callback func to bpf_prog->aux structure such that if a recursion miss happened and bpf_prog->aux->recursion_skipped is not NULL, the callback fn will be called so the subsystem can do proper action based on their respective design. v6 -> v7: - v6 link: https://lore.kernel.org/bpf/20241020191341.2104841-1-yonghong.song@linux.dev/ - Going back to do private stack allocation per prog instead per subtree. This can simplify implementation and avoid verifier complexity. - Handle potential nested subprog run if async callback exists. - Use struct_ops->check_member() callback to set whether a particular struct-ops prog wants private stack or not. v5 -> v6: - v5 link: https://lore.kernel.org/bpf/20241017223138.3175885-1-yonghong.song@linux.dev/ - Instead of using (or not using) private stack at struct_ops level, each prog in struct_ops can decide whether to use private stack or not. v4 -> v5: - v4 link: https://lore.kernel.org/bpf/20241010175552.1895980-1-yonghong.song@linux.dev/ - Remove bpf_prog_call() related implementation. - Allow (opt-in) private stack for sched-ext progs. v3 -> v4: - v3 link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/ There is a long discussion in the above v3 link trying to allow private stack to be used by kernel functions in order to simplify implementation. But unfortunately we didn't find a workable solution yet, so we return to the approach where private stack is only used by bpf programs. - Add bpf_prog_call() kfunc. v2 -> v3: - Instead of per-subprog private stack allocation, allocate private stacks at main prog or callback entry prog. Subprogs not main or callback progs will increment the inherited stack pointer to be their frame pointer. - Private stack allows each prog max stack size to be 512 bytes, intead of the whole prog hierarchy to be 512 bytes. - Add some tests. Yonghong Song (7): bpf: Find eligible subprogs for private stack support bpf: Enable private stack for eligible subprogs bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth bpf, x86: Support private stack in jit selftests/bpf: Add tracing prog private stack tests bpf: Support private stack for struct_ops progs selftests/bpf: Add struct_ops prog private stack tests arch/x86/net/bpf_jit_comp.c | 147 +++++++++- include/linux/bpf.h | 4 + include/linux/bpf_verifier.h | 8 + include/linux/filter.h | 1 + kernel/bpf/core.c | 5 + kernel/bpf/trampoline.c | 4 + kernel/bpf/verifier.c | 112 +++++++- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 104 +++++++ .../selftests/bpf/bpf_testmod/bpf_testmod.h | 5 + .../bpf/prog_tests/struct_ops_private_stack.c | 106 +++++++ .../selftests/bpf/prog_tests/verifier.c | 2 + .../bpf/progs/struct_ops_private_stack.c | 62 ++++ .../bpf/progs/struct_ops_private_stack_fail.c | 62 ++++ .../progs/struct_ops_private_stack_recur.c | 50 ++++ .../bpf/progs/verifier_private_stack.c | 272 ++++++++++++++++++ 15 files changed, 930 insertions(+), 14 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c