[bpf-next,v2,1/2] bpf: Support private stack for bpf progs

The main motivation for private stack comes from nested
scheduler in sched-ext from Tejun. The basic idea is that
 - each cgroup will its own associated bpf program,
 - bpf program with parent cgroup will call bpf programs
   in immediate child cgroups.

Let us say we have the following cgroup hierarchy:
  root_cg (prog0):
    cg1 (prog1):
      cg11 (prog11):
        cg111 (prog111)
        cg112 (prog112)
      cg12 (prog12):
        cg121 (prog121)
        cg122 (prog122)
    cg2 (prog2):
      cg21 (prog21)
      cg22 (prog22)
      cg23 (prog23)

In the above example, prog0 will call a kfunc which will
call prog1 and prog2 to get sched info for cg1 and cg2 and
then the information is summarized and sent back to prog0.
Similarly, prog11 and prog12 will be invoked in the kfunc
and the result will be summarized and sent back to prog1, etc.

Currently, for each thread, the x86 kernel allocate 8KB stack.
The each bpf program (including its subprograms) has maximum
512B stack size to avoid potential stack overflow.
And nested bpf programs increase the risk of stack overflow.
To avoid potential stack overflow caused by bpf programs,
this patch implemented a private stack so bpf program stack
space is allocated dynamically when the program is jited.
Such private stack is applied to tracing programs like
kprobe/uprobe, perf_event, tracepoint, raw tracepoint and
tracing.

But more than one instance of the same bpf program may
run in the system. To make things simple, percpu private
stack is allocated for each program, so if the same program
is running on different cpus concurrently, we won't have
any issue. Note that the kernel already have logic to prevent
the recursion for the same bpf program on the same cpu
(kprobe, fentry, etc.).

The patch implemented a percpu private stack based approach
for x86 arch.
  - The stack size will be 0 and any stack access is from
    jit-time allocated percpu storage.
  - In the beginning of jit, r9 is used to save percpu
    private stack pointer.
  - Each rbp in the bpf asm insn is replaced by r9.
  - For each call, push r9 before the call and pop r9
    after the call to preserve r9 value.

Compared to previous RFC patch [1], this patch added
some conditions to enable private stack, e.g., verifier
calculated stack size, prog type, etc. The new patch
also added a performance test to compare private stack
vs. no private stack.

The following are some code example to illustrate the idea
for selftest cgroup_skb_sk_lookup:

   the existing code                        the private-stack approach code
   endbr64                                  endbr64
   nop    DWORD PTR [rax+rax*1+0x0]         nop    DWORD PTR [rax+rax*1+0x0]
   xchg   ax,ax                             xchg   ax,ax
   push   rbp                               push   rbp
   mov    rbp,rsp                           mov    rbp,rsp
   endbr64                                  endbr64
   sub    rsp,0x68
   push   rbx                               push   rbx
   ...                                      ...
   ...                                      mov    r9d,0x8c1c860
   ...                                      add    r9,QWORD PTR gs:0x21a00
   ...                                      ...
   mov    rdx,rbp                           mov    rdx, r9
   add    rdx,0xffffffffffffffb4            rdx,0xffffffffffffffb4
   ...                                      ...
   mov    ecx,0x28                          mov    ecx,0x28
                                            push   r9
   call   0xffffffffe305e474                call   0xffffffffe305e524
                                            pop    r9
   mov    rdi,rax                           mov    rdi,rax
   ...                                      ...
   movzx  rdi,BYTE PTR [rbp-0x46]           movzx  rdi,BYTE PTR [r9-0x46]
   ...                                      ...

So the number of insns is increased by 1 + num_of_calls * 2.
Here the number of calls are those calls in the final jited binary.
Comparing function call itself, the push/pop overhead should be
minimum in most common cases.

Our original use case is for sched-ext nested scheduler. This will be done
in the future.

  [1] https://lore.kernel.org/bpf/707970c5-6bba-450a-be08-adf24d8b9276@linux.dev/T/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 63 ++++++++++++++++++++++++++++++++++---
 include/linux/bpf.h         |  2 ++
 kernel/bpf/core.c           | 20 ++++++++++++
 kernel/bpf/syscall.c        |  1 +
 4 files changed, 82 insertions(+), 4 deletions(-)

Message ID	20240718205158.3651529-1-yonghong.song@linux.dev (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Received: from 69-171-232-181.mail-mxout.facebook.com (69-171-232-181.mail-mxout.facebook.com [69.171.232.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D50C78C8E for <bpf@vger.kernel.org>; Thu, 18 Jul 2024 20:52:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=69.171.232.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721335938; cv=none; b=RC3nmlq356sB1AGQU75OrfCUiwK4pISrzDhyIhAMSkmbjzupBJu4knpEzCoH78QlklpoAlUJDXM5dJyyq88nRq5eM1trzJIcPC3gXuSWyHtTT069d3J5sDUT+Pdw8HoUAKmE9F3iR2LMaCNadKRe4loMgWqIaGbrh5Ym0PXC57U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721335938; c=relaxed/simple; bh=4RS7rIonO4mA802lCXvHhMb6tjaabzesbWVfWhwv2LY=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=iiaAbVuhiFYDFNWsoJ4gEqEMw880yS8FOTCdSyLiHtujkPf96odQSFjYFgmPRyHMmwYlMEG0ZOrW/EMJA2d0YuxsDU3u1ZMGqbCiJlYv0xyWx5n7TUrgK5DeAc4JtExbTq8Tj+xsgYKjjhTZGAf9Zh6GIkC8DrGsBbnSsmlkYN8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev; spf=fail smtp.mailfrom=linux.dev; arc=none smtp.client-ip=69.171.232.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=linux.dev Received: by devbig309.ftw3.facebook.com (Postfix, from userid 128203) id C05796BC75EC; Thu, 18 Jul 2024 13:51:58 -0700 (PDT) From: Yonghong Song <yonghong.song@linux.dev> To: bpf@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org>, Andrii Nakryiko <andrii@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org> Subject: [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Date: Thu, 18 Jul 2024 13:51:58 -0700 Message-ID: <20240718205158.3651529-1-yonghong.song@linux.dev> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: <bpf.vger.kernel.org> List-Subscribe: <mailto:bpf+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:bpf+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Patchwork-Delegate: bpf@iogearbox.net
Series	[bpf-next,v2,1/2] bpf: Support private stack for bpf progs \| expand [bpf-next,v2,1/2] bpf: Support private stack for bpf progs [bpf-next,v2,2/2,no_merge] selftests/bpf: Benchmark runtime performance with private stack

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 898 this patch: 898
netdev/build_tools	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	16 maintainers not CCed: kpsingh@kernel.org dave.hansen@linux.intel.com haoluo@google.com dsahern@kernel.org bp@alien8.de john.fastabend@gmail.com hpa@zytor.com song@kernel.org mingo@redhat.com x86@kernel.org netdev@vger.kernel.org jolsa@kernel.org martin.lau@linux.dev eddyz87@gmail.com tglx@linutronix.de sdf@fomichev.me
netdev/build_clang	success	Errors and warnings before: 971 this patch: 971
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	fail	Errors and warnings before: 7630 this patch: 7631
netdev/checkpatch	warning	CHECK: No space is necessary after a cast WARNING: line length of 91 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 6 this patch: 6
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13	fail	Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42	success	Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39	success	Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40	success	Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41	success	Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc

[bpf-next,v2,1/2] bpf: Support private stack for bpf progs

Checks

Commit Message

Comments

Patch