[RFC,bpf-next,v2,0/9] no_caller_saved_registers attribute for helper calls

Message ID	20240704102402.1644916-1-eddyz87@gmail.com (mailing list archive)
Headers	show Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0B8D13C8FF for <bpf@vger.kernel.org>; Thu, 4 Jul 2024 10:24:19 +0000 (UTC) From: Eduard Zingerman <eddyz87@gmail.com> To: bpf@vger.kernel.org, ast@kernel.org Cc: andrii@kernel.org, daniel@iogearbox.net, martin.lau@linux.dev, kernel-team@fb.com, yonghong.song@linux.dev, puranjay@kernel.org, jose.marchesi@oracle.com, Eduard Zingerman <eddyz87@gmail.com> Subject: [RFC bpf-next v2 0/9] no_caller_saved_registers attribute for helper calls Date: Thu, 4 Jul 2024 03:23:52 -0700 Message-ID: <20240704102402.1644916-1-eddyz87@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	no_caller_saved_registers attribute for helper calls \| expand [RFC,bpf-next,v2,0/9] no_caller_saved_registers attribute for helper calls [RFC,bpf-next,v2,1/9] bpf: add a get_helper_proto() utility function [RFC,bpf-next,v2,2/9] bpf: no_caller_saved_registers attribute for helper calls [RFC,bpf-next,v2,3/9] bpf, x86, riscv, arm: no_caller_saved_registers for bpf_get_smp_processor_id() [RFC,bpf-next,v2,4/9] selftests/bpf: extract utility function for BPF disassembly [RFC,bpf-next,v2,5/9] selftests/bpf: no need to track next_match_pos in struct test_loader [RFC,bpf-next,v2,6/9] selftests/bpf: extract test_loader->expect_msgs as a data structure [RFC,bpf-next,v2,7/9] selftests/bpf: allow checking xlated programs in verifier_* tests [RFC,bpf-next,v2,8/9] selftests/bpf: __arch_* macro to limit test cases to specific archs [RFC,bpf-next,v2,9/9] selftests/bpf: test no_caller_saved_registers spill/fill removal

Message ID

20240704102402.1644916-1-eddyz87@gmail.com (mailing list archive)

Headers

From: Eduard Zingerman <eddyz87@gmail.com>
To: bpf@vger.kernel.org,
	ast@kernel.org
Cc: andrii@kernel.org,
	daniel@iogearbox.net,
	martin.lau@linux.dev,
	kernel-team@fb.com,
	yonghong.song@linux.dev,
	puranjay@kernel.org,
	jose.marchesi@oracle.com,
	Eduard Zingerman <eddyz87@gmail.com>
Subject: [RFC bpf-next v2 0/9] no_caller_saved_registers attribute for helper
 calls
Date: Thu,  4 Jul 2024 03:23:52 -0700
Message-ID: <20240704102402.1644916-1-eddyz87@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

no_caller_saved_registers attribute for helper calls | expand

Message

Eduard Zingerman July 4, 2024, 10:23 a.m. UTC

This RFC seeks to allow using no_caller_saved_registers gcc/clang
attribute with some BPF helper functions (and kfuncs in the future).

As documented in [1], this attribute means that function scratches
only some of the caller saved registers defined by ABI.
For BPF the set of such registers could be defined as follows:
- R0 is scratched only if function is non-void;
- R1-R5 are scratched only if corresponding parameter type is defined
  in the function prototype.

The goal of the RFC is to implement no_caller_saved_registers
(nocsr for short) in a backwards compatible manner:
- for kernels that support the feature, gain some performance boost
  from better register allocation;
- for kernels that don't support the feature, allow programs execution
  with minor performance losses.

To achieve this, use a scheme suggested by Alexei Starovoitov:
- for nocsr calls clang allocates registers as-if relevant r0-r5
  registers are not scratched by the call;
- as a post-processing step, clang visits each nocsr call and adds
  spill/fill for every live r0-r5;
- stack offsets used for spills/fills are allocated as minimal
  stack offsets in whole function and are not used for any other
  purposes;
- when kernel loads a program, it looks for such patterns
  (nocsr function surrounded by spills/fills) and checks if
  spill/fill stack offsets are used exclusively in nocsr patterns;
- if so, and if current JIT inlines the call to the nocsr function
  (e.g. a helper call), kernel removes unnecessary spill/fill pairs;
- when old kernel loads a program, presence of spill/fill pairs
  keeps BPF program valid, albeit slightly less efficient.

Corresponding clang/llvm changes are available in [2].

The patch-set uses bpf_get_smp_processor_id() function as a canary,
making it the first helper with nocsr attribute.

For example, consider the following program:

  #define __no_csr __attribute__((no_caller_saved_registers))
  #define SEC(name) __attribute__((section(name), used))
  #define bpf_printk(fmt, ...) bpf_trace_printk((fmt), sizeof(fmt), __VA_ARGS__)

  typedef unsigned int __u32;

  static long (* const bpf_trace_printk)(const char *fmt, __u32 fmt_size, ...) = (void *) 6;
  static __u32 (*const bpf_get_smp_processor_id)(void) __no_csr = (void *)8;

  SEC("raw_tp")
  int test(void *ctx)
  {
          __u32 task = bpf_get_smp_processor_id();
  	bpf_printk("ctx=%p, smp=%d", ctx, task);
  	return 0;
  }

  char _license[] SEC("license") = "GPL";

Compiled (using [2]) as follows:

  $ clang --target=bpf -O2 -g -c -o nocsr.bpf.o nocsr.bpf.c
  $ llvm-objdump --no-show-raw-insn -Sd nocsr.bpf.o
    ...
  3rd parameter for printk call     removable spill/fill pair
  .--- 0:       r3 = r1                             |
; |       __u32 task = bpf_get_smp_processor_id();  |
  |    1:       *(u64 *)(r10 - 0x8) = r3 <----------|
  |    2:       call 0x8                            |
  |    3:       r3 = *(u64 *)(r10 - 0x8) <----------'
; |     bpf_printk("ctx=%p, smp=%d", ctx, task);
  |    4:       r1 = 0x0 ll
  |    6:       r2 = 0xf
  |    7:       r4 = r0
  '--> 8:       call 0x6
;       return 0;
       9:       r0 = 0x0
      10:       exit

Here is how the program looks after verifier processing:

  # bpftool prog load ./nocsr.bpf.o /sys/fs/bpf/nocsr-test
  # bpftool prog dump xlated pinned /sys/fs/bpf/nocsr-test
  int test(void * ctx):
  ; int test(void *ctx)
     0: (bf) r3 = r1               <--------- 3rd printk parameter
  ; __u32 task = bpf_get_smp_processor_id();
     1: (b4) w0 = 197132           <--------- inlined helper call,
     2: (bf) r0 = r0               <--------- spill/fill pair removed
     3: (61) r0 = *(u32 *)(r0 +0)  <---------
  ; bpf_printk("ctx=%p, smp=%d", ctx, task);
     4: (18) r1 = map[id:13][0]+0
     6: (b7) r2 = 15
     7: (bf) r4 = r0
     8: (85) call bpf_trace_printk#-125920
  ; return 0;
     9: (b7) r0 = 0
    10: (95) exit

[1] https://clang.llvm.org/docs/AttributeReference.html#no-caller-saved-registers
[2] https://github.com/eddyz87/llvm-project/tree/bpf-no-caller-saved-registers

Change list:
- v1 -> v2:
  - assume that functions inlined by either jit or verifier
    conform to no_caller_saved_registers contract (Andrii, Puranjay);
  - allow nocsr rewrite for bpf_get_smp_processor_id()
    on arm64 and riscv64 architectures (Puranjay);
  - __arch_{x86_64,arm64,riscv64} macro for test_loader;
  - moved remove_nocsr_spills_fills() inside do_misc_fixups() (Andrii);
  - moved nocsr pattern detection from check_cfg() to a separate pass
    (Andrii);
  - various stylistic/correctness changes according to Andrii's
    comments.

Revisions:
- v1 https://lore.kernel.org/bpf/20240629094733.3863850-1-eddyz87@gmail.com/

Eduard Zingerman (9):
  bpf: add a get_helper_proto() utility function
  bpf: no_caller_saved_registers attribute for helper calls
  bpf, x86, riscv, arm: no_caller_saved_registers for
    bpf_get_smp_processor_id()
  selftests/bpf: extract utility function for BPF disassembly
  selftests/bpf: no need to track next_match_pos in struct test_loader
  selftests/bpf: extract test_loader->expect_msgs as a data structure
  selftests/bpf: allow checking xlated programs in verifier_* tests
  selftests/bpf: __arch_* macro to limit test cases to specific archs
  selftests/bpf: test no_caller_saved_registers spill/fill removal

 include/linux/bpf.h                           |   6 +
 include/linux/bpf_verifier.h                  |  14 +
 kernel/bpf/helpers.c                          |   1 +
 kernel/bpf/verifier.c                         | 339 +++++++++++-
 tools/testing/selftests/bpf/Makefile          |   1 +
 tools/testing/selftests/bpf/disasm_helpers.c  |  51 ++
 tools/testing/selftests/bpf/disasm_helpers.h  |  12 +
 .../selftests/bpf/prog_tests/ctx_rewrite.c    |  74 +--
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 tools/testing/selftests/bpf/progs/bpf_misc.h  |  13 +
 .../selftests/bpf/progs/verifier_nocsr.c      | 521 ++++++++++++++++++
 tools/testing/selftests/bpf/test_loader.c     | 217 ++++++--
 tools/testing/selftests/bpf/test_progs.h      |   1 -
 tools/testing/selftests/bpf/testing_helpers.c |   1 +
 14 files changed, 1124 insertions(+), 129 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/disasm_helpers.c
 create mode 100644 tools/testing/selftests/bpf/disasm_helpers.h
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_nocsr.c

Comments

Puranjay Mohan July 8, 2024, 11:44 a.m. UTC | #1

Eduard Zingerman <eddyz87@gmail.com> writes:

> This RFC seeks to allow using no_caller_saved_registers gcc/clang
> attribute with some BPF helper functions (and kfuncs in the future).
>
> As documented in [1], this attribute means that function scratches
> only some of the caller saved registers defined by ABI.
> For BPF the set of such registers could be defined as follows:
> - R0 is scratched only if function is non-void;
> - R1-R5 are scratched only if corresponding parameter type is defined
>   in the function prototype.
>
> The goal of the RFC is to implement no_caller_saved_registers
> (nocsr for short) in a backwards compatible manner:
> - for kernels that support the feature, gain some performance boost
>   from better register allocation;
> - for kernels that don't support the feature, allow programs execution
>   with minor performance losses.
>
> To achieve this, use a scheme suggested by Alexei Starovoitov:
> - for nocsr calls clang allocates registers as-if relevant r0-r5
>   registers are not scratched by the call;
> - as a post-processing step, clang visits each nocsr call and adds
>   spill/fill for every live r0-r5;
> - stack offsets used for spills/fills are allocated as minimal
>   stack offsets in whole function and are not used for any other
>   purposes;
> - when kernel loads a program, it looks for such patterns
>   (nocsr function surrounded by spills/fills) and checks if
>   spill/fill stack offsets are used exclusively in nocsr patterns;
> - if so, and if current JIT inlines the call to the nocsr function
>   (e.g. a helper call), kernel removes unnecessary spill/fill pairs;
> - when old kernel loads a program, presence of spill/fill pairs
>   keeps BPF program valid, albeit slightly less efficient.
>
> Corresponding clang/llvm changes are available in [2].
>
> The patch-set uses bpf_get_smp_processor_id() function as a canary,
> making it the first helper with nocsr attribute.
>
> For example, consider the following program:
>
>   #define __no_csr __attribute__((no_caller_saved_registers))
>   #define SEC(name) __attribute__((section(name), used))
>   #define bpf_printk(fmt, ...) bpf_trace_printk((fmt), sizeof(fmt), __VA_ARGS__)
>
>   typedef unsigned int __u32;
>
>   static long (* const bpf_trace_printk)(const char *fmt, __u32 fmt_size, ...) = (void *) 6;
>   static __u32 (*const bpf_get_smp_processor_id)(void) __no_csr = (void *)8;
>
>   SEC("raw_tp")
>   int test(void *ctx)
>   {
>           __u32 task = bpf_get_smp_processor_id();
>   	bpf_printk("ctx=%p, smp=%d", ctx, task);
>   	return 0;
>   }
>
>   char _license[] SEC("license") = "GPL";
>
> Compiled (using [2]) as follows:
>
>   $ clang --target=bpf -O2 -g -c -o nocsr.bpf.o nocsr.bpf.c
>   $ llvm-objdump --no-show-raw-insn -Sd nocsr.bpf.o
>     ...
>   3rd parameter for printk call     removable spill/fill pair
>   .--- 0:       r3 = r1                             |
> ; |       __u32 task = bpf_get_smp_processor_id();  |
>   |    1:       *(u64 *)(r10 - 0x8) = r3 <----------|
>   |    2:       call 0x8                            |
>   |    3:       r3 = *(u64 *)(r10 - 0x8) <----------'
> ; |     bpf_printk("ctx=%p, smp=%d", ctx, task);
>   |    4:       r1 = 0x0 ll
>   |    6:       r2 = 0xf
>   |    7:       r4 = r0
>   '--> 8:       call 0x6
> ;       return 0;
>        9:       r0 = 0x0
>       10:       exit
>
> Here is how the program looks after verifier processing:
>
>   # bpftool prog load ./nocsr.bpf.o /sys/fs/bpf/nocsr-test
>   # bpftool prog dump xlated pinned /sys/fs/bpf/nocsr-test
>   int test(void * ctx):
>   ; int test(void *ctx)
>      0: (bf) r3 = r1               <--------- 3rd printk parameter
>   ; __u32 task = bpf_get_smp_processor_id();
>      1: (b4) w0 = 197132           <--------- inlined helper call,
>      2: (bf) r0 = r0               <--------- spill/fill pair removed
>      3: (61) r0 = *(u32 *)(r0 +0)  <---------
>   ; bpf_printk("ctx=%p, smp=%d", ctx, task);
>      4: (18) r1 = map[id:13][0]+0
>      6: (b7) r2 = 15
>      7: (bf) r4 = r0
>      8: (85) call bpf_trace_printk#-125920
>   ; return 0;
>      9: (b7) r0 = 0
>     10: (95) exit
>
> [1] https://clang.llvm.org/docs/AttributeReference.html#no-caller-saved-registers
> [2] https://github.com/eddyz87/llvm-project/tree/bpf-no-caller-saved-registers
>
> Change list:
> - v1 -> v2:
>   - assume that functions inlined by either jit or verifier
>     conform to no_caller_saved_registers contract (Andrii, Puranjay);
>   - allow nocsr rewrite for bpf_get_smp_processor_id()
>     on arm64 and riscv64 architectures (Puranjay);
>   - __arch_{x86_64,arm64,riscv64} macro for test_loader;
>   - moved remove_nocsr_spills_fills() inside do_misc_fixups() (Andrii);
>   - moved nocsr pattern detection from check_cfg() to a separate pass
>     (Andrii);
>   - various stylistic/correctness changes according to Andrii's
>     comments.
>
> Revisions:
> - v1 https://lore.kernel.org/bpf/20240629094733.3863850-1-eddyz87@gmail.com/
>
> Eduard Zingerman (9):
>   bpf: add a get_helper_proto() utility function
>   bpf: no_caller_saved_registers attribute for helper calls
>   bpf, x86, riscv, arm: no_caller_saved_registers for
>     bpf_get_smp_processor_id()

Ran the selftest on riscv-64 on qemu:

    root@rv-tester:~/bpf# uname -a
    Linux rv-tester 6.10.0-rc2 #27 SMP Mon Jul  8 09:58:20 UTC 2024 riscv64 riscv64 riscv64 GNU/Linux
    root@rv-tester:~/bpf# ./test_progs -a verifier_nocsr/canary_arm64_riscv64
    #496/2   verifier_nocsr/canary_arm64_riscv64:OK
    #496     verifier_nocsr:OK
    Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

Tested-by: Puranjay Mohan <puranjay@kernel.org> #riscv64

Thanks,
Puranjay

Eduard Zingerman July 8, 2024, 5:29 p.m. UTC | #2

On Mon, 2024-07-08 at 11:44 +0000, Puranjay Mohan wrote:

[...]

> Ran the selftest on riscv-64 on qemu:
> 
>     root@rv-tester:~/bpf# uname -a
>     Linux rv-tester 6.10.0-rc2 #27 SMP Mon Jul  8 09:58:20 UTC 2024 riscv64 riscv64 riscv64 GNU/Linux
>     root@rv-tester:~/bpf# ./test_progs -a verifier_nocsr/canary_arm64_riscv64
>     #496/2   verifier_nocsr/canary_arm64_riscv64:OK
>     #496     verifier_nocsr:OK
>     Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> 
> Tested-by: Puranjay Mohan <puranjay@kernel.org> #riscv64

Great, thank you for testing!

Alexei Starovoitov July 10, 2024, 1:18 a.m. UTC | #3

On Thu, Jul 4, 2024 at 3:24 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> - stack offsets used for spills/fills are allocated as minimal
>   stack offsets in whole function and are not used for any other
>   purposes;

"minimal stack offset" reads odd to me.
I noticed the same naming convention is used in llvm diff.
imo it's odd there as well.
Maybe say:
llvm grows the stack that in bpf architecture always grows down and
picks the lowest stack offset not used by local variables
and spill/fill.

> Here is how the program looks after verifier processing:
>
>   # bpftool prog load ./nocsr.bpf.o /sys/fs/bpf/nocsr-test
>   # bpftool prog dump xlated pinned /sys/fs/bpf/nocsr-test
>   int test(void * ctx):
>   ; int test(void *ctx)
>      0: (bf) r3 = r1               <--------- 3rd printk parameter
>   ; __u32 task = bpf_get_smp_processor_id();
>      1: (b4) w0 = 197132           <--------- inlined helper call,
>      2: (bf) r0 = r0               <--------- spill/fill pair removed

Are you using old bpftool or something?
That should have been:
r0 = &(void __percpu *)(r0)
?

>      3: (61) r0 = *(u32 *)(r0 +0)  <---------
>   ; bpf_printk("ctx=%p, smp=%d", ctx, task);
>      4: (18) r1 = map[id:13][0]+0
>      6: (b7) r2 = 15
>      7: (bf) r4 = r0
>      8: (85) call bpf_trace_printk#-125920
>   ; return 0;
>      9: (b7) r0 = 0
>     10: (95) exit

Eduard Zingerman July 10, 2024, 3:35 a.m. UTC | #4

On Tue, 2024-07-09 at 18:18 -0700, Alexei Starovoitov wrote:
> On Thu, Jul 4, 2024 at 3:24 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > 
> > - stack offsets used for spills/fills are allocated as minimal
> >   stack offsets in whole function and are not used for any other
> >   purposes;
> 
> "minimal stack offset" reads odd to me.
> I noticed the same naming convention is used in llvm diff.
> imo it's odd there as well.
> Maybe say:
> llvm grows the stack that in bpf architecture always grows down and
> picks the lowest stack offset not used by local variables
> and spill/fill.

Will replace "minimal" with lowest here and in LLVM diff.

> > Here is how the program looks after verifier processing:
> > 
> >   # bpftool prog load ./nocsr.bpf.o /sys/fs/bpf/nocsr-test
> >   # bpftool prog dump xlated pinned /sys/fs/bpf/nocsr-test
> >   int test(void * ctx):
> >   ; int test(void *ctx)
> >      0: (bf) r3 = r1               <--------- 3rd printk parameter
> >   ; __u32 task = bpf_get_smp_processor_id();
> >      1: (b4) w0 = 197132           <--------- inlined helper call,
> >      2: (bf) r0 = r0               <--------- spill/fill pair removed
> 
> Are you using old bpftool or something?
> That should have been:
> r0 = &(void __percpu *)(r0)
> ?

Yes, I was using distro-provided bpftool.
Re-running with kernel version of the tool shows the __percpu thing.

> 
> >      3: (61) r0 = *(u32 *)(r0 +0)  <---------

[...]