mbox series

[-next,v5,0/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate)

Message ID 20220321152852.2334294-1-xukuohai@huawei.com (mailing list archive)
Headers show
Series bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate) | expand

Message

Xu Kuohai March 21, 2022, 3:28 p.m. UTC
The current BPF store/load instruction is translated by the JIT into two
instructions. The first instruction moves the immediate offset into a
temporary register. The second instruction uses this temporary register
to do the real store/load.

In fact, arm64 supports addressing with immediate offsets. So This series
introduces optimization that uses arm64 str/ldr instruction with immediate
offset when the offset fits.

Example of generated instuction for r2 = *(u64 *)(r1 + 0):

Without optimization:
mov x10, 0
ldr x1, [x0, x10]

With optimization:
ldr x1, [x0, 0]

For the following bpftrace command:

  bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'

Without this series, jited code(fragment):

   0:   bti     c
   4:   stp     x29, x30, [sp, #-16]!
   8:   mov     x29, sp
   c:   stp     x19, x20, [sp, #-16]!
  10:   stp     x21, x22, [sp, #-16]!
  14:   stp     x25, x26, [sp, #-16]!
  18:   mov     x25, sp
  1c:   mov     x26, #0x0                       // #0
  20:   bti     j
  24:   sub     sp, sp, #0x90
  28:   add     x19, x0, #0x0
  2c:   mov     x0, #0x0                        // #0
  30:   mov     x10, #0xffffffffffffff78        // #-136
  34:   str     x0, [x25, x10]
  38:   mov     x10, #0xffffffffffffff80        // #-128
  3c:   str     x0, [x25, x10]
  40:   mov     x10, #0xffffffffffffff88        // #-120
  44:   str     x0, [x25, x10]
  48:   mov     x10, #0xffffffffffffff90        // #-112
  4c:   str     x0, [x25, x10]
  50:   mov     x10, #0xffffffffffffff98        // #-104
  54:   str     x0, [x25, x10]
  58:   mov     x10, #0xffffffffffffffa0        // #-96
  5c:   str     x0, [x25, x10]
  60:   mov     x10, #0xffffffffffffffa8        // #-88
  64:   str     x0, [x25, x10]
  68:   mov     x10, #0xffffffffffffffb0        // #-80
  6c:   str     x0, [x25, x10]
  70:   mov     x10, #0xffffffffffffffb8        // #-72
  74:   str     x0, [x25, x10]
  78:   mov     x10, #0xffffffffffffffc0        // #-64
  7c:   str     x0, [x25, x10]
  80:   mov     x10, #0xffffffffffffffc8        // #-56
  84:   str     x0, [x25, x10]
  88:   mov     x10, #0xffffffffffffffd0        // #-48
  8c:   str     x0, [x25, x10]
  90:   mov     x10, #0xffffffffffffffd8        // #-40
  94:   str     x0, [x25, x10]
  98:   mov     x10, #0xffffffffffffffe0        // #-32
  9c:   str     x0, [x25, x10]
  a0:   mov     x10, #0xffffffffffffffe8        // #-24
  a4:   str     x0, [x25, x10]
  a8:   mov     x10, #0xfffffffffffffff0        // #-16
  ac:   str     x0, [x25, x10]
  b0:   mov     x10, #0xfffffffffffffff8        // #-8
  b4:   str     x0, [x25, x10]
  b8:   mov     x10, #0x8                       // #8
  bc:   ldr     x2, [x19, x10]
  [...]

With this series, jited code(fragment):

   0:   bti     c
   4:   stp     x29, x30, [sp, #-16]!
   8:   mov     x29, sp
   c:   stp     x19, x20, [sp, #-16]!
  10:   stp     x21, x22, [sp, #-16]!
  14:   stp     x25, x26, [sp, #-16]!
  18:   stp     x27, x28, [sp, #-16]!
  1c:   mov     x25, sp
  20:   sub     x27, x25, #0x88
  24:   mov     x26, #0x0                       // #0
  28:   bti     j
  2c:   sub     sp, sp, #0x90
  30:   add     x19, x0, #0x0
  34:   mov     x0, #0x0                        // #0
  38:   str     x0, [x27]
  3c:   str     x0, [x27, #8]
  40:   str     x0, [x27, #16]
  44:   str     x0, [x27, #24]
  48:   str     x0, [x27, #32]
  4c:   str     x0, [x27, #40]
  50:   str     x0, [x27, #48]
  54:   str     x0, [x27, #56]
  58:   str     x0, [x27, #64]
  5c:   str     x0, [x27, #72]
  60:   str     x0, [x27, #80]
  64:   str     x0, [x27, #88]
  68:   str     x0, [x27, #96]
  6c:   str     x0, [x27, #104]
  70:   str     x0, [x27, #112]
  74:   str     x0, [x27, #120]
  78:   str     x0, [x27, #128]
  7c:   ldr     x2, [x19, #8]
  [...]

Tested with test_bpf on both big-endian and little-endian arm64 qemu:

 test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
 test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
 test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED

v4->v5:
1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
   and add a tail call test case for this issue
2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
3. Style and spelling fix

v3->v4:
1. Fix compile error reported by kernel test robot
2. Add one more test case for load/store in different offsets, and move
   test case to last patch
3. Fix some obvious bugs

v2 -> v3:
1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
   the other for BPF JIT
2. Add tests for BPF_LDX/BPF_STX with different offsets
3. Adjust the offset of str/ldr(immediate) to positive number

v1 -> v2:
1. Remove macro definition that causes checkpatch to fail
2. Append result to commit message

Xu Kuohai (5):
  arm64: insn: add ldr/str with immediate offset
  bpf, arm64: Optimize BPF store/load using str/ldr with immediate
    offset
  bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
  bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
  bpf, arm64: add load store test case for tail call

 arch/arm64/include/asm/insn.h |   9 +
 arch/arm64/lib/insn.c         |  67 ++++++--
 arch/arm64/net/bpf_jit.h      |  14 ++
 arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
 lib/test_bpf.c                | 315 +++++++++++++++++++++++++++++++++-
 5 files changed, 613 insertions(+), 35 deletions(-)

Comments

Xu Kuohai March 30, 2022, 3:43 a.m. UTC | #1
在 2022/3/21 23:28, Xu Kuohai 写道:
> The current BPF store/load instruction is translated by the JIT into two
> instructions. The first instruction moves the immediate offset into a
> temporary register. The second instruction uses this temporary register
> to do the real store/load.
> 
> In fact, arm64 supports addressing with immediate offsets. So This series
> introduces optimization that uses arm64 str/ldr instruction with immediate
> offset when the offset fits.
> 
> Example of generated instuction for r2 = *(u64 *)(r1 + 0):
> 
> Without optimization:
> mov x10, 0
> ldr x1, [x0, x10]
> 
> With optimization:
> ldr x1, [x0, 0]
> 
> For the following bpftrace command:
> 
>    bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'
> 
> Without this series, jited code(fragment):
> 
>     0:   bti     c
>     4:   stp     x29, x30, [sp, #-16]!
>     8:   mov     x29, sp
>     c:   stp     x19, x20, [sp, #-16]!
>    10:   stp     x21, x22, [sp, #-16]!
>    14:   stp     x25, x26, [sp, #-16]!
>    18:   mov     x25, sp
>    1c:   mov     x26, #0x0                       // #0
>    20:   bti     j
>    24:   sub     sp, sp, #0x90
>    28:   add     x19, x0, #0x0
>    2c:   mov     x0, #0x0                        // #0
>    30:   mov     x10, #0xffffffffffffff78        // #-136
>    34:   str     x0, [x25, x10]
>    38:   mov     x10, #0xffffffffffffff80        // #-128
>    3c:   str     x0, [x25, x10]
>    40:   mov     x10, #0xffffffffffffff88        // #-120
>    44:   str     x0, [x25, x10]
>    48:   mov     x10, #0xffffffffffffff90        // #-112
>    4c:   str     x0, [x25, x10]
>    50:   mov     x10, #0xffffffffffffff98        // #-104
>    54:   str     x0, [x25, x10]
>    58:   mov     x10, #0xffffffffffffffa0        // #-96
>    5c:   str     x0, [x25, x10]
>    60:   mov     x10, #0xffffffffffffffa8        // #-88
>    64:   str     x0, [x25, x10]
>    68:   mov     x10, #0xffffffffffffffb0        // #-80
>    6c:   str     x0, [x25, x10]
>    70:   mov     x10, #0xffffffffffffffb8        // #-72
>    74:   str     x0, [x25, x10]
>    78:   mov     x10, #0xffffffffffffffc0        // #-64
>    7c:   str     x0, [x25, x10]
>    80:   mov     x10, #0xffffffffffffffc8        // #-56
>    84:   str     x0, [x25, x10]
>    88:   mov     x10, #0xffffffffffffffd0        // #-48
>    8c:   str     x0, [x25, x10]
>    90:   mov     x10, #0xffffffffffffffd8        // #-40
>    94:   str     x0, [x25, x10]
>    98:   mov     x10, #0xffffffffffffffe0        // #-32
>    9c:   str     x0, [x25, x10]
>    a0:   mov     x10, #0xffffffffffffffe8        // #-24
>    a4:   str     x0, [x25, x10]
>    a8:   mov     x10, #0xfffffffffffffff0        // #-16
>    ac:   str     x0, [x25, x10]
>    b0:   mov     x10, #0xfffffffffffffff8        // #-8
>    b4:   str     x0, [x25, x10]
>    b8:   mov     x10, #0x8                       // #8
>    bc:   ldr     x2, [x19, x10]
>    [...]
> 
> With this series, jited code(fragment):
> 
>     0:   bti     c
>     4:   stp     x29, x30, [sp, #-16]!
>     8:   mov     x29, sp
>     c:   stp     x19, x20, [sp, #-16]!
>    10:   stp     x21, x22, [sp, #-16]!
>    14:   stp     x25, x26, [sp, #-16]!
>    18:   stp     x27, x28, [sp, #-16]!
>    1c:   mov     x25, sp
>    20:   sub     x27, x25, #0x88
>    24:   mov     x26, #0x0                       // #0
>    28:   bti     j
>    2c:   sub     sp, sp, #0x90
>    30:   add     x19, x0, #0x0
>    34:   mov     x0, #0x0                        // #0
>    38:   str     x0, [x27]
>    3c:   str     x0, [x27, #8]
>    40:   str     x0, [x27, #16]
>    44:   str     x0, [x27, #24]
>    48:   str     x0, [x27, #32]
>    4c:   str     x0, [x27, #40]
>    50:   str     x0, [x27, #48]
>    54:   str     x0, [x27, #56]
>    58:   str     x0, [x27, #64]
>    5c:   str     x0, [x27, #72]
>    60:   str     x0, [x27, #80]
>    64:   str     x0, [x27, #88]
>    68:   str     x0, [x27, #96]
>    6c:   str     x0, [x27, #104]
>    70:   str     x0, [x27, #112]
>    74:   str     x0, [x27, #120]
>    78:   str     x0, [x27, #128]
>    7c:   ldr     x2, [x19, #8]
>    [...]
> 
> Tested with test_bpf on both big-endian and little-endian arm64 qemu:
> 
>   test_bpf: Summary: 1026 PASSED, 0 FAILED, [1014/1014 JIT'ed]
>   test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]
>   test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
> 
> v4->v5:
> 1. Fix incorrect FP offset in tail call scenario pointed out by Daniel,
>     and add a tail call test case for this issue
> 2. Align down fpb_offset to 8 bytes to avoid unaligned offsets
> 3. Style and spelling fix
> 
> v3->v4:
> 1. Fix compile error reported by kernel test robot
> 2. Add one more test case for load/store in different offsets, and move
>     test case to last patch
> 3. Fix some obvious bugs
> 
> v2 -> v3:
> 1. Split the v2 patch into 2 patches, one for arm64 instruction encoder,
>     the other for BPF JIT
> 2. Add tests for BPF_LDX/BPF_STX with different offsets
> 3. Adjust the offset of str/ldr(immediate) to positive number
> 
> v1 -> v2:
> 1. Remove macro definition that causes checkpatch to fail
> 2. Append result to commit message
> 
> Xu Kuohai (5):
>    arm64: insn: add ldr/str with immediate offset
>    bpf, arm64: Optimize BPF store/load using str/ldr with immediate
>      offset
>    bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
>    bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
>    bpf, arm64: add load store test case for tail call
> 
>   arch/arm64/include/asm/insn.h |   9 +
>   arch/arm64/lib/insn.c         |  67 ++++++--
>   arch/arm64/net/bpf_jit.h      |  14 ++
>   arch/arm64/net/bpf_jit_comp.c | 243 ++++++++++++++++++++++++--
>   lib/test_bpf.c                | 315 +++++++++++++++++++++++++++++++++-
>   5 files changed, 613 insertions(+), 35 deletions(-)
> 

ping ;)
patchwork-bot+netdevbpf@kernel.org March 31, 2022, 11:20 p.m. UTC | #2
Hello:

This series was applied to bpf/bpf-next.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Mon, 21 Mar 2022 11:28:47 -0400 you wrote:
> The current BPF store/load instruction is translated by the JIT into two
> instructions. The first instruction moves the immediate offset into a
> temporary register. The second instruction uses this temporary register
> to do the real store/load.
> 
> In fact, arm64 supports addressing with immediate offsets. So This series
> introduces optimization that uses arm64 str/ldr instruction with immediate
> offset when the offset fits.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v5,1/5] arm64: insn: add ldr/str with immediate offset
    https://git.kernel.org/bpf/bpf-next/c/30c90f6757a7
  - [bpf-next,v5,2/5] bpf, arm64: Optimize BPF store/load using arm64 str/ldr(immediate offset)
    https://git.kernel.org/bpf/bpf-next/c/7db6c0f1d8ee
  - [bpf-next,v5,3/5] bpf, arm64: adjust the offset of str/ldr(immediate) to positive number
    https://git.kernel.org/bpf/bpf-next/c/5b3d19b9bd40
  - [bpf-next,v5,4/5] bpf/tests: Add tests for BPF_LDX/BPF_STX with different offsets
    https://git.kernel.org/bpf/bpf-next/c/f516420f683d
  - [bpf-next,v5,5/5] bpf, arm64: add load store test case for tail call
    https://git.kernel.org/bpf/bpf-next/c/38608ee7b690

You are awesome, thank you!