diff mbox series

[bpf-next,v2,2/2] bpf, arm64: inline bpf_get_smp_processor_id() helper

Message ID 20240424173550.16359-3-puranjay@kernel.org (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf, arm64: Support per-cpu instructions | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 955 this patch: 955
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 13 of 13 maintainers
netdev/build_clang success Errors and warnings before: 955 this patch: 955
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 966 this patch: 966
netdev/checkpatch warning WARNING: line length of 107 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18

Commit Message

Puranjay Mohan April 24, 2024, 5:35 p.m. UTC
As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
bpf_get_smp_processor_id().

ARM64 uses the per-cpu variable cpu_number to store the cpu id.

Here is how the BPF and ARM64 JITed assembly changes after this commit:

                                         BPF
         		                =====
              BEFORE                                       AFTER
             --------                                     -------

int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
(85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
                                                (bf) r0 = r0
                                                (61) r0 = *(u32 *)(r0 +0)

				      ARM64 JIT
				     ===========

              BEFORE                                       AFTER
             --------                                     -------

int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
blr     x10                                mrs     x10, tpidr_el1
add     x7, x0, #0x0                       add     x7, x7, x10
                                           ldr     w7, [x7]

Performance improvement using benchmark[1]

             BEFORE                                       AFTER
            --------                                     -------

glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/verifier.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Andrii Nakryiko April 24, 2024, 10:01 p.m. UTC | #1
On Wed, Apr 24, 2024 at 10:36 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
> bpf_get_smp_processor_id().
>
> ARM64 uses the per-cpu variable cpu_number to store the cpu id.
>
> Here is how the BPF and ARM64 JITed assembly changes after this commit:
>
>                                          BPF
>                                         =====
>               BEFORE                                       AFTER
>              --------                                     -------
>
> int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
> (85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
>                                                 (bf) r0 = r0

nit: hmm, you are probably using a bit outdated bpftool, it should be
emitted as:

(bf) r0 = &(void __percpu *)(r0)

>                                                 (61) r0 = *(u32 *)(r0 +0)
>
>                                       ARM64 JIT
>                                      ===========
>
>               BEFORE                                       AFTER
>              --------                                     -------
>
> int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
> mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
> movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
> movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
> blr     x10                                mrs     x10, tpidr_el1
> add     x7, x0, #0x0                       add     x7, x7, x10
>                                            ldr     w7, [x7]
>
> Performance improvement using benchmark[1]
>
>              BEFORE                                       AFTER
>             --------                                     -------
>
> glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
> arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
> hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  kernel/bpf/verifier.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>

Besides the nits, lgtm.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9715c88cc025..3373be261889 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20205,7 +20205,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         goto next_insn;
>                 }
>
> -#ifdef CONFIG_X86_64
> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)

I think you can drop this, we are protected by
bpf_jit_supports_percpu_insn() check and newly added inner #if/#elif
checks?

>                 /* Implement bpf_get_smp_processor_id() inline. */
>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
> @@ -20214,11 +20214,20 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                          * changed in some incompatible and hard to support
>                          * way, it's fine to back out this inlining logic
>                          */
> +#if defined(CONFIG_X86_64)
>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>                         cnt = 3;
> +#elif defined(CONFIG_ARM64)
> +                       struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
>

this &cpu_number offset is not guaranteed to be within 4GB on arm64?

> +                       insn_buf[0] = cpu_number_addr[0];
> +                       insn_buf[1] = cpu_number_addr[1];
> +                       insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> +                       insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
> +                       cnt = 4;
> +#endif
>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>                         if (!new_prog)
>                                 return -ENOMEM;
> --
> 2.40.1
>
Puranjay Mohan April 25, 2024, 10:14 a.m. UTC | #2
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Apr 24, 2024 at 10:36 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
>> bpf_get_smp_processor_id().
>>
>> ARM64 uses the per-cpu variable cpu_number to store the cpu id.
>>
>> Here is how the BPF and ARM64 JITed assembly changes after this commit:
>>
>>                                          BPF
>>                                         =====
>>               BEFORE                                       AFTER
>>              --------                                     -------
>>
>> int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
>> (85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
>>                                                 (bf) r0 = r0
>
> nit: hmm, you are probably using a bit outdated bpftool, it should be
> emitted as:
>
> (bf) r0 = &(void __percpu *)(r0)

Yes, I was using the bpftool shipped with the distro. I tried it again
with the latest bpftool and it emitted this as expected.

>
>>                                                 (61) r0 = *(u32 *)(r0 +0)
>>
>>                                       ARM64 JIT
>>                                      ===========
>>
>>               BEFORE                                       AFTER
>>              --------                                     -------
>>
>> int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
>> mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
>> movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
>> movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
>> blr     x10                                mrs     x10, tpidr_el1
>> add     x7, x0, #0x0                       add     x7, x7, x10
>>                                            ldr     w7, [x7]
>>
>> Performance improvement using benchmark[1]
>>
>>              BEFORE                                       AFTER
>>             --------                                     -------
>>
>> glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
>> arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
>> hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s
>>
>> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>>
>> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> ---
>>  kernel/bpf/verifier.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>
> Besides the nits, lgtm.
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 9715c88cc025..3373be261889 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -20205,7 +20205,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>>                         goto next_insn;
>>                 }
>>
>> -#ifdef CONFIG_X86_64
>> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
>
> I think you can drop this, we are protected by
> bpf_jit_supports_percpu_insn() check and newly added inner #if/#elif
> checks?

If I remove this and later add support of percpu_insn on RISCV without
inlining bpf_get_smp_processor_id() then it will cause problems here
right? because then the last 5-6 lines inside this if(){} will be
executed for RISCV.

>
>>                 /* Implement bpf_get_smp_processor_id() inline. */
>>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
>>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
>> @@ -20214,11 +20214,20 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>>                          * changed in some incompatible and hard to support
>>                          * way, it's fine to back out this inlining logic
>>                          */
>> +#if defined(CONFIG_X86_64)
>>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
>>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>>                         cnt = 3;
>> +#elif defined(CONFIG_ARM64)
>> +                       struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
>>
>
> this &cpu_number offset is not guaranteed to be within 4GB on arm64?

Unfortunately, the per-cpu section is not placed in the first 4GB and
therefore the per-cpu pointers are not 32-bit on ARM64.

>
>> +                       insn_buf[0] = cpu_number_addr[0];
>> +                       insn_buf[1] = cpu_number_addr[1];
>> +                       insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>> +                       insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>> +                       cnt = 4;
>> +#endif
>>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>                         if (!new_prog)
>>                                 return -ENOMEM;
>> --
>> 2.40.1
>>
Andrii Nakryiko April 25, 2024, 6:09 p.m. UTC | #3
On Thu, Apr 25, 2024 at 3:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Wed, Apr 24, 2024 at 10:36 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
> >> bpf_get_smp_processor_id().
> >>
> >> ARM64 uses the per-cpu variable cpu_number to store the cpu id.
> >>
> >> Here is how the BPF and ARM64 JITed assembly changes after this commit:
> >>
> >>                                          BPF
> >>                                         =====
> >>               BEFORE                                       AFTER
> >>              --------                                     -------
> >>
> >> int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
> >> (85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
> >>                                                 (bf) r0 = r0
> >
> > nit: hmm, you are probably using a bit outdated bpftool, it should be
> > emitted as:
> >
> > (bf) r0 = &(void __percpu *)(r0)
>
> Yes, I was using the bpftool shipped with the distro. I tried it again
> with the latest bpftool and it emitted this as expected.

Cool, would be nice to update the commit message with the right syntax
for next revision, thanks!

>
> >
> >>                                                 (61) r0 = *(u32 *)(r0 +0)
> >>
> >>                                       ARM64 JIT
> >>                                      ===========
> >>
> >>               BEFORE                                       AFTER
> >>              --------                                     -------
> >>
> >> int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
> >> mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
> >> movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
> >> movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
> >> blr     x10                                mrs     x10, tpidr_el1
> >> add     x7, x0, #0x0                       add     x7, x7, x10
> >>                                            ldr     w7, [x7]
> >>
> >> Performance improvement using benchmark[1]
> >>
> >>              BEFORE                                       AFTER
> >>             --------                                     -------
> >>
> >> glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
> >> arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
> >> hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s
> >>
> >> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
> >>
> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> >> ---
> >>  kernel/bpf/verifier.c | 11 ++++++++++-
> >>  1 file changed, 10 insertions(+), 1 deletion(-)
> >>
> >
> > Besides the nits, lgtm.
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> >
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 9715c88cc025..3373be261889 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -20205,7 +20205,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >>                         goto next_insn;
> >>                 }
> >>
> >> -#ifdef CONFIG_X86_64
> >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
> >
> > I think you can drop this, we are protected by
> > bpf_jit_supports_percpu_insn() check and newly added inner #if/#elif
> > checks?
>
> If I remove this and later add support of percpu_insn on RISCV without
> inlining bpf_get_smp_processor_id() then it will cause problems here
> right? because then the last 5-6 lines inside this if(){} will be
> executed for RISCV.

Just add

#else
return -EFAULT;
#endif

?

I'm trying to avoid this duplication of the defined(CONFIG_xxx) checks
for supported architectures.

>
> >
> >>                 /* Implement bpf_get_smp_processor_id() inline. */
> >>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
> >>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
> >> @@ -20214,11 +20214,20 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >>                          * changed in some incompatible and hard to support
> >>                          * way, it's fine to back out this inlining logic
> >>                          */
> >> +#if defined(CONFIG_X86_64)
> >>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
> >>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> >>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
> >>                         cnt = 3;
> >> +#elif defined(CONFIG_ARM64)
> >> +                       struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
> >>
> >
> > this &cpu_number offset is not guaranteed to be within 4GB on arm64?
>
> Unfortunately, the per-cpu section is not placed in the first 4GB and
> therefore the per-cpu pointers are not 32-bit on ARM64.

I see. It might make sense to turn x86-64 code into using MOV64_IMM as
well to keep more of the logic common. Then it will be just the
difference of an offset that's loaded. Give it a try?

>
> >
> >> +                       insn_buf[0] = cpu_number_addr[0];
> >> +                       insn_buf[1] = cpu_number_addr[1];
> >> +                       insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> >> +                       insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
> >> +                       cnt = 4;
> >> +#endif
> >>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> >>                         if (!new_prog)
> >>                                 return -ENOMEM;
> >> --
> >> 2.40.1
> >>
Puranjay Mohan April 25, 2024, 6:55 p.m. UTC | #4
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Apr 25, 2024 at 3:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Wed, Apr 24, 2024 at 10:36 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
>> >> bpf_get_smp_processor_id().
>> >>
>> >> ARM64 uses the per-cpu variable cpu_number to store the cpu id.
>> >>
>> >> Here is how the BPF and ARM64 JITed assembly changes after this commit:
>> >>
>> >>                                          BPF
>> >>                                         =====
>> >>               BEFORE                                       AFTER
>> >>              --------                                     -------
>> >>
>> >> int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
>> >> (85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
>> >>                                                 (bf) r0 = r0
>> >
>> > nit: hmm, you are probably using a bit outdated bpftool, it should be
>> > emitted as:
>> >
>> > (bf) r0 = &(void __percpu *)(r0)
>>
>> Yes, I was using the bpftool shipped with the distro. I tried it again
>> with the latest bpftool and it emitted this as expected.
>
> Cool, would be nice to update the commit message with the right syntax
> for next revision, thanks!
>

Sure, will do.

>>
>> >
>> >>                                                 (61) r0 = *(u32 *)(r0 +0)
>> >>
>> >>                                       ARM64 JIT
>> >>                                      ===========
>> >>
>> >>               BEFORE                                       AFTER
>> >>              --------                                     -------
>> >>
>> >> int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
>> >> mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
>> >> movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
>> >> movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
>> >> blr     x10                                mrs     x10, tpidr_el1
>> >> add     x7, x0, #0x0                       add     x7, x7, x10
>> >>                                            ldr     w7, [x7]
>> >>
>> >> Performance improvement using benchmark[1]
>> >>
>> >>              BEFORE                                       AFTER
>> >>             --------                                     -------
>> >>
>> >> glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
>> >> arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
>> >> hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s
>> >>
>> >> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>> >>
>> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> >> ---
>> >>  kernel/bpf/verifier.c | 11 ++++++++++-
>> >>  1 file changed, 10 insertions(+), 1 deletion(-)
>> >>
>> >
>> > Besides the nits, lgtm.
>> >
>> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> >
>> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> >> index 9715c88cc025..3373be261889 100644
>> >> --- a/kernel/bpf/verifier.c
>> >> +++ b/kernel/bpf/verifier.c
>> >> @@ -20205,7 +20205,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>> >>                         goto next_insn;
>> >>                 }
>> >>
>> >> -#ifdef CONFIG_X86_64
>> >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
>> >
>> > I think you can drop this, we are protected by
>> > bpf_jit_supports_percpu_insn() check and newly added inner #if/#elif
>> > checks?
>>
>> If I remove this and later add support of percpu_insn on RISCV without
>> inlining bpf_get_smp_processor_id() then it will cause problems here
>> right? because then the last 5-6 lines inside this if(){} will be
>> executed for RISCV.
>
> Just add
>
> #else
> return -EFAULT;

I don't think we can return.

> #endif
>
> ?
>
> I'm trying to avoid this duplication of the defined(CONFIG_xxx) checks
> for supported architectures.

Does the following look correct?

I will do it like this:

                /* Implement bpf_get_smp_processor_id() inline. */
                if (insn->imm == BPF_FUNC_get_smp_processor_id &&
                    prog->jit_requested && bpf_jit_supports_percpu_insn()) {
                        /* BPF_FUNC_get_smp_processor_id inlining is an
                         * optimization, so if pcpu_hot.cpu_number is ever
                         * changed in some incompatible and hard to support
                         * way, it's fine to back out this inlining logic
                         */
#if defined(CONFIG_X86_64)
                        insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
                        insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
                        insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
                        cnt = 3;
#elif defined(CONFIG_ARM64)
                        struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };

                        insn_buf[0] = cpu_number_addr[0];
                        insn_buf[1] = cpu_number_addr[1];
                        insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
                        insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
                        cnt = 4;
#else
                        goto next_insn;
#endif
                        new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
                        if (!new_prog)
                                return -ENOMEM;

                        delta    += cnt - 1;
                        env->prog = prog = new_prog;
                        insn      = new_prog->insnsi + i + delta;
                        goto next_insn;
                }


>>
>> >
>> >>                 /* Implement bpf_get_smp_processor_id() inline. */
>> >>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
>> >>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
>> >> @@ -20214,11 +20214,20 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>> >>                          * changed in some incompatible and hard to support
>> >>                          * way, it's fine to back out this inlining logic
>> >>                          */
>> >> +#if defined(CONFIG_X86_64)
>> >>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
>> >>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>> >>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>> >>                         cnt = 3;
>> >> +#elif defined(CONFIG_ARM64)
>> >> +                       struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
>> >>
>> >
>> > this &cpu_number offset is not guaranteed to be within 4GB on arm64?
>>
>> Unfortunately, the per-cpu section is not placed in the first 4GB and
>> therefore the per-cpu pointers are not 32-bit on ARM64.
>
> I see. It might make sense to turn x86-64 code into using MOV64_IMM as
> well to keep more of the logic common. Then it will be just the
> difference of an offset that's loaded. Give it a try?

I think MOV64_IMM would have more overhead than MOV32_IMM and if we can
use it in x86-64 we should keep doing it that way. Wdyt? 

>>
>> >
>> >> +                       insn_buf[0] = cpu_number_addr[0];
>> >> +                       insn_buf[1] = cpu_number_addr[1];
>> >> +                       insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>> >> +                       insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>> >> +                       cnt = 4;
>> >> +#endif
>> >>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>> >>                         if (!new_prog)
>> >>                                 return -ENOMEM;
>> >> --
>> >> 2.40.1
>> >>
Andrii Nakryiko April 25, 2024, 8:43 p.m. UTC | #5
On Thu, Apr 25, 2024 at 11:56 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Thu, Apr 25, 2024 at 3:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Wed, Apr 24, 2024 at 10:36 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >> >>
> >> >> As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
> >> >> bpf_get_smp_processor_id().
> >> >>
> >> >> ARM64 uses the per-cpu variable cpu_number to store the cpu id.
> >> >>
> >> >> Here is how the BPF and ARM64 JITed assembly changes after this commit:
> >> >>
> >> >>                                          BPF
> >> >>                                         =====
> >> >>               BEFORE                                       AFTER
> >> >>              --------                                     -------
> >> >>
> >> >> int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
> >> >> (85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
> >> >>                                                 (bf) r0 = r0
> >> >
> >> > nit: hmm, you are probably using a bit outdated bpftool, it should be
> >> > emitted as:
> >> >
> >> > (bf) r0 = &(void __percpu *)(r0)
> >>
> >> Yes, I was using the bpftool shipped with the distro. I tried it again
> >> with the latest bpftool and it emitted this as expected.
> >
> > Cool, would be nice to update the commit message with the right syntax
> > for next revision, thanks!
> >
>
> Sure, will do.
>
> >>
> >> >
> >> >>                                                 (61) r0 = *(u32 *)(r0 +0)
> >> >>
> >> >>                                       ARM64 JIT
> >> >>                                      ===========
> >> >>
> >> >>               BEFORE                                       AFTER
> >> >>              --------                                     -------
> >> >>
> >> >> int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
> >> >> mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
> >> >> movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
> >> >> movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
> >> >> blr     x10                                mrs     x10, tpidr_el1
> >> >> add     x7, x0, #0x0                       add     x7, x7, x10
> >> >>                                            ldr     w7, [x7]
> >> >>
> >> >> Performance improvement using benchmark[1]
> >> >>
> >> >>              BEFORE                                       AFTER
> >> >>             --------                                     -------
> >> >>
> >> >> glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
> >> >> arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
> >> >> hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s
> >> >>
> >> >> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
> >> >>
> >> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> >> >> ---
> >> >>  kernel/bpf/verifier.c | 11 ++++++++++-
> >> >>  1 file changed, 10 insertions(+), 1 deletion(-)
> >> >>
> >> >
> >> > Besides the nits, lgtm.
> >> >
> >> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> >> >
> >> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> >> index 9715c88cc025..3373be261889 100644
> >> >> --- a/kernel/bpf/verifier.c
> >> >> +++ b/kernel/bpf/verifier.c
> >> >> @@ -20205,7 +20205,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >> >>                         goto next_insn;
> >> >>                 }
> >> >>
> >> >> -#ifdef CONFIG_X86_64
> >> >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
> >> >
> >> > I think you can drop this, we are protected by
> >> > bpf_jit_supports_percpu_insn() check and newly added inner #if/#elif
> >> > checks?
> >>
> >> If I remove this and later add support of percpu_insn on RISCV without
> >> inlining bpf_get_smp_processor_id() then it will cause problems here
> >> right? because then the last 5-6 lines inside this if(){} will be
> >> executed for RISCV.
> >
> > Just add
> >
> > #else
> > return -EFAULT;
>
> I don't think we can return.

ah, because it's not an error condition, right

>
> > #endif
> >
> > ?
> >
> > I'm trying to avoid this duplication of the defined(CONFIG_xxx) checks
> > for supported architectures.
>
> Does the following look correct?
>
> I will do it like this:
>
>                 /* Implement bpf_get_smp_processor_id() inline. */
>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
>                         /* BPF_FUNC_get_smp_processor_id inlining is an
>                          * optimization, so if pcpu_hot.cpu_number is ever
>                          * changed in some incompatible and hard to support
>                          * way, it's fine to back out this inlining logic
>                          */
> #if defined(CONFIG_X86_64)
>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>                         cnt = 3;
> #elif defined(CONFIG_ARM64)
>                         struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
>
>                         insn_buf[0] = cpu_number_addr[0];
>                         insn_buf[1] = cpu_number_addr[1];
>                         insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>                         insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>                         cnt = 4;
> #else
>                         goto next_insn;
> #endif

yep, I just wrote a large comment about goto next_insns above and then
saw you already proposed that :) Yep, I think this is the way.

>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>                         if (!new_prog)
>                                 return -ENOMEM;
>
>                         delta    += cnt - 1;
>                         env->prog = prog = new_prog;
>                         insn      = new_prog->insnsi + i + delta;
>                         goto next_insn;
>                 }
>
>
> >>
> >> >
> >> >>                 /* Implement bpf_get_smp_processor_id() inline. */
> >> >>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
> >> >>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
> >> >> @@ -20214,11 +20214,20 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >> >>                          * changed in some incompatible and hard to support
> >> >>                          * way, it's fine to back out this inlining logic
> >> >>                          */
> >> >> +#if defined(CONFIG_X86_64)
> >> >>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
> >> >>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> >> >>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
> >> >>                         cnt = 3;
> >> >> +#elif defined(CONFIG_ARM64)
> >> >> +                       struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
> >> >>
> >> >
> >> > this &cpu_number offset is not guaranteed to be within 4GB on arm64?
> >>
> >> Unfortunately, the per-cpu section is not placed in the first 4GB and
> >> therefore the per-cpu pointers are not 32-bit on ARM64.
> >
> > I see. It might make sense to turn x86-64 code into using MOV64_IMM as
> > well to keep more of the logic common. Then it will be just the
> > difference of an offset that's loaded. Give it a try?
>
> I think MOV64_IMM would have more overhead than MOV32_IMM and if we can
> use it in x86-64 we should keep doing it that way. Wdyt?

My assumption (which I didn't check) was that BPF JITs should optimize
such MOV64_IMM that have a constant fitting within 32-bits with a
faster and smaller instruction. But I'm fine leaving it as is, of
course.

>
> >>
> >> >
> >> >> +                       insn_buf[0] = cpu_number_addr[0];
> >> >> +                       insn_buf[1] = cpu_number_addr[1];
> >> >> +                       insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> >> >> +                       insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
> >> >> +                       cnt = 4;
> >> >> +#endif
> >> >>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> >> >>                         if (!new_prog)
> >> >>                                 return -ENOMEM;
> >> >> --
> >> >> 2.40.1
> >> >>
Puranjay Mohan April 26, 2024, 10:26 a.m. UTC | #6
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Apr 25, 2024 at 11:56 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Thu, Apr 25, 2024 at 3:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >>
>> >> > On Wed, Apr 24, 2024 at 10:36 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >> >>
>> >> >> As ARM64 JIT now implements BPF_MOV64_PERCPU_REG instruction, inline
>> >> >> bpf_get_smp_processor_id().
>> >> >>
>> >> >> ARM64 uses the per-cpu variable cpu_number to store the cpu id.
>> >> >>
>> >> >> Here is how the BPF and ARM64 JITed assembly changes after this commit:
>> >> >>
>> >> >>                                          BPF
>> >> >>                                         =====
>> >> >>               BEFORE                                       AFTER
>> >> >>              --------                                     -------
>> >> >>
>> >> >> int cpu = bpf_get_smp_processor_id();           int cpu = bpf_get_smp_processor_id();
>> >> >> (85) call bpf_get_smp_processor_id#229032       (18) r0 = 0xffff800082072008
>> >> >>                                                 (bf) r0 = r0
>> >> >
>> >> > nit: hmm, you are probably using a bit outdated bpftool, it should be
>> >> > emitted as:
>> >> >
>> >> > (bf) r0 = &(void __percpu *)(r0)
>> >>
>> >> Yes, I was using the bpftool shipped with the distro. I tried it again
>> >> with the latest bpftool and it emitted this as expected.
>> >
>> > Cool, would be nice to update the commit message with the right syntax
>> > for next revision, thanks!
>> >
>>
>> Sure, will do.
>>
>> >>
>> >> >
>> >> >>                                                 (61) r0 = *(u32 *)(r0 +0)
>> >> >>
>> >> >>                                       ARM64 JIT
>> >> >>                                      ===========
>> >> >>
>> >> >>               BEFORE                                       AFTER
>> >> >>              --------                                     -------
>> >> >>
>> >> >> int cpu = bpf_get_smp_processor_id();      int cpu = bpf_get_smp_processor_id();
>> >> >> mov     x10, #0xfffffffffffff4d0           mov     x7, #0xffff8000ffffffff
>> >> >> movk    x10, #0x802b, lsl #16              movk    x7, #0x8207, lsl #16
>> >> >> movk    x10, #0x8000, lsl #32              movk    x7, #0x2008
>> >> >> blr     x10                                mrs     x10, tpidr_el1
>> >> >> add     x7, x0, #0x0                       add     x7, x7, x10
>> >> >>                                            ldr     w7, [x7]
>> >> >>
>> >> >> Performance improvement using benchmark[1]
>> >> >>
>> >> >>              BEFORE                                       AFTER
>> >> >>             --------                                     -------
>> >> >>
>> >> >> glob-arr-inc   :   23.817 ± 0.019M/s      glob-arr-inc   :   24.631 ± 0.027M/s
>> >> >> arr-inc        :   23.253 ± 0.019M/s      arr-inc        :   23.742 ± 0.023M/s
>> >> >> hash-inc       :   12.258 ± 0.010M/s      hash-inc       :   12.625 ± 0.004M/s
>> >> >>
>> >> >> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>> >> >>
>> >> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> >> >> ---
>> >> >>  kernel/bpf/verifier.c | 11 ++++++++++-
>> >> >>  1 file changed, 10 insertions(+), 1 deletion(-)
>> >> >>
>> >> >
>> >> > Besides the nits, lgtm.
>> >> >
>> >> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> >> >
>> >> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> >> >> index 9715c88cc025..3373be261889 100644
>> >> >> --- a/kernel/bpf/verifier.c
>> >> >> +++ b/kernel/bpf/verifier.c
>> >> >> @@ -20205,7 +20205,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>> >> >>                         goto next_insn;
>> >> >>                 }
>> >> >>
>> >> >> -#ifdef CONFIG_X86_64
>> >> >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
>> >> >
>> >> > I think you can drop this, we are protected by
>> >> > bpf_jit_supports_percpu_insn() check and newly added inner #if/#elif
>> >> > checks?
>> >>
>> >> If I remove this and later add support of percpu_insn on RISCV without
>> >> inlining bpf_get_smp_processor_id() then it will cause problems here
>> >> right? because then the last 5-6 lines inside this if(){} will be
>> >> executed for RISCV.
>> >
>> > Just add
>> >
>> > #else
>> > return -EFAULT;
>>
>> I don't think we can return.
>
> ah, because it's not an error condition, right
>
>>
>> > #endif
>> >
>> > ?
>> >
>> > I'm trying to avoid this duplication of the defined(CONFIG_xxx) checks
>> > for supported architectures.
>>
>> Does the following look correct?
>>
>> I will do it like this:
>>
>>                 /* Implement bpf_get_smp_processor_id() inline. */
>>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
>>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
>>                         /* BPF_FUNC_get_smp_processor_id inlining is an
>>                          * optimization, so if pcpu_hot.cpu_number is ever
>>                          * changed in some incompatible and hard to support
>>                          * way, it's fine to back out this inlining logic
>>                          */
>> #if defined(CONFIG_X86_64)
>>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
>>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>>                         cnt = 3;
>> #elif defined(CONFIG_ARM64)
>>                         struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
>>
>>                         insn_buf[0] = cpu_number_addr[0];
>>                         insn_buf[1] = cpu_number_addr[1];
>>                         insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>>                         insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>>                         cnt = 4;
>> #else
>>                         goto next_insn;
>> #endif
>
> yep, I just wrote a large comment about goto next_insns above and then
> saw you already proposed that :) Yep, I think this is the way.
>
>>                         new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
>>                         if (!new_prog)
>>                                 return -ENOMEM;
>>
>>                         delta    += cnt - 1;
>>                         env->prog = prog = new_prog;
>>                         insn      = new_prog->insnsi + i + delta;
>>                         goto next_insn;
>>                 }
>>
>>
>> >>
>> >> >
>> >> >>                 /* Implement bpf_get_smp_processor_id() inline. */
>> >> >>                 if (insn->imm == BPF_FUNC_get_smp_processor_id &&
>> >> >>                     prog->jit_requested && bpf_jit_supports_percpu_insn()) {
>> >> >> @@ -20214,11 +20214,20 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>> >> >>                          * changed in some incompatible and hard to support
>> >> >>                          * way, it's fine to back out this inlining logic
>> >> >>                          */
>> >> >> +#if defined(CONFIG_X86_64)
>> >> >>                         insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
>> >> >>                         insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
>> >> >>                         insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
>> >> >>                         cnt = 3;
>> >> >> +#elif defined(CONFIG_ARM64)
>> >> >> +                       struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
>> >> >>
>> >> >
>> >> > this &cpu_number offset is not guaranteed to be within 4GB on arm64?
>> >>
>> >> Unfortunately, the per-cpu section is not placed in the first 4GB and
>> >> therefore the per-cpu pointers are not 32-bit on ARM64.
>> >
>> > I see. It might make sense to turn x86-64 code into using MOV64_IMM as
>> > well to keep more of the logic common. Then it will be just the
>> > difference of an offset that's loaded. Give it a try?
>>
>> I think MOV64_IMM would have more overhead than MOV32_IMM and if we can
>> use it in x86-64 we should keep doing it that way. Wdyt?
>
> My assumption (which I didn't check) was that BPF JITs should optimize
> such MOV64_IMM that have a constant fitting within 32-bits with a
> faster and smaller instruction. But I'm fine leaving it as is, of
> course.

You are right. I verified that the JITs will optimize this if the imm is
32-bit. So, I will make it common in the next version.

Also, for the readers, we are discussing:

1) BPF_MOV32_IMM : This moves a 32 bit imm into a register and
                   zero-extends it.

2) BPF_LD_IMM64 : This moves(loads) a 64 bit imm into a register. The
                  JITs will optimize this to a BPF_MOV32_IMM, if the imm
                  is 32-bit.

Not to be confused with :
3) BPF_MOV64_IMM: This also works with a 32-bit imm but will sign extend
                  it to 64-bit rather than zero-extend.

Thanks,
Puranjay
diff mbox series

Patch

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9715c88cc025..3373be261889 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20205,7 +20205,7 @@  static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
-#ifdef CONFIG_X86_64
+#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
 		/* Implement bpf_get_smp_processor_id() inline. */
 		if (insn->imm == BPF_FUNC_get_smp_processor_id &&
 		    prog->jit_requested && bpf_jit_supports_percpu_insn()) {
@@ -20214,11 +20214,20 @@  static int do_misc_fixups(struct bpf_verifier_env *env)
 			 * changed in some incompatible and hard to support
 			 * way, it's fine to back out this inlining logic
 			 */
+#if defined(CONFIG_X86_64)
 			insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
 			insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
 			insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
 			cnt = 3;
+#elif defined(CONFIG_ARM64)
+			struct bpf_insn cpu_number_addr[2] = { BPF_LD_IMM64(BPF_REG_0, (u64)&cpu_number) };
 
+			insn_buf[0] = cpu_number_addr[0];
+			insn_buf[1] = cpu_number_addr[1];
+			insn_buf[2] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
+			insn_buf[3] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
+			cnt = 4;
+#endif
 			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
 			if (!new_prog)
 				return -ENOMEM;