Message ID | 20201127175738.1085417-1-jackmanb@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Atomics for eBPF | expand |
On 11/27/20 9:57 AM, Brendan Jackman wrote: > Status of the patches > ===================== > > Thanks for the reviews! Differences from v1->v2 [1]: > > * Fixed mistakes in the netronome driver > > * Addd sub, add, or, xor operations > > * The above led to some refactors to keep things readable. (Maybe I > should have just waited until I'd implemented these before starting > the review...) > > * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which > include the BPF_FETCH flag > > * Added a bit of documentation. Suggestions welcome for more places > to dump this info... > > The prog_test that's added depends on Clang/LLVM features added by > Yonghong in https://reviews.llvm.org/D72184 > > This only includes a JIT implementation for x86_64 - I don't plan to > implement JIT support myself for other architectures. > > Operations > ========== > > This patchset adds atomic operations to the eBPF instruction set. The > use-case that motivated this work was a trivial and efficient way to > generate globally-unique cookies in BPF progs, but I think it's > obvious that these features are pretty widely applicable. The > instructions that are added here can be summarised with this list of > kernel operations: > > * atomic[64]_[fetch_]add > * atomic[64]_[fetch_]sub > * atomic[64]_[fetch_]and > * atomic[64]_[fetch_]or * atomic[64]_[fetch_]xor > * atomic[64]_xchg > * atomic[64]_cmpxchg Thanks. Overall looks good to me but I did not check carefully on jit part as I am not an expert in x64 assembly... This patch also introduced atomic[64]_{sub,and,or,xor}, similar to xadd. I am not sure whether it is necessary. For one thing, users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore return value and they will achieve the same result, right? From llvm side, there is no ready-to-use gcc builtin matching atomic[64]_{sub,and,or,xor} which does not have return values. If we go this route, we will need to invent additional bpf specific builtins. > > The following are left out of scope for this effort: > > * 16 and 8 bit operations > * Explicit memory barriers > > Encoding > ======== > > I originally planned to add new values for bpf_insn.opcode. This was > rather unpleasant: the opcode space has holes in it but no entire > instruction classes[2]. Yonghong Song had a better idea: use the > immediate field of the existing STX XADD instruction to encode the > operation. This works nicely, without breaking existing programs, > because the immediate field is currently reserved-must-be-zero, and > extra-nicely because BPF_ADD happens to be zero. > > Note that this of course makes immediate-source atomic operations > impossible. It's hard to imagine a measurable speedup from such > instructions, and if it existed it would certainly not benefit x86, > which has no support for them. > > The BPF_OP opcode fields are re-used in the immediate, and an > additional flag BPF_FETCH is used to mark instructions that should > fetch a pre-modification value from memory. > > So, BPF_XADD is now called BPF_ATOMIC (the old name is kept to avoid > breaking userspace builds), and where we previously had .imm = 0, we > now have .imm = BPF_ADD (which is 0). > > Operands > ======== > > Reg-source eBPF instructions only have two operands, while these > atomic operations have up to four. To avoid needing to encode > additional operands, then: > > - One of the input registers is re-used as an output register > (e.g. atomic_fetch_add both reads from and writes to the source > register). > > - Where necessary (i.e. for cmpxchg) , R0 is "hard-coded" as one of > the operands. > > This approach also allows the new eBPF instructions to map directly > to single x86 instructions. > > [1] Previous patchset: > https://lore.kernel.org/bpf/20201123173202.1335708-1-jackmanb@google.com/ > > [2] Visualisation of eBPF opcode space: > https://gist.github.com/bjackman/00fdad2d5dfff601c1918bc29b16e778 > > > Brendan Jackman (13): > bpf: x86: Factor out emission of ModR/M for *(reg + off) > bpf: x86: Factor out emission of REX byte > bpf: x86: Factor out function to emit NEG > bpf: x86: Factor out a lookup table for some ALU opcodes > bpf: Rename BPF_XADD and prepare to encode other atomics in .imm > bpf: Move BPF_STX reserved field check into BPF_STX verifier code > bpf: Add BPF_FETCH field / create atomic_fetch_add instruction > bpf: Add instructions for atomic_[cmp]xchg > bpf: Pull out a macro for interpreting atomic ALU operations > bpf: Add instructions for atomic[64]_[fetch_]sub > bpf: Add bitwise atomic instructions > bpf: Add tests for new BPF atomic operations > bpf: Document new atomic instructions > > Documentation/networking/filter.rst | 57 ++- > arch/arm/net/bpf_jit_32.c | 7 +- > arch/arm64/net/bpf_jit_comp.c | 16 +- > arch/mips/net/ebpf_jit.c | 11 +- > arch/powerpc/net/bpf_jit_comp64.c | 25 +- > arch/riscv/net/bpf_jit_comp32.c | 20 +- > arch/riscv/net/bpf_jit_comp64.c | 16 +- > arch/s390/net/bpf_jit_comp.c | 27 +- > arch/sparc/net/bpf_jit_comp_64.c | 17 +- > arch/x86/net/bpf_jit_comp.c | 252 ++++++++++---- > arch/x86/net/bpf_jit_comp32.c | 6 +- > drivers/net/ethernet/netronome/nfp/bpf/jit.c | 14 +- > drivers/net/ethernet/netronome/nfp/bpf/main.h | 4 +- > .../net/ethernet/netronome/nfp/bpf/verifier.c | 15 +- > include/linux/filter.h | 117 ++++++- > include/uapi/linux/bpf.h | 8 +- > kernel/bpf/core.c | 67 +++- > kernel/bpf/disasm.c | 41 ++- > kernel/bpf/verifier.c | 77 +++- > lib/test_bpf.c | 2 +- > samples/bpf/bpf_insn.h | 4 +- > samples/bpf/sock_example.c | 2 +- > samples/bpf/test_cgrp2_attach.c | 4 +- > tools/include/linux/filter.h | 117 ++++++- > tools/include/uapi/linux/bpf.h | 8 +- > tools/testing/selftests/bpf/Makefile | 12 +- > .../selftests/bpf/prog_tests/atomics_test.c | 329 ++++++++++++++++++ > .../bpf/prog_tests/cgroup_attach_multi.c | 4 +- > .../selftests/bpf/progs/atomics_test.c | 124 +++++++ > .../selftests/bpf/verifier/atomic_and.c | 77 ++++ > .../selftests/bpf/verifier/atomic_cmpxchg.c | 96 +++++ > .../selftests/bpf/verifier/atomic_fetch_add.c | 106 ++++++ > .../selftests/bpf/verifier/atomic_or.c | 77 ++++ > .../selftests/bpf/verifier/atomic_sub.c | 44 +++ > .../selftests/bpf/verifier/atomic_xchg.c | 46 +++ > .../selftests/bpf/verifier/atomic_xor.c | 77 ++++ > tools/testing/selftests/bpf/verifier/ctx.c | 7 +- > .../testing/selftests/bpf/verifier/leak_ptr.c | 4 +- > tools/testing/selftests/bpf/verifier/unpriv.c | 3 +- > tools/testing/selftests/bpf/verifier/xadd.c | 2 +- > 40 files changed, 1754 insertions(+), 188 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/atomics_test.c > create mode 100644 tools/testing/selftests/bpf/progs/atomics_test.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_and.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_cmpxchg.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_fetch_add.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_or.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_sub.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_xchg.c > create mode 100644 tools/testing/selftests/bpf/verifier/atomic_xor.c > > -- > 2.29.2.454.gaff20da3a2-goog >
On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote: > > > On 11/27/20 9:57 AM, Brendan Jackman wrote: > > Status of the patches > > ===================== > > > > Thanks for the reviews! Differences from v1->v2 [1]: > > > > * Fixed mistakes in the netronome driver > > > > * Addd sub, add, or, xor operations > > > > * The above led to some refactors to keep things readable. (Maybe I > > should have just waited until I'd implemented these before starting > > the review...) > > > > * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which > > include the BPF_FETCH flag > > > > * Added a bit of documentation. Suggestions welcome for more places > > to dump this info... > > > > The prog_test that's added depends on Clang/LLVM features added by > > Yonghong in https://reviews.llvm.org/D72184 > > > > This only includes a JIT implementation for x86_64 - I don't plan to > > implement JIT support myself for other architectures. > > > > Operations > > ========== > > > > This patchset adds atomic operations to the eBPF instruction set. The > > use-case that motivated this work was a trivial and efficient way to > > generate globally-unique cookies in BPF progs, but I think it's > > obvious that these features are pretty widely applicable. The > > instructions that are added here can be summarised with this list of > > kernel operations: > > > > * atomic[64]_[fetch_]add > > * atomic[64]_[fetch_]sub > > * atomic[64]_[fetch_]and > > * atomic[64]_[fetch_]or > > * atomic[64]_[fetch_]xor > > > * atomic[64]_xchg > > * atomic[64]_cmpxchg > > Thanks. Overall looks good to me but I did not check carefully > on jit part as I am not an expert in x64 assembly... > > This patch also introduced atomic[64]_{sub,and,or,xor}, similar to > xadd. I am not sure whether it is necessary. For one thing, > users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore > return value and they will achieve the same result, right? > From llvm side, there is no ready-to-use gcc builtin matching > atomic[64]_{sub,and,or,xor} which does not have return values. > If we go this route, we will need to invent additional bpf > specific builtins. I think bpf specific builtins are overkill. As you said the users can use atomic_fetch_xor() and ignore return value. I think llvm backend should be smart enough to use BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case. But if it's too cumbersome to do at the moment we skip this optimization for now.
On 11/28/20 5:40 PM, Alexei Starovoitov wrote: > On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote: >> >> >> On 11/27/20 9:57 AM, Brendan Jackman wrote: >>> Status of the patches >>> ===================== >>> >>> Thanks for the reviews! Differences from v1->v2 [1]: >>> >>> * Fixed mistakes in the netronome driver >>> >>> * Addd sub, add, or, xor operations >>> >>> * The above led to some refactors to keep things readable. (Maybe I >>> should have just waited until I'd implemented these before starting >>> the review...) >>> >>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which >>> include the BPF_FETCH flag >>> >>> * Added a bit of documentation. Suggestions welcome for more places >>> to dump this info... >>> >>> The prog_test that's added depends on Clang/LLVM features added by >>> Yonghong in https://reviews.llvm.org/D72184 >>> >>> This only includes a JIT implementation for x86_64 - I don't plan to >>> implement JIT support myself for other architectures. >>> >>> Operations >>> ========== >>> >>> This patchset adds atomic operations to the eBPF instruction set. The >>> use-case that motivated this work was a trivial and efficient way to >>> generate globally-unique cookies in BPF progs, but I think it's >>> obvious that these features are pretty widely applicable. The >>> instructions that are added here can be summarised with this list of >>> kernel operations: >>> >>> * atomic[64]_[fetch_]add >>> * atomic[64]_[fetch_]sub >>> * atomic[64]_[fetch_]and >>> * atomic[64]_[fetch_]or >> >> * atomic[64]_[fetch_]xor >> >>> * atomic[64]_xchg >>> * atomic[64]_cmpxchg >> >> Thanks. Overall looks good to me but I did not check carefully >> on jit part as I am not an expert in x64 assembly... >> >> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to >> xadd. I am not sure whether it is necessary. For one thing, >> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore >> return value and they will achieve the same result, right? >> From llvm side, there is no ready-to-use gcc builtin matching >> atomic[64]_{sub,and,or,xor} which does not have return values. >> If we go this route, we will need to invent additional bpf >> specific builtins. > > I think bpf specific builtins are overkill. > As you said the users can use atomic_fetch_xor() and ignore > return value. I think llvm backend should be smart enough to use > BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case. > But if it's too cumbersome to do at the moment we skip this > optimization for now. We can initially all have BPF_FETCH bit as at that point we do not have def-use chain. Later on we can add a machine ssa IR phase and check whether the result of, say atomic_fetch_or(), is used or not. If not, we can change the instruction to atomic_or. >
On 11/30/20 9:22 AM, Yonghong Song wrote: > > > On 11/28/20 5:40 PM, Alexei Starovoitov wrote: >> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote: >>> >>> >>> On 11/27/20 9:57 AM, Brendan Jackman wrote: >>>> Status of the patches >>>> ===================== >>>> >>>> Thanks for the reviews! Differences from v1->v2 [1]: >>>> >>>> * Fixed mistakes in the netronome driver >>>> >>>> * Addd sub, add, or, xor operations >>>> >>>> * The above led to some refactors to keep things readable. (Maybe I >>>> should have just waited until I'd implemented these before starting >>>> the review...) >>>> >>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which >>>> include the BPF_FETCH flag >>>> >>>> * Added a bit of documentation. Suggestions welcome for more places >>>> to dump this info... >>>> >>>> The prog_test that's added depends on Clang/LLVM features added by >>>> Yonghong in >>>> https://reviews.llvm.org/D72184 >>>> >>>> This only includes a JIT implementation for x86_64 - I don't plan to >>>> implement JIT support myself for other architectures. >>>> >>>> Operations >>>> ========== >>>> >>>> This patchset adds atomic operations to the eBPF instruction set. The >>>> use-case that motivated this work was a trivial and efficient way to >>>> generate globally-unique cookies in BPF progs, but I think it's >>>> obvious that these features are pretty widely applicable. The >>>> instructions that are added here can be summarised with this list of >>>> kernel operations: >>>> >>>> * atomic[64]_[fetch_]add >>>> * atomic[64]_[fetch_]sub >>>> * atomic[64]_[fetch_]and >>>> * atomic[64]_[fetch_]or >>> >>> * atomic[64]_[fetch_]xor >>> >>>> * atomic[64]_xchg >>>> * atomic[64]_cmpxchg >>> >>> Thanks. Overall looks good to me but I did not check carefully >>> on jit part as I am not an expert in x64 assembly... >>> >>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to >>> xadd. I am not sure whether it is necessary. For one thing, >>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore >>> return value and they will achieve the same result, right? >>> From llvm side, there is no ready-to-use gcc builtin matching >>> atomic[64]_{sub,and,or,xor} which does not have return values. >>> If we go this route, we will need to invent additional bpf >>> specific builtins. >> >> I think bpf specific builtins are overkill. >> As you said the users can use atomic_fetch_xor() and ignore >> return value. I think llvm backend should be smart enough to use >> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case. >> But if it's too cumbersome to do at the moment we skip this >> optimization for now. > > We can initially all have BPF_FETCH bit as at that point we do not > have def-use chain. Later on we can add a > machine ssa IR phase and check whether the result of, say > atomic_fetch_or(), is used or not. If not, we can change the > instruction to atomic_or. Just implemented what we discussed above in llvm: https://reviews.llvm.org/D72184 main change: 1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will transparently transforms it to negation followed by atomic_fetch_add or atomic_add (xadd). Kernel can remove atomic_fetch_sub/atomic_sub insns. 2. added new instructions for atomic_{and, or, xor}. 3. for gcc builtin e.g., __sync_fetch_and_or(), if return value is used, atomic_fetch_or will be generated. Otherwise, atomic_or will be generated.
On Mon, Nov 30, 2020 at 7:51 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 11/30/20 9:22 AM, Yonghong Song wrote: > > > > > > On 11/28/20 5:40 PM, Alexei Starovoitov wrote: > >> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote: > >>> > >>> > >>> On 11/27/20 9:57 AM, Brendan Jackman wrote: > >>>> Status of the patches > >>>> ===================== > >>>> > >>>> Thanks for the reviews! Differences from v1->v2 [1]: > >>>> > >>>> * Fixed mistakes in the netronome driver > >>>> > >>>> * Addd sub, add, or, xor operations > >>>> > >>>> * The above led to some refactors to keep things readable. (Maybe I > >>>> should have just waited until I'd implemented these before starting > >>>> the review...) > >>>> > >>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which > >>>> include the BPF_FETCH flag > >>>> > >>>> * Added a bit of documentation. Suggestions welcome for more places > >>>> to dump this info... > >>>> > >>>> The prog_test that's added depends on Clang/LLVM features added by > >>>> Yonghong in > >>>> https://reviews.llvm.org/D72184 > >>>> > >>>> This only includes a JIT implementation for x86_64 - I don't plan to > >>>> implement JIT support myself for other architectures. > >>>> > >>>> Operations > >>>> ========== > >>>> > >>>> This patchset adds atomic operations to the eBPF instruction set. The > >>>> use-case that motivated this work was a trivial and efficient way to > >>>> generate globally-unique cookies in BPF progs, but I think it's > >>>> obvious that these features are pretty widely applicable. The > >>>> instructions that are added here can be summarised with this list of > >>>> kernel operations: > >>>> > >>>> * atomic[64]_[fetch_]add > >>>> * atomic[64]_[fetch_]sub > >>>> * atomic[64]_[fetch_]and > >>>> * atomic[64]_[fetch_]or > >>> > >>> * atomic[64]_[fetch_]xor > >>> > >>>> * atomic[64]_xchg > >>>> * atomic[64]_cmpxchg > >>> > >>> Thanks. Overall looks good to me but I did not check carefully > >>> on jit part as I am not an expert in x64 assembly... > >>> > >>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to > >>> xadd. I am not sure whether it is necessary. For one thing, > >>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore > >>> return value and they will achieve the same result, right? > >>> From llvm side, there is no ready-to-use gcc builtin matching > >>> atomic[64]_{sub,and,or,xor} which does not have return values. > >>> If we go this route, we will need to invent additional bpf > >>> specific builtins. > >> > >> I think bpf specific builtins are overkill. > >> As you said the users can use atomic_fetch_xor() and ignore > >> return value. I think llvm backend should be smart enough to use > >> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case. > >> But if it's too cumbersome to do at the moment we skip this > >> optimization for now. > > > > We can initially all have BPF_FETCH bit as at that point we do not > > have def-use chain. Later on we can add a > > machine ssa IR phase and check whether the result of, say > > atomic_fetch_or(), is used or not. If not, we can change the > > instruction to atomic_or. > > Just implemented what we discussed above in llvm: > https://reviews.llvm.org/D72184 > main change: > 1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will > transparently transforms it to negation followed by > atomic_fetch_add or atomic_add (xadd). Kernel can remove > atomic_fetch_sub/atomic_sub insns. > 2. added new instructions for atomic_{and, or, xor}. > 3. for gcc builtin e.g., __sync_fetch_and_or(), if return > value is used, atomic_fetch_or will be generated. Otherwise, > atomic_or will be generated. Great, this means that all existing valid uses of __sync_fetch_and_add() will generate BPF_XADD instructions and will work on old kernels, right? If that's the case, do we still need cpu=v4? The new instructions are *only* going to be generated if the user uses previously unsupported __sync_fetch_xxx() intrinsics. So, in effect, the user consciously opts into using new BPF instructions. cpu=v4 seems like an unnecessary tautology then?
On 12/1/20 6:00 PM, Andrii Nakryiko wrote: > On Mon, Nov 30, 2020 at 7:51 PM Yonghong Song <yhs@fb.com> wrote: >> >> >> >> On 11/30/20 9:22 AM, Yonghong Song wrote: >>> >>> >>> On 11/28/20 5:40 PM, Alexei Starovoitov wrote: >>>> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote: >>>>> >>>>> >>>>> On 11/27/20 9:57 AM, Brendan Jackman wrote: >>>>>> Status of the patches >>>>>> ===================== >>>>>> >>>>>> Thanks for the reviews! Differences from v1->v2 [1]: >>>>>> >>>>>> * Fixed mistakes in the netronome driver >>>>>> >>>>>> * Addd sub, add, or, xor operations >>>>>> >>>>>> * The above led to some refactors to keep things readable. (Maybe I >>>>>> should have just waited until I'd implemented these before starting >>>>>> the review...) >>>>>> >>>>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which >>>>>> include the BPF_FETCH flag >>>>>> >>>>>> * Added a bit of documentation. Suggestions welcome for more places >>>>>> to dump this info... >>>>>> >>>>>> The prog_test that's added depends on Clang/LLVM features added by >>>>>> Yonghong in >>>>>> https://reviews.llvm.org/D72184 >>>>>> >>>>>> This only includes a JIT implementation for x86_64 - I don't plan to >>>>>> implement JIT support myself for other architectures. >>>>>> >>>>>> Operations >>>>>> ========== >>>>>> >>>>>> This patchset adds atomic operations to the eBPF instruction set. The >>>>>> use-case that motivated this work was a trivial and efficient way to >>>>>> generate globally-unique cookies in BPF progs, but I think it's >>>>>> obvious that these features are pretty widely applicable. The >>>>>> instructions that are added here can be summarised with this list of >>>>>> kernel operations: >>>>>> >>>>>> * atomic[64]_[fetch_]add >>>>>> * atomic[64]_[fetch_]sub >>>>>> * atomic[64]_[fetch_]and >>>>>> * atomic[64]_[fetch_]or >>>>> >>>>> * atomic[64]_[fetch_]xor >>>>> >>>>>> * atomic[64]_xchg >>>>>> * atomic[64]_cmpxchg >>>>> >>>>> Thanks. Overall looks good to me but I did not check carefully >>>>> on jit part as I am not an expert in x64 assembly... >>>>> >>>>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to >>>>> xadd. I am not sure whether it is necessary. For one thing, >>>>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore >>>>> return value and they will achieve the same result, right? >>>>> From llvm side, there is no ready-to-use gcc builtin matching >>>>> atomic[64]_{sub,and,or,xor} which does not have return values. >>>>> If we go this route, we will need to invent additional bpf >>>>> specific builtins. >>>> >>>> I think bpf specific builtins are overkill. >>>> As you said the users can use atomic_fetch_xor() and ignore >>>> return value. I think llvm backend should be smart enough to use >>>> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case. >>>> But if it's too cumbersome to do at the moment we skip this >>>> optimization for now. >>> >>> We can initially all have BPF_FETCH bit as at that point we do not >>> have def-use chain. Later on we can add a >>> machine ssa IR phase and check whether the result of, say >>> atomic_fetch_or(), is used or not. If not, we can change the >>> instruction to atomic_or. >> >> Just implemented what we discussed above in llvm: >> https://reviews.llvm.org/D72184 >> main change: >> 1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will >> transparently transforms it to negation followed by >> atomic_fetch_add or atomic_add (xadd). Kernel can remove >> atomic_fetch_sub/atomic_sub insns. >> 2. added new instructions for atomic_{and, or, xor}. >> 3. for gcc builtin e.g., __sync_fetch_and_or(), if return >> value is used, atomic_fetch_or will be generated. Otherwise, >> atomic_or will be generated. > > Great, this means that all existing valid uses of > __sync_fetch_and_add() will generate BPF_XADD instructions and will > work on old kernels, right? That is correct. > > If that's the case, do we still need cpu=v4? The new instructions are > *only* going to be generated if the user uses previously unsupported > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously > opts into using new BPF instructions. cpu=v4 seems like an unnecessary > tautology then? This is a very good question. Essentially this boils to when users can use the new functionality including meaningful return value of __sync_fetch_and_add(). (1). user can write a small bpf program to test the feature. If user gets a failed compilation (fatal error), it won't be supported. Otherwise, it is supported. (2). compiler provides some way to tell user it is safe to use, e.g., -mcpu=v4, or some clang macro suggested by Brendan earlier. I guess since kernel already did a lot of feature discovery. Option (1) is probably fine.
Yonghong Song wrote: > > [...] > > Great, this means that all existing valid uses of > > __sync_fetch_and_add() will generate BPF_XADD instructions and will > > work on old kernels, right? > > That is correct. > > > > > If that's the case, do we still need cpu=v4? The new instructions are > > *only* going to be generated if the user uses previously unsupported > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary > > tautology then? > > This is a very good question. Essentially this boils to when users can > use the new functionality including meaningful return value of > __sync_fetch_and_add(). > (1). user can write a small bpf program to test the feature. If user > gets a failed compilation (fatal error), it won't be supported. > Otherwise, it is supported. > (2). compiler provides some way to tell user it is safe to use, e.g., > -mcpu=v4, or some clang macro suggested by Brendan earlier. > > I guess since kernel already did a lot of feature discovery. Option (1) > is probably fine. For option (2) we can use BTF with kernel version check. If kernel is greater than kernel this lands in we use the the new instructions if not we use a slower locked version. That should work for all cases unless someone backports patches into an older case. At least thats what I'll probably end up wrapping in a helper function.
On Tue, Dec 1, 2020 at 9:53 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Yonghong Song wrote: > > > > > > [...] > > > > Great, this means that all existing valid uses of > > > __sync_fetch_and_add() will generate BPF_XADD instructions and will > > > work on old kernels, right? > > > > That is correct. > > > > > > > > If that's the case, do we still need cpu=v4? The new instructions are > > > *only* going to be generated if the user uses previously unsupported > > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously > > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary > > > tautology then? > > > > This is a very good question. Essentially this boils to when users can > > use the new functionality including meaningful return value of > > __sync_fetch_and_add(). > > (1). user can write a small bpf program to test the feature. If user > > gets a failed compilation (fatal error), it won't be supported. > > Otherwise, it is supported. > > (2). compiler provides some way to tell user it is safe to use, e.g., > > -mcpu=v4, or some clang macro suggested by Brendan earlier. > > > > I guess since kernel already did a lot of feature discovery. Option (1) > > is probably fine. > > For option (2) we can use BTF with kernel version check. If kernel is > greater than kernel this lands in we use the the new instructions if > not we use a slower locked version. That should work for all cases > unless someone backports patches into an older case. Two different things: Clang support detection and kernel support detection. You are talking about kernel detection, I and Yonghong were talking about Clang detection and explicit cpu=v4 opt-in. For kernel detection, if there is an enum value or type that gets added along the feature, then with CO-RE built-ins it's easy to detect and kernel dead code elimination will make sure that unsupported instructions won't trip up the BPF verifier. Still need Clang support to compile the program in the first place, though. If there is no such BTF-based way to check, it is still possible to try to load a trivial BPF program with __sync_fech_and_xxx() to do feature detection and then use .rodata to turn off code paths relying on a new instruction set. > > At least thats what I'll probably end up wrapping in a helper function.
Andrii Nakryiko wrote: > On Tue, Dec 1, 2020 at 9:53 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Yonghong Song wrote: > > > > > > > > > > [...] > > > > > > Great, this means that all existing valid uses of > > > > __sync_fetch_and_add() will generate BPF_XADD instructions and will > > > > work on old kernels, right? > > > > > > That is correct. > > > > > > > > > > > If that's the case, do we still need cpu=v4? The new instructions are > > > > *only* going to be generated if the user uses previously unsupported > > > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously > > > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary > > > > tautology then? > > > > > > This is a very good question. Essentially this boils to when users can > > > use the new functionality including meaningful return value of > > > __sync_fetch_and_add(). > > > (1). user can write a small bpf program to test the feature. If user > > > gets a failed compilation (fatal error), it won't be supported. > > > Otherwise, it is supported. > > > (2). compiler provides some way to tell user it is safe to use, e.g., > > > -mcpu=v4, or some clang macro suggested by Brendan earlier. > > > > > > I guess since kernel already did a lot of feature discovery. Option (1) > > > is probably fine. > > > > For option (2) we can use BTF with kernel version check. If kernel is > > greater than kernel this lands in we use the the new instructions if > > not we use a slower locked version. That should work for all cases > > unless someone backports patches into an older case. > > Two different things: Clang support detection and kernel support > detection. You are talking about kernel detection, I and Yonghong were > talking about Clang detection and explicit cpu=v4 opt-in. > Ah right, catching up on email and reading the thread backwords I lost the context thanks! So, I live in a dev world where I control the build infrastructure so always know clang/llvm versions and features. What I don't know as well is where the program I just built might be run. So its a bit of an odd question from my perspective to ask if my clang supports feature X. If it doesn't support feature X and I want it we upgrade clang so that it does support it. I don't think we would ever write a program to test the assertion. Anyways thanks. > For kernel detection, if there is an enum value or type that gets > added along the feature, then with CO-RE built-ins it's easy to detect > and kernel dead code elimination will make sure that unsupported > instructions won't trip up the BPF verifier. Still need Clang support > to compile the program in the first place, though. +1 > > If there is no such BTF-based way to check, it is still possible to > try to load a trivial BPF program with __sync_fech_and_xxx() to do > feature detection and then use .rodata to turn off code paths relying > on a new instruction set. Right. > > > > > At least thats what I'll probably end up wrapping in a helper function.
On 12/1/20 9:05 PM, Yonghong Song wrote: > > > On 12/1/20 6:00 PM, Andrii Nakryiko wrote: >> On Mon, Nov 30, 2020 at 7:51 PM Yonghong Song <yhs@fb.com> wrote: >>> >>> >>> >>> On 11/30/20 9:22 AM, Yonghong Song wrote: >>>> >>>> >>>> On 11/28/20 5:40 PM, Alexei Starovoitov wrote: >>>>> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote: >>>>>> >>>>>> >>>>>> On 11/27/20 9:57 AM, Brendan Jackman wrote: >>>>>>> Status of the patches >>>>>>> ===================== >>>>>>> >>>>>>> Thanks for the reviews! Differences from v1->v2 [1]: >>>>>>> >>>>>>> * Fixed mistakes in the netronome driver >>>>>>> >>>>>>> * Addd sub, add, or, xor operations >>>>>>> >>>>>>> * The above led to some refactors to keep things readable. (Maybe I >>>>>>> should have just waited until I'd implemented these before >>>>>>> starting >>>>>>> the review...) >>>>>>> >>>>>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which >>>>>>> include the BPF_FETCH flag >>>>>>> >>>>>>> * Added a bit of documentation. Suggestions welcome for more places >>>>>>> to dump this info... >>>>>>> >>>>>>> The prog_test that's added depends on Clang/LLVM features added by >>>>>>> Yonghong in >>>>>>> https://reviews.llvm.org/D72184 >>>>>>> >>>>>>> This only includes a JIT implementation for x86_64 - I don't plan to >>>>>>> implement JIT support myself for other architectures. >>>>>>> >>>>>>> Operations >>>>>>> ========== >>>>>>> >>>>>>> This patchset adds atomic operations to the eBPF instruction set. >>>>>>> The >>>>>>> use-case that motivated this work was a trivial and efficient way to >>>>>>> generate globally-unique cookies in BPF progs, but I think it's >>>>>>> obvious that these features are pretty widely applicable. The >>>>>>> instructions that are added here can be summarised with this list of >>>>>>> kernel operations: >>>>>>> >>>>>>> * atomic[64]_[fetch_]add >>>>>>> * atomic[64]_[fetch_]sub >>>>>>> * atomic[64]_[fetch_]and >>>>>>> * atomic[64]_[fetch_]or >>>>>> >>>>>> * atomic[64]_[fetch_]xor >>>>>> >>>>>>> * atomic[64]_xchg >>>>>>> * atomic[64]_cmpxchg >>>>>> >>>>>> Thanks. Overall looks good to me but I did not check carefully >>>>>> on jit part as I am not an expert in x64 assembly... >>>>>> >>>>>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to >>>>>> xadd. I am not sure whether it is necessary. For one thing, >>>>>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore >>>>>> return value and they will achieve the same result, right? >>>>>> From llvm side, there is no ready-to-use gcc builtin matching >>>>>> atomic[64]_{sub,and,or,xor} which does not have return values. >>>>>> If we go this route, we will need to invent additional bpf >>>>>> specific builtins. >>>>> >>>>> I think bpf specific builtins are overkill. >>>>> As you said the users can use atomic_fetch_xor() and ignore >>>>> return value. I think llvm backend should be smart enough to use >>>>> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case. >>>>> But if it's too cumbersome to do at the moment we skip this >>>>> optimization for now. >>>> >>>> We can initially all have BPF_FETCH bit as at that point we do not >>>> have def-use chain. Later on we can add a >>>> machine ssa IR phase and check whether the result of, say >>>> atomic_fetch_or(), is used or not. If not, we can change the >>>> instruction to atomic_or. >>> >>> Just implemented what we discussed above in llvm: >>> >>> https://reviews.llvm.org/D72184 >>> main change: >>> 1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will >>> transparently transforms it to negation followed by >>> atomic_fetch_add or atomic_add (xadd). Kernel can remove >>> atomic_fetch_sub/atomic_sub insns. >>> 2. added new instructions for atomic_{and, or, xor}. >>> 3. for gcc builtin e.g., __sync_fetch_and_or(), if return >>> value is used, atomic_fetch_or will be generated. Otherwise, >>> atomic_or will be generated. >> >> Great, this means that all existing valid uses of >> __sync_fetch_and_add() will generate BPF_XADD instructions and will >> work on old kernels, right? > > That is correct. > >> >> If that's the case, do we still need cpu=v4? The new instructions are >> *only* going to be generated if the user uses previously unsupported >> __sync_fetch_xxx() intrinsics. So, in effect, the user consciously >> opts into using new BPF instructions. cpu=v4 seems like an unnecessary >> tautology then? > > This is a very good question. Essentially this boils to when users can > use the new functionality including meaningful return value of > __sync_fetch_and_add(). > (1). user can write a small bpf program to test the feature. If user > gets a failed compilation (fatal error), it won't be supported. > Otherwise, it is supported. > (2). compiler provides some way to tell user it is safe to use, e.g., > -mcpu=v4, or some clang macro suggested by Brendan earlier. > > I guess since kernel already did a lot of feature discovery. Option (1) > is probably fine. Just pushed a new llvm version (https://reviews.llvm.org/D72184) which removed -mcpu=v4. The new instructions will be generated by default for 64bit type. For 32bit type, alu32 is required. Currently -mcpu=v3 already has alu32 as default and kernel supporting atomic insns should have good alu32 support too. So I decided to have skip non-alu32 32bit mode. But if people feel strongly to support non-alu32 32bit mode atomic instructions, I can add them in llvm... The instruction encodings are the same for alu32/non-alu32 32bit mode so the kernel will not be impacted.