mbox series

[v2,bpf-next,00/13] Atomics for eBPF

Message ID 20201127175738.1085417-1-jackmanb@google.com (mailing list archive)
Headers show
Series Atomics for eBPF | expand

Message

Brendan Jackman Nov. 27, 2020, 5:57 p.m. UTC
Status of the patches
=====================

Thanks for the reviews! Differences from v1->v2 [1]:

* Fixed mistakes in the netronome driver

* Addd sub, add, or, xor operations

* The above led to some refactors to keep things readable. (Maybe I
  should have just waited until I'd implemented these before starting
  the review...)

* Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
  include the BPF_FETCH flag

* Added a bit of documentation. Suggestions welcome for more places
  to dump this info...

The prog_test that's added depends on Clang/LLVM features added by
Yonghong in https://reviews.llvm.org/D72184

This only includes a JIT implementation for x86_64 - I don't plan to
implement JIT support myself for other architectures.

Operations
==========

This patchset adds atomic operations to the eBPF instruction set. The
use-case that motivated this work was a trivial and efficient way to
generate globally-unique cookies in BPF progs, but I think it's
obvious that these features are pretty widely applicable.  The
instructions that are added here can be summarised with this list of
kernel operations:

* atomic[64]_[fetch_]add
* atomic[64]_[fetch_]sub
* atomic[64]_[fetch_]and
* atomic[64]_[fetch_]or
* atomic[64]_xchg
* atomic[64]_cmpxchg

The following are left out of scope for this effort:

* 16 and 8 bit operations
* Explicit memory barriers

Encoding
========

I originally planned to add new values for bpf_insn.opcode. This was
rather unpleasant: the opcode space has holes in it but no entire
instruction classes[2]. Yonghong Song had a better idea: use the
immediate field of the existing STX XADD instruction to encode the
operation. This works nicely, without breaking existing programs,
because the immediate field is currently reserved-must-be-zero, and
extra-nicely because BPF_ADD happens to be zero.

Note that this of course makes immediate-source atomic operations
impossible. It's hard to imagine a measurable speedup from such
instructions, and if it existed it would certainly not benefit x86,
which has no support for them.

The BPF_OP opcode fields are re-used in the immediate, and an
additional flag BPF_FETCH is used to mark instructions that should
fetch a pre-modification value from memory.

So, BPF_XADD is now called BPF_ATOMIC (the old name is kept to avoid
breaking userspace builds), and where we previously had .imm = 0, we
now have .imm = BPF_ADD (which is 0).

Operands
========

Reg-source eBPF instructions only have two operands, while these
atomic operations have up to four. To avoid needing to encode
additional operands, then:

- One of the input registers is re-used as an output register
  (e.g. atomic_fetch_add both reads from and writes to the source
  register).

- Where necessary (i.e. for cmpxchg) , R0 is "hard-coded" as one of
  the operands.

This approach also allows the new eBPF instructions to map directly
to single x86 instructions.

[1] Previous patchset:
    https://lore.kernel.org/bpf/20201123173202.1335708-1-jackmanb@google.com/

[2] Visualisation of eBPF opcode space:
    https://gist.github.com/bjackman/00fdad2d5dfff601c1918bc29b16e778


Brendan Jackman (13):
  bpf: x86: Factor out emission of ModR/M for *(reg + off)
  bpf: x86: Factor out emission of REX byte
  bpf: x86: Factor out function to emit NEG
  bpf: x86: Factor out a lookup table for some ALU opcodes
  bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
  bpf: Move BPF_STX reserved field check into BPF_STX verifier code
  bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
  bpf: Add instructions for atomic_[cmp]xchg
  bpf: Pull out a macro for interpreting atomic ALU operations
  bpf: Add instructions for atomic[64]_[fetch_]sub
  bpf: Add bitwise atomic instructions
  bpf: Add tests for new BPF atomic operations
  bpf: Document new atomic instructions

 Documentation/networking/filter.rst           |  57 ++-
 arch/arm/net/bpf_jit_32.c                     |   7 +-
 arch/arm64/net/bpf_jit_comp.c                 |  16 +-
 arch/mips/net/ebpf_jit.c                      |  11 +-
 arch/powerpc/net/bpf_jit_comp64.c             |  25 +-
 arch/riscv/net/bpf_jit_comp32.c               |  20 +-
 arch/riscv/net/bpf_jit_comp64.c               |  16 +-
 arch/s390/net/bpf_jit_comp.c                  |  27 +-
 arch/sparc/net/bpf_jit_comp_64.c              |  17 +-
 arch/x86/net/bpf_jit_comp.c                   | 252 ++++++++++----
 arch/x86/net/bpf_jit_comp32.c                 |   6 +-
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  |  14 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |   4 +-
 .../net/ethernet/netronome/nfp/bpf/verifier.c |  15 +-
 include/linux/filter.h                        | 117 ++++++-
 include/uapi/linux/bpf.h                      |   8 +-
 kernel/bpf/core.c                             |  67 +++-
 kernel/bpf/disasm.c                           |  41 ++-
 kernel/bpf/verifier.c                         |  77 +++-
 lib/test_bpf.c                                |   2 +-
 samples/bpf/bpf_insn.h                        |   4 +-
 samples/bpf/sock_example.c                    |   2 +-
 samples/bpf/test_cgrp2_attach.c               |   4 +-
 tools/include/linux/filter.h                  | 117 ++++++-
 tools/include/uapi/linux/bpf.h                |   8 +-
 tools/testing/selftests/bpf/Makefile          |  12 +-
 .../selftests/bpf/prog_tests/atomics_test.c   | 329 ++++++++++++++++++
 .../bpf/prog_tests/cgroup_attach_multi.c      |   4 +-
 .../selftests/bpf/progs/atomics_test.c        | 124 +++++++
 .../selftests/bpf/verifier/atomic_and.c       |  77 ++++
 .../selftests/bpf/verifier/atomic_cmpxchg.c   |  96 +++++
 .../selftests/bpf/verifier/atomic_fetch_add.c | 106 ++++++
 .../selftests/bpf/verifier/atomic_or.c        |  77 ++++
 .../selftests/bpf/verifier/atomic_sub.c       |  44 +++
 .../selftests/bpf/verifier/atomic_xchg.c      |  46 +++
 .../selftests/bpf/verifier/atomic_xor.c       |  77 ++++
 tools/testing/selftests/bpf/verifier/ctx.c    |   7 +-
 .../testing/selftests/bpf/verifier/leak_ptr.c |   4 +-
 tools/testing/selftests/bpf/verifier/unpriv.c |   3 +-
 tools/testing/selftests/bpf/verifier/xadd.c   |   2 +-
 40 files changed, 1754 insertions(+), 188 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/atomics_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/atomics_test.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_and.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_cmpxchg.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_fetch_add.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_or.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_sub.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_xchg.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_xor.c

--
2.29.2.454.gaff20da3a2-goog

Comments

Yonghong Song Nov. 28, 2020, 5:53 a.m. UTC | #1
On 11/27/20 9:57 AM, Brendan Jackman wrote:
> Status of the patches
> =====================
> 
> Thanks for the reviews! Differences from v1->v2 [1]:
> 
> * Fixed mistakes in the netronome driver
> 
> * Addd sub, add, or, xor operations
> 
> * The above led to some refactors to keep things readable. (Maybe I
>    should have just waited until I'd implemented these before starting
>    the review...)
> 
> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
>    include the BPF_FETCH flag
> 
> * Added a bit of documentation. Suggestions welcome for more places
>    to dump this info...
> 
> The prog_test that's added depends on Clang/LLVM features added by
> Yonghong in https://reviews.llvm.org/D72184
> 
> This only includes a JIT implementation for x86_64 - I don't plan to
> implement JIT support myself for other architectures.
> 
> Operations
> ==========
> 
> This patchset adds atomic operations to the eBPF instruction set. The
> use-case that motivated this work was a trivial and efficient way to
> generate globally-unique cookies in BPF progs, but I think it's
> obvious that these features are pretty widely applicable.  The
> instructions that are added here can be summarised with this list of
> kernel operations:
> 
> * atomic[64]_[fetch_]add
> * atomic[64]_[fetch_]sub
> * atomic[64]_[fetch_]and
> * atomic[64]_[fetch_]or

* atomic[64]_[fetch_]xor

> * atomic[64]_xchg
> * atomic[64]_cmpxchg

Thanks. Overall looks good to me but I did not check carefully
on jit part as I am not an expert in x64 assembly...

This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
xadd. I am not sure whether it is necessary. For one thing,
users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
return value and they will achieve the same result, right?
 From llvm side, there is no ready-to-use gcc builtin matching
atomic[64]_{sub,and,or,xor} which does not have return values.
If we go this route, we will need to invent additional bpf
specific builtins.

> 
> The following are left out of scope for this effort:
> 
> * 16 and 8 bit operations
> * Explicit memory barriers
> 
> Encoding
> ========
> 
> I originally planned to add new values for bpf_insn.opcode. This was
> rather unpleasant: the opcode space has holes in it but no entire
> instruction classes[2]. Yonghong Song had a better idea: use the
> immediate field of the existing STX XADD instruction to encode the
> operation. This works nicely, without breaking existing programs,
> because the immediate field is currently reserved-must-be-zero, and
> extra-nicely because BPF_ADD happens to be zero.
> 
> Note that this of course makes immediate-source atomic operations
> impossible. It's hard to imagine a measurable speedup from such
> instructions, and if it existed it would certainly not benefit x86,
> which has no support for them.
> 
> The BPF_OP opcode fields are re-used in the immediate, and an
> additional flag BPF_FETCH is used to mark instructions that should
> fetch a pre-modification value from memory.
> 
> So, BPF_XADD is now called BPF_ATOMIC (the old name is kept to avoid
> breaking userspace builds), and where we previously had .imm = 0, we
> now have .imm = BPF_ADD (which is 0).
> 
> Operands
> ========
> 
> Reg-source eBPF instructions only have two operands, while these
> atomic operations have up to four. To avoid needing to encode
> additional operands, then:
> 
> - One of the input registers is re-used as an output register
>    (e.g. atomic_fetch_add both reads from and writes to the source
>    register).
> 
> - Where necessary (i.e. for cmpxchg) , R0 is "hard-coded" as one of
>    the operands.
> 
> This approach also allows the new eBPF instructions to map directly
> to single x86 instructions.
> 
> [1] Previous patchset:
>      https://lore.kernel.org/bpf/20201123173202.1335708-1-jackmanb@google.com/
> 
> [2] Visualisation of eBPF opcode space:
>      https://gist.github.com/bjackman/00fdad2d5dfff601c1918bc29b16e778
> 
> 
> Brendan Jackman (13):
>    bpf: x86: Factor out emission of ModR/M for *(reg + off)
>    bpf: x86: Factor out emission of REX byte
>    bpf: x86: Factor out function to emit NEG
>    bpf: x86: Factor out a lookup table for some ALU opcodes
>    bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
>    bpf: Move BPF_STX reserved field check into BPF_STX verifier code
>    bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
>    bpf: Add instructions for atomic_[cmp]xchg
>    bpf: Pull out a macro for interpreting atomic ALU operations
>    bpf: Add instructions for atomic[64]_[fetch_]sub
>    bpf: Add bitwise atomic instructions
>    bpf: Add tests for new BPF atomic operations
>    bpf: Document new atomic instructions
> 
>   Documentation/networking/filter.rst           |  57 ++-
>   arch/arm/net/bpf_jit_32.c                     |   7 +-
>   arch/arm64/net/bpf_jit_comp.c                 |  16 +-
>   arch/mips/net/ebpf_jit.c                      |  11 +-
>   arch/powerpc/net/bpf_jit_comp64.c             |  25 +-
>   arch/riscv/net/bpf_jit_comp32.c               |  20 +-
>   arch/riscv/net/bpf_jit_comp64.c               |  16 +-
>   arch/s390/net/bpf_jit_comp.c                  |  27 +-
>   arch/sparc/net/bpf_jit_comp_64.c              |  17 +-
>   arch/x86/net/bpf_jit_comp.c                   | 252 ++++++++++----
>   arch/x86/net/bpf_jit_comp32.c                 |   6 +-
>   drivers/net/ethernet/netronome/nfp/bpf/jit.c  |  14 +-
>   drivers/net/ethernet/netronome/nfp/bpf/main.h |   4 +-
>   .../net/ethernet/netronome/nfp/bpf/verifier.c |  15 +-
>   include/linux/filter.h                        | 117 ++++++-
>   include/uapi/linux/bpf.h                      |   8 +-
>   kernel/bpf/core.c                             |  67 +++-
>   kernel/bpf/disasm.c                           |  41 ++-
>   kernel/bpf/verifier.c                         |  77 +++-
>   lib/test_bpf.c                                |   2 +-
>   samples/bpf/bpf_insn.h                        |   4 +-
>   samples/bpf/sock_example.c                    |   2 +-
>   samples/bpf/test_cgrp2_attach.c               |   4 +-
>   tools/include/linux/filter.h                  | 117 ++++++-
>   tools/include/uapi/linux/bpf.h                |   8 +-
>   tools/testing/selftests/bpf/Makefile          |  12 +-
>   .../selftests/bpf/prog_tests/atomics_test.c   | 329 ++++++++++++++++++
>   .../bpf/prog_tests/cgroup_attach_multi.c      |   4 +-
>   .../selftests/bpf/progs/atomics_test.c        | 124 +++++++
>   .../selftests/bpf/verifier/atomic_and.c       |  77 ++++
>   .../selftests/bpf/verifier/atomic_cmpxchg.c   |  96 +++++
>   .../selftests/bpf/verifier/atomic_fetch_add.c | 106 ++++++
>   .../selftests/bpf/verifier/atomic_or.c        |  77 ++++
>   .../selftests/bpf/verifier/atomic_sub.c       |  44 +++
>   .../selftests/bpf/verifier/atomic_xchg.c      |  46 +++
>   .../selftests/bpf/verifier/atomic_xor.c       |  77 ++++
>   tools/testing/selftests/bpf/verifier/ctx.c    |   7 +-
>   .../testing/selftests/bpf/verifier/leak_ptr.c |   4 +-
>   tools/testing/selftests/bpf/verifier/unpriv.c |   3 +-
>   tools/testing/selftests/bpf/verifier/xadd.c   |   2 +-
>   40 files changed, 1754 insertions(+), 188 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/atomics_test.c
>   create mode 100644 tools/testing/selftests/bpf/progs/atomics_test.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_and.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_cmpxchg.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_fetch_add.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_or.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_sub.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_xchg.c
>   create mode 100644 tools/testing/selftests/bpf/verifier/atomic_xor.c
> 
> --
> 2.29.2.454.gaff20da3a2-goog
>
Alexei Starovoitov Nov. 29, 2020, 1:40 a.m. UTC | #2
On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
> 
> 
> On 11/27/20 9:57 AM, Brendan Jackman wrote:
> > Status of the patches
> > =====================
> > 
> > Thanks for the reviews! Differences from v1->v2 [1]:
> > 
> > * Fixed mistakes in the netronome driver
> > 
> > * Addd sub, add, or, xor operations
> > 
> > * The above led to some refactors to keep things readable. (Maybe I
> >    should have just waited until I'd implemented these before starting
> >    the review...)
> > 
> > * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
> >    include the BPF_FETCH flag
> > 
> > * Added a bit of documentation. Suggestions welcome for more places
> >    to dump this info...
> > 
> > The prog_test that's added depends on Clang/LLVM features added by
> > Yonghong in https://reviews.llvm.org/D72184
> > 
> > This only includes a JIT implementation for x86_64 - I don't plan to
> > implement JIT support myself for other architectures.
> > 
> > Operations
> > ==========
> > 
> > This patchset adds atomic operations to the eBPF instruction set. The
> > use-case that motivated this work was a trivial and efficient way to
> > generate globally-unique cookies in BPF progs, but I think it's
> > obvious that these features are pretty widely applicable.  The
> > instructions that are added here can be summarised with this list of
> > kernel operations:
> > 
> > * atomic[64]_[fetch_]add
> > * atomic[64]_[fetch_]sub
> > * atomic[64]_[fetch_]and
> > * atomic[64]_[fetch_]or
> 
> * atomic[64]_[fetch_]xor
> 
> > * atomic[64]_xchg
> > * atomic[64]_cmpxchg
> 
> Thanks. Overall looks good to me but I did not check carefully
> on jit part as I am not an expert in x64 assembly...
> 
> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
> xadd. I am not sure whether it is necessary. For one thing,
> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
> return value and they will achieve the same result, right?
> From llvm side, there is no ready-to-use gcc builtin matching
> atomic[64]_{sub,and,or,xor} which does not have return values.
> If we go this route, we will need to invent additional bpf
> specific builtins.

I think bpf specific builtins are overkill.
As you said the users can use atomic_fetch_xor() and ignore
return value. I think llvm backend should be smart enough to use
BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
But if it's too cumbersome to do at the moment we skip this
optimization for now.
Yonghong Song Nov. 30, 2020, 5:22 p.m. UTC | #3
On 11/28/20 5:40 PM, Alexei Starovoitov wrote:
> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
>>
>>
>> On 11/27/20 9:57 AM, Brendan Jackman wrote:
>>> Status of the patches
>>> =====================
>>>
>>> Thanks for the reviews! Differences from v1->v2 [1]:
>>>
>>> * Fixed mistakes in the netronome driver
>>>
>>> * Addd sub, add, or, xor operations
>>>
>>> * The above led to some refactors to keep things readable. (Maybe I
>>>     should have just waited until I'd implemented these before starting
>>>     the review...)
>>>
>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
>>>     include the BPF_FETCH flag
>>>
>>> * Added a bit of documentation. Suggestions welcome for more places
>>>     to dump this info...
>>>
>>> The prog_test that's added depends on Clang/LLVM features added by
>>> Yonghong in https://reviews.llvm.org/D72184
>>>
>>> This only includes a JIT implementation for x86_64 - I don't plan to
>>> implement JIT support myself for other architectures.
>>>
>>> Operations
>>> ==========
>>>
>>> This patchset adds atomic operations to the eBPF instruction set. The
>>> use-case that motivated this work was a trivial and efficient way to
>>> generate globally-unique cookies in BPF progs, but I think it's
>>> obvious that these features are pretty widely applicable.  The
>>> instructions that are added here can be summarised with this list of
>>> kernel operations:
>>>
>>> * atomic[64]_[fetch_]add
>>> * atomic[64]_[fetch_]sub
>>> * atomic[64]_[fetch_]and
>>> * atomic[64]_[fetch_]or
>>
>> * atomic[64]_[fetch_]xor
>>
>>> * atomic[64]_xchg
>>> * atomic[64]_cmpxchg
>>
>> Thanks. Overall looks good to me but I did not check carefully
>> on jit part as I am not an expert in x64 assembly...
>>
>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
>> xadd. I am not sure whether it is necessary. For one thing,
>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
>> return value and they will achieve the same result, right?
>>  From llvm side, there is no ready-to-use gcc builtin matching
>> atomic[64]_{sub,and,or,xor} which does not have return values.
>> If we go this route, we will need to invent additional bpf
>> specific builtins.
> 
> I think bpf specific builtins are overkill.
> As you said the users can use atomic_fetch_xor() and ignore
> return value. I think llvm backend should be smart enough to use
> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
> But if it's too cumbersome to do at the moment we skip this
> optimization for now.

We can initially all have BPF_FETCH bit as at that point we do not
have def-use chain. Later on we can add a
machine ssa IR phase and check whether the result of, say 
atomic_fetch_or(), is used or not. If not, we can change the
instruction to atomic_or.

>
Yonghong Song Dec. 1, 2020, 3:48 a.m. UTC | #4
On 11/30/20 9:22 AM, Yonghong Song wrote:
> 
> 
> On 11/28/20 5:40 PM, Alexei Starovoitov wrote:
>> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
>>>
>>>
>>> On 11/27/20 9:57 AM, Brendan Jackman wrote:
>>>> Status of the patches
>>>> =====================
>>>>
>>>> Thanks for the reviews! Differences from v1->v2 [1]:
>>>>
>>>> * Fixed mistakes in the netronome driver
>>>>
>>>> * Addd sub, add, or, xor operations
>>>>
>>>> * The above led to some refactors to keep things readable. (Maybe I
>>>>     should have just waited until I'd implemented these before starting
>>>>     the review...)
>>>>
>>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
>>>>     include the BPF_FETCH flag
>>>>
>>>> * Added a bit of documentation. Suggestions welcome for more places
>>>>     to dump this info...
>>>>
>>>> The prog_test that's added depends on Clang/LLVM features added by
>>>> Yonghong in 
>>>> https://reviews.llvm.org/D72184 
>>>>
>>>> This only includes a JIT implementation for x86_64 - I don't plan to
>>>> implement JIT support myself for other architectures.
>>>>
>>>> Operations
>>>> ==========
>>>>
>>>> This patchset adds atomic operations to the eBPF instruction set. The
>>>> use-case that motivated this work was a trivial and efficient way to
>>>> generate globally-unique cookies in BPF progs, but I think it's
>>>> obvious that these features are pretty widely applicable.  The
>>>> instructions that are added here can be summarised with this list of
>>>> kernel operations:
>>>>
>>>> * atomic[64]_[fetch_]add
>>>> * atomic[64]_[fetch_]sub
>>>> * atomic[64]_[fetch_]and
>>>> * atomic[64]_[fetch_]or
>>>
>>> * atomic[64]_[fetch_]xor
>>>
>>>> * atomic[64]_xchg
>>>> * atomic[64]_cmpxchg
>>>
>>> Thanks. Overall looks good to me but I did not check carefully
>>> on jit part as I am not an expert in x64 assembly...
>>>
>>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
>>> xadd. I am not sure whether it is necessary. For one thing,
>>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
>>> return value and they will achieve the same result, right?
>>>  From llvm side, there is no ready-to-use gcc builtin matching
>>> atomic[64]_{sub,and,or,xor} which does not have return values.
>>> If we go this route, we will need to invent additional bpf
>>> specific builtins.
>>
>> I think bpf specific builtins are overkill.
>> As you said the users can use atomic_fetch_xor() and ignore
>> return value. I think llvm backend should be smart enough to use
>> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
>> But if it's too cumbersome to do at the moment we skip this
>> optimization for now.
> 
> We can initially all have BPF_FETCH bit as at that point we do not
> have def-use chain. Later on we can add a
> machine ssa IR phase and check whether the result of, say 
> atomic_fetch_or(), is used or not. If not, we can change the
> instruction to atomic_or.

Just implemented what we discussed above in llvm:
   https://reviews.llvm.org/D72184
main change:
   1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will
      transparently transforms it to negation followed by
      atomic_fetch_add or atomic_add (xadd). Kernel can remove
      atomic_fetch_sub/atomic_sub insns.
   2. added new instructions for atomic_{and, or, xor}.
   3. for gcc builtin e.g., __sync_fetch_and_or(), if return
      value is used, atomic_fetch_or will be generated. Otherwise,
      atomic_or will be generated.
Andrii Nakryiko Dec. 2, 2020, 2 a.m. UTC | #5
On Mon, Nov 30, 2020 at 7:51 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 11/30/20 9:22 AM, Yonghong Song wrote:
> >
> >
> > On 11/28/20 5:40 PM, Alexei Starovoitov wrote:
> >> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
> >>>
> >>>
> >>> On 11/27/20 9:57 AM, Brendan Jackman wrote:
> >>>> Status of the patches
> >>>> =====================
> >>>>
> >>>> Thanks for the reviews! Differences from v1->v2 [1]:
> >>>>
> >>>> * Fixed mistakes in the netronome driver
> >>>>
> >>>> * Addd sub, add, or, xor operations
> >>>>
> >>>> * The above led to some refactors to keep things readable. (Maybe I
> >>>>     should have just waited until I'd implemented these before starting
> >>>>     the review...)
> >>>>
> >>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
> >>>>     include the BPF_FETCH flag
> >>>>
> >>>> * Added a bit of documentation. Suggestions welcome for more places
> >>>>     to dump this info...
> >>>>
> >>>> The prog_test that's added depends on Clang/LLVM features added by
> >>>> Yonghong in
> >>>> https://reviews.llvm.org/D72184
> >>>>
> >>>> This only includes a JIT implementation for x86_64 - I don't plan to
> >>>> implement JIT support myself for other architectures.
> >>>>
> >>>> Operations
> >>>> ==========
> >>>>
> >>>> This patchset adds atomic operations to the eBPF instruction set. The
> >>>> use-case that motivated this work was a trivial and efficient way to
> >>>> generate globally-unique cookies in BPF progs, but I think it's
> >>>> obvious that these features are pretty widely applicable.  The
> >>>> instructions that are added here can be summarised with this list of
> >>>> kernel operations:
> >>>>
> >>>> * atomic[64]_[fetch_]add
> >>>> * atomic[64]_[fetch_]sub
> >>>> * atomic[64]_[fetch_]and
> >>>> * atomic[64]_[fetch_]or
> >>>
> >>> * atomic[64]_[fetch_]xor
> >>>
> >>>> * atomic[64]_xchg
> >>>> * atomic[64]_cmpxchg
> >>>
> >>> Thanks. Overall looks good to me but I did not check carefully
> >>> on jit part as I am not an expert in x64 assembly...
> >>>
> >>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
> >>> xadd. I am not sure whether it is necessary. For one thing,
> >>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
> >>> return value and they will achieve the same result, right?
> >>>  From llvm side, there is no ready-to-use gcc builtin matching
> >>> atomic[64]_{sub,and,or,xor} which does not have return values.
> >>> If we go this route, we will need to invent additional bpf
> >>> specific builtins.
> >>
> >> I think bpf specific builtins are overkill.
> >> As you said the users can use atomic_fetch_xor() and ignore
> >> return value. I think llvm backend should be smart enough to use
> >> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
> >> But if it's too cumbersome to do at the moment we skip this
> >> optimization for now.
> >
> > We can initially all have BPF_FETCH bit as at that point we do not
> > have def-use chain. Later on we can add a
> > machine ssa IR phase and check whether the result of, say
> > atomic_fetch_or(), is used or not. If not, we can change the
> > instruction to atomic_or.
>
> Just implemented what we discussed above in llvm:
>    https://reviews.llvm.org/D72184
> main change:
>    1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will
>       transparently transforms it to negation followed by
>       atomic_fetch_add or atomic_add (xadd). Kernel can remove
>       atomic_fetch_sub/atomic_sub insns.
>    2. added new instructions for atomic_{and, or, xor}.
>    3. for gcc builtin e.g., __sync_fetch_and_or(), if return
>       value is used, atomic_fetch_or will be generated. Otherwise,
>       atomic_or will be generated.

Great, this means that all existing valid uses of
__sync_fetch_and_add() will generate BPF_XADD instructions and will
work on old kernels, right?

If that's the case, do we still need cpu=v4? The new instructions are
*only* going to be generated if the user uses previously unsupported
__sync_fetch_xxx() intrinsics. So, in effect, the user consciously
opts into using new BPF instructions. cpu=v4 seems like an unnecessary
tautology then?
Yonghong Song Dec. 2, 2020, 5:05 a.m. UTC | #6
On 12/1/20 6:00 PM, Andrii Nakryiko wrote:
> On Mon, Nov 30, 2020 at 7:51 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 11/30/20 9:22 AM, Yonghong Song wrote:
>>>
>>>
>>> On 11/28/20 5:40 PM, Alexei Starovoitov wrote:
>>>> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 11/27/20 9:57 AM, Brendan Jackman wrote:
>>>>>> Status of the patches
>>>>>> =====================
>>>>>>
>>>>>> Thanks for the reviews! Differences from v1->v2 [1]:
>>>>>>
>>>>>> * Fixed mistakes in the netronome driver
>>>>>>
>>>>>> * Addd sub, add, or, xor operations
>>>>>>
>>>>>> * The above led to some refactors to keep things readable. (Maybe I
>>>>>>      should have just waited until I'd implemented these before starting
>>>>>>      the review...)
>>>>>>
>>>>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
>>>>>>      include the BPF_FETCH flag
>>>>>>
>>>>>> * Added a bit of documentation. Suggestions welcome for more places
>>>>>>      to dump this info...
>>>>>>
>>>>>> The prog_test that's added depends on Clang/LLVM features added by
>>>>>> Yonghong in
>>>>>> https://reviews.llvm.org/D72184
>>>>>>
>>>>>> This only includes a JIT implementation for x86_64 - I don't plan to
>>>>>> implement JIT support myself for other architectures.
>>>>>>
>>>>>> Operations
>>>>>> ==========
>>>>>>
>>>>>> This patchset adds atomic operations to the eBPF instruction set. The
>>>>>> use-case that motivated this work was a trivial and efficient way to
>>>>>> generate globally-unique cookies in BPF progs, but I think it's
>>>>>> obvious that these features are pretty widely applicable.  The
>>>>>> instructions that are added here can be summarised with this list of
>>>>>> kernel operations:
>>>>>>
>>>>>> * atomic[64]_[fetch_]add
>>>>>> * atomic[64]_[fetch_]sub
>>>>>> * atomic[64]_[fetch_]and
>>>>>> * atomic[64]_[fetch_]or
>>>>>
>>>>> * atomic[64]_[fetch_]xor
>>>>>
>>>>>> * atomic[64]_xchg
>>>>>> * atomic[64]_cmpxchg
>>>>>
>>>>> Thanks. Overall looks good to me but I did not check carefully
>>>>> on jit part as I am not an expert in x64 assembly...
>>>>>
>>>>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
>>>>> xadd. I am not sure whether it is necessary. For one thing,
>>>>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
>>>>> return value and they will achieve the same result, right?
>>>>>   From llvm side, there is no ready-to-use gcc builtin matching
>>>>> atomic[64]_{sub,and,or,xor} which does not have return values.
>>>>> If we go this route, we will need to invent additional bpf
>>>>> specific builtins.
>>>>
>>>> I think bpf specific builtins are overkill.
>>>> As you said the users can use atomic_fetch_xor() and ignore
>>>> return value. I think llvm backend should be smart enough to use
>>>> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
>>>> But if it's too cumbersome to do at the moment we skip this
>>>> optimization for now.
>>>
>>> We can initially all have BPF_FETCH bit as at that point we do not
>>> have def-use chain. Later on we can add a
>>> machine ssa IR phase and check whether the result of, say
>>> atomic_fetch_or(), is used or not. If not, we can change the
>>> instruction to atomic_or.
>>
>> Just implemented what we discussed above in llvm:
>>     https://reviews.llvm.org/D72184
>> main change:
>>     1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will
>>        transparently transforms it to negation followed by
>>        atomic_fetch_add or atomic_add (xadd). Kernel can remove
>>        atomic_fetch_sub/atomic_sub insns.
>>     2. added new instructions for atomic_{and, or, xor}.
>>     3. for gcc builtin e.g., __sync_fetch_and_or(), if return
>>        value is used, atomic_fetch_or will be generated. Otherwise,
>>        atomic_or will be generated.
> 
> Great, this means that all existing valid uses of
> __sync_fetch_and_add() will generate BPF_XADD instructions and will
> work on old kernels, right?

That is correct.

> 
> If that's the case, do we still need cpu=v4? The new instructions are
> *only* going to be generated if the user uses previously unsupported
> __sync_fetch_xxx() intrinsics. So, in effect, the user consciously
> opts into using new BPF instructions. cpu=v4 seems like an unnecessary
> tautology then?

This is a very good question. Essentially this boils to when users can 
use the new functionality including meaningful return value  of 
__sync_fetch_and_add().
   (1). user can write a small bpf program to test the feature. If user
        gets a failed compilation (fatal error), it won't be supported.
        Otherwise, it is supported.
   (2). compiler provides some way to tell user it is safe to use, e.g.,
        -mcpu=v4, or some clang macro suggested by Brendan earlier.

I guess since kernel already did a lot of feature discovery. Option (1)
is probably fine.
John Fastabend Dec. 2, 2020, 5:53 a.m. UTC | #7
Yonghong Song wrote:
> 
> 

[...]

> > Great, this means that all existing valid uses of
> > __sync_fetch_and_add() will generate BPF_XADD instructions and will
> > work on old kernels, right?
> 
> That is correct.
> 
> > 
> > If that's the case, do we still need cpu=v4? The new instructions are
> > *only* going to be generated if the user uses previously unsupported
> > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously
> > opts into using new BPF instructions. cpu=v4 seems like an unnecessary
> > tautology then?
> 
> This is a very good question. Essentially this boils to when users can 
> use the new functionality including meaningful return value  of 
> __sync_fetch_and_add().
>    (1). user can write a small bpf program to test the feature. If user
>         gets a failed compilation (fatal error), it won't be supported.
>         Otherwise, it is supported.
>    (2). compiler provides some way to tell user it is safe to use, e.g.,
>         -mcpu=v4, or some clang macro suggested by Brendan earlier.
> 
> I guess since kernel already did a lot of feature discovery. Option (1)
> is probably fine.

For option (2) we can use BTF with kernel version check. If kernel is
greater than kernel this lands in we use the the new instructions if
not we use a slower locked version. That should work for all cases
unless someone backports patches into an older case.

At least thats what I'll probably end up wrapping in a helper function.
Andrii Nakryiko Dec. 2, 2020, 5:59 a.m. UTC | #8
On Tue, Dec 1, 2020 at 9:53 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Yonghong Song wrote:
> >
> >
>
> [...]
>
> > > Great, this means that all existing valid uses of
> > > __sync_fetch_and_add() will generate BPF_XADD instructions and will
> > > work on old kernels, right?
> >
> > That is correct.
> >
> > >
> > > If that's the case, do we still need cpu=v4? The new instructions are
> > > *only* going to be generated if the user uses previously unsupported
> > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously
> > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary
> > > tautology then?
> >
> > This is a very good question. Essentially this boils to when users can
> > use the new functionality including meaningful return value  of
> > __sync_fetch_and_add().
> >    (1). user can write a small bpf program to test the feature. If user
> >         gets a failed compilation (fatal error), it won't be supported.
> >         Otherwise, it is supported.
> >    (2). compiler provides some way to tell user it is safe to use, e.g.,
> >         -mcpu=v4, or some clang macro suggested by Brendan earlier.
> >
> > I guess since kernel already did a lot of feature discovery. Option (1)
> > is probably fine.
>
> For option (2) we can use BTF with kernel version check. If kernel is
> greater than kernel this lands in we use the the new instructions if
> not we use a slower locked version. That should work for all cases
> unless someone backports patches into an older case.

Two different things: Clang support detection and kernel support
detection. You are talking about kernel detection, I and Yonghong were
talking about Clang detection and explicit cpu=v4 opt-in.

For kernel detection, if there is an enum value or type that gets
added along the feature, then with CO-RE built-ins it's easy to detect
and kernel dead code elimination will make sure that unsupported
instructions won't trip up the BPF verifier. Still need Clang support
to compile the program in the first place, though.

If there is no such BTF-based way to check, it is still possible to
try to load a trivial BPF program with __sync_fech_and_xxx() to do
feature detection and then use .rodata to turn off code paths relying
on a new instruction set.

>
> At least thats what I'll probably end up wrapping in a helper function.
John Fastabend Dec. 2, 2020, 6:27 a.m. UTC | #9
Andrii Nakryiko wrote:
> On Tue, Dec 1, 2020 at 9:53 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Yonghong Song wrote:
> > >
> > >
> >
> > [...]
> >
> > > > Great, this means that all existing valid uses of
> > > > __sync_fetch_and_add() will generate BPF_XADD instructions and will
> > > > work on old kernels, right?
> > >
> > > That is correct.
> > >
> > > >
> > > > If that's the case, do we still need cpu=v4? The new instructions are
> > > > *only* going to be generated if the user uses previously unsupported
> > > > __sync_fetch_xxx() intrinsics. So, in effect, the user consciously
> > > > opts into using new BPF instructions. cpu=v4 seems like an unnecessary
> > > > tautology then?
> > >
> > > This is a very good question. Essentially this boils to when users can
> > > use the new functionality including meaningful return value  of
> > > __sync_fetch_and_add().
> > >    (1). user can write a small bpf program to test the feature. If user
> > >         gets a failed compilation (fatal error), it won't be supported.
> > >         Otherwise, it is supported.
> > >    (2). compiler provides some way to tell user it is safe to use, e.g.,
> > >         -mcpu=v4, or some clang macro suggested by Brendan earlier.
> > >
> > > I guess since kernel already did a lot of feature discovery. Option (1)
> > > is probably fine.
> >
> > For option (2) we can use BTF with kernel version check. If kernel is
> > greater than kernel this lands in we use the the new instructions if
> > not we use a slower locked version. That should work for all cases
> > unless someone backports patches into an older case.
> 
> Two different things: Clang support detection and kernel support
> detection. You are talking about kernel detection, I and Yonghong were
> talking about Clang detection and explicit cpu=v4 opt-in.
> 

Ah right, catching up on email and reading the thread backwords I lost
the context thanks!

So, I live in a dev world where I control the build infrastructure so
always know clang/llvm versions and features. What I don't know as
well is where the program I just built might be run. So its a bit
of an odd question from my perspective to ask if my clang supports
feature X. If it doesn't support feature X and I want it we upgrade
clang so that it does support it. I don't think we would ever
write a program to test the assertion. Anyways thanks.

> For kernel detection, if there is an enum value or type that gets
> added along the feature, then with CO-RE built-ins it's easy to detect
> and kernel dead code elimination will make sure that unsupported
> instructions won't trip up the BPF verifier. Still need Clang support
> to compile the program in the first place, though.

+1

> 
> If there is no such BTF-based way to check, it is still possible to
> try to load a trivial BPF program with __sync_fech_and_xxx() to do
> feature detection and then use .rodata to turn off code paths relying
> on a new instruction set.

Right.

> 
> >
> > At least thats what I'll probably end up wrapping in a helper function.
Yonghong Song Dec. 2, 2020, 8:03 a.m. UTC | #10
On 12/1/20 9:05 PM, Yonghong Song wrote:
> 
> 
> On 12/1/20 6:00 PM, Andrii Nakryiko wrote:
>> On Mon, Nov 30, 2020 at 7:51 PM Yonghong Song <yhs@fb.com> wrote:
>>>
>>>
>>>
>>> On 11/30/20 9:22 AM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 11/28/20 5:40 PM, Alexei Starovoitov wrote:
>>>>> On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 11/27/20 9:57 AM, Brendan Jackman wrote:
>>>>>>> Status of the patches
>>>>>>> =====================
>>>>>>>
>>>>>>> Thanks for the reviews! Differences from v1->v2 [1]:
>>>>>>>
>>>>>>> * Fixed mistakes in the netronome driver
>>>>>>>
>>>>>>> * Addd sub, add, or, xor operations
>>>>>>>
>>>>>>> * The above led to some refactors to keep things readable. (Maybe I
>>>>>>>      should have just waited until I'd implemented these before 
>>>>>>> starting
>>>>>>>      the review...)
>>>>>>>
>>>>>>> * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
>>>>>>>      include the BPF_FETCH flag
>>>>>>>
>>>>>>> * Added a bit of documentation. Suggestions welcome for more places
>>>>>>>      to dump this info...
>>>>>>>
>>>>>>> The prog_test that's added depends on Clang/LLVM features added by
>>>>>>> Yonghong in
>>>>>>> https://reviews.llvm.org/D72184 
>>>>>>>
>>>>>>> This only includes a JIT implementation for x86_64 - I don't plan to
>>>>>>> implement JIT support myself for other architectures.
>>>>>>>
>>>>>>> Operations
>>>>>>> ==========
>>>>>>>
>>>>>>> This patchset adds atomic operations to the eBPF instruction set. 
>>>>>>> The
>>>>>>> use-case that motivated this work was a trivial and efficient way to
>>>>>>> generate globally-unique cookies in BPF progs, but I think it's
>>>>>>> obvious that these features are pretty widely applicable.  The
>>>>>>> instructions that are added here can be summarised with this list of
>>>>>>> kernel operations:
>>>>>>>
>>>>>>> * atomic[64]_[fetch_]add
>>>>>>> * atomic[64]_[fetch_]sub
>>>>>>> * atomic[64]_[fetch_]and
>>>>>>> * atomic[64]_[fetch_]or
>>>>>>
>>>>>> * atomic[64]_[fetch_]xor
>>>>>>
>>>>>>> * atomic[64]_xchg
>>>>>>> * atomic[64]_cmpxchg
>>>>>>
>>>>>> Thanks. Overall looks good to me but I did not check carefully
>>>>>> on jit part as I am not an expert in x64 assembly...
>>>>>>
>>>>>> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
>>>>>> xadd. I am not sure whether it is necessary. For one thing,
>>>>>> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
>>>>>> return value and they will achieve the same result, right?
>>>>>>   From llvm side, there is no ready-to-use gcc builtin matching
>>>>>> atomic[64]_{sub,and,or,xor} which does not have return values.
>>>>>> If we go this route, we will need to invent additional bpf
>>>>>> specific builtins.
>>>>>
>>>>> I think bpf specific builtins are overkill.
>>>>> As you said the users can use atomic_fetch_xor() and ignore
>>>>> return value. I think llvm backend should be smart enough to use
>>>>> BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
>>>>> But if it's too cumbersome to do at the moment we skip this
>>>>> optimization for now.
>>>>
>>>> We can initially all have BPF_FETCH bit as at that point we do not
>>>> have def-use chain. Later on we can add a
>>>> machine ssa IR phase and check whether the result of, say
>>>> atomic_fetch_or(), is used or not. If not, we can change the
>>>> instruction to atomic_or.
>>>
>>> Just implemented what we discussed above in llvm:
>>>     
>>> https://reviews.llvm.org/D72184 
>>> main change:
>>>     1. atomic_fetch_sub (and later atomic_sub) is gone. llvm will
>>>        transparently transforms it to negation followed by
>>>        atomic_fetch_add or atomic_add (xadd). Kernel can remove
>>>        atomic_fetch_sub/atomic_sub insns.
>>>     2. added new instructions for atomic_{and, or, xor}.
>>>     3. for gcc builtin e.g., __sync_fetch_and_or(), if return
>>>        value is used, atomic_fetch_or will be generated. Otherwise,
>>>        atomic_or will be generated.
>>
>> Great, this means that all existing valid uses of
>> __sync_fetch_and_add() will generate BPF_XADD instructions and will
>> work on old kernels, right?
> 
> That is correct.
> 
>>
>> If that's the case, do we still need cpu=v4? The new instructions are
>> *only* going to be generated if the user uses previously unsupported
>> __sync_fetch_xxx() intrinsics. So, in effect, the user consciously
>> opts into using new BPF instructions. cpu=v4 seems like an unnecessary
>> tautology then?
> 
> This is a very good question. Essentially this boils to when users can 
> use the new functionality including meaningful return value  of 
> __sync_fetch_and_add().
>    (1). user can write a small bpf program to test the feature. If user
>         gets a failed compilation (fatal error), it won't be supported.
>         Otherwise, it is supported.
>    (2). compiler provides some way to tell user it is safe to use, e.g.,
>         -mcpu=v4, or some clang macro suggested by Brendan earlier.
> 
> I guess since kernel already did a lot of feature discovery. Option (1)
> is probably fine.

Just pushed a new llvm version (https://reviews.llvm.org/D72184) which
removed -mcpu=v4. The new instructions will be generated by default
for 64bit type. For 32bit type, alu32 is required. Currently -mcpu=v3
already has alu32 as default and kernel supporting atomic insns should
have good alu32 support too. So I decided to have skip non-alu32 32bit
mode. But if people feel strongly to support non-alu32 32bit mode atomic
instructions, I can add them in llvm... The instruction encodings are
the same for alu32/non-alu32 32bit mode so the kernel will not be
impacted.