diff mbox series

[bpf-next,2/3] bpf, x86: Don't generate lock prefix for BPF_XCHG

Message ID 20231219135615.2656572-3-houtao@huaweicloud.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: inline bpf_kptr_xchg() | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/apply fail Patch does not apply to bpf-next
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc

Commit Message

Hou Tao Dec. 19, 2023, 1:56 p.m. UTC
From: Hou Tao <houtao1@huawei.com>

According to the implementation of atomic_xchg() under x86-64, the lock
prefix is not necessary for BPF_XCHG atomic operation, so just remove
it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 arch/x86/net/bpf_jit_comp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Daniel Borkmann Dec. 20, 2023, 2:58 p.m. UTC | #1
On 12/19/23 2:56 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> According to the implementation of atomic_xchg() under x86-64, the lock
> prefix is not necessary for BPF_XCHG atomic operation, so just remove
> it.

It's probably a good idea for the commit message to explicitly quote the
Intel docs in here, so it's easier to find on why the lock prefix would
not be needed for the xchg op.

> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   arch/x86/net/bpf_jit_comp.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c89a4abdd726..49dac4d22a7b 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -990,7 +990,9 @@ static int emit_atomic(u8 **pprog, u8 atomic_op,
>   {
>   	u8 *prog = *pprog;
>   
> -	EMIT1(0xF0); /* lock prefix */
> +	/* lock prefix */
> +	if (atomic_op != BPF_XCHG)
> +		EMIT1(0xF0);
>   
>   	maybe_emit_mod(&prog, dst_reg, src_reg, bpf_size == BPF_DW);
>   
>
Alexei Starovoitov Dec. 20, 2023, 6:25 p.m. UTC | #2
On Wed, Dec 20, 2023 at 6:58 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 12/19/23 2:56 PM, Hou Tao wrote:
> > From: Hou Tao <houtao1@huawei.com>
> >
> > According to the implementation of atomic_xchg() under x86-64, the lock
> > prefix is not necessary for BPF_XCHG atomic operation, so just remove
> > it.
>
> It's probably a good idea for the commit message to explicitly quote the
> Intel docs in here, so it's easier to find on why the lock prefix would
> not be needed for the xchg op.

It's a surprise to me as well.
Definitely more info would be good.

Also if xchg insn without lock somehow implies lock in the HW
what is the harm of adding it explicitly?
If it's a lock in HW than performance with and without lock prefix
should be the same, right?
Daniel Borkmann Dec. 20, 2023, 6:46 p.m. UTC | #3
On 12/20/23 7:25 PM, Alexei Starovoitov wrote:
> On Wed, Dec 20, 2023 at 6:58 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 12/19/23 2:56 PM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> According to the implementation of atomic_xchg() under x86-64, the lock
>>> prefix is not necessary for BPF_XCHG atomic operation, so just remove
>>> it.
>>
>> It's probably a good idea for the commit message to explicitly quote the
>> Intel docs in here, so it's easier to find on why the lock prefix would
>> not be needed for the xchg op.
> 
> It's a surprise to me as well.
> Definitely more info would be good.
> 
> Also if xchg insn without lock somehow implies lock in the HW
> what is the harm of adding it explicitly?
> If it's a lock in HW than performance with and without lock prefix
> should be the same, right?

e.g. 7.3.1.2 Exchange Instructions says:

   The XCHG (exchange) instruction swaps the contents of two operands. This
   instruction takes the place of three MOV instructions and does not require
   a temporary location to save the contents of one operand location while
   the other is being loaded. When a memory operand is used with the XCHG
   instruction, the processor’s LOCK signal is automatically asserted.

Also curious if there is any harm adding it explicitly.
Hou Tao Dec. 21, 2023, 11:37 a.m. UTC | #4
Hi,

On 12/21/2023 2:46 AM, Daniel Borkmann wrote:
> On 12/20/23 7:25 PM, Alexei Starovoitov wrote:
>> On Wed, Dec 20, 2023 at 6:58 AM Daniel Borkmann
>> <daniel@iogearbox.net> wrote:
>>>
>>> On 12/19/23 2:56 PM, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> According to the implementation of atomic_xchg() under x86-64, the
>>>> lock
>>>> prefix is not necessary for BPF_XCHG atomic operation, so just remove
>>>> it.
>>>
>>> It's probably a good idea for the commit message to explicitly quote
>>> the
>>> Intel docs in here, so it's easier to find on why the lock prefix would
>>> not be needed for the xchg op.
>>
>> It's a surprise to me as well.
>> Definitely more info would be good.
>>
>> Also if xchg insn without lock somehow implies lock in the HW
>> what is the harm of adding it explicitly?
>> If it's a lock in HW than performance with and without lock prefix
>> should be the same, right?
>
> e.g. 7.3.1.2 Exchange Instructions says:
>
>   The XCHG (exchange) instruction swaps the contents of two operands.
> This
>   instruction takes the place of three MOV instructions and does not
> require
>   a temporary location to save the contents of one operand location while
>   the other is being loaded. When a memory operand is used with the XCHG
>   instruction, the processor’s LOCK signal is automatically asserted.
>
> Also curious if there is any harm adding it explicitly.
>
> .

I could use the bpf ma benchmark to test it, but I doubt it will make
any visible difference.
Hou Tao Dec. 23, 2023, 4:01 a.m. UTC | #5
Hi,

On 12/21/2023 7:37 PM, Hou Tao wrote:
> Hi,
>
> On 12/21/2023 2:46 AM, Daniel Borkmann wrote:
>> On 12/20/23 7:25 PM, Alexei Starovoitov wrote:
>>> On Wed, Dec 20, 2023 at 6:58 AM Daniel Borkmann
>>> <daniel@iogearbox.net> wrote:
>>>> On 12/19/23 2:56 PM, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> According to the implementation of atomic_xchg() under x86-64, the
>>>>> lock
>>>>> prefix is not necessary for BPF_XCHG atomic operation, so just remove
>>>>> it.
>>>> It's probably a good idea for the commit message to explicitly quote
>>>> the
>>>> Intel docs in here, so it's easier to find on why the lock prefix would
>>>> not be needed for the xchg op.
>>> It's a surprise to me as well.
>>> Definitely more info would be good.
>>>
>>> Also if xchg insn without lock somehow implies lock in the HW
>>> what is the harm of adding it explicitly?
>>> If it's a lock in HW than performance with and without lock prefix
>>> should be the same, right?
>> e.g. 7.3.1.2 Exchange Instructions says:
>>
>>   The XCHG (exchange) instruction swaps the contents of two operands.
>> This
>>   instruction takes the place of three MOV instructions and does not
>> require
>>   a temporary location to save the contents of one operand location while
>>   the other is being loaded. When a memory operand is used with the XCHG
>>   instruction, the processor’s LOCK signal is automatically asserted.
>>
>> Also curious if there is any harm adding it explicitly.
>>
>> .
> I could use the bpf ma benchmark to test it, but I doubt it will make
> any visible difference.
>
> .

The following is the performance comparison between inlined-kptr-xchg()
without and with lock prefix. It seems for helper without-lock-prefix
the peak free performance is indeed larger than the helper with lock
prefix, but it is so tiny that it may be just fluctuations. So I will
keep the JIT of BPF_XCHG() as-is.


(1) inlined bpf_kptr_xchg() without lock prefix:

$ for i in 1 2 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done | grep
Summary
Summary: per-prod alloc   11.50 ± 0.25M/s free   37.52 ± 0.59M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc    8.52 ± 0.14M/s free   34.73 ± 0.46M/s, total
memory usage    0.01 ± 0.00MiB
Summary: per-prod alloc    7.40 ± 0.09M/s free   36.09 ± 0.65M/s, total
memory usage    0.02 ± 0.00MiB
Summary: per-prod alloc    6.62 ± 0.16M/s free   36.48 ± 2.06M/s, total
memory usage    0.07 ± 0.00MiB

$ for j in $(seq 3); do ./bench -w3 -d10 bpf_ma -p1 -a; done | grep Summary
Summary: per-prod alloc   10.81 ± 0.12M/s free   36.78 ± 0.35M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc   10.77 ± 0.07M/s free   35.76 ± 0.37M/s, total
memory usage    0.01 ± 0.00MiB
Summary: per-prod alloc   10.61 ± 0.09M/s free   33.92 ± 0.41M/s, total
memory usage    0.00 ± 0.00MiB

(2) inlined bpf_kptr_xchg() with lock prefix:

$ for i in 1 2 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done | grep
Summary
Summary: per-prod alloc   11.10 ± 0.19M/s free   36.07 ± 0.40M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc    8.56 ± 0.13M/s free   35.74 ± 0.55M/s, total
memory usage    0.01 ± 0.00MiB
Summary: per-prod alloc    7.37 ± 0.05M/s free   35.78 ± 0.64M/s, total
memory usage    0.02 ± 0.00MiB
Summary: per-prod alloc    6.57 ± 0.10M/s free   33.72 ± 0.57M/s, total
memory usage    0.07 ± 0.00MiB

$ for j in $(seq 3); do ./bench -w3 -d10 bpf_ma -p1 -a; done | grep Summary
Summary: per-prod alloc   11.51 ± 0.20M/s free   35.42 ± 0.34M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc   11.30 ± 0.08M/s free   34.81 ± 0.35M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc   11.43 ± 0.11M/s free   35.43 ± 0.33M/s, total
memory usage    0.00 ± 0.00MiB
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index c89a4abdd726..49dac4d22a7b 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -990,7 +990,9 @@  static int emit_atomic(u8 **pprog, u8 atomic_op,
 {
 	u8 *prog = *pprog;
 
-	EMIT1(0xF0); /* lock prefix */
+	/* lock prefix */
+	if (atomic_op != BPF_XCHG)
+		EMIT1(0xF0);
 
 	maybe_emit_mod(&prog, dst_reg, src_reg, bpf_size == BPF_DW);