diff mbox series

[v2,bpf-next,4/4] selftest/bpf/benchs: add bpf_loop benchmark

Message ID 20211123183409.3599979-5-joannekoong@fb.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Add bpf_loop_helper | expand

Checks

Context Check Description
bpf/vmtest-bpf-next fail VM_Test
bpf/vmtest-bpf-next-PR fail PR summary
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 8 maintainers not CCed: netdev@vger.kernel.org kafai@fb.com shuah@kernel.org songliubraving@fb.com john.fastabend@gmail.com kpsingh@kernel.org yhs@fb.com linux-kselftest@vger.kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 100 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Joanne Koong Nov. 23, 2021, 6:34 p.m. UTC
Add benchmark to measure the throughput and latency of the bpf_loop
call.

Testing this on qemu on my dev machine on 1 thread, the data is
as follows:

        nr_loops: 1
bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op

        nr_loops: 10
bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op

        nr_loops: 100
bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op

        nr_loops: 500
bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op

        nr_loops: 1000
bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op

        nr_loops: 5000
bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op

        nr_loops: 10000
bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op

        nr_loops: 50000
bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op

        nr_loops: 100000
bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op

        nr_loops: 500000
bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op

        nr_loops: 1000000
bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op

From this data, we can see that the latency per loop decreases as the
number of loops increases. On this particular machine, each loop had an
overhead of about ~13 ns, and we were able to run ~70 million loops
per second.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  26 +++++
 tools/testing/selftests/bpf/bench.h           |   1 +
 .../selftests/bpf/benchs/bench_bpf_loop.c     | 105 ++++++++++++++++++
 .../bpf/benchs/run_bench_bpf_loop.sh          |  15 +++
 .../selftests/bpf/benchs/run_common.sh        |  15 +++
 .../selftests/bpf/progs/bpf_loop_bench.c      |  26 +++++
 7 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop_bench.c

Comments

Toke Høiland-Jørgensen Nov. 23, 2021, 7:19 p.m. UTC | #1
Joanne Koong <joannekoong@fb.com> writes:

> Add benchmark to measure the throughput and latency of the bpf_loop
> call.
>
> Testing this on qemu on my dev machine on 1 thread, the data is
> as follows:
>
>         nr_loops: 1
> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>
>         nr_loops: 10
> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>
>         nr_loops: 100
> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>
>         nr_loops: 500
> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>
>         nr_loops: 1000
> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>
>         nr_loops: 5000
> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>
>         nr_loops: 10000
> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>
>         nr_loops: 50000
> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>
>         nr_loops: 100000
> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>
>         nr_loops: 500000
> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>
>         nr_loops: 1000000
> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>
> From this data, we can see that the latency per loop decreases as the
> number of loops increases. On this particular machine, each loop had an
> overhead of about ~13 ns, and we were able to run ~70 million loops
> per second.

The latency figures are great, thanks! I assume these numbers are with
retpolines enabled? Otherwise 12ns seems a bit much... Or is this
because of qemu?

-Toke
Joanne Koong Nov. 24, 2021, 12:20 a.m. UTC | #2
On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:

> Joanne Koong <joannekoong@fb.com> writes:
>
>> Add benchmark to measure the throughput and latency of the bpf_loop
>> call.
>>
>> Testing this on qemu on my dev machine on 1 thread, the data is
>> as follows:
>>
>>          nr_loops: 1
>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>
>>          nr_loops: 10
>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>
>>          nr_loops: 100
>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>
>>          nr_loops: 500
>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>
>>          nr_loops: 1000
>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>
>>          nr_loops: 5000
>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>
>>          nr_loops: 10000
>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>
>>          nr_loops: 50000
>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>
>>          nr_loops: 100000
>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>
>>          nr_loops: 500000
>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>
>>          nr_loops: 1000000
>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>
>>  From this data, we can see that the latency per loop decreases as the
>> number of loops increases. On this particular machine, each loop had an
>> overhead of about ~13 ns, and we were able to run ~70 million loops
>> per second.
> The latency figures are great, thanks! I assume these numbers are with
> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
> because of qemu?
I just tested it on a machine (without retpoline enabled) that runs on 
actual
hardware and here is what I found:

             nr_loops: 1
     bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op

             nr_loops: 10
     bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op

             nr_loops: 100
     bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op

             nr_loops: 500
     bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op

             nr_loops: 1000
     bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

             nr_loops: 5000
     bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op

             nr_loops: 10000
     bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op

             nr_loops: 50000
     bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op

             nr_loops: 100000
     bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op

             nr_loops: 500000
     bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op

             nr_loops: 1000000
     bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op

The latency is about ~4ns / loop.

I will update the commit message in v3 with these new numbers as well.
>
> -Toke
>
Toke Høiland-Jørgensen Nov. 24, 2021, 12:56 p.m. UTC | #3
Joanne Koong <joannekoong@fb.com> writes:

> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>
>> Joanne Koong <joannekoong@fb.com> writes:
>>
>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>> call.
>>>
>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>> as follows:
>>>
>>>          nr_loops: 1
>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>
>>>          nr_loops: 10
>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>
>>>          nr_loops: 100
>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>
>>>          nr_loops: 500
>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>
>>>          nr_loops: 1000
>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>
>>>          nr_loops: 5000
>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>
>>>          nr_loops: 10000
>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>
>>>          nr_loops: 50000
>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>
>>>          nr_loops: 100000
>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>
>>>          nr_loops: 500000
>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>
>>>          nr_loops: 1000000
>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>
>>>  From this data, we can see that the latency per loop decreases as the
>>> number of loops increases. On this particular machine, each loop had an
>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>> per second.
>> The latency figures are great, thanks! I assume these numbers are with
>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>> because of qemu?
> I just tested it on a machine (without retpoline enabled) that runs on 
> actual
> hardware and here is what I found:
>
>              nr_loops: 1
>      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>
>              nr_loops: 10
>      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>
>              nr_loops: 100
>      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>
>              nr_loops: 500
>      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>
>              nr_loops: 1000
>      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>
>              nr_loops: 5000
>      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>
>              nr_loops: 10000
>      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>
>              nr_loops: 50000
>      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>
>              nr_loops: 100000
>      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>
>              nr_loops: 500000
>      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>
>              nr_loops: 1000000
>      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>
> The latency is about ~4ns / loop.
>
> I will update the commit message in v3 with these new numbers as well.

Right, awesome, thank you for the additional test. This is closer to
what I would expect: on the hardware I'm usually testing on, a function
call takes ~1.5ns, but the difference might just be the hardware, or
because these are indirect calls.

Another comparison just occurred to me (but it's totally OK if you don't
want to add any more benchmarks):

The difference between a program that does:

bpf_loop(nr_loops, empty_callback, NULL, 0);

and

for (i = 0; i < nr_loops; i++)
  empty_callback();


should show the difference between the indirect call in the helper and a
direct call from BPF (and show what the potential performance gain from
having the verifier inline the helper would be). This was more
interesting when there was a ~10x delta than a ~2x between your numbers
and mine, so also totally OK to leave this as-is, and we can cycle back
to such optimisations if it turns out to be necessary...

-Toke
Andrii Nakryiko Nov. 24, 2021, 7:26 p.m. UTC | #4
On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Joanne Koong <joannekoong@fb.com> writes:
>
> > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
> >
> >> Joanne Koong <joannekoong@fb.com> writes:
> >>
> >>> Add benchmark to measure the throughput and latency of the bpf_loop
> >>> call.
> >>>
> >>> Testing this on qemu on my dev machine on 1 thread, the data is
> >>> as follows:
> >>>
> >>>          nr_loops: 1
> >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
> >>>
> >>>          nr_loops: 10
> >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
> >>>
> >>>          nr_loops: 100
> >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
> >>>
> >>>          nr_loops: 500
> >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
> >>>
> >>>          nr_loops: 1000
> >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
> >>>
> >>>          nr_loops: 5000
> >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
> >>>
> >>>          nr_loops: 10000
> >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
> >>>
> >>>          nr_loops: 50000
> >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
> >>>
> >>>          nr_loops: 100000
> >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
> >>>
> >>>          nr_loops: 500000
> >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
> >>>
> >>>          nr_loops: 1000000
> >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
> >>>
> >>>  From this data, we can see that the latency per loop decreases as the
> >>> number of loops increases. On this particular machine, each loop had an
> >>> overhead of about ~13 ns, and we were able to run ~70 million loops
> >>> per second.
> >> The latency figures are great, thanks! I assume these numbers are with
> >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
> >> because of qemu?
> > I just tested it on a machine (without retpoline enabled) that runs on
> > actual
> > hardware and here is what I found:
> >
> >              nr_loops: 1
> >      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
> >
> >              nr_loops: 10
> >      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
> >
> >              nr_loops: 100
> >      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
> >
> >              nr_loops: 500
> >      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
> >
> >              nr_loops: 1000
> >      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
> >
> >              nr_loops: 5000
> >      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
> >
> >              nr_loops: 10000
> >      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
> >
> >              nr_loops: 50000
> >      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
> >
> >              nr_loops: 100000
> >      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
> >
> >              nr_loops: 500000
> >      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
> >
> >              nr_loops: 1000000
> >      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
> >
> > The latency is about ~4ns / loop.
> >
> > I will update the commit message in v3 with these new numbers as well.
>
> Right, awesome, thank you for the additional test. This is closer to
> what I would expect: on the hardware I'm usually testing on, a function
> call takes ~1.5ns, but the difference might just be the hardware, or
> because these are indirect calls.
>
> Another comparison just occurred to me (but it's totally OK if you don't
> want to add any more benchmarks):
>
> The difference between a program that does:
>
> bpf_loop(nr_loops, empty_callback, NULL, 0);
>
> and
>
> for (i = 0; i < nr_loops; i++)
>   empty_callback();

You are basically trying to measure the overhead of bpf_loop() helper
call itself, because other than that it should be identical. We can
estimate that already from the numbers Joanne posted above:

             nr_loops: 1
      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
             nr_loops: 1000
      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

nr_loops:1 is bpf_loop() overhead and one static callback call.
bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
17.6ns. I don't think we need yet another benchmark just for this.


>
> should show the difference between the indirect call in the helper and a
> direct call from BPF (and show what the potential performance gain from
> having the verifier inline the helper would be). This was more
> interesting when there was a ~10x delta than a ~2x between your numbers
> and mine, so also totally OK to leave this as-is, and we can cycle back
> to such optimisations if it turns out to be necessary...
>
> -Toke
>
Toke Høiland-Jørgensen Nov. 24, 2021, 9:59 p.m. UTC | #5
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Joanne Koong <joannekoong@fb.com> writes:
>>
>> > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>> >
>> >> Joanne Koong <joannekoong@fb.com> writes:
>> >>
>> >>> Add benchmark to measure the throughput and latency of the bpf_loop
>> >>> call.
>> >>>
>> >>> Testing this on qemu on my dev machine on 1 thread, the data is
>> >>> as follows:
>> >>>
>> >>>          nr_loops: 1
>> >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>> >>>
>> >>>          nr_loops: 10
>> >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>> >>>
>> >>>          nr_loops: 100
>> >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>> >>>
>> >>>          nr_loops: 500
>> >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>> >>>
>> >>>          nr_loops: 1000
>> >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>> >>>
>> >>>          nr_loops: 5000
>> >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>> >>>
>> >>>          nr_loops: 10000
>> >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>> >>>
>> >>>          nr_loops: 50000
>> >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>> >>>
>> >>>          nr_loops: 100000
>> >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>> >>>
>> >>>          nr_loops: 500000
>> >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>> >>>
>> >>>          nr_loops: 1000000
>> >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>> >>>
>> >>>  From this data, we can see that the latency per loop decreases as the
>> >>> number of loops increases. On this particular machine, each loop had an
>> >>> overhead of about ~13 ns, and we were able to run ~70 million loops
>> >>> per second.
>> >> The latency figures are great, thanks! I assume these numbers are with
>> >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>> >> because of qemu?
>> > I just tested it on a machine (without retpoline enabled) that runs on
>> > actual
>> > hardware and here is what I found:
>> >
>> >              nr_loops: 1
>> >      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>> >
>> >              nr_loops: 10
>> >      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>> >
>> >              nr_loops: 100
>> >      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>> >
>> >              nr_loops: 500
>> >      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>> >
>> >              nr_loops: 1000
>> >      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>> >
>> >              nr_loops: 5000
>> >      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>> >
>> >              nr_loops: 10000
>> >      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>> >
>> >              nr_loops: 50000
>> >      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>> >
>> >              nr_loops: 100000
>> >      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>> >
>> >              nr_loops: 500000
>> >      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>> >
>> >              nr_loops: 1000000
>> >      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>> >
>> > The latency is about ~4ns / loop.
>> >
>> > I will update the commit message in v3 with these new numbers as well.
>>
>> Right, awesome, thank you for the additional test. This is closer to
>> what I would expect: on the hardware I'm usually testing on, a function
>> call takes ~1.5ns, but the difference might just be the hardware, or
>> because these are indirect calls.
>>
>> Another comparison just occurred to me (but it's totally OK if you don't
>> want to add any more benchmarks):
>>
>> The difference between a program that does:
>>
>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>
>> and
>>
>> for (i = 0; i < nr_loops; i++)
>>   empty_callback();
>
> You are basically trying to measure the overhead of bpf_loop() helper
> call itself, because other than that it should be identical.

No, I'm trying to measure the difference between the indirect call in
the helper, and the direct call from the BPF program. Should be minor
without retpolines, and somewhat higher where they are enabled...

> We can estimate that already from the numbers Joanne posted above:
>
>              nr_loops: 1
>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>              nr_loops: 1000
>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>
> nr_loops:1 is bpf_loop() overhead and one static callback call.
> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
> 17.6ns. I don't think we need yet another benchmark just for this.

That seems really high, though? The helper is a pretty simple function,
and the call to it should just be JIT'ed into a single regular function
call, right? So why the order-of-magnitude difference?

-Toke
Joanne Koong Nov. 25, 2021, 12:04 a.m. UTC | #6
On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:

> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> Joanne Koong <joannekoong@fb.com> writes:
>>>
>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>>>>
>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>
>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>>>>> call.
>>>>>>
>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>>>>> as follows:
>>>>>>
>>>>>>           nr_loops: 1
>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>>>>
>>>>>>           nr_loops: 10
>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>>>>
>>>>>>           nr_loops: 100
>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>>>>
>>>>>>           nr_loops: 500
>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>>>>
>>>>>>           nr_loops: 1000
>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>>>>
>>>>>>           nr_loops: 5000
>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>>>>
>>>>>>           nr_loops: 10000
>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>>>>
>>>>>>           nr_loops: 50000
>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>>>>
>>>>>>           nr_loops: 100000
>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>>>>
>>>>>>           nr_loops: 500000
>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>>>>
>>>>>>           nr_loops: 1000000
>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>>>>
>>>>>>   From this data, we can see that the latency per loop decreases as the
>>>>>> number of loops increases. On this particular machine, each loop had an
>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>>>>> per second.
>>>>> The latency figures are great, thanks! I assume these numbers are with
>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>>>>> because of qemu?
>>>> I just tested it on a machine (without retpoline enabled) that runs on
>>>> actual
>>>> hardware and here is what I found:
>>>>
>>>>               nr_loops: 1
>>>>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>
>>>>               nr_loops: 10
>>>>       bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>>>>
>>>>               nr_loops: 100
>>>>       bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>>>>
>>>>               nr_loops: 500
>>>>       bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>>>>
>>>>               nr_loops: 1000
>>>>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>
>>>>               nr_loops: 5000
>>>>       bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>>>>
>>>>               nr_loops: 10000
>>>>       bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>>>>
>>>>               nr_loops: 50000
>>>>       bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>>>>
>>>>               nr_loops: 100000
>>>>       bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>>>>
>>>>               nr_loops: 500000
>>>>       bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>>>>
>>>>               nr_loops: 1000000
>>>>       bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>>>>
>>>> The latency is about ~4ns / loop.
>>>>
>>>> I will update the commit message in v3 with these new numbers as well.
>>> Right, awesome, thank you for the additional test. This is closer to
>>> what I would expect: on the hardware I'm usually testing on, a function
>>> call takes ~1.5ns, but the difference might just be the hardware, or
>>> because these are indirect calls.
>>>
>>> Another comparison just occurred to me (but it's totally OK if you don't
>>> want to add any more benchmarks):
>>>
>>> The difference between a program that does:
>>>
>>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>>
>>> and
>>>
>>> for (i = 0; i < nr_loops; i++)
>>>    empty_callback();
>> You are basically trying to measure the overhead of bpf_loop() helper
>> call itself, because other than that it should be identical.
> No, I'm trying to measure the difference between the indirect call in
> the helper, and the direct call from the BPF program. Should be minor
> without retpolines, and somewhat higher where they are enabled...
>
>> We can estimate that already from the numbers Joanne posted above:
>>
>>               nr_loops: 1
>>        bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>               nr_loops: 1000
>>        bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>
>> nr_loops:1 is bpf_loop() overhead and one static callback call.
>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
>> 17.6ns. I don't think we need yet another benchmark just for this.
> That seems really high, though? The helper is a pretty simple function,
> and the call to it should just be JIT'ed into a single regular function
> call, right? So why the order-of-magnitude difference?
I think the overhead of triggering the bpf program from the userspace
benchmarking program is also contributing to this. When nr_loops = 1, we
have to do the context switch between userspace + kernel per every 1000 
loops;
this overhead also contributes to the latency numbers above
> -Toke
>
Toke Høiland-Jørgensen Nov. 25, 2021, 11:35 a.m. UTC | #7
Joanne Koong <joannekoong@fb.com> writes:

> On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:
>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>
>>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>>>>>
>>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>>
>>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>>>>>> call.
>>>>>>>
>>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>>>>>> as follows:
>>>>>>>
>>>>>>>           nr_loops: 1
>>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>>>>>
>>>>>>>           nr_loops: 10
>>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>>>>>
>>>>>>>           nr_loops: 100
>>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>>>>>
>>>>>>>           nr_loops: 500
>>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>>>>>
>>>>>>>           nr_loops: 1000
>>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>>>>>
>>>>>>>           nr_loops: 5000
>>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>>>>>
>>>>>>>           nr_loops: 10000
>>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>>>>>
>>>>>>>           nr_loops: 50000
>>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>>>>>
>>>>>>>           nr_loops: 100000
>>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>>>>>
>>>>>>>           nr_loops: 500000
>>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>>>>>
>>>>>>>           nr_loops: 1000000
>>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>>>>>
>>>>>>>   From this data, we can see that the latency per loop decreases as the
>>>>>>> number of loops increases. On this particular machine, each loop had an
>>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>>>>>> per second.
>>>>>> The latency figures are great, thanks! I assume these numbers are with
>>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>>>>>> because of qemu?
>>>>> I just tested it on a machine (without retpoline enabled) that runs on
>>>>> actual
>>>>> hardware and here is what I found:
>>>>>
>>>>>               nr_loops: 1
>>>>>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>>
>>>>>               nr_loops: 10
>>>>>       bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>>>>>
>>>>>               nr_loops: 100
>>>>>       bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>>>>>
>>>>>               nr_loops: 500
>>>>>       bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>>>>>
>>>>>               nr_loops: 1000
>>>>>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>>
>>>>>               nr_loops: 5000
>>>>>       bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>>>>>
>>>>>               nr_loops: 10000
>>>>>       bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>>>>>
>>>>>               nr_loops: 50000
>>>>>       bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>>>>>
>>>>>               nr_loops: 100000
>>>>>       bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>>>>>
>>>>>               nr_loops: 500000
>>>>>       bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>>>>>
>>>>>               nr_loops: 1000000
>>>>>       bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>>>>>
>>>>> The latency is about ~4ns / loop.
>>>>>
>>>>> I will update the commit message in v3 with these new numbers as well.
>>>> Right, awesome, thank you for the additional test. This is closer to
>>>> what I would expect: on the hardware I'm usually testing on, a function
>>>> call takes ~1.5ns, but the difference might just be the hardware, or
>>>> because these are indirect calls.
>>>>
>>>> Another comparison just occurred to me (but it's totally OK if you don't
>>>> want to add any more benchmarks):
>>>>
>>>> The difference between a program that does:
>>>>
>>>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>>>
>>>> and
>>>>
>>>> for (i = 0; i < nr_loops; i++)
>>>>    empty_callback();
>>> You are basically trying to measure the overhead of bpf_loop() helper
>>> call itself, because other than that it should be identical.
>> No, I'm trying to measure the difference between the indirect call in
>> the helper, and the direct call from the BPF program. Should be minor
>> without retpolines, and somewhat higher where they are enabled...
>>
>>> We can estimate that already from the numbers Joanne posted above:
>>>
>>>               nr_loops: 1
>>>        bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>               nr_loops: 1000
>>>        bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>
>>> nr_loops:1 is bpf_loop() overhead and one static callback call.
>>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
>>> 17.6ns. I don't think we need yet another benchmark just for this.
>> That seems really high, though? The helper is a pretty simple function,
>> and the call to it should just be JIT'ed into a single regular function
>> call, right? So why the order-of-magnitude difference?
> I think the overhead of triggering the bpf program from the userspace
> benchmarking program is also contributing to this. When nr_loops = 1, we
> have to do the context switch between userspace + kernel per every 1000 
> loops;
> this overhead also contributes to the latency numbers above

Right okay. But then that data point is not really measuring what it's
purporting to measure? That's a bit misleading, so maybe better to leave
it out entirely?

-Toke
Joanne Koong Nov. 29, 2021, 7:41 p.m. UTC | #8
On 11/25/21 3:35 AM, Toke Høiland-Jørgensen wrote:

> Joanne Koong <joannekoong@fb.com> writes:
>
>> On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote:
>>
>>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>>
>>>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>
>>>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>>>>>>
>>>>>>> Joanne Koong <joannekoong@fb.com> writes:
>>>>>>>
>>>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop
>>>>>>>> call.
>>>>>>>>
>>>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>>            nr_loops: 1
>>>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 10
>>>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 100
>>>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 500
>>>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 1000
>>>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 5000
>>>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 10000
>>>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 50000
>>>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 100000
>>>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 500000
>>>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>>>>>>>>
>>>>>>>>            nr_loops: 1000000
>>>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>>>>>>>>
>>>>>>>>    From this data, we can see that the latency per loop decreases as the
>>>>>>>> number of loops increases. On this particular machine, each loop had an
>>>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops
>>>>>>>> per second.
>>>>>>> The latency figures are great, thanks! I assume these numbers are with
>>>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>>>>>>> because of qemu?
>>>>>> I just tested it on a machine (without retpoline enabled) that runs on
>>>>>> actual
>>>>>> hardware and here is what I found:
>>>>>>
>>>>>>                nr_loops: 1
>>>>>>        bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>>>
>>>>>>                nr_loops: 10
>>>>>>        bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>>>>>>
>>>>>>                nr_loops: 100
>>>>>>        bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>>>>>>
>>>>>>                nr_loops: 500
>>>>>>        bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>>>>>>
>>>>>>                nr_loops: 1000
>>>>>>        bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>>>
>>>>>>                nr_loops: 5000
>>>>>>        bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>>>>>>
>>>>>>                nr_loops: 10000
>>>>>>        bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>>>>>>
>>>>>>                nr_loops: 50000
>>>>>>        bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>>>>>>
>>>>>>                nr_loops: 100000
>>>>>>        bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>>>>>>
>>>>>>                nr_loops: 500000
>>>>>>        bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>>>>>>
>>>>>>                nr_loops: 1000000
>>>>>>        bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>>>>>>
>>>>>> The latency is about ~4ns / loop.
>>>>>>
>>>>>> I will update the commit message in v3 with these new numbers as well.
>>>>> Right, awesome, thank you for the additional test. This is closer to
>>>>> what I would expect: on the hardware I'm usually testing on, a function
>>>>> call takes ~1.5ns, but the difference might just be the hardware, or
>>>>> because these are indirect calls.
>>>>>
>>>>> Another comparison just occurred to me (but it's totally OK if you don't
>>>>> want to add any more benchmarks):
>>>>>
>>>>> The difference between a program that does:
>>>>>
>>>>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>>>>
>>>>> and
>>>>>
>>>>> for (i = 0; i < nr_loops; i++)
>>>>>     empty_callback();
>>>> You are basically trying to measure the overhead of bpf_loop() helper
>>>> call itself, because other than that it should be identical.
>>> No, I'm trying to measure the difference between the indirect call in
>>> the helper, and the direct call from the BPF program. Should be minor
>>> without retpolines, and somewhat higher where they are enabled...
>>>
>>>> We can estimate that already from the numbers Joanne posted above:
>>>>
>>>>                nr_loops: 1
>>>>         bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>>>>                nr_loops: 1000
>>>>         bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>>>>
>>>> nr_loops:1 is bpf_loop() overhead and one static callback call.
>>>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
>>>> 17.6ns. I don't think we need yet another benchmark just for this.
>>> That seems really high, though? The helper is a pretty simple function,
>>> and the call to it should just be JIT'ed into a single regular function
>>> call, right? So why the order-of-magnitude difference?
>> I think the overhead of triggering the bpf program from the userspace
>> benchmarking program is also contributing to this. When nr_loops = 1, we
>> have to do the context switch between userspace + kernel per every 1000
>> loops;
>> this overhead also contributes to the latency numbers above
> Right okay. But then that data point is not really measuring what it's
> purporting to measure? That's a bit misleading, so maybe better to leave
> it out entirely?
Sure, I will leave this nr_loops = 1 datapoint out in v3 of this 
patchset :)
The overhead of triggering the bpf program from the userspace benchmarks
is present in every datapoint, but for nr_loops = 1, it's especially 
emphasized
since this overhead is per 1000 loops whereas for other datapoints, it is
per every 1000 * nr_loops.
> -Toke
>
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 35684d61aaeb..a6c0e92c86a1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -531,6 +531,7 @@  $(OUTPUT)/bench_trigger.o: $(OUTPUT)/trigger_bench.skel.h
 $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 			    $(OUTPUT)/perfbuf_bench.skel.h
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
+$(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -540,7 +541,8 @@  $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_rename.o \
 		 $(OUTPUT)/bench_trigger.o \
 		 $(OUTPUT)/bench_ringbufs.o \
-		 $(OUTPUT)/bench_bloom_filter_map.o
+		 $(OUTPUT)/bench_bloom_filter_map.o \
+		 $(OUTPUT)/bench_bpf_loop.o
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index c75e7ee28746..bfd969e0424f 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -134,6 +134,28 @@  void hits_drops_report_final(struct bench_res res[], int res_cnt)
 	       total_ops_mean, total_ops_stddev);
 }
 
+void ops_report_final(struct bench_res res[], int res_cnt)
+{
+	double hits_mean = 0.0, hits_stddev = 0.0;
+	int i;
+
+	for (i = 0; i < res_cnt; i++)
+		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++)
+			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
+				       (hits_mean - res[i].hits / 1000000.0) /
+				       (res_cnt - 1.0);
+
+		hits_stddev = sqrt(hits_stddev);
+	}
+	printf("Summary: throughput %8.3lf \u00B1 %5.3lf M ops/s (%7.3lfM ops/prod), ",
+	       hits_mean, hits_stddev, hits_mean / env.producer_cnt);
+	printf("latency %8.3lf ns/op (%7.3lf ns/op /prod)\n",
+	       1000.0 / hits_mean, 1000.0 / hits_mean / env.producer_cnt);
+}
+
 const char *argp_program_version = "benchmark";
 const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
 const char argp_program_doc[] =
@@ -171,10 +193,12 @@  static const struct argp_option opts[] = {
 
 extern struct argp bench_ringbufs_argp;
 extern struct argp bench_bloom_map_argp;
+extern struct argp bench_bpf_loop_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
 	{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
+	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
 	{},
 };
 
@@ -373,6 +397,7 @@  extern const struct bench bench_bloom_update;
 extern const struct bench bench_bloom_false_positive;
 extern const struct bench bench_hashmap_without_bloom;
 extern const struct bench bench_hashmap_with_bloom;
+extern const struct bench bench_bpf_loop;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -404,6 +429,7 @@  static const struct bench *benchs[] = {
 	&bench_bloom_false_positive,
 	&bench_hashmap_without_bloom,
 	&bench_hashmap_with_bloom,
+	&bench_bpf_loop,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index 624c6b11501f..f0895de9aad0 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -59,6 +59,7 @@  void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
 void hits_drops_report_final(struct bench_res res[], int res_cnt);
 void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
 void false_hits_report_final(struct bench_res res[], int res_cnt);
+void ops_report_final(struct bench_res res[], int res_cnt);
 
 static inline __u64 get_time_ns() {
 	struct timespec t;
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c b/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
new file mode 100644
index 000000000000..809a51d7be66
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
@@ -0,0 +1,105 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <argp.h>
+#include "bench.h"
+#include "bpf_loop_bench.skel.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct bpf_loop_bench *skel;
+} ctx;
+
+static struct {
+	__u32 nr_loops;
+} args = {
+	.nr_loops = 10,
+};
+
+enum {
+	ARG_NR_LOOPS = 4000,
+};
+
+static const struct argp_option opts[] = {
+	{ "nr_loops", ARG_NR_LOOPS, "nr_loops", 0,
+		"Set number of loops for the bpf_loop helper"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_NR_LOOPS:
+		args.nr_loops = strtol(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+/* exported into benchmark runner */
+const struct argp bench_bpf_loop_argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+static void validate(void)
+{
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	while (true)
+		/* trigger the bpf program */
+		syscall(__NR_getpgid);
+
+	return NULL;
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+static void measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+static void setup(void)
+{
+	struct bpf_link *link;
+
+	setup_libbpf();
+
+	ctx.skel = bpf_loop_bench__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach(ctx.skel->progs.benchmark);
+	if (!link) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+
+	ctx.skel->bss->nr_loops = args.nr_loops;
+}
+
+const struct bench bench_bpf_loop = {
+	.name = "bpf-loop",
+	.validate = validate,
+	.setup = setup,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = ops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh b/tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
new file mode 100755
index 000000000000..ff740e80ba84
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
@@ -0,0 +1,15 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+for t in 1 4 8 12 16; do
+for i in 1 10 100 500 1000 5000 10000 50000 100000 500000 1000000; do
+subtitle "nr_loops: $i, nr_threads: $t"
+	summarize_ops "bpf_loop: " \
+	    "$($RUN_BENCH -p $t --nr_loops $i bpf-loop)"
+	printf "\n"
+done
+done
diff --git a/tools/testing/selftests/bpf/benchs/run_common.sh b/tools/testing/selftests/bpf/benchs/run_common.sh
index 9a16be78b180..6c5e6023a69f 100644
--- a/tools/testing/selftests/bpf/benchs/run_common.sh
+++ b/tools/testing/selftests/bpf/benchs/run_common.sh
@@ -33,6 +33,14 @@  function percentage()
 	echo "$*" | sed -E "s/.*Percentage\s=\s+([0-9]+\.[0-9]+).*/\1/"
 }
 
+function ops()
+{
+	echo -n "throughput: "
+	echo -n "$*" | sed -E "s/.*throughput\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+\sM\sops\/s).*/\1/"
+	echo -n -e ", latency: "
+	echo "$*" | sed -E "s/.*latency\s+([0-9]+\.[0-9]+\sns\/op).*/\1/"
+}
+
 function total()
 {
 	echo "$*" | sed -E "s/.*total operations\s+([0-9]+\.[0-9]+ ± [0-9]+\.[0-9]+M\/s).*/\1/"
@@ -52,6 +60,13 @@  function summarize_percentage()
 	printf "%-20s %s%%\n" "$bench" "$(percentage $summary)"
 }
 
+function summarize_ops()
+{
+	bench="$1"
+	summary=$(echo $2 | tail -n1)
+	printf "%-20s %s\n" "$bench" "$(ops $summary)"
+}
+
 function summarize_total()
 {
 	bench="$1"
diff --git a/tools/testing/selftests/bpf/progs/bpf_loop_bench.c b/tools/testing/selftests/bpf/progs/bpf_loop_bench.c
new file mode 100644
index 000000000000..ff00621858c0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_loop_bench.c
@@ -0,0 +1,26 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2021 Facebook
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+u32 nr_loops;
+long hits;
+
+static int empty_callback(__u32 index, void *data)
+{
+	return 0;
+}
+
+SEC("fentry/__x64_sys_getpgid")
+int benchmark(void *ctx)
+{
+	for (int i = 0; i < 1000; i++) {
+		bpf_loop(nr_loops, empty_callback, NULL, 0);
+
+		__sync_add_and_fetch(&hits, nr_loops);
+	}
+	return 0;
+}