bpf: call get_random_u32() for random integers

Message ID	20221205181534.612702-1-Jason@zx2c4.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <netdev-owner@kernel.org> From: "Jason A. Donenfeld" <Jason@zx2c4.com> To: bpf@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev> Subject: [PATCH] bpf: call get_random_u32() for random integers Date: Mon, 5 Dec 2022 19:15:34 +0100 Message-Id: <20221205181534.612702-1-Jason@zx2c4.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bpf: call get_random_u32() for random integers \| expand bpf: call get_random_u32() for random integers

Context	Check	Description
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for build for s390x with gcc
netdev/tree_selection	success	Guessed tree name to be net-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1391 this patch: 1391
netdev/cc_maintainers	warning	12 maintainers not CCed: edumazet@google.com ast@kernel.org kpsingh@kernel.org davem@davemloft.net haoluo@google.com song@kernel.org yhs@fb.com kuba@kernel.org sdf@google.com pabeni@redhat.com john.fastabend@gmail.com jolsa@kernel.org
netdev/build_clang	success	Errors and warnings before: 154 this patch: 154
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1381 this patch: 1381
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 51 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19	fail	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for test_maps on s390x with gcc

Jason A. Donenfeld Dec. 5, 2022, 6:15 p.m. UTC

Since BPF's bpf_user_rnd_u32() was introduced, there have been three
significant developments in the RNG: 1) get_random_u32() returns the
same types of bytes as /dev/urandom, eliminating the distinction between
"kernel random bytes" and "userspace random bytes", 2) get_random_u32()
operates mostly locklessly over percpu state, 3) get_random_u32() has
become quite fast.

So rather than using the old clunky Tausworthe prandom code, just call
get_random_u32(), which should fit BPF uses perfectly.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 include/linux/bpf.h   |  1 -
 kernel/bpf/core.c     | 17 +----------------
 kernel/bpf/verifier.c |  2 --
 net/core/filter.c     |  1 -
 4 files changed, 1 insertion(+), 20 deletions(-)

Daniel Borkmann Dec. 5, 2022, 10:21 p.m. UTC | #1

On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
> Since BPF's bpf_user_rnd_u32() was introduced, there have been three
> significant developments in the RNG: 1) get_random_u32() returns the
> same types of bytes as /dev/urandom, eliminating the distinction between
> "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
> operates mostly locklessly over percpu state, 3) get_random_u32() has
> become quite fast.

Wrt "quite fast", do you have a comparison between the two? Asking as its
often used in networking worst case on per packet basis (e.g. via XDP), would
be useful to state concrete numbers for the two on a given machine.

> So rather than using the old clunky Tausworthe prandom code, just call
> get_random_u32(), which should fit BPF uses perfectly.

Jason A. Donenfeld Dec. 5, 2022, 10:47 p.m. UTC | #2

On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
> > significant developments in the RNG: 1) get_random_u32() returns the
> > same types of bytes as /dev/urandom, eliminating the distinction between
> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
> > operates mostly locklessly over percpu state, 3) get_random_u32() has
> > become quite fast.
> 
> Wrt "quite fast", do you have a comparison between the two? Asking as its
> often used in networking worst case on per packet basis (e.g. via XDP), would
> be useful to state concrete numbers for the two on a given machine.

Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
little slower, but too small of a difference to matter.

Toke Høiland-Jørgensen Dec. 6, 2022, 12:50 p.m. UTC | #3

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
>> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
>> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
>> > significant developments in the RNG: 1) get_random_u32() returns the
>> > same types of bytes as /dev/urandom, eliminating the distinction between
>> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
>> > operates mostly locklessly over percpu state, 3) get_random_u32() has
>> > become quite fast.
>> 
>> Wrt "quite fast", do you have a comparison between the two? Asking as its
>> often used in networking worst case on per packet basis (e.g. via XDP), would
>> be useful to state concrete numbers for the two on a given machine.
>
> Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
> little slower, but too small of a difference to matter.

Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
with small packets, the per-packet processing budget is 67.2 ns, so
those extra 4.3 ns will eat up ~6.4% of the budget.

So in other words, "too small a difference to matter" is definitely not
true in general. It really depends on the use case; if someone is using
this to, say, draw per-packet random numbers to compute a drop frequency
on ingress, that extra processing time will most likely result in a
quite measurable drop in performance.

-Toke

Jason A. Donenfeld Dec. 6, 2022, 12:59 p.m. UTC | #4

Hi Toke,

On Tue, Dec 6, 2022 at 1:50 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> "Jason A. Donenfeld" <Jason@zx2c4.com> writes:
>
> > On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
> >> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
> >> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
> >> > significant developments in the RNG: 1) get_random_u32() returns the
> >> > same types of bytes as /dev/urandom, eliminating the distinction between
> >> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
> >> > operates mostly locklessly over percpu state, 3) get_random_u32() has
> >> > become quite fast.
> >>
> >> Wrt "quite fast", do you have a comparison between the two? Asking as its
> >> often used in networking worst case on per packet basis (e.g. via XDP), would
> >> be useful to state concrete numbers for the two on a given machine.
> >
> > Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
> > little slower, but too small of a difference to matter.
>
> Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
> additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
> with small packets, the per-packet processing budget is 67.2 ns, so
> those extra 4.3 ns will eat up ~6.4% of the budget.
>
> So in other words, "too small a difference to matter" is definitely not
> true in general. It really depends on the use case; if someone is using
> this to, say, draw per-packet random numbers to compute a drop frequency
> on ingress, that extra processing time will most likely result in a
> quite measurable drop in performance.

Huh, neat calculation, I'll keep that method in mind.

Alright, sorry for the noise here. I'll check back in if I ever manage
to eliminate that performance gap.

Jason

Toke Høiland-Jørgensen Dec. 6, 2022, 1:26 p.m. UTC | #5

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hi Toke,
>
> On Tue, Dec 6, 2022 at 1:50 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> "Jason A. Donenfeld" <Jason@zx2c4.com> writes:
>>
>> > On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
>> >> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
>> >> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
>> >> > significant developments in the RNG: 1) get_random_u32() returns the
>> >> > same types of bytes as /dev/urandom, eliminating the distinction between
>> >> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
>> >> > operates mostly locklessly over percpu state, 3) get_random_u32() has
>> >> > become quite fast.
>> >>
>> >> Wrt "quite fast", do you have a comparison between the two? Asking as its
>> >> often used in networking worst case on per packet basis (e.g. via XDP), would
>> >> be useful to state concrete numbers for the two on a given machine.
>> >
>> > Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
>> > little slower, but too small of a difference to matter.
>>
>> Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
>> additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
>> with small packets, the per-packet processing budget is 67.2 ns, so
>> those extra 4.3 ns will eat up ~6.4% of the budget.
>>
>> So in other words, "too small a difference to matter" is definitely not
>> true in general. It really depends on the use case; if someone is using
>> this to, say, draw per-packet random numbers to compute a drop frequency
>> on ingress, that extra processing time will most likely result in a
>> quite measurable drop in performance.
>
> Huh, neat calculation, I'll keep that method in mind.

Yeah, I find that thinking in "time budget per packet" helps a lot! For
completeness, the 67.2 ns comes from 10 Gbps / 84 bytes (that's 64-byte
packets + 20 bytes of preamble and inter-packet gap), which is 14.8
Mpps, or 67.2 ns/pkt.

> Alright, sorry for the noise here. I'll check back in if I ever manage
> to eliminate that performance gap.

One approach that we use generally for XDP (and which may or may not
help for this case) is that we try to amortise as much fixed cost as we
can over a batch of packets. Because XDP processing always happens in
the NAPI receive poll loop, we have a natural batching mechanism, and we
know that for that whole loop we'll keep running on the same CPU.

So for instance, if there's a large fixed component of the overhead of
get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
per-CPU buffer and then just emit u32 chunks of that as long as we're
still in the same NAPI loop as the first call. Or something to that
effect. Not sure if this makes sense for this use case, but figured I'd
throw the idea out there :)

-Toke

Jason A. Donenfeld Dec. 6, 2022, 1:30 p.m. UTC | #6

Hi Toke,

On Tue, Dec 6, 2022 at 2:26 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> So for instance, if there's a large fixed component of the overhead of
> get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
> per-CPU buffer and then just emit u32 chunks of that as long as we're
> still in the same NAPI loop as the first call. Or something to that
> effect. Not sure if this makes sense for this use case, but figured I'd
> throw the idea out there :)

Actually, this already is how get_random_u32() works! It buffers a
bunch of u32s in percpu batches, and doles them out as requested.
However, this API currently works in all contexts, including in
interrupts. So each call results in disabling irqs and reenabling
them. If I bifurcated batches into irq batches and non-irq batches, so
that we only needed to disable preemption for the non-irq batches,
that'd probably improve things quite a bit, since then the overhead
really would reduce to just a memcpy for the majority of calls. But I
don't know if adding that duplication of all code paths is really
worth the huge hassle.

Jason

Toke Høiland-Jørgensen Dec. 6, 2022, 1:53 p.m. UTC | #7

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hi Toke,
>
> On Tue, Dec 6, 2022 at 2:26 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> So for instance, if there's a large fixed component of the overhead of
>> get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
>> per-CPU buffer and then just emit u32 chunks of that as long as we're
>> still in the same NAPI loop as the first call. Or something to that
>> effect. Not sure if this makes sense for this use case, but figured I'd
>> throw the idea out there :)
>
> Actually, this already is how get_random_u32() works! It buffers a
> bunch of u32s in percpu batches, and doles them out as requested.

Ah, right. Not terribly surprised you already did this!

> However, this API currently works in all contexts, including in
> interrupts. So each call results in disabling irqs and reenabling
> them. If I bifurcated batches into irq batches and non-irq batches, so
> that we only needed to disable preemption for the non-irq batches,
> that'd probably improve things quite a bit, since then the overhead
> really would reduce to just a memcpy for the majority of calls. But I
> don't know if adding that duplication of all code paths is really
> worth the huge hassle.

Right, makes sense; happy to leave that decision entirely up to you :)

-Toke

bpf: call get_random_u32() for random integers

Checks

Commit Message

Comments

Patch