diff mbox series

[v2,bpf-next,1/4] bpf: Add bloom filter map implementation

Message ID 20210914040433.3184308-2-joannekoong@fb.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Implement bloom filter map | expand

Checks

Context Check Description
netdev/apply fail Patch does not apply to bpf-next
netdev/tree_selection success Clearly marked for bpf-next
bpf/vmtest-bpf-next-PR fail merge-conflict

Commit Message

Joanne Koong Sept. 14, 2021, 4:04 a.m. UTC
Bloom filters are a space-efficient probabilistic data structure
used to quickly test whether an element exists in a set.
In a bloom filter, false positives are possible whereas false
negatives should never be.

This patch adds a bloom filter map for bpf programs.
The bloom filter map supports peek (determining whether an element
is present in the map) and push (adding an element to the map)
operations.These operations are exposed to userspace applications
through the already existing syscalls in the following way:

BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_UPDATE_ELEM -> push

The bloom filter map does not have keys, only values. In light of
this, the bloom filter map's API matches that of queue stack maps:
user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
APIs to query or add an element to the bloom filter map. When the
bloom filter map is created, it must be created with a key_size of 0.

For updates, the user will pass in the element to add to the map
as the value, with a NULL key. For lookups, the user will pass in the
element to query in the map as the value. In the verifier layer, this
requires us to modify the argument type of a bloom filter's
BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
the syscall layer, we need to copy over the user value so that in
bpf_map_peek_elem, we know which specific value to query.

A few things to please take note of:
 * If there are any concurrent lookups + updates, the user is
responsible for synchronizing this to ensure no false negative lookups
occur.
 * The number of hashes to use for the bloom filter is configurable from
userspace. If no number is specified, the default used will be 5 hash
functions. The benchmarks later in this patchset can help compare the
performance of using different number of hashes on different entry
sizes. In general, using more hashes decreases the speed of a lookup,
but increases the false positive rate of an element being detected in the
bloom filter.
 * Deleting an element in the bloom filter map is not supported.
 * The bloom filter map may be used as an inner map.
 * The "max_entries" size that is specified at map creation time is used to
approximate a reasonable bitmap size for the bloom filter, and is not
otherwise strictly enforced. If the user wishes to insert more entries into
the bloom filter than "max_entries", they may do so but they should be
aware that this may lead to a higher false positive rate.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |  10 ++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/bloom_filter.c      | 205 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  14 ++-
 kernel/bpf/verifier.c          |  19 ++-
 tools/include/uapi/linux/bpf.h |  10 ++
 7 files changed, 255 insertions(+), 6 deletions(-)
 create mode 100644 kernel/bpf/bloom_filter.c

Comments

Alexei Starovoitov Sept. 17, 2021, 5:01 p.m. UTC | #1
On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
> +
> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
> + * The maximum number of hash functions supported is 15. If this is not set,
> + * the default number of hash functions used will be 5.
> + */
> +	BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
> +	BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
> +	BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
> +	BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),

The bit selection is unintuitive.
Since key_size has to be zero may be used that instead to indicate the number of hash
functions in the rare case when 5 is not good enough?
Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
It could be a union:
    __u32   max_entries;    /* max number of entries in a map */
    __u32   map_flags;      /* BPF_MAP_CREATE related
                             * flags defined above.
                             */
    union {
       __u32  inner_map_fd;   /* fd pointing to the inner map */
       __u32  nr_hash_funcs;  /* or number of hash functions */
    };
    __u32   numa_node;      /* numa node */

> +struct bpf_bloom_filter {
> +	struct bpf_map map;
> +	u32 bit_array_mask;
> +	u32 hash_seed;
> +	/* If the size of the values in the bloom filter is u32 aligned,
> +	 * then it is more performant to use jhash2 as the underlying hash
> +	 * function, else we use jhash. This tracks the number of u32s
> +	 * in an u32-aligned value size. If the value size is not u32 aligned,
> +	 * this will be 0.
> +	 */
> +	u32 aligned_u32_count;

what is the performance difference?
May be we enforce 4-byte sized value for simplicity?
Andrii Nakryiko Sept. 17, 2021, 9:48 p.m. UTC | #2
On Mon, Sep 13, 2021 at 9:09 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> Bloom filters are a space-efficient probabilistic data structure
> used to quickly test whether an element exists in a set.
> In a bloom filter, false positives are possible whereas false
> negatives should never be.
>
> This patch adds a bloom filter map for bpf programs.
> The bloom filter map supports peek (determining whether an element
> is present in the map) and push (adding an element to the map)
> operations.These operations are exposed to userspace applications
> through the already existing syscalls in the following way:
>
> BPF_MAP_LOOKUP_ELEM -> peek
> BPF_MAP_UPDATE_ELEM -> push
>
> The bloom filter map does not have keys, only values. In light of
> this, the bloom filter map's API matches that of queue stack maps:
> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
> APIs to query or add an element to the bloom filter map. When the
> bloom filter map is created, it must be created with a key_size of 0.
>
> For updates, the user will pass in the element to add to the map
> as the value, with a NULL key. For lookups, the user will pass in the
> element to query in the map as the value. In the verifier layer, this
> requires us to modify the argument type of a bloom filter's
> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in
> the syscall layer, we need to copy over the user value so that in
> bpf_map_peek_elem, we know which specific value to query.
>
> A few things to please take note of:
>  * If there are any concurrent lookups + updates, the user is
> responsible for synchronizing this to ensure no false negative lookups
> occur.
>  * The number of hashes to use for the bloom filter is configurable from
> userspace. If no number is specified, the default used will be 5 hash
> functions. The benchmarks later in this patchset can help compare the
> performance of using different number of hashes on different entry
> sizes. In general, using more hashes decreases the speed of a lookup,
> but increases the false positive rate of an element being detected in the
> bloom filter.
>  * Deleting an element in the bloom filter map is not supported.
>  * The bloom filter map may be used as an inner map.
>  * The "max_entries" size that is specified at map creation time is used to
> approximate a reasonable bitmap size for the bloom filter, and is not
> otherwise strictly enforced. If the user wishes to insert more entries into
> the bloom filter than "max_entries", they may do so but they should be
> aware that this may lead to a higher false positive rate.
>
> Signed-off-by: Joanne Koong <joannekoong@fb.com>
> ---
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |  10 ++
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/bloom_filter.c      | 205 +++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  14 ++-
>  kernel/bpf/verifier.c          |  19 ++-
>  tools/include/uapi/linux/bpf.h |  10 ++
>  7 files changed, 255 insertions(+), 6 deletions(-)
>  create mode 100644 kernel/bpf/bloom_filter.c
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 9c81724e4b98..c4424ac2fa02 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  #endif
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 791f31dd0abe..1d82860fd98e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -906,6 +906,7 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_RINGBUF,
>         BPF_MAP_TYPE_INODE_STORAGE,
>         BPF_MAP_TYPE_TASK_STORAGE,
> +       BPF_MAP_TYPE_BLOOM_FILTER,
>  };
>
>  /* Note that tracing related programs such as
> @@ -1210,6 +1211,15 @@ enum {
>
>  /* Create a map that is suitable to be an inner map with dynamic max entries */
>         BPF_F_INNER_MAP         = (1U << 12),
> +
> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
> + * The maximum number of hash functions supported is 15. If this is not set,
> + * the default number of hash functions used will be 5.
> + */
> +       BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
> +       BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
> +       BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
> +       BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),

I didn't realize all those map_flags are sequentially numbered, but I
guess we are not yet running out of space, so this might be ok. But to
be usable from BPF program nicely, I would be better to define this as
offset of a first hash bit and number of bits:

BPF_F_BLOOM_NR_HASH_OFF = 13,
BPF_F_BLOOM_NR_HASH_CNT = 4,

So that in BPF map definiton in BPF program we could do

struct {
    __uint(type, BPF_MAP_TYPE_BLOOM_FILTER),
    ...
    __uint(map_flags, 5 << BPF_F_BLOOM_NR_HASH_OFF),
};

It's still quite ugly, but given it's unlikely to be used very
frequently, might be ok.


[...]

> +
> +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value)
> +{
> +       struct bpf_bloom_filter *bloom_filter =
> +               container_of(map, struct bpf_bloom_filter, map);
> +       u32 hash;
> +       u8 i;
> +
> +       for (i = 0; i < bloom_filter->nr_hashes; i++) {
> +               if (bloom_filter->aligned_u32_count)
> +                       hash = jhash2(value, bloom_filter->aligned_u32_count,
> +                                     bloom_filter->hash_seed + i) &
> +                               bloom_filter->bit_array_mask;
> +               else
> +                       hash = jhash(value, map->value_size,
> +                                    bloom_filter->hash_seed + i) &
> +                               bloom_filter->bit_array_mask;

this looks like a good candidate for helper function used in at least two places

> +
> +               if (!test_bit(hash, bloom_filter->bit_array))
> +                       return -ENOENT;
> +       }
> +
> +       return 0;
> +}
> +

[...]
Andrii Nakryiko Sept. 20, 2021, 8:58 p.m. UTC | #3
On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
> > +
> > +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
> > + * The maximum number of hash functions supported is 15. If this is not set,
> > + * the default number of hash functions used will be 5.
> > + */
> > +     BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
> > +     BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
> > +     BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
> > +     BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
>
> The bit selection is unintuitive.
> Since key_size has to be zero may be used that instead to indicate the number of hash
> functions in the rare case when 5 is not good enough?

Hm... I was initially thinking about proposing something like that,
but it felt a bit ugly at the time. But now thinking about this a bit
more, I think this would be a bit more meaningful if we change the
terminology a bit. Instead of saying that Bloom filter has values and
no keys, we actually have keys and no values. So all those bytes that
are hashed are treated as keys (which is actually how sets are
implemented on top of maps, where you have keys and no values, or at
least the value is always true).

So with that we'll have key/key_size to specify number of bytes that
needs to be hashed (and it's type info). And then we can squint a bit
and say that number of hashes are specified by value_size, as in
values are those nr_hash bits that we set in Bloom filter.

Still a bit of terminology stretch, but won't necessitate those
specialized fields just for Bloom filter map. But if default value is
going to be good enough for most cases and most cases won't need to
adjust number of hashes, this seems to be pretty clean to me.

> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
> It could be a union:
>     __u32   max_entries;    /* max number of entries in a map */
>     __u32   map_flags;      /* BPF_MAP_CREATE related
>                              * flags defined above.
>                              */
>     union {
>        __u32  inner_map_fd;   /* fd pointing to the inner map */
>        __u32  nr_hash_funcs;  /* or number of hash functions */
>     };

This works as well. A bit more Bloom filter-only terminology
throughout UAPI and libbpf, but I'd be fine with that as well.


>     __u32   numa_node;      /* numa node */
>
> > +struct bpf_bloom_filter {
> > +     struct bpf_map map;
> > +     u32 bit_array_mask;
> > +     u32 hash_seed;
> > +     /* If the size of the values in the bloom filter is u32 aligned,
> > +      * then it is more performant to use jhash2 as the underlying hash
> > +      * function, else we use jhash. This tracks the number of u32s
> > +      * in an u32-aligned value size. If the value size is not u32 aligned,
> > +      * this will be 0.
> > +      */
> > +     u32 aligned_u32_count;
>
> what is the performance difference?
> May be we enforce 4-byte sized value for simplicity?

This might be a bit too surprising, especially if keys are just some
strings, where people might not expect that it has to 4-byte multiple
size. And debugging this without extra tooling (like retsnoop) is
going to be nightmarish.

If the performance diff is huge and that if/else logic is
unacceptable, we can also internally pad with up to 3 zero bytes and
include those into the hash.
Joanne Koong Sept. 20, 2021, 9:03 p.m. UTC | #4
On 9/17/21 10:01 AM, Alexei Starovoitov wrote:

> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
>> +
>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
>> + * The maximum number of hash functions supported is 15. If this is not set,
>> + * the default number of hash functions used will be 5.
>> + */
>> +	BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
>> +	BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
>> +	BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
>> +	BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
> The bit selection is unintuitive.
> Since key_size has to be zero may be used that instead to indicate the number of hash
> functions in the rare case when 5 is not good enough?
> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
> It could be a union:
>      __u32   max_entries;    /* max number of entries in a map */
>      __u32   map_flags;      /* BPF_MAP_CREATE related
>                               * flags defined above.
>                               */
>      union {
>         __u32  inner_map_fd;   /* fd pointing to the inner map */
>         __u32  nr_hash_funcs;  /* or number of hash functions */
>      };
>      __u32   numa_node;      /* numa node */
I really like the idea of union-ing inner_map_fd with the number of hash 
functions (my worry with
using key_size is that it might be a confusing / non-intuitive API quirk 
for users), but I think this
would later require us to add some bloom filter specific APIs to libbpf 
(such as bpf_map__set_nr_hashes).

To make the bit selection more intuitive, Andrii suggested defining some 
helper like

BPF_F_BLOOM_NR_HASH_OFF = 13

where the user could then do something like

struct {
     __uint(type, BPF_MAP_TYPE_BLOOM_FILTER),
     ...
     __uint(map_flags, 5 << BPF_F_BLOOM_NR_HASH_OFF),
};

to set the number of hash functions.

Would this approach address your concerns about the unintuitiveness of 
the bit selection?

>> +struct bpf_bloom_filter {
>> +	struct bpf_map map;
>> +	u32 bit_array_mask;
>> +	u32 hash_seed;
>> +	/* If the size of the values in the bloom filter is u32 aligned,
>> +	 * then it is more performant to use jhash2 as the underlying hash
>> +	 * function, else we use jhash. This tracks the number of u32s
>> +	 * in an u32-aligned value size. If the value size is not u32 aligned,
>> +	 * this will be 0.
>> +	 */
>> +	u32 aligned_u32_count;
> what is the performance difference?

Using results from the hashmap benchmark tests, using jhash2 instead of 
jhash for 4-byte
aligned value sizes improved the performance by roughly 5% to 15%. For 
non-4-byte aligned
value sizes, there wasn't a noticeable difference between using jhash2 
(and truncating the
remainder bits) vs. using jhash.

> May be we enforce 4-byte sized value for simplicity?
Sounds great! And if in the future this becomes too restrictive, we 
could always loosen this
as well
Joanne Koong Sept. 20, 2021, 10:52 p.m. UTC | #5
My previous email replied to Alexei's email before I saw Andrii's new 
email, so please
feel free to disregard my previous email.

On 9/20/21 1:58 PM, Andrii Nakryiko wrote:

> On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
>>> +
>>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
>>> + * The maximum number of hash functions supported is 15. If this is not set,
>>> + * the default number of hash functions used will be 5.
>>> + */
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
>> The bit selection is unintuitive.
>> Since key_size has to be zero may be used that instead to indicate the number of hash
>> functions in the rare case when 5 is not good enough?
> Hm... I was initially thinking about proposing something like that,
> but it felt a bit ugly at the time. But now thinking about this a bit
> more, I think this would be a bit more meaningful if we change the
> terminology a bit. Instead of saying that Bloom filter has values and
> no keys, we actually have keys and no values. So all those bytes that
> are hashed are treated as keys (which is actually how sets are
> implemented on top of maps, where you have keys and no values, or at
> least the value is always true).
>
> So with that we'll have key/key_size to specify number of bytes that
> needs to be hashed (and it's type info). And then we can squint a bit
> and say that number of hashes are specified by value_size, as in
> values are those nr_hash bits that we set in Bloom filter.
>
> Still a bit of terminology stretch, but won't necessitate those
> specialized fields just for Bloom filter map. But if default value is
> going to be good enough for most cases and most cases won't need to
> adjust number of hashes, this seems to be pretty clean to me.

With having bloom filter map keys instead of values,  I think this would
lead to messier code in the kernel for handling map_lookup_elem
and map_update_elem calls, due to the fact that the bloom filter map
is a non-associative map and the current APIs for non-associative map types
(peek_elem/push_elem/pop_elem) all have the map data as the value and
not the key.

For example, for map_update_elem, the API from the eBPF program side is

int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 
flags);

This would require us to either

a) Add some custom logic in syscall.c so that we bypass the 
copy_from_bpfptr call on
bloom filter map values (necessary because memcpying 0 bytes still 
requires the src pointer
to be valid), which would allow us to pass in a NULL value

b) Add a new function like

int (*map_push_key)(struct bpf_map *map, void *key, u64 flags)

that eBPF programs can call instead of map_update_elem.

or

c) Try to repurpose the existing map_push_elem API by passing in the key 
instead of the value,
which would lead to inconsistent use of the API

I think if we could change the non-associative map types (currently only 
stack maps and queue
maps, I believe) to have their data be a key instead of a value, and 
have the pop/peek APIs use
keys instead of values, then this would be cleaner, since we could then 
just use the existing peek/pop
APIs.

>> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
>> It could be a union:
>>      __u32   max_entries;    /* max number of entries in a map */
>>      __u32   map_flags;      /* BPF_MAP_CREATE related
>>                               * flags defined above.
>>                               */
>>      union {
>>         __u32  inner_map_fd;   /* fd pointing to the inner map */
>>         __u32  nr_hash_funcs;  /* or number of hash functions */
>>      };
> This works as well. A bit more Bloom filter-only terminology
> throughout UAPI and libbpf, but I'd be fine with that as well.
>
Great, it looks like this is the consensus - I will go with this option 
for v3!
>>      __u32   numa_node;      /* numa node */
>>
>>> +struct bpf_bloom_filter {
>>> +     struct bpf_map map;
>>> +     u32 bit_array_mask;
>>> +     u32 hash_seed;
>>> +     /* If the size of the values in the bloom filter is u32 aligned,
>>> +      * then it is more performant to use jhash2 as the underlying hash
>>> +      * function, else we use jhash. This tracks the number of u32s
>>> +      * in an u32-aligned value size. If the value size is not u32 aligned,
>>> +      * this will be 0.
>>> +      */
>>> +     u32 aligned_u32_count;
>> what is the performance difference?
>> May be we enforce 4-byte sized value for simplicity?
> This might be a bit too surprising, especially if keys are just some
> strings, where people might not expect that it has to 4-byte multiple
> size. And debugging this without extra tooling (like retsnoop) is
> going to be nightmarish.
>
> If the performance diff is huge and that if/else logic is
> unacceptable, we can also internally pad with up to 3 zero bytes and
> include those into the hash.
I think the if/else logic is unavoidable if we support non 4-byte 
aligned value sizes,
unless we are okay with truncating any remainder bytes of non 4-byte 
aligned values
and stipulating that a bloom filter map value size has to be greater 
than 4 bytes (these
conditions would allow us to use jhash2 for every value without an 
if/else check). If we
internally pad, we will have to pad on every update and lookup, which 
would also
require an if/else.
Thanks for the comments and reviews, Alexei and Andrii. They are much 
appreciated!
Andrii Nakryiko Sept. 20, 2021, 11:21 p.m. UTC | #6
On Mon, Sep 20, 2021 at 3:52 PM Joanne Koong <joannekoong@fb.com> wrote:
>
> My previous email replied to Alexei's email before I saw Andrii's new
> email, so please
> feel free to disregard my previous email.

Never got that reply of yours. Alexei's email arrived a few hours
after I've already replied to you. It was a time warp anomaly :)

>
> On 9/20/21 1:58 PM, Andrii Nakryiko wrote:
>
> > On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
> >>> +
> >>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
> >>> + * The maximum number of hash functions supported is 15. If this is not set,
> >>> + * the default number of hash functions used will be 5.
> >>> + */
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
> >> The bit selection is unintuitive.
> >> Since key_size has to be zero may be used that instead to indicate the number of hash
> >> functions in the rare case when 5 is not good enough?
> > Hm... I was initially thinking about proposing something like that,
> > but it felt a bit ugly at the time. But now thinking about this a bit
> > more, I think this would be a bit more meaningful if we change the
> > terminology a bit. Instead of saying that Bloom filter has values and
> > no keys, we actually have keys and no values. So all those bytes that
> > are hashed are treated as keys (which is actually how sets are
> > implemented on top of maps, where you have keys and no values, or at
> > least the value is always true).
> >
> > So with that we'll have key/key_size to specify number of bytes that
> > needs to be hashed (and it's type info). And then we can squint a bit
> > and say that number of hashes are specified by value_size, as in
> > values are those nr_hash bits that we set in Bloom filter.
> >
> > Still a bit of terminology stretch, but won't necessitate those
> > specialized fields just for Bloom filter map. But if default value is
> > going to be good enough for most cases and most cases won't need to
> > adjust number of hashes, this seems to be pretty clean to me.
>
> With having bloom filter map keys instead of values,  I think this would
> lead to messier code in the kernel for handling map_lookup_elem
> and map_update_elem calls, due to the fact that the bloom filter map
> is a non-associative map and the current APIs for non-associative map types
> (peek_elem/push_elem/pop_elem) all have the map data as the value and
> not the key.
>
> For example, for map_update_elem, the API from the eBPF program side is
>
> int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64
> flags);
>
> This would require us to either
>
> a) Add some custom logic in syscall.c so that we bypass the
> copy_from_bpfptr call on
> bloom filter map values (necessary because memcpying 0 bytes still
> requires the src pointer
> to be valid), which would allow us to pass in a NULL value
>
> b) Add a new function like
>
> int (*map_push_key)(struct bpf_map *map, void *key, u64 flags)
>
> that eBPF programs can call instead of map_update_elem.
>
> or
>
> c) Try to repurpose the existing map_push_elem API by passing in the key
> instead of the value,
> which would lead to inconsistent use of the API
>
> I think if we could change the non-associative map types (currently only
> stack maps and queue
> maps, I believe) to have their data be a key instead of a value, and
> have the pop/peek APIs use
> keys instead of values, then this would be cleaner, since we could then
> just use the existing peek/pop
> APIs.

I don't think we can change existing map APIs anymore, unfortunately.

>
> >> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
> >> It could be a union:
> >>      __u32   max_entries;    /* max number of entries in a map */
> >>      __u32   map_flags;      /* BPF_MAP_CREATE related
> >>                               * flags defined above.
> >>                               */
> >>      union {
> >>         __u32  inner_map_fd;   /* fd pointing to the inner map */
> >>         __u32  nr_hash_funcs;  /* or number of hash functions */
> >>      };
> > This works as well. A bit more Bloom filter-only terminology
> > throughout UAPI and libbpf, but I'd be fine with that as well.
> >
> Great, it looks like this is the consensus - I will go with this option
> for v3!
> >>      __u32   numa_node;      /* numa node */
> >>
> >>> +struct bpf_bloom_filter {
> >>> +     struct bpf_map map;
> >>> +     u32 bit_array_mask;
> >>> +     u32 hash_seed;
> >>> +     /* If the size of the values in the bloom filter is u32 aligned,
> >>> +      * then it is more performant to use jhash2 as the underlying hash
> >>> +      * function, else we use jhash. This tracks the number of u32s
> >>> +      * in an u32-aligned value size. If the value size is not u32 aligned,
> >>> +      * this will be 0.
> >>> +      */
> >>> +     u32 aligned_u32_count;
> >> what is the performance difference?
> >> May be we enforce 4-byte sized value for simplicity?
> > This might be a bit too surprising, especially if keys are just some
> > strings, where people might not expect that it has to 4-byte multiple
> > size. And debugging this without extra tooling (like retsnoop) is
> > going to be nightmarish.
> >
> > If the performance diff is huge and that if/else logic is
> > unacceptable, we can also internally pad with up to 3 zero bytes and
> > include those into the hash.
> I think the if/else logic is unavoidable if we support non 4-byte
> aligned value sizes,
> unless we are okay with truncating any remainder bytes of non 4-byte
> aligned values
> and stipulating that a bloom filter map value size has to be greater
> than 4 bytes (these
> conditions would allow us to use jhash2 for every value without an
> if/else check). If we
> internally pad, we will have to pad on every update and lookup, which
> would also
> require an if/else.
> Thanks for the comments and reviews, Alexei and Andrii. They are much
> appreciated!

I don't think truncation is an option. And I also forgot that we don't
really store values, so there is nothing to pad, really. So yeah, I'd
keep it as is, especially if that is not expensive (which I assume
it's not). As I mentioned before, that logic can be encapsulated in a
dedicated helper function and reused in a few places.
diff mbox series

Patch

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 9c81724e4b98..c4424ac2fa02 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@  BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 791f31dd0abe..1d82860fd98e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -906,6 +906,7 @@  enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_BLOOM_FILTER,
 };
 
 /* Note that tracing related programs such as
@@ -1210,6 +1211,15 @@  enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* For bloom filter maps, the next 4 bits represent how many hashes to use.
+ * The maximum number of hash functions supported is 15. If this is not set,
+ * the default number of hash functions used will be 5.
+ */
+	BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
+	BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
+	BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
+	BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
 };
 
 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 7f33098ca63f..cf6ca339f3cd 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@  endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
new file mode 100644
index 000000000000..43a17c5b35ac
--- /dev/null
+++ b/kernel/bpf/bloom_filter.c
@@ -0,0 +1,205 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <linux/bitmap.h>
+#include <linux/bpf.h>
+#include <linux/err.h>
+#include <linux/jhash.h>
+#include <linux/random.h>
+
+#define BLOOM_FILTER_HASH_BITMASK \
+	 (BPF_F_BLOOM_FILTER_HASH_BIT_1 | BPF_F_BLOOM_FILTER_HASH_BIT_2 | \
+	 BPF_F_BLOOM_FILTER_HASH_BIT_3 | BPF_F_BLOOM_FILTER_HASH_BIT_4)
+
+#define BLOOM_FILTER_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK | \
+	 BLOOM_FILTER_HASH_BITMASK)
+
+struct bpf_bloom_filter {
+	struct bpf_map map;
+	u32 bit_array_mask;
+	u32 hash_seed;
+	/* If the size of the values in the bloom filter is u32 aligned,
+	 * then it is more performant to use jhash2 as the underlying hash
+	 * function, else we use jhash. This tracks the number of u32s
+	 * in an u32-aligned value size. If the value size is not u32 aligned,
+	 * this will be 0.
+	 */
+	u32 aligned_u32_count;
+	u8 nr_hashes;
+	unsigned long bit_array[];
+};
+
+static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value)
+{
+	struct bpf_bloom_filter *bloom_filter =
+		container_of(map, struct bpf_bloom_filter, map);
+	u32 hash;
+	u8 i;
+
+	for (i = 0; i < bloom_filter->nr_hashes; i++) {
+		if (bloom_filter->aligned_u32_count)
+			hash = jhash2(value, bloom_filter->aligned_u32_count,
+				      bloom_filter->hash_seed + i) &
+				bloom_filter->bit_array_mask;
+		else
+			hash = jhash(value, map->value_size,
+				     bloom_filter->hash_seed + i) &
+				bloom_filter->bit_array_mask;
+
+		if (!test_bit(hash, bloom_filter->bit_array))
+			return -ENOENT;
+	}
+
+	return 0;
+}
+
+static u8 get_nr_hashes(u32 map_flags)
+{
+	u8 nr_hashes = (map_flags & BLOOM_FILTER_HASH_BITMASK) >>
+		ilog2(BPF_F_BLOOM_FILTER_HASH_BIT_1);
+
+	/* Default to 5 if no number of hashes was specified */
+	return nr_hashes == 0 ? 5 : nr_hashes;
+}
+
+static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr)
+{
+	u32 nr_bits, bit_array_bytes, bit_array_mask;
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct bpf_bloom_filter *bloom_filter;
+	u8 nr_hashes;
+
+	if (!bpf_capable())
+		return ERR_PTR(-EPERM);
+
+	if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 ||
+	    attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK ||
+	    !bpf_map_flags_access_ok(attr->map_flags))
+		return ERR_PTR(-EINVAL);
+
+	nr_hashes = get_nr_hashes(attr->map_flags);
+
+	/* For the bloom filter, the optimal bit array size that minimizes the
+	 * false positive probability is n * k / ln(2) where n is the number of
+	 * expected entries in the bloom filter and k is the number of hash
+	 * functions. We use 7 / 5 to approximate 1 / ln(2).
+	 *
+	 * We round this up to the nearest power of two to enable more efficient
+	 * hashing using bitmasks. The bitmask will be the bit array size - 1.
+	 *
+	 * If this overflows a u32, the bit array size will have 2^32 (4
+	 * GB) bits.
+	 */
+	if (check_mul_overflow(attr->max_entries, (u32)nr_hashes, &nr_bits) ||
+	    check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits) ||
+	    nr_bits > (1UL << 31)) {
+		/* The bit array size is 2^32 bits but to avoid overflowing the
+		 * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the
+		 * equivalent number of bytes
+		 */
+		bit_array_bytes = BITS_TO_BYTES(U32_MAX);
+		bit_array_mask = U32_MAX;
+	} else {
+		if (nr_bits <= BITS_PER_LONG)
+			nr_bits = BITS_PER_LONG;
+		else
+			nr_bits = roundup_pow_of_two(nr_bits);
+		bit_array_bytes = BITS_TO_BYTES(nr_bits);
+		bit_array_mask = nr_bits - 1;
+	}
+
+	bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long));
+	bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes,
+					  numa_node);
+
+	if (!bloom_filter)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&bloom_filter->map, attr);
+
+	bloom_filter->nr_hashes = nr_hashes;
+	bloom_filter->bit_array_mask = bit_array_mask;
+	if ((attr->value_size & (sizeof(u32) - 1)) == 0)
+		bloom_filter->aligned_u32_count = attr->value_size / sizeof(u32);
+
+	if (!(attr->map_flags & BPF_F_ZERO_SEED))
+		bloom_filter->hash_seed = get_random_int();
+
+	return &bloom_filter->map;
+}
+
+static void bloom_filter_map_free(struct bpf_map *map)
+{
+	struct bpf_bloom_filter *bloom_filter =
+		container_of(map, struct bpf_bloom_filter, map);
+
+	bpf_map_area_free(bloom_filter);
+}
+
+static int bloom_filter_map_push_elem(struct bpf_map *map, void *value,
+				      u64 flags)
+{
+	struct bpf_bloom_filter *bloom_filter =
+		container_of(map, struct bpf_bloom_filter, map);
+	u32 hash;
+	u8 i;
+
+	if (flags != BPF_ANY)
+		return -EINVAL;
+
+	for (i = 0; i < bloom_filter->nr_hashes; i++) {
+		if (bloom_filter->aligned_u32_count)
+			hash = jhash2(value, bloom_filter->aligned_u32_count,
+				      bloom_filter->hash_seed + i) &
+				bloom_filter->bit_array_mask;
+		else
+			hash = jhash(value, map->value_size,
+				     bloom_filter->hash_seed + i) &
+				bloom_filter->bit_array_mask;
+
+		set_bit(hash, bloom_filter->bit_array);
+	}
+
+	return 0;
+}
+
+static void *bloom_filter_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	/* The eBPF program should use map_peek_elem instead */
+	return ERR_PTR(-EINVAL);
+}
+
+static int bloom_filter_map_update_elem(struct bpf_map *map, void *key,
+					void *value, u64 flags)
+{
+	/* The eBPF program should use map_push_elem instead */
+	return -EINVAL;
+}
+
+static int bloom_filter_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EOPNOTSUPP;
+}
+
+static int bloom_filter_map_get_next_key(struct bpf_map *map, void *key,
+					 void *next_key)
+{
+	return -EOPNOTSUPP;
+}
+
+static int bloom_filter_map_btf_id;
+const struct bpf_map_ops bloom_filter_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = bloom_filter_map_alloc,
+	.map_free = bloom_filter_map_free,
+	.map_push_elem = bloom_filter_map_push_elem,
+	.map_peek_elem = bloom_filter_map_peek_elem,
+	.map_lookup_elem = bloom_filter_map_lookup_elem,
+	.map_update_elem = bloom_filter_map_update_elem,
+	.map_delete_elem = bloom_filter_map_delete_elem,
+	.map_get_next_key = bloom_filter_map_get_next_key,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_name = "bpf_bloom_filter",
+	.map_btf_id = &bloom_filter_map_btf_id,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e50c0bfdb7d..9865b5b1e667 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -199,7 +199,8 @@  static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key,
 		err = bpf_fd_reuseport_array_update_elem(map, key, value,
 							 flags);
 	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
+		   map->map_type == BPF_MAP_TYPE_STACK ||
+		   map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) {
 		err = map->ops->map_push_elem(map, value, flags);
 	} else {
 		rcu_read_lock();
@@ -238,7 +239,8 @@  static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
 		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
 	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
+		   map->map_type == BPF_MAP_TYPE_STACK ||
+		   map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) {
 		err = map->ops->map_peek_elem(map, value);
 	} else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		/* struct_ops map requires directly updating "value" */
@@ -1080,6 +1082,14 @@  static int map_lookup_elem(union bpf_attr *attr)
 	if (!value)
 		goto free_key;
 
+	if (map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) {
+		if (copy_from_user(value, uvalue, value_size))
+			err = -EFAULT;
+		else
+			err = bpf_map_copy_value(map, key, value, attr->flags);
+		goto free_value;
+	}
+
 	err = bpf_map_copy_value(map, key, value, attr->flags);
 	if (err)
 		goto free_value;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 047ac4b4703b..5cbcff4c2222 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4813,7 +4813,10 @@  static int resolve_map_arg_type(struct bpf_verifier_env *env,
 			return -EINVAL;
 		}
 		break;
-
+	case BPF_MAP_TYPE_BLOOM_FILTER:
+		if (meta->func_id == BPF_FUNC_map_peek_elem)
+			*arg_type = ARG_PTR_TO_MAP_VALUE;
+		break;
 	default:
 		break;
 	}
@@ -5388,6 +5391,11 @@  static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_task_storage_delete)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_BLOOM_FILTER:
+		if (func_id != BPF_FUNC_map_push_elem &&
+		    func_id != BPF_FUNC_map_peek_elem)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -5455,13 +5463,18 @@  static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_SOCKHASH)
 			goto error;
 		break;
-	case BPF_FUNC_map_peek_elem:
 	case BPF_FUNC_map_pop_elem:
-	case BPF_FUNC_map_push_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
 		    map->map_type != BPF_MAP_TYPE_STACK)
 			goto error;
 		break;
+	case BPF_FUNC_map_push_elem:
+	case BPF_FUNC_map_peek_elem:
+		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_BLOOM_FILTER)
+			goto error;
+		break;
 	case BPF_FUNC_sk_storage_get:
 	case BPF_FUNC_sk_storage_delete:
 		if (map->map_type != BPF_MAP_TYPE_SK_STORAGE)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 791f31dd0abe..1d82860fd98e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -906,6 +906,7 @@  enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_BLOOM_FILTER,
 };
 
 /* Note that tracing related programs such as
@@ -1210,6 +1211,15 @@  enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* For bloom filter maps, the next 4 bits represent how many hashes to use.
+ * The maximum number of hash functions supported is 15. If this is not set,
+ * the default number of hash functions used will be 5.
+ */
+	BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
+	BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
+	BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
+	BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
 };
 
 /* Flags for BPF_PROG_QUERY. */