Message ID | 20210831225005.2762202-2-joannekoong@fb.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Implement bloom filter map | expand |
On Tue, Aug 31, 2021 at 03:50:01PM -0700, Joanne Koong wrote: > +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value) > +{ > + struct bpf_bloom_filter *bloom_filter = > + container_of(map, struct bpf_bloom_filter, map); > + u32 i, hash; > + > + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > + bloom_filter->bit_array_mask; > + if (!test_bit(hash, bloom_filter->bit_array)) > + return -ENOENT; > + } I'm curious what bloom filter theory says about n-hashes > 1 concurrent access with updates in terms of false negative? Two concurrent updates race is protected by spin_lock, but what about peek and update? The update might set one bit, but not the other. That shouldn't trigger false negative lookup, right? Is bloom filter supported as inner map? Hash and lru maps are often used as inner maps. The lookups from them would be pre-filtered by bloom filter map that would have to be (in some cases) inner map. I suspect one bloom filter for all inner maps might be reasonable workaround in some cases too. The delete is not supported in bloom filter, of course. Would be good to mention it in the commit log. Since there is no delete the users would likely need to replace the whole bloom filter. So map-in-map would become necessary. Do you think 'clear-all' operation might be useful for bloom filter? It feels that if map-in-map is supported then clear-all is probably not that useful, since atomic replacement and delete of the map would work better. 'clear-all' will have issues with lookup, since it cannot be done in parallel. Would be good to document all these ideas and restrictions. Could you collect 'perf annotate' data for the above performance critical loop? I wonder whether using jhash2 and forcing u32 value size could speed it up. Probably not, but would be good to check, since restricting value_size later would be problematic due to backward compatibility. The recommended nr_hashes=3 was computed with value_size=8, right? I wonder whether nr_hashes would be different for value_size=16 and =4 which are ipv6/ipv4 addresses and value_size = 40 an approximation of networking n-tuple. > +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) > +{ > + int numa_node = bpf_map_attr_numa_node(attr); > + u32 nr_bits, bit_array_bytes, bit_array_mask; > + struct bpf_bloom_filter *bloom_filter; > + > + if (!bpf_capable()) > + return ERR_PTR(-EPERM); > + > + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || > + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || Would it make sense to default to nr_hashes=3 if zero is passed? This way the libbpf changes for nr_hashes will become 'optional'. Most users wouldn't have to specify it explicitly. Overall looks great! Performance numbers are impressive.
On Tue, Aug 31, 2021 at 3:51 PM Joanne Koong <joannekoong@fb.com> wrote: > > Bloom filters are a space-efficient probabilistic data structure > used to quickly test whether an element exists in a set. > In a bloom filter, false positives are possible whereas false > negatives are not. > > This patch adds a bloom filter map for bpf programs. > The bloom filter map supports peek (determining whether an element > is present in the map) and push (adding an element to the map) > operations.These operations are exposed to userspace applications > through the already existing syscalls in the following way: > > BPF_MAP_LOOKUP_ELEM -> peek > BPF_MAP_UPDATE_ELEM -> push > > The bloom filter map does not have keys, only values. In light of > this, the bloom filter map's API matches that of queue stack maps: > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem > APIs to query or add an element to the bloom filter map. When the > bloom filter map is created, it must be created with a key_size of 0. > > For updates, the user will pass in the element to add to the map > as the value, wih a NULL key. For lookups, the user will pass in the > element to query in the map as the value. In the verifier layer, this > requires us to modify the argument type of a bloom filter's > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in > the syscall layer, we need to copy over the user value so that in > bpf_map_peek_elem, we know which specific value to query. > > The maximum number of entries in the bloom filter is not enforced; if > the user wishes to insert more entries into the bloom filter than they > specified as the max entries size of the bloom filter, that is permitted > but the performance of their bloom filter will have a higher false > positive rate. > > The number of hashes to use for the bloom filter is configurable from > userspace. The benchmarks later in this patchset can help compare the > performances of different number of hashes on different entry > sizes. In general, using more hashes decreases the speed of a lookup, > but increases the false positive rate of an element being detected in the > bloom filter. > > Signed-off-by: Joanne Koong <joannekoong@fb.com> > --- This looks nice and simple. I left a few comments below. But one high-level point I wanted to discuss was that bloom filter logic is actually simple enough to be implementable by pure BPF program logic. The only problematic part is generic hashing of a piece of memory. Regardless of implementing bloom filter as kernel-provided BPF map or implementing it with custom BPF program logic, having BPF helper for hashing a piece of memory seems extremely useful and very generic. I can't recall if we ever discussed adding such helpers, but maybe we should? It would be a really interesting experiment to implement the same logic in pure BPF logic and run it as another benchmark, along the Bloom filter map. BPF has both spinlock and atomic operation, so we can compare and contrast. We only miss hashing BPF helper. Being able to do this in pure BPF code has a bunch of advantages. Depending on specific application, users can decide to: - speed up the operation by ditching spinlock or atomic operation, if the logic allows to lose some bit updates; - decide on optimal size, which might not be a power of 2, depending on memory vs CPU trade of in any particular case; - it's also possible to implement a more general Counting Bloom filter, all without modifying the kernel. We could go further, and start implementing other simple data structures relying on hashing, like HyperLogLog. And all with no kernel modifications. Map-in-map is no issue as well, because there is a choice of using either fixed global data arrays for maximum performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside map-in-map. Basically, regardless of having this map in the kernel or not, let's have a "universal" hashing function as a BPF helper as well. Thoughts? > include/linux/bpf.h | 3 +- > include/linux/bpf_types.h | 1 + > include/uapi/linux/bpf.h | 3 + > kernel/bpf/Makefile | 2 +- > kernel/bpf/bloom_filter.c | 171 +++++++++++++++++++++++++++++++++ > kernel/bpf/syscall.c | 20 +++- > kernel/bpf/verifier.c | 19 +++- > tools/include/uapi/linux/bpf.h | 3 + > 8 files changed, 214 insertions(+), 8 deletions(-) > create mode 100644 kernel/bpf/bloom_filter.c > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index f4c16f19f83e..2abaa1052096 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -181,7 +181,8 @@ struct bpf_map { > u32 btf_vmlinux_value_type_id; > bool bypass_spec_v1; > bool frozen; /* write-once; write-protected by freeze_mutex */ > - /* 22 bytes hole */ > + u32 nr_hashes; /* used for bloom filter maps */ > + /* 18 bytes hole */ > > /* The 3rd and 4th cacheline with misc members to avoid false sharing > * particularly with refcounting. > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > index 9c81724e4b98..c4424ac2fa02 100644 > --- a/include/linux/bpf_types.h > +++ b/include/linux/bpf_types.h > @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) > #endif > BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) > +BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) > > BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) > BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 791f31dd0abe..c2acb0a510fe 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -906,6 +906,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_RINGBUF, > BPF_MAP_TYPE_INODE_STORAGE, > BPF_MAP_TYPE_TASK_STORAGE, > + BPF_MAP_TYPE_BLOOM_FILTER, > }; > > /* Note that tracing related programs such as > @@ -1274,6 +1275,7 @@ union bpf_attr { > * struct stored as the > * map value > */ > + __u32 nr_hashes; /* used for configuring bloom filter maps */ This feels like a bit too one-off property that won't be ever reused by any other type of map. Also consider that we should probably limit nr_hashes to some pretty small sane value (<16? <64?) to prevent easy DOS from inside BPF programs (e.g., set nr_hash to 2bln, each operation is now extremely slow and CPU intensive). So with that, maybe let's provide number of hashes as part of map_flags? And as Alexei proposed, zero would mean some recommended value (2 or 3, right?). This would also mean that libbpf won't need to know about one-off map property in parsing BTF map definitions. > }; > > struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ > @@ -5594,6 +5596,7 @@ struct bpf_map_info { > __u32 btf_id; > __u32 btf_key_type_id; > __u32 btf_value_type_id; > + __u32 nr_hashes; /* used for bloom filter maps */ > } __attribute__((aligned(8))); > > struct bpf_btf_info { > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 7f33098ca63f..cf6ca339f3cd 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -7,7 +7,7 @@ endif > CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) > > obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o > -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o > +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o > obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o > obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o > obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o > diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c > new file mode 100644 > index 000000000000..3ae799ab3747 > --- /dev/null > +++ b/kernel/bpf/bloom_filter.c > @@ -0,0 +1,171 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright (c) 2021 Facebook */ > + > +#include <linux/bitmap.h> > +#include <linux/bpf.h> > +#include <linux/err.h> > +#include <linux/jhash.h> > +#include <linux/random.h> > +#include <linux/spinlock.h> > + > +#define BLOOM_FILTER_CREATE_FLAG_MASK \ > + (BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK) > + > +struct bpf_bloom_filter { > + struct bpf_map map; > + u32 bit_array_mask; > + u32 hash_seed; > + /* Used for synchronizing parallel writes to the bit array */ > + spinlock_t spinlock; > + unsigned long bit_array[]; > +}; > + > +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value) > +{ > + struct bpf_bloom_filter *bloom_filter = > + container_of(map, struct bpf_bloom_filter, map); > + u32 i, hash; > + > + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > + bloom_filter->bit_array_mask; > + if (!test_bit(hash, bloom_filter->bit_array)) > + return -ENOENT; > + } > + > + return 0; > +} > + > +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) > +{ > + int numa_node = bpf_map_attr_numa_node(attr); > + u32 nr_bits, bit_array_bytes, bit_array_mask; > + struct bpf_bloom_filter *bloom_filter; > + > + if (!bpf_capable()) > + return ERR_PTR(-EPERM); > + > + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || > + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || > + !bpf_map_flags_access_ok(attr->map_flags)) > + return ERR_PTR(-EINVAL); > + > + /* For the bloom filter, the optimal bit array size that minimizes the > + * false positive probability is n * k / ln(2) where n is the number of > + * expected entries in the bloom filter and k is the number of hash > + * functions. We use 7 / 5 to approximate 1 / ln(2). > + * > + * We round this up to the nearest power of two to enable more efficient > + * hashing using bitmasks. The bitmask will be the bit array size - 1. > + * > + * If this overflows a u32, the bit array size will have 2^32 (4 > + * GB) bits. > + */ > + if (unlikely(check_mul_overflow(attr->max_entries, attr->nr_hashes, &nr_bits)) || > + unlikely(check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits)) || > + unlikely(nr_bits > (1UL << 31))) { nit: map_alloc is not performance-critical (because it's infrequent), so unlikely() are probably unnecessary? > + /* The bit array size is 2^32 bits but to avoid overflowing the > + * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the > + * equivalent number of bytes > + */ > + bit_array_bytes = BITS_TO_BYTES(U32_MAX); > + bit_array_mask = U32_MAX; > + } else { > + if (nr_bits <= BITS_PER_LONG) > + nr_bits = BITS_PER_LONG; > + else > + nr_bits = roundup_pow_of_two(nr_bits); > + bit_array_bytes = BITS_TO_BYTES(nr_bits); > + bit_array_mask = nr_bits - 1; > + } > + > + bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long)); > + bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes, > + numa_node); > + > + if (!bloom_filter) > + return ERR_PTR(-ENOMEM); > + > + bpf_map_init_from_attr(&bloom_filter->map, attr); > + bloom_filter->map.nr_hashes = attr->nr_hashes; > + > + bloom_filter->bit_array_mask = bit_array_mask; > + spin_lock_init(&bloom_filter->spinlock); > + > + if (!(attr->map_flags & BPF_F_ZERO_SEED)) > + bloom_filter->hash_seed = get_random_int(); > + > + return &bloom_filter->map; > +} > + > +static void bloom_filter_map_free(struct bpf_map *map) > +{ > + struct bpf_bloom_filter *bloom_filter = > + container_of(map, struct bpf_bloom_filter, map); > + > + bpf_map_area_free(bloom_filter); > +} > + > +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value, > + u64 flags) > +{ > + struct bpf_bloom_filter *bloom_filter = > + container_of(map, struct bpf_bloom_filter, map); > + unsigned long spinlock_flags; > + u32 i, hash; > + > + if (flags != BPF_ANY) > + return -EINVAL; > + > + spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags); > + If value_size is pretty big, hashing might take a noticeable amount of CPU, during which we'll be keeping spinlock. With what I said above about sane number of hashes, if we bound it to small reasonable number (e.g., 16), we can have a local 16-element array with hashes calculated before we take lock. That way spinlock will be held only for few bit flips. Also, I wonder if ditching spinlock in favor of atomic bit set operation would improve performance in typical scenarios. Seems like set_bit() is an atomic operation, so it should be easy to test. Do you mind running benchmarks with spinlock and with set_bit()? > + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > + bloom_filter->bit_array_mask; > + bitmap_set(bloom_filter->bit_array, hash, 1); > + } > + > + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); > + > + return 0; > +} > + [...]
Joanne Koong wrote: > Bloom filters are a space-efficient probabilistic data structure > used to quickly test whether an element exists in a set. > In a bloom filter, false positives are possible whereas false > negatives are not. > > This patch adds a bloom filter map for bpf programs. > The bloom filter map supports peek (determining whether an element > is present in the map) and push (adding an element to the map) > operations.These operations are exposed to userspace applications > through the already existing syscalls in the following way: > > BPF_MAP_LOOKUP_ELEM -> peek > BPF_MAP_UPDATE_ELEM -> push > > The bloom filter map does not have keys, only values. In light of > this, the bloom filter map's API matches that of queue stack maps: > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem > APIs to query or add an element to the bloom filter map. When the > bloom filter map is created, it must be created with a key_size of 0. > > For updates, the user will pass in the element to add to the map > as the value, wih a NULL key. For lookups, the user will pass in the > element to query in the map as the value. In the verifier layer, this > requires us to modify the argument type of a bloom filter's > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in > the syscall layer, we need to copy over the user value so that in > bpf_map_peek_elem, we know which specific value to query. > > The maximum number of entries in the bloom filter is not enforced; if > the user wishes to insert more entries into the bloom filter than they > specified as the max entries size of the bloom filter, that is permitted > but the performance of their bloom filter will have a higher false > positive rate. hmm I'm wondering if this means the memory footprint can grow without bounds? Typically maps have an upper bound on memory established at alloc time. In queue_stack_map_alloc() we have, queue_size = sizeof(*qs) + size * attr->value_size); bpf_map_area_alloc(queue_size, numa_node) In hashmap (not preallocated) we have, alloc_htab_elem() that will give us an upper bound. Is there a practical value in allowing these to grow endlessly? And should we be charging the value memory against something? In bpf_map_kmalloc_node we set_active_memcg() for example. I'll review code as well, but think above is worth some thought. > > The number of hashes to use for the bloom filter is configurable from > userspace. The benchmarks later in this patchset can help compare the > performances of different number of hashes on different entry > sizes. In general, using more hashes decreases the speed of a lookup, > but increases the false positive rate of an element being detected in the > bloom filter. > > Signed-off-by: Joanne Koong <joannekoong@fb.com>
On Wed, Sep 1, 2021 at 8:18 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Joanne Koong wrote: > > Bloom filters are a space-efficient probabilistic data structure > > used to quickly test whether an element exists in a set. > > In a bloom filter, false positives are possible whereas false > > negatives are not. > > > > This patch adds a bloom filter map for bpf programs. > > The bloom filter map supports peek (determining whether an element > > is present in the map) and push (adding an element to the map) > > operations.These operations are exposed to userspace applications > > through the already existing syscalls in the following way: > > > > BPF_MAP_LOOKUP_ELEM -> peek > > BPF_MAP_UPDATE_ELEM -> push > > > > The bloom filter map does not have keys, only values. In light of > > this, the bloom filter map's API matches that of queue stack maps: > > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM > > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, > > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem > > APIs to query or add an element to the bloom filter map. When the > > bloom filter map is created, it must be created with a key_size of 0. > > > > For updates, the user will pass in the element to add to the map > > as the value, wih a NULL key. For lookups, the user will pass in the > > element to query in the map as the value. In the verifier layer, this > > requires us to modify the argument type of a bloom filter's > > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in > > the syscall layer, we need to copy over the user value so that in > > bpf_map_peek_elem, we know which specific value to query. > > > > The maximum number of entries in the bloom filter is not enforced; if > > the user wishes to insert more entries into the bloom filter than they > > specified as the max entries size of the bloom filter, that is permitted > > but the performance of their bloom filter will have a higher false > > positive rate. > > hmm I'm wondering if this means the memory footprint can grow without > bounds? Typically maps have an upper bound on memory established at > alloc time. It's a bit unfortunate wording, but no, the amount of used memory is fixed. Bloom filter is a probabilistic data structure in which each "value" has few designated bits, determined by hash functions on that value. The number of bits is fixed, though. If all designated bits are set to 1, then we declare "value" to be present in the Bloom filter. If at least one is 0, then we definitely didn't see "value" yet (that's what guarantees no false negatives; this also answers Alexei's worry about possible false negative due to unsynchronized update and lookup, it can't happen by the nature of the data structure design, regardless of synchronization). We can, of course, have all such bits set to 1 even if the actual value was never "added" into the Bloom filter, just by the nature of hash collisions with other elements' hash functions (that's where the false positive comes from). It might be useful to just leave a link to Wikipedia for description of Bloom filter data structure ([0]). [0] https://en.wikipedia.org/wiki/Bloom_filter > > In queue_stack_map_alloc() we have, > > queue_size = sizeof(*qs) + size * attr->value_size); > bpf_map_area_alloc(queue_size, numa_node) > > In hashmap (not preallocated) we have, alloc_htab_elem() that will > give us an upper bound. > > Is there a practical value in allowing these to grow endlessly? And > should we be charging the value memory against something? In > bpf_map_kmalloc_node we set_active_memcg() for example. > > I'll review code as well, but think above is worth some thought. > > > > > The number of hashes to use for the bloom filter is configurable from > > userspace. The benchmarks later in this patchset can help compare the > > performances of different number of hashes on different entry > > sizes. In general, using more hashes decreases the speed of a lookup, > > but increases the false positive rate of an element being detected in the > > bloom filter. > > > > Signed-off-by: Joanne Koong <joannekoong@fb.com>
Andrii Nakryiko wrote: > On Wed, Sep 1, 2021 at 8:18 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Joanne Koong wrote: > > > Bloom filters are a space-efficient probabilistic data structure > > > used to quickly test whether an element exists in a set. > > > In a bloom filter, false positives are possible whereas false > > > negatives are not. > > > > > > This patch adds a bloom filter map for bpf programs. > > > The bloom filter map supports peek (determining whether an element > > > is present in the map) and push (adding an element to the map) > > > operations.These operations are exposed to userspace applications > > > through the already existing syscalls in the following way: > > > > > > BPF_MAP_LOOKUP_ELEM -> peek > > > BPF_MAP_UPDATE_ELEM -> push > > > > > > The bloom filter map does not have keys, only values. In light of > > > this, the bloom filter map's API matches that of queue stack maps: > > > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM > > > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, > > > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem > > > APIs to query or add an element to the bloom filter map. When the > > > bloom filter map is created, it must be created with a key_size of 0. > > > > > > For updates, the user will pass in the element to add to the map > > > as the value, wih a NULL key. For lookups, the user will pass in the > > > element to query in the map as the value. In the verifier layer, this > > > requires us to modify the argument type of a bloom filter's > > > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in > > > the syscall layer, we need to copy over the user value so that in > > > bpf_map_peek_elem, we know which specific value to query. > > > > > > The maximum number of entries in the bloom filter is not enforced; if > > > the user wishes to insert more entries into the bloom filter than they > > > specified as the max entries size of the bloom filter, that is permitted > > > but the performance of their bloom filter will have a higher false > > > positive rate. > > > > hmm I'm wondering if this means the memory footprint can grow without > > bounds? Typically maps have an upper bound on memory established at > > alloc time. > > It's a bit unfortunate wording, but no, the amount of used memory is > fixed. Bloom filter is a probabilistic data structure in which each > "value" has few designated bits, determined by hash functions on that > value. The number of bits is fixed, though. If all designated bits are > set to 1, then we declare "value" to be present in the Bloom filter. Thanks actually reading the code helped as well. Looks like a v2 will likely happen perhaps a small note here for maxiumum number of entries in the bloom filter is only used to estimate the number of bits used. I guess if a BPF user did want to enforce the max number of entries a simple BPF counter wouldn't be much for users to add. I usually add these to maps for debug/statistic reasons anyways. > If at least one is 0, then we definitely didn't see "value" yet > (that's what guarantees no false negatives; this also answers Alexei's > worry about possible false negative due to unsynchronized update and > lookup, it can't happen by the nature of the data structure design, > regardless of synchronization). We can, of course, have all such bits > set to 1 even if the actual value was never "added" into the Bloom > filter, just by the nature of hash collisions with other elements' > hash functions (that's where the false positive comes from). It might > be useful to just leave a link to Wikipedia for description of Bloom > filter data structure ([0]). > > [0] https://en.wikipedia.org/wiki/Bloom_filter Thanks. Yep needed a refresher for sure.
Andrii Nakryiko wrote: > On Tue, Aug 31, 2021 at 3:51 PM Joanne Koong <joannekoong@fb.com> wrote: > > > > Bloom filters are a space-efficient probabilistic data structure > > used to quickly test whether an element exists in a set. > > In a bloom filter, false positives are possible whereas false > > negatives are not. > > > > This patch adds a bloom filter map for bpf programs. > > The bloom filter map supports peek (determining whether an element > > is present in the map) and push (adding an element to the map) > > operations.These operations are exposed to userspace applications > > through the already existing syscalls in the following way: > > > > BPF_MAP_LOOKUP_ELEM -> peek > > BPF_MAP_UPDATE_ELEM -> push > > > > The bloom filter map does not have keys, only values. In light of > > this, the bloom filter map's API matches that of queue stack maps: > > user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM > > which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, > > and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem > > APIs to query or add an element to the bloom filter map. When the > > bloom filter map is created, it must be created with a key_size of 0. > > > > For updates, the user will pass in the element to add to the map > > as the value, wih a NULL key. For lookups, the user will pass in the > > element to query in the map as the value. In the verifier layer, this > > requires us to modify the argument type of a bloom filter's > > BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in > > the syscall layer, we need to copy over the user value so that in > > bpf_map_peek_elem, we know which specific value to query. > > > > The maximum number of entries in the bloom filter is not enforced; if > > the user wishes to insert more entries into the bloom filter than they > > specified as the max entries size of the bloom filter, that is permitted > > but the performance of their bloom filter will have a higher false > > positive rate. > > > > The number of hashes to use for the bloom filter is configurable from > > userspace. The benchmarks later in this patchset can help compare the > > performances of different number of hashes on different entry > > sizes. In general, using more hashes decreases the speed of a lookup, > > but increases the false positive rate of an element being detected in the > > bloom filter. > > > > Signed-off-by: Joanne Koong <joannekoong@fb.com> > > --- > > This looks nice and simple. I left a few comments below. > > But one high-level point I wanted to discuss was that bloom filter > logic is actually simple enough to be implementable by pure BPF > program logic. The only problematic part is generic hashing of a piece > of memory. Regardless of implementing bloom filter as kernel-provided > BPF map or implementing it with custom BPF program logic, having BPF > helper for hashing a piece of memory seems extremely useful and very > generic. I can't recall if we ever discussed adding such helpers, but > maybe we should? Aha started typing the same thing :) Adding generic hash helper has been on my todo list and close to the top now. The use case is hashing data from skb payloads and such from kprobe and sockmap side. I'm happy to work on it as soon as possible if no one else picks it up. > > It would be a really interesting experiment to implement the same > logic in pure BPF logic and run it as another benchmark, along the > Bloom filter map. BPF has both spinlock and atomic operation, so we > can compare and contrast. We only miss hashing BPF helper. The one issue I've found writing a hash logic is its a bit tricky to get the verifier to consume it. Especially when the hash is nested inside a for loop and sometimes a couple for loops so you end up with things like, for (i = 0; i < someTlvs; i++) { for (j = 0; j < someKeys; i++) { ... bpf_hash(someValue) ... } I've find small seemingly unrelated changes cause the complexity limit to explode. Usually we can work around it with code to get pruning points and such, but its a bit ugly. Perhaps this means we need to dive into details of why the complexity explodes, but I've not got to it yet. The todo list is long. > > Being able to do this in pure BPF code has a bunch of advantages. > Depending on specific application, users can decide to: > - speed up the operation by ditching spinlock or atomic operation, > if the logic allows to lose some bit updates; > - decide on optimal size, which might not be a power of 2, depending > on memory vs CPU trade of in any particular case; > - it's also possible to implement a more general Counting Bloom > filter, all without modifying the kernel. Also it means no call and if you build it on top of an array map of size 1 its just a load. Could this be a performance win for example a Bloom filter in XDP for DDOS? Maybe. Not sure if the program would be complex enough a call might be in the noise. I don't know. > > We could go further, and start implementing other simple data > structures relying on hashing, like HyperLogLog. And all with no > kernel modifications. Map-in-map is no issue as well, because there is > a choice of using either fixed global data arrays for maximum > performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside > map-in-map. We've been doing most of our array maps as single entry arrays at this point and just calculating offsets directly in BPF. Same for some simple hashing algorithms. > > Basically, regardless of having this map in the kernel or not, let's > have a "universal" hashing function as a BPF helper as well. Yes please! > > Thoughts? I like it, but not the bloom filter expert here. > > > include/linux/bpf.h | 3 +- > > include/linux/bpf_types.h | 1 + > > include/uapi/linux/bpf.h | 3 + > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/bloom_filter.c | 171 +++++++++++++++++++++++++++++++++ > > kernel/bpf/syscall.c | 20 +++- > > kernel/bpf/verifier.c | 19 +++- > > tools/include/uapi/linux/bpf.h | 3 + > > 8 files changed, 214 insertions(+), 8 deletions(-) > > create mode 100644 kernel/bpf/bloom_filter.c > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index f4c16f19f83e..2abaa1052096 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -181,7 +181,8 @@ struct bpf_map { > > u32 btf_vmlinux_value_type_id; > > bool bypass_spec_v1; > > bool frozen; /* write-once; write-protected by freeze_mutex */ > > - /* 22 bytes hole */ > > + u32 nr_hashes; /* used for bloom filter maps */ > > + /* 18 bytes hole */ > > > > /* The 3rd and 4th cacheline with misc members to avoid false sharing > > * particularly with refcounting. > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > > index 9c81724e4b98..c4424ac2fa02 100644 > > --- a/include/linux/bpf_types.h > > +++ b/include/linux/bpf_types.h > > @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) > > BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) > > #endif > > BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) > > +BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) > > > > BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) > > BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 791f31dd0abe..c2acb0a510fe 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -906,6 +906,7 @@ enum bpf_map_type { > > BPF_MAP_TYPE_RINGBUF, > > BPF_MAP_TYPE_INODE_STORAGE, > > BPF_MAP_TYPE_TASK_STORAGE, > > + BPF_MAP_TYPE_BLOOM_FILTER, > > }; > > > > /* Note that tracing related programs such as > > @@ -1274,6 +1275,7 @@ union bpf_attr { > > * struct stored as the > > * map value > > */ > > + __u32 nr_hashes; /* used for configuring bloom filter maps */ > > This feels like a bit too one-off property that won't be ever reused > by any other type of map. Also consider that we should probably limit > nr_hashes to some pretty small sane value (<16? <64?) to prevent easy > DOS from inside BPF programs (e.g., set nr_hash to 2bln, each > operation is now extremely slow and CPU intensive). So with that, > maybe let's provide number of hashes as part of map_flags? And as > Alexei proposed, zero would mean some recommended value (2 or 3, > right?). This would also mean that libbpf won't need to know about > one-off map property in parsing BTF map definitions. > > > }; > > > > struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ > > @@ -5594,6 +5596,7 @@ struct bpf_map_info { > > __u32 btf_id; > > __u32 btf_key_type_id; > > __u32 btf_value_type_id; > > + __u32 nr_hashes; /* used for bloom filter maps */ > > } __attribute__((aligned(8))); > > > > struct bpf_btf_info { > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > index 7f33098ca63f..cf6ca339f3cd 100644 > > --- a/kernel/bpf/Makefile > > +++ b/kernel/bpf/Makefile > > @@ -7,7 +7,7 @@ endif > > CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) > > > > obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o > > -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o > > +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o > > obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o > > obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o > > obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o > > diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c > > new file mode 100644 > > index 000000000000..3ae799ab3747 > > --- /dev/null > > +++ b/kernel/bpf/bloom_filter.c > > @@ -0,0 +1,171 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* Copyright (c) 2021 Facebook */ > > + > > +#include <linux/bitmap.h> > > +#include <linux/bpf.h> > > +#include <linux/err.h> > > +#include <linux/jhash.h> > > +#include <linux/random.h> > > +#include <linux/spinlock.h> > > + > > +#define BLOOM_FILTER_CREATE_FLAG_MASK \ > > + (BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK) > > + > > +struct bpf_bloom_filter { > > + struct bpf_map map; > > + u32 bit_array_mask; > > + u32 hash_seed; > > + /* Used for synchronizing parallel writes to the bit array */ > > + spinlock_t spinlock; > > + unsigned long bit_array[]; > > +}; > > + > > +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value) > > +{ > > + struct bpf_bloom_filter *bloom_filter = > > + container_of(map, struct bpf_bloom_filter, map); > > + u32 i, hash; > > + > > + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > > + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > > + bloom_filter->bit_array_mask; > > + if (!test_bit(hash, bloom_filter->bit_array)) > > + return -ENOENT; > > + } > > + > > + return 0; > > +} > > + > > +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) > > +{ > > + int numa_node = bpf_map_attr_numa_node(attr); > > + u32 nr_bits, bit_array_bytes, bit_array_mask; > > + struct bpf_bloom_filter *bloom_filter; > > + > > + if (!bpf_capable()) > > + return ERR_PTR(-EPERM); > > + > > + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || > > + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || > > + !bpf_map_flags_access_ok(attr->map_flags)) > > + return ERR_PTR(-EINVAL); > > + > > + /* For the bloom filter, the optimal bit array size that minimizes the > > + * false positive probability is n * k / ln(2) where n is the number of > > + * expected entries in the bloom filter and k is the number of hash > > + * functions. We use 7 / 5 to approximate 1 / ln(2). > > + * > > + * We round this up to the nearest power of two to enable more efficient > > + * hashing using bitmasks. The bitmask will be the bit array size - 1. > > + * > > + * If this overflows a u32, the bit array size will have 2^32 (4 > > + * GB) bits. Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was a user I might want to know that the number of bits I pushed down is not the actual number? Another thought, would it be simpler to let user do this calculation and just let max_elements be number of bits they want? Then we could have examples with the above comment. Just a thought... > > + */ > > + if (unlikely(check_mul_overflow(attr->max_entries, attr->nr_hashes, &nr_bits)) || > > + unlikely(check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits)) || > > + unlikely(nr_bits > (1UL << 31))) { > > nit: map_alloc is not performance-critical (because it's infrequent), > so unlikely() are probably unnecessary? > > > + /* The bit array size is 2^32 bits but to avoid overflowing the > > + * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the > > + * equivalent number of bytes > > + */ > > + bit_array_bytes = BITS_TO_BYTES(U32_MAX); > > + bit_array_mask = U32_MAX; > > + } else { > > + if (nr_bits <= BITS_PER_LONG) > > + nr_bits = BITS_PER_LONG; > > + else > > + nr_bits = roundup_pow_of_two(nr_bits); > > + bit_array_bytes = BITS_TO_BYTES(nr_bits); > > + bit_array_mask = nr_bits - 1; > > + } > > + > > + bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long)); > > + bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes, > > + numa_node); > > + > > + if (!bloom_filter) > > + return ERR_PTR(-ENOMEM); > > + > > + bpf_map_init_from_attr(&bloom_filter->map, attr); > > + bloom_filter->map.nr_hashes = attr->nr_hashes; > > + > > + bloom_filter->bit_array_mask = bit_array_mask; > > + spin_lock_init(&bloom_filter->spinlock); > > + > > + if (!(attr->map_flags & BPF_F_ZERO_SEED)) > > + bloom_filter->hash_seed = get_random_int(); > > + > > + return &bloom_filter->map; > > +} > > + > > +static void bloom_filter_map_free(struct bpf_map *map) > > +{ > > + struct bpf_bloom_filter *bloom_filter = > > + container_of(map, struct bpf_bloom_filter, map); > > + > > + bpf_map_area_free(bloom_filter); > > +} > > + > > +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value, > > + u64 flags) > > +{ > > + struct bpf_bloom_filter *bloom_filter = > > + container_of(map, struct bpf_bloom_filter, map); > > + unsigned long spinlock_flags; > > + u32 i, hash; > > + > > + if (flags != BPF_ANY) > > + return -EINVAL; > > + > > + spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags); > > + > > If value_size is pretty big, hashing might take a noticeable amount of > CPU, during which we'll be keeping spinlock. With what I said above > about sane number of hashes, if we bound it to small reasonable number > (e.g., 16), we can have a local 16-element array with hashes > calculated before we take lock. That way spinlock will be held only > for few bit flips. +1. Anyways we are inside a RCU section here and the map shouldn't disapper without a grace period so we can READ_ONCE the seed right? Or are we thinking about sleepable programs here? > Also, I wonder if ditching spinlock in favor of atomic bit set > operation would improve performance in typical scenarios. Seems like > set_bit() is an atomic operation, so it should be easy to test. Do you > mind running benchmarks with spinlock and with set_bit()? With the jhash pulled out of lock, I think it might be noticable. Curious to see. > > > + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > > + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > > + bloom_filter->bit_array_mask; > > + bitmap_set(bloom_filter->bit_array, hash, 1); > > + } > > + > > + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); > > + > > + return 0; > > +} > > + > > [...]
On Wed, Sep 01, 2021 at 10:11:20PM -0700, John Fastabend wrote: [ ... ] > > > +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) > > > +{ > > > + int numa_node = bpf_map_attr_numa_node(attr); > > > + u32 nr_bits, bit_array_bytes, bit_array_mask; > > > + struct bpf_bloom_filter *bloom_filter; > > > + > > > + if (!bpf_capable()) > > > + return ERR_PTR(-EPERM); > > > + > > > + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || > > > + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || > > > + !bpf_map_flags_access_ok(attr->map_flags)) > > > + return ERR_PTR(-EINVAL); > > > + > > > + /* For the bloom filter, the optimal bit array size that minimizes the > > > + * false positive probability is n * k / ln(2) where n is the number of > > > + * expected entries in the bloom filter and k is the number of hash > > > + * functions. We use 7 / 5 to approximate 1 / ln(2). > > > + * > > > + * We round this up to the nearest power of two to enable more efficient > > > + * hashing using bitmasks. The bitmask will be the bit array size - 1. > > > + * > > > + * If this overflows a u32, the bit array size will have 2^32 (4 > > > + * GB) bits. > > Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was > a user I might want to know that the number of bits I pushed down is not the actual > number? > > Another thought, would it be simpler to let user do this calculation and just let > max_elements be number of bits they want? Then we could have examples with the > above comment. Just a thought... Instead of having user second guessing on what max_entries means for a particular map, I think it is better to keep max_entries meaning as consistent as possible and let the kernel figure out the correct nr_bits to use. [ ... ] > > > +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value, > > > + u64 flags) > > > +{ > > > + struct bpf_bloom_filter *bloom_filter = > > > + container_of(map, struct bpf_bloom_filter, map); > > > + unsigned long spinlock_flags; > > > + u32 i, hash; > > > + > > > + if (flags != BPF_ANY) > > > + return -EINVAL; > > > + > > > + spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags); > > > + > > > > If value_size is pretty big, hashing might take a noticeable amount of > > CPU, during which we'll be keeping spinlock. With what I said above Good catch on big value_size. > > about sane number of hashes, if we bound it to small reasonable number > > (e.g., 16), we can have a local 16-element array with hashes > > calculated before we take lock. That way spinlock will be held only > > for few bit flips. > > +1. Anyways we are inside a RCU section here and the map shouldn't > disapper without a grace period so we can READ_ONCE the seed right? > Or are we thinking about sleepable programs here? > > > Also, I wonder if ditching spinlock in favor of atomic bit set > > operation would improve performance in typical scenarios. Seems like > > set_bit() is an atomic operation, so it should be easy to test. Do you > > mind running benchmarks with spinlock and with set_bit()? The atomic set_bit() is a good idea. Then no need to have a 16-element array and keep thing simple. It is in general useful to optimize the update/push path (e.g. I would like to have the bloom-filter bench populating millions entries faster). Our current usecase is to have the userspace populates the map (e.g. a lot of suspicious IP that we have already learned) at the very beginning and then very sparse update after that. The bpf prog will mostly only lookup/peek which I think is a better optimization and benchmark target. > > With the jhash pulled out of lock, I think it might be noticable. Curious > to see. > > > > > > + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > > > + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > > > + bloom_filter->bit_array_mask; > > > + bitmap_set(bloom_filter->bit_array, hash, 1); > > > + } > > > + > > > + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); > > > + > > > + return 0; > > > +} > > > + > > > > [...] > >
On 8/31/21 7:55 PM, Alexei Starovoitov wrote: > On Tue, Aug 31, 2021 at 03:50:01PM -0700, Joanne Koong wrote: >> +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value) >> +{ >> + struct bpf_bloom_filter *bloom_filter = >> + container_of(map, struct bpf_bloom_filter, map); >> + u32 i, hash; >> + >> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { >> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & >> + bloom_filter->bit_array_mask; >> + if (!test_bit(hash, bloom_filter->bit_array)) >> + return -ENOENT; >> + } > I'm curious what bloom filter theory says about n-hashes > 1 > concurrent access with updates in terms of false negative? > Two concurrent updates race is protected by spin_lock, > but what about peek and update? > The update might set one bit, but not the other. > That shouldn't trigger false negative lookup, right? For cases where there is a concurrent peek and update, the user is responsible for synchronizing these operations if they want to ensure that the peek will always return true while the update is occurring. I will add this to the commit message. > Is bloom filter supported as inner map? > Hash and lru maps are often used as inner maps. > The lookups from them would be pre-filtered by bloom filter > map that would have to be (in some cases) inner map. > I suspect one bloom filter for all inner maps might be > reasonable workaround in some cases too. > The delete is not supported in bloom filter, of course. > Would be good to mention it in the commit log. > Since there is no delete the users would likely need > to replace the whole bloom filter. So map-in-map would > become necessary. > Do you think 'clear-all' operation might be useful for bloom filter? > It feels that if map-in-map is supported then clear-all is probably > not that useful, since atomic replacement and delete of the map > would work better. 'clear-all' will have issues with > lookup, since it cannot be done in parallel. > Would be good to document all these ideas and restrictions. The bloom filter is supported as an inner map. I will include a test for this and add it to v2 (and document this in the commit message in v2) > Could you collect 'perf annotate' data for the above performance > critical loop? > I wonder whether using jhash2 and forcing u32 value size could speed it up. > Probably not, but would be good to check, since restricting value_size > later would be problematic due to backward compatibility. > > The recommended nr_hashes=3 was computed with value_size=8, right? > I wonder whether nr_hashes would be different for value_size=16 and =4 > which are ipv6/ipv4 addresses and value_size = 40 > an approximation of networking n-tuple. Great suggestions! I will do all of these you mentioned, after I incorporate the edits for v2, and report back with the results. >> +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) >> +{ >> + int numa_node = bpf_map_attr_numa_node(attr); >> + u32 nr_bits, bit_array_bytes, bit_array_mask; >> + struct bpf_bloom_filter *bloom_filter; >> + >> + if (!bpf_capable()) >> + return ERR_PTR(-EPERM); >> + >> + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || >> + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || > Would it make sense to default to nr_hashes=3 if zero is passed? > This way the libbpf changes for nr_hashes will become 'optional'. > Most users wouldn't have to specify it explicitly. I like this idea - it'll make the API more friendly to use as well for people who might not be acquainted with bloom filters :) > > Overall looks great! > Performance numbers are impressive.
On 9/1/21 10:11 PM, John Fastabend wrote: > Andrii Nakryiko wrote: >> On Tue, Aug 31, 2021 at 3:51 PM Joanne Koong <joannekoong@fb.com> wrote: >>> Bloom filters are a space-efficient probabilistic data structure >>> used to quickly test whether an element exists in a set. >>> In a bloom filter, false positives are possible whereas false >>> negatives are not. >>> >>> This patch adds a bloom filter map for bpf programs. >>> The bloom filter map supports peek (determining whether an element >>> is present in the map) and push (adding an element to the map) >>> operations.These operations are exposed to userspace applications >>> through the already existing syscalls in the following way: >>> >>> BPF_MAP_LOOKUP_ELEM -> peek >>> BPF_MAP_UPDATE_ELEM -> push >>> >>> The bloom filter map does not have keys, only values. In light of >>> this, the bloom filter map's API matches that of queue stack maps: >>> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM >>> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, >>> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem >>> APIs to query or add an element to the bloom filter map. When the >>> bloom filter map is created, it must be created with a key_size of 0. >>> >>> For updates, the user will pass in the element to add to the map >>> as the value, wih a NULL key. For lookups, the user will pass in the >>> element to query in the map as the value. In the verifier layer, this >>> requires us to modify the argument type of a bloom filter's >>> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in >>> the syscall layer, we need to copy over the user value so that in >>> bpf_map_peek_elem, we know which specific value to query. >>> >>> The maximum number of entries in the bloom filter is not enforced; if >>> the user wishes to insert more entries into the bloom filter than they >>> specified as the max entries size of the bloom filter, that is permitted >>> but the performance of their bloom filter will have a higher false >>> positive rate. >>> >>> The number of hashes to use for the bloom filter is configurable from >>> userspace. The benchmarks later in this patchset can help compare the >>> performances of different number of hashes on different entry >>> sizes. In general, using more hashes decreases the speed of a lookup, >>> but increases the false positive rate of an element being detected in the >>> bloom filter. >>> >>> Signed-off-by: Joanne Koong <joannekoong@fb.com> >>> --- >> This looks nice and simple. I left a few comments below. >> >> But one high-level point I wanted to discuss was that bloom filter >> logic is actually simple enough to be implementable by pure BPF >> program logic. The only problematic part is generic hashing of a piece >> of memory. Regardless of implementing bloom filter as kernel-provided >> BPF map or implementing it with custom BPF program logic, having BPF >> helper for hashing a piece of memory seems extremely useful and very >> generic. I can't recall if we ever discussed adding such helpers, but >> maybe we should? > Aha started typing the same thing :) > > Adding generic hash helper has been on my todo list and close to the top > now. The use case is hashing data from skb payloads and such from kprobe > and sockmap side. I'm happy to work on it as soon as possible if no one > else picks it up. > >> It would be a really interesting experiment to implement the same >> logic in pure BPF logic and run it as another benchmark, along the >> Bloom filter map. BPF has both spinlock and atomic operation, so we >> can compare and contrast. We only miss hashing BPF helper. > The one issue I've found writing a hash logic is its a bit tricky > to get the verifier to consume it. Especially when the hash is nested > inside a for loop and sometimes a couple for loops so you end up with > things like, > > for (i = 0; i < someTlvs; i++) { > for (j = 0; j < someKeys; i++) { > ... > bpf_hash(someValue) > ... > } > > I've find small seemingly unrelated changes cause the complexity limit > to explode. Usually we can work around it with code to get pruning > points and such, but its a bit ugly. Perhaps this means we need > to dive into details of why the complexity explodes, but I've not > got to it yet. The todo list is long. > >> Being able to do this in pure BPF code has a bunch of advantages. >> Depending on specific application, users can decide to: >> - speed up the operation by ditching spinlock or atomic operation, >> if the logic allows to lose some bit updates; >> - decide on optimal size, which might not be a power of 2, depending >> on memory vs CPU trade of in any particular case; >> - it's also possible to implement a more general Counting Bloom >> filter, all without modifying the kernel. > Also it means no call and if you build it on top of an array > map of size 1 its just a load. Could this be a performance win for > example a Bloom filter in XDP for DDOS? Maybe. Not sure if the program > would be complex enough a call might be in the noise. I don't know. > >> We could go further, and start implementing other simple data >> structures relying on hashing, like HyperLogLog. And all with no >> kernel modifications. Map-in-map is no issue as well, because there is >> a choice of using either fixed global data arrays for maximum >> performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside >> map-in-map. > We've been doing most of our array maps as single entry arrays > at this point and just calculating offsets directly in BPF. Same > for some simple hashing algorithms. > >> Basically, regardless of having this map in the kernel or not, let's >> have a "universal" hashing function as a BPF helper as well. > Yes please! I completely agree! >> Thoughts? > I like it, but not the bloom filter expert here. Ooh, I like your idea of comparing the performance of the bloom filter with a kernel-provided BPF map vs. custom BPF program logic using a hash helper, especially if a BPF hash helper is something useful that we want to add to the codebase in and of itself! >>> include/linux/bpf.h | 3 +- >>> include/linux/bpf_types.h | 1 + >>> include/uapi/linux/bpf.h | 3 + >>> kernel/bpf/Makefile | 2 +- >>> kernel/bpf/bloom_filter.c | 171 +++++++++++++++++++++++++++++++++ >>> kernel/bpf/syscall.c | 20 +++- >>> kernel/bpf/verifier.c | 19 +++- >>> tools/include/uapi/linux/bpf.h | 3 + >>> 8 files changed, 214 insertions(+), 8 deletions(-) >>> create mode 100644 kernel/bpf/bloom_filter.c >>> >>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >>> index f4c16f19f83e..2abaa1052096 100644 >>> --- a/include/linux/bpf.h >>> +++ b/include/linux/bpf.h >>> @@ -181,7 +181,8 @@ struct bpf_map { >>> u32 btf_vmlinux_value_type_id; >>> bool bypass_spec_v1; >>> bool frozen; /* write-once; write-protected by freeze_mutex */ >>> - /* 22 bytes hole */ >>> + u32 nr_hashes; /* used for bloom filter maps */ >>> + /* 18 bytes hole */ >>> >>> /* The 3rd and 4th cacheline with misc members to avoid false sharing >>> * particularly with refcounting. >>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h >>> index 9c81724e4b98..c4424ac2fa02 100644 >>> --- a/include/linux/bpf_types.h >>> +++ b/include/linux/bpf_types.h >>> @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) >>> BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) >>> #endif >>> BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) >>> +BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) >>> >>> BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) >>> BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) >>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>> index 791f31dd0abe..c2acb0a510fe 100644 >>> --- a/include/uapi/linux/bpf.h >>> +++ b/include/uapi/linux/bpf.h >>> @@ -906,6 +906,7 @@ enum bpf_map_type { >>> BPF_MAP_TYPE_RINGBUF, >>> BPF_MAP_TYPE_INODE_STORAGE, >>> BPF_MAP_TYPE_TASK_STORAGE, >>> + BPF_MAP_TYPE_BLOOM_FILTER, >>> }; >>> >>> /* Note that tracing related programs such as >>> @@ -1274,6 +1275,7 @@ union bpf_attr { >>> * struct stored as the >>> * map value >>> */ >>> + __u32 nr_hashes; /* used for configuring bloom filter maps */ >> This feels like a bit too one-off property that won't be ever reused >> by any other type of map. Also consider that we should probably limit >> nr_hashes to some pretty small sane value (<16? <64?) to prevent easy >> DOS from inside BPF programs (e.g., set nr_hash to 2bln, each >> operation is now extremely slow and CPU intensive). So with that, >> maybe let's provide number of hashes as part of map_flags? And as >> Alexei proposed, zero would mean some recommended value (2 or 3, >> right?). This would also mean that libbpf won't need to know about >> one-off map property in parsing BTF map definitions. I think we can limit nr_hashes to 10, since 10 hashes has a false positive rate of roughly ~0% >>> }; >>> >>> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ >>> @@ -5594,6 +5596,7 @@ struct bpf_map_info { >>> __u32 btf_id; >>> __u32 btf_key_type_id; >>> __u32 btf_value_type_id; >>> + __u32 nr_hashes; /* used for bloom filter maps */ >>> } __attribute__((aligned(8))); >>> >>> struct bpf_btf_info { >>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile >>> index 7f33098ca63f..cf6ca339f3cd 100644 >>> --- a/kernel/bpf/Makefile >>> +++ b/kernel/bpf/Makefile >>> @@ -7,7 +7,7 @@ endif >>> CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) >>> >>> obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o >>> -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o >>> +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o >>> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o >>> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o >>> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o >>> diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c >>> new file mode 100644 >>> index 000000000000..3ae799ab3747 >>> --- /dev/null >>> +++ b/kernel/bpf/bloom_filter.c >>> @@ -0,0 +1,171 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> +/* Copyright (c) 2021 Facebook */ >>> + >>> +#include <linux/bitmap.h> >>> +#include <linux/bpf.h> >>> +#include <linux/err.h> >>> +#include <linux/jhash.h> >>> +#include <linux/random.h> >>> +#include <linux/spinlock.h> >>> + >>> +#define BLOOM_FILTER_CREATE_FLAG_MASK \ >>> + (BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK) >>> + >>> +struct bpf_bloom_filter { >>> + struct bpf_map map; >>> + u32 bit_array_mask; >>> + u32 hash_seed; >>> + /* Used for synchronizing parallel writes to the bit array */ >>> + spinlock_t spinlock; >>> + unsigned long bit_array[]; >>> +}; >>> + >>> +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value) >>> +{ >>> + struct bpf_bloom_filter *bloom_filter = >>> + container_of(map, struct bpf_bloom_filter, map); >>> + u32 i, hash; >>> + >>> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { >>> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & >>> + bloom_filter->bit_array_mask; >>> + if (!test_bit(hash, bloom_filter->bit_array)) >>> + return -ENOENT; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) >>> +{ >>> + int numa_node = bpf_map_attr_numa_node(attr); >>> + u32 nr_bits, bit_array_bytes, bit_array_mask; >>> + struct bpf_bloom_filter *bloom_filter; >>> + >>> + if (!bpf_capable()) >>> + return ERR_PTR(-EPERM); >>> + >>> + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || >>> + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || >>> + !bpf_map_flags_access_ok(attr->map_flags)) >>> + return ERR_PTR(-EINVAL); >>> + >>> + /* For the bloom filter, the optimal bit array size that minimizes the >>> + * false positive probability is n * k / ln(2) where n is the number of >>> + * expected entries in the bloom filter and k is the number of hash >>> + * functions. We use 7 / 5 to approximate 1 / ln(2). >>> + * >>> + * We round this up to the nearest power of two to enable more efficient >>> + * hashing using bitmasks. The bitmask will be the bit array size - 1. >>> + * >>> + * If this overflows a u32, the bit array size will have 2^32 (4 >>> + * GB) bits. > Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was > a user I might want to know that the number of bits I pushed down is not the actual > number? I think if we return E2BIG or EINVAL here, this will fail to create the bloom filter map if the max_entries exceeds some limit (~3 billion, according to my math) whereas automatically setting the bit array size to 2^32 if the max_entries is extraordinarily large will still allow the user to create and use a bloom filter (albeit one with a higher false positive rate). > Another thought, would it be simpler to let user do this calculation and just let > max_elements be number of bits they want? Then we could have examples with the > above comment. Just a thought... I like Martin's idea of keeping the max_entries meaning consistent across all map types. I think that makes the interface clearer for users. >>> + */ >>> + if (unlikely(check_mul_overflow(attr->max_entries, attr->nr_hashes, &nr_bits)) || >>> + unlikely(check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits)) || >>> + unlikely(nr_bits > (1UL << 31))) { >> nit: map_alloc is not performance-critical (because it's infrequent), >> so unlikely() are probably unnecessary? >> >>> + /* The bit array size is 2^32 bits but to avoid overflowing the >>> + * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the >>> + * equivalent number of bytes >>> + */ >>> + bit_array_bytes = BITS_TO_BYTES(U32_MAX); >>> + bit_array_mask = U32_MAX; >>> + } else { >>> + if (nr_bits <= BITS_PER_LONG) >>> + nr_bits = BITS_PER_LONG; >>> + else >>> + nr_bits = roundup_pow_of_two(nr_bits); >>> + bit_array_bytes = BITS_TO_BYTES(nr_bits); >>> + bit_array_mask = nr_bits - 1; >>> + } >>> + >>> + bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long)); >>> + bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes, >>> + numa_node); >>> + >>> + if (!bloom_filter) >>> + return ERR_PTR(-ENOMEM); >>> + >>> + bpf_map_init_from_attr(&bloom_filter->map, attr); >>> + bloom_filter->map.nr_hashes = attr->nr_hashes; >>> + >>> + bloom_filter->bit_array_mask = bit_array_mask; >>> + spin_lock_init(&bloom_filter->spinlock); >>> + >>> + if (!(attr->map_flags & BPF_F_ZERO_SEED)) >>> + bloom_filter->hash_seed = get_random_int(); >>> + >>> + return &bloom_filter->map; >>> +} >>> + >>> +static void bloom_filter_map_free(struct bpf_map *map) >>> +{ >>> + struct bpf_bloom_filter *bloom_filter = >>> + container_of(map, struct bpf_bloom_filter, map); >>> + >>> + bpf_map_area_free(bloom_filter); >>> +} >>> + >>> +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value, >>> + u64 flags) >>> +{ >>> + struct bpf_bloom_filter *bloom_filter = >>> + container_of(map, struct bpf_bloom_filter, map); >>> + unsigned long spinlock_flags; >>> + u32 i, hash; >>> + >>> + if (flags != BPF_ANY) >>> + return -EINVAL; >>> + >>> + spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags); >>> + >> If value_size is pretty big, hashing might take a noticeable amount of >> CPU, during which we'll be keeping spinlock. With what I said above >> about sane number of hashes, if we bound it to small reasonable number >> (e.g., 16), we can have a local 16-element array with hashes >> calculated before we take lock. That way spinlock will be held only >> for few bit flips. > +1. Anyways we are inside a RCU section here and the map shouldn't > disapper without a grace period so we can READ_ONCE the seed right? > Or are we thinking about sleepable programs here? > >> Also, I wonder if ditching spinlock in favor of atomic bit set >> operation would improve performance in typical scenarios. Seems like >> set_bit() is an atomic operation, so it should be easy to test. Do you >> mind running benchmarks with spinlock and with set_bit()? > With the jhash pulled out of lock, I think it might be noticable. Curious > to see. Awesome, I will test this out and report back! >>> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { >>> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & >>> + bloom_filter->bit_array_mask; >>> + bitmap_set(bloom_filter->bit_array, hash, 1); >>> + } >>> + >>> + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); >>> + >>> + return 0; >>> +} >>> + >> [...] >
On Thu, Sep 02, 2021 at 03:07:56PM -0700, Joanne Koong wrote: [ ... ] > > > But one high-level point I wanted to discuss was that bloom filter > > > logic is actually simple enough to be implementable by pure BPF > > > program logic. The only problematic part is generic hashing of a piece > > > of memory. Regardless of implementing bloom filter as kernel-provided > > > BPF map or implementing it with custom BPF program logic, having BPF > > > helper for hashing a piece of memory seems extremely useful and very > > > generic. I can't recall if we ever discussed adding such helpers, but > > > maybe we should? > > Aha started typing the same thing :) > > > > Adding generic hash helper has been on my todo list and close to the top > > now. The use case is hashing data from skb payloads and such from kprobe > > and sockmap side. I'm happy to work on it as soon as possible if no one > > else picks it up. > > > > > It would be a really interesting experiment to implement the same > > > logic in pure BPF logic and run it as another benchmark, along the > > > Bloom filter map. BPF has both spinlock and atomic operation, so we > > > can compare and contrast. We only miss hashing BPF helper. > > > > I've find small seemingly unrelated changes cause the complexity limit > > to explode. Usually we can work around it with code to get pruning > > points and such, but its a bit ugly. Perhaps this means we need > > to dive into details of why the complexity explodes, but I've not > > got to it yet. The todo list is long. > > > > > Being able to do this in pure BPF code has a bunch of advantages. > > > Depending on specific application, users can decide to: > > > - speed up the operation by ditching spinlock or atomic operation, > > > if the logic allows to lose some bit updates; > > > - decide on optimal size, which might not be a power of 2, depending > > > on memory vs CPU trade of in any particular case; > > > - it's also possible to implement a more general Counting Bloom > > > filter, all without modifying the kernel. > > Also it means no call and if you build it on top of an array > > map of size 1 its just a load. Could this be a performance win for > > example a Bloom filter in XDP for DDOS? Maybe. Not sure if the program > > would be complex enough a call might be in the noise. I don't know. > > > > > We could go further, and start implementing other simple data > > > structures relying on hashing, like HyperLogLog. And all with no > > > kernel modifications. Map-in-map is no issue as well, because there is > > > a choice of using either fixed global data arrays for maximum > > > performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside > > > map-in-map. > > We've been doing most of our array maps as single entry arrays > > at this point and just calculating offsets directly in BPF. Same > > for some simple hashing algorithms. > > > > > Basically, regardless of having this map in the kernel or not, let's > > > have a "universal" hashing function as a BPF helper as well. > > > Thoughts? > > I like it, but not the bloom filter expert here. > > Ooh, I like your idea of comparing the performance of the bloom filter with > a kernel-provided BPF map vs. custom BPF program logic using a > hash helper, especially if a BPF hash helper is something useful that > we want to add to the codebase in and of itself! I think a hash helper will be useful in general but could it be a separate experiment to try using it to implement some bpf maps (probably a mix of an easy one and a little harder one) ?
On 9/2/21 5:56 PM, Martin KaFai Lau wrote: > On Thu, Sep 02, 2021 at 03:07:56PM -0700, Joanne Koong wrote: > [ ... ] >>>> But one high-level point I wanted to discuss was that bloom filter >>>> logic is actually simple enough to be implementable by pure BPF >>>> program logic. The only problematic part is generic hashing of a piece >>>> of memory. Regardless of implementing bloom filter as kernel-provided >>>> BPF map or implementing it with custom BPF program logic, having BPF >>>> helper for hashing a piece of memory seems extremely useful and very >>>> generic. I can't recall if we ever discussed adding such helpers, but >>>> maybe we should? >>> Aha started typing the same thing :) >>> >>> Adding generic hash helper has been on my todo list and close to the top >>> now. The use case is hashing data from skb payloads and such from kprobe >>> and sockmap side. I'm happy to work on it as soon as possible if no one >>> else picks it up. >>> After thinking through this some more, I'm curious to hear your thoughts, Andrii and John, on how the bitmap would be allocated. From what I understand, we do not currently support dynamic memory allocations in bpf programs. Assuming the optimal number of bits the user wants to use for their bitmap follows something like num_entries * num_hash_functions / ln(2), I think the bitmap would have to be dynamically allocated in the bpf program since it'd be too large to store on the stack, unless there's some other way I'm not seeing? >>>> It would be a really interesting experiment to implement the same >>>> logic in pure BPF logic and run it as another benchmark, along the >>>> Bloom filter map. BPF has both spinlock and atomic operation, so we >>>> can compare and contrast. We only miss hashing BPF helper. >>> The one issue I've found writing a hash logic is its a bit tricky >>> to get the verifier to consume it. Especially when the hash is nested >>> inside a for loop and sometimes a couple for loops so you end up with >>> things like, >>> >>> for (i = 0; i < someTlvs; i++) { >>> for (j = 0; j < someKeys; i++) { >>> ... >>> bpf_hash(someValue) >>> ... >>> } >>> >>> I've find small seemingly unrelated changes cause the complexity limit >>> to explode. Usually we can work around it with code to get pruning >>> points and such, but its a bit ugly. Perhaps this means we need >>> to dive into details of why the complexity explodes, but I've not >>> got to it yet. The todo list is long. Out of curiosity, why would this helper have trouble in the verifier? From a quick glance, it seems like the implementation for it would be pretty similar to how bpf_get_prandom_u32() is implemented (except where the arguments for the hash helper would take in a void* data (ARG_PTR_TO_MEM), the size of the data buffer, and the seed)? I'm a bit new to bpf, so there's a good chance I might be completely overlooking something here :) >>>> Being able to do this in pure BPF code has a bunch of advantages. >>>> Depending on specific application, users can decide to: >>>> - speed up the operation by ditching spinlock or atomic operation, >>>> if the logic allows to lose some bit updates; >>>> - decide on optimal size, which might not be a power of 2, depending >>>> on memory vs CPU trade of in any particular case; >>>> - it's also possible to implement a more general Counting Bloom >>>> filter, all without modifying the kernel. >>> Also it means no call and if you build it on top of an array >>> map of size 1 its just a load. Could this be a performance win for >>> example a Bloom filter in XDP for DDOS? Maybe. Not sure if the program >>> would be complex enough a call might be in the noise. I don't know. >>> >>>> We could go further, and start implementing other simple data >>>> structures relying on hashing, like HyperLogLog. And all with no >>>> kernel modifications. Map-in-map is no issue as well, because there is >>>> a choice of using either fixed global data arrays for maximum >>>> performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside >>>> map-in-map. >>> We've been doing most of our array maps as single entry arrays >>> at this point and just calculating offsets directly in BPF. Same >>> for some simple hashing algorithms. >>> >>>> Basically, regardless of having this map in the kernel or not, let's >>>> have a "universal" hashing function as a BPF helper as well. >>>> Thoughts? >>> I like it, but not the bloom filter expert here. >> Ooh, I like your idea of comparing the performance of the bloom filter with >> a kernel-provided BPF map vs. custom BPF program logic using a >> hash helper, especially if a BPF hash helper is something useful that >> we want to add to the codebase in and of itself! > I think a hash helper will be useful in general but could it be a > separate experiment to try using it to implement some bpf maps (probably > a mix of an easy one and a little harder one) ? I agree, I think the hash helper implementation should be its own separate patchset orthogonal to this one.
On Fri, Sep 3, 2021 at 12:13 AM Joanne Koong <joannekoong@fb.com> wrote: > > On 9/2/21 5:56 PM, Martin KaFai Lau wrote: > > > On Thu, Sep 02, 2021 at 03:07:56PM -0700, Joanne Koong wrote: > > [ ... ] > >>>> But one high-level point I wanted to discuss was that bloom filter > >>>> logic is actually simple enough to be implementable by pure BPF > >>>> program logic. The only problematic part is generic hashing of a piece > >>>> of memory. Regardless of implementing bloom filter as kernel-provided > >>>> BPF map or implementing it with custom BPF program logic, having BPF > >>>> helper for hashing a piece of memory seems extremely useful and very > >>>> generic. I can't recall if we ever discussed adding such helpers, but > >>>> maybe we should? > >>> Aha started typing the same thing :) > >>> > >>> Adding generic hash helper has been on my todo list and close to the top > >>> now. The use case is hashing data from skb payloads and such from kprobe > >>> and sockmap side. I'm happy to work on it as soon as possible if no one > >>> else picks it up. > >>> > After thinking through this some more, I'm curious to hear your thoughts, > Andrii and John, on how the bitmap would be allocated. From what I > understand, we do not currently support dynamic memory allocations > in bpf programs. Assuming the optimal number of bits the user wants > to use for their bitmap follows something like > num_entries * num_hash_functions / ln(2), I think the bitmap would > have to be dynamically allocated in the bpf program since it'd be too > large to store on the stack, unless there's some other way I'm not seeing? You can either use BPF_MAP_TYPE_ARRAY and size it at runtime. Or one can use compile-time fixed-sized array in BPF program: u64 bits[HOWEVER_MANY_U64S_WE_NEED]; /* then in BPF program itself */ h = hash(...); bits[h / 64] |= (1 << (h % 64)); As an example. The latter case avoid map lookups completely, except you'd need to prove to the verifier that you are not going out of bounds for bits, which is simple to do if HOWEVER_MANY_U64S_WE_NEED is power-of-2. Then you can do: h = hash(...); bits[(h / 64) & (HOWEVER_MANY_U64S_WE_NEED - 1)] |= (1 << (h % 64)); > >>>> It would be a really interesting experiment to implement the same > >>>> logic in pure BPF logic and run it as another benchmark, along the > >>>> Bloom filter map. BPF has both spinlock and atomic operation, so we > >>>> can compare and contrast. We only miss hashing BPF helper. > >>> The one issue I've found writing a hash logic is its a bit tricky > >>> to get the verifier to consume it. Especially when the hash is nested > >>> inside a for loop and sometimes a couple for loops so you end up with > >>> things like, > >>> > >>> for (i = 0; i < someTlvs; i++) { > >>> for (j = 0; j < someKeys; i++) { > >>> ... > >>> bpf_hash(someValue) > >>> ... > >>> } > >>> > >>> I've find small seemingly unrelated changes cause the complexity limit > >>> to explode. Usually we can work around it with code to get pruning btw, global BPF functions (sub-programs) should limit this complexity explosion, even if you implement your own hashing function purely in BPF. > >>> points and such, but its a bit ugly. Perhaps this means we need > >>> to dive into details of why the complexity explodes, but I've not > >>> got to it yet. The todo list is long. > Out of curiosity, why would this helper have trouble in the verifier? > From a quick glance, it seems like the implementation for it would > be pretty similar to how bpf_get_prandom_u32() is implemented > (except where the arguments for the hash helper would take in a > void* data (ARG_PTR_TO_MEM), the size of the data buffer, and > the seed)? I'm a bit new to bpf, so there's a good chance I might be > completely overlooking something here :) Curious as well. I imagine we'd define new helper with this signature: u64 bpf_hash_mem(void *data, u64 sz, enum bpf_hash_func hash_fn, u64 flags); Where enum bpf_hash_func { JHASH, MURMUR, CRC32, etc }, whatever is available in the kernel (or will be added later). John, would this still cause problems for the verifier? > > >>>> Being able to do this in pure BPF code has a bunch of advantages. > >>>> Depending on specific application, users can decide to: > >>>> - speed up the operation by ditching spinlock or atomic operation, > >>>> if the logic allows to lose some bit updates; > >>>> - decide on optimal size, which might not be a power of 2, depending > >>>> on memory vs CPU trade of in any particular case; > >>>> - it's also possible to implement a more general Counting Bloom > >>>> filter, all without modifying the kernel. > >>> Also it means no call and if you build it on top of an array > >>> map of size 1 its just a load. Could this be a performance win for > >>> example a Bloom filter in XDP for DDOS? Maybe. Not sure if the program > >>> would be complex enough a call might be in the noise. I don't know. > >>> > >>>> We could go further, and start implementing other simple data > >>>> structures relying on hashing, like HyperLogLog. And all with no > >>>> kernel modifications. Map-in-map is no issue as well, because there is > >>>> a choice of using either fixed global data arrays for maximum > >>>> performance, or using BPF_MAP_TYPE_ARRAY maps that can go inside > >>>> map-in-map. > >>> We've been doing most of our array maps as single entry arrays > >>> at this point and just calculating offsets directly in BPF. Same > >>> for some simple hashing algorithms. > >>> > >>>> Basically, regardless of having this map in the kernel or not, let's > >>>> have a "universal" hashing function as a BPF helper as well. > >>>> Thoughts? > >>> I like it, but not the bloom filter expert here. > >> Ooh, I like your idea of comparing the performance of the bloom filter with > >> a kernel-provided BPF map vs. custom BPF program logic using a > >> hash helper, especially if a BPF hash helper is something useful that > >> we want to add to the codebase in and of itself! > > I think a hash helper will be useful in general but could it be a > > separate experiment to try using it to implement some bpf maps (probably > > a mix of an easy one and a little harder one) ? > > I agree, I think the hash helper implementation should be its own separate > patchset orthogonal to this one. > Sure, I don't feel strongly against having Bloom filter as BPF map.
Joanne Koong wrote: > On 9/1/21 10:11 PM, John Fastabend wrote: > > > Andrii Nakryiko wrote: > >> On Tue, Aug 31, 2021 at 3:51 PM Joanne Koong <joannekoong@fb.com> wrote: > >>> Bloom filters are a space-efficient probabilistic data structure > >>> used to quickly test whether an element exists in a set. > >>> In a bloom filter, false positives are possible whereas false > >>> negatives are not. > >>> > >>> This patch adds a bloom filter map for bpf programs. > >>> The bloom filter map supports peek (determining whether an element > >>> is present in the map) and push (adding an element to the map) > >>> operations.These operations are exposed to userspace applications > >>> through the already existing syscalls in the following way: > >>> > >>> BPF_MAP_LOOKUP_ELEM -> peek > >>> BPF_MAP_UPDATE_ELEM -> push > >>> > >>> The bloom filter map does not have keys, only values. In light of > >>> this, the bloom filter map's API matches that of queue stack maps: > >>> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM > >>> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, > >>> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem > >>> APIs to query or add an element to the bloom filter map. When the > >>> bloom filter map is created, it must be created with a key_size of 0. > >>> > >>> For updates, the user will pass in the element to add to the map > >>> as the value, wih a NULL key. For lookups, the user will pass in the > >>> element to query in the map as the value. In the verifier layer, this > >>> requires us to modify the argument type of a bloom filter's > >>> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in > >>> the syscall layer, we need to copy over the user value so that in > >>> bpf_map_peek_elem, we know which specific value to query. > >>> > >>> The maximum number of entries in the bloom filter is not enforced; if > >>> the user wishes to insert more entries into the bloom filter than they > >>> specified as the max entries size of the bloom filter, that is permitted > >>> but the performance of their bloom filter will have a higher false > >>> positive rate. > >>> > >>> The number of hashes to use for the bloom filter is configurable from > >>> userspace. The benchmarks later in this patchset can help compare the > >>> performances of different number of hashes on different entry > >>> sizes. In general, using more hashes decreases the speed of a lookup, > >>> but increases the false positive rate of an element being detected in the > >>> bloom filter. > >>> > >>> Signed-off-by: Joanne Koong <joannekoong@fb.com> [...] > >>> +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) > >>> +{ > >>> + int numa_node = bpf_map_attr_numa_node(attr); > >>> + u32 nr_bits, bit_array_bytes, bit_array_mask; > >>> + struct bpf_bloom_filter *bloom_filter; > >>> + > >>> + if (!bpf_capable()) > >>> + return ERR_PTR(-EPERM); > >>> + > >>> + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || > >>> + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || > >>> + !bpf_map_flags_access_ok(attr->map_flags)) > >>> + return ERR_PTR(-EINVAL); > >>> + > >>> + /* For the bloom filter, the optimal bit array size that minimizes the > >>> + * false positive probability is n * k / ln(2) where n is the number of > >>> + * expected entries in the bloom filter and k is the number of hash > >>> + * functions. We use 7 / 5 to approximate 1 / ln(2). > >>> + * > >>> + * We round this up to the nearest power of two to enable more efficient > >>> + * hashing using bitmasks. The bitmask will be the bit array size - 1. > >>> + * > >>> + * If this overflows a u32, the bit array size will have 2^32 (4 > >>> + * GB) bits. > > Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was > > a user I might want to know that the number of bits I pushed down is not the actual > > number? > > I think if we return E2BIG or EINVAL here, this will fail to create the > bloom filter map > if the max_entries exceeds some limit (~3 billion, according to my math) > whereas > automatically setting the bit array size to 2^32 if the max_entries is > extraordinarily large will still allow the user to create and use a > bloom filter (albeit > one with a higher false positive rate). It doesn't matter much to me, but I think if a user request 3+billion max entries its ok to return E2BIG and then they can use a lower limit and know the false positive rate is going to go up. > > > Another thought, would it be simpler to let user do this calculation and just let > > max_elements be number of bits they want? Then we could have examples with the > > above comment. Just a thought... > > I like Martin's idea of keeping the max_entries meaning consistent > across all map types. > I think that makes the interface clearer for users. I'm convinced as well, lets keep it consistent. Thanks. [...] > >> Also, I wonder if ditching spinlock in favor of atomic bit set > >> operation would improve performance in typical scenarios. Seems like > >> set_bit() is an atomic operation, so it should be easy to test. Do you > >> mind running benchmarks with spinlock and with set_bit()? > > With the jhash pulled out of lock, I think it might be noticable. Curious > > to see. > Awesome, I will test this out and report back! It looks like the benchmark tests were done with value size of __u64 should we do larger entry? I guess (you tell me?) if this is used from XDP for DDOS you would use a flow tuple and with IPv6 this could be {dstIp, srcIp, sport, dport, proto} with roughly 44B. > >>> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { > >>> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & > >>> + bloom_filter->bit_array_mask; > >>> + bitmap_set(bloom_filter->bit_array, hash, 1); > >>> + } > >>> + > >>> + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); > >>> + > >>> + return 0; > >>> +} > >>> + > >> [...] > >
On 9/3/21 10:22 AM, John Fastabend wrote: > Joanne Koong wrote: >> On 9/1/21 10:11 PM, John Fastabend wrote: >> >>> Andrii Nakryiko wrote: >>>> On Tue, Aug 31, 2021 at 3:51 PM Joanne Koong <joannekoong@fb.com> wrote: >>>>> Bloom filters are a space-efficient probabilistic data structure >>>>> used to quickly test whether an element exists in a set. >>>>> In a bloom filter, false positives are possible whereas false >>>>> negatives are not. >>>>> >>>>> This patch adds a bloom filter map for bpf programs. >>>>> The bloom filter map supports peek (determining whether an element >>>>> is present in the map) and push (adding an element to the map) >>>>> operations.These operations are exposed to userspace applications >>>>> through the already existing syscalls in the following way: >>>>> >>>>> BPF_MAP_LOOKUP_ELEM -> peek >>>>> BPF_MAP_UPDATE_ELEM -> push >>>>> >>>>> The bloom filter map does not have keys, only values. In light of >>>>> this, the bloom filter map's API matches that of queue stack maps: >>>>> user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM >>>>> which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, >>>>> and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem >>>>> APIs to query or add an element to the bloom filter map. When the >>>>> bloom filter map is created, it must be created with a key_size of 0. >>>>> >>>>> For updates, the user will pass in the element to add to the map >>>>> as the value, wih a NULL key. For lookups, the user will pass in the >>>>> element to query in the map as the value. In the verifier layer, this >>>>> requires us to modify the argument type of a bloom filter's >>>>> BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in >>>>> the syscall layer, we need to copy over the user value so that in >>>>> bpf_map_peek_elem, we know which specific value to query. >>>>> >>>>> The maximum number of entries in the bloom filter is not enforced; if >>>>> the user wishes to insert more entries into the bloom filter than they >>>>> specified as the max entries size of the bloom filter, that is permitted >>>>> but the performance of their bloom filter will have a higher false >>>>> positive rate. >>>>> >>>>> The number of hashes to use for the bloom filter is configurable from >>>>> userspace. The benchmarks later in this patchset can help compare the >>>>> performances of different number of hashes on different entry >>>>> sizes. In general, using more hashes decreases the speed of a lookup, >>>>> but increases the false positive rate of an element being detected in the >>>>> bloom filter. >>>>> >>>>> Signed-off-by: Joanne Koong <joannekoong@fb.com> > [...] > >>>>> +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) >>>>> +{ >>>>> + int numa_node = bpf_map_attr_numa_node(attr); >>>>> + u32 nr_bits, bit_array_bytes, bit_array_mask; >>>>> + struct bpf_bloom_filter *bloom_filter; >>>>> + >>>>> + if (!bpf_capable()) >>>>> + return ERR_PTR(-EPERM); >>>>> + >>>>> + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || >>>>> + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || >>>>> + !bpf_map_flags_access_ok(attr->map_flags)) >>>>> + return ERR_PTR(-EINVAL); >>>>> + >>>>> + /* For the bloom filter, the optimal bit array size that minimizes the >>>>> + * false positive probability is n * k / ln(2) where n is the number of >>>>> + * expected entries in the bloom filter and k is the number of hash >>>>> + * functions. We use 7 / 5 to approximate 1 / ln(2). >>>>> + * >>>>> + * We round this up to the nearest power of two to enable more efficient >>>>> + * hashing using bitmasks. The bitmask will be the bit array size - 1. >>>>> + * >>>>> + * If this overflows a u32, the bit array size will have 2^32 (4 >>>>> + * GB) bits. >>> Would it be better to return E2BIG or EINVAL here? Speculating a bit, but if I was >>> a user I might want to know that the number of bits I pushed down is not the actual >>> number? >> I think if we return E2BIG or EINVAL here, this will fail to create the >> bloom filter map >> if the max_entries exceeds some limit (~3 billion, according to my math) >> whereas >> automatically setting the bit array size to 2^32 if the max_entries is >> extraordinarily large will still allow the user to create and use a >> bloom filter (albeit >> one with a higher false positive rate). > It doesn't matter much to me, but I think if a user request 3+billion max entries > its ok to return E2BIG and then they can use a lower limit and know the > false positive rate is going to go up. > >>> Another thought, would it be simpler to let user do this calculation and just let >>> max_elements be number of bits they want? Then we could have examples with the >>> above comment. Just a thought... >> I like Martin's idea of keeping the max_entries meaning consistent >> across all map types. >> I think that makes the interface clearer for users. > I'm convinced as well, lets keep it consistent. Thanks. > > [...] > >>>> Also, I wonder if ditching spinlock in favor of atomic bit set >>>> operation would improve performance in typical scenarios. Seems like >>>> set_bit() is an atomic operation, so it should be easy to test. Do you >>>> mind running benchmarks with spinlock and with set_bit()? >>> With the jhash pulled out of lock, I think it might be noticable. Curious >>> to see. >> Awesome, I will test this out and report back! > It looks like the benchmark tests were done with value size of __u64 should > we do larger entry? I guess (you tell me?) if this is used from XDP for > DDOS you would use a flow tuple and with IPv6 this could be > {dstIp, srcIp, sport, dport, proto} with roughly 44B. Great suggestion. Alexei mentioned this as well in his earlier reply. I am planning to run benchmarks on the v2 version using value sizes of 4, 8, 16, and 40 bytes. >>>>> + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { >>>>> + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & >>>>> + bloom_filter->bit_array_mask; >>>>> + bitmap_set(bloom_filter->bit_array, hash, 1); >>>>> + } >>>>> + >>>>> + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>> [...]
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f4c16f19f83e..2abaa1052096 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -181,7 +181,8 @@ struct bpf_map { u32 btf_vmlinux_value_type_id; bool bypass_spec_v1; bool frozen; /* write-once; write-protected by freeze_mutex */ - /* 22 bytes hole */ + u32 nr_hashes; /* used for bloom filter maps */ + /* 18 bytes hole */ /* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 9c81724e4b98..c4424ac2fa02 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 791f31dd0abe..c2acb0a510fe 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -906,6 +906,7 @@ enum bpf_map_type { BPF_MAP_TYPE_RINGBUF, BPF_MAP_TYPE_INODE_STORAGE, BPF_MAP_TYPE_TASK_STORAGE, + BPF_MAP_TYPE_BLOOM_FILTER, }; /* Note that tracing related programs such as @@ -1274,6 +1275,7 @@ union bpf_attr { * struct stored as the * map value */ + __u32 nr_hashes; /* used for configuring bloom filter maps */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ @@ -5594,6 +5596,7 @@ struct bpf_map_info { __u32 btf_id; __u32 btf_key_type_id; __u32 btf_value_type_id; + __u32 nr_hashes; /* used for bloom filter maps */ } __attribute__((aligned(8))); struct bpf_btf_info { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 7f33098ca63f..cf6ca339f3cd 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -7,7 +7,7 @@ endif CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c new file mode 100644 index 000000000000..3ae799ab3747 --- /dev/null +++ b/kernel/bpf/bloom_filter.c @@ -0,0 +1,171 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ + +#include <linux/bitmap.h> +#include <linux/bpf.h> +#include <linux/err.h> +#include <linux/jhash.h> +#include <linux/random.h> +#include <linux/spinlock.h> + +#define BLOOM_FILTER_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK) + +struct bpf_bloom_filter { + struct bpf_map map; + u32 bit_array_mask; + u32 hash_seed; + /* Used for synchronizing parallel writes to the bit array */ + spinlock_t spinlock; + unsigned long bit_array[]; +}; + +static int bloom_filter_map_peek_elem(struct bpf_map *map, void *value) +{ + struct bpf_bloom_filter *bloom_filter = + container_of(map, struct bpf_bloom_filter, map); + u32 i, hash; + + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & + bloom_filter->bit_array_mask; + if (!test_bit(hash, bloom_filter->bit_array)) + return -ENOENT; + } + + return 0; +} + +static struct bpf_map *bloom_filter_map_alloc(union bpf_attr *attr) +{ + int numa_node = bpf_map_attr_numa_node(attr); + u32 nr_bits, bit_array_bytes, bit_array_mask; + struct bpf_bloom_filter *bloom_filter; + + if (!bpf_capable()) + return ERR_PTR(-EPERM); + + if (attr->key_size != 0 || attr->value_size == 0 || attr->max_entries == 0 || + attr->nr_hashes == 0 || attr->map_flags & ~BLOOM_FILTER_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags)) + return ERR_PTR(-EINVAL); + + /* For the bloom filter, the optimal bit array size that minimizes the + * false positive probability is n * k / ln(2) where n is the number of + * expected entries in the bloom filter and k is the number of hash + * functions. We use 7 / 5 to approximate 1 / ln(2). + * + * We round this up to the nearest power of two to enable more efficient + * hashing using bitmasks. The bitmask will be the bit array size - 1. + * + * If this overflows a u32, the bit array size will have 2^32 (4 + * GB) bits. + */ + if (unlikely(check_mul_overflow(attr->max_entries, attr->nr_hashes, &nr_bits)) || + unlikely(check_mul_overflow(nr_bits / 5, (u32)7, &nr_bits)) || + unlikely(nr_bits > (1UL << 31))) { + /* The bit array size is 2^32 bits but to avoid overflowing the + * u32, we use BITS_TO_BYTES(U32_MAX), which will round up to the + * equivalent number of bytes + */ + bit_array_bytes = BITS_TO_BYTES(U32_MAX); + bit_array_mask = U32_MAX; + } else { + if (nr_bits <= BITS_PER_LONG) + nr_bits = BITS_PER_LONG; + else + nr_bits = roundup_pow_of_two(nr_bits); + bit_array_bytes = BITS_TO_BYTES(nr_bits); + bit_array_mask = nr_bits - 1; + } + + bit_array_bytes = roundup(bit_array_bytes, sizeof(unsigned long)); + bloom_filter = bpf_map_area_alloc(sizeof(*bloom_filter) + bit_array_bytes, + numa_node); + + if (!bloom_filter) + return ERR_PTR(-ENOMEM); + + bpf_map_init_from_attr(&bloom_filter->map, attr); + bloom_filter->map.nr_hashes = attr->nr_hashes; + + bloom_filter->bit_array_mask = bit_array_mask; + spin_lock_init(&bloom_filter->spinlock); + + if (!(attr->map_flags & BPF_F_ZERO_SEED)) + bloom_filter->hash_seed = get_random_int(); + + return &bloom_filter->map; +} + +static void bloom_filter_map_free(struct bpf_map *map) +{ + struct bpf_bloom_filter *bloom_filter = + container_of(map, struct bpf_bloom_filter, map); + + bpf_map_area_free(bloom_filter); +} + +static int bloom_filter_map_push_elem(struct bpf_map *map, void *value, + u64 flags) +{ + struct bpf_bloom_filter *bloom_filter = + container_of(map, struct bpf_bloom_filter, map); + unsigned long spinlock_flags; + u32 i, hash; + + if (flags != BPF_ANY) + return -EINVAL; + + spin_lock_irqsave(&bloom_filter->spinlock, spinlock_flags); + + for (i = 0; i < bloom_filter->map.nr_hashes; i++) { + hash = jhash(value, map->value_size, bloom_filter->hash_seed + i) & + bloom_filter->bit_array_mask; + bitmap_set(bloom_filter->bit_array, hash, 1); + } + + spin_unlock_irqrestore(&bloom_filter->spinlock, spinlock_flags); + + return 0; +} + +static void *bloom_filter_map_lookup_elem(struct bpf_map *map, void *key) +{ + /* The eBPF program should use map_peek_elem instead */ + return ERR_PTR(-EINVAL); +} + +static int bloom_filter_map_update_elem(struct bpf_map *map, void *key, + void *value, u64 flags) +{ + /* The eBPF program should use map_push_elem instead */ + return -EINVAL; +} + +static int bloom_filter_map_delete_elem(struct bpf_map *map, void *key) +{ + return -EOPNOTSUPP; +} + +static int bloom_filter_map_get_next_key(struct bpf_map *map, void *key, + void *next_key) +{ + return -EOPNOTSUPP; +} + +static int bloom_filter_map_btf_id; +const struct bpf_map_ops bloom_filter_map_ops = { + .map_meta_equal = bpf_map_meta_equal, + .map_alloc = bloom_filter_map_alloc, + .map_free = bloom_filter_map_free, + .map_push_elem = bloom_filter_map_push_elem, + .map_peek_elem = bloom_filter_map_peek_elem, + .map_lookup_elem = bloom_filter_map_lookup_elem, + .map_update_elem = bloom_filter_map_update_elem, + .map_delete_elem = bloom_filter_map_delete_elem, + .map_get_next_key = bloom_filter_map_get_next_key, + .map_check_btf = map_check_no_btf, + .map_btf_name = "bpf_bloom_filter", + .map_btf_id = &bloom_filter_map_btf_id, +}; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 4e50c0bfdb7d..b80bdda26fbf 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -199,7 +199,8 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, err = bpf_fd_reuseport_array_update_elem(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_QUEUE || - map->map_type == BPF_MAP_TYPE_STACK) { + map->map_type == BPF_MAP_TYPE_STACK || + map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) { err = map->ops->map_push_elem(map, value, flags); } else { rcu_read_lock(); @@ -238,7 +239,8 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { err = bpf_fd_reuseport_array_lookup_elem(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_QUEUE || - map->map_type == BPF_MAP_TYPE_STACK) { + map->map_type == BPF_MAP_TYPE_STACK || + map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) { err = map->ops->map_peek_elem(map, value); } else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { /* struct_ops map requires directly updating "value" */ @@ -810,7 +812,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, return ret; } -#define BPF_MAP_CREATE_LAST_FIELD btf_vmlinux_value_type_id +#define BPF_MAP_CREATE_LAST_FIELD nr_hashes /* called via syscall */ static int map_create(union bpf_attr *attr) { @@ -831,6 +833,9 @@ static int map_create(union bpf_attr *attr) return -EINVAL; } + if (attr->nr_hashes != 0 && attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER) + return -EINVAL; + f_flags = bpf_get_file_flag(attr->map_flags); if (f_flags < 0) return f_flags; @@ -1080,6 +1085,14 @@ static int map_lookup_elem(union bpf_attr *attr) if (!value) goto free_key; + if (map->map_type == BPF_MAP_TYPE_BLOOM_FILTER) { + if (copy_from_user(value, uvalue, value_size)) + err = -EFAULT; + else + err = bpf_map_copy_value(map, key, value, attr->flags); + goto free_value; + } + err = bpf_map_copy_value(map, key, value, attr->flags); if (err) goto free_value; @@ -3872,6 +3885,7 @@ static int bpf_map_get_info_by_fd(struct file *file, info.max_entries = map->max_entries; info.map_flags = map->map_flags; memcpy(info.name, map->name, sizeof(map->name)); + info.nr_hashes = map->nr_hashes; if (map->btf) { info.btf_id = btf_obj_id(map->btf); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 047ac4b4703b..5cbcff4c2222 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4813,7 +4813,10 @@ static int resolve_map_arg_type(struct bpf_verifier_env *env, return -EINVAL; } break; - + case BPF_MAP_TYPE_BLOOM_FILTER: + if (meta->func_id == BPF_FUNC_map_peek_elem) + *arg_type = ARG_PTR_TO_MAP_VALUE; + break; default: break; } @@ -5388,6 +5391,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_task_storage_delete) goto error; break; + case BPF_MAP_TYPE_BLOOM_FILTER: + if (func_id != BPF_FUNC_map_push_elem && + func_id != BPF_FUNC_map_peek_elem) + goto error; + break; default: break; } @@ -5455,13 +5463,18 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, map->map_type != BPF_MAP_TYPE_SOCKHASH) goto error; break; - case BPF_FUNC_map_peek_elem: case BPF_FUNC_map_pop_elem: - case BPF_FUNC_map_push_elem: if (map->map_type != BPF_MAP_TYPE_QUEUE && map->map_type != BPF_MAP_TYPE_STACK) goto error; break; + case BPF_FUNC_map_push_elem: + case BPF_FUNC_map_peek_elem: + if (map->map_type != BPF_MAP_TYPE_QUEUE && + map->map_type != BPF_MAP_TYPE_STACK && + map->map_type != BPF_MAP_TYPE_BLOOM_FILTER) + goto error; + break; case BPF_FUNC_sk_storage_get: case BPF_FUNC_sk_storage_delete: if (map->map_type != BPF_MAP_TYPE_SK_STORAGE) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 791f31dd0abe..26b814a7d61a 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -906,6 +906,7 @@ enum bpf_map_type { BPF_MAP_TYPE_RINGBUF, BPF_MAP_TYPE_INODE_STORAGE, BPF_MAP_TYPE_TASK_STORAGE, + BPF_MAP_TYPE_BLOOM_FILTER, }; /* Note that tracing related programs such as @@ -1274,6 +1275,7 @@ union bpf_attr { * struct stored as the * map value */ + __u32 nr_hashes; /* used for configuring bloom filter maps */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ @@ -5594,6 +5596,7 @@ struct bpf_map_info { __u32 btf_id; __u32 btf_key_type_id; __u32 btf_value_type_id; + __u32 nr_hashes; /* used for bloom filter maps */ } __attribute__((aligned(8))); struct bpf_btf_info {
Bloom filters are a space-efficient probabilistic data structure used to quickly test whether an element exists in a set. In a bloom filter, false positives are possible whereas false negatives are not. This patch adds a bloom filter map for bpf programs. The bloom filter map supports peek (determining whether an element is present in the map) and push (adding an element to the map) operations.These operations are exposed to userspace applications through the already existing syscalls in the following way: BPF_MAP_LOOKUP_ELEM -> peek BPF_MAP_UPDATE_ELEM -> push The bloom filter map does not have keys, only values. In light of this, the bloom filter map's API matches that of queue stack maps: user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM which correspond internally to bpf_map_peek_elem/bpf_map_push_elem, and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem APIs to query or add an element to the bloom filter map. When the bloom filter map is created, it must be created with a key_size of 0. For updates, the user will pass in the element to add to the map as the value, wih a NULL key. For lookups, the user will pass in the element to query in the map as the value. In the verifier layer, this requires us to modify the argument type of a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE; as well, in the syscall layer, we need to copy over the user value so that in bpf_map_peek_elem, we know which specific value to query. The maximum number of entries in the bloom filter is not enforced; if the user wishes to insert more entries into the bloom filter than they specified as the max entries size of the bloom filter, that is permitted but the performance of their bloom filter will have a higher false positive rate. The number of hashes to use for the bloom filter is configurable from userspace. The benchmarks later in this patchset can help compare the performances of different number of hashes on different entry sizes. In general, using more hashes decreases the speed of a lookup, but increases the false positive rate of an element being detected in the bloom filter. Signed-off-by: Joanne Koong <joannekoong@fb.com> --- include/linux/bpf.h | 3 +- include/linux/bpf_types.h | 1 + include/uapi/linux/bpf.h | 3 + kernel/bpf/Makefile | 2 +- kernel/bpf/bloom_filter.c | 171 +++++++++++++++++++++++++++++++++ kernel/bpf/syscall.c | 20 +++- kernel/bpf/verifier.c | 19 +++- tools/include/uapi/linux/bpf.h | 3 + 8 files changed, 214 insertions(+), 8 deletions(-) create mode 100644 kernel/bpf/bloom_filter.c