Message ID | 20220924133620.4147153-1-houtao@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | Add support for qp-trie with dynptr key | expand |
On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > > Hi, > > The initial motivation for qp-trie map is to reduce memory usage for > string keys special those with large differencies in length as > discussed in [0]. And as a big-endian lexicographical ordered map, it > can also be used for any binary data with fixed or variable length. > > Now the basic functionality of qp-trie is ready, so posting it to get > more feedback or suggestions about qp-trie. Specially feedback > about the following questions: > > (1) Use cases for qp-trie > Andrii had proposed to re-implement lpm-trie by using qp-trie. The > advantage would be the speed up of lookup operations due to lower tree > depth of qp-trie and the performance of update may also increase. > But is there any other use cases for qp-trie ? Specially those cases > which need both ordering and memory efficiency or cases in which qp-trie > will have high fan-out and its lookup performance will be much better than > hash-table as shown below: > > Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255]) > htab lookup (1 thread) 4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB) > htab lookup (2 thread) 10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB) > htab lookup (4 thread) 20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB) > htab lookup (8 thread) 39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB) > htab lookup (16 thread) 79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB) > > qp-trie lookup (1 thread) 10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB) > qp-trie lookup (2 thread) 20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB) > qp-trie lookup (4 thread) 41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB) > qp-trie lookup (8 thread) 81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB) > qp-trie lookup (16 thread) 164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB) > > * non-zero drops is due to duplicated keys in generated keys. > > (2) Improve update/delete performance for qp-trie > Now top-5 overheads in update/delete operations are: > > 21.23% bench [kernel.vmlinux] [k] qp_trie_update_elem > 13.98% bench [kernel.vmlinux] [k] qp_trie_delete_elem > 7.96% bench [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > 5.16% bench [kernel.vmlinux] [k] memcpy_erms > 5.00% bench [kernel.vmlinux] [k] __kmalloc_node > > The top-2 overheads are due to memory access and atomic ops on > max_entries. I had tried memory prefetch but it didn't work out, maybe > I did it wrong. For subtree spinlock overhead, I also had tried the > hierarchical lock by using hand-over-hand lock scheme, but it didn't > scale well [1]. I will try to increase the number of subtrees from 256 > to 1024, 4096 or bigger and check whether it makes any difference. > > For atomic ops and kmalloc overhead, I think I can reuse the idea from > patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc > a simple try and encounter some problems. One problem is that > immediate reuse of freed object in bpf memory allocator. Because qp-trie > uses bpf memory allocator to allocate and free qp_trie_branch, if > qp_trie_branch is reused immediately, the lookup procedure may oops due > to the incorrect content in qp_trie_branch. And another problem is the > size limitation in bpf_mem_alloc() is 4096. It may be a little small for > the total size of key size and value size, but maybe I can use two > separated bpf_mem_alloc for key and value. 4096 limit for key+value size would be an acceptable trade-off. With kptrs the user will be able to extend value to much bigger sizes while doing <= 4096 allocation at a time. Larger allocations are failing in production more often than not. Any algorithm relying on successful >= 4096 allocation is likely to fail. kvmalloc is a fallback that the kernel is using, but we're not there yet in bpf land. The benefits of bpf_mem_alloc in qp-trie would be huge though. qp-trie would work in all contexts including sleepable progs. As presented the use cases for qp-trie are quite limited. If I understand correctly the concern for not using bpf_mem_alloc is that qp_trie_branch can be reused. Can you provide an exact scenario that will casue issuses? Instead of call_rcu in qp_trie_branch_free (which will work only for regular progs and have high overhead as demonstrated by mem_alloc patches) the qp-trie freeing logic can scrub that element, so it's ready to be reused as another struct qp_trie_branch. I guess I'm missing how rcu protects this internal data structures of qp-trie. The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? Is that it? So to make qp-trie work in sleepable progs the algo would need to be changed to do both call_rcu and call_rcu_task_trace everywhere to protect these inner structs? call_rcu_task_trace can take long time. So qp_trie_branch-s may linger around. So quick update/delete (in sleepable with call_rcu_task_trace) may very well exhaust memory. With bpf_mem_alloc we don't have this issue since rcu_task_trace gp is observed only when freeing into global mem pool. Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. What is the worst that can happen? qp_trie_lookup_elem will go into wrong path, but won't crash, right? Can we do hlist_nulls trick to address that? In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. Many kernel data structures know how to deal with such object reuse. We can have a private bpf_mem_alloc here for qp_trie_branch-s only and construct a logic in a way that obj reuse is not problematic. Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem to protect qp_trie_branch during lookup while using bpf_mem_alloc for both qp_trie_branch and leaf nodes, but that's not a great solution either. It will allow qp-trie to be usable in sleepable, but use of call_rcu in update/delete will prevent qp-trie to be usable in tracing progs. Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU. Other than this issue the patches look great. This new map would be awesome addition.
Hi, On 9/26/2022 9:25 AM, Alexei Starovoitov wrote: > On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> SNIP >> For atomic ops and kmalloc overhead, I think I can reuse the idea from >> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc >> a simple try and encounter some problems. One problem is that >> immediate reuse of freed object in bpf memory allocator. Because qp-trie >> uses bpf memory allocator to allocate and free qp_trie_branch, if >> qp_trie_branch is reused immediately, the lookup procedure may oops due >> to the incorrect content in qp_trie_branch. And another problem is the >> size limitation in bpf_mem_alloc() is 4096. It may be a little small for >> the total size of key size and value size, but maybe I can use two >> separated bpf_mem_alloc for key and value. > 4096 limit for key+value size would be an acceptable trade-off. > With kptrs the user will be able to extend value to much bigger sizes > while doing <= 4096 allocation at a time. Larger allocations are failing > in production more often than not. Any algorithm relying on successful > >= 4096 allocation is likely to fail. kvmalloc is a fallback that > the kernel is using, but we're not there yet in bpf land. > The benefits of bpf_mem_alloc in qp-trie would be huge though. > qp-trie would work in all contexts including sleepable progs. > As presented the use cases for qp-trie are quite limited. > If I understand correctly the concern for not using bpf_mem_alloc > is that qp_trie_branch can be reused. Can you provide an exact scenario > that will casue issuses? The usage of branch node during lookup is as follows: (1) check the index field of branch node which records the position of nibble in which the keys of child nodes are different (2) calculate the index of child node by using the nibble value of lookup key in index position (3) get the pointer of child node by dereferencing the variable-length pointer array in branch node Because both branch node and leaf node have variable length, I used one bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch node, the pointer got in step 3 may be invalid. If using separated bpf_mem_alloc for branch node and leaf node, it may still be problematic because the updates to a reused branch node are not atomic and branch nodes with different child node will reuse the same object due to size alignment in allocator, so the lookup procedure below may get an uninitialized pointer in the pointer array: lookup procedure update procedure // three child nodes, 48-bytes branch node x // four child nodes, 56-bytes reuse branch node x x->bitmap = 0xf // got an uninitialized pointer x->nodes[3] Initialize x->nodes[0~3] The problem may can be solved by zeroing the unused or whole part of allocated object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of node array happens before the update of bitmap is also OK and the cost will be much cheaper in x86 host. Beside lookup procedure, get_next_key() from syscall also lookups trie locklessly. If the branch node is reused, the order of returned keys may be broken. There is also a parent pointer in branch node and it is used for reverse lookup during get_next_key, the reuse may lead to unexpected skip in iteration. > Instead of call_rcu in qp_trie_branch_free (which will work only for > regular progs and have high overhead as demonstrated by mem_alloc patches) > the qp-trie freeing logic can scrub that element, so it's ready to be > reused as another struct qp_trie_branch. > I guess I'm missing how rcu protects this internal data structures of qp-trie. > The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? > Is that it? Yes. The update is made atomic by copying the parent branch node to a new branch node and replacing the pointer to the parent branch node by the new branch node, so the lookup procedure either find the old branch node or the new branch node. > So to make qp-trie work in sleepable progs the algo would need to > be changed to do both call_rcu and call_rcu_task_trace everywhere > to protect these inner structs? > call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > around. So quick update/delete (in sleepable with call_rcu_task_trace) > may very well exhaust memory. With bpf_mem_alloc we don't have this issue > since rcu_task_trace gp is observed only when freeing into global mem pool. > Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. > What is the worst that can happen? qp_trie_lookup_elem will go into wrong > path, but won't crash, right? Can we do hlist_nulls trick to address that? > In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. > Many kernel data structures know how to deal with such object reuse. > We can have a private bpf_mem_alloc here for qp_trie_branch-s only and > construct a logic in a way that obj reuse is not problematic. As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I don't know how to do it for get_next_key because the iteration result needs to be ordered and can not skip existed elements before the iterations begins. If removing immediate reuse from bpf_mem_alloc, beside the may-decreased performance, is there any reason we can not do that ? > > Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem > to protect qp_trie_branch during lookup while using bpf_mem_alloc > for both qp_trie_branch and leaf nodes, but that's not a great solution either. > It will allow qp-trie to be usable in sleepable, but use of call_rcu > in update/delete will prevent qp-trie to be usable in tracing progs. > > Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU. > > Other than this issue the patches look great. This new map would be awesome addition. Thanks for that.
On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote: > Hi, > > On 9/26/2022 9:25 AM, Alexei Starovoitov wrote: > > On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: > >> From: Hou Tao <houtao1@huawei.com> > SNIP > >> For atomic ops and kmalloc overhead, I think I can reuse the idea from > >> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc > >> a simple try and encounter some problems. One problem is that > >> immediate reuse of freed object in bpf memory allocator. Because qp-trie > >> uses bpf memory allocator to allocate and free qp_trie_branch, if > >> qp_trie_branch is reused immediately, the lookup procedure may oops due > >> to the incorrect content in qp_trie_branch. And another problem is the > >> size limitation in bpf_mem_alloc() is 4096. It may be a little small for > >> the total size of key size and value size, but maybe I can use two > >> separated bpf_mem_alloc for key and value. > > 4096 limit for key+value size would be an acceptable trade-off. > > With kptrs the user will be able to extend value to much bigger sizes > > while doing <= 4096 allocation at a time. Larger allocations are failing > > in production more often than not. Any algorithm relying on successful > > >= 4096 allocation is likely to fail. kvmalloc is a fallback that > > the kernel is using, but we're not there yet in bpf land. > > The benefits of bpf_mem_alloc in qp-trie would be huge though. > > qp-trie would work in all contexts including sleepable progs. > > As presented the use cases for qp-trie are quite limited. > > If I understand correctly the concern for not using bpf_mem_alloc > > is that qp_trie_branch can be reused. Can you provide an exact scenario > > that will casue issuses? > The usage of branch node during lookup is as follows: > (1) check the index field of branch node which records the position of nibble in > which the keys of child nodes are different > (2) calculate the index of child node by using the nibble value of lookup key in > index position > (3) get the pointer of child node by dereferencing the variable-length pointer > array in branch node > > Because both branch node and leaf node have variable length, I used one > bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch > node, the pointer got in step 3 may be invalid. > > If using separated bpf_mem_alloc for branch node and leaf node, it may still be > problematic because the updates to a reused branch node are not atomic and > branch nodes with different child node will reuse the same object due to size > alignment in allocator, so the lookup procedure below may get an uninitialized > pointer in the pointer array: > > lookup procedure update procedure > > > // three child nodes, 48-bytes > branch node x > // four child > nodes, 56-bytes > reuse branch node x > x->bitmap = 0xf > // got an uninitialized pointer > x->nodes[3] > Initialize > x->nodes[0~3] Looking at lookup: + while (is_branch_node(node)) { + struct qp_trie_branch *br = node; + unsigned int bitmap; + unsigned int iip; + + /* When byte index equals with key len, the target key + * may be in twigs->nodes[0]. + */ + if (index_to_byte_index(br->index) > data_len) + goto done; + + bitmap = calc_br_bitmap(br->index, data, data_len); + if (!(bitmap & br->bitmap)) + goto done; + + iip = calc_twig_index(br->bitmap, bitmap); + node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held()); + } To be safe the br->index needs to be initialized after br->nodex and br->bitmap. While deleting the br->index can be set to special value which would mean restart the lookup from the beginning. As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br. Also the race is extremely tight, right? After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu after spin_lock and reused in update after another spin_lock. Without artifical big delay it's hard to imagine how nodes[iip] pointer would be initialized to some other qp_trie_branch or leaf during delete, then memory reused and nodes[iip] is initialized again with the same address. Theoretically possible, but unlikely, right? And with correct ordering of scrubbing and updates to br->nodes, br->bitmap, br->index it can be made safe. We can add a sequence number to qp_trie_branch as well and read it before and after. Every reuse would inc the seq. If seq number differs, re-read the node pointer form parent. > The problem may can be solved by zeroing the unused or whole part of allocated > object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of > node array happens before the update of bitmap is also OK and the cost will be > much cheaper in x86 host. Something like this, right. We can also consider doing lookup under spin_lock. For a large branchy trie the cost of spin_lock maybe negligible. > Beside lookup procedure, get_next_key() from syscall also lookups trie > locklessly. If the branch node is reused, the order of returned keys may be > broken. There is also a parent pointer in branch node and it is used for reverse > lookup during get_next_key, the reuse may lead to unexpected skip in iteration. qp_trie_lookup_next_node can be done under spin_lock. Iterating all map elements is a slow operation anyway. > > Instead of call_rcu in qp_trie_branch_free (which will work only for > > regular progs and have high overhead as demonstrated by mem_alloc patches) > > the qp-trie freeing logic can scrub that element, so it's ready to be > > reused as another struct qp_trie_branch. > > I guess I'm missing how rcu protects this internal data structures of qp-trie. > > The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? > > Is that it? > Yes. The update is made atomic by copying the parent branch node to a new branch > node and replacing the pointer to the parent branch node by the new branch node, > so the lookup procedure either find the old branch node or the new branch node. > > So to make qp-trie work in sleepable progs the algo would need to > > be changed to do both call_rcu and call_rcu_task_trace everywhere > > to protect these inner structs? > > call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > > around. So quick update/delete (in sleepable with call_rcu_task_trace) > > may very well exhaust memory. With bpf_mem_alloc we don't have this issue > > since rcu_task_trace gp is observed only when freeing into global mem pool. > > Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. > > What is the worst that can happen? qp_trie_lookup_elem will go into wrong > > path, but won't crash, right? Can we do hlist_nulls trick to address that? > > In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. > > Many kernel data structures know how to deal with such object reuse. > > We can have a private bpf_mem_alloc here for qp_trie_branch-s only and > > construct a logic in a way that obj reuse is not problematic. > As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I > don't know how to do it for get_next_key because the iteration result needs to > be ordered and can not skip existed elements before the iterations begins. imo it's fine to spin_lock in get_next_key. We should measure the lock overhead in lookup. It might be acceptable too. > If removing immediate reuse from bpf_mem_alloc, beside the may-decreased > performance, is there any reason we can not do that ? What do you mean? Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ? As I said above: " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger around. So quick update/delete (in sleepable with call_rcu_task_trace) may very well exhaust memory. " As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus. With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
Hi, On 9/27/2022 9:19 AM, Alexei Starovoitov wrote: > On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote: >> Hi, >> >> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote: >>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: >>>> From: Hou Tao <houtao1@huawei.com> >> SNIP >>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from >>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc >>>> a simple try and encounter some problems. One problem is that >>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie >>>> uses bpf memory allocator to allocate and free qp_trie_branch, if >>>> qp_trie_branch is reused immediately, the lookup procedure may oops due >>>> to the incorrect content in qp_trie_branch. And another problem is the >>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for >>>> the total size of key size and value size, but maybe I can use two >>>> separated bpf_mem_alloc for key and value. >>> 4096 limit for key+value size would be an acceptable trade-off. >>> With kptrs the user will be able to extend value to much bigger sizes >>> while doing <= 4096 allocation at a time. Larger allocations are failing >>> in production more often than not. Any algorithm relying on successful >>> >= 4096 allocation is likely to fail. kvmalloc is a fallback that >>> the kernel is using, but we're not there yet in bpf land. >>> The benefits of bpf_mem_alloc in qp-trie would be huge though. >>> qp-trie would work in all contexts including sleepable progs. >>> As presented the use cases for qp-trie are quite limited. >>> If I understand correctly the concern for not using bpf_mem_alloc >>> is that qp_trie_branch can be reused. Can you provide an exact scenario >>> that will casue issuses? >> The usage of branch node during lookup is as follows: >> (1) check the index field of branch node which records the position of nibble in >> which the keys of child nodes are different >> (2) calculate the index of child node by using the nibble value of lookup key in >> index position >> (3) get the pointer of child node by dereferencing the variable-length pointer >> array in branch node >> >> Because both branch node and leaf node have variable length, I used one >> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch >> node, the pointer got in step 3 may be invalid. >> >> If using separated bpf_mem_alloc for branch node and leaf node, it may still be >> problematic because the updates to a reused branch node are not atomic and >> branch nodes with different child node will reuse the same object due to size >> alignment in allocator, so the lookup procedure below may get an uninitialized >> pointer in the pointer array: >> >> lookup procedure update procedure >> >> >> // three child nodes, 48-bytes >> branch node x >> // four child >> nodes, 56-bytes >> reuse branch node x >> x->bitmap = 0xf >> // got an uninitialized pointer >> x->nodes[3] >> Initialize >> x->nodes[0~3] > Looking at lookup: > + while (is_branch_node(node)) { > + struct qp_trie_branch *br = node; > + unsigned int bitmap; > + unsigned int iip; > + > + /* When byte index equals with key len, the target key > + * may be in twigs->nodes[0]. > + */ > + if (index_to_byte_index(br->index) > data_len) > + goto done; > + > + bitmap = calc_br_bitmap(br->index, data, data_len); > + if (!(bitmap & br->bitmap)) > + goto done; > + > + iip = calc_twig_index(br->bitmap, bitmap); > + node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held()); > + } > > To be safe the br->index needs to be initialized after br->nodex and br->bitmap. > While deleting the br->index can be set to special value which would mean > restart the lookup from the beginning. > As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br. > Also the race is extremely tight, right? > After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu > after spin_lock and reused in update after another spin_lock. > Without artifical big delay it's hard to imagine how nodes[iip] pointer > would be initialized to some other qp_trie_branch or leaf during delete, > then memory reused and nodes[iip] is initialized again with the same address. > Theoretically possible, but unlikely, right? > And with correct ordering of scrubbing and updates to > br->nodes, br->bitmap, br->index it can be made safe. The reuse of node not only introduces the safety problem (e.g. access an invalid pointer), but also incur the false negative problem (e.g. can not find an existent element) as show below: lookup A in X on CPU1 update X on CPU 2 [ branch X v1 ] leaf A | leaf B | leaf C [ branch X v2 ] leaf A | leaf B | leaf C | leaf D // free and reuse branch X v1 [ branch X v1 ] leaf O | leaf P | leaf Q // leaf A can not be found > We can add a sequence number to qp_trie_branch as well and read it before and after. > Every reuse would inc the seq. > If seq number differs, re-read the node pointer form parent. A seq number on qp_trie_branch is a good idea. Will try it. But we also need to consider the starvation of lookup by update/deletion. Maybe need fallback to the subtree spinlock after some reread. >> The problem may can be solved by zeroing the unused or whole part of allocated >> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of >> node array happens before the update of bitmap is also OK and the cost will be >> much cheaper in x86 host. > Something like this, right. > We can also consider doing lookup under spin_lock. For a large branchy trie > the cost of spin_lock maybe negligible. Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse or taking the subtree spinlock during lookup ? IMO the latter will make the lookup performance suffer, but I will check it as well. > >> Beside lookup procedure, get_next_key() from syscall also lookups trie >> locklessly. If the branch node is reused, the order of returned keys may be >> broken. There is also a parent pointer in branch node and it is used for reverse >> lookup during get_next_key, the reuse may lead to unexpected skip in iteration. > qp_trie_lookup_next_node can be done under spin_lock. > Iterating all map elements is a slow operation anyway. OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure whether or not the solution for lockless lookup will work for get_next_key. Will check. > >>> Instead of call_rcu in qp_trie_branch_free (which will work only for >>> regular progs and have high overhead as demonstrated by mem_alloc patches) >>> the qp-trie freeing logic can scrub that element, so it's ready to be >>> reused as another struct qp_trie_branch. >>> I guess I'm missing how rcu protects this internal data structures of qp-trie. >>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? >>> Is that it? >> Yes. The update is made atomic by copying the parent branch node to a new branch >> node and replacing the pointer to the parent branch node by the new branch node, >> so the lookup procedure either find the old branch node or the new branch node. >>> So to make qp-trie work in sleepable progs the algo would need to >>> be changed to do both call_rcu and call_rcu_task_trace everywhere >>> to protect these inner structs? >>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger >>> around. So quick update/delete (in sleepable with call_rcu_task_trace) >>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue >>> since rcu_task_trace gp is observed only when freeing into global mem pool. >>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. >>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong >>> path, but won't crash, right? Can we do hlist_nulls trick to address that? >>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. >>> Many kernel data structures know how to deal with such object reuse. >>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and >>> construct a logic in a way that obj reuse is not problematic. >> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I >> don't know how to do it for get_next_key because the iteration result needs to >> be ordered and can not skip existed elements before the iterations begins. > imo it's fine to spin_lock in get_next_key. > We should measure the lock overhead in lookup. It might be acceptable too. Will check that. > >> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased >> performance, is there any reason we can not do that ? > What do you mean? > Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ? Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like free_bulk does ? > As I said above: > " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > around. So quick update/delete (in sleepable with call_rcu_task_trace) > may very well exhaust memory. > " > As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap > before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus. > With call_rcu_tasks_trace it's worse. It cannot sustain such flood. > . Will check the result of map_perf_test. But it seems bpf_mem_alloc may still exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a close lookup on that.
On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 9/27/2022 9:19 AM, Alexei Starovoitov wrote: > > On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote: > >> Hi, > >> > >> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote: > >>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: > >>>> From: Hou Tao <houtao1@huawei.com> > >> SNIP > >>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from > >>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc > >>>> a simple try and encounter some problems. One problem is that > >>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie > >>>> uses bpf memory allocator to allocate and free qp_trie_branch, if > >>>> qp_trie_branch is reused immediately, the lookup procedure may oops due > >>>> to the incorrect content in qp_trie_branch. And another problem is the > >>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for > >>>> the total size of key size and value size, but maybe I can use two > >>>> separated bpf_mem_alloc for key and value. > >>> 4096 limit for key+value size would be an acceptable trade-off. > >>> With kptrs the user will be able to extend value to much bigger sizes > >>> while doing <= 4096 allocation at a time. Larger allocations are failing > >>> in production more often than not. Any algorithm relying on successful > >>> >= 4096 allocation is likely to fail. kvmalloc is a fallback that > >>> the kernel is using, but we're not there yet in bpf land. > >>> The benefits of bpf_mem_alloc in qp-trie would be huge though. > >>> qp-trie would work in all contexts including sleepable progs. > >>> As presented the use cases for qp-trie are quite limited. > >>> If I understand correctly the concern for not using bpf_mem_alloc > >>> is that qp_trie_branch can be reused. Can you provide an exact scenario > >>> that will casue issuses? > >> The usage of branch node during lookup is as follows: > >> (1) check the index field of branch node which records the position of nibble in > >> which the keys of child nodes are different > >> (2) calculate the index of child node by using the nibble value of lookup key in > >> index position > >> (3) get the pointer of child node by dereferencing the variable-length pointer > >> array in branch node > >> > >> Because both branch node and leaf node have variable length, I used one > >> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch > >> node, the pointer got in step 3 may be invalid. > >> > >> If using separated bpf_mem_alloc for branch node and leaf node, it may still be > >> problematic because the updates to a reused branch node are not atomic and > >> branch nodes with different child node will reuse the same object due to size > >> alignment in allocator, so the lookup procedure below may get an uninitialized > >> pointer in the pointer array: > >> > >> lookup procedure update procedure > >> > >> > >> // three child nodes, 48-bytes > >> branch node x > >> // four child > >> nodes, 56-bytes > >> reuse branch node x > >> x->bitmap = 0xf > >> // got an uninitialized pointer > >> x->nodes[3] > >> Initialize > >> x->nodes[0~3] > > Looking at lookup: > > + while (is_branch_node(node)) { > > + struct qp_trie_branch *br = node; > > + unsigned int bitmap; > > + unsigned int iip; > > + > > + /* When byte index equals with key len, the target key > > + * may be in twigs->nodes[0]. > > + */ > > + if (index_to_byte_index(br->index) > data_len) > > + goto done; > > + > > + bitmap = calc_br_bitmap(br->index, data, data_len); > > + if (!(bitmap & br->bitmap)) > > + goto done; > > + > > + iip = calc_twig_index(br->bitmap, bitmap); > > + node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held()); > > + } > > > > To be safe the br->index needs to be initialized after br->nodex and br->bitmap. > > While deleting the br->index can be set to special value which would mean > > restart the lookup from the beginning. > > As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br. > > Also the race is extremely tight, right? > > After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu > > after spin_lock and reused in update after another spin_lock. > > Without artifical big delay it's hard to imagine how nodes[iip] pointer > > would be initialized to some other qp_trie_branch or leaf during delete, > > then memory reused and nodes[iip] is initialized again with the same address. > > Theoretically possible, but unlikely, right? > > And with correct ordering of scrubbing and updates to > > br->nodes, br->bitmap, br->index it can be made safe. > The reuse of node not only introduces the safety problem (e.g. access an invalid > pointer), but also incur the false negative problem (e.g. can not find an > existent element) as show below: > > lookup A in X on CPU1 update X on CPU 2 > > [ branch X v1 ] > leaf A | leaf B | leaf C > [ branch X v2 ] > leaf A | leaf B | leaf C | leaf D > > // free and reuse branch X v1 > [ branch X v1 ] > leaf O | leaf P | leaf Q > // leaf A can not be found Right. That's why I suggested to consider hlist_nulls-like approach that htab is using. > > We can add a sequence number to qp_trie_branch as well and read it before and after. > > Every reuse would inc the seq. > > If seq number differs, re-read the node pointer form parent. > A seq number on qp_trie_branch is a good idea. Will try it. But we also need to > consider the starvation of lookup by update/deletion. Maybe need fallback to the > subtree spinlock after some reread. I think the fallback is an overkill. The race is extremely unlikely. > >> The problem may can be solved by zeroing the unused or whole part of allocated > >> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of > >> node array happens before the update of bitmap is also OK and the cost will be > >> much cheaper in x86 host. > > Something like this, right. > > We can also consider doing lookup under spin_lock. For a large branchy trie > > the cost of spin_lock maybe negligible. > Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse > or taking the subtree spinlock during lookup ? IMO the latter will make the > lookup performance suffer, but I will check it as well. subtree lock. lookup perf will suffer a bit. The numbers will tell the true story. > > > >> Beside lookup procedure, get_next_key() from syscall also lookups trie > >> locklessly. If the branch node is reused, the order of returned keys may be > >> broken. There is also a parent pointer in branch node and it is used for reverse > >> lookup during get_next_key, the reuse may lead to unexpected skip in iteration. > > qp_trie_lookup_next_node can be done under spin_lock. > > Iterating all map elements is a slow operation anyway. > OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure > whether or not the solution for lockless lookup will work for get_next_key. Will > check. What kind of scalability are you concerned about? get_next is done by user space only. Plenty of overhead already. > > > >>> Instead of call_rcu in qp_trie_branch_free (which will work only for > >>> regular progs and have high overhead as demonstrated by mem_alloc patches) > >>> the qp-trie freeing logic can scrub that element, so it's ready to be > >>> reused as another struct qp_trie_branch. > >>> I guess I'm missing how rcu protects this internal data structures of qp-trie. > >>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? > >>> Is that it? > >> Yes. The update is made atomic by copying the parent branch node to a new branch > >> node and replacing the pointer to the parent branch node by the new branch node, > >> so the lookup procedure either find the old branch node or the new branch node. > >>> So to make qp-trie work in sleepable progs the algo would need to > >>> be changed to do both call_rcu and call_rcu_task_trace everywhere > >>> to protect these inner structs? > >>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > >>> around. So quick update/delete (in sleepable with call_rcu_task_trace) > >>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue > >>> since rcu_task_trace gp is observed only when freeing into global mem pool. > >>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. > >>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong > >>> path, but won't crash, right? Can we do hlist_nulls trick to address that? > >>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. > >>> Many kernel data structures know how to deal with such object reuse. > >>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and > >>> construct a logic in a way that obj reuse is not problematic. > >> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I > >> don't know how to do it for get_next_key because the iteration result needs to > >> be ordered and can not skip existed elements before the iterations begins. > > imo it's fine to spin_lock in get_next_key. > > We should measure the lock overhead in lookup. It might be acceptable too. > Will check that. > > > >> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased > >> performance, is there any reason we can not do that ? > > What do you mean? > > Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ? > Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like > free_bulk does ? > > As I said above: > > " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > > around. So quick update/delete (in sleepable with call_rcu_task_trace) > > may very well exhaust memory. > > " > > As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap > > before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus. > > With call_rcu_tasks_trace it's worse. It cannot sustain such flood. > > . > Will check the result of map_perf_test. But it seems bpf_mem_alloc may still > exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a > close lookup on that. In theory. yes. The batching makes a big difference.
Hi, On 9/27/2022 11:18 AM, Alexei Starovoitov wrote: > On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@huaweicloud.com> wrote: SNIP >>>>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from >>>>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc >>>>>> a simple try and encounter some problems. One problem is that >>>>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie >>>>>> uses bpf memory allocator to allocate and free qp_trie_branch, if >>>>>> qp_trie_branch is reused immediately, the lookup procedure may oops due >>>>>> to the incorrect content in qp_trie_branch. And another problem is the >>>>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for >>>>>> the total size of key size and value size, but maybe I can use two >>>>>> separated bpf_mem_alloc for key and value. >>>>> 4096 limit for key+value size would be an acceptable trade-off. >>>>> With kptrs the user will be able to extend value to much bigger sizes >>>>> while doing <= 4096 allocation at a time. Larger allocations are failing >>>>> in production more often than not. Any algorithm relying on successful >>>>> >= 4096 allocation is likely to fail. kvmalloc is a fallback that >>>>> the kernel is using, but we're not there yet in bpf land. >>>>> The benefits of bpf_mem_alloc in qp-trie would be huge though. >>>>> qp-trie would work in all contexts including sleepable progs. >>>>> As presented the use cases for qp-trie are quite limited. >>>>> If I understand correctly the concern for not using bpf_mem_alloc >>>>> is that qp_trie_branch can be reused. Can you provide an exact scenario >>>>> that will casue issuses? SNIP >>>> Looking at lookup: >>>> + while (is_branch_node(node)) { >>>> + struct qp_trie_branch *br = node; >>>> + unsigned int bitmap; >>>> + unsigned int iip; >>>> + >>>> + /* When byte index equals with key len, the target key >>>> + * may be in twigs->nodes[0]. >>>> + */ >>>> + if (index_to_byte_index(br->index) > data_len) >>>> + goto done; >>>> + >>>> + bitmap = calc_br_bitmap(br->index, data, data_len); >>>> + if (!(bitmap & br->bitmap)) >>>> + goto done; >>>> + >>>> + iip = calc_twig_index(br->bitmap, bitmap); >>>> + node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held()); >>>> + } >>>> >>>> To be safe the br->index needs to be initialized after br->nodex and br->bitmap. >>>> While deleting the br->index can be set to special value which would mean >>>> restart the lookup from the beginning. >>>> As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br. >>>> Also the race is extremely tight, right? >>>> After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu >>>> after spin_lock and reused in update after another spin_lock. >>>> Without artifical big delay it's hard to imagine how nodes[iip] pointer >>>> would be initialized to some other qp_trie_branch or leaf during delete, >>>> then memory reused and nodes[iip] is initialized again with the same address. >>>> Theoretically possible, but unlikely, right? >>>> And with correct ordering of scrubbing and updates to >>>> br->nodes, br->bitmap, br->index it can be made safe. >> The reuse of node not only introduces the safety problem (e.g. access an invalid >> pointer), but also incur the false negative problem (e.g. can not find an >> existent element) as show below: >> >> lookup A in X on CPU1 update X on CPU 2 >> >> [ branch X v1 ] >> leaf A | leaf B | leaf C >> [ branch X v2 ] >> leaf A | leaf B | leaf C | leaf D >> >> // free and reuse branch X v1 >> [ branch X v1 ] >> leaf O | leaf P | leaf Q >> // leaf A can not be found > Right. That's why I suggested to consider hlist_nulls-like approach > that htab is using. > >>> We can add a sequence number to qp_trie_branch as well and read it before and after. >>> Every reuse would inc the seq. >>> If seq number differs, re-read the node pointer form parent. >> A seq number on qp_trie_branch is a good idea. Will try it. But we also need to >> consider the starvation of lookup by update/deletion. Maybe need fallback to the >> subtree spinlock after some reread. > I think the fallback is an overkill. The race is extremely unlikely. OK. Will add a test on tiny qp-trie to ensure it is OK. >>>> The problem may can be solved by zeroing the unused or whole part of allocated >>>> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of >>>> node array happens before the update of bitmap is also OK and the cost will be >>>> much cheaper in x86 host. >>> Something like this, right. >>> We can also consider doing lookup under spin_lock. For a large branchy trie >>> the cost of spin_lock maybe negligible. >> Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse >> or taking the subtree spinlock during lookup ? IMO the latter will make the >> lookup performance suffer, but I will check it as well. > subtree lock. lookup perf will suffer a bit. > The numbers will tell the true story. A quick benchmark show the performance is bad when using subtree lock for lookup: Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255]) * no lock qp-trie lookup (1 thread) 10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (2 thread) 20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (4 thread) 41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (8 thread) 82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (16 thread) 162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem 0.000 MiB) * subtree lock qp-trie lookup (1 thread) 8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (2 thread) 15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (4 thread) 27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (8 thread) 42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem 0.000 MiB) qp-trie lookup (16 thread) 50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem 0.000 MiB) Strings in /proc/kallsyms (key size=83, max entries=170958) * no lock qp-trie lookup (1 thread) 4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem 0.000 MiB) qp-trie lookup (2 thread) 8.226 ± 0.009M/s (drops 0.500 ± 0.002M/s mem 0.000 MiB) qp-trie lookup (4 thread) 15.356 ± 0.034M/s (drops 0.933 ± 0.006M/s mem 0.000 MiB) qp-trie lookup (8 thread) 30.037 ± 0.584M/s (drops 1.827 ± 0.037M/s mem 0.000 MiB) qp-trie lookup (16 thread) 62.600 ± 0.307M/s (drops 3.808 ± 0.029M/s mem 0.000 MiB) * subtree lock qp-trie lookup (1 thread) 4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem 0.000 MiB) qp-trie lookup (2 thread) 4.883 ± 0.500M/s (drops 0.297 ± 0.031M/s mem 0.000 MiB) qp-trie lookup (4 thread) 5.771 ± 0.137M/s (drops 0.351 ± 0.008M/s mem 0.000 MiB) qp-trie lookup (8 thread) 5.926 ± 0.104M/s (drops 0.359 ± 0.011M/s mem 0.000 MiB) qp-trie lookup (16 thread) 5.947 ± 0.171M/s (drops 0.362 ± 0.023M/s mem 0.000 MiB) >>>> Beside lookup procedure, get_next_key() from syscall also lookups trie >>>> locklessly. If the branch node is reused, the order of returned keys may be >>>> broken. There is also a parent pointer in branch node and it is used for reverse >>>> lookup during get_next_key, the reuse may lead to unexpected skip in iteration. >>> qp_trie_lookup_next_node can be done under spin_lock. >>> Iterating all map elements is a slow operation anyway. >> OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure >> whether or not the solution for lockless lookup will work for get_next_key. Will >> check. > What kind of scalability are you concerned about? > get_next is done by user space only. Plenty of overhead already. As an ordered map, maybe the next and prev iteration operations are needed in bpf program in the future. For now, i think it is OK. >>>>> Instead of call_rcu in qp_trie_branch_free (which will work only for >>>>> regular progs and have high overhead as demonstrated by mem_alloc patches) >>>>> the qp-trie freeing logic can scrub that element, so it's ready to be >>>>> reused as another struct qp_trie_branch. >>>>> I guess I'm missing how rcu protects this internal data structures of qp-trie. >>>>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? >>>>> Is that it? >>>> Yes. The update is made atomic by copying the parent branch node to a new branch >>>> node and replacing the pointer to the parent branch node by the new branch node, >>>> so the lookup procedure either find the old branch node or the new branch node. >>>>> So to make qp-trie work in sleepable progs the algo would need to >>>>> be changed to do both call_rcu and call_rcu_task_trace everywhere >>>>> to protect these inner structs? >>>>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger >>>>> around. So quick update/delete (in sleepable with call_rcu_task_trace) >>>>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue >>>>> since rcu_task_trace gp is observed only when freeing into global mem pool. >>>>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. >>>>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong >>>>> path, but won't crash, right? Can we do hlist_nulls trick to address that? >>>>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. >>>>> Many kernel data structures know how to deal with such object reuse. >>>>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and >>>>> construct a logic in a way that obj reuse is not problematic. >>>> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I >>>> don't know how to do it for get_next_key because the iteration result needs to >>>> be ordered and can not skip existed elements before the iterations begins. >>> imo it's fine to spin_lock in get_next_key. >>> We should measure the lock overhead in lookup. It might be acceptable too. >> Will check that. >>>> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased >>>> performance, is there any reason we can not do that ? >>> What do you mean? >>> Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ? >> Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like >> free_bulk does ? >>> As I said above: >>> " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger >>> around. So quick update/delete (in sleepable with call_rcu_task_trace) >>> may very well exhaust memory. >>> " >>> As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap >>> before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus. >>> With call_rcu_tasks_trace it's worse. It cannot sustain such flood. >>> . I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my local environment, could you share the setup for it ? The following is the output of perf report (--no-children) for "./map_perf_test 4 72 10240 100000" on a x86-64 host with 72-cpus: 26.63% map_perf_test [kernel.vmlinux] [k] alloc_htab_elem 21.57% map_perf_test [kernel.vmlinux] [k] htab_map_update_elem 18.08% map_perf_test [kernel.vmlinux] [k] htab_map_delete_elem 12.30% map_perf_test [kernel.vmlinux] [k] free_htab_elem 10.55% map_perf_test [kernel.vmlinux] [k] __htab_map_lookup_elem 1.58% map_perf_test [kernel.vmlinux] [k] bpf_map_kmalloc_node 1.39% map_perf_test [kernel.vmlinux] [k] _raw_spin_lock_irqsave 1.37% map_perf_test [kernel.vmlinux] [k] __copy_map_value.constprop.0 0.45% map_perf_test [kernel.vmlinux] [k] check_and_free_fields 0.33% map_perf_test [kernel.vmlinux] [k] rcu_segcblist_enqueue The overhead of call_rcu is tiny compared with hash map operations. Instead alloc_htab_elem() and free_htab_eleme() are the bottlenecks. The following is the output of perf record after apply bpf_mem_alloc: 25.35% map_perf_test [kernel.vmlinux] [k] htab_map_delete_elem 23.69% map_perf_test [kernel.vmlinux] [k] htab_map_update_elem 8.42% map_perf_test [kernel.vmlinux] [k] __htab_map_lookup_elem 7.60% map_perf_test [kernel.vmlinux] [k] alloc_htab_elem 4.35% map_perf_test [kernel.vmlinux] [k] free_htab_elem 2.28% map_perf_test [kernel.vmlinux] [k] memcpy_erms 2.24% map_perf_test [kernel.vmlinux] [k] jhash 2.02% map_perf_test [kernel.vmlinux] [k] _raw_spin_lock_irqsave >> Will check the result of map_perf_test. But it seems bpf_mem_alloc may still >> exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a >> close lookup on that. > In theory. yes. The batching makes a big difference.
On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > A quick benchmark show the performance is bad when using subtree lock for lookup: > > Randomly-generated binary data (key size=255, max entries=16K, key length > range:[1, 255]) > * no lock > qp-trie lookup (1 thread) 10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (2 thread) 20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (4 thread) 41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (8 thread) 82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (16 thread) 162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem > 0.000 MiB) > > * subtree lock > qp-trie lookup (1 thread) 8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (2 thread) 15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (4 thread) 27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (8 thread) 42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (16 thread) 50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem > 0.000 MiB) That's indeed significant. But I interpret it differently. Since single thread perf is close enough while 16 thread suffers 3x it means the lock mechanism is inefficient. It means update/delete performance equally doesn't scale. > Strings in /proc/kallsyms (key size=83, max entries=170958) > * no lock > qp-trie lookup (1 thread) 4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem > 0.000 MiB) > > * subtree lock > qp-trie lookup (1 thread) 4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem > 0.000 MiB) Here a single thread with spin_lock is _faster_ than without. So it's not about the cost of spin_lock, but its contention. So all the complexity to do lockless lookup needs to be considered in this context. Looks like update/delete don't scale anyway. So lock-less lookup complexity is justified only for the case with a lot of concurrent lookups and little update/delete. When bpf hash map was added the majority of tracing use cases had # of lookups == # of updates == # of deletes. For qp-trie we obviously cannot predict, but should not pivot strongly into lock-less lookup without data. The lock-less lookup should have reasonable complexity. Otherwise let's do spin_lock and work on a different locking scheme for all operations (lookup/update/delete). > I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my > local environment, could you share the setup for it ? > > The following is the output of perf report (--no-children) for "./map_perf_test > 4 72 10240 100000" on a x86-64 host with 72-cpus: > > 26.63% map_perf_test [kernel.vmlinux] [k] > alloc_htab_elem > 21.57% map_perf_test [kernel.vmlinux] [k] > htab_map_update_elem Looks like the perf is lost on atomic_inc/dec. Try a partial revert of mem_alloc. In particular to make sure commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") is reverted and call_rcu is in place, but percpu counter optimization is still there. Also please use 'map_perf_test 4'. I doubt 1000 vs 10240 will make a difference, but still.
Hi, On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: >> A quick benchmark show the performance is bad when using subtree lock for lookup: >> >> Randomly-generated binary data (key size=255, max entries=16K, key length >> range:[1, 255]) >> * no lock >> qp-trie lookup (1 thread) 10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (2 thread) 20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (4 thread) 41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (8 thread) 82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (16 thread) 162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem >> 0.000 MiB) >> >> * subtree lock >> qp-trie lookup (1 thread) 8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (2 thread) 15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (4 thread) 27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (8 thread) 42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (16 thread) 50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem >> 0.000 MiB) > That's indeed significant. > But I interpret it differently. > Since single thread perf is close enough while 16 thread > suffers 3x it means the lock mechanism is inefficient. > It means update/delete performance equally doesn't scale. > >> Strings in /proc/kallsyms (key size=83, max entries=170958) >> * no lock >> qp-trie lookup (1 thread) 4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem >> 0.000 MiB) >> >> * subtree lock >> qp-trie lookup (1 thread) 4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem >> 0.000 MiB) > Here a single thread with spin_lock is _faster_ than without. It is a little strange because there is not any overhead for lockless lookup for now. I thought it may be due to CPU boost or something similar. Will check it again. > So it's not about the cost of spin_lock, but its contention. > So all the complexity to do lockless lookup > needs to be considered in this context. > Looks like update/delete don't scale anyway. Yes. > So lock-less lookup complexity is justified only > for the case with a lot of concurrent lookups and > little update/delete. > When bpf hash map was added the majority of tracing use cases > had # of lookups == # of updates == # of deletes. > For qp-trie we obviously cannot predict, > but should not pivot strongly into lock-less lookup without data. > The lock-less lookup should have reasonable complexity. > Otherwise let's do spin_lock and work on a different locking scheme > for all operations (lookup/update/delete). For lpm-trie, does the use case in cillium [0] has many lockless lookups and few updates/deletions ? So I think the use case is applicable for qp-trie as well. Before steering towards spin_lock, let's implement a demo for lock-less lookup to see its complexity. For spin-lock way, I had implemented a hand-over-hand lock but the lookup was still lockless, but it doesn't scale well as well. > >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my >> local environment, could you share the setup for it ? >> >> The following is the output of perf report (--no-children) for "./map_perf_test >> 4 72 10240 100000" on a x86-64 host with 72-cpus: >> >> 26.63% map_perf_test [kernel.vmlinux] [k] >> alloc_htab_elem >> 21.57% map_perf_test [kernel.vmlinux] [k] >> htab_map_update_elem > Looks like the perf is lost on atomic_inc/dec. > Try a partial revert of mem_alloc. > In particular to make sure > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > is reverted and call_rcu is in place, > but percpu counter optimization is still there. > Also please use 'map_perf_test 4'. > I doubt 1000 vs 10240 will make a difference, but still. Will do. The suggestion is reasonable. But when max_entries=1000, use_percpu_counter will be false.
On Tue, Sep 27, 2022 at 8:27 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Also please use 'map_perf_test 4'. > > I doubt 1000 vs 10240 will make a difference, but still. > Will do. The suggestion is reasonable. But when max_entries=1000, > use_percpu_counter will be false. ... when the system has a lot of cpus... The knobs in the microbenchmark are there to tune it to observe desired behavior. Please mention the reasons to use that specific value next time. When N things have changed it's hard to do apples-to-apples.
Hi, On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > SNIP >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my >> local environment, could you share the setup for it ? >> >> The following is the output of perf report (--no-children) for "./map_perf_test >> 4 72 10240 100000" on a x86-64 host with 72-cpus: >> >> 26.63% map_perf_test [kernel.vmlinux] [k] >> alloc_htab_elem >> 21.57% map_perf_test [kernel.vmlinux] [k] >> htab_map_update_elem > Looks like the perf is lost on atomic_inc/dec. > Try a partial revert of mem_alloc. > In particular to make sure > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > is reverted and call_rcu is in place, > but percpu counter optimization is still there. > Also please use 'map_perf_test 4'. > I doubt 1000 vs 10240 will make a difference, but still. > I have tried the following two setups: (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map # Samples: 1M of event 'cycles:ppp' # Event count (approx.): 1041345723234 # # Overhead Command Shared Object Symbol # ........ ............... ........................................... ............................................... # 10.36% map_perf_test [kernel.vmlinux] [k] bpf_map_get_memcg.isra.0 9.82% map_perf_test [kernel.vmlinux] [k] bpf_map_kmalloc_node 4.24% map_perf_test [kernel.vmlinux] [k] check_preemption_disabled 2.86% map_perf_test [kernel.vmlinux] [k] htab_map_update_elem 2.80% map_perf_test [kernel.vmlinux] [k] __kmalloc_node 2.72% map_perf_test [kernel.vmlinux] [k] htab_map_delete_elem 2.30% map_perf_test [kernel.vmlinux] [k] memcg_slab_post_alloc_hook 2.21% map_perf_test [kernel.vmlinux] [k] entry_SYSCALL_64 2.17% map_perf_test [kernel.vmlinux] [k] syscall_exit_to_user_mode 2.12% map_perf_test [kernel.vmlinux] [k] jhash 2.11% map_perf_test [kernel.vmlinux] [k] syscall_return_via_sysret 2.05% map_perf_test [kernel.vmlinux] [k] alloc_htab_elem 1.94% map_perf_test [kernel.vmlinux] [k] _raw_spin_lock_irqsave 1.92% map_perf_test [kernel.vmlinux] [k] preempt_count_add 1.92% map_perf_test [kernel.vmlinux] [k] preempt_count_sub 1.87% map_perf_test [kernel.vmlinux] [k] call_rcu (2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu optimization By revert the following commits: 9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache. 02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs. dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. 96da3f7d489d bpf: Remove tracing program restriction on map types ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. 4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc. 8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. 7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache 0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map. 5.17% map_perf_test [kernel.vmlinux] [k] check_preemption_disabled 4.53% map_perf_test [kernel.vmlinux] [k] __get_obj_cgroup_from_memcg 2.97% map_perf_test [kernel.vmlinux] [k] htab_map_update_elem 2.74% map_perf_test [kernel.vmlinux] [k] htab_map_delete_elem 2.62% map_perf_test [kernel.vmlinux] [k] kmem_cache_alloc_node 2.57% map_perf_test [kernel.vmlinux] [k] memcg_slab_post_alloc_hook 2.34% map_perf_test [kernel.vmlinux] [k] jhash 2.30% map_perf_test [kernel.vmlinux] [k] entry_SYSCALL_64 2.25% map_perf_test [kernel.vmlinux] [k] obj_cgroup_charge 2.23% map_perf_test [kernel.vmlinux] [k] alloc_htab_elem 2.17% map_perf_test [kernel.vmlinux] [k] memcpy_erms 2.17% map_perf_test [kernel.vmlinux] [k] syscall_exit_to_user_mode 2.16% map_perf_test [kernel.vmlinux] [k] syscall_return_via_sysret 2.14% map_perf_test [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.13% map_perf_test [kernel.vmlinux] [k] preempt_count_add 2.12% map_perf_test [kernel.vmlinux] [k] preempt_count_sub 2.00% map_perf_test [kernel.vmlinux] [k] percpu_counter_add_batch 1.99% map_perf_test [kernel.vmlinux] [k] alloc_bulk 1.97% map_perf_test [kernel.vmlinux] [k] call_rcu 1.52% map_perf_test [kernel.vmlinux] [k] mod_objcg_state 1.36% map_perf_test [kernel.vmlinux] [k] allocate_slab In both of these two setups, the overhead of call_rcu is about 2% and it is not the biggest overhead. Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do you think ?
On 9/28/2022 4:45 PM, Hou Tao wrote: > Hi, > > On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: >> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: SNIP >> >> In both of these two setups, the overhead of call_rcu is about 2% and it is not >> the biggest overhead. Also considering that the overhead of qp-trie update/delete will be much bigger than hash map update. >> >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do >> you think ? s/reason/reasonable. > .
On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > SNIP > >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my > >> local environment, could you share the setup for it ? > >> > >> The following is the output of perf report (--no-children) for "./map_perf_test > >> 4 72 10240 100000" on a x86-64 host with 72-cpus: > >> > >> 26.63% map_perf_test [kernel.vmlinux] [k] > >> alloc_htab_elem > >> 21.57% map_perf_test [kernel.vmlinux] [k] > >> htab_map_update_elem > > Looks like the perf is lost on atomic_inc/dec. > > Try a partial revert of mem_alloc. > > In particular to make sure > > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > is reverted and call_rcu is in place, > > but percpu counter optimization is still there. > > Also please use 'map_perf_test 4'. > > I doubt 1000 vs 10240 will make a difference, but still. > > > I have tried the following two setups: > (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > # Samples: 1M of event 'cycles:ppp' > # Event count (approx.): 1041345723234 > # > # Overhead Command Shared Object Symbol > # ........ ............... ........................................... > ............................................... > # > 10.36% map_perf_test [kernel.vmlinux] [k] > bpf_map_get_memcg.isra.0 That is per-cpu counter and it's consuming 10% ?! Something is really odd in your setup. A lot of debug configs? > 9.82% map_perf_test [kernel.vmlinux] [k] > bpf_map_kmalloc_node > 4.24% map_perf_test [kernel.vmlinux] [k] > check_preemption_disabled clearly debug build. Please use production build. > 2.86% map_perf_test [kernel.vmlinux] [k] > htab_map_update_elem > 2.80% map_perf_test [kernel.vmlinux] [k] > __kmalloc_node > 2.72% map_perf_test [kernel.vmlinux] [k] > htab_map_delete_elem > 2.30% map_perf_test [kernel.vmlinux] [k] > memcg_slab_post_alloc_hook > 2.21% map_perf_test [kernel.vmlinux] [k] > entry_SYSCALL_64 > 2.17% map_perf_test [kernel.vmlinux] [k] > syscall_exit_to_user_mode > 2.12% map_perf_test [kernel.vmlinux] [k] jhash > 2.11% map_perf_test [kernel.vmlinux] [k] > syscall_return_via_sysret > 2.05% map_perf_test [kernel.vmlinux] [k] > alloc_htab_elem > 1.94% map_perf_test [kernel.vmlinux] [k] > _raw_spin_lock_irqsave > 1.92% map_perf_test [kernel.vmlinux] [k] > preempt_count_add > 1.92% map_perf_test [kernel.vmlinux] [k] > preempt_count_sub > 1.87% map_perf_test [kernel.vmlinux] [k] > call_rcu > > > (2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu > optimization > By revert the following commits: > > 9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. > bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache. > 02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs. > dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. > 96da3f7d489d bpf: Remove tracing program restriction on map types > ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. > 4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc. > 8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. > 7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache > 0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map. > > 5.17% map_perf_test [kernel.vmlinux] [k] > check_preemption_disabled > 4.53% map_perf_test [kernel.vmlinux] [k] > __get_obj_cgroup_from_memcg > 2.97% map_perf_test [kernel.vmlinux] [k] > htab_map_update_elem > 2.74% map_perf_test [kernel.vmlinux] [k] > htab_map_delete_elem > 2.62% map_perf_test [kernel.vmlinux] [k] > kmem_cache_alloc_node > 2.57% map_perf_test [kernel.vmlinux] [k] > memcg_slab_post_alloc_hook > 2.34% map_perf_test [kernel.vmlinux] [k] jhash > 2.30% map_perf_test [kernel.vmlinux] [k] > entry_SYSCALL_64 > 2.25% map_perf_test [kernel.vmlinux] [k] > obj_cgroup_charge > 2.23% map_perf_test [kernel.vmlinux] [k] > alloc_htab_elem > 2.17% map_perf_test [kernel.vmlinux] [k] > memcpy_erms > 2.17% map_perf_test [kernel.vmlinux] [k] > syscall_exit_to_user_mode > 2.16% map_perf_test [kernel.vmlinux] [k] > syscall_return_via_sysret > 2.14% map_perf_test [kernel.vmlinux] [k] > _raw_spin_lock_irqsave > 2.13% map_perf_test [kernel.vmlinux] [k] > preempt_count_add > 2.12% map_perf_test [kernel.vmlinux] [k] > preempt_count_sub > 2.00% map_perf_test [kernel.vmlinux] [k] > percpu_counter_add_batch > 1.99% map_perf_test [kernel.vmlinux] [k] > alloc_bulk > 1.97% map_perf_test [kernel.vmlinux] [k] > call_rcu > 1.52% map_perf_test [kernel.vmlinux] [k] > mod_objcg_state > 1.36% map_perf_test [kernel.vmlinux] [k] > allocate_slab > > In both of these two setups, the overhead of call_rcu is about 2% and it is not > the biggest overhead. > > Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > you think ? We've discussed it twice already. It's not an option due to OOM and performance considerations. call_rcu doesn't scale to millions a second.
Hi, On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: >>> >>> Looks like the perf is lost on atomic_inc/dec. >>> Try a partial revert of mem_alloc. >>> In particular to make sure >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") >>> is reverted and call_rcu is in place, >>> but percpu counter optimization is still there. >>> Also please use 'map_perf_test 4'. >>> I doubt 1000 vs 10240 will make a difference, but still. >>> >> I have tried the following two setups: >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map >> # Samples: 1M of event 'cycles:ppp' >> # Event count (approx.): 1041345723234 >> # >> # Overhead Command Shared Object Symbol >> # ........ ............... ........................................... >> ............................................... >> # >> 10.36% map_perf_test [kernel.vmlinux] [k] >> bpf_map_get_memcg.isra.0 > That is per-cpu counter and it's consuming 10% ?! > Something is really odd in your setup. > A lot of debug configs? Sorry for the late reply. Just back to work from a long vacation. My local .config is derived from Fedora distribution. It indeed has some DEBUG related configs. Will turn these configs off to check it again :) >> 9.82% map_perf_test [kernel.vmlinux] [k] >> bpf_map_kmalloc_node >> 4.24% map_perf_test [kernel.vmlinux] [k] >> check_preemption_disabled > clearly debug build. > Please use production build. check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on Fedora distribution. >> 2.86% map_perf_test [kernel.vmlinux] [k] >> htab_map_update_elem >> 2.80% map_perf_test [kernel.vmlinux] [k] >> __kmalloc_node >> 2.72% map_perf_test [kernel.vmlinux] [k] >> htab_map_delete_elem >> 2.30% map_perf_test [kernel.vmlinux] [k] >> memcg_slab_post_alloc_hook >> 2.21% map_perf_test [kernel.vmlinux] [k] >> entry_SYSCALL_64 >> 2.17% map_perf_test [kernel.vmlinux] [k] >> syscall_exit_to_user_mode >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash >> 2.11% map_perf_test [kernel.vmlinux] [k] >> syscall_return_via_sysret >> 2.05% map_perf_test [kernel.vmlinux] [k] >> alloc_htab_elem >> 1.94% map_perf_test [kernel.vmlinux] [k] >> _raw_spin_lock_irqsave >> 1.92% map_perf_test [kernel.vmlinux] [k] >> preempt_count_add >> 1.92% map_perf_test [kernel.vmlinux] [k] >> preempt_count_sub >> 1.87% map_perf_test [kernel.vmlinux] [k] >> call_rcu SNIP >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do >> you think ? > We've discussed it twice already. It's not an option due to OOM > and performance considerations. > call_rcu doesn't scale to millions a second. Understand. I was just trying to understand the exact performance overhead of call_rcu(). If the overhead of map operations are much greater than the overhead of call_rcu(), I think calling call_rcu() one millions a second will be not a problem and it also makes the implementation of qp-trie being much simpler. The OOM problem is indeed a problem, although it is also possible for the current implementation, so I will try to implement the lookup procedure which handles the reuse problem. Regards. Tao > .
On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi, > >> > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > >>> > >>> Looks like the perf is lost on atomic_inc/dec. > >>> Try a partial revert of mem_alloc. > >>> In particular to make sure > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > >>> is reverted and call_rcu is in place, > >>> but percpu counter optimization is still there. > >>> Also please use 'map_perf_test 4'. > >>> I doubt 1000 vs 10240 will make a difference, but still. > >>> > >> I have tried the following two setups: > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > >> # Samples: 1M of event 'cycles:ppp' > >> # Event count (approx.): 1041345723234 > >> # > >> # Overhead Command Shared Object Symbol > >> # ........ ............... ........................................... > >> ............................................... > >> # > >> 10.36% map_perf_test [kernel.vmlinux] [k] > >> bpf_map_get_memcg.isra.0 > > That is per-cpu counter and it's consuming 10% ?! > > Something is really odd in your setup. > > A lot of debug configs? > Sorry for the late reply. Just back to work from a long vacation. > > My local .config is derived from Fedora distribution. It indeed has some DEBUG > related configs. Will turn these configs off to check it again :) > >> 9.82% map_perf_test [kernel.vmlinux] [k] > >> bpf_map_kmalloc_node > >> 4.24% map_perf_test [kernel.vmlinux] [k] > >> check_preemption_disabled > > clearly debug build. > > Please use production build. > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on > Fedora distribution. > >> 2.86% map_perf_test [kernel.vmlinux] [k] > >> htab_map_update_elem > >> 2.80% map_perf_test [kernel.vmlinux] [k] > >> __kmalloc_node > >> 2.72% map_perf_test [kernel.vmlinux] [k] > >> htab_map_delete_elem > >> 2.30% map_perf_test [kernel.vmlinux] [k] > >> memcg_slab_post_alloc_hook > >> 2.21% map_perf_test [kernel.vmlinux] [k] > >> entry_SYSCALL_64 > >> 2.17% map_perf_test [kernel.vmlinux] [k] > >> syscall_exit_to_user_mode > >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash > >> 2.11% map_perf_test [kernel.vmlinux] [k] > >> syscall_return_via_sysret > >> 2.05% map_perf_test [kernel.vmlinux] [k] > >> alloc_htab_elem > >> 1.94% map_perf_test [kernel.vmlinux] [k] > >> _raw_spin_lock_irqsave > >> 1.92% map_perf_test [kernel.vmlinux] [k] > >> preempt_count_add > >> 1.92% map_perf_test [kernel.vmlinux] [k] > >> preempt_count_sub > >> 1.87% map_perf_test [kernel.vmlinux] [k] > >> call_rcu > SNIP > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > >> you think ? > > We've discussed it twice already. It's not an option due to OOM > > and performance considerations. > > call_rcu doesn't scale to millions a second. > Understand. I was just trying to understand the exact performance overhead of > call_rcu(). If the overhead of map operations are much greater than the overhead > of call_rcu(), I think calling call_rcu() one millions a second will be not a > problem and it also makes the implementation of qp-trie being much simpler. The > OOM problem is indeed a problem, although it is also possible for the current > implementation, so I will try to implement the lookup procedure which handles > the reuse problem. call_rcu is not just that particular function. It's all the work rcu subsystem needs to do to observe gp and execute that callback. Just see how many kthreads it will start when overloaded like this.
On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > Hi, > > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote: > > >> Hi, > > >> > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > >>> > > >>> Looks like the perf is lost on atomic_inc/dec. > > >>> Try a partial revert of mem_alloc. > > >>> In particular to make sure > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > >>> is reverted and call_rcu is in place, > > >>> but percpu counter optimization is still there. > > >>> Also please use 'map_perf_test 4'. > > >>> I doubt 1000 vs 10240 will make a difference, but still. > > >>> > > >> I have tried the following two setups: > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > > >> # Samples: 1M of event 'cycles:ppp' > > >> # Event count (approx.): 1041345723234 > > >> # > > >> # Overhead Command Shared Object Symbol > > >> # ........ ............... ........................................... > > >> ............................................... > > >> # > > >> 10.36% map_perf_test [kernel.vmlinux] [k] > > >> bpf_map_get_memcg.isra.0 > > > That is per-cpu counter and it's consuming 10% ?! > > > Something is really odd in your setup. > > > A lot of debug configs? > > Sorry for the late reply. Just back to work from a long vacation. > > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG > > related configs. Will turn these configs off to check it again :) > > >> 9.82% map_perf_test [kernel.vmlinux] [k] > > >> bpf_map_kmalloc_node > > >> 4.24% map_perf_test [kernel.vmlinux] [k] > > >> check_preemption_disabled > > > clearly debug build. > > > Please use production build. > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on > > Fedora distribution. > > >> 2.86% map_perf_test [kernel.vmlinux] [k] > > >> htab_map_update_elem > > >> 2.80% map_perf_test [kernel.vmlinux] [k] > > >> __kmalloc_node > > >> 2.72% map_perf_test [kernel.vmlinux] [k] > > >> htab_map_delete_elem > > >> 2.30% map_perf_test [kernel.vmlinux] [k] > > >> memcg_slab_post_alloc_hook > > >> 2.21% map_perf_test [kernel.vmlinux] [k] > > >> entry_SYSCALL_64 > > >> 2.17% map_perf_test [kernel.vmlinux] [k] > > >> syscall_exit_to_user_mode > > >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash > > >> 2.11% map_perf_test [kernel.vmlinux] [k] > > >> syscall_return_via_sysret > > >> 2.05% map_perf_test [kernel.vmlinux] [k] > > >> alloc_htab_elem > > >> 1.94% map_perf_test [kernel.vmlinux] [k] > > >> _raw_spin_lock_irqsave > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > >> preempt_count_add > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > >> preempt_count_sub > > >> 1.87% map_perf_test [kernel.vmlinux] [k] > > >> call_rcu > > SNIP > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > > >> you think ? > > > We've discussed it twice already. It's not an option due to OOM > > > and performance considerations. > > > call_rcu doesn't scale to millions a second. > > Understand. I was just trying to understand the exact performance overhead of > > call_rcu(). If the overhead of map operations are much greater than the overhead > > of call_rcu(), I think calling call_rcu() one millions a second will be not a > > problem and it also makes the implementation of qp-trie being much simpler. The > > OOM problem is indeed a problem, although it is also possible for the current > > implementation, so I will try to implement the lookup procedure which handles > > the reuse problem. > > call_rcu is not just that particular function. > It's all the work rcu subsystem needs to do to observe gp > and execute that callback. Just see how many kthreads it will > start when overloaded like this. The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, and rcuo*. There is also the back-of-interrupt softirq context, which requires some care to measure accurately. The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it that the per-element locking overhead for exact iterations was a problem? If so, what exactly are the consistency rules for iteration? Presumably stronger than "if the element existed throughout, it is included in the iteration; if it did not exist throughout, it is not included; otherwise it might or might not be included" given that you get that for free. Either way, could you please tell me the exact iteration rules? Thanx, Paul
On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > > > Hi, > > > > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > >> Hi, > > > >> > > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > >>> > > > >>> Looks like the perf is lost on atomic_inc/dec. > > > >>> Try a partial revert of mem_alloc. > > > >>> In particular to make sure > > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > > >>> is reverted and call_rcu is in place, > > > >>> but percpu counter optimization is still there. > > > >>> Also please use 'map_perf_test 4'. > > > >>> I doubt 1000 vs 10240 will make a difference, but still. > > > >>> > > > >> I have tried the following two setups: > > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > > > >> # Samples: 1M of event 'cycles:ppp' > > > >> # Event count (approx.): 1041345723234 > > > >> # > > > >> # Overhead Command Shared Object Symbol > > > >> # ........ ............... ........................................... > > > >> ............................................... > > > >> # > > > >> 10.36% map_perf_test [kernel.vmlinux] [k] > > > >> bpf_map_get_memcg.isra.0 > > > > That is per-cpu counter and it's consuming 10% ?! > > > > Something is really odd in your setup. > > > > A lot of debug configs? > > > Sorry for the late reply. Just back to work from a long vacation. > > > > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG > > > related configs. Will turn these configs off to check it again :) > > > >> 9.82% map_perf_test [kernel.vmlinux] [k] > > > >> bpf_map_kmalloc_node > > > >> 4.24% map_perf_test [kernel.vmlinux] [k] > > > >> check_preemption_disabled > > > > clearly debug build. > > > > Please use production build. > > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on > > > Fedora distribution. > > > >> 2.86% map_perf_test [kernel.vmlinux] [k] > > > >> htab_map_update_elem > > > >> 2.80% map_perf_test [kernel.vmlinux] [k] > > > >> __kmalloc_node > > > >> 2.72% map_perf_test [kernel.vmlinux] [k] > > > >> htab_map_delete_elem > > > >> 2.30% map_perf_test [kernel.vmlinux] [k] > > > >> memcg_slab_post_alloc_hook > > > >> 2.21% map_perf_test [kernel.vmlinux] [k] > > > >> entry_SYSCALL_64 > > > >> 2.17% map_perf_test [kernel.vmlinux] [k] > > > >> syscall_exit_to_user_mode > > > >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash > > > >> 2.11% map_perf_test [kernel.vmlinux] [k] > > > >> syscall_return_via_sysret > > > >> 2.05% map_perf_test [kernel.vmlinux] [k] > > > >> alloc_htab_elem > > > >> 1.94% map_perf_test [kernel.vmlinux] [k] > > > >> _raw_spin_lock_irqsave > > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > > >> preempt_count_add > > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > > >> preempt_count_sub > > > >> 1.87% map_perf_test [kernel.vmlinux] [k] > > > >> call_rcu > > > SNIP > > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > > > >> you think ? > > > > We've discussed it twice already. It's not an option due to OOM > > > > and performance considerations. > > > > call_rcu doesn't scale to millions a second. > > > Understand. I was just trying to understand the exact performance overhead of > > > call_rcu(). If the overhead of map operations are much greater than the overhead > > > of call_rcu(), I think calling call_rcu() one millions a second will be not a > > > problem and it also makes the implementation of qp-trie being much simpler. The > > > OOM problem is indeed a problem, although it is also possible for the current > > > implementation, so I will try to implement the lookup procedure which handles > > > the reuse problem. > > > > call_rcu is not just that particular function. > > It's all the work rcu subsystem needs to do to observe gp > > and execute that callback. Just see how many kthreads it will > > start when overloaded like this. > > The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, > and rcuo*. There is also the back-of-interrupt softirq context, which > requires some care to measure accurately. > > The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it > that the per-element locking overhead for exact iterations was a problem? > If so, what exactly are the consistency rules for iteration? Presumably > stronger than "if the element existed throughout, it is included in the > iteration; if it did not exist throughout, it is not included; otherwise > it might or might not be included" given that you get that for free. > > Either way, could you please tell me the exact iteration rules? The rules are the way we make them to be. iteration will be under lock. lookup needs to be correct. It can retry if necessary (like htab is doing). Randomly returning 'noexist' is, of course, not acceptable.
On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote: > On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > > > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > > > > > Hi, > > > > > > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > > > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > > >> Hi, > > > > >> > > > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > > >>> > > > > >>> Looks like the perf is lost on atomic_inc/dec. > > > > >>> Try a partial revert of mem_alloc. > > > > >>> In particular to make sure > > > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > > > >>> is reverted and call_rcu is in place, > > > > >>> but percpu counter optimization is still there. > > > > >>> Also please use 'map_perf_test 4'. > > > > >>> I doubt 1000 vs 10240 will make a difference, but still. > > > > >>> > > > > >> I have tried the following two setups: > > > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > > > > >> # Samples: 1M of event 'cycles:ppp' > > > > >> # Event count (approx.): 1041345723234 > > > > >> # > > > > >> # Overhead Command Shared Object Symbol > > > > >> # ........ ............... ........................................... > > > > >> ............................................... > > > > >> # > > > > >> 10.36% map_perf_test [kernel.vmlinux] [k] > > > > >> bpf_map_get_memcg.isra.0 > > > > > That is per-cpu counter and it's consuming 10% ?! > > > > > Something is really odd in your setup. > > > > > A lot of debug configs? > > > > Sorry for the late reply. Just back to work from a long vacation. > > > > > > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG > > > > related configs. Will turn these configs off to check it again :) > > > > >> 9.82% map_perf_test [kernel.vmlinux] [k] > > > > >> bpf_map_kmalloc_node > > > > >> 4.24% map_perf_test [kernel.vmlinux] [k] > > > > >> check_preemption_disabled > > > > > clearly debug build. > > > > > Please use production build. > > > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on > > > > Fedora distribution. > > > > >> 2.86% map_perf_test [kernel.vmlinux] [k] > > > > >> htab_map_update_elem > > > > >> 2.80% map_perf_test [kernel.vmlinux] [k] > > > > >> __kmalloc_node > > > > >> 2.72% map_perf_test [kernel.vmlinux] [k] > > > > >> htab_map_delete_elem > > > > >> 2.30% map_perf_test [kernel.vmlinux] [k] > > > > >> memcg_slab_post_alloc_hook > > > > >> 2.21% map_perf_test [kernel.vmlinux] [k] > > > > >> entry_SYSCALL_64 > > > > >> 2.17% map_perf_test [kernel.vmlinux] [k] > > > > >> syscall_exit_to_user_mode > > > > >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash > > > > >> 2.11% map_perf_test [kernel.vmlinux] [k] > > > > >> syscall_return_via_sysret > > > > >> 2.05% map_perf_test [kernel.vmlinux] [k] > > > > >> alloc_htab_elem > > > > >> 1.94% map_perf_test [kernel.vmlinux] [k] > > > > >> _raw_spin_lock_irqsave > > > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > > > >> preempt_count_add > > > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > > > >> preempt_count_sub > > > > >> 1.87% map_perf_test [kernel.vmlinux] [k] > > > > >> call_rcu > > > > SNIP > > > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > > > > >> you think ? > > > > > We've discussed it twice already. It's not an option due to OOM > > > > > and performance considerations. > > > > > call_rcu doesn't scale to millions a second. > > > > Understand. I was just trying to understand the exact performance overhead of > > > > call_rcu(). If the overhead of map operations are much greater than the overhead > > > > of call_rcu(), I think calling call_rcu() one millions a second will be not a > > > > problem and it also makes the implementation of qp-trie being much simpler. The > > > > OOM problem is indeed a problem, although it is also possible for the current > > > > implementation, so I will try to implement the lookup procedure which handles > > > > the reuse problem. > > > > > > call_rcu is not just that particular function. > > > It's all the work rcu subsystem needs to do to observe gp > > > and execute that callback. Just see how many kthreads it will > > > start when overloaded like this. > > > > The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, > > and rcuo*. There is also the back-of-interrupt softirq context, which > > requires some care to measure accurately. > > > > The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it > > that the per-element locking overhead for exact iterations was a problem? > > If so, what exactly are the consistency rules for iteration? Presumably > > stronger than "if the element existed throughout, it is included in the > > iteration; if it did not exist throughout, it is not included; otherwise > > it might or might not be included" given that you get that for free. > > > > Either way, could you please tell me the exact iteration rules? > > The rules are the way we make them to be. > iteration will be under lock. > lookup needs to be correct. It can retry if necessary (like htab is doing). > Randomly returning 'noexist' is, of course, not acceptable. OK, so then it is important that updates to this data structure be carried out in such a way as to avoid discombobulating lockless readers. Do the updates have that property? The usual way to get that property is to leave the old search structure around, replacing it with the new one, and RCU-freeing the old one. In case it helps, Kung and Lehman describe how to do that for search trees: http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf Thanx, Paul
Hi Paul, On 10/9/2022 4:11 AM, Paul E. McKenney wrote: > On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote: >> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: >>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: SNIP >>>>> Understand. I was just trying to understand the exact performance overhead of >>>>> call_rcu(). If the overhead of map operations are much greater than the overhead >>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a >>>>> problem and it also makes the implementation of qp-trie being much simpler. The >>>>> OOM problem is indeed a problem, although it is also possible for the current >>>>> implementation, so I will try to implement the lookup procedure which handles >>>>> the reuse problem. >>>> call_rcu is not just that particular function. >>>> It's all the work rcu subsystem needs to do to observe gp >>>> and execute that callback. Just see how many kthreads it will >>>> start when overloaded like this. >>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, >>> and rcuo*. There is also the back-of-interrupt softirq context, which >>> requires some care to measure accurately. >>> >>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it >>> that the per-element locking overhead for exact iterations was a problem? >>> If so, what exactly are the consistency rules for iteration? Presumably >>> stronger than "if the element existed throughout, it is included in the >>> iteration; if it did not exist throughout, it is not included; otherwise >>> it might or might not be included" given that you get that for free. >>> >>> Either way, could you please tell me the exact iteration rules? >> The rules are the way we make them to be. >> iteration will be under lock. >> lookup needs to be correct. It can retry if necessary (like htab is doing). >> Randomly returning 'noexist' is, of course, not acceptable. > OK, so then it is important that updates to this data structure be > carried out in such a way as to avoid discombobulating lockless readers. > Do the updates have that property? Yes. The update procedure will copy the old pointer array to a new array first, then update the new array and replace the pointer of old array by the pointer of new array. > > The usual way to get that property is to leave the old search structure > around, replacing it with the new one, and RCU-freeing the old one. > In case it helps, Kung and Lehman describe how to do that for search trees: > > http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf Thanks for the paper. Just skimming through it, it seems that it uses reference-counting and garbage collection to solve the safe memory reclamation problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to check whether or not the branch and the leaf node is reused during lookup, and retry the lookup if it happened. Now just checking the feasibility of the solution and it seems a little complicated than expected. > > Thanx, Paul > .
On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote: > Hi Paul, > > On 10/9/2022 4:11 AM, Paul E. McKenney wrote: > > On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote: > >> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: > >>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > SNIP > >>>>> Understand. I was just trying to understand the exact performance overhead of > >>>>> call_rcu(). If the overhead of map operations are much greater than the overhead > >>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a > >>>>> problem and it also makes the implementation of qp-trie being much simpler. The > >>>>> OOM problem is indeed a problem, although it is also possible for the current > >>>>> implementation, so I will try to implement the lookup procedure which handles > >>>>> the reuse problem. > >>>> call_rcu is not just that particular function. > >>>> It's all the work rcu subsystem needs to do to observe gp > >>>> and execute that callback. Just see how many kthreads it will > >>>> start when overloaded like this. > >>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, > >>> and rcuo*. There is also the back-of-interrupt softirq context, which > >>> requires some care to measure accurately. > >>> > >>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it > >>> that the per-element locking overhead for exact iterations was a problem? > >>> If so, what exactly are the consistency rules for iteration? Presumably > >>> stronger than "if the element existed throughout, it is included in the > >>> iteration; if it did not exist throughout, it is not included; otherwise > >>> it might or might not be included" given that you get that for free. > >>> > >>> Either way, could you please tell me the exact iteration rules? > >> The rules are the way we make them to be. > >> iteration will be under lock. > >> lookup needs to be correct. It can retry if necessary (like htab is doing). > >> Randomly returning 'noexist' is, of course, not acceptable. > > OK, so then it is important that updates to this data structure be > > carried out in such a way as to avoid discombobulating lockless readers. > > Do the updates have that property? > > Yes. The update procedure will copy the old pointer array to a new array first, > then update the new array and replace the pointer of old array by the pointer of > new array. Very good. But then why is there a problem? Is the iteration using multiple RCU read-side critical sections or something? > > The usual way to get that property is to leave the old search structure > > around, replacing it with the new one, and RCU-freeing the old one. > > In case it helps, Kung and Lehman describe how to do that for search trees: > > > > http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf > Thanks for the paper. Just skimming through it, it seems that it uses > reference-counting and garbage collection to solve the safe memory reclamation > problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to > check whether or not the branch and the leaf node is reused during lookup, and > retry the lookup if it happened. Now just checking the feasibility of the > solution and it seems a little complicated than expected. The main thing in that paper is the handling of rotations in the search-tree update. But if you are not using a tree, that won't be all that relevant. Thanx, Paul
Hi, On 10/9/2022 5:05 PM, Paul E. McKenney wrote: > On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote: >> Hi Paul, >> >> On 10/9/2022 4:11 AM, Paul E. McKenney wrote: >>> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote: >>>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: >>>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: >> SNIP >>>>>>> Understand. I was just trying to understand the exact performance overhead of >>>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead >>>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a >>>>>>> problem and it also makes the implementation of qp-trie being much simpler. The >>>>>>> OOM problem is indeed a problem, although it is also possible for the current >>>>>>> implementation, so I will try to implement the lookup procedure which handles >>>>>>> the reuse problem. >>>>>> call_rcu is not just that particular function. >>>>>> It's all the work rcu subsystem needs to do to observe gp >>>>>> and execute that callback. Just see how many kthreads it will >>>>>> start when overloaded like this. >>>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, >>>>> and rcuo*. There is also the back-of-interrupt softirq context, which >>>>> requires some care to measure accurately. >>>>> >>>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it >>>>> that the per-element locking overhead for exact iterations was a problem? >>>>> If so, what exactly are the consistency rules for iteration? Presumably >>>>> stronger than "if the element existed throughout, it is included in the >>>>> iteration; if it did not exist throughout, it is not included; otherwise >>>>> it might or might not be included" given that you get that for free. >>>>> >>>>> Either way, could you please tell me the exact iteration rules? >>>> The rules are the way we make them to be. >>>> iteration will be under lock. >>>> lookup needs to be correct. It can retry if necessary (like htab is doing). >>>> Randomly returning 'noexist' is, of course, not acceptable. >>> OK, so then it is important that updates to this data structure be >>> carried out in such a way as to avoid discombobulating lockless readers. >>> Do the updates have that property? >> Yes. The update procedure will copy the old pointer array to a new array first, >> then update the new array and replace the pointer of old array by the pointer of >> new array. > Very good. But then why is there a problem? Is the iteration using > multiple RCU read-side critical sections or something? The problem is that although the objects are RCU-freed, but these object also can be reused immediately in bpf memory allocator. The reason for reuse is for performance and is to reduce the possibility of OOM. Because the object can be reused during RCU-protected lookup and the possibility of reuse is low, the lookup procedure needs to check whether reuse is happening during lookup. And I was arguing with Alexei about whether or no it is reasonable to provide an optional flag to remove the immediate reuse in bpf memory allocator. > >>> The usual way to get that property is to leave the old search structure >>> around, replacing it with the new one, and RCU-freeing the old one. >>> In case it helps, Kung and Lehman describe how to do that for search trees: >>> >>> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf >> Thanks for the paper. Just skimming through it, it seems that it uses >> reference-counting and garbage collection to solve the safe memory reclamation >> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to >> check whether or not the branch and the leaf node is reused during lookup, and >> retry the lookup if it happened. Now just checking the feasibility of the >> solution and it seems a little complicated than expected. > The main thing in that paper is the handling of rotations in the > search-tree update. But if you are not using a tree, that won't be all > that relevant. I see. Thanks for the explanation. > > Thanx, Paul
On Sun, Oct 09, 2022 at 06:45:22PM +0800, Hou Tao wrote: > Hi, > On 10/9/2022 5:05 PM, Paul E. McKenney wrote: > > On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote: > >> Hi Paul, > >> > >> On 10/9/2022 4:11 AM, Paul E. McKenney wrote: > >>> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote: > >>>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: > >>>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > >> SNIP > >>>>>>> Understand. I was just trying to understand the exact performance overhead of > >>>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead > >>>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a > >>>>>>> problem and it also makes the implementation of qp-trie being much simpler. The > >>>>>>> OOM problem is indeed a problem, although it is also possible for the current > >>>>>>> implementation, so I will try to implement the lookup procedure which handles > >>>>>>> the reuse problem. > >>>>>> call_rcu is not just that particular function. > >>>>>> It's all the work rcu subsystem needs to do to observe gp > >>>>>> and execute that callback. Just see how many kthreads it will > >>>>>> start when overloaded like this. > >>>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, > >>>>> and rcuo*. There is also the back-of-interrupt softirq context, which > >>>>> requires some care to measure accurately. > >>>>> > >>>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it > >>>>> that the per-element locking overhead for exact iterations was a problem? > >>>>> If so, what exactly are the consistency rules for iteration? Presumably > >>>>> stronger than "if the element existed throughout, it is included in the > >>>>> iteration; if it did not exist throughout, it is not included; otherwise > >>>>> it might or might not be included" given that you get that for free. > >>>>> > >>>>> Either way, could you please tell me the exact iteration rules? > >>>> The rules are the way we make them to be. > >>>> iteration will be under lock. > >>>> lookup needs to be correct. It can retry if necessary (like htab is doing). > >>>> Randomly returning 'noexist' is, of course, not acceptable. > >>> OK, so then it is important that updates to this data structure be > >>> carried out in such a way as to avoid discombobulating lockless readers. > >>> Do the updates have that property? > >> Yes. The update procedure will copy the old pointer array to a new array first, > >> then update the new array and replace the pointer of old array by the pointer of > >> new array. > > Very good. But then why is there a problem? Is the iteration using > > multiple RCU read-side critical sections or something? > > The problem is that although the objects are RCU-freed, but these object also > can be reused immediately in bpf memory allocator. The reason for reuse is for > performance and is to reduce the possibility of OOM. Because the object can be > reused during RCU-protected lookup and the possibility of reuse is low, the > lookup procedure needs to check whether reuse is happening during lookup. And I > was arguing with Alexei about whether or no it is reasonable to provide an > optional flag to remove the immediate reuse in bpf memory allocator. Indeed, in that case there needs to be a check, for example as described in the comment preceding the definition of SLAB_TYPESAFE_BY_RCU. If the use of the element is read-only on the one hand or heuristic/statistical on the other, lighter weight approaches are possible. Would that help? Thanx, Paul > >>> The usual way to get that property is to leave the old search structure > >>> around, replacing it with the new one, and RCU-freeing the old one. > >>> In case it helps, Kung and Lehman describe how to do that for search trees: > >>> > >>> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf > >> Thanks for the paper. Just skimming through it, it seems that it uses > >> reference-counting and garbage collection to solve the safe memory reclamation > >> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to > >> check whether or not the branch and the leaf node is reused during lookup, and > >> retry the lookup if it happened. Now just checking the feasibility of the > >> solution and it seems a little complicated than expected. > > The main thing in that paper is the handling of rotations in the > > search-tree update. But if you are not using a tree, that won't be all > > that relevant. > I see. Thanks for the explanation. > > > > Thanx, Paul >
Hello all, I have just found out about this qp-trie work, and I'm pleased to hear that it is looking promising for you! I have a few very broad observations: The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff between branch factor, maximum key length, and size of branch node. The greater the branch factor, the fewer indirections needed to traverse the tree; but if you go too wide then prefetching is less effective and branch nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap, 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew nodes from 16 bytes to 24 bytes, and it ended up slower. Your interior nodes are much bigger than mine, so you might find the tradeoff is different. I encourage you to try it out. I saw there has been some discussion about locking and RCU. My current project is integrating a qp-trie into BIND, with the aim of replacing its old red-black tree for searching DNS records. It's based on a concurrent qp-trie that I prototyped in NSD (a smaller and simpler DNS server than BIND). My strategy is based on a custom allocator for interior nodes. This has two main effects: * Node references are now 32 bit indexes into the allocator's pool, instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes. * The allocator supports copy-on-write and safe memory reclamation with a fairly small overhead, 3 x 32 bit counters per memory chunk (each chunk is roughly page sized). I wrote some notes when the design was new, but things have changed since then. https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html For memory reclamation the interior nodes get moved / compacted. It's a kind of garbage collector, but easy-mode because the per-chunk counters accurately indicate when compaction is worthwhile. I've written some notes on my several failed GC experiments; the last / current attempt seems (by and large) good enough. https://dotat.at/@/2022-06-22-compact-qp.html For exterior / leaf nodes, I'm using atomic refcounts to know when they can be reclaimed. The caller is responsible for COWing its leaves when necessary. Updates to the tree are transactional in style, and do not block readers: a single writer gets the write mutex, makes whatever changes it needs (copying as required), then commits by flipping the tree's root. After a commit it can free unused chunks. (Compaction can be part of an update transaction or a transaction of its own.) I'm currently using a reader-writer lock for the tree root, but I designed it with liburcu in mind, while trying to keep things simple. This strategy is very heavily biased in favour of readers, which suits DNS servers. I don't know enough about BPF to have any idea what kind of update traffic you need to support. At the moment I am reworking and simplifying my transaction and reclamation code and it's all very broken. I guess this isn't the best possible time to compare notes on qp-trie variants, but I'm happy to hear from others who have code and ideas to share.
On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@dotat.at> wrote: > > Hello all, > > I have just found out about this qp-trie work, and I'm pleased to hear > that it is looking promising for you! > This is a very nice data structure, so thank you for doing a great job explaining it in your post! > I have a few very broad observations: > > The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff > between branch factor, maximum key length, and size of branch node. The > greater the branch factor, the fewer indirections needed to traverse the > tree; but if you go too wide then prefetching is less effective and branch > nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap, > 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID > but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew > nodes from 16 bytes to 24 bytes, and it ended up slower. > > Your interior nodes are much bigger than mine, so you might find the > tradeoff is different. I encourage you to try it out. True, but I think for (at least initial) simplicity, sticking to half-bytes would simplify the code and let us figure out BPF and kernel-specific issues without having to worry about the correctness of the qp-trie core logic itself. > > I saw there has been some discussion about locking and RCU. My current > project is integrating a qp-trie into BIND, with the aim of replacing its > old red-black tree for searching DNS records. It's based on a concurrent > qp-trie that I prototyped in NSD (a smaller and simpler DNS server than > BIND). My strategy is based on a custom allocator for interior nodes. This > has two main effects: > > * Node references are now 32 bit indexes into the allocator's pool, > instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes. > > * The allocator supports copy-on-write and safe memory reclamation with > a fairly small overhead, 3 x 32 bit counters per memory chunk (each > chunk is roughly page sized). > > I wrote some notes when the design was new, but things have changed since > then. > > https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html > > For memory reclamation the interior nodes get moved / compacted. It's a > kind of garbage collector, but easy-mode because the per-chunk counters > accurately indicate when compaction is worthwhile. I've written some notes > on my several failed GC experiments; the last / current attempt seems (by > and large) good enough. > > https://dotat.at/@/2022-06-22-compact-qp.html > > For exterior / leaf nodes, I'm using atomic refcounts to know when they > can be reclaimed. The caller is responsible for COWing its leaves when > necessary. > > Updates to the tree are transactional in style, and do not block readers: > a single writer gets the write mutex, makes whatever changes it needs > (copying as required), then commits by flipping the tree's root. After a > commit it can free unused chunks. (Compaction can be part of an update > transaction or a transaction of its own.) > > I'm currently using a reader-writer lock for the tree root, but I designed > it with liburcu in mind, while trying to keep things simple. > > This strategy is very heavily biased in favour of readers, which suits DNS > servers. I don't know enough about BPF to have any idea what kind of > update traffic you need to support. These are some nice ideas, I did a quick read on your latest blog posts, missed those updates since last time I checked your blog. One limitation that we have in the BPF world is that BPF programs can be run in extremely restrictive contexts (e.g., NMI), in which things that user-space can assume will almost always succeed (like memory allocation), are not allowed. We do have BPF-specific memory allocator, but even it can fail to allocate memory, depending on allocation patterns. So we need to think if this COW approach is acceptable. I'd love for Hou Tao to think about this and chime in, though, as he spent a lot of time thinking about particulars. But very basically, ultimate memory and performance savings are perhaps less important in trying to fit qp-trie into BPF framework. We can iterate after with optimizations and improvements, but first we need to get the things correct and well-behaved. > > At the moment I am reworking and simplifying my transaction and > reclamation code and it's all very broken. I guess this isn't the best > possible time to compare notes on qp-trie variants, but I'm happy to hear > from others who have code and ideas to share. It would be great if you can lend your expertise in reviewing at least generic qp-trie parts, but also in helping to figure out the overall concurrency approach we can take in kernel/BPF land (depending on your familiarity with kernel specifics, of course). Thanks for offering the latest on qp-trie, exciting to see more production applications of qp-trie and that you are still actively working on this! > > -- > Tony Finch <dot@dotat.at> https://dotat.at/ > Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6, > increasing 6 to gale 8 for a time. Smooth or slight in eastern > shelter, otherwise slight or moderate. Rain or showers. Good, > occasionally poor.
Hi, On 10/28/2022 2:52 AM, Andrii Nakryiko wrote: > On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@dotat.at> wrote: >> Hello all, >> >> I have just found out about this qp-trie work, and I'm pleased to hear >> that it is looking promising for you! >> > This is a very nice data structure, so thank you for doing a great job > explaining it in your post! Sorry for the late reply. Stilling digging into other problems. Also thanks Tony for his job on qp.git. > >> I have a few very broad observations: >> >> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff >> between branch factor, maximum key length, and size of branch node. The >> greater the branch factor, the fewer indirections needed to traverse the >> tree; but if you go too wide then prefetching is less effective and branch >> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap, >> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID >> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew >> nodes from 16 bytes to 24 bytes, and it ended up slower. >> >> Your interior nodes are much bigger than mine, so you might find the >> tradeoff is different. I encourage you to try it out. parent field in qp_trie_branch is used to support non-recursive iteration and rcu_head is used for RCU memory freeing. > True, but I think for (at least initial) simplicity, sticking to > half-bytes would simplify the code and let us figure out BPF and > kernel-specific issues without having to worry about the correctness > of the qp-trie core logic itself. Agreed. > >> I saw there has been some discussion about locking and RCU. My current >> project is integrating a qp-trie into BIND, with the aim of replacing its >> old red-black tree for searching DNS records. It's based on a concurrent >> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than >> BIND). My strategy is based on a custom allocator for interior nodes. This >> has two main effects: >> >> * Node references are now 32 bit indexes into the allocator's pool, >> instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes. >> >> * The allocator supports copy-on-write and safe memory reclamation with >> a fairly small overhead, 3 x 32 bit counters per memory chunk (each >> chunk is roughly page sized). >> >> I wrote some notes when the design was new, but things have changed since >> then. >> >> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html >> >> For memory reclamation the interior nodes get moved / compacted. It's a >> kind of garbage collector, but easy-mode because the per-chunk counters >> accurately indicate when compaction is worthwhile. I've written some notes >> on my several failed GC experiments; the last / current attempt seems (by >> and large) good enough. >> >> https://dotat.at/@/2022-06-22-compact-qp.html >> >> For exterior / leaf nodes, I'm using atomic refcounts to know when they >> can be reclaimed. The caller is responsible for COWing its leaves when >> necessary. >> >> Updates to the tree are transactional in style, and do not block readers: >> a single writer gets the write mutex, makes whatever changes it needs >> (copying as required), then commits by flipping the tree's root. After a >> commit it can free unused chunks. (Compaction can be part of an update >> transaction or a transaction of its own.) >> >> I'm currently using a reader-writer lock for the tree root, but I designed >> it with liburcu in mind, while trying to keep things simple. >> >> This strategy is very heavily biased in favour of readers, which suits DNS >> servers. I don't know enough about BPF to have any idea what kind of >> update traffic you need to support. > These are some nice ideas, I did a quick read on your latest blog > posts, missed those updates since last time I checked your blog. > > One limitation that we have in the BPF world is that BPF programs can > be run in extremely restrictive contexts (e.g., NMI), in which things > that user-space can assume will almost always succeed (like memory > allocation), are not allowed. We do have BPF-specific memory > allocator, but even it can fail to allocate memory, depending on > allocation patterns. So we need to think if this COW approach is > acceptable. I'd love for Hou Tao to think about this and chime in, > though, as he spent a lot of time thinking about particulars. Current implementation of BPF_MAP_TYPE_QP_TRIE is already COWed. When adding or deleting a leaf node, its parent interior node will be copied to a new interior node, the pointer to the old parent node (in the grand-parent interior node) will be updated by the new parent node, and the old parent node will be RCU-freed. According to above description, COW in qp-trie means all nodes on the path from the root node to the leaf node are COWed, so I think current COW implementation is better for bpf map usage scenario. But I will check the qp-trie code in BIND [0] later. 0: https://gitlab.isc.org/isc-projects/bind9/-/commit/ecc555e6ec763c4f8f2495864ec08749202fff1a#65b4d67ce64e9195e41ac43d78af5156f9ebb779_0_553 > But very basically, ultimate memory and performance savings are > perhaps less important in trying to fit qp-trie into BPF framework. We > can iterate after with optimizations and improvements, but first we > need to get the things correct and well-behaved. Understand. > >> At the moment I am reworking and simplifying my transaction and >> reclamation code and it's all very broken. I guess this isn't the best >> possible time to compare notes on qp-trie variants, but I'm happy to hear >> from others who have code and ideas to share. > It would be great if you can lend your expertise in reviewing at least > generic qp-trie parts, but also in helping to figure out the overall > concurrency approach we can take in kernel/BPF land (depending on your > familiarity with kernel specifics, of course). > > Thanks for offering the latest on qp-trie, exciting to see more > production applications of qp-trie and that you are still actively > working on this! Yes, it would be great if Tony could help to review or co-design bpf qp-trie map. >> -- >> Tony Finch <dot@dotat.at> https://dotat.at/ >> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6, >> increasing 6 to gale 8 for a time. Smooth or slight in eastern >> shelter, otherwise slight or moderate. Rain or showers. Good, >> occasionally poor.
From: Hou Tao <houtao1@huawei.com> Hi, The initial motivation for qp-trie map is to reduce memory usage for string keys special those with large differencies in length as discussed in [0]. And as a big-endian lexicographical ordered map, it can also be used for any binary data with fixed or variable length. Now the basic functionality of qp-trie is ready, so posting it to get more feedback or suggestions about qp-trie. Specially feedback about the following questions: (1) Use cases for qp-trie Andrii had proposed to re-implement lpm-trie by using qp-trie. The advantage would be the speed up of lookup operations due to lower tree depth of qp-trie and the performance of update may also increase. But is there any other use cases for qp-trie ? Specially those cases which need both ordering and memory efficiency or cases in which qp-trie will have high fan-out and its lookup performance will be much better than hash-table as shown below: Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255]) htab lookup (1 thread) 4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB) htab lookup (2 thread) 10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB) htab lookup (4 thread) 20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB) htab lookup (8 thread) 39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB) htab lookup (16 thread) 79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB) qp-trie lookup (1 thread) 10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB) qp-trie lookup (2 thread) 20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB) qp-trie lookup (4 thread) 41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB) qp-trie lookup (8 thread) 81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB) qp-trie lookup (16 thread) 164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB) * non-zero drops is due to duplicated keys in generated keys. (2) Improve update/delete performance for qp-trie Now top-5 overheads in update/delete operations are: 21.23% bench [kernel.vmlinux] [k] qp_trie_update_elem 13.98% bench [kernel.vmlinux] [k] qp_trie_delete_elem 7.96% bench [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 5.16% bench [kernel.vmlinux] [k] memcpy_erms 5.00% bench [kernel.vmlinux] [k] __kmalloc_node The top-2 overheads are due to memory access and atomic ops on max_entries. I had tried memory prefetch but it didn't work out, maybe I did it wrong. For subtree spinlock overhead, I also had tried the hierarchical lock by using hand-over-hand lock scheme, but it didn't scale well [1]. I will try to increase the number of subtrees from 256 to 1024, 4096 or bigger and check whether it makes any difference. For atomic ops and kmalloc overhead, I think I can reuse the idea from patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc a simple try and encounter some problems. One problem is that immediate reuse of freed object in bpf memory allocator. Because qp-trie uses bpf memory allocator to allocate and free qp_trie_branch, if qp_trie_branch is reused immediately, the lookup procedure may oops due to the incorrect content in qp_trie_branch. And another problem is the size limitation in bpf_mem_alloc() is 4096. It may be a little small for the total size of key size and value size, but maybe I can use two separated bpf_mem_alloc for key and value. String in BTF string sections (entries=115980, max size=71, threads=4) On hash-table Iter 0 ( 21.872us): hits 11.496M/s ( 2.874M/prod), drops 0.000M/s, total operations 11.496M/s Iter 1 ( 4.816us): hits 12.701M/s ( 3.175M/prod), drops 0.000M/s, total operations 12.701M/s Iter 2 (-11.006us): hits 12.443M/s ( 3.111M/prod), drops 0.000M/s, total operations 12.443M/s Iter 3 ( 2.628us): hits 11.223M/s ( 2.806M/prod), drops 0.000M/s, total operations 11.223M/s Slab: 22.225 MiB On qp-trie: Iter 0 ( 24.388us): hits 4.062M/s ( 1.016M/prod), drops 0.000M/s, total operations 4.062M/s Iter 1 ( 20.612us): hits 3.884M/s ( 0.971M/prod), drops 0.000M/s, total operations 3.884M/s Iter 2 (-21.533us): hits 4.046M/s ( 1.012M/prod), drops 0.000M/s, total operations 4.046M/s Iter 3 ( 2.090us): hits 3.971M/s ( 0.993M/prod), drops 0.000M/s, total operations 3.971M/s Slab: 10.849 MiB (3) Improve memory efficiency further When using strings in BTF string section as a data set for qp-trie, the slab memory usage showed in cgroup memory.stats file is about 11MB for qp-trie and 22MB for hash table as shown above. However the theoretical memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu" fields from qp_trie_branch) and the extra memory usage (about 38% of total usage) mainly comes from internal fragment in slab (namely 2^n alignment for allocation) and overhead in kmem-cgroup accounting. We can reduce the internal fragment by creating separated kmem_cache for qp_trie_branch with different child nodes, but not sure whether it is worth it. And in order to prevent allocating a rcu_head for each leaf node, now only branch node is RCU-freed, so when replacing a leaf node, a new branch node and a new leaf node will be allocated instead of replacing the old leaf node and RCU-freed the old leaf node. No sure whether or not there are ways to remove rcu_head completely from qp_trie_branch node. Headless kfree_rcu() doesn't need rcu_head but it might sleeps. It seems BPF memory allocator is a better choice as for now because it replaces rcu_read by llist_node. Sorted strings in BTF string sections (entries=115980) htab lookup (1 thread) 6.915 ± 0.029M/s (drops 0.000 ± 0.000M/s mem 22.216 MiB) qp-trie lookup (1 thread) 6.791 ± 0.005M/s (drops 0.000 ± 0.000M/s mem 11.273 MiB) All files under linux kernel source directory (entries=74359) htab lookup (1 thread) 7.978 ± 0.009M/s (drops 0.000 ± 0.000M/s mem 14.272 MiB) qp-trie lookup (1 thread) 5.521 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.367 MiB) Domain names for Alexa top million web site (entries=1000000) htab lookup (1 thread) 3.148 ± 0.039M/s (drops 0.000 ± 0.000M/s mem 190.831 MiB) qp-trie lookup (1 thread) 2.374 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 83.733 MiB) Comments and suggestions are always welcome. [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/db34696a-cbfe-16e8-6dd5-8174b97dcf1d@huawei.com/ Change Log: v2: * Always use copy_to_user() in bpf_copy_to_dynptr_ukey() (from kernel test robot <lkp@intel.com>) No-MMU ARM32 host doesn't support 8-bytes get_user(). * Remove BPF_F_DYNPTR_KEY and use the more extensible dynptr_key_off * Remove the unnecessary rcu_barrier in qp_trie_free() * Add a new helper bpf_dynptr_user_trim() for bpftool in libbpf * Support qp-trie map in bpftool * Add 2 more test_progs test cases for qp-trie map * Fix test_progs-no_alu32 failure for test_progs test case * Add tests for not-supported operations in test_maps v1: https://lore.kernel.org/bpf/20220917153125.2001645-1-houtao@huaweicloud.com/ * Use bpf_dynptr as map key type instead of bpf_lpm_trie_key-styled key (Suggested by Andrii) * Fix build error and RCU related sparse errors reported by lkp robot * Copy the passed key firstly in qp_trie_update_elem(), because the content of passed key may change and may break the assumption in two-round lookup process during update. * Add the missing rcu_barrier in qp_trie_free() RFC: https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@huawei.com/ Hou Tao (13): bpf: Export bpf_dynptr_set_size() bpf: Add helper btf_find_dynptr() bpf: Support bpf_dynptr-typed map key in bpf syscall bpf: Support bpf_dynptr-typed map key in verifier libbpf: Add helpers for bpf_dynptr_user bpf: Add support for qp-trie map with dynptr key libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE bpftool: Add support for qp-trie map selftests/bpf: Add two new dynptr_fail cases for map key selftests/bpf: Move ENOTSUPP into bpf_util.h selftests/bpf: Add prog tests for qp-trie map selftests/bpf: Add benchmark for qp-trie map selftests/bpf: Add map tests for qp-trie by using bpf syscall include/linux/bpf.h | 11 +- include/linux/bpf_types.h | 1 + include/linux/btf.h | 1 + include/uapi/linux/bpf.h | 7 + kernel/bpf/Makefile | 1 + kernel/bpf/bpf_qp_trie.c | 1057 ++++++++++++++ kernel/bpf/btf.c | 13 + kernel/bpf/helpers.c | 9 +- kernel/bpf/map_in_map.c | 3 + kernel/bpf/syscall.c | 121 +- kernel/bpf/verifier.c | 17 +- .../bpf/bpftool/Documentation/bpftool-map.rst | 4 +- tools/bpf/bpftool/btf_dumper.c | 33 + tools/bpf/bpftool/map.c | 149 +- tools/include/uapi/linux/bpf.h | 7 + tools/lib/bpf/bpf.h | 29 + tools/lib/bpf/libbpf.c | 1 + tools/lib/bpf/libbpf_probes.c | 25 + tools/testing/selftests/bpf/Makefile | 5 +- tools/testing/selftests/bpf/bench.c | 10 + .../selftests/bpf/benchs/bench_qp_trie.c | 511 +++++++ .../selftests/bpf/benchs/run_bench_qp_trie.sh | 55 + tools/testing/selftests/bpf/bpf_util.h | 4 + .../selftests/bpf/map_tests/qp_trie_map.c | 1209 +++++++++++++++++ .../selftests/bpf/prog_tests/bpf_tcp_ca.c | 4 - .../testing/selftests/bpf/prog_tests/dynptr.c | 2 + .../selftests/bpf/prog_tests/lsm_cgroup.c | 4 - .../selftests/bpf/prog_tests/qp_trie_test.c | 214 +++ .../testing/selftests/bpf/progs/dynptr_fail.c | 43 + .../selftests/bpf/progs/qp_trie_bench.c | 236 ++++ .../selftests/bpf/progs/qp_trie_test.c | 200 +++ tools/testing/selftests/bpf/test_maps.c | 4 - 32 files changed, 3931 insertions(+), 59 deletions(-) create mode 100644 kernel/bpf/bpf_qp_trie.c create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c