mbox series

[bpf-next,v2,00/13] Add support for qp-trie with dynptr key

Message ID 20220924133620.4147153-1-houtao@huaweicloud.com (mailing list archive)
Headers show
Series Add support for qp-trie with dynptr key | expand

Message

Hou Tao Sept. 24, 2022, 1:36 p.m. UTC
From: Hou Tao <houtao1@huawei.com>

Hi,

The initial motivation for qp-trie map is to reduce memory usage for
string keys special those with large differencies in length as
discussed in [0]. And as a big-endian lexicographical ordered map, it
can also be used for any binary data with fixed or variable length.

Now the basic functionality of qp-trie is ready, so posting it to get
more feedback or suggestions about qp-trie. Specially feedback
about the following questions:

(1) Use cases for qp-trie
Andrii had proposed to re-implement lpm-trie by using qp-trie. The
advantage would be the speed up of lookup operations due to lower tree
depth of qp-trie and the performance of update may also increase.
But is there any other use cases for qp-trie ? Specially those cases
which need both ordering and memory efficiency or cases in which qp-trie
will have high fan-out and its lookup performance will be much better than
hash-table as shown below:

  Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
  htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
  htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
  htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
  htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
  htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
  
  qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
  qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
  qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
  qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
  qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)

  * non-zero drops is due to duplicated keys in generated keys.

(2) Improve update/delete performance for qp-trie
Now top-5 overheads in update/delete operations are:

    21.23%  bench    [kernel.vmlinux]    [k] qp_trie_update_elem
    13.98%  bench    [kernel.vmlinux]    [k] qp_trie_delete_elem
     7.96%  bench    [kernel.vmlinux]    [k] native_queued_spin_lock_slowpath
     5.16%  bench    [kernel.vmlinux]    [k] memcpy_erms
     5.00%  bench    [kernel.vmlinux]    [k] __kmalloc_node

The top-2 overheads are due to memory access and atomic ops on
max_entries. I had tried memory prefetch but it didn't work out, maybe
I did it wrong. For subtree spinlock overhead, I also had tried the
hierarchical lock by using hand-over-hand lock scheme, but it didn't
scale well [1]. I will try to increase the number of subtrees from 256
to 1024, 4096 or bigger and check whether it makes any difference.

For atomic ops and kmalloc overhead, I think I can reuse the idea from
patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
a simple try and encounter some problems. One problem is that
immediate reuse of freed object in bpf memory allocator. Because qp-trie
uses bpf memory allocator to allocate and free qp_trie_branch, if
qp_trie_branch is reused immediately, the lookup procedure may oops due
to the incorrect content in qp_trie_branch. And another problem is the
size limitation in bpf_mem_alloc() is 4096. It may be a little small for
the total size of key size and value size, but maybe I can use two
separated bpf_mem_alloc for key and value.

String in BTF string sections (entries=115980, max size=71, threads=4)

On hash-table
Iter   0 ( 21.872us): hits   11.496M/s (  2.874M/prod), drops    0.000M/s, total operations   11.496M/s
Iter   1 (  4.816us): hits   12.701M/s (  3.175M/prod), drops    0.000M/s, total operations   12.701M/s
Iter   2 (-11.006us): hits   12.443M/s (  3.111M/prod), drops    0.000M/s, total operations   12.443M/s
Iter   3 (  2.628us): hits   11.223M/s (  2.806M/prod), drops    0.000M/s, total operations   11.223M/s
Slab: 22.225 MiB

On qp-trie:
Iter   0 ( 24.388us): hits    4.062M/s (  1.016M/prod), drops    0.000M/s, total operations    4.062M/s
Iter   1 ( 20.612us): hits    3.884M/s (  0.971M/prod), drops    0.000M/s, total operations    3.884M/s
Iter   2 (-21.533us): hits    4.046M/s (  1.012M/prod), drops    0.000M/s, total operations    4.046M/s
Iter   3 (  2.090us): hits    3.971M/s (  0.993M/prod), drops    0.000M/s, total operations    3.971M/s
Slab: 10.849 MiB

(3) Improve memory efficiency further
When using strings in BTF string section as a data set for qp-trie, the
slab memory usage showed in cgroup memory.stats file is about 11MB for
qp-trie and 22MB for hash table as shown above. However the theoretical
memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu"
fields from qp_trie_branch) and the extra memory usage (about 38% of total
usage) mainly comes from internal fragment in slab (namely 2^n alignment
for allocation) and overhead in kmem-cgroup accounting. We can reduce the
internal fragment by creating separated kmem_cache for qp_trie_branch with
different child nodes, but not sure whether it is worth it.

And in order to prevent allocating a rcu_head for each leaf node, now only
branch node is RCU-freed, so when replacing a leaf node, a new branch node
and a new leaf node will be allocated instead of replacing the old leaf
node and RCU-freed the old leaf node.

No sure whether or not there are ways to remove rcu_head completely from
qp_trie_branch node. Headless kfree_rcu() doesn't need rcu_head but it
might sleeps. It seems BPF memory allocator is a better choice as for
now because it replaces rcu_read by llist_node.

  Sorted strings in BTF string sections (entries=115980)
  htab lookup      (1  thread)    6.915 ± 0.029M/s (drops 0.000 ± 0.000M/s mem 22.216 MiB)
  qp-trie lookup   (1  thread)    6.791 ± 0.005M/s (drops 0.000 ± 0.000M/s mem 11.273 MiB)
  
  All files under linux kernel source directory (entries=74359)
  htab lookup      (1  thread)    7.978 ± 0.009M/s (drops 0.000 ± 0.000M/s mem 14.272 MiB)
  qp-trie lookup   (1  thread)    5.521 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.367 MiB)
  
  Domain names for Alexa top million web site (entries=1000000)
  htab lookup      (1  thread)    3.148 ± 0.039M/s (drops 0.000 ± 0.000M/s mem 190.831 MiB)
  qp-trie lookup   (1  thread)    2.374 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 83.733 MiB)

Comments and suggestions are always welcome.

[0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/db34696a-cbfe-16e8-6dd5-8174b97dcf1d@huawei.com/

Change Log:
v2:
 * Always use copy_to_user() in bpf_copy_to_dynptr_ukey() (from kernel test robot <lkp@intel.com>)
   No-MMU ARM32 host doesn't support 8-bytes get_user().
 * Remove BPF_F_DYNPTR_KEY and use the more extensible dynptr_key_off
 * Remove the unnecessary rcu_barrier in qp_trie_free()
 * Add a new helper bpf_dynptr_user_trim() for bpftool in libbpf
 * Support qp-trie map in bpftool
 * Add 2 more test_progs test cases for qp-trie map
 * Fix test_progs-no_alu32 failure for test_progs test case
 * Add tests for not-supported operations in test_maps

v1: https://lore.kernel.org/bpf/20220917153125.2001645-1-houtao@huaweicloud.com/
 * Use bpf_dynptr as map key type instead of bpf_lpm_trie_key-styled key (Suggested by Andrii)
 * Fix build error and RCU related sparse errors reported by lkp robot
 * Copy the passed key firstly in qp_trie_update_elem(), because the content of passed key may
   change and may break the assumption in two-round lookup process during update.
 * Add the missing rcu_barrier in qp_trie_free()

RFC: https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@huawei.com/

Hou Tao (13):
  bpf: Export bpf_dynptr_set_size()
  bpf: Add helper btf_find_dynptr()
  bpf: Support bpf_dynptr-typed map key in bpf syscall
  bpf: Support bpf_dynptr-typed map key in verifier
  libbpf: Add helpers for bpf_dynptr_user
  bpf: Add support for qp-trie map with dynptr key
  libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE
  bpftool: Add support for qp-trie map
  selftests/bpf: Add two new dynptr_fail cases for map key
  selftests/bpf: Move ENOTSUPP into bpf_util.h
  selftests/bpf: Add prog tests for qp-trie map
  selftests/bpf: Add benchmark for qp-trie map
  selftests/bpf: Add map tests for qp-trie by using bpf syscall

 include/linux/bpf.h                           |   11 +-
 include/linux/bpf_types.h                     |    1 +
 include/linux/btf.h                           |    1 +
 include/uapi/linux/bpf.h                      |    7 +
 kernel/bpf/Makefile                           |    1 +
 kernel/bpf/bpf_qp_trie.c                      | 1057 ++++++++++++++
 kernel/bpf/btf.c                              |   13 +
 kernel/bpf/helpers.c                          |    9 +-
 kernel/bpf/map_in_map.c                       |    3 +
 kernel/bpf/syscall.c                          |  121 +-
 kernel/bpf/verifier.c                         |   17 +-
 .../bpf/bpftool/Documentation/bpftool-map.rst |    4 +-
 tools/bpf/bpftool/btf_dumper.c                |   33 +
 tools/bpf/bpftool/map.c                       |  149 +-
 tools/include/uapi/linux/bpf.h                |    7 +
 tools/lib/bpf/bpf.h                           |   29 +
 tools/lib/bpf/libbpf.c                        |    1 +
 tools/lib/bpf/libbpf_probes.c                 |   25 +
 tools/testing/selftests/bpf/Makefile          |    5 +-
 tools/testing/selftests/bpf/bench.c           |   10 +
 .../selftests/bpf/benchs/bench_qp_trie.c      |  511 +++++++
 .../selftests/bpf/benchs/run_bench_qp_trie.sh |   55 +
 tools/testing/selftests/bpf/bpf_util.h        |    4 +
 .../selftests/bpf/map_tests/qp_trie_map.c     | 1209 +++++++++++++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |    4 -
 .../testing/selftests/bpf/prog_tests/dynptr.c |    2 +
 .../selftests/bpf/prog_tests/lsm_cgroup.c     |    4 -
 .../selftests/bpf/prog_tests/qp_trie_test.c   |  214 +++
 .../testing/selftests/bpf/progs/dynptr_fail.c |   43 +
 .../selftests/bpf/progs/qp_trie_bench.c       |  236 ++++
 .../selftests/bpf/progs/qp_trie_test.c        |  200 +++
 tools/testing/selftests/bpf/test_maps.c       |    4 -
 32 files changed, 3931 insertions(+), 59 deletions(-)
 create mode 100644 kernel/bpf/bpf_qp_trie.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh
 create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c

Comments

Alexei Starovoitov Sept. 26, 2022, 1:25 a.m. UTC | #1
On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
> 
> Now the basic functionality of qp-trie is ready, so posting it to get
> more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
> 
> (1) Use cases for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie and the performance of update may also increase.
> But is there any other use cases for qp-trie ? Specially those cases
> which need both ordering and memory efficiency or cases in which qp-trie
> will have high fan-out and its lookup performance will be much better than
> hash-table as shown below:
> 
>   Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
>   htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
>   htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
>   htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
>   htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
>   htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
>   
>   qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
>   qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
>   qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
>   qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
>   qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)
> 
>   * non-zero drops is due to duplicated keys in generated keys.
> 
> (2) Improve update/delete performance for qp-trie
> Now top-5 overheads in update/delete operations are:
> 
>     21.23%  bench    [kernel.vmlinux]    [k] qp_trie_update_elem
>     13.98%  bench    [kernel.vmlinux]    [k] qp_trie_delete_elem
>      7.96%  bench    [kernel.vmlinux]    [k] native_queued_spin_lock_slowpath
>      5.16%  bench    [kernel.vmlinux]    [k] memcpy_erms
>      5.00%  bench    [kernel.vmlinux]    [k] __kmalloc_node
> 
> The top-2 overheads are due to memory access and atomic ops on
> max_entries. I had tried memory prefetch but it didn't work out, maybe
> I did it wrong. For subtree spinlock overhead, I also had tried the
> hierarchical lock by using hand-over-hand lock scheme, but it didn't
> scale well [1]. I will try to increase the number of subtrees from 256
> to 1024, 4096 or bigger and check whether it makes any difference.
> 
> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> a simple try and encounter some problems. One problem is that
> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> uses bpf memory allocator to allocate and free qp_trie_branch, if
> qp_trie_branch is reused immediately, the lookup procedure may oops due
> to the incorrect content in qp_trie_branch. And another problem is the
> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> the total size of key size and value size, but maybe I can use two
> separated bpf_mem_alloc for key and value.

4096 limit for key+value size would be an acceptable trade-off.
With kptrs the user will be able to extend value to much bigger sizes
while doing <= 4096 allocation at a time. Larger allocations are failing
in production more often than not. Any algorithm relying on successful
 >= 4096 allocation is likely to fail. kvmalloc is a fallback that
the kernel is using, but we're not there yet in bpf land.
The benefits of bpf_mem_alloc in qp-trie would be huge though.
qp-trie would work in all contexts including sleepable progs.
As presented the use cases for qp-trie are quite limited.
If I understand correctly the concern for not using bpf_mem_alloc
is that qp_trie_branch can be reused. Can you provide an exact scenario
that will casue issuses?
Instead of call_rcu in qp_trie_branch_free (which will work only for
regular progs and have high overhead as demonstrated by mem_alloc patches)
the qp-trie freeing logic can scrub that element, so it's ready to be
reused as another struct qp_trie_branch.
I guess I'm missing how rcu protects this internal data structures of qp-trie.
The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
Is that it?
So to make qp-trie work in sleepable progs the algo would need to
be changed to do both call_rcu and call_rcu_task_trace everywhere
to protect these inner structs?
call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
around. So quick update/delete (in sleepable with call_rcu_task_trace)
may very well exhaust memory. With bpf_mem_alloc we don't have this issue
since rcu_task_trace gp is observed only when freeing into global mem pool.
Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
What is the worst that can happen? qp_trie_lookup_elem will go into wrong
path, but won't crash, right? Can we do hlist_nulls trick to address that?
In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
Many kernel data structures know how to deal with such object reuse.
We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
construct a logic in a way that obj reuse is not problematic.

Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem
to protect qp_trie_branch during lookup while using bpf_mem_alloc
for both qp_trie_branch and leaf nodes, but that's not a great solution either.
It will allow qp-trie to be usable in sleepable, but use of call_rcu
in update/delete will prevent qp-trie to be usable in tracing progs.

Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU.

Other than this issue the patches look great. This new map would be awesome addition.
Hou Tao Sept. 26, 2022, 1:18 p.m. UTC | #2
Hi,

On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
SNIP
>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>> a simple try and encounter some problems. One problem is that
>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>> to the incorrect content in qp_trie_branch. And another problem is the
>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>> the total size of key size and value size, but maybe I can use two
>> separated bpf_mem_alloc for key and value.
> 4096 limit for key+value size would be an acceptable trade-off.
> With kptrs the user will be able to extend value to much bigger sizes
> while doing <= 4096 allocation at a time. Larger allocations are failing
> in production more often than not. Any algorithm relying on successful
>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> the kernel is using, but we're not there yet in bpf land.
> The benefits of bpf_mem_alloc in qp-trie would be huge though.
> qp-trie would work in all contexts including sleepable progs.
> As presented the use cases for qp-trie are quite limited.
> If I understand correctly the concern for not using bpf_mem_alloc
> is that qp_trie_branch can be reused. Can you provide an exact scenario
> that will casue issuses?
The usage of branch node during lookup is as follows:
(1) check the index field of branch node which records the position of nibble in
which the keys of child nodes are different
(2) calculate the index of child node by using the nibble value of lookup key in
index position
(3) get the pointer of child node by dereferencing the variable-length pointer
array in branch node

Because both branch node and leaf node have variable length, I used one
bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
node, the pointer got in step 3 may be invalid.

If using separated bpf_mem_alloc for branch node and leaf node, it may still be
problematic because the updates to a reused branch node are not atomic and
branch nodes with different child node will reuse the same object due to size
alignment in allocator, so the lookup procedure below may get an uninitialized
pointer in the pointer array:

lookup procedure                                update procedure


// three child nodes, 48-bytes
branch node x
                                                              //  four child
nodes, 56-bytes
                                                              reuse branch node x
                                                              x->bitmap = 0xf
// got an uninitialized pointer
x->nodes[3]
                                                              Initialize
x->nodes[0~3]

The problem may can be solved by zeroing the unused or whole part of allocated
object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
node array happens before the update of bitmap is also OK and the cost will be
much cheaper in x86 host.

Beside lookup procedure, get_next_key() from syscall also lookups trie
locklessly. If the branch node is reused, the order of returned keys may be
broken. There is also a parent pointer in branch node and it is used for reverse
lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> Instead of call_rcu in qp_trie_branch_free (which will work only for
> regular progs and have high overhead as demonstrated by mem_alloc patches)
> the qp-trie freeing logic can scrub that element, so it's ready to be
> reused as another struct qp_trie_branch.
> I guess I'm missing how rcu protects this internal data structures of qp-trie.
> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> Is that it?
Yes. The update is made atomic by copying the parent branch node to a new branch
node and replacing the pointer to the parent branch node by the new branch node,
so the lookup procedure either find the old branch node or the new branch node.
> So to make qp-trie work in sleepable progs the algo would need to
> be changed to do both call_rcu and call_rcu_task_trace everywhere
> to protect these inner structs?
> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> around. So quick update/delete (in sleepable with call_rcu_task_trace)
> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> since rcu_task_trace gp is observed only when freeing into global mem pool.
> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> path, but won't crash, right? Can we do hlist_nulls trick to address that?
> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> Many kernel data structures know how to deal with such object reuse.
> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> construct a logic in a way that obj reuse is not problematic.
As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
don't know how to do it for get_next_key because the iteration result needs to
be ordered and can not skip existed elements before the iterations begins.
If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
performance, is there any reason we can not do that ?
>
> Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem
> to protect qp_trie_branch during lookup while using bpf_mem_alloc
> for both qp_trie_branch and leaf nodes, but that's not a great solution either.
> It will allow qp-trie to be usable in sleepable, but use of call_rcu
> in update/delete will prevent qp-trie to be usable in tracing progs.
>
> Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU.
>
> Other than this issue the patches look great. This new map would be awesome addition.
Thanks for that.
Alexei Starovoitov Sept. 27, 2022, 1:19 a.m. UTC | #3
On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
> Hi,
> 
> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> > On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> SNIP
> >> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> >> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> >> a simple try and encounter some problems. One problem is that
> >> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> >> uses bpf memory allocator to allocate and free qp_trie_branch, if
> >> qp_trie_branch is reused immediately, the lookup procedure may oops due
> >> to the incorrect content in qp_trie_branch. And another problem is the
> >> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> >> the total size of key size and value size, but maybe I can use two
> >> separated bpf_mem_alloc for key and value.
> > 4096 limit for key+value size would be an acceptable trade-off.
> > With kptrs the user will be able to extend value to much bigger sizes
> > while doing <= 4096 allocation at a time. Larger allocations are failing
> > in production more often than not. Any algorithm relying on successful
> >  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> > the kernel is using, but we're not there yet in bpf land.
> > The benefits of bpf_mem_alloc in qp-trie would be huge though.
> > qp-trie would work in all contexts including sleepable progs.
> > As presented the use cases for qp-trie are quite limited.
> > If I understand correctly the concern for not using bpf_mem_alloc
> > is that qp_trie_branch can be reused. Can you provide an exact scenario
> > that will casue issuses?
> The usage of branch node during lookup is as follows:
> (1) check the index field of branch node which records the position of nibble in
> which the keys of child nodes are different
> (2) calculate the index of child node by using the nibble value of lookup key in
> index position
> (3) get the pointer of child node by dereferencing the variable-length pointer
> array in branch node
> 
> Because both branch node and leaf node have variable length, I used one
> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
> node, the pointer got in step 3 may be invalid.
> 
> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
> problematic because the updates to a reused branch node are not atomic and
> branch nodes with different child node will reuse the same object due to size
> alignment in allocator, so the lookup procedure below may get an uninitialized
> pointer in the pointer array:
> 
> lookup procedure                                update procedure
> 
> 
> // three child nodes, 48-bytes
> branch node x
>                                                               //  four child
> nodes, 56-bytes
>                                                               reuse branch node x
>                                                               x->bitmap = 0xf
> // got an uninitialized pointer
> x->nodes[3]
>                                                               Initialize
> x->nodes[0~3]

Looking at lookup:
+	while (is_branch_node(node)) {
+		struct qp_trie_branch *br = node;
+		unsigned int bitmap;
+		unsigned int iip;
+
+		/* When byte index equals with key len, the target key
+		 * may be in twigs->nodes[0].
+		 */
+		if (index_to_byte_index(br->index) > data_len)
+			goto done;
+
+		bitmap = calc_br_bitmap(br->index, data, data_len);
+		if (!(bitmap & br->bitmap))
+			goto done;
+
+		iip = calc_twig_index(br->bitmap, bitmap);
+		node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
+	}

To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
While deleting the br->index can be set to special value which would mean
restart the lookup from the beginning.
As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
Also the race is extremely tight, right?
After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
after spin_lock and reused in update after another spin_lock.
Without artifical big delay it's hard to imagine how nodes[iip] pointer
would be initialized to some other qp_trie_branch or leaf during delete,
then memory reused and nodes[iip] is initialized again with the same address.
Theoretically possible, but unlikely, right?
And with correct ordering of scrubbing and updates to
br->nodes, br->bitmap, br->index it can be made safe.

We can add a sequence number to qp_trie_branch as well and read it before and after.
Every reuse would inc the seq.
If seq number differs, re-read the node pointer form parent.

> The problem may can be solved by zeroing the unused or whole part of allocated
> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
> node array happens before the update of bitmap is also OK and the cost will be
> much cheaper in x86 host.

Something like this, right.
We can also consider doing lookup under spin_lock. For a large branchy trie
the cost of spin_lock maybe negligible.

> Beside lookup procedure, get_next_key() from syscall also lookups trie
> locklessly. If the branch node is reused, the order of returned keys may be
> broken. There is also a parent pointer in branch node and it is used for reverse
> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.

qp_trie_lookup_next_node can be done under spin_lock.
Iterating all map elements is a slow operation anyway.

> > Instead of call_rcu in qp_trie_branch_free (which will work only for
> > regular progs and have high overhead as demonstrated by mem_alloc patches)
> > the qp-trie freeing logic can scrub that element, so it's ready to be
> > reused as another struct qp_trie_branch.
> > I guess I'm missing how rcu protects this internal data structures of qp-trie.
> > The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> > Is that it?
> Yes. The update is made atomic by copying the parent branch node to a new branch
> node and replacing the pointer to the parent branch node by the new branch node,
> so the lookup procedure either find the old branch node or the new branch node.
> > So to make qp-trie work in sleepable progs the algo would need to
> > be changed to do both call_rcu and call_rcu_task_trace everywhere
> > to protect these inner structs?
> > call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> > around. So quick update/delete (in sleepable with call_rcu_task_trace)
> > may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> > since rcu_task_trace gp is observed only when freeing into global mem pool.
> > Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> > What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> > path, but won't crash, right? Can we do hlist_nulls trick to address that?
> > In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> > Many kernel data structures know how to deal with such object reuse.
> > We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> > construct a logic in a way that obj reuse is not problematic.
> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
> don't know how to do it for get_next_key because the iteration result needs to
> be ordered and can not skip existed elements before the iterations begins.

imo it's fine to spin_lock in get_next_key.
We should measure the lock overhead in lookup. It might be acceptable too.

> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
> performance, is there any reason we can not do that ?

What do you mean?
Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
As I said above:
" call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
  around. So quick update/delete (in sleepable with call_rcu_task_trace)
  may very well exhaust memory.
"
As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
Hou Tao Sept. 27, 2022, 3:08 a.m. UTC | #4
Hi,

On 9/27/2022 9:19 AM, Alexei Starovoitov wrote:
> On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
>>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>> SNIP
>>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>>>> a simple try and encounter some problems. One problem is that
>>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>>>> to the incorrect content in qp_trie_branch. And another problem is the
>>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>>>> the total size of key size and value size, but maybe I can use two
>>>> separated bpf_mem_alloc for key and value.
>>> 4096 limit for key+value size would be an acceptable trade-off.
>>> With kptrs the user will be able to extend value to much bigger sizes
>>> while doing <= 4096 allocation at a time. Larger allocations are failing
>>> in production more often than not. Any algorithm relying on successful
>>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
>>> the kernel is using, but we're not there yet in bpf land.
>>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
>>> qp-trie would work in all contexts including sleepable progs.
>>> As presented the use cases for qp-trie are quite limited.
>>> If I understand correctly the concern for not using bpf_mem_alloc
>>> is that qp_trie_branch can be reused. Can you provide an exact scenario
>>> that will casue issuses?
>> The usage of branch node during lookup is as follows:
>> (1) check the index field of branch node which records the position of nibble in
>> which the keys of child nodes are different
>> (2) calculate the index of child node by using the nibble value of lookup key in
>> index position
>> (3) get the pointer of child node by dereferencing the variable-length pointer
>> array in branch node
>>
>> Because both branch node and leaf node have variable length, I used one
>> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
>> node, the pointer got in step 3 may be invalid.
>>
>> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
>> problematic because the updates to a reused branch node are not atomic and
>> branch nodes with different child node will reuse the same object due to size
>> alignment in allocator, so the lookup procedure below may get an uninitialized
>> pointer in the pointer array:
>>
>> lookup procedure                                update procedure
>>
>>
>> // three child nodes, 48-bytes
>> branch node x
>>                                                               //  four child
>> nodes, 56-bytes
>>                                                               reuse branch node x
>>                                                               x->bitmap = 0xf
>> // got an uninitialized pointer
>> x->nodes[3]
>>                                                               Initialize
>> x->nodes[0~3]
> Looking at lookup:
> +	while (is_branch_node(node)) {
> +		struct qp_trie_branch *br = node;
> +		unsigned int bitmap;
> +		unsigned int iip;
> +
> +		/* When byte index equals with key len, the target key
> +		 * may be in twigs->nodes[0].
> +		 */
> +		if (index_to_byte_index(br->index) > data_len)
> +			goto done;
> +
> +		bitmap = calc_br_bitmap(br->index, data, data_len);
> +		if (!(bitmap & br->bitmap))
> +			goto done;
> +
> +		iip = calc_twig_index(br->bitmap, bitmap);
> +		node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
> +	}
>
> To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
> While deleting the br->index can be set to special value which would mean
> restart the lookup from the beginning.
> As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
> Also the race is extremely tight, right?
> After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
> after spin_lock and reused in update after another spin_lock.
> Without artifical big delay it's hard to imagine how nodes[iip] pointer
> would be initialized to some other qp_trie_branch or leaf during delete,
> then memory reused and nodes[iip] is initialized again with the same address.
> Theoretically possible, but unlikely, right?
> And with correct ordering of scrubbing and updates to
> br->nodes, br->bitmap, br->index it can be made safe.
The reuse of node not only introduces the safety problem (e.g. access an invalid
pointer), but also incur the false negative problem (e.g. can not find an
existent element) as show below:

lookup A in X on CPU1            update X on CPU 2

     [ branch X v1 ]                   
 leaf A | leaf B | leaf C
                                                 [ branch X v2 ]
                                               leaf A | leaf B | leaf C | leaf D
                                                 
                                                  // free and reuse branch X v1
                                                  [ branch X v1 ]
                                                leaf O | leaf P | leaf Q
// leaf A can not be found
> We can add a sequence number to qp_trie_branch as well and read it before and after.
> Every reuse would inc the seq.
> If seq number differs, re-read the node pointer form parent.
A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
consider the starvation of lookup by update/deletion. Maybe need fallback to the
subtree spinlock after some reread.
>> The problem may can be solved by zeroing the unused or whole part of allocated
>> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
>> node array happens before the update of bitmap is also OK and the cost will be
>> much cheaper in x86 host.
> Something like this, right.
> We can also consider doing lookup under spin_lock. For a large branchy trie
> the cost of spin_lock maybe negligible.
Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
or taking the subtree spinlock during lookup ? IMO the latter will make the
lookup performance suffer, but I will check it as well.
>
>> Beside lookup procedure, get_next_key() from syscall also lookups trie
>> locklessly. If the branch node is reused, the order of returned keys may be
>> broken. There is also a parent pointer in branch node and it is used for reverse
>> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> qp_trie_lookup_next_node can be done under spin_lock.
> Iterating all map elements is a slow operation anyway.
OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
whether or not the solution for lockless lookup will work for get_next_key. Will
check.
>
>>> Instead of call_rcu in qp_trie_branch_free (which will work only for
>>> regular progs and have high overhead as demonstrated by mem_alloc patches)
>>> the qp-trie freeing logic can scrub that element, so it's ready to be
>>> reused as another struct qp_trie_branch.
>>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
>>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
>>> Is that it?
>> Yes. The update is made atomic by copying the parent branch node to a new branch
>> node and replacing the pointer to the parent branch node by the new branch node,
>> so the lookup procedure either find the old branch node or the new branch node.
>>> So to make qp-trie work in sleepable progs the algo would need to
>>> be changed to do both call_rcu and call_rcu_task_trace everywhere
>>> to protect these inner structs?
>>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
>>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
>>> since rcu_task_trace gp is observed only when freeing into global mem pool.
>>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
>>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
>>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
>>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
>>> Many kernel data structures know how to deal with such object reuse.
>>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
>>> construct a logic in a way that obj reuse is not problematic.
>> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
>> don't know how to do it for get_next_key because the iteration result needs to
>> be ordered and can not skip existed elements before the iterations begins.
> imo it's fine to spin_lock in get_next_key.
> We should measure the lock overhead in lookup. It might be acceptable too.
Will check that.
>
>> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
>> performance, is there any reason we can not do that ?
> What do you mean?
> Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
free_bulk does ?
> As I said above:
> " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>   around. So quick update/delete (in sleepable with call_rcu_task_trace)
>   may very well exhaust memory.
> "
> As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
> before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
> With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
> .
Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
close lookup on that.
Alexei Starovoitov Sept. 27, 2022, 3:18 a.m. UTC | #5
On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/27/2022 9:19 AM, Alexei Starovoitov wrote:
> > On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> >>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> >>>> From: Hou Tao <houtao1@huawei.com>
> >> SNIP
> >>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> >>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> >>>> a simple try and encounter some problems. One problem is that
> >>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> >>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
> >>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
> >>>> to the incorrect content in qp_trie_branch. And another problem is the
> >>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> >>>> the total size of key size and value size, but maybe I can use two
> >>>> separated bpf_mem_alloc for key and value.
> >>> 4096 limit for key+value size would be an acceptable trade-off.
> >>> With kptrs the user will be able to extend value to much bigger sizes
> >>> while doing <= 4096 allocation at a time. Larger allocations are failing
> >>> in production more often than not. Any algorithm relying on successful
> >>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> >>> the kernel is using, but we're not there yet in bpf land.
> >>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
> >>> qp-trie would work in all contexts including sleepable progs.
> >>> As presented the use cases for qp-trie are quite limited.
> >>> If I understand correctly the concern for not using bpf_mem_alloc
> >>> is that qp_trie_branch can be reused. Can you provide an exact scenario
> >>> that will casue issuses?
> >> The usage of branch node during lookup is as follows:
> >> (1) check the index field of branch node which records the position of nibble in
> >> which the keys of child nodes are different
> >> (2) calculate the index of child node by using the nibble value of lookup key in
> >> index position
> >> (3) get the pointer of child node by dereferencing the variable-length pointer
> >> array in branch node
> >>
> >> Because both branch node and leaf node have variable length, I used one
> >> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
> >> node, the pointer got in step 3 may be invalid.
> >>
> >> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
> >> problematic because the updates to a reused branch node are not atomic and
> >> branch nodes with different child node will reuse the same object due to size
> >> alignment in allocator, so the lookup procedure below may get an uninitialized
> >> pointer in the pointer array:
> >>
> >> lookup procedure                                update procedure
> >>
> >>
> >> // three child nodes, 48-bytes
> >> branch node x
> >>                                                               //  four child
> >> nodes, 56-bytes
> >>                                                               reuse branch node x
> >>                                                               x->bitmap = 0xf
> >> // got an uninitialized pointer
> >> x->nodes[3]
> >>                                                               Initialize
> >> x->nodes[0~3]
> > Looking at lookup:
> > +     while (is_branch_node(node)) {
> > +             struct qp_trie_branch *br = node;
> > +             unsigned int bitmap;
> > +             unsigned int iip;
> > +
> > +             /* When byte index equals with key len, the target key
> > +              * may be in twigs->nodes[0].
> > +              */
> > +             if (index_to_byte_index(br->index) > data_len)
> > +                     goto done;
> > +
> > +             bitmap = calc_br_bitmap(br->index, data, data_len);
> > +             if (!(bitmap & br->bitmap))
> > +                     goto done;
> > +
> > +             iip = calc_twig_index(br->bitmap, bitmap);
> > +             node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
> > +     }
> >
> > To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
> > While deleting the br->index can be set to special value which would mean
> > restart the lookup from the beginning.
> > As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
> > Also the race is extremely tight, right?
> > After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
> > after spin_lock and reused in update after another spin_lock.
> > Without artifical big delay it's hard to imagine how nodes[iip] pointer
> > would be initialized to some other qp_trie_branch or leaf during delete,
> > then memory reused and nodes[iip] is initialized again with the same address.
> > Theoretically possible, but unlikely, right?
> > And with correct ordering of scrubbing and updates to
> > br->nodes, br->bitmap, br->index it can be made safe.
> The reuse of node not only introduces the safety problem (e.g. access an invalid
> pointer), but also incur the false negative problem (e.g. can not find an
> existent element) as show below:
>
> lookup A in X on CPU1            update X on CPU 2
>
>      [ branch X v1 ]
>  leaf A | leaf B | leaf C
>                                                  [ branch X v2 ]
>                                                leaf A | leaf B | leaf C | leaf D
>
>                                                   // free and reuse branch X v1
>                                                   [ branch X v1 ]
>                                                 leaf O | leaf P | leaf Q
> // leaf A can not be found

Right. That's why I suggested to consider hlist_nulls-like approach
that htab is using.

> > We can add a sequence number to qp_trie_branch as well and read it before and after.
> > Every reuse would inc the seq.
> > If seq number differs, re-read the node pointer form parent.
> A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
> consider the starvation of lookup by update/deletion. Maybe need fallback to the
> subtree spinlock after some reread.

I think the fallback is an overkill. The race is extremely unlikely.

> >> The problem may can be solved by zeroing the unused or whole part of allocated
> >> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
> >> node array happens before the update of bitmap is also OK and the cost will be
> >> much cheaper in x86 host.
> > Something like this, right.
> > We can also consider doing lookup under spin_lock. For a large branchy trie
> > the cost of spin_lock maybe negligible.
> Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
> or taking the subtree spinlock during lookup ? IMO the latter will make the
> lookup performance suffer, but I will check it as well.

subtree lock. lookup perf will suffer a bit.
The numbers will tell the true story.

> >
> >> Beside lookup procedure, get_next_key() from syscall also lookups trie
> >> locklessly. If the branch node is reused, the order of returned keys may be
> >> broken. There is also a parent pointer in branch node and it is used for reverse
> >> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> > qp_trie_lookup_next_node can be done under spin_lock.
> > Iterating all map elements is a slow operation anyway.
> OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
> whether or not the solution for lockless lookup will work for get_next_key. Will
> check.

What kind of scalability are you concerned about?
get_next is done by user space only. Plenty of overhead already.

> >
> >>> Instead of call_rcu in qp_trie_branch_free (which will work only for
> >>> regular progs and have high overhead as demonstrated by mem_alloc patches)
> >>> the qp-trie freeing logic can scrub that element, so it's ready to be
> >>> reused as another struct qp_trie_branch.
> >>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
> >>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> >>> Is that it?
> >> Yes. The update is made atomic by copying the parent branch node to a new branch
> >> node and replacing the pointer to the parent branch node by the new branch node,
> >> so the lookup procedure either find the old branch node or the new branch node.
> >>> So to make qp-trie work in sleepable progs the algo would need to
> >>> be changed to do both call_rcu and call_rcu_task_trace everywhere
> >>> to protect these inner structs?
> >>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> >>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
> >>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> >>> since rcu_task_trace gp is observed only when freeing into global mem pool.
> >>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> >>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> >>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
> >>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> >>> Many kernel data structures know how to deal with such object reuse.
> >>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> >>> construct a logic in a way that obj reuse is not problematic.
> >> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
> >> don't know how to do it for get_next_key because the iteration result needs to
> >> be ordered and can not skip existed elements before the iterations begins.
> > imo it's fine to spin_lock in get_next_key.
> > We should measure the lock overhead in lookup. It might be acceptable too.
> Will check that.
> >
> >> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
> >> performance, is there any reason we can not do that ?
> > What do you mean?
> > Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
> Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
> free_bulk does ?
> > As I said above:
> > " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> >   around. So quick update/delete (in sleepable with call_rcu_task_trace)
> >   may very well exhaust memory.
> > "
> > As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
> > before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
> > With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
> > .
> Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
> exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
> close lookup on that.

In theory. yes. The batching makes a big difference.
Hou Tao Sept. 27, 2022, 2:07 p.m. UTC | #6
Hi,

On 9/27/2022 11:18 AM, Alexei Starovoitov wrote:
> On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>>>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>>>>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>>>>>> a simple try and encounter some problems. One problem is that
>>>>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>>>>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>>>>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>>>>>> to the incorrect content in qp_trie_branch. And another problem is the
>>>>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>>>>>> the total size of key size and value size, but maybe I can use two
>>>>>> separated bpf_mem_alloc for key and value.
>>>>> 4096 limit for key+value size would be an acceptable trade-off.
>>>>> With kptrs the user will be able to extend value to much bigger sizes
>>>>> while doing <= 4096 allocation at a time. Larger allocations are failing
>>>>> in production more often than not. Any algorithm relying on successful
>>>>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
>>>>> the kernel is using, but we're not there yet in bpf land.
>>>>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
>>>>> qp-trie would work in all contexts including sleepable progs.
>>>>> As presented the use cases for qp-trie are quite limited.
>>>>> If I understand correctly the concern for not using bpf_mem_alloc
>>>>> is that qp_trie_branch can be reused. Can you provide an exact scenario
>>>>> that will casue issuses?
SNIP
>>>> Looking at lookup:
>>>> +     while (is_branch_node(node)) {
>>>> +             struct qp_trie_branch *br = node;
>>>> +             unsigned int bitmap;
>>>> +             unsigned int iip;
>>>> +
>>>> +             /* When byte index equals with key len, the target key
>>>> +              * may be in twigs->nodes[0].
>>>> +              */
>>>> +             if (index_to_byte_index(br->index) > data_len)
>>>> +                     goto done;
>>>> +
>>>> +             bitmap = calc_br_bitmap(br->index, data, data_len);
>>>> +             if (!(bitmap & br->bitmap))
>>>> +                     goto done;
>>>> +
>>>> +             iip = calc_twig_index(br->bitmap, bitmap);
>>>> +             node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
>>>> +     }
>>>>
>>>> To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
>>>> While deleting the br->index can be set to special value which would mean
>>>> restart the lookup from the beginning.
>>>> As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
>>>> Also the race is extremely tight, right?
>>>> After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
>>>> after spin_lock and reused in update after another spin_lock.
>>>> Without artifical big delay it's hard to imagine how nodes[iip] pointer
>>>> would be initialized to some other qp_trie_branch or leaf during delete,
>>>> then memory reused and nodes[iip] is initialized again with the same address.
>>>> Theoretically possible, but unlikely, right?
>>>> And with correct ordering of scrubbing and updates to
>>>> br->nodes, br->bitmap, br->index it can be made safe.
>> The reuse of node not only introduces the safety problem (e.g. access an invalid
>> pointer), but also incur the false negative problem (e.g. can not find an
>> existent element) as show below:
>>
>> lookup A in X on CPU1            update X on CPU 2
>>
>>      [ branch X v1 ]
>>  leaf A | leaf B | leaf C
>>                                                  [ branch X v2 ]
>>                                                leaf A | leaf B | leaf C | leaf D
>>
>>                                                   // free and reuse branch X v1
>>                                                   [ branch X v1 ]
>>                                                 leaf O | leaf P | leaf Q
>> // leaf A can not be found
> Right. That's why I suggested to consider hlist_nulls-like approach
> that htab is using.
>
>>> We can add a sequence number to qp_trie_branch as well and read it before and after.
>>> Every reuse would inc the seq.
>>> If seq number differs, re-read the node pointer form parent.
>> A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
>> consider the starvation of lookup by update/deletion. Maybe need fallback to the
>> subtree spinlock after some reread.
> I think the fallback is an overkill. The race is extremely unlikely.
OK. Will add a test on tiny qp-trie to ensure it is OK.
>>>> The problem may can be solved by zeroing the unused or whole part of allocated
>>>> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
>>>> node array happens before the update of bitmap is also OK and the cost will be
>>>> much cheaper in x86 host.
>>> Something like this, right.
>>> We can also consider doing lookup under spin_lock. For a large branchy trie
>>> the cost of spin_lock maybe negligible.
>> Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
>> or taking the subtree spinlock during lookup ? IMO the latter will make the
>> lookup performance suffer, but I will check it as well.
> subtree lock. lookup perf will suffer a bit.
> The numbers will tell the true story.
A quick benchmark show the performance is bad when using subtree lock for lookup:

Randomly-generated binary data (key size=255, max entries=16K, key length
range:[1, 255])
* no lock
qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
0.000 MiB)

* subtree lock
qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
0.000 MiB)


Strings in /proc/kallsyms (key size=83, max entries=170958)
* no lock
qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)    8.226 ± 0.009M/s (drops 0.500 ± 0.002M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)   15.356 ± 0.034M/s (drops 0.933 ± 0.006M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)   30.037 ± 0.584M/s (drops 1.827 ± 0.037M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)   62.600 ± 0.307M/s (drops 3.808 ± 0.029M/s mem
0.000 MiB)

* subtree lock
qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
0.000 MiB)
qp-trie lookup   (2  thread)    4.883 ± 0.500M/s (drops 0.297 ± 0.031M/s mem
0.000 MiB)
qp-trie lookup   (4  thread)    5.771 ± 0.137M/s (drops 0.351 ± 0.008M/s mem
0.000 MiB)
qp-trie lookup   (8  thread)    5.926 ± 0.104M/s (drops 0.359 ± 0.011M/s mem
0.000 MiB)
qp-trie lookup   (16 thread)    5.947 ± 0.171M/s (drops 0.362 ± 0.023M/s mem
0.000 MiB)
>>>> Beside lookup procedure, get_next_key() from syscall also lookups trie
>>>> locklessly. If the branch node is reused, the order of returned keys may be
>>>> broken. There is also a parent pointer in branch node and it is used for reverse
>>>> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
>>> qp_trie_lookup_next_node can be done under spin_lock.
>>> Iterating all map elements is a slow operation anyway.
>> OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
>> whether or not the solution for lockless lookup will work for get_next_key. Will
>> check.
> What kind of scalability are you concerned about?
> get_next is done by user space only. Plenty of overhead already.
As an ordered map, maybe the next and prev iteration operations are needed in
bpf program in the future. For now, i think it is OK.
>>>>> Instead of call_rcu in qp_trie_branch_free (which will work only for
>>>>> regular progs and have high overhead as demonstrated by mem_alloc patches)
>>>>> the qp-trie freeing logic can scrub that element, so it's ready to be
>>>>> reused as another struct qp_trie_branch.
>>>>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
>>>>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
>>>>> Is that it?
>>>> Yes. The update is made atomic by copying the parent branch node to a new branch
>>>> node and replacing the pointer to the parent branch node by the new branch node,
>>>> so the lookup procedure either find the old branch node or the new branch node.
>>>>> So to make qp-trie work in sleepable progs the algo would need to
>>>>> be changed to do both call_rcu and call_rcu_task_trace everywhere
>>>>> to protect these inner structs?
>>>>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>>>>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
>>>>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
>>>>> since rcu_task_trace gp is observed only when freeing into global mem pool.
>>>>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
>>>>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
>>>>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
>>>>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
>>>>> Many kernel data structures know how to deal with such object reuse.
>>>>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
>>>>> construct a logic in a way that obj reuse is not problematic.
>>>> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
>>>> don't know how to do it for get_next_key because the iteration result needs to
>>>> be ordered and can not skip existed elements before the iterations begins.
>>> imo it's fine to spin_lock in get_next_key.
>>> We should measure the lock overhead in lookup. It might be acceptable too.
>> Will check that.
>>>> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
>>>> performance, is there any reason we can not do that ?
>>> What do you mean?
>>> Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
>> Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
>> free_bulk does ?
>>> As I said above:
>>> " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
>>>   around. So quick update/delete (in sleepable with call_rcu_task_trace)
>>>   may very well exhaust memory.
>>> "
>>> As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
>>> before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
>>> With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
>>> .
I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
local environment, could you share the setup for it ?

The following is the output of perf report (--no-children) for "./map_perf_test
4 72 10240 100000" on a x86-64 host with 72-cpus:

    26.63%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
    21.57%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
    18.08%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
    12.30%  map_perf_test    [kernel.vmlinux]                             [k]
free_htab_elem
    10.55%  map_perf_test    [kernel.vmlinux]                             [k]
__htab_map_lookup_elem
     1.58%  map_perf_test    [kernel.vmlinux]                             [k]
bpf_map_kmalloc_node
     1.39%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave
     1.37%  map_perf_test    [kernel.vmlinux]                             [k]
__copy_map_value.constprop.0
     0.45%  map_perf_test    [kernel.vmlinux]                             [k]
check_and_free_fields
     0.33%  map_perf_test    [kernel.vmlinux]                             [k]
rcu_segcblist_enqueue

The overhead of call_rcu is tiny compared with hash map operations. Instead
alloc_htab_elem() and free_htab_eleme() are the bottlenecks. The following is
the output of perf record after apply bpf_mem_alloc:

    25.35%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
    23.69%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
     8.42%  map_perf_test    [kernel.vmlinux]                             [k]
__htab_map_lookup_elem
     7.60%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
     4.35%  map_perf_test    [kernel.vmlinux]                             [k]
free_htab_elem
     2.28%  map_perf_test    [kernel.vmlinux]                             [k]
memcpy_erms
     2.24%  map_perf_test    [kernel.vmlinux]                             [k] jhash
     2.02%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave

>> Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
>> exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
>> close lookup on that.
> In theory. yes. The batching makes a big difference.
Alexei Starovoitov Sept. 28, 2022, 1:08 a.m. UTC | #7
On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> A quick benchmark show the performance is bad when using subtree lock for lookup:
>
> Randomly-generated binary data (key size=255, max entries=16K, key length
> range:[1, 255])
> * no lock
> qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
> 0.000 MiB)
>
> * subtree lock
> qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
> 0.000 MiB)

That's indeed significant.
But I interpret it differently.
Since single thread perf is close enough while 16 thread
suffers 3x it means the lock mechanism is inefficient.
It means update/delete performance equally doesn't scale.

> Strings in /proc/kallsyms (key size=83, max entries=170958)
> * no lock
> qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
> 0.000 MiB)
>
> * subtree lock
> qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
> 0.000 MiB)

Here a single thread with spin_lock is _faster_ than without.
So it's not about the cost of spin_lock, but its contention.
So all the complexity to do lockless lookup
needs to be considered in this context.
Looks like update/delete don't scale anyway.
So lock-less lookup complexity is justified only
for the case with a lot of concurrent lookups and
little update/delete.
When bpf hash map was added the majority of tracing use cases
had # of lookups == # of updates == # of deletes.
For qp-trie we obviously cannot predict,
but should not pivot strongly into lock-less lookup without data.
The lock-less lookup should have reasonable complexity.
Otherwise let's do spin_lock and work on a different locking scheme
for all operations (lookup/update/delete).

> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
> local environment, could you share the setup for it ?
>
> The following is the output of perf report (--no-children) for "./map_perf_test
> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>
>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem

Looks like the perf is lost on atomic_inc/dec.
Try a partial revert of mem_alloc.
In particular to make sure
commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
is reverted and call_rcu is in place,
but percpu counter optimization is still there.
Also please use 'map_perf_test 4'.
I doubt 1000 vs 10240 will make a difference, but still.
Hou Tao Sept. 28, 2022, 3:27 a.m. UTC | #8
Hi,

On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> A quick benchmark show the performance is bad when using subtree lock for lookup:
>>
>> Randomly-generated binary data (key size=255, max entries=16K, key length
>> range:[1, 255])
>> * no lock
>> qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
>> 0.000 MiB)
>>
>> * subtree lock
>> qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
>> 0.000 MiB)
> That's indeed significant.
> But I interpret it differently.
> Since single thread perf is close enough while 16 thread
> suffers 3x it means the lock mechanism is inefficient.
> It means update/delete performance equally doesn't scale.
>
>> Strings in /proc/kallsyms (key size=83, max entries=170958)
>> * no lock
>> qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
>> 0.000 MiB)
>>
>> * subtree lock
>> qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
>> 0.000 MiB)
> Here a single thread with spin_lock is _faster_ than without.
It is a little strange because there is not any overhead for lockless lookup for
now. I thought it may be due to CPU boost or something similar. Will check it again.
> So it's not about the cost of spin_lock, but its contention.
> So all the complexity to do lockless lookup
> needs to be considered in this context.
> Looks like update/delete don't scale anyway.
Yes.
> So lock-less lookup complexity is justified only
> for the case with a lot of concurrent lookups and
> little update/delete.
> When bpf hash map was added the majority of tracing use cases
> had # of lookups == # of updates == # of deletes.
> For qp-trie we obviously cannot predict,
> but should not pivot strongly into lock-less lookup without data.
> The lock-less lookup should have reasonable complexity.
> Otherwise let's do spin_lock and work on a different locking scheme
> for all operations (lookup/update/delete).
For lpm-trie, does the use case in cillium [0] has many lockless lookups and few
updates/deletions ? So I think the use case is applicable for qp-trie as well.
Before steering towards spin_lock, let's implement a demo for lock-less lookup
to see its complexity. For spin-lock way, I had implemented a hand-over-hand
lock but the lookup was still lockless, but it doesn't scale well as well.
>
>> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
>> local environment, could you share the setup for it ?
>>
>> The following is the output of perf report (--no-children) for "./map_perf_test
>> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>>
>>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
> Looks like the perf is lost on atomic_inc/dec.
> Try a partial revert of mem_alloc.
> In particular to make sure
> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> is reverted and call_rcu is in place,
> but percpu counter optimization is still there.
> Also please use 'map_perf_test 4'.
> I doubt 1000 vs 10240 will make a difference, but still.
Will do. The suggestion is reasonable. But when max_entries=1000,
use_percpu_counter will be false.
Alexei Starovoitov Sept. 28, 2022, 4:37 a.m. UTC | #9
On Tue, Sep 27, 2022 at 8:27 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > Also please use 'map_perf_test 4'.
> > I doubt 1000 vs 10240 will make a difference, but still.
> Will do. The suggestion is reasonable. But when max_entries=1000,
> use_percpu_counter will be false.

... when the system has a lot of cpus...
The knobs in the microbenchmark are there to tune it
to observe desired behavior. Please mention the reasons to
use that specific value next time.
When N things have changed it's hard to do apples-to-apples.
Hou Tao Sept. 28, 2022, 8:45 a.m. UTC | #10
Hi,

On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
SNIP
>> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
>> local environment, could you share the setup for it ?
>>
>> The following is the output of perf report (--no-children) for "./map_perf_test
>> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>>
>>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
> Looks like the perf is lost on atomic_inc/dec.
> Try a partial revert of mem_alloc.
> In particular to make sure
> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> is reverted and call_rcu is in place,
> but percpu counter optimization is still there.
> Also please use 'map_perf_test 4'.
> I doubt 1000 vs 10240 will make a difference, but still.
>
I have tried the following two setups:
(1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
# Samples: 1M of event 'cycles:ppp'
# Event count (approx.): 1041345723234
#
# Overhead  Command          Shared Object                                Symbol
# ........  ...............  ........................................... 
...............................................
#
    10.36%  map_perf_test    [kernel.vmlinux]                             [k]
bpf_map_get_memcg.isra.0
     9.82%  map_perf_test    [kernel.vmlinux]                             [k]
bpf_map_kmalloc_node
     4.24%  map_perf_test    [kernel.vmlinux]                             [k]
check_preemption_disabled
     2.86%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
     2.80%  map_perf_test    [kernel.vmlinux]                             [k]
__kmalloc_node
     2.72%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
     2.30%  map_perf_test    [kernel.vmlinux]                             [k]
memcg_slab_post_alloc_hook
     2.21%  map_perf_test    [kernel.vmlinux]                             [k]
entry_SYSCALL_64
     2.17%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_exit_to_user_mode
     2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
     2.11%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_return_via_sysret
     2.05%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
     1.94%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave
     1.92%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_add
     1.92%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_sub
     1.87%  map_perf_test    [kernel.vmlinux]                             [k]
call_rcu


(2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu
optimization
By revert the following commits:

9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache.
02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs.
dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
96da3f7d489d bpf: Remove tracing program restriction on map types
ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc.
8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache
0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map.

     5.17%  map_perf_test    [kernel.vmlinux]                             [k]
check_preemption_disabled
     4.53%  map_perf_test    [kernel.vmlinux]                             [k]
__get_obj_cgroup_from_memcg
     2.97%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_update_elem
     2.74%  map_perf_test    [kernel.vmlinux]                             [k]
htab_map_delete_elem
     2.62%  map_perf_test    [kernel.vmlinux]                             [k]
kmem_cache_alloc_node
     2.57%  map_perf_test    [kernel.vmlinux]                             [k]
memcg_slab_post_alloc_hook
     2.34%  map_perf_test    [kernel.vmlinux]                             [k] jhash
     2.30%  map_perf_test    [kernel.vmlinux]                             [k]
entry_SYSCALL_64
     2.25%  map_perf_test    [kernel.vmlinux]                             [k]
obj_cgroup_charge
     2.23%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_htab_elem
     2.17%  map_perf_test    [kernel.vmlinux]                             [k]
memcpy_erms
     2.17%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_exit_to_user_mode
     2.16%  map_perf_test    [kernel.vmlinux]                             [k]
syscall_return_via_sysret
     2.14%  map_perf_test    [kernel.vmlinux]                             [k]
_raw_spin_lock_irqsave
     2.13%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_add
     2.12%  map_perf_test    [kernel.vmlinux]                             [k]
preempt_count_sub
     2.00%  map_perf_test    [kernel.vmlinux]                             [k]
percpu_counter_add_batch
     1.99%  map_perf_test    [kernel.vmlinux]                             [k]
alloc_bulk
     1.97%  map_perf_test    [kernel.vmlinux]                             [k]
call_rcu
     1.52%  map_perf_test    [kernel.vmlinux]                             [k]
mod_objcg_state
     1.36%  map_perf_test    [kernel.vmlinux]                             [k]
allocate_slab

In both of these two setups, the overhead of call_rcu is about 2% and it is not
the biggest overhead.

Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
you think ?
Hou Tao Sept. 28, 2022, 8:49 a.m. UTC | #11
On 9/28/2022 4:45 PM, Hou Tao wrote:
> Hi,
>
> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>
>> In both of these two setups, the overhead of call_rcu is about 2% and it is not
>> the biggest overhead.
Also considering that the overhead of qp-trie update/delete will be much bigger
than hash map update.
>>
>> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
>> you think ?
s/reason/reasonable.
> .
Alexei Starovoitov Sept. 29, 2022, 3:22 a.m. UTC | #12
On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> SNIP
> >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
> >> local environment, could you share the setup for it ?
> >>
> >> The following is the output of perf report (--no-children) for "./map_perf_test
> >> 4 72 10240 100000" on a x86-64 host with 72-cpus:
> >>
> >>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
> >> alloc_htab_elem
> >>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_update_elem
> > Looks like the perf is lost on atomic_inc/dec.
> > Try a partial revert of mem_alloc.
> > In particular to make sure
> > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > is reverted and call_rcu is in place,
> > but percpu counter optimization is still there.
> > Also please use 'map_perf_test 4'.
> > I doubt 1000 vs 10240 will make a difference, but still.
> >
> I have tried the following two setups:
> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> # Samples: 1M of event 'cycles:ppp'
> # Event count (approx.): 1041345723234
> #
> # Overhead  Command          Shared Object                                Symbol
> # ........  ...............  ...........................................
> ...............................................
> #
>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> bpf_map_get_memcg.isra.0

That is per-cpu counter and it's consuming 10% ?!
Something is really odd in your setup.
A lot of debug configs?

>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> bpf_map_kmalloc_node
>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> check_preemption_disabled

clearly debug build.
Please use production build.

>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem
>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> __kmalloc_node
>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_delete_elem
>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> memcg_slab_post_alloc_hook
>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> entry_SYSCALL_64
>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_exit_to_user_mode
>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_return_via_sysret
>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> _raw_spin_lock_irqsave
>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_add
>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_sub
>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> call_rcu
>
>
> (2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu
> optimization
> By revert the following commits:
>
> 9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
> bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache.
> 02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs.
> dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
> 96da3f7d489d bpf: Remove tracing program restriction on map types
> ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
> 4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc.
> 8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
> 7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache
> 0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map.
>
>      5.17%  map_perf_test    [kernel.vmlinux]                             [k]
> check_preemption_disabled
>      4.53%  map_perf_test    [kernel.vmlinux]                             [k]
> __get_obj_cgroup_from_memcg
>      2.97%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem
>      2.74%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_delete_elem
>      2.62%  map_perf_test    [kernel.vmlinux]                             [k]
> kmem_cache_alloc_node
>      2.57%  map_perf_test    [kernel.vmlinux]                             [k]
> memcg_slab_post_alloc_hook
>      2.34%  map_perf_test    [kernel.vmlinux]                             [k] jhash
>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> entry_SYSCALL_64
>      2.25%  map_perf_test    [kernel.vmlinux]                             [k]
> obj_cgroup_charge
>      2.23%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> memcpy_erms
>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_exit_to_user_mode
>      2.16%  map_perf_test    [kernel.vmlinux]                             [k]
> syscall_return_via_sysret
>      2.14%  map_perf_test    [kernel.vmlinux]                             [k]
> _raw_spin_lock_irqsave
>      2.13%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_add
>      2.12%  map_perf_test    [kernel.vmlinux]                             [k]
> preempt_count_sub
>      2.00%  map_perf_test    [kernel.vmlinux]                             [k]
> percpu_counter_add_batch
>      1.99%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_bulk
>      1.97%  map_perf_test    [kernel.vmlinux]                             [k]
> call_rcu
>      1.52%  map_perf_test    [kernel.vmlinux]                             [k]
> mod_objcg_state
>      1.36%  map_perf_test    [kernel.vmlinux]                             [k]
> allocate_slab
>
> In both of these two setups, the overhead of call_rcu is about 2% and it is not
> the biggest overhead.
>
> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> you think ?

We've discussed it twice already. It's not an option due to OOM
and performance considerations.
call_rcu doesn't scale to millions a second.
Hou Tao Oct. 8, 2022, 1:56 a.m. UTC | #13
Hi,

On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
>>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>
>>> Looks like the perf is lost on atomic_inc/dec.
>>> Try a partial revert of mem_alloc.
>>> In particular to make sure
>>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
>>> is reverted and call_rcu is in place,
>>> but percpu counter optimization is still there.
>>> Also please use 'map_perf_test 4'.
>>> I doubt 1000 vs 10240 will make a difference, but still.
>>>
>> I have tried the following two setups:
>> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
>> # Samples: 1M of event 'cycles:ppp'
>> # Event count (approx.): 1041345723234
>> #
>> # Overhead  Command          Shared Object                                Symbol
>> # ........  ...............  ...........................................
>> ...............................................
>> #
>>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
>> bpf_map_get_memcg.isra.0
> That is per-cpu counter and it's consuming 10% ?!
> Something is really odd in your setup.
> A lot of debug configs?
Sorry for the late reply. Just back to work from a long vacation.

My local .config is derived from Fedora distribution. It indeed has some DEBUG
related configs. Will turn these configs off to check it again :)
>>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
>> bpf_map_kmalloc_node
>>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
>> check_preemption_disabled
> clearly debug build.
> Please use production build.
check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
Fedora distribution.
>>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
>>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
>> __kmalloc_node
>>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_delete_elem
>>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
>> memcg_slab_post_alloc_hook
>>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
>> entry_SYSCALL_64
>>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
>> syscall_exit_to_user_mode
>>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
>>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
>> syscall_return_via_sysret
>>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
>> _raw_spin_lock_irqsave
>>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
>> preempt_count_add
>>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
>> preempt_count_sub
>>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
>> call_rcu
SNIP
>> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
>> you think ?
> We've discussed it twice already. It's not an option due to OOM
> and performance considerations.
> call_rcu doesn't scale to millions a second.
Understand. I was just trying to understand the exact performance overhead of
call_rcu(). If the overhead of map operations are much greater than the overhead
of call_rcu(), I think calling call_rcu() one millions a second will be not a
problem and  it also makes the implementation of qp-trie being much simpler. The
OOM problem is indeed a problem, although it is also possible for the current
implementation, so I will try to implement the lookup procedure which handles
the reuse problem.

Regards.
Tao
> .
Alexei Starovoitov Oct. 8, 2022, 1:59 a.m. UTC | #14
On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>
> >>> Looks like the perf is lost on atomic_inc/dec.
> >>> Try a partial revert of mem_alloc.
> >>> In particular to make sure
> >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> >>> is reverted and call_rcu is in place,
> >>> but percpu counter optimization is still there.
> >>> Also please use 'map_perf_test 4'.
> >>> I doubt 1000 vs 10240 will make a difference, but still.
> >>>
> >> I have tried the following two setups:
> >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> >> # Samples: 1M of event 'cycles:ppp'
> >> # Event count (approx.): 1041345723234
> >> #
> >> # Overhead  Command          Shared Object                                Symbol
> >> # ........  ...............  ...........................................
> >> ...............................................
> >> #
> >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> >> bpf_map_get_memcg.isra.0
> > That is per-cpu counter and it's consuming 10% ?!
> > Something is really odd in your setup.
> > A lot of debug configs?
> Sorry for the late reply. Just back to work from a long vacation.
>
> My local .config is derived from Fedora distribution. It indeed has some DEBUG
> related configs. Will turn these configs off to check it again :)
> >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> >> bpf_map_kmalloc_node
> >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> >> check_preemption_disabled
> > clearly debug build.
> > Please use production build.
> check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> Fedora distribution.
> >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_update_elem
> >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> >> __kmalloc_node
> >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_delete_elem
> >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> >> memcg_slab_post_alloc_hook
> >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> >> entry_SYSCALL_64
> >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> >> syscall_exit_to_user_mode
> >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> >> syscall_return_via_sysret
> >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> >> alloc_htab_elem
> >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> >> _raw_spin_lock_irqsave
> >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> >> preempt_count_add
> >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> >> preempt_count_sub
> >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> >> call_rcu
> SNIP
> >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> >> you think ?
> > We've discussed it twice already. It's not an option due to OOM
> > and performance considerations.
> > call_rcu doesn't scale to millions a second.
> Understand. I was just trying to understand the exact performance overhead of
> call_rcu(). If the overhead of map operations are much greater than the overhead
> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> problem and  it also makes the implementation of qp-trie being much simpler. The
> OOM problem is indeed a problem, although it is also possible for the current
> implementation, so I will try to implement the lookup procedure which handles
> the reuse problem.

call_rcu is not just that particular function.
It's all the work rcu subsystem needs to do to observe gp
and execute that callback. Just see how many kthreads it will
start when overloaded like this.
Paul E. McKenney Oct. 8, 2022, 1:22 p.m. UTC | #15
On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>
> > >>> Looks like the perf is lost on atomic_inc/dec.
> > >>> Try a partial revert of mem_alloc.
> > >>> In particular to make sure
> > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > >>> is reverted and call_rcu is in place,
> > >>> but percpu counter optimization is still there.
> > >>> Also please use 'map_perf_test 4'.
> > >>> I doubt 1000 vs 10240 will make a difference, but still.
> > >>>
> > >> I have tried the following two setups:
> > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> > >> # Samples: 1M of event 'cycles:ppp'
> > >> # Event count (approx.): 1041345723234
> > >> #
> > >> # Overhead  Command          Shared Object                                Symbol
> > >> # ........  ...............  ...........................................
> > >> ...............................................
> > >> #
> > >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> bpf_map_get_memcg.isra.0
> > > That is per-cpu counter and it's consuming 10% ?!
> > > Something is really odd in your setup.
> > > A lot of debug configs?
> > Sorry for the late reply. Just back to work from a long vacation.
> >
> > My local .config is derived from Fedora distribution. It indeed has some DEBUG
> > related configs. Will turn these configs off to check it again :)
> > >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> bpf_map_kmalloc_node
> > >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> check_preemption_disabled
> > > clearly debug build.
> > > Please use production build.
> > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> > Fedora distribution.
> > >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> htab_map_update_elem
> > >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> __kmalloc_node
> > >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> htab_map_delete_elem
> > >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> memcg_slab_post_alloc_hook
> > >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> entry_SYSCALL_64
> > >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> syscall_exit_to_user_mode
> > >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> > >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> syscall_return_via_sysret
> > >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> alloc_htab_elem
> > >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> _raw_spin_lock_irqsave
> > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> preempt_count_add
> > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> preempt_count_sub
> > >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> > >> call_rcu
> > SNIP
> > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> > >> you think ?
> > > We've discussed it twice already. It's not an option due to OOM
> > > and performance considerations.
> > > call_rcu doesn't scale to millions a second.
> > Understand. I was just trying to understand the exact performance overhead of
> > call_rcu(). If the overhead of map operations are much greater than the overhead
> > of call_rcu(), I think calling call_rcu() one millions a second will be not a
> > problem and  it also makes the implementation of qp-trie being much simpler. The
> > OOM problem is indeed a problem, although it is also possible for the current
> > implementation, so I will try to implement the lookup procedure which handles
> > the reuse problem.
> 
> call_rcu is not just that particular function.
> It's all the work rcu subsystem needs to do to observe gp
> and execute that callback. Just see how many kthreads it will
> start when overloaded like this.

The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
and rcuo*.  There is also the back-of-interrupt softirq context, which
requires some care to measure accurately.

The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
that the per-element locking overhead for exact iterations was a problem?
If so, what exactly are the consistency rules for iteration?  Presumably
stronger than "if the element existed throughout, it is included in the
iteration; if it did not exist throughout, it is not included; otherwise
it might or might not be included" given that you get that for free.

Either way, could you please tell me the exact iteration rules?

							Thanx, Paul
Alexei Starovoitov Oct. 8, 2022, 4:40 p.m. UTC | #16
On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >> Hi,
> > > >>
> > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >>>
> > > >>> Looks like the perf is lost on atomic_inc/dec.
> > > >>> Try a partial revert of mem_alloc.
> > > >>> In particular to make sure
> > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > > >>> is reverted and call_rcu is in place,
> > > >>> but percpu counter optimization is still there.
> > > >>> Also please use 'map_perf_test 4'.
> > > >>> I doubt 1000 vs 10240 will make a difference, but still.
> > > >>>
> > > >> I have tried the following two setups:
> > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> > > >> # Samples: 1M of event 'cycles:ppp'
> > > >> # Event count (approx.): 1041345723234
> > > >> #
> > > >> # Overhead  Command          Shared Object                                Symbol
> > > >> # ........  ...............  ...........................................
> > > >> ...............................................
> > > >> #
> > > >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> bpf_map_get_memcg.isra.0
> > > > That is per-cpu counter and it's consuming 10% ?!
> > > > Something is really odd in your setup.
> > > > A lot of debug configs?
> > > Sorry for the late reply. Just back to work from a long vacation.
> > >
> > > My local .config is derived from Fedora distribution. It indeed has some DEBUG
> > > related configs. Will turn these configs off to check it again :)
> > > >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> bpf_map_kmalloc_node
> > > >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> check_preemption_disabled
> > > > clearly debug build.
> > > > Please use production build.
> > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> > > Fedora distribution.
> > > >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> htab_map_update_elem
> > > >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> __kmalloc_node
> > > >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> htab_map_delete_elem
> > > >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> memcg_slab_post_alloc_hook
> > > >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> entry_SYSCALL_64
> > > >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> syscall_exit_to_user_mode
> > > >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> > > >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> syscall_return_via_sysret
> > > >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> alloc_htab_elem
> > > >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> _raw_spin_lock_irqsave
> > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> preempt_count_add
> > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> preempt_count_sub
> > > >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> > > >> call_rcu
> > > SNIP
> > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> > > >> you think ?
> > > > We've discussed it twice already. It's not an option due to OOM
> > > > and performance considerations.
> > > > call_rcu doesn't scale to millions a second.
> > > Understand. I was just trying to understand the exact performance overhead of
> > > call_rcu(). If the overhead of map operations are much greater than the overhead
> > > of call_rcu(), I think calling call_rcu() one millions a second will be not a
> > > problem and  it also makes the implementation of qp-trie being much simpler. The
> > > OOM problem is indeed a problem, although it is also possible for the current
> > > implementation, so I will try to implement the lookup procedure which handles
> > > the reuse problem.
> >
> > call_rcu is not just that particular function.
> > It's all the work rcu subsystem needs to do to observe gp
> > and execute that callback. Just see how many kthreads it will
> > start when overloaded like this.
>
> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> and rcuo*.  There is also the back-of-interrupt softirq context, which
> requires some care to measure accurately.
>
> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> that the per-element locking overhead for exact iterations was a problem?
> If so, what exactly are the consistency rules for iteration?  Presumably
> stronger than "if the element existed throughout, it is included in the
> iteration; if it did not exist throughout, it is not included; otherwise
> it might or might not be included" given that you get that for free.
>
> Either way, could you please tell me the exact iteration rules?

The rules are the way we make them to be.
iteration will be under lock.
lookup needs to be correct. It can retry if necessary (like htab is doing).
Randomly returning 'noexist' is, of course, not acceptable.
Paul E. McKenney Oct. 8, 2022, 8:11 p.m. UTC | #17
On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> > > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > > >> Hi,
> > > > >>
> > > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> > > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > > > >>>
> > > > >>> Looks like the perf is lost on atomic_inc/dec.
> > > > >>> Try a partial revert of mem_alloc.
> > > > >>> In particular to make sure
> > > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> > > > >>> is reverted and call_rcu is in place,
> > > > >>> but percpu counter optimization is still there.
> > > > >>> Also please use 'map_perf_test 4'.
> > > > >>> I doubt 1000 vs 10240 will make a difference, but still.
> > > > >>>
> > > > >> I have tried the following two setups:
> > > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> > > > >> # Samples: 1M of event 'cycles:ppp'
> > > > >> # Event count (approx.): 1041345723234
> > > > >> #
> > > > >> # Overhead  Command          Shared Object                                Symbol
> > > > >> # ........  ...............  ...........................................
> > > > >> ...............................................
> > > > >> #
> > > > >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> bpf_map_get_memcg.isra.0
> > > > > That is per-cpu counter and it's consuming 10% ?!
> > > > > Something is really odd in your setup.
> > > > > A lot of debug configs?
> > > > Sorry for the late reply. Just back to work from a long vacation.
> > > >
> > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG
> > > > related configs. Will turn these configs off to check it again :)
> > > > >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> bpf_map_kmalloc_node
> > > > >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> check_preemption_disabled
> > > > > clearly debug build.
> > > > > Please use production build.
> > > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> > > > Fedora distribution.
> > > > >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> htab_map_update_elem
> > > > >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> __kmalloc_node
> > > > >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> htab_map_delete_elem
> > > > >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> memcg_slab_post_alloc_hook
> > > > >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> entry_SYSCALL_64
> > > > >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> syscall_exit_to_user_mode
> > > > >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> > > > >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> syscall_return_via_sysret
> > > > >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> alloc_htab_elem
> > > > >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> _raw_spin_lock_irqsave
> > > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> preempt_count_add
> > > > >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> preempt_count_sub
> > > > >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> > > > >> call_rcu
> > > > SNIP
> > > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> > > > >> you think ?
> > > > > We've discussed it twice already. It's not an option due to OOM
> > > > > and performance considerations.
> > > > > call_rcu doesn't scale to millions a second.
> > > > Understand. I was just trying to understand the exact performance overhead of
> > > > call_rcu(). If the overhead of map operations are much greater than the overhead
> > > > of call_rcu(), I think calling call_rcu() one millions a second will be not a
> > > > problem and  it also makes the implementation of qp-trie being much simpler. The
> > > > OOM problem is indeed a problem, although it is also possible for the current
> > > > implementation, so I will try to implement the lookup procedure which handles
> > > > the reuse problem.
> > >
> > > call_rcu is not just that particular function.
> > > It's all the work rcu subsystem needs to do to observe gp
> > > and execute that callback. Just see how many kthreads it will
> > > start when overloaded like this.
> >
> > The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> > and rcuo*.  There is also the back-of-interrupt softirq context, which
> > requires some care to measure accurately.
> >
> > The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> > that the per-element locking overhead for exact iterations was a problem?
> > If so, what exactly are the consistency rules for iteration?  Presumably
> > stronger than "if the element existed throughout, it is included in the
> > iteration; if it did not exist throughout, it is not included; otherwise
> > it might or might not be included" given that you get that for free.
> >
> > Either way, could you please tell me the exact iteration rules?
> 
> The rules are the way we make them to be.
> iteration will be under lock.
> lookup needs to be correct. It can retry if necessary (like htab is doing).
> Randomly returning 'noexist' is, of course, not acceptable.

OK, so then it is important that updates to this data structure be
carried out in such a way as to avoid discombobulating lockless readers.
Do the updates have that property?

The usual way to get that property is to leave the old search structure
around, replacing it with the new one, and RCU-freeing the old one.
In case it helps, Kung and Lehman describe how to do that for search trees:

http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf

							Thanx, Paul
Hou Tao Oct. 9, 2022, 1:09 a.m. UTC | #18
Hi Paul,

On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
SNIP
>>>>> Understand. I was just trying to understand the exact performance overhead of
>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
>>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
>>>>> OOM problem is indeed a problem, although it is also possible for the current
>>>>> implementation, so I will try to implement the lookup procedure which handles
>>>>> the reuse problem.
>>>> call_rcu is not just that particular function.
>>>> It's all the work rcu subsystem needs to do to observe gp
>>>> and execute that callback. Just see how many kthreads it will
>>>> start when overloaded like this.
>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
>>> and rcuo*.  There is also the back-of-interrupt softirq context, which
>>> requires some care to measure accurately.
>>>
>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
>>> that the per-element locking overhead for exact iterations was a problem?
>>> If so, what exactly are the consistency rules for iteration?  Presumably
>>> stronger than "if the element existed throughout, it is included in the
>>> iteration; if it did not exist throughout, it is not included; otherwise
>>> it might or might not be included" given that you get that for free.
>>>
>>> Either way, could you please tell me the exact iteration rules?
>> The rules are the way we make them to be.
>> iteration will be under lock.
>> lookup needs to be correct. It can retry if necessary (like htab is doing).
>> Randomly returning 'noexist' is, of course, not acceptable.
> OK, so then it is important that updates to this data structure be
> carried out in such a way as to avoid discombobulating lockless readers.
> Do the updates have that property?
Yes. The update procedure will copy the old pointer array to a new array first,
then update the new array and replace the pointer of old array by the pointer of
new array.
>
> The usual way to get that property is to leave the old search structure
> around, replacing it with the new one, and RCU-freeing the old one.
> In case it helps, Kung and Lehman describe how to do that for search trees:
>
> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
Thanks for the paper. Just skimming through it, it seems that it uses
reference-counting and garbage collection to solve the safe memory reclamation
problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
check whether or not the branch and the leaf node is reused during lookup, and
retry the lookup if it happened. Now just checking the feasibility of the
solution and it seems a little complicated than expected.

>
> 							Thanx, Paul
> .
Paul E. McKenney Oct. 9, 2022, 9:05 a.m. UTC | #19
On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
> > On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
> >> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> SNIP
> >>>>> Understand. I was just trying to understand the exact performance overhead of
> >>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
> >>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> >>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
> >>>>> OOM problem is indeed a problem, although it is also possible for the current
> >>>>> implementation, so I will try to implement the lookup procedure which handles
> >>>>> the reuse problem.
> >>>> call_rcu is not just that particular function.
> >>>> It's all the work rcu subsystem needs to do to observe gp
> >>>> and execute that callback. Just see how many kthreads it will
> >>>> start when overloaded like this.
> >>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> >>> and rcuo*.  There is also the back-of-interrupt softirq context, which
> >>> requires some care to measure accurately.
> >>>
> >>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> >>> that the per-element locking overhead for exact iterations was a problem?
> >>> If so, what exactly are the consistency rules for iteration?  Presumably
> >>> stronger than "if the element existed throughout, it is included in the
> >>> iteration; if it did not exist throughout, it is not included; otherwise
> >>> it might or might not be included" given that you get that for free.
> >>>
> >>> Either way, could you please tell me the exact iteration rules?
> >> The rules are the way we make them to be.
> >> iteration will be under lock.
> >> lookup needs to be correct. It can retry if necessary (like htab is doing).
> >> Randomly returning 'noexist' is, of course, not acceptable.
> > OK, so then it is important that updates to this data structure be
> > carried out in such a way as to avoid discombobulating lockless readers.
> > Do the updates have that property?
> 
> Yes. The update procedure will copy the old pointer array to a new array first,
> then update the new array and replace the pointer of old array by the pointer of
> new array.

Very good.  But then why is there a problem?  Is the iteration using
multiple RCU read-side critical sections or something?

> > The usual way to get that property is to leave the old search structure
> > around, replacing it with the new one, and RCU-freeing the old one.
> > In case it helps, Kung and Lehman describe how to do that for search trees:
> >
> > http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
> Thanks for the paper. Just skimming through it, it seems that it uses
> reference-counting and garbage collection to solve the safe memory reclamation
> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
> check whether or not the branch and the leaf node is reused during lookup, and
> retry the lookup if it happened. Now just checking the feasibility of the
> solution and it seems a little complicated than expected.

The main thing in that paper is the handling of rotations in the
search-tree update.  But if you are not using a tree, that won't be all
that relevant.

								Thanx, Paul
Hou Tao Oct. 9, 2022, 10:45 a.m. UTC | #20
Hi,

On 10/9/2022 5:05 PM, Paul E. McKenney wrote:
> On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
>>> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
>>>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
>> SNIP
>>>>>>> Understand. I was just trying to understand the exact performance overhead of
>>>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
>>>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
>>>>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
>>>>>>> OOM problem is indeed a problem, although it is also possible for the current
>>>>>>> implementation, so I will try to implement the lookup procedure which handles
>>>>>>> the reuse problem.
>>>>>> call_rcu is not just that particular function.
>>>>>> It's all the work rcu subsystem needs to do to observe gp
>>>>>> and execute that callback. Just see how many kthreads it will
>>>>>> start when overloaded like this.
>>>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
>>>>> and rcuo*.  There is also the back-of-interrupt softirq context, which
>>>>> requires some care to measure accurately.
>>>>>
>>>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
>>>>> that the per-element locking overhead for exact iterations was a problem?
>>>>> If so, what exactly are the consistency rules for iteration?  Presumably
>>>>> stronger than "if the element existed throughout, it is included in the
>>>>> iteration; if it did not exist throughout, it is not included; otherwise
>>>>> it might or might not be included" given that you get that for free.
>>>>>
>>>>> Either way, could you please tell me the exact iteration rules?
>>>> The rules are the way we make them to be.
>>>> iteration will be under lock.
>>>> lookup needs to be correct. It can retry if necessary (like htab is doing).
>>>> Randomly returning 'noexist' is, of course, not acceptable.
>>> OK, so then it is important that updates to this data structure be
>>> carried out in such a way as to avoid discombobulating lockless readers.
>>> Do the updates have that property?
>> Yes. The update procedure will copy the old pointer array to a new array first,
>> then update the new array and replace the pointer of old array by the pointer of
>> new array.
> Very good.  But then why is there a problem?  Is the iteration using
> multiple RCU read-side critical sections or something?
The problem is that although the objects are RCU-freed, but these object also
can be reused immediately in bpf memory allocator. The reason for reuse is for
performance and is to reduce the possibility of OOM. Because the object can be
reused during RCU-protected lookup and the possibility of reuse is low, the
lookup procedure needs to check whether reuse is happening during lookup. And I
was arguing with Alexei about whether or no it is reasonable to provide an
optional flag to remove the immediate reuse in bpf memory allocator.
>
>>> The usual way to get that property is to leave the old search structure
>>> around, replacing it with the new one, and RCU-freeing the old one.
>>> In case it helps, Kung and Lehman describe how to do that for search trees:
>>>
>>> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
>> Thanks for the paper. Just skimming through it, it seems that it uses
>> reference-counting and garbage collection to solve the safe memory reclamation
>> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
>> check whether or not the branch and the leaf node is reused during lookup, and
>> retry the lookup if it happened. Now just checking the feasibility of the
>> solution and it seems a little complicated than expected.
> The main thing in that paper is the handling of rotations in the
> search-tree update.  But if you are not using a tree, that won't be all
> that relevant.
I see. Thanks for the explanation.
>
> 								Thanx, Paul
Paul E. McKenney Oct. 9, 2022, 11:04 a.m. UTC | #21
On Sun, Oct 09, 2022 at 06:45:22PM +0800, Hou Tao wrote:
> Hi,
> On 10/9/2022 5:05 PM, Paul E. McKenney wrote:
> > On Sun, Oct 09, 2022 at 09:09:44AM +0800, Hou Tao wrote:
> >> Hi Paul,
> >>
> >> On 10/9/2022 4:11 AM, Paul E. McKenney wrote:
> >>> On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote:
> >>>> On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>>>> On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote:
> >> SNIP
> >>>>>>> Understand. I was just trying to understand the exact performance overhead of
> >>>>>>> call_rcu(). If the overhead of map operations are much greater than the overhead
> >>>>>>> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> >>>>>>> problem and  it also makes the implementation of qp-trie being much simpler. The
> >>>>>>> OOM problem is indeed a problem, although it is also possible for the current
> >>>>>>> implementation, so I will try to implement the lookup procedure which handles
> >>>>>>> the reuse problem.
> >>>>>> call_rcu is not just that particular function.
> >>>>>> It's all the work rcu subsystem needs to do to observe gp
> >>>>>> and execute that callback. Just see how many kthreads it will
> >>>>>> start when overloaded like this.
> >>>>> The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*,
> >>>>> and rcuo*.  There is also the back-of-interrupt softirq context, which
> >>>>> requires some care to measure accurately.
> >>>>>
> >>>>> The possibility of SLAB_TYPESAFE_BY_RCU has been discussed.  I take it
> >>>>> that the per-element locking overhead for exact iterations was a problem?
> >>>>> If so, what exactly are the consistency rules for iteration?  Presumably
> >>>>> stronger than "if the element existed throughout, it is included in the
> >>>>> iteration; if it did not exist throughout, it is not included; otherwise
> >>>>> it might or might not be included" given that you get that for free.
> >>>>>
> >>>>> Either way, could you please tell me the exact iteration rules?
> >>>> The rules are the way we make them to be.
> >>>> iteration will be under lock.
> >>>> lookup needs to be correct. It can retry if necessary (like htab is doing).
> >>>> Randomly returning 'noexist' is, of course, not acceptable.
> >>> OK, so then it is important that updates to this data structure be
> >>> carried out in such a way as to avoid discombobulating lockless readers.
> >>> Do the updates have that property?
> >> Yes. The update procedure will copy the old pointer array to a new array first,
> >> then update the new array and replace the pointer of old array by the pointer of
> >> new array.
> > Very good.  But then why is there a problem?  Is the iteration using
> > multiple RCU read-side critical sections or something?
> 
> The problem is that although the objects are RCU-freed, but these object also
> can be reused immediately in bpf memory allocator. The reason for reuse is for
> performance and is to reduce the possibility of OOM. Because the object can be
> reused during RCU-protected lookup and the possibility of reuse is low, the
> lookup procedure needs to check whether reuse is happening during lookup. And I
> was arguing with Alexei about whether or no it is reasonable to provide an
> optional flag to remove the immediate reuse in bpf memory allocator.

Indeed, in that case there needs to be a check, for example as described
in the comment preceding the definition of SLAB_TYPESAFE_BY_RCU.

If the use of the element is read-only on the one hand or
heuristic/statistical on the other, lighter weight approaches are
possible.  Would that help?

							Thanx, Paul

> >>> The usual way to get that property is to leave the old search structure
> >>> around, replacing it with the new one, and RCU-freeing the old one.
> >>> In case it helps, Kung and Lehman describe how to do that for search trees:
> >>>
> >>> http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf
> >> Thanks for the paper. Just skimming through it, it seems that it uses
> >> reference-counting and garbage collection to solve the safe memory reclamation
> >> problem. It may be too heavy for qp-trie and we plan to use seqcount-like way to
> >> check whether or not the branch and the leaf node is reused during lookup, and
> >> retry the lookup if it happened. Now just checking the feasibility of the
> >> solution and it seems a little complicated than expected.
> > The main thing in that paper is the handling of rotations in the
> > search-tree update.  But if you are not using a tree, that won't be all
> > that relevant.
> I see. Thanks for the explanation.
> >
> > 								Thanx, Paul
>
Tony Finch Oct. 19, 2022, 5:01 p.m. UTC | #22
Hello all,

I have just found out about this qp-trie work, and I'm pleased to hear
that it is looking promising for you!

I have a few very broad observations:

The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
between branch factor, maximum key length, and size of branch node. The
greater the branch factor, the fewer indirections needed to traverse the
tree; but if you go too wide then prefetching is less effective and branch
nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
nodes from 16 bytes to 24 bytes, and it ended up slower.

Your interior nodes are much bigger than mine, so you might find the
tradeoff is different. I encourage you to try it out.

I saw there has been some discussion about locking and RCU. My current
project is integrating a qp-trie into BIND, with the aim of replacing its
old red-black tree for searching DNS records. It's based on a concurrent
qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
BIND). My strategy is based on a custom allocator for interior nodes. This
has two main effects:

  * Node references are now 32 bit indexes into the allocator's pool,
    instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.

  * The allocator supports copy-on-write and safe memory reclamation with
    a fairly small overhead, 3 x 32 bit counters per memory chunk (each
    chunk is roughly page sized).

I wrote some notes when the design was new, but things have changed since
then.

https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html

For memory reclamation the interior nodes get moved / compacted. It's a
kind of garbage collector, but easy-mode because the per-chunk counters
accurately indicate when compaction is worthwhile. I've written some notes
on my several failed GC experiments; the last / current attempt seems (by
and large) good enough.

https://dotat.at/@/2022-06-22-compact-qp.html

For exterior / leaf nodes, I'm using atomic refcounts to know when they
can be reclaimed. The caller is responsible for COWing its leaves when
necessary.

Updates to the tree are transactional in style, and do not block readers:
a single writer gets the write mutex, makes whatever changes it needs
(copying as required), then commits by flipping the tree's root. After a
commit it can free unused chunks. (Compaction can be part of an update
transaction or a transaction of its own.)

I'm currently using a reader-writer lock for the tree root, but I designed
it with liburcu in mind, while trying to keep things simple.

This strategy is very heavily biased in favour of readers, which suits DNS
servers. I don't know enough about BPF to have any idea what kind of
update traffic you need to support.

At the moment I am reworking and simplifying my transaction and
reclamation code and it's all very broken. I guess this isn't the best
possible time to compare notes on qp-trie variants, but I'm happy to hear
from others who have code and ideas to share.
Andrii Nakryiko Oct. 27, 2022, 6:52 p.m. UTC | #23
On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@dotat.at> wrote:
>
> Hello all,
>
> I have just found out about this qp-trie work, and I'm pleased to hear
> that it is looking promising for you!
>

This is a very nice data structure, so thank you for doing a great job
explaining it in your post!

> I have a few very broad observations:
>
> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
> between branch factor, maximum key length, and size of branch node. The
> greater the branch factor, the fewer indirections needed to traverse the
> tree; but if you go too wide then prefetching is less effective and branch
> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
> nodes from 16 bytes to 24 bytes, and it ended up slower.
>
> Your interior nodes are much bigger than mine, so you might find the
> tradeoff is different. I encourage you to try it out.

True, but I think for (at least initial) simplicity, sticking to
half-bytes would simplify the code and let us figure out BPF and
kernel-specific issues without having to worry about the correctness
of the qp-trie core logic itself.

>
> I saw there has been some discussion about locking and RCU. My current
> project is integrating a qp-trie into BIND, with the aim of replacing its
> old red-black tree for searching DNS records. It's based on a concurrent
> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
> BIND). My strategy is based on a custom allocator for interior nodes. This
> has two main effects:
>
>   * Node references are now 32 bit indexes into the allocator's pool,
>     instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.
>
>   * The allocator supports copy-on-write and safe memory reclamation with
>     a fairly small overhead, 3 x 32 bit counters per memory chunk (each
>     chunk is roughly page sized).
>
> I wrote some notes when the design was new, but things have changed since
> then.
>
> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
>
> For memory reclamation the interior nodes get moved / compacted. It's a
> kind of garbage collector, but easy-mode because the per-chunk counters
> accurately indicate when compaction is worthwhile. I've written some notes
> on my several failed GC experiments; the last / current attempt seems (by
> and large) good enough.
>
> https://dotat.at/@/2022-06-22-compact-qp.html
>
> For exterior / leaf nodes, I'm using atomic refcounts to know when they
> can be reclaimed. The caller is responsible for COWing its leaves when
> necessary.
>
> Updates to the tree are transactional in style, and do not block readers:
> a single writer gets the write mutex, makes whatever changes it needs
> (copying as required), then commits by flipping the tree's root. After a
> commit it can free unused chunks. (Compaction can be part of an update
> transaction or a transaction of its own.)
>
> I'm currently using a reader-writer lock for the tree root, but I designed
> it with liburcu in mind, while trying to keep things simple.
>
> This strategy is very heavily biased in favour of readers, which suits DNS
> servers. I don't know enough about BPF to have any idea what kind of
> update traffic you need to support.

These are some nice ideas, I did a quick read on your latest blog
posts, missed those updates since last time I checked your blog.

One limitation that we have in the BPF world is that BPF programs can
be run in extremely restrictive contexts (e.g., NMI), in which things
that user-space can assume will almost always succeed (like memory
allocation), are not allowed. We do have BPF-specific memory
allocator, but even it can fail to allocate memory, depending on
allocation patterns. So we need to think if this COW approach is
acceptable. I'd love for Hou Tao to think about this and chime in,
though, as he spent a lot of time thinking about particulars.

But very basically, ultimate memory and performance savings are
perhaps less important in trying to fit qp-trie into BPF framework. We
can iterate after with optimizations and improvements, but first we
need to get the things correct and well-behaved.

>
> At the moment I am reworking and simplifying my transaction and
> reclamation code and it's all very broken. I guess this isn't the best
> possible time to compare notes on qp-trie variants, but I'm happy to hear
> from others who have code and ideas to share.

It would be great if you can lend your expertise in reviewing at least
generic qp-trie parts, but also in helping to figure out the overall
concurrency approach we can take in kernel/BPF land (depending on your
familiarity with kernel specifics, of course).

Thanks for offering the latest on qp-trie, exciting to see more
production applications of qp-trie and that you are still actively
working on this!

>
> --
> Tony Finch  <dot@dotat.at>  https://dotat.at/
> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
> increasing 6 to gale 8 for a time. Smooth or slight in eastern
> shelter, otherwise slight or moderate. Rain or showers. Good,
> occasionally poor.
Hou Tao Nov. 1, 2022, 12:07 p.m. UTC | #24
Hi,

On 10/28/2022 2:52 AM, Andrii Nakryiko wrote:
> On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@dotat.at> wrote:
>> Hello all,
>>
>> I have just found out about this qp-trie work, and I'm pleased to hear
>> that it is looking promising for you!
>>
> This is a very nice data structure, so thank you for doing a great job
> explaining it in your post!
Sorry for the late reply. Stilling digging into other problems. Also thanks Tony
for his job on qp.git.
>
>> I have a few very broad observations:
>>
>> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
>> between branch factor, maximum key length, and size of branch node. The
>> greater the branch factor, the fewer indirections needed to traverse the
>> tree; but if you go too wide then prefetching is less effective and branch
>> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
>> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
>> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
>> nodes from 16 bytes to 24 bytes, and it ended up slower.
>>
>> Your interior nodes are much bigger than mine, so you might find the
>> tradeoff is different. I encourage you to try it out.
parent field in qp_trie_branch is used to support non-recursive iteration and
rcu_head is used for RCU memory freeing.
> True, but I think for (at least initial) simplicity, sticking to
> half-bytes would simplify the code and let us figure out BPF and
> kernel-specific issues without having to worry about the correctness
> of the qp-trie core logic itself.
Agreed.
>
>> I saw there has been some discussion about locking and RCU. My current
>> project is integrating a qp-trie into BIND, with the aim of replacing its
>> old red-black tree for searching DNS records. It's based on a concurrent
>> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
>> BIND). My strategy is based on a custom allocator for interior nodes. This
>> has two main effects:
>>
>>   * Node references are now 32 bit indexes into the allocator's pool,
>>     instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.
>>
>>   * The allocator supports copy-on-write and safe memory reclamation with
>>     a fairly small overhead, 3 x 32 bit counters per memory chunk (each
>>     chunk is roughly page sized).
>>
>> I wrote some notes when the design was new, but things have changed since
>> then.
>>
>> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
>>
>> For memory reclamation the interior nodes get moved / compacted. It's a
>> kind of garbage collector, but easy-mode because the per-chunk counters
>> accurately indicate when compaction is worthwhile. I've written some notes
>> on my several failed GC experiments; the last / current attempt seems (by
>> and large) good enough.
>>
>> https://dotat.at/@/2022-06-22-compact-qp.html
>>
>> For exterior / leaf nodes, I'm using atomic refcounts to know when they
>> can be reclaimed. The caller is responsible for COWing its leaves when
>> necessary.
>>
>> Updates to the tree are transactional in style, and do not block readers:
>> a single writer gets the write mutex, makes whatever changes it needs
>> (copying as required), then commits by flipping the tree's root. After a
>> commit it can free unused chunks. (Compaction can be part of an update
>> transaction or a transaction of its own.)
>>
>> I'm currently using a reader-writer lock for the tree root, but I designed
>> it with liburcu in mind, while trying to keep things simple.
>>
>> This strategy is very heavily biased in favour of readers, which suits DNS
>> servers. I don't know enough about BPF to have any idea what kind of
>> update traffic you need to support.
> These are some nice ideas, I did a quick read on your latest blog
> posts, missed those updates since last time I checked your blog.
>
> One limitation that we have in the BPF world is that BPF programs can
> be run in extremely restrictive contexts (e.g., NMI), in which things
> that user-space can assume will almost always succeed (like memory
> allocation), are not allowed. We do have BPF-specific memory
> allocator, but even it can fail to allocate memory, depending on
> allocation patterns. So we need to think if this COW approach is
> acceptable. I'd love for Hou Tao to think about this and chime in,
> though, as he spent a lot of time thinking about particulars.
Current implementation of BPF_MAP_TYPE_QP_TRIE is already COWed. When adding or
deleting a leaf node, its parent interior node will be copied to a new interior
node, the pointer to the old parent node (in the grand-parent interior node)
will be updated by the new parent node, and the old parent node will be RCU-freed.
According to above description, COW in qp-trie means all nodes on the path from
the root node to the leaf node are COWed, so I think current COW implementation
is better for bpf map usage scenario. But I will check the qp-trie code in BIND
[0] later.

0:
https://gitlab.isc.org/isc-projects/bind9/-/commit/ecc555e6ec763c4f8f2495864ec08749202fff1a#65b4d67ce64e9195e41ac43d78af5156f9ebb779_0_553
> But very basically, ultimate memory and performance savings are
> perhaps less important in trying to fit qp-trie into BPF framework. We
> can iterate after with optimizations and improvements, but first we
> need to get the things correct and well-behaved.
Understand.
>
>> At the moment I am reworking and simplifying my transaction and
>> reclamation code and it's all very broken. I guess this isn't the best
>> possible time to compare notes on qp-trie variants, but I'm happy to hear
>> from others who have code and ideas to share.
> It would be great if you can lend your expertise in reviewing at least
> generic qp-trie parts, but also in helping to figure out the overall
> concurrency approach we can take in kernel/BPF land (depending on your
> familiarity with kernel specifics, of course).
>
> Thanks for offering the latest on qp-trie, exciting to see more
> production applications of qp-trie and that you are still actively
> working on this!
Yes, it would be great if Tony could help to review or co-design bpf qp-trie map.
>> --
>> Tony Finch  <dot@dotat.at>  https://dotat.at/
>> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
>> increasing 6 to gale 8 for a time. Smooth or slight in eastern
>> shelter, otherwise slight or moderate. Rain or showers. Good,
>> occasionally poor.