mbox series

[RFC,bpf-next,v4,0/3] Handle immediate reuse in bpf memory allocator

Message ID 20230606035310.4026145-1-houtao@huaweicloud.com (mailing list archive)
Headers show
Series Handle immediate reuse in bpf memory allocator | expand

Message

Hou Tao June 6, 2023, 3:53 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

Hi,

The implementation of v4 is mainly based on suggestions from Alexi [0].
There are still pending problems for the current implementation as shown
in the benchmark result in patch #3, but there was a long time from the
posting of v3, so posting v4 here for further disscussions and more
suggestions.

The first problem is the huge memory usage compared with bpf memory
allocator which does immediate reuse:

htab-mem-benchmark (reuse-after-RCU-GP):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1159.18   | 0.99                | 0.99             |
| overwrite          | 11.00     | 2288                | 4109             |
| batch_add_batch_del| 8.86      | 1558                | 2763             |
| add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |

htab-mem-benchmark (immediate-reuse):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1160.66   | 0.99                | 1.00             |
| overwrite          | 28.52     | 2.46                | 2.73             |
| batch_add_batch_del| 11.50     | 2.69                | 2.95             |
| add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |

It seems the direct reason is the slow RCU grace period. During
benchmark, the elapsed time when reuse_rcu() callback is called is about
100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
spin-lock and the irq-work running in the contex of freeing process will
increase the running overhead of bpf program, the running time of
getpgid() is increased, the contex switch is slowed down and the RCU
grace period increases [1], but I am still diggin into it.

Another problem is the performance degradation compared with immediate
reuse and the output from perf report shown the per-bpf-ma spin-lock is a
top-one hotspot:

map_perf_test (reuse-after-RCU-GP)
0:hash_map_perf kmalloc 194677 events per sec

map_perf_test (immediate reuse)
2:hash_map_perf kmalloc 384527 events per sec

Considering the purpose of introducing per-bpf-ma reusable list is to
handle the case in which the allocation and free are done on different
CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
for overwrite & batch_add_batch_del cases. So maybe we could implement a
hybrid of global reusable list and per-cpu reusable list and switch
between these two kinds of list according to the history of allocation
and free frequency.

As ususal, suggestions and comments are always welcome.

[0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
[1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@huaweicloud.com

Change Log:
v4:
 * no kworker (Alexei)
 * Use a global reusable list in bpf memory allocator (Alexei)
 * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
   defaultly in bpf memory allocator (Alexei)
 * add benchmark results from map_perf_test (Alexei)

v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@huaweicloud.com/
 * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
 * Update htab memory benchmark
   * move the benchmark patch to the last patch
   * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
     programs
   * add synchronization between addition CPU and deletion CPU for
     add_del_on_diff_cpu case to prevent unnecessary loop
   * add the benchmark result for "extra call_rcu + bpf ma"

v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@huaweicloud.com/
 * add a benchmark for bpf memory allocator to compare between different
   flavor of bpf memory allocator.
 * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.

v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

Hou Tao (3):
  bpf: Factor out a common helper free_all()
  selftests/bpf: Add benchmark for bpf memory allocator
  bpf: Only reuse after one RCU GP in bpf memory allocator

 include/linux/bpf_mem_alloc.h                 |   4 +
 kernel/bpf/memalloc.c                         | 385 ++++++++++++------
 tools/testing/selftests/bpf/Makefile          |   3 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++
 .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
 .../selftests/bpf/progs/htab_mem_bench.c      | 135 ++++++
 7 files changed, 809 insertions(+), 116 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
 create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c

Comments

Hou Tao June 6, 2023, 12:30 p.m. UTC | #1
Hi,

On 6/6/2023 11:53 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The implementation of v4 is mainly based on suggestions from Alexi [0].
> There are still pending problems for the current implementation as shown
> in the benchmark result in patch #3, but there was a long time from the
> posting of v3, so posting v4 here for further disscussions and more
> suggestions.
>
> The first problem is the huge memory usage compared with bpf memory
> allocator which does immediate reuse:
>
> htab-mem-benchmark (reuse-after-RCU-GP):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1159.18   | 0.99                | 0.99             |
> | overwrite          | 11.00     | 2288                | 4109             |
> | batch_add_batch_del| 8.86      | 1558                | 2763             |
> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>
> htab-mem-benchmark (immediate-reuse):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1160.66   | 0.99                | 1.00             |
> | overwrite          | 28.52     | 2.46                | 2.73             |
> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>
> It seems the direct reason is the slow RCU grace period. During
> benchmark, the elapsed time when reuse_rcu() callback is called is about
> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> spin-lock and the irq-work running in the contex of freeing process will
> increase the running overhead of bpf program, the running time of
> getpgid() is increased, the contex switch is slowed down and the RCU
> grace period increases [1], but I am still diggin into it.
For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
(namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
did) instead, the memory usage of htab-mem-benchmark will decrease a lot:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1165.38   | 0.97                | 1.00             |
| overwrite          | 17.25     | 626.41              | 781.82           |
| batch_add_batch_del| 11.51     | 398.56              | 500.29           |
| add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |

But the memory usage is still large compared with v3 and the elapsed
time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
are still two differences:
1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
accelerate the reuse of freed objects.
2) v3 uses kworker instead of irq_work for free procedure.

For 1), after using kmalloc() in irq_work to allocate multiple inflight
RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
but is not enough:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1247.00   | 0.97                | 1.00             |
| overwrite          | 16.56     | 490.18              | 557.17           |
| batch_add_batch_del| 11.31     | 276.32              | 360.89           |
| add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |

So it seems the large memory usage is due to irq_work (reuse_bulk) used
for free procedure. However after increasing the threshold for invoking
irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
big difference in the memory usage and the delayed time for RCU
callbacks. Perhaps the reason is that although the number of  reuse_bulk
irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
increased because there are no reusable objects.

>
> Another problem is the performance degradation compared with immediate
> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> top-one hotspot:
>
> map_perf_test (reuse-after-RCU-GP)
> 0:hash_map_perf kmalloc 194677 events per sec
>
> map_perf_test (immediate reuse)
> 2:hash_map_perf kmalloc 384527 events per sec
>
> Considering the purpose of introducing per-bpf-ma reusable list is to
> handle the case in which the allocation and free are done on different
> CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
> for overwrite & batch_add_batch_del cases. So maybe we could implement a
> hybrid of global reusable list and per-cpu reusable list and switch
> between these two kinds of list according to the history of allocation
> and free frequency.
>
> As ususal, suggestions and comments are always welcome.
>
> [0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
> [1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@huaweicloud.com
>
> Change Log:
> v4:
>  * no kworker (Alexei)
>  * Use a global reusable list in bpf memory allocator (Alexei)
>  * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
>    defaultly in bpf memory allocator (Alexei)
>  * add benchmark results from map_perf_test (Alexei)
>
> v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@huaweicloud.com/
>  * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
>  * Update htab memory benchmark
>    * move the benchmark patch to the last patch
>    * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
>      programs
>    * add synchronization between addition CPU and deletion CPU for
>      add_del_on_diff_cpu case to prevent unnecessary loop
>    * add the benchmark result for "extra call_rcu + bpf ma"
>
> v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@huaweicloud.com/
>  * add a benchmark for bpf memory allocator to compare between different
>    flavor of bpf memory allocator.
>  * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
>
> v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>
> Hou Tao (3):
>   bpf: Factor out a common helper free_all()
>   selftests/bpf: Add benchmark for bpf memory allocator
>   bpf: Only reuse after one RCU GP in bpf memory allocator
>
>  include/linux/bpf_mem_alloc.h                 |   4 +
>  kernel/bpf/memalloc.c                         | 385 ++++++++++++------
>  tools/testing/selftests/bpf/Makefile          |   3 +
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++
>  .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
>  .../selftests/bpf/progs/htab_mem_bench.c      | 135 ++++++
>  7 files changed, 809 insertions(+), 116 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
>
Alexei Starovoitov June 6, 2023, 9:04 p.m. UTC | #2
On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/6/2023 11:53 AM, Hou Tao wrote:
> > From: Hou Tao <houtao1@huawei.com>
> >
> > Hi,
> >
> > The implementation of v4 is mainly based on suggestions from Alexi [0].
> > There are still pending problems for the current implementation as shown
> > in the benchmark result in patch #3, but there was a long time from the
> > posting of v3, so posting v4 here for further disscussions and more
> > suggestions.
> >
> > The first problem is the huge memory usage compared with bpf memory
> > allocator which does immediate reuse:
> >
> > htab-mem-benchmark (reuse-after-RCU-GP):
> > | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> > | --                 | --        | --                  | --               |
> > | no_op              | 1159.18   | 0.99                | 0.99             |
> > | overwrite          | 11.00     | 2288                | 4109             |
> > | batch_add_batch_del| 8.86      | 1558                | 2763             |
> > | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
> >
> > htab-mem-benchmark (immediate-reuse):
> > | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> > | --                 | --        | --                  | --               |
> > | no_op              | 1160.66   | 0.99                | 1.00             |
> > | overwrite          | 28.52     | 2.46                | 2.73             |
> > | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> > | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
> >
> > It seems the direct reason is the slow RCU grace period. During
> > benchmark, the elapsed time when reuse_rcu() callback is called is about
> > 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> > spin-lock and the irq-work running in the contex of freeing process will
> > increase the running overhead of bpf program, the running time of
> > getpgid() is increased, the contex switch is slowed down and the RCU
> > grace period increases [1], but I am still diggin into it.
> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
> 
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1165.38   | 0.97                | 1.00             |
> | overwrite          | 17.25     | 626.41              | 781.82           |
> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
> 
> But the memory usage is still large compared with v3 and the elapsed
> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
> are still two differences:
> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
> accelerate the reuse of freed objects.
> 2) v3 uses kworker instead of irq_work for free procedure.
> 
> For 1), after using kmalloc() in irq_work to allocate multiple inflight
> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
> but is not enough:
> 
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1247.00   | 0.97                | 1.00             |
> | overwrite          | 16.56     | 490.18              | 557.17           |
> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
> 
> So it seems the large memory usage is due to irq_work (reuse_bulk) used
> for free procedure. However after increasing the threshold for invoking
> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
> big difference in the memory usage and the delayed time for RCU
> callbacks. Perhaps the reason is that although the number of  reuse_bulk
> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
> increased because there are no reusable objects.

The large memory usage is because the benchmark in patch 2 is abusing it.
It's doing one bpf_loop() over 16k elements (in case of 1 producer)
and 16k/8 loops for --producers=8.
That's 2k memory allocations that have to wait for RCU GP.
Of course that's a ton of memory.

As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
It makes the code hard to read and doesn't buy us anything.
Other than that the algorithm looks fine.

> > Another problem is the performance degradation compared with immediate
> > reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> > top-one hotspot:

That's not what I see.
Hot spin_lock is in generic htab code. Not it ma.
I still believe per-bpf-ma spin-lock is fine.
The bench in patch 2 is measuring something that no real bpf prog cares about.

See how map_perf_test is doing:
        for (i = 0; i < 10; i++) {
                bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);

Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
16k/8 is beyond any normal scenario.
There is no reason to optimize bpf_ma for the case of htab abuse.

> > map_perf_test (reuse-after-RCU-GP)
> > 0:hash_map_perf kmalloc 194677 events per sec
> >
> > map_perf_test (immediate reuse)
> > 2:hash_map_perf kmalloc 384527 events per sec

For some reason I cannot reproduce the slow down with map_perf_test 4 8.
I see the same perf with/without patch 3.

I've applied patch 1.
Please respin with patch 2 doing no more than 10 map_updates under rcu lock
and remove *_tail optimization from patch 3.
Just do llist_for_each_safe() when you move elements from one list to another.
And let's brainstorm further.
Please do not delay.
Hou Tao June 7, 2023, 1:19 a.m. UTC | #3
Hi,

On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Hi,
>>>
>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>> There are still pending problems for the current implementation as shown
>>> in the benchmark result in patch #3, but there was a long time from the
>>> posting of v3, so posting v4 here for further disscussions and more
>>> suggestions.
>>>
>>> The first problem is the huge memory usage compared with bpf memory
>>> allocator which does immediate reuse:
>>>
>>> htab-mem-benchmark (reuse-after-RCU-GP):
>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>> | --                 | --        | --                  | --               |
>>> | no_op              | 1159.18   | 0.99                | 0.99             |
>>> | overwrite          | 11.00     | 2288                | 4109             |
>>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
>>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>>>
>>> htab-mem-benchmark (immediate-reuse):
>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>> | --                 | --        | --                  | --               |
>>> | no_op              | 1160.66   | 0.99                | 1.00             |
>>> | overwrite          | 28.52     | 2.46                | 2.73             |
>>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
>>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>>>
>>> It seems the direct reason is the slow RCU grace period. During
>>> benchmark, the elapsed time when reuse_rcu() callback is called is about
>>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
>>> spin-lock and the irq-work running in the contex of freeing process will
>>> increase the running overhead of bpf program, the running time of
>>> getpgid() is increased, the contex switch is slowed down and the RCU
>>> grace period increases [1], but I am still diggin into it.
>> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
>> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
>> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
>>
>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>> | --                 | --        | --                  | --               |
>> | no_op              | 1165.38   | 0.97                | 1.00             |
>> | overwrite          | 17.25     | 626.41              | 781.82           |
>> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
>> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
>>
>> But the memory usage is still large compared with v3 and the elapsed
>> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
>> are still two differences:
>> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
>> accelerate the reuse of freed objects.
>> 2) v3 uses kworker instead of irq_work for free procedure.
>>
>> For 1), after using kmalloc() in irq_work to allocate multiple inflight
>> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
>> but is not enough:
>>
>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>> | --                 | --        | --                  | --               |
>> | no_op              | 1247.00   | 0.97                | 1.00             |
>> | overwrite          | 16.56     | 490.18              | 557.17           |
>> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
>> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
>>
>> So it seems the large memory usage is due to irq_work (reuse_bulk) used
>> for free procedure. However after increasing the threshold for invoking
>> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
>> big difference in the memory usage and the delayed time for RCU
>> callbacks. Perhaps the reason is that although the number of  reuse_bulk
>> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
>> increased because there are no reusable objects.
> The large memory usage is because the benchmark in patch 2 is abusing it.
> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> and 16k/8 loops for --producers=8.
> That's 2k memory allocations that have to wait for RCU GP.
> Of course that's a ton of memory.
I don't agree that. Because in v3, the benchmark is the same, but both
the performance and the memory usage are better than v4. Even compared
with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
much smaller as shown below. If the large memory usage is due to the
abuse in benchmark, how do you explain the memory usage in v3 ?

htab-mem-benchmark (reuse-after-rcu-gp v3)

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1199.16    | 0.97                 | 0.99              |
| overwrite           | 16.37      | 24.01                | 31.76             |
| batch_add_batch_del | 9.61       | 16.71                | 19.95             |
| add_del_on_diff_cpu | 3.62       | 22.93                | 37.02             |

>
> As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
> It makes the code hard to read and doesn't buy us anything.
The reason I added tail for each list is that there could be thousands
even ten thousands elements in these lists and there is no need to spend
CPU time to traversal these list one by one. It maybe a premature
optimization. So let me remove tails from these list first and I will
try to add these tails back later and check whether or not there is any
performance improvement.
> Other than that the algorithm looks fine.
>
>>> Another problem is the performance degradation compared with immediate
>>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
>>> top-one hotspot:
> That's not what I see.
> Hot spin_lock is in generic htab code. Not it ma.
> I still believe per-bpf-ma spin-lock is fine.
> The bench in patch 2 is measuring something that no real bpf prog cares about.
>
> See how map_perf_test is doing:
>         for (i = 0; i < 10; i++) {
>                 bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
>
> Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
> 16k/8 is beyond any normal scenario.
> There is no reason to optimize bpf_ma for the case of htab abuse.
I have a different view for the benchmark. Firstly htab is not the only
user of bpf memory allocator, secondly we can't predict the exact
behavior of bpf programs, so I think to stress bpf memory allocator for
various kinds of use case is good for its broad usage.
>
>>> map_perf_test (reuse-after-RCU-GP)
>>> 0:hash_map_perf kmalloc 194677 events per sec
>>>
>>> map_perf_test (immediate reuse)
>>> 2:hash_map_perf kmalloc 384527 events per sec
> For some reason I cannot reproduce the slow down with map_perf_test 4 8.
> I see the same perf with/without patch 3.
I will double check my local setup and test results.
>
> I've applied patch 1.
> Please respin with patch 2 doing no more than 10 map_updates under rcu lock
> and remove *_tail optimization from patch 3.
> Just do llist_for_each_safe() when you move elements from one list to another.
> And let's brainstorm further.
> Please do not delay.
> .
Alexei Starovoitov June 7, 2023, 1:39 a.m. UTC | #4
On Tue, Jun 6, 2023 at 6:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> > On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/6/2023 11:53 AM, Hou Tao wrote:
> >>> From: Hou Tao <houtao1@huawei.com>
> >>>
> >>> Hi,
> >>>
> >>> The implementation of v4 is mainly based on suggestions from Alexi [0].
> >>> There are still pending problems for the current implementation as shown
> >>> in the benchmark result in patch #3, but there was a long time from the
> >>> posting of v3, so posting v4 here for further disscussions and more
> >>> suggestions.
> >>>
> >>> The first problem is the huge memory usage compared with bpf memory
> >>> allocator which does immediate reuse:
> >>>
> >>> htab-mem-benchmark (reuse-after-RCU-GP):
> >>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >>> | --                 | --        | --                  | --               |
> >>> | no_op              | 1159.18   | 0.99                | 0.99             |
> >>> | overwrite          | 11.00     | 2288                | 4109             |
> >>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
> >>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
> >>>
> >>> htab-mem-benchmark (immediate-reuse):
> >>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >>> | --                 | --        | --                  | --               |
> >>> | no_op              | 1160.66   | 0.99                | 1.00             |
> >>> | overwrite          | 28.52     | 2.46                | 2.73             |
> >>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> >>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
> >>>
> >>> It seems the direct reason is the slow RCU grace period. During
> >>> benchmark, the elapsed time when reuse_rcu() callback is called is about
> >>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> >>> spin-lock and the irq-work running in the contex of freeing process will
> >>> increase the running overhead of bpf program, the running time of
> >>> getpgid() is increased, the contex switch is slowed down and the RCU
> >>> grace period increases [1], but I am still diggin into it.
> >> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
> >> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
> >> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
> >>
> >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
> >> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >> | --                 | --        | --                  | --               |
> >> | no_op              | 1165.38   | 0.97                | 1.00             |
> >> | overwrite          | 17.25     | 626.41              | 781.82           |
> >> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
> >> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
> >>
> >> But the memory usage is still large compared with v3 and the elapsed
> >> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
> >> are still two differences:
> >> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
> >> accelerate the reuse of freed objects.
> >> 2) v3 uses kworker instead of irq_work for free procedure.
> >>
> >> For 1), after using kmalloc() in irq_work to allocate multiple inflight
> >> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
> >> but is not enough:
> >>
> >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
> >> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >> | --                 | --        | --                  | --               |
> >> | no_op              | 1247.00   | 0.97                | 1.00             |
> >> | overwrite          | 16.56     | 490.18              | 557.17           |
> >> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
> >> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
> >>
> >> So it seems the large memory usage is due to irq_work (reuse_bulk) used
> >> for free procedure. However after increasing the threshold for invoking
> >> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
> >> big difference in the memory usage and the delayed time for RCU
> >> callbacks. Perhaps the reason is that although the number of  reuse_bulk
> >> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
> >> increased because there are no reusable objects.
> > The large memory usage is because the benchmark in patch 2 is abusing it.
> > It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> > and 16k/8 loops for --producers=8.
> > That's 2k memory allocations that have to wait for RCU GP.
> > Of course that's a ton of memory.
> I don't agree that. Because in v3, the benchmark is the same, but both
> the performance and the memory usage are better than v4. Even compared
> with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
> multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
> much smaller as shown below. If the large memory usage is due to the
> abuse in benchmark, how do you explain the memory usage in v3 ?

There could have been implementation bugs or whatever else.
The main point is the bench test is not realistic and should not be
used to make design decisions.

> The reason I added tail for each list is that there could be thousands
> even ten thousands elements in these lists and there is no need to spend
> CPU time to traversal these list one by one. It maybe a premature
> optimization. So let me remove tails from these list first and I will
> try to add these tails back later and check whether or not there is any
> performance improvement.

There will be thousands of elements only because the bench test is wrong.
It's doing something no real prog would do.

> I have a different view for the benchmark. Firstly htab is not the only
> user of bpf memory allocator, secondly we can't predict the exact
> behavior of bpf programs, so I think to stress bpf memory allocator for
> various kinds of use case is good for its broad usage.

It is not a stress test. It's an abuse.
A stress test would be something that can happen in practice.
Doing thousands map_updates in a forever loop is not something
useful code would do.
For example call_rcu_tasks_trace is not design to be called millions
times a second. It's an anti-pattern and rcu core won't be optimized to do so.
rcu, srcu, rcu_task_trace have different usage patterns.
The programmer has to correctly pick one depending on the use case.
Same with bpf htab. If somebody has a real need to do thousands
updates under rcu lock they should be using preallocated map and deal
with immediate reuse.
Hou Tao June 7, 2023, 7:56 a.m. UTC | #5
Hi,

On 6/7/2023 9:39 AM, Alexei Starovoitov wrote:
> On Tue, Jun 6, 2023 at 6:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
>>> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> Hi,
>>>>>
>>>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>>>> There are still pending problems for the current implementation as shown
>>>>> in the benchmark result in patch #3, but there was a long time from the
>>>>> posting of v3, so posting v4 here for further disscussions and more
>>>>> suggestions.
>>>>>
>>>>> The first problem is the huge memory usage compared with bpf memory
>>>>> allocator which does immediate reuse:
>>>>>
>>>>> htab-mem-benchmark (reuse-after-RCU-GP):
>>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>>> | --                 | --        | --                  | --               |
>>>>> | no_op              | 1159.18   | 0.99                | 0.99             |
>>>>> | overwrite          | 11.00     | 2288                | 4109             |
>>>>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
>>>>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>>>>>
>>>>> htab-mem-benchmark (immediate-reuse):
>>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>>> | --                 | --        | --                  | --               |
>>>>> | no_op              | 1160.66   | 0.99                | 1.00             |
>>>>> | overwrite          | 28.52     | 2.46                | 2.73             |
>>>>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
>>>>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>>>>>
>>>>> It seems the direct reason is the slow RCU grace period. During
>>>>> benchmark, the elapsed time when reuse_rcu() callback is called is about
>>>>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
>>>>> spin-lock and the irq-work running in the contex of freeing process will
>>>>> increase the running overhead of bpf program, the running time of
>>>>> getpgid() is increased, the contex switch is slowed down and the RCU
>>>>> grace period increases [1], but I am still diggin into it.
>>>> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
>>>> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
>>>> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
>>>>
>>>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>> | --                 | --        | --                  | --               |
>>>> | no_op              | 1165.38   | 0.97                | 1.00             |
>>>> | overwrite          | 17.25     | 626.41              | 781.82           |
>>>> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
>>>> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
>>>>
>>>> But the memory usage is still large compared with v3 and the elapsed
>>>> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
>>>> are still two differences:
>>>> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
>>>> accelerate the reuse of freed objects.
>>>> 2) v3 uses kworker instead of irq_work for free procedure.
>>>>
>>>> For 1), after using kmalloc() in irq_work to allocate multiple inflight
>>>> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
>>>> but is not enough:
>>>>
>>>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
>>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>>> | --                 | --        | --                  | --               |
>>>> | no_op              | 1247.00   | 0.97                | 1.00             |
>>>> | overwrite          | 16.56     | 490.18              | 557.17           |
>>>> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
>>>> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
>>>>
>>>> So it seems the large memory usage is due to irq_work (reuse_bulk) used
>>>> for free procedure. However after increasing the threshold for invoking
>>>> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
>>>> big difference in the memory usage and the delayed time for RCU
>>>> callbacks. Perhaps the reason is that although the number of  reuse_bulk
>>>> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
>>>> increased because there are no reusable objects.
>>> The large memory usage is because the benchmark in patch 2 is abusing it.
>>> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
>>> and 16k/8 loops for --producers=8.
>>> That's 2k memory allocations that have to wait for RCU GP.
>>> Of course that's a ton of memory.
>> I don't agree that. Because in v3, the benchmark is the same, but both
>> the performance and the memory usage are better than v4. Even compared
>> with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
>> multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
>> much smaller as shown below. If the large memory usage is due to the
>> abuse in benchmark, how do you explain the memory usage in v3 ?
> There could have been implementation bugs or whatever else.
> The main point is the bench test is not realistic and should not be
> used to make design decisions.
I see your point. I will continue to debug the memory usage difference
between v3 and v4.
>
>> The reason I added tail for each list is that there could be thousands
>> even ten thousands elements in these lists and there is no need to spend
>> CPU time to traversal these list one by one. It maybe a premature
>> optimization. So let me remove tails from these list first and I will
>> try to add these tails back later and check whether or not there is any
>> performance improvement.
> There will be thousands of elements only because the bench test is wrong.
> It's doing something no real prog would do.
I don't think so. Let's considering the per-cpu list first. Assume the
normal RCU grace period is about 30ms and we are tracing the IO latency
of a normal SSD. The iops is about 176K per seconds, so before one RCU
GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
For the per-ma list, when the number of CPUs increased, it is easy to
make the list contain thousands of elements.
>
>> I have a different view for the benchmark. Firstly htab is not the only
>> user of bpf memory allocator, secondly we can't predict the exact
>> behavior of bpf programs, so I think to stress bpf memory allocator for
>> various kinds of use case is good for its broad usage.
> It is not a stress test. It's an abuse.
> A stress test would be something that can happen in practice.
> Doing thousands map_updates in a forever loop is not something
> useful code would do.
> For example call_rcu_tasks_trace is not design to be called millions
> times a second. It's an anti-pattern and rcu core won't be optimized to do so.
> rcu, srcu, rcu_task_trace have different usage patterns.
> The programmer has to correctly pick one depending on the use case.
> Same with bpf htab. If somebody has a real need to do thousands
> updates under rcu lock they should be using preallocated map and deal
> with immediate reuse.
Before I post v5, I want to know the reason why per-bpf-ma list is
introduced. Previously, I though it was used to handle the case in which
allocation and freeing are done on different CPUs. And as we can see
from the benchmark result above and in v3, the performance and the
memory usage of v4 for add_del_on_diff_cpu is better than v3. But now I
am not sure, because as you mentioned above, it is used to reduce the
calling frequency of call_rcu_task_trace(). So could you tell me the
main reason for the per-bpf-ma list ? As shown in the above benchmark
performance, using per-cpu-reuse-list (namely htab-mem-benchmark
(reuse-after-RCU-GP + per-cpu reusable list)) have better performance
and memory usage compared with per-ma-list (htab-mem-benchmark
(reuse-after-RCU-GP)). If we just want to reduce the calling frequency
of call_rcu_task_trace(), we could make
bpf_mem_shared_cache->reuse_ready_head being per-cpu and leave
wait_for_free being per-bpf-ma.
Hou Tao June 7, 2023, 8:42 a.m. UTC | #6
Hi,

On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Hi,
>>>
>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>> There are still pending problems for the current implementation as shown
>>> in the benchmark result in patch #3, but there was a long time from the
>>> posting of v3, so posting v4 here for further disscussions and more
>>> suggestions.
>>>
>>> The first problem is the huge memory usage compared with bpf memory
>>> allocator which does immediate reuse:
SNIP
> The large memory usage is because the benchmark in patch 2 is abusing it.
> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> and 16k/8 loops for --producers=8.
> That's 2k memory allocations that have to wait for RCU GP.
> Of course that's a ton of memory.
>
> As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
> It makes the code hard to read and doesn't buy us anything.
> Other than that the algorithm looks fine.
>
>>> Another problem is the performance degradation compared with immediate
>>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
>>> top-one hotspot:
> That's not what I see.
> Hot spin_lock is in generic htab code. Not it ma.
> I still believe per-bpf-ma spin-lock is fine.
> The bench in patch 2 is measuring something that no real bpf prog cares about.
>
> See how map_perf_test is doing:
>         for (i = 0; i < 10; i++) {
>                 bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
>
> Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
> 16k/8 is beyond any normal scenario.
> There is no reason to optimize bpf_ma for the case of htab abuse.
>
>>> map_perf_test (reuse-after-RCU-GP)
>>> 0:hash_map_perf kmalloc 194677 events per sec
>>>
>>> map_perf_test (immediate reuse)
>>> 2:hash_map_perf kmalloc 384527 events per sec
> For some reason I cannot reproduce the slow down with map_perf_test 4 8.
> I see the same perf with/without patch 3.
As said in the commit message, the command line for test is
"./map_perf_test 4 8 16384", because the default max_entries is 1000. If
using default max_entries and the number of CPUs is greater than 15,
use_percpu_counter will be false.

I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
there are obvious performance degradation.

Before reuse-after-rcu-gp:
[root@hello bpf]# ./map_perf_test 4 8
5:hash_map_perf kmalloc 358498 events per sec
4:hash_map_perf kmalloc 306351 events per sec
1:hash_map_perf kmalloc 299175 events per sec
3:hash_map_perf kmalloc 297689 events per sec
6:hash_map_perf kmalloc 299839 events per sec
2:hash_map_perf kmalloc 292286 events per sec
7:hash_map_perf kmalloc 278138 events per sec
0:hash_map_perf kmalloc 265031 events per sec

[root@hello bpf]# ./map_perf_test 4 8 16384
2:hash_map_perf kmalloc 359201 events per sec
6:hash_map_perf kmalloc 302475 events per sec
3:hash_map_perf kmalloc 298583 events per sec
7:hash_map_perf kmalloc 301594 events per sec
0:hash_map_perf kmalloc 295210 events per sec
4:hash_map_perf kmalloc 230496 events per sec
5:hash_map_perf kmalloc 163530 events per sec
1:hash_map_perf kmalloc 153520 events per sec

After reuse-after-rcu-gp:
[root@hello bpf]# ./map_perf_test 4 8
1:hash_map_perf kmalloc 203252 events per sec
2:hash_map_perf kmalloc 181777 events per sec
6:hash_map_perf kmalloc 183467 events per sec
4:hash_map_perf kmalloc 182590 events per sec
0:hash_map_perf kmalloc 180840 events per sec
3:hash_map_perf kmalloc 179875 events per sec
5:hash_map_perf kmalloc 152250 events per sec
7:hash_map_perf kmalloc 137500 events per sec

[root@hello bpf]# ./map_perf_test 4 8 16384
4:hash_map_perf kmalloc 203983 events per sec
5:hash_map_perf kmalloc 197902 events per sec
2:hash_map_perf kmalloc 184693 events per sec
3:hash_map_perf kmalloc 185341 events per sec
1:hash_map_perf kmalloc 183064 events per sec
7:hash_map_perf kmalloc 181148 events per sec
0:hash_map_perf kmalloc 178520 events per sec
6:hash_map_perf kmalloc 179340 events per sec

I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
performances for "./map_perf_test 4 8" are similar, but there is obvious
performance degradation for "./map_perf_test 4 8 16384"

Before reuse-after-rcu-gp:

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8
1:hash_map_perf kmalloc 41711 events per sec
3:hash_map_perf kmalloc 41352 events per sec
4:hash_map_perf kmalloc 41352 events per sec
0:hash_map_perf kmalloc 41008 events per sec
7:hash_map_perf kmalloc 41086 events per sec
5:hash_map_perf kmalloc 41038 events per sec
2:hash_map_perf kmalloc 40971 events per sec
6:hash_map_perf kmalloc 41008 events per sec

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
1:hash_map_perf kmalloc 388088 events per sec
7:hash_map_perf kmalloc 391843 events per sec
6:hash_map_perf kmalloc 387901 events per sec
3:hash_map_perf kmalloc 356096 events per sec
4:hash_map_perf kmalloc 356836 events per sec
2:hash_map_perf kmalloc 349728 events per sec
0:hash_map_perf kmalloc 345792 events per sec
5:hash_map_perf kmalloc 346742 events per sec

After reuse-after-rcu-gp:

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8
4:hash_map_perf kmalloc 42667 events per sec
1:hash_map_perf kmalloc 42206 events per sec
5:hash_map_perf kmalloc 42264 events per sec
6:hash_map_perf kmalloc 42196 events per sec
7:hash_map_perf kmalloc 42142 events per sec
2:hash_map_perf kmalloc 42028 events per sec
0:hash_map_perf kmalloc 41933 events per sec
3:hash_map_perf kmalloc 41986 events per sec

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
5:hash_map_perf kmalloc 655628 events per sec
7:hash_map_perf kmalloc 650691 events per sec
2:hash_map_perf kmalloc 624384 events per sec
1:hash_map_perf kmalloc 615011 events per sec
3:hash_map_perf kmalloc 618335 events per sec
4:hash_map_perf kmalloc 624072 events per sec
6:hash_map_perf kmalloc 628559 events per sec
0:hash_map_perf kmalloc 585384 events per sec

So could you please double check your setup and rerun map_perf_test ? If
there is no performance degradation, could you please share your setup
and your kernel configure file ?
>
> I've applied patch 1.
> Please respin with patch 2 doing no more than 10 map_updates under rcu lock
> and remove *_tail optimization from patch 3.
> Just do llist_for_each_safe() when you move elements from one list to another.
> And let's brainstorm further.
> Please do not delay.
Alexei Starovoitov June 7, 2023, 5:52 p.m. UTC | #7
On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> As said in the commit message, the command line for test is
> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> using default max_entries and the number of CPUs is greater than 15,
> use_percpu_counter will be false.

Right. percpu or not depends on number of cpus.

> 
> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> there are obvious performance degradation.
...
> [root@hello bpf]# ./map_perf_test 4 8 16384
> 2:hash_map_perf kmalloc 359201 events per sec
..
> [root@hello bpf]# ./map_perf_test 4 8 16384
> 4:hash_map_perf kmalloc 203983 events per sec

this is indeed a degration in a VM.

> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> performances for "./map_perf_test 4 8" are similar, but there is obvious
> performance degradation for "./map_perf_test 4 8 16384"

but... a degradation?

> Before reuse-after-rcu-gp:
>
> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> 1:hash_map_perf kmalloc 388088 events per sec
...
> After reuse-after-rcu-gp:
> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> 5:hash_map_perf kmalloc 655628 events per sec

This is a big improvement :) Not a degration.
You always have to double check the numbers with perf report.

> So could you please double check your setup and rerun map_perf_test ? If
> there is no performance degradation, could you please share your setup
> and your kernel configure file ?

I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
Playing with it a bit more I found something interesting:
map_perf_test 4 8 16348
before/after has too much noise to be conclusive.

So I did
map_perf_test 4 8 16348 1000000

and now I see significant degration from patch 3.
It drops from 800k to 200k.
And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
The following hack addresses most of the perf degradtion:

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index fea1cb0c78bb..eeadc9359097 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
        alloc = 0;
        head = NULL;
        tail = NULL;
-       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
        while (alloc < cnt) {
                obj = __llist_del_first(&sc->reuse_ready_head);
                if (obj) {
@@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
                alloc++;
        }
        raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+       }

        if (alloc) {
                if (IS_ENABLED(CONFIG_PREEMPT_RT))
@@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
                sc->reuse_ready_tail = NULL;
                WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
                __llist_add_batch(head, tail, &sc->wait_for_free);
+               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
                call_rcu_tasks_trace(&sc->rcu, free_rcu);
+       } else {
+               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
        }
-       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
 }

It now drops from 800k to 450k.
And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.

Answering your other email:

> I see your point. I will continue to debug the memory usage difference
> between v3 and v4.

imo it's a waste of time to continue analyzing performance based on bench in patch 2.

> I don't think so. Let's considering the per-cpu list first. Assume the
> normal RCU grace period is about 30ms and we are tracing the IO latency
> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> For the per-ma list, when the number of CPUs increased, it is easy to
> make the list contain thousands of elements.

That would be true only if there were no scheduling events in all of 176K ops.
Which is not the case.
I'm not sure why you're saying that RCU GP is 30ms.
In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
Every sched event is sort-of implicit rcu_read_lock/unlock.
Network and block IO doesn't process 176K packets without resched.
Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.

For small size buckets low_watermark=32 and high=96.
We typically move 32 elements at a time from one list to another.
A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
is not instant, but once __free_rcu_tasks_trace is called we need to take
elements from waiting_for_gp one at a time and kfree it one at a time.
So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.

> Before I post v5, I want to know the reason why per-bpf-ma list is
>introduced. Previously, I though it was used to handle the case in which
> allocation and freeing are done on different CPUs. 

Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.

> And as we can see
> from the benchmark result above and in v3, the performance and the
> memory usage of v4 for add_del_on_diff_cpu is better than v3. 

bench from patch 2 is invalid. Hence no conclusion can be made.

So far the only bench we can trust and analyze is map_perf_test.
Please make bench in patch 2 yield the cpu after few updates.
Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
64 updates is realistic too. A thousand is not.
Alexei Starovoitov June 7, 2023, 8:50 p.m. UTC | #8
On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > As said in the commit message, the command line for test is
> > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > using default max_entries and the number of CPUs is greater than 15,
> > use_percpu_counter will be false.
>
> Right. percpu or not depends on number of cpus.
>
> >
> > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > there are obvious performance degradation.
> ...
> > [root@hello bpf]# ./map_perf_test 4 8 16384
> > 2:hash_map_perf kmalloc 359201 events per sec
> ..
> > [root@hello bpf]# ./map_perf_test 4 8 16384
> > 4:hash_map_perf kmalloc 203983 events per sec
>
> this is indeed a degration in a VM.
>
> > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > performance degradation for "./map_perf_test 4 8 16384"
>
> but... a degradation?
>
> > Before reuse-after-rcu-gp:
> >
> > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > 1:hash_map_perf kmalloc 388088 events per sec
> ...
> > After reuse-after-rcu-gp:
> > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > 5:hash_map_perf kmalloc 655628 events per sec
>
> This is a big improvement :) Not a degration.
> You always have to double check the numbers with perf report.
>
> > So could you please double check your setup and rerun map_perf_test ? If
> > there is no performance degradation, could you please share your setup
> > and your kernel configure file ?
>
> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> Playing with it a bit more I found something interesting:
> map_perf_test 4 8 16348
> before/after has too much noise to be conclusive.
>
> So I did
> map_perf_test 4 8 16348 1000000
>
> and now I see significant degration from patch 3.
> It drops from 800k to 200k.
> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> The following hack addresses most of the perf degradtion:
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index fea1cb0c78bb..eeadc9359097 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>         alloc = 0;
>         head = NULL;
>         tail = NULL;
> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>         while (alloc < cnt) {
>                 obj = __llist_del_first(&sc->reuse_ready_head);
>                 if (obj) {
> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>                 alloc++;
>         }
>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> +       }
>
>         if (alloc) {
>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>                 sc->reuse_ready_tail = NULL;
>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>                 __llist_add_batch(head, tail, &sc->wait_for_free);
> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> +       } else {
> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>         }
> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>  }
>
> It now drops from 800k to 450k.
> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.

Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.

Also noticed that the overhead of shared reuse_ready list
comes both from the contended lock and from cache misses
when one cpu pushes to the list after RCU GP and another
cpu removes.

Also low/batch/high watermark are all wrong in patch 3.
low=32 and high=96 makes no sense when it's not a single list.
I'm experimenting with 32 for all three heuristics.

Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
are redundant.
unit_free() can push into free_by_rcu directly
then reuse_bulk() can fill it up with free_llist_extra and
move them into waiting_for_gp.

All these _tail optimizations are obscuring the code and make it hard
to notice these issues.

> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>
> Answering your other email:
>
> > I see your point. I will continue to debug the memory usage difference
> > between v3 and v4.
>
> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>
> > I don't think so. Let's considering the per-cpu list first. Assume the
> > normal RCU grace period is about 30ms and we are tracing the IO latency
> > of a normal SSD. The iops is about 176K per seconds, so before one RCU
> > GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> > For the per-ma list, when the number of CPUs increased, it is easy to
> > make the list contain thousands of elements.
>
> That would be true only if there were no scheduling events in all of 176K ops.
> Which is not the case.
> I'm not sure why you're saying that RCU GP is 30ms.
> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> Every sched event is sort-of implicit rcu_read_lock/unlock.
> Network and block IO doesn't process 176K packets without resched.
> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>
> For small size buckets low_watermark=32 and high=96.
> We typically move 32 elements at a time from one list to another.
> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> is not instant, but once __free_rcu_tasks_trace is called we need to take
> elements from waiting_for_gp one at a time and kfree it one at a time.
> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>
> > Before I post v5, I want to know the reason why per-bpf-ma list is
> >introduced. Previously, I though it was used to handle the case in which
> > allocation and freeing are done on different CPUs.
>
> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>
> > And as we can see
> > from the benchmark result above and in v3, the performance and the
> > memory usage of v4 for add_del_on_diff_cpu is better than v3.
>
> bench from patch 2 is invalid. Hence no conclusion can be made.
>
> So far the only bench we can trust and analyze is map_perf_test.
> Please make bench in patch 2 yield the cpu after few updates.
> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> 64 updates is realistic too. A thousand is not.
Alexei Starovoitov June 7, 2023, 11:23 p.m. UTC | #9
On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > As said in the commit message, the command line for test is
> > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > using default max_entries and the number of CPUs is greater than 15,
> > > use_percpu_counter will be false.
> >
> > Right. percpu or not depends on number of cpus.
> >
> > >
> > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > there are obvious performance degradation.
> > ...
> > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > 2:hash_map_perf kmalloc 359201 events per sec
> > ..
> > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > 4:hash_map_perf kmalloc 203983 events per sec
> >
> > this is indeed a degration in a VM.
> >
> > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > performance degradation for "./map_perf_test 4 8 16384"
> >
> > but... a degradation?
> >
> > > Before reuse-after-rcu-gp:
> > >
> > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > 1:hash_map_perf kmalloc 388088 events per sec
> > ...
> > > After reuse-after-rcu-gp:
> > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > 5:hash_map_perf kmalloc 655628 events per sec
> >
> > This is a big improvement :) Not a degration.
> > You always have to double check the numbers with perf report.
> >
> > > So could you please double check your setup and rerun map_perf_test ? If
> > > there is no performance degradation, could you please share your setup
> > > and your kernel configure file ?
> >
> > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > Playing with it a bit more I found something interesting:
> > map_perf_test 4 8 16348
> > before/after has too much noise to be conclusive.
> >
> > So I did
> > map_perf_test 4 8 16348 1000000
> >
> > and now I see significant degration from patch 3.
> > It drops from 800k to 200k.
> > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > The following hack addresses most of the perf degradtion:
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index fea1cb0c78bb..eeadc9359097 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >         alloc = 0;
> >         head = NULL;
> >         tail = NULL;
> > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> >         while (alloc < cnt) {
> >                 obj = __llist_del_first(&sc->reuse_ready_head);
> >                 if (obj) {
> > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >                 alloc++;
> >         }
> >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > +       }
> >
> >         if (alloc) {
> >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> >                 sc->reuse_ready_tail = NULL;
> >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > +       } else {
> > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >         }
> > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >  }
> >
> > It now drops from 800k to 450k.
> > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
>
> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.

An update..

I tweaked patch 3 to do per-cpu reuse_ready and it addressed
the lock contention, but cache miss on
__llist_del_first(&c->reuse_ready_head);
was still very high and performance was still at 450k as
with a simple hack above.

Then I removed some of the _tail optimizations and added counters
to these llists.
To my surprise
map_perf_test 4 1 16348 1000000
was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
and ~400k sitting in reuse_ready_head.

Then noticed that we should be doing:
call_rcu_hurry(&c->rcu, reuse_rcu);
instead of call_rcu(),
but my config didn't have RCU_LAZY, so that didn't help.
Obviously we cannot allow such a huge number of elements to sit
in these link lists.
The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
To unblock qp-trie work I suggest to add rcu_head to each inner node
and do call_rcu() on them before free-ing them to bpf_mem_alloc.
Explicit call_rcu would disqualify qp-tree from tracing programs though :(
Paul E. McKenney June 7, 2023, 11:30 p.m. UTC | #10
On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > As said in the commit message, the command line for test is
> > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > using default max_entries and the number of CPUs is greater than 15,
> > > > use_percpu_counter will be false.
> > >
> > > Right. percpu or not depends on number of cpus.
> > >
> > > >
> > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > there are obvious performance degradation.
> > > ...
> > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > ..
> > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > 4:hash_map_perf kmalloc 203983 events per sec
> > >
> > > this is indeed a degration in a VM.
> > >
> > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > performance degradation for "./map_perf_test 4 8 16384"
> > >
> > > but... a degradation?
> > >
> > > > Before reuse-after-rcu-gp:
> > > >
> > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > ...
> > > > After reuse-after-rcu-gp:
> > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > 5:hash_map_perf kmalloc 655628 events per sec
> > >
> > > This is a big improvement :) Not a degration.
> > > You always have to double check the numbers with perf report.
> > >
> > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > there is no performance degradation, could you please share your setup
> > > > and your kernel configure file ?
> > >
> > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > Playing with it a bit more I found something interesting:
> > > map_perf_test 4 8 16348
> > > before/after has too much noise to be conclusive.
> > >
> > > So I did
> > > map_perf_test 4 8 16348 1000000
> > >
> > > and now I see significant degration from patch 3.
> > > It drops from 800k to 200k.
> > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > The following hack addresses most of the perf degradtion:
> > >
> > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > index fea1cb0c78bb..eeadc9359097 100644
> > > --- a/kernel/bpf/memalloc.c
> > > +++ b/kernel/bpf/memalloc.c
> > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > >         alloc = 0;
> > >         head = NULL;
> > >         tail = NULL;
> > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > >         while (alloc < cnt) {
> > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > >                 if (obj) {
> > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > >                 alloc++;
> > >         }
> > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > +       }
> > >
> > >         if (alloc) {
> > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > >                 sc->reuse_ready_tail = NULL;
> > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > +       } else {
> > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > >         }
> > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > >  }
> > >
> > > It now drops from 800k to 450k.
> > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> >
> > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> 
> An update..
> 
> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> the lock contention, but cache miss on
> __llist_del_first(&c->reuse_ready_head);
> was still very high and performance was still at 450k as
> with a simple hack above.
> 
> Then I removed some of the _tail optimizations and added counters
> to these llists.
> To my surprise
> map_perf_test 4 1 16348 1000000
> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> and ~400k sitting in reuse_ready_head.
> 
> Then noticed that we should be doing:
> call_rcu_hurry(&c->rcu, reuse_rcu);
> instead of call_rcu(),
> but my config didn't have RCU_LAZY, so that didn't help.
> Obviously we cannot allow such a huge number of elements to sit
> in these link lists.
> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> To unblock qp-trie work I suggest to add rcu_head to each inner node
> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> Explicit call_rcu would disqualify qp-tree from tracing programs though :(

I am sure that you guys have already considered and discarded this one,
but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.

							Thanx, Paul
Alexei Starovoitov June 7, 2023, 11:50 p.m. UTC | #11
On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > As said in the commit message, the command line for test is
> > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > use_percpu_counter will be false.
> > > >
> > > > Right. percpu or not depends on number of cpus.
> > > >
> > > > >
> > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > there are obvious performance degradation.
> > > > ...
> > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > ..
> > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > >
> > > > this is indeed a degration in a VM.
> > > >
> > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > >
> > > > but... a degradation?
> > > >
> > > > > Before reuse-after-rcu-gp:
> > > > >
> > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > ...
> > > > > After reuse-after-rcu-gp:
> > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > >
> > > > This is a big improvement :) Not a degration.
> > > > You always have to double check the numbers with perf report.
> > > >
> > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > there is no performance degradation, could you please share your setup
> > > > > and your kernel configure file ?
> > > >
> > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > Playing with it a bit more I found something interesting:
> > > > map_perf_test 4 8 16348
> > > > before/after has too much noise to be conclusive.
> > > >
> > > > So I did
> > > > map_perf_test 4 8 16348 1000000
> > > >
> > > > and now I see significant degration from patch 3.
> > > > It drops from 800k to 200k.
> > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > The following hack addresses most of the perf degradtion:
> > > >
> > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > --- a/kernel/bpf/memalloc.c
> > > > +++ b/kernel/bpf/memalloc.c
> > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > >         alloc = 0;
> > > >         head = NULL;
> > > >         tail = NULL;
> > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > >         while (alloc < cnt) {
> > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > >                 if (obj) {
> > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > >                 alloc++;
> > > >         }
> > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > +       }
> > > >
> > > >         if (alloc) {
> > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > >                 sc->reuse_ready_tail = NULL;
> > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > +       } else {
> > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >         }
> > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >  }
> > > >
> > > > It now drops from 800k to 450k.
> > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > >
> > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> >
> > An update..
> >
> > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > the lock contention, but cache miss on
> > __llist_del_first(&c->reuse_ready_head);
> > was still very high and performance was still at 450k as
> > with a simple hack above.
> >
> > Then I removed some of the _tail optimizations and added counters
> > to these llists.
> > To my surprise
> > map_perf_test 4 1 16348 1000000
> > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > and ~400k sitting in reuse_ready_head.
> >
> > Then noticed that we should be doing:
> > call_rcu_hurry(&c->rcu, reuse_rcu);
> > instead of call_rcu(),
> > but my config didn't have RCU_LAZY, so that didn't help.
> > Obviously we cannot allow such a huge number of elements to sit
> > in these link lists.
> > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
>
> I am sure that you guys have already considered and discarded this one,
> but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.

SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
We want to add an option to make it not do it and instead observe RCU GP
for every element freed via bpf_mem_free().
In other words, make bpf_mem_free() behave like kfree_rcu.
I just tried to use rcu_expedite_gp() before bpf prog runs
and it helps a bit.
Paul E. McKenney June 8, 2023, 12:13 a.m. UTC | #12
On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > As said in the commit message, the command line for test is
> > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > use_percpu_counter will be false.
> > > > >
> > > > > Right. percpu or not depends on number of cpus.
> > > > >
> > > > > >
> > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > there are obvious performance degradation.
> > > > > ...
> > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > ..
> > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > >
> > > > > this is indeed a degration in a VM.
> > > > >
> > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > >
> > > > > but... a degradation?
> > > > >
> > > > > > Before reuse-after-rcu-gp:
> > > > > >
> > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > ...
> > > > > > After reuse-after-rcu-gp:
> > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > >
> > > > > This is a big improvement :) Not a degration.
> > > > > You always have to double check the numbers with perf report.
> > > > >
> > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > there is no performance degradation, could you please share your setup
> > > > > > and your kernel configure file ?
> > > > >
> > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > Playing with it a bit more I found something interesting:
> > > > > map_perf_test 4 8 16348
> > > > > before/after has too much noise to be conclusive.
> > > > >
> > > > > So I did
> > > > > map_perf_test 4 8 16348 1000000
> > > > >
> > > > > and now I see significant degration from patch 3.
> > > > > It drops from 800k to 200k.
> > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > The following hack addresses most of the perf degradtion:
> > > > >
> > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > --- a/kernel/bpf/memalloc.c
> > > > > +++ b/kernel/bpf/memalloc.c
> > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > >         alloc = 0;
> > > > >         head = NULL;
> > > > >         tail = NULL;
> > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > >         while (alloc < cnt) {
> > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > >                 if (obj) {
> > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > >                 alloc++;
> > > > >         }
> > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > +       }
> > > > >
> > > > >         if (alloc) {
> > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > >                 sc->reuse_ready_tail = NULL;
> > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > +       } else {
> > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > >         }
> > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > >  }
> > > > >
> > > > > It now drops from 800k to 450k.
> > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > >
> > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > >
> > > An update..
> > >
> > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > the lock contention, but cache miss on
> > > __llist_del_first(&c->reuse_ready_head);
> > > was still very high and performance was still at 450k as
> > > with a simple hack above.
> > >
> > > Then I removed some of the _tail optimizations and added counters
> > > to these llists.
> > > To my surprise
> > > map_perf_test 4 1 16348 1000000
> > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > and ~400k sitting in reuse_ready_head.
> > >
> > > Then noticed that we should be doing:
> > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > instead of call_rcu(),
> > > but my config didn't have RCU_LAZY, so that didn't help.
> > > Obviously we cannot allow such a huge number of elements to sit
> > > in these link lists.
> > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> >
> > I am sure that you guys have already considered and discarded this one,
> > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> 
> SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> We want to add an option to make it not do it and instead observe RCU GP
> for every element freed via bpf_mem_free().
> In other words, make bpf_mem_free() behave like kfree_rcu.
> I just tried to use rcu_expedite_gp() before bpf prog runs
> and it helps a bit.

OK, got it, so you guys have considered, implemented, and are now trying
to discard SLAB_TYPESAFE_BY_RCU.  ;-)

Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
surprised that rcu_expedite_gp() makes any difference.

We do some expediting if there are huge numbers of callbacks or if one
of RCU's shrinker notifiers is invoked.  If the concern is only memory
footprint, it is possible to make the shrinkers more aggressive.  I am
not sure whether making them unconditionally more aggressive is a good
idea, however if memory footprint is the only concern and if shrink-time
expediting would suffice, it is certainly worth some investigation.

Thoughts?

							Thanx, Paul
Alexei Starovoitov June 8, 2023, 12:34 a.m. UTC | #13
On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > > As said in the commit message, the command line for test is
> > > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > > use_percpu_counter will be false.
> > > > > >
> > > > > > Right. percpu or not depends on number of cpus.
> > > > > >
> > > > > > >
> > > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > > there are obvious performance degradation.
> > > > > > ...
> > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > > ..
> > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > > >
> > > > > > this is indeed a degration in a VM.
> > > > > >
> > > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > > >
> > > > > > but... a degradation?
> > > > > >
> > > > > > > Before reuse-after-rcu-gp:
> > > > > > >
> > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > > ...
> > > > > > > After reuse-after-rcu-gp:
> > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > > >
> > > > > > This is a big improvement :) Not a degration.
> > > > > > You always have to double check the numbers with perf report.
> > > > > >
> > > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > > there is no performance degradation, could you please share your setup
> > > > > > > and your kernel configure file ?
> > > > > >
> > > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > > Playing with it a bit more I found something interesting:
> > > > > > map_perf_test 4 8 16348
> > > > > > before/after has too much noise to be conclusive.
> > > > > >
> > > > > > So I did
> > > > > > map_perf_test 4 8 16348 1000000
> > > > > >
> > > > > > and now I see significant degration from patch 3.
> > > > > > It drops from 800k to 200k.
> > > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > > The following hack addresses most of the perf degradtion:
> > > > > >
> > > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > > --- a/kernel/bpf/memalloc.c
> > > > > > +++ b/kernel/bpf/memalloc.c
> > > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > >         alloc = 0;
> > > > > >         head = NULL;
> > > > > >         tail = NULL;
> > > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > > >         while (alloc < cnt) {
> > > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > > >                 if (obj) {
> > > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > >                 alloc++;
> > > > > >         }
> > > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > +       }
> > > > > >
> > > > > >         if (alloc) {
> > > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > > >                 sc->reuse_ready_tail = NULL;
> > > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > > +       } else {
> > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > >         }
> > > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > >  }
> > > > > >
> > > > > > It now drops from 800k to 450k.
> > > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > > >
> > > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > > >
> > > > An update..
> > > >
> > > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > > the lock contention, but cache miss on
> > > > __llist_del_first(&c->reuse_ready_head);
> > > > was still very high and performance was still at 450k as
> > > > with a simple hack above.
> > > >
> > > > Then I removed some of the _tail optimizations and added counters
> > > > to these llists.
> > > > To my surprise
> > > > map_perf_test 4 1 16348 1000000
> > > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > > and ~400k sitting in reuse_ready_head.
> > > >
> > > > Then noticed that we should be doing:
> > > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > > instead of call_rcu(),
> > > > but my config didn't have RCU_LAZY, so that didn't help.
> > > > Obviously we cannot allow such a huge number of elements to sit
> > > > in these link lists.
> > > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> > >
> > > I am sure that you guys have already considered and discarded this one,
> > > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> >
> > SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> > We want to add an option to make it not do it and instead observe RCU GP
> > for every element freed via bpf_mem_free().
> > In other words, make bpf_mem_free() behave like kfree_rcu.
> > I just tried to use rcu_expedite_gp() before bpf prog runs
> > and it helps a bit.
>
> OK, got it, so you guys have considered, implemented, and are now trying
> to discard SLAB_TYPESAFE_BY_RCU.  ;-)
>
> Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
> surprised that rcu_expedite_gp() makes any difference.
>
> We do some expediting if there are huge numbers of callbacks or if one
> of RCU's shrinker notifiers is invoked.  If the concern is only memory
> footprint, it is possible to make the shrinkers more aggressive.  I am
> not sure whether making them unconditionally more aggressive is a good
> idea, however if memory footprint is the only concern and if shrink-time
> expediting would suffice, it is certainly worth some investigation.

Right. I don't think it's a good idea to tweak RCU for this use case.
RCU parameters have to be optimized for all. Instead the bpf side needs
to understand how RCU heuristics/watermarks work and play that game.
For example, Hou's patch 3 has one pending call_rcu per-cpu.
As soon as one call_rcu_hurry is done all future freed elements gets
queued into llist and for the next call_rcu_hurry() that list will
contain 100k elements.
I believe from RCU pov one pending call_rcu cb is not a reason to
act right away. It's trying to batch multiple cb-s.
Right now I'm experimenting with multiple call_rcu calls from the bpf side,
so that RCU sees multiple pending cb-s and has to act.
It seems to work much better. Memory footprint is now reasonable.
Could you point me to a code in RCU where it's doing callback batching?
Paul E. McKenney June 8, 2023, 1:15 a.m. UTC | #14
On Wed, Jun 07, 2023 at 05:34:50PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > > > As said in the commit message, the command line for test is
> > > > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > > > use_percpu_counter will be false.
> > > > > > >
> > > > > > > Right. percpu or not depends on number of cpus.
> > > > > > >
> > > > > > > >
> > > > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > > > there are obvious performance degradation.
> > > > > > > ...
> > > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > > > ..
> > > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > > > >
> > > > > > > this is indeed a degration in a VM.
> > > > > > >
> > > > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > > > >
> > > > > > > but... a degradation?
> > > > > > >
> > > > > > > > Before reuse-after-rcu-gp:
> > > > > > > >
> > > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > > > ...
> > > > > > > > After reuse-after-rcu-gp:
> > > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > > > >
> > > > > > > This is a big improvement :) Not a degration.
> > > > > > > You always have to double check the numbers with perf report.
> > > > > > >
> > > > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > > > there is no performance degradation, could you please share your setup
> > > > > > > > and your kernel configure file ?
> > > > > > >
> > > > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > > > Playing with it a bit more I found something interesting:
> > > > > > > map_perf_test 4 8 16348
> > > > > > > before/after has too much noise to be conclusive.
> > > > > > >
> > > > > > > So I did
> > > > > > > map_perf_test 4 8 16348 1000000
> > > > > > >
> > > > > > > and now I see significant degration from patch 3.
> > > > > > > It drops from 800k to 200k.
> > > > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > > > The following hack addresses most of the perf degradtion:
> > > > > > >
> > > > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > > > --- a/kernel/bpf/memalloc.c
> > > > > > > +++ b/kernel/bpf/memalloc.c
> > > > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > > >         alloc = 0;
> > > > > > >         head = NULL;
> > > > > > >         tail = NULL;
> > > > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > > > >         while (alloc < cnt) {
> > > > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > > > >                 if (obj) {
> > > > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > > >                 alloc++;
> > > > > > >         }
> > > > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > > +       }
> > > > > > >
> > > > > > >         if (alloc) {
> > > > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > > > >                 sc->reuse_ready_tail = NULL;
> > > > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > > > +       } else {
> > > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >         }
> > > > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >  }
> > > > > > >
> > > > > > > It now drops from 800k to 450k.
> > > > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > > > >
> > > > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > > > >
> > > > > An update..
> > > > >
> > > > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > > > the lock contention, but cache miss on
> > > > > __llist_del_first(&c->reuse_ready_head);
> > > > > was still very high and performance was still at 450k as
> > > > > with a simple hack above.
> > > > >
> > > > > Then I removed some of the _tail optimizations and added counters
> > > > > to these llists.
> > > > > To my surprise
> > > > > map_perf_test 4 1 16348 1000000
> > > > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > > > and ~400k sitting in reuse_ready_head.
> > > > >
> > > > > Then noticed that we should be doing:
> > > > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > > > instead of call_rcu(),
> > > > > but my config didn't have RCU_LAZY, so that didn't help.
> > > > > Obviously we cannot allow such a huge number of elements to sit
> > > > > in these link lists.
> > > > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> > > >
> > > > I am sure that you guys have already considered and discarded this one,
> > > > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> > >
> > > SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> > > We want to add an option to make it not do it and instead observe RCU GP
> > > for every element freed via bpf_mem_free().
> > > In other words, make bpf_mem_free() behave like kfree_rcu.
> > > I just tried to use rcu_expedite_gp() before bpf prog runs
> > > and it helps a bit.
> >
> > OK, got it, so you guys have considered, implemented, and are now trying
> > to discard SLAB_TYPESAFE_BY_RCU.  ;-)
> >
> > Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
> > surprised that rcu_expedite_gp() makes any difference.
> >
> > We do some expediting if there are huge numbers of callbacks or if one
> > of RCU's shrinker notifiers is invoked.  If the concern is only memory
> > footprint, it is possible to make the shrinkers more aggressive.  I am
> > not sure whether making them unconditionally more aggressive is a good
> > idea, however if memory footprint is the only concern and if shrink-time
> > expediting would suffice, it is certainly worth some investigation.
> 
> Right. I don't think it's a good idea to tweak RCU for this use case.
> RCU parameters have to be optimized for all. Instead the bpf side needs
> to understand how RCU heuristics/watermarks work and play that game.
> For example, Hou's patch 3 has one pending call_rcu per-cpu.
> As soon as one call_rcu_hurry is done all future freed elements gets
> queued into llist and for the next call_rcu_hurry() that list will
> contain 100k elements.
> I believe from RCU pov one pending call_rcu cb is not a reason to
> act right away. It's trying to batch multiple cb-s.

Indeed, if a single RCU callback was that important, it would be necessary
for the caller to explicitly say so somehow.  And I agree that this
would open a can of worms that might be best left unopened.

> Right now I'm experimenting with multiple call_rcu calls from the bpf side,
> so that RCU sees multiple pending cb-s and has to act.
> It seems to work much better. Memory footprint is now reasonable.

Nice!!!

> Could you point me to a code in RCU where it's doing callback batching?

There are quite a few things in RCU that make this happen:

1.	The latency of the previous RCU grace period automatically
	batches up callbacks for the RCU grace period that is just
	now starting.

2.	More specifically, the rcutree.jiffies_till_first_fqs and
	rcutree.jiffies_till_next_fqs module parameters control
	how long RCU grace-period processing waits between checks
	for idle CPUs.

3.	The rcu_implicit_dynticks_qs() function is invoked periodically
	as specified by #2 above, and pays attention to grace-period
	progress.  If it decides that the current grace period is not
	moving along quickly enough, it will enlist help from
	cond_resched(), the context-switch code, and from a
	scheduler IPI.  Failing that, it collects information that
	can be printed out by the ensuing RCU CPU stall warning.

4.	When callbacks are not offloaded to the rcuo kthreads, when
	the __call_rcu_core() function decides that there are too many
	callbacks piled up on the current CPU, it gives the current grace
	period a kick by invoking rcu_force_quiescent_state(), which
	in turn forces RCU to check for holdout CPUs.  And also which
	forces the current CPU to check immediately for a quiescent state.

5.	The __call_rcu_nocb_wake() does something similar for the case
	where callbacks are offloaded.

	Most datacenter Linux kernels do *not* offload callbacks.

As a rough rule of thumb, RCU wants to keep grace period latency
somewhere in the milliseconds.  If the grace period goes too slowly or
if there are too many callbacks piling up, it will take evasive action
as described above.

Does that help, or am I missing the point of your question?

							Thanx, Paul
Hou Tao June 8, 2023, 1:51 a.m. UTC | #15
Hi,

On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>> As said in the commit message, the command line for test is
>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>> using default max_entries and the number of CPUs is greater than 15,
>>> use_percpu_counter will be false.
>> Right. percpu or not depends on number of cpus.
>>
>>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
>>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
>>> there are obvious performance degradation.
>> ...
>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>> 2:hash_map_perf kmalloc 359201 events per sec
>> ..
>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>> 4:hash_map_perf kmalloc 203983 events per sec
>> this is indeed a degration in a VM.
>>
>>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
>>> performances for "./map_perf_test 4 8" are similar, but there is obvious
>>> performance degradation for "./map_perf_test 4 8 16384"
>> but... a degradation?
Er, My bad. I miss-labeled "Before" and "After". v4 indeed introduces
big performance degradation in physical host.
>>
>>> Before reuse-after-rcu-gp:
>>>
>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>> 1:hash_map_perf kmalloc 388088 events per sec
>> ...
>>> After reuse-after-rcu-gp:
>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>> 5:hash_map_perf kmalloc 655628 events per sec
>> This is a big improvement :) Not a degration.
>> You always have to double check the numbers with perf report.
>>
>>> So could you please double check your setup and rerun map_perf_test ? If
>>> there is no performance degradation, could you please share your setup
>>> and your kernel configure file ?
>> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
>> Playing with it a bit more I found something interesting:
>> map_perf_test 4 8 16348
>> before/after has too much noise to be conclusive.
>>
>> So I did
>> map_perf_test 4 8 16348 1000000
>>
>> and now I see significant degration from patch 3.
>> It drops from 800k to 200k.
>> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
>> The following hack addresses most of the perf degradtion:
>>
>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>> index fea1cb0c78bb..eeadc9359097 100644
>> --- a/kernel/bpf/memalloc.c
>> +++ b/kernel/bpf/memalloc.c
>> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>         alloc = 0;
>>         head = NULL;
>>         tail = NULL;
>> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
>> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>>         while (alloc < cnt) {
>>                 obj = __llist_del_first(&sc->reuse_ready_head);
>>                 if (obj) {
>> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>                 alloc++;
>>         }
>>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>> +       }
>>
>>         if (alloc) {
>>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
>> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>>                 sc->reuse_ready_tail = NULL;
>>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>>                 __llist_add_batch(head, tail, &sc->wait_for_free);
>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
>> +       } else {
>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>         }
>> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>  }
>>
>> It now drops from 800k to 450k.
>> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
>> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
Yes, I known, because I had just proposed it in the email yesterday.
>
> Also noticed that the overhead of shared reuse_ready list
> comes both from the contended lock and from cache misses
> when one cpu pushes to the list after RCU GP and another
> cpu removes.
>
> Also low/batch/high watermark are all wrong in patch 3.
> low=32 and high=96 makes no sense when it's not a single list.
> I'm experimenting with 32 for all three heuristics.
>
> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> are redundant.
> unit_free() can push into free_by_rcu directly
> then reuse_bulk() can fill it up with free_llist_extra and
> move them into waiting_for_gp.
Yes. Indeed missing that.
>
> All these _tail optimizations are obscuring the code and make it hard
> to notice these issues.
>
>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>
>> Answering your other email:
>>
>>> I see your point. I will continue to debug the memory usage difference
>>> between v3 and v4.
>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
Don't agree with that. I still think the potential memory usage of v4 is
a problem and the difference memory usage between v3 and v4 reveals that
there is some peculiarity in RCU subsystem, because the difference comes
from the duration of RCU grace period. We need to find out why and fix
or workaround that.
>>
>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>> make the list contain thousands of elements.
>> That would be true only if there were no scheduling events in all of 176K ops.
>> Which is not the case.
>> I'm not sure why you're saying that RCU GP is 30ms.
Because these freed elements will be freed after one RCU GP and in v4
there is only one RCU callback per-cpu, so before one RCU GP is expired,
these freed elements will be accumulated on the list.
>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>> Network and block IO doesn't process 176K packets without resched.
>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>
>> For small size buckets low_watermark=32 and high=96.
>> We typically move 32 elements at a time from one list to another.
>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>> elements from waiting_for_gp one at a time and kfree it one at a time.
>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>
>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>> introduced. Previously, I though it was used to handle the case in which
>>> allocation and freeing are done on different CPUs.
>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>
>>> And as we can see
>>> from the benchmark result above and in v3, the performance and the
>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>
>> So far the only bench we can trust and analyze is map_perf_test.
>> Please make bench in patch 2 yield the cpu after few updates.
>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>> 64 updates is realistic too. A thousand is not.
Will do that.
Hou Tao June 8, 2023, 1:57 a.m. UTC | #16
Hi,

On 6/8/2023 7:23 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>> As said in the commit message, the command line for test is
>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>> using default max_entries and the number of CPUs is greater than 15,
>>>> use_percpu_counter will be false.
>>> Right. percpu or not depends on number of cpus.
>>>
>>>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
>>>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
>>>> there are obvious performance degradation.
>>> ...
>>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>>> 2:hash_map_perf kmalloc 359201 events per sec
>>> ..
>>>> [root@hello bpf]# ./map_perf_test 4 8 16384
>>>> 4:hash_map_perf kmalloc 203983 events per sec
>>> this is indeed a degration in a VM.
>>>
>>>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
>>>> performances for "./map_perf_test 4 8" are similar, but there is obvious
>>>> performance degradation for "./map_perf_test 4 8 16384"
>>> but... a degradation?
>>>
>>>> Before reuse-after-rcu-gp:
>>>>
>>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>>> 1:hash_map_perf kmalloc 388088 events per sec
>>> ...
>>>> After reuse-after-rcu-gp:
>>>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
>>>> 5:hash_map_perf kmalloc 655628 events per sec
>>> This is a big improvement :) Not a degration.
>>> You always have to double check the numbers with perf report.
>>>
>>>> So could you please double check your setup and rerun map_perf_test ? If
>>>> there is no performance degradation, could you please share your setup
>>>> and your kernel configure file ?
>>> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
>>> Playing with it a bit more I found something interesting:
>>> map_perf_test 4 8 16348
>>> before/after has too much noise to be conclusive.
>>>
>>> So I did
>>> map_perf_test 4 8 16348 1000000
>>>
>>> and now I see significant degration from patch 3.
>>> It drops from 800k to 200k.
>>> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
>>> The following hack addresses most of the perf degradtion:
>>>
>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>> index fea1cb0c78bb..eeadc9359097 100644
>>> --- a/kernel/bpf/memalloc.c
>>> +++ b/kernel/bpf/memalloc.c
>>> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>>         alloc = 0;
>>>         head = NULL;
>>>         tail = NULL;
>>> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
>>> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>>>         while (alloc < cnt) {
>>>                 obj = __llist_del_first(&sc->reuse_ready_head);
>>>                 if (obj) {
>>> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>>>                 alloc++;
>>>         }
>>>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>> +       }
>>>
>>>         if (alloc) {
>>>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
>>> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>>>                 sc->reuse_ready_tail = NULL;
>>>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>>>                 __llist_add_batch(head, tail, &sc->wait_for_free);
>>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
>>> +       } else {
>>> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>         }
>>> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>>>  }
>>>
>>> It now drops from 800k to 450k.
>>> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
>>> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
>> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
>> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> An update..
>
> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> the lock contention, but cache miss on
> __llist_del_first(&c->reuse_ready_head);
> was still very high and performance was still at 450k as
> with a simple hack above.
>
> Then I removed some of the _tail optimizations and added counters
> to these llists.
> To my surprise
> map_perf_test 4 1 16348 1000000
> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> and ~400k sitting in reuse_ready_head.
Yep. If you use htab-mem-bechmark in patch #2, you will find the same
results, the same long single lists and the same huge memory usage.
>
> Then noticed that we should be doing:
> call_rcu_hurry(&c->rcu, reuse_rcu);
> instead of call_rcu(),
> but my config didn't have RCU_LAZY, so that didn't help.
> Obviously we cannot allow such a huge number of elements to sit
> in these link lists.
> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
I think the main blocker is the huge memory usage (it is the same thing
as the long wait_for_reuse and wait_for_free list), right ?
> To unblock qp-trie work I suggest to add rcu_head to each inner node
> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> Explicit call_rcu would disqualify qp-tree from tracing programs though :(
Paul E. McKenney June 8, 2023, 2:55 a.m. UTC | #17
On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>> As said in the commit message, the command line for test is
> >>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>> using default max_entries and the number of CPUs is greater than 15,
> >>> use_percpu_counter will be false.
> >> Right. percpu or not depends on number of cpus.
> >>
> >>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> >>> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> >>> there are obvious performance degradation.
> >> ...
> >>> [root@hello bpf]# ./map_perf_test 4 8 16384
> >>> 2:hash_map_perf kmalloc 359201 events per sec
> >> ..
> >>> [root@hello bpf]# ./map_perf_test 4 8 16384
> >>> 4:hash_map_perf kmalloc 203983 events per sec
> >> this is indeed a degration in a VM.
> >>
> >>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> >>> performances for "./map_perf_test 4 8" are similar, but there is obvious
> >>> performance degradation for "./map_perf_test 4 8 16384"
> >> but... a degradation?
> Er, My bad. I miss-labeled "Before" and "After". v4 indeed introduces
> big performance degradation in physical host.
> >>
> >>> Before reuse-after-rcu-gp:
> >>>
> >>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> >>> 1:hash_map_perf kmalloc 388088 events per sec
> >> ...
> >>> After reuse-after-rcu-gp:
> >>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> >>> 5:hash_map_perf kmalloc 655628 events per sec
> >> This is a big improvement :) Not a degration.
> >> You always have to double check the numbers with perf report.
> >>
> >>> So could you please double check your setup and rerun map_perf_test ? If
> >>> there is no performance degradation, could you please share your setup
> >>> and your kernel configure file ?
> >> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> >> Playing with it a bit more I found something interesting:
> >> map_perf_test 4 8 16348
> >> before/after has too much noise to be conclusive.
> >>
> >> So I did
> >> map_perf_test 4 8 16348 1000000
> >>
> >> and now I see significant degration from patch 3.
> >> It drops from 800k to 200k.
> >> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> >> The following hack addresses most of the perf degradtion:
> >>
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index fea1cb0c78bb..eeadc9359097 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >>         alloc = 0;
> >>         head = NULL;
> >>         tail = NULL;
> >> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> >> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> >>         while (alloc < cnt) {
> >>                 obj = __llist_del_first(&sc->reuse_ready_head);
> >>                 if (obj) {
> >> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >>                 alloc++;
> >>         }
> >>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >> +       }
> >>
> >>         if (alloc) {
> >>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> >>                 sc->reuse_ready_tail = NULL;
> >>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> >>                 __llist_add_batch(head, tail, &sc->wait_for_free);
> >> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> >> +       } else {
> >> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >>         }
> >> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >>  }
> >>
> >> It now drops from 800k to 450k.
> >> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> >> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> Yes, I known, because I had just proposed it in the email yesterday.
> >
> > Also noticed that the overhead of shared reuse_ready list
> > comes both from the contended lock and from cache misses
> > when one cpu pushes to the list after RCU GP and another
> > cpu removes.
> >
> > Also low/batch/high watermark are all wrong in patch 3.
> > low=32 and high=96 makes no sense when it's not a single list.
> > I'm experimenting with 32 for all three heuristics.
> >
> > Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> > are redundant.
> > unit_free() can push into free_by_rcu directly
> > then reuse_bulk() can fill it up with free_llist_extra and
> > move them into waiting_for_gp.
> Yes. Indeed missing that.
> >
> > All these _tail optimizations are obscuring the code and make it hard
> > to notice these issues.
> >
> >> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>
> >> Answering your other email:
> >>
> >>> I see your point. I will continue to debug the memory usage difference
> >>> between v3 and v4.
> >> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> Don't agree with that. I still think the potential memory usage of v4 is
> a problem and the difference memory usage between v3 and v4 reveals that
> there is some peculiarity in RCU subsystem, because the difference comes
> from the duration of RCU grace period. We need to find out why and fix
> or workaround that.

A tight loop in the kernel can extend RCU grace periods, especially
for kernels built with CONFIG_PREEPTION=n.  Placing things like
cond_resched() in such loops can help.  Of course, if you are in an
RCU read-side critical section (or holding a spinlock), you will need
to exit, cond_resched(), then re-enter.  Taking care to ensure that the
state upon re-entry is valid.  After all, having exited either type of
critical section, anything might happen.

							Thanx, Paul

> >>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>> make the list contain thousands of elements.
> >> That would be true only if there were no scheduling events in all of 176K ops.
> >> Which is not the case.
> >> I'm not sure why you're saying that RCU GP is 30ms.
> Because these freed elements will be freed after one RCU GP and in v4
> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> these freed elements will be accumulated on the list.
> >> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >> Network and block IO doesn't process 176K packets without resched.
> >> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>
> >> For small size buckets low_watermark=32 and high=96.
> >> We typically move 32 elements at a time from one list to another.
> >> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >> elements from waiting_for_gp one at a time and kfree it one at a time.
> >> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>
> >>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>> introduced. Previously, I though it was used to handle the case in which
> >>> allocation and freeing are done on different CPUs.
> >> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>
> >>> And as we can see
> >>> from the benchmark result above and in v3, the performance and the
> >>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>
> >> So far the only bench we can trust and analyze is map_perf_test.
> >> Please make bench in patch 2 yield the cpu after few updates.
> >> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >> 64 updates is realistic too. A thousand is not.
> Will do that.
>
Hou Tao June 8, 2023, 3:35 a.m. UTC | #18
Hi,

On 6/8/2023 8:34 AM, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
>>> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
SNIP
>>>> An update..
>>>>
>>>> I tweaked patch 3 to do per-cpu reuse_ready and it addressed
>>>> the lock contention, but cache miss on
>>>> __llist_del_first(&c->reuse_ready_head);
>>>> was still very high and performance was still at 450k as
>>>> with a simple hack above.
>>>>
>>>> Then I removed some of the _tail optimizations and added counters
>>>> to these llists.
>>>> To my surprise
>>>> map_perf_test 4 1 16348 1000000
>>>> was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
>>>> and ~400k sitting in reuse_ready_head.
>>>>
>>>> Then noticed that we should be doing:
>>>> call_rcu_hurry(&c->rcu, reuse_rcu);
>>>> instead of call_rcu(),
>>>> but my config didn't have RCU_LAZY, so that didn't help.
>>>> Obviously we cannot allow such a huge number of elements to sit
>>>> in these link lists.
>>>> The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
>>>> To unblock qp-trie work I suggest to add rcu_head to each inner node
>>>> and do call_rcu() on them before free-ing them to bpf_mem_alloc.
>>>> Explicit call_rcu would disqualify qp-tree from tracing programs though :(
>>>> I am sure that you guys have already considered and discarded this one,
>>>> but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
Thanks for the suggestion. SLAB_TYPESAFE_BY_RCU needs to modify the
logic of reader to check whether or not the found element is still
valid. If the reader finds the found element is invalid, it needs to retry.
And for different cache, the way to check the validity is also
different. In bpf we don't want to let the reader to take care of the
reuse and do the retry logic, and we want to do it atomatically in the
bpf memory allocator.
>>> SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
>>> We want to add an option to make it not do it and instead observe RCU GP
>>> for every element freed via bpf_mem_free().
>>> In other words, make bpf_mem_free() behave like kfree_rcu.
>>> I just tried to use rcu_expedite_gp() before bpf prog runs
>>> and it helps a bit.
>> OK, got it, so you guys have considered, implemented, and are now trying
>> to discard SLAB_TYPESAFE_BY_RCU.  ;-)
>>
>> Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
>> surprised that rcu_expedite_gp() makes any difference.
>>
>> We do some expediting if there are huge numbers of callbacks or if one
>> of RCU's shrinker notifiers is invoked.  If the concern is only memory
>> footprint, it is possible to make the shrinkers more aggressive.  I am
>> not sure whether making them unconditionally more aggressive is a good
>> idea, however if memory footprint is the only concern and if shrink-time
>> expediting would suffice, it is certainly worth some investigation.
> Right. I don't think it's a good idea to tweak RCU for this use case.
> RCU parameters have to be optimized for all. Instead the bpf side needs
> to understand how RCU heuristics/watermarks work and play that game.
> For example, Hou's patch 3 has one pending call_rcu per-cpu.
> As soon as one call_rcu_hurry is done all future freed elements gets
> queued into llist and for the next call_rcu_hurry() that list will
> contain 100k elements.
> I believe from RCU pov one pending call_rcu cb is not a reason to
> act right away. It's trying to batch multiple cb-s.
> Right now I'm experimenting with multiple call_rcu calls from the bpf side,
> so that RCU sees multiple pending cb-s and has to act.
> It seems to work much better. Memory footprint is now reasonable.
Could you please share the memory usage of the original v4 and the
version with reasonable memory by using htab-mem-benchmark ?

v3 already uses multiple RCU cbs for reuse to accelerate the reuse of
freed objects. I also did the experiment for v4 as I replied the day
before yesterday, so I just quota it:

htab-mem-benchmark (reuse-after-RCU-GP):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1159.18   | 0.99                | 0.99             |
| overwrite          | 11.00     | 2288                | 4109             |
| batch_add_batch_del| 8.86      | 1558                | 2763             |
| add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |

For 1), after using kmalloc() in irq_work to allocate multiple inflight
RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
but is not enough:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1247.00   | 0.97                | 1.00             |
| overwrite          | 16.56     | 490.18              | 557.17           |
| batch_add_batch_del| 11.31     | 276.32              | 360.89           |
| add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |


By comparing the implementation of v3 and v4, I just find one hack which
could reduce the memory usage of v4 (with per-cpu reusabe list)
significantly and memory usage will be similar between v3 and v4. If we
queue a empty work before calling irq_work_raise() as shown below, the
calling latency of reuse_rcu (a normal RCU callback) will decreased from
~150ms to ~10 ms. I think the reason is that the duration of normal RCU
grace period is decreased a lot, but I don't know why did it happen.
Hope to get some help from Paul. Because Paul doesn't have enough
context, so I will try to explain the context of the weird problem
below. And Alexei, could you please also try the hack below for your
multiple-rcu-cbs version ?

Hi Paul,

I just found out the time between the calling of call_rcu(..,
reuse_rcub) and the calling of RCU callback (namely resue_cb()) will
decrease a lot (from ~150ms to ~10ms) if I queued a empty kworker
periodically as shown in the diff below. Before the diff below applied,
the benchmark process will do the following things on a VM with 8 CPUs:

1) create a pthread htab_mem_producer on each CPU and pinned the thread
on the specific CPU
2) htab_mem_producer will call syscall(__NR_getpgid) repeatedly in a
dead-loop
3) the calling of getpgid() will trigger the invocation of a bpf program
attached on getpgid() syscall
4) the bpf program will overwrite 2048 elements in a bpf hash map
5) during the overwrite, it will free the existed element firstly
6) the free will call unit_free(), unit_free() will trigger a irq-work
batchly after 96-element were freed
7) in the irq_work, it will allocate a new struct to save the freed
elements and the rcu_head and do call_rcu(..., reuse_rcu)
8) in reuse_rcu() it just moves these freed elements into a per-cpu
reuse list
9) After the free completes, the overwrite will allocate a new element
10) the allocation may also trigger a irq-work batchly after the
preallocated elements were exhausted
11) in the irq-work, it will try to fetch elements from per-cpu reuse
list and if it is empty, it will do kmalloc()

For the procedure describe above, the calling latency between the call
of call_rcu() and the call of reuse_rcu is about ~150ms or larger. I
have also checked the calling latency of syscall(__NR_getpgid) and all
latency is less than 1ms. But after do queueing a empty kwork in step
6), the calling latency will decreased from ~150ms to ~10ms and I
suspect that is because the RCU grace period is decreased a lot, but I
don't know how to debug that (e.g., to debug why the RCU grace period is
so long), so I hope to get some help.

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks + queue_empty_work):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| overwrite          | 13.85     | 17.89               | 21.49            |
| batch_add_batch_del| 10.22     | 16.65               | 19.07            |
| add_del_on_diff_cpu| 3.82      | 21.36               | 33.05            |


+static void bpf_ma_prepare_reuse_work(struct work_struct *work)
+{
+       udelay(100);
+}
+
 /* When size != 0 bpf_mem_cache for each cpu.
  * This is typical bpf hash map use case when all elements have equal size.
  *
@@ -547,6 +559,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
size, bool percpu)
                        c->cpu = cpu;
                        c->objcg = objcg;
                        c->percpu_size = percpu_size;
+                       INIT_WORK(&c->reuse_work,
bpf_ma_prepare_reuse_work);
                        raw_spin_lock_init(&c->lock);
                        c->reuse.percpu = percpu;
                        c->reuse.cpu = cpu;
@@ -574,6 +587,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
size, bool percpu)
                        c->unit_size = sizes[i];
                        c->cpu = cpu
                        c->objcg = objcg;
+                       INIT_WORK(&c->reuse_work,
bpf_ma_prepare_reuse_work);
                        raw_spin_lock_init(&c->lock);
                        c->reuse.percpu = percpu;
                        c->reuse.lock = &c->lock;
@@ -793,6 +807,8 @@ static void notrace unit_free(struct bpf_mem_cache
*c, void *ptr)
                        c->prepare_reuse_tail = llnode;
                __llist_add(llnode, &c->prepare_reuse_head);
                cnt = ++c->prepare_reuse_cnt;
+               if (cnt > c->high_watermark &&
!work_pending(&c->reuse_work))
+                       queue_work(bpf_ma_wq, &c->reuse_work);
        } else {
                /* unit_free() cannot fail. Therefore add an object to
atomic
                 * llist. reuse_bulk() will drain it. Though
free_llist_extra is
@@ -901,3 +917,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct
bpf_mem_alloc *ma, gfp_t flags)

        return !ret ? NULL : ret + LLIST_NODE_SZ;
 }
+
+static int __init bpf_ma_init(void)
+{
+       bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
+       BUG_ON(!bpf_ma_wq);
+       return 0;
+}
+late_initcall(bpf_ma_init);



> Could you point me to a code in RCU where it's doing callback batching?
Hou Tao June 8, 2023, 3:43 a.m. UTC | #19
Hi Paul,

On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>> As said in the commit message, the command line for test is
>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>> use_percpu_counter will be false.
>>>> Right. percpu or not depends on number of cpus.
SNIP
>>>>  known, because I had just proposed it in the email yesterday.
>>> Also noticed that the overhead of shared reuse_ready list
>>> comes both from the contended lock and from cache misses
>>> when one cpu pushes to the list after RCU GP and another
>>> cpu removes.
>>>
>>> Also low/batch/high watermark are all wrong in patch 3.
>>> low=32 and high=96 makes no sense when it's not a single list.
>>> I'm experimenting with 32 for all three heuristics.
>>>
>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>> are redundant.
>>> unit_free() can push into free_by_rcu directly
>>> then reuse_bulk() can fill it up with free_llist_extra and
>>> move them into waiting_for_gp.
>> Yes. Indeed missing that.
>>> All these _tail optimizations are obscuring the code and make it hard
>>> to notice these issues.
>>>
>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>
>>>> Answering your other email:
>>>>
>>>>> I see your point. I will continue to debug the memory usage difference
>>>>> between v3 and v4.
>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>> Don't agree with that. I still think the potential memory usage of v4 is
>> a problem and the difference memory usage between v3 and v4 reveals that
>> there is some peculiarity in RCU subsystem, because the difference comes
>> from the duration of RCU grace period. We need to find out why and fix
>> or workaround that.
> A tight loop in the kernel can extend RCU grace periods, especially
> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> cond_resched() in such loops can help.  Of course, if you are in an
> RCU read-side critical section (or holding a spinlock), you will need
> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> state upon re-entry is valid.  After all, having exited either type of
> critical section, anything might happen.

As said in the help-wanted email just send out, it was weird that the
RCU grace period was extended a lot, up to ~150ms or more. But queue a
dummy kworker periodically which does nothing will help to reduce the
grace period to ~10ms. I have explained the context of the problem in
that email. And hope that we could get some help from you and the RCU
experts in the community.

Regards,
Tao
>
> 							Thanx, Paul
>
>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>> make the list contain thousands of elements.
>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>> Which is not the case.
>>>> I'm not sure why you're saying that RCU GP is 30ms.
>> Because these freed elements will be freed after one RCU GP and in v4
>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>> these freed elements will be accumulated on the list.
>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>> Network and block IO doesn't process 176K packets without resched.
>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>
>>>> For small size buckets low_watermark=32 and high=96.
>>>> We typically move 32 elements at a time from one list to another.
>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>
>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>> allocation and freeing are done on different CPUs.
>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>
>>>>> And as we can see
>>>>> from the benchmark result above and in v3, the performance and the
>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>
>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>> 64 updates is realistic too. A thousand is not.
>> Will do that.
>>
Hou Tao June 8, 2023, 4:30 a.m. UTC | #20
Hi,

On 6/8/2023 11:35 AM, Hou Tao wrote:
> Hi,
>
> On 6/8/2023 8:34 AM, Alexei Starovoitov wrote:
>> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
>>>> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
SNIP
>
> By comparing the implementation of v3 and v4, I just find one hack which
> could reduce the memory usage of v4 (with per-cpu reusabe list)
> significantly and memory usage will be similar between v3 and v4. If we
> queue a empty work before calling irq_work_raise() as shown below, the
> calling latency of reuse_rcu (a normal RCU callback) will decreased from
> ~150ms to ~10 ms. I think the reason is that the duration of normal RCU
> grace period is decreased a lot, but I don't know why did it happen.
> Hope to get some help from Paul. Because Paul doesn't have enough
> context, so I will try to explain the context of the weird problem
> below. And Alexei, could you please also try the hack below for your
> multiple-rcu-cbs version ?
An update for the queue_work() hack. It works for both CONFIG_PREEMPT=y
and CONFIG_PREEMPT=n cases. I will try to enable RCU trace to check
whether or not there is any difference.
>
> Hi Paul,
>
> I just found out the time between the calling of call_rcu(..,
> reuse_rcub) and the calling of RCU callback (namely resue_cb()) will
> decrease a lot (from ~150ms to ~10ms) if I queued a empty kworker
> periodically as shown in the diff below. Before the diff below applied,
> the benchmark process will do the following things on a VM with 8 CPUs:
>
> 1) create a pthread htab_mem_producer on each CPU and pinned the thread
> on the specific CPU
> 2) htab_mem_producer will call syscall(__NR_getpgid) repeatedly in a
> dead-loop
> 3) the calling of getpgid() will trigger the invocation of a bpf program
> attached on getpgid() syscall
> 4) the bpf program will overwrite 2048 elements in a bpf hash map
> 5) during the overwrite, it will free the existed element firstly
> 6) the free will call unit_free(), unit_free() will trigger a irq-work
> batchly after 96-element were freed
> 7) in the irq_work, it will allocate a new struct to save the freed
> elements and the rcu_head and do call_rcu(..., reuse_rcu)
> 8) in reuse_rcu() it just moves these freed elements into a per-cpu
> reuse list
> 9) After the free completes, the overwrite will allocate a new element
> 10) the allocation may also trigger a irq-work batchly after the
> preallocated elements were exhausted
> 11) in the irq-work, it will try to fetch elements from per-cpu reuse
> list and if it is empty, it will do kmalloc()
>
> For the procedure describe above, the calling latency between the call
> of call_rcu() and the call of reuse_rcu is about ~150ms or larger. I
> have also checked the calling latency of syscall(__NR_getpgid) and all
> latency is less than 1ms. But after do queueing a empty kwork in step
> 6), the calling latency will decreased from ~150ms to ~10ms and I
> suspect that is because the RCU grace period is decreased a lot, but I
> don't know how to debug that (e.g., to debug why the RCU grace period is
> so long), so I hope to get some help.
>
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks + queue_empty_work):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | overwrite          | 13.85     | 17.89               | 21.49            |
> | batch_add_batch_del| 10.22     | 16.65               | 19.07            |
> | add_del_on_diff_cpu| 3.82      | 21.36               | 33.05            |
>
>
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work)
> +{
> +       udelay(100);
> +}
> +
>  /* When size != 0 bpf_mem_cache for each cpu.
>   * This is typical bpf hash map use case when all elements have equal size.
>   *
> @@ -547,6 +559,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
> size, bool percpu)
>                         c->cpu = cpu;
>                         c->objcg = objcg;
>                         c->percpu_size = percpu_size;
> +                       INIT_WORK(&c->reuse_work,
> bpf_ma_prepare_reuse_work);
>                         raw_spin_lock_init(&c->lock);
>                         c->reuse.percpu = percpu;
>                         c->reuse.cpu = cpu;
> @@ -574,6 +587,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
> size, bool percpu)
>                         c->unit_size = sizes[i];
>                         c->cpu = cpu
>                         c->objcg = objcg;
> +                       INIT_WORK(&c->reuse_work,
> bpf_ma_prepare_reuse_work);
>                         raw_spin_lock_init(&c->lock);
>                         c->reuse.percpu = percpu;
>                         c->reuse.lock = &c->lock;
> @@ -793,6 +807,8 @@ static void notrace unit_free(struct bpf_mem_cache
> *c, void *ptr)
>                         c->prepare_reuse_tail = llnode;
>                 __llist_add(llnode, &c->prepare_reuse_head);
>                 cnt = ++c->prepare_reuse_cnt;
> +               if (cnt > c->high_watermark &&
> !work_pending(&c->reuse_work))
> +                       queue_work(bpf_ma_wq, &c->reuse_work);
>         } else {
>                 /* unit_free() cannot fail. Therefore add an object to
> atomic
>                  * llist. reuse_bulk() will drain it. Though
> free_llist_extra is
> @@ -901,3 +917,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct
> bpf_mem_alloc *ma, gfp_t flags)
>
>         return !ret ? NULL : ret + LLIST_NODE_SZ;
>  }
> +
> +static int __init bpf_ma_init(void)
> +{
> +       bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
> +       BUG_ON(!bpf_ma_wq);
> +       return 0;
> +}
> +late_initcall(bpf_ma_init);
>
>
>
>> Could you point me to a code in RCU where it's doing callback batching?
> .
Alexei Starovoitov June 8, 2023, 4:30 a.m. UTC | #21
On Wed, Jun 7, 2023 at 8:36 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> below. And Alexei, could you please also try the hack below for your
> multiple-rcu-cbs version ?

Before any further experiments... please fix the bench in patch 2.
Make sure it exits out of RCU CS after no more than 64 map_update_elem.
Just send that patch alone.
Paul E. McKenney June 8, 2023, 4:18 p.m. UTC | #22
On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> > On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> >>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>>>> As said in the commit message, the command line for test is
> >>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>>>> using default max_entries and the number of CPUs is greater than 15,
> >>>>> use_percpu_counter will be false.
> >>>> Right. percpu or not depends on number of cpus.
> SNIP
> >>>>  known, because I had just proposed it in the email yesterday.
> >>> Also noticed that the overhead of shared reuse_ready list
> >>> comes both from the contended lock and from cache misses
> >>> when one cpu pushes to the list after RCU GP and another
> >>> cpu removes.
> >>>
> >>> Also low/batch/high watermark are all wrong in patch 3.
> >>> low=32 and high=96 makes no sense when it's not a single list.
> >>> I'm experimenting with 32 for all three heuristics.
> >>>
> >>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> >>> are redundant.
> >>> unit_free() can push into free_by_rcu directly
> >>> then reuse_bulk() can fill it up with free_llist_extra and
> >>> move them into waiting_for_gp.
> >> Yes. Indeed missing that.
> >>> All these _tail optimizations are obscuring the code and make it hard
> >>> to notice these issues.
> >>>
> >>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>>>
> >>>> Answering your other email:
> >>>>
> >>>>> I see your point. I will continue to debug the memory usage difference
> >>>>> between v3 and v4.
> >>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> >> Don't agree with that. I still think the potential memory usage of v4 is
> >> a problem and the difference memory usage between v3 and v4 reveals that
> >> there is some peculiarity in RCU subsystem, because the difference comes
> >> from the duration of RCU grace period. We need to find out why and fix
> >> or workaround that.
> > A tight loop in the kernel can extend RCU grace periods, especially
> > for kernels built with CONFIG_PREEPTION=n.  Placing things like
> > cond_resched() in such loops can help.  Of course, if you are in an
> > RCU read-side critical section (or holding a spinlock), you will need
> > to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> > state upon re-entry is valid.  After all, having exited either type of
> > critical section, anything might happen.
> 
> As said in the help-wanted email just send out, it was weird that the
> RCU grace period was extended a lot, up to ~150ms or more. But queue a
> dummy kworker periodically which does nothing will help to reduce the
> grace period to ~10ms. I have explained the context of the problem in
> that email. And hope that we could get some help from you and the RCU
> experts in the community.

OK, I will bite...  Why do you think this is weird?

RCU depends on context switches, among other things.  If you have a
long-running in-kernel task, that will naturally extend the grace period.
Scheduling the empty worker provided the context switch that RCU needed
to make forward progress.

So 150 milliseconds is an OK RCU grace period.  A bit long, but not
excessively so.  In contrast, by default in mainline, RCU starts seriously
complaining if its grace period is extended beyond 21 *seconds*.  This is
when the RCU CPU stall warning will appear.  (Yes, some Android configs
are tuning this down to 20 milliseconds, but that is a special hardware
and software configuration.)

But if you want shorter RCU grace periods, what can you do?

1.	Follow Alexei's good advice on one of your early patches.

2.	As an alternative to scheduling an empty kworker, invoke something
	like rcu_momentary_dyntick_idle() periodically.  Note well that
	it is illegal to invoke this in an RCU read-side critical section,
	with preemption disabled, from idle, ...

3.	In non-preemptible kernels, cond_resched() is a much lighter
	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
	kernels get the same effect by preempting.  In your case, this
	is also true, but it takes 150 milliseconds.)

That should do for a start.  ;-)

							Thanx, Paul

> Regards,
> Tao
> >
> > 							Thanx, Paul
> >
> >>>>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>>>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>>>> make the list contain thousands of elements.
> >>>> That would be true only if there were no scheduling events in all of 176K ops.
> >>>> Which is not the case.
> >>>> I'm not sure why you're saying that RCU GP is 30ms.
> >> Because these freed elements will be freed after one RCU GP and in v4
> >> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> >> these freed elements will be accumulated on the list.
> >>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >>>> Network and block IO doesn't process 176K packets without resched.
> >>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>>>
> >>>> For small size buckets low_watermark=32 and high=96.
> >>>> We typically move 32 elements at a time from one list to another.
> >>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >>>> elements from waiting_for_gp one at a time and kfree it one at a time.
> >>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>>>
> >>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>>>> introduced. Previously, I though it was used to handle the case in which
> >>>>> allocation and freeing are done on different CPUs.
> >>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>>>
> >>>>> And as we can see
> >>>>> from the benchmark result above and in v3, the performance and the
> >>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >>>> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>>>
> >>>> So far the only bench we can trust and analyze is map_perf_test.
> >>>> Please make bench in patch 2 yield the cpu after few updates.
> >>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >>>> 64 updates is realistic too. A thousand is not.
> >> Will do that.
> >>
>
Hou Tao June 9, 2023, 3:02 a.m. UTC | #23
Hi Paul,

On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>>>> As said in the commit message, the command line for test is
>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>>>> use_percpu_counter will be false.
>>>>>> Right. percpu or not depends on number of cpus.
>> SNIP
>>>>>>  known, because I had just proposed it in the email yesterday.
>>>>> Also noticed that the overhead of shared reuse_ready list
>>>>> comes both from the contended lock and from cache misses
>>>>> when one cpu pushes to the list after RCU GP and another
>>>>> cpu removes.
>>>>>
>>>>> Also low/batch/high watermark are all wrong in patch 3.
>>>>> low=32 and high=96 makes no sense when it's not a single list.
>>>>> I'm experimenting with 32 for all three heuristics.
>>>>>
>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>>>> are redundant.
>>>>> unit_free() can push into free_by_rcu directly
>>>>> then reuse_bulk() can fill it up with free_llist_extra and
>>>>> move them into waiting_for_gp.
>>>> Yes. Indeed missing that.
>>>>> All these _tail optimizations are obscuring the code and make it hard
>>>>> to notice these issues.
>>>>>
>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>>>
>>>>>> Answering your other email:
>>>>>>
>>>>>>> I see your point. I will continue to debug the memory usage difference
>>>>>>> between v3 and v4.
>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>>>> Don't agree with that. I still think the potential memory usage of v4 is
>>>> a problem and the difference memory usage between v3 and v4 reveals that
>>>> there is some peculiarity in RCU subsystem, because the difference comes
>>>> from the duration of RCU grace period. We need to find out why and fix
>>>> or workaround that.
>>> A tight loop in the kernel can extend RCU grace periods, especially
>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
>>> cond_resched() in such loops can help.  Of course, if you are in an
>>> RCU read-side critical section (or holding a spinlock), you will need
>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
>>> state upon re-entry is valid.  After all, having exited either type of
>>> critical section, anything might happen.
>> As said in the help-wanted email just send out, it was weird that the
>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
>> dummy kworker periodically which does nothing will help to reduce the
>> grace period to ~10ms. I have explained the context of the problem in
>> that email. And hope that we could get some help from you and the RCU
>> experts in the community.
> OK, I will bite...  Why do you think this is weird?
>
> RCU depends on context switches, among other things.  If you have a
> long-running in-kernel task, that will naturally extend the grace period.
> Scheduling the empty worker provided the context switch that RCU needed
> to make forward progress.

Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
And there is neither implicit or explicit calling of schedule() in the
kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
And also I don't know how to define a long-running in-kernel task,
because I have checked the latency of getpgid() syscall in producer
thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
there are indeed some outliers, but the most latency is less than 1ms,
so the producer thread will do contex_switch in at most 1ms. But before
the empty kwork trick, the latency of RCU callback is about 100ms or
more, and after the empty kwork trick, the latenct for RCU callback is
reduced to ~8ms.

@hist_us:
[128, 256)            60
|                                                    |
[256, 512)        101364
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)          17580
|@@@@@@@@@                                           |
[1K, 2K)              16
|                                                    |

@stat_us: count 119020, average 436, total 51957277


>
> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
> excessively so.  In contrast, by default in mainline, RCU starts seriously
> complaining if its grace period is extended beyond 21 *seconds*.  This is
> when the RCU CPU stall warning will appear.  (Yes, some Android configs
> are tuning this down to 20 milliseconds, but that is a special hardware
> and software configuration.)
>
> But if you want shorter RCU grace periods, what can you do?
>
> 1.	Follow Alexei's good advice on one of your early patches.
>
> 2.	As an alternative to scheduling an empty kworker, invoke something
> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
> 	it is illegal to invoke this in an RCU read-side critical section,
> 	with preemption disabled, from idle, ...
>
> 3.	In non-preemptible kernels, cond_resched() is a much lighter
> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
> 	kernels get the same effect by preempting.  In your case, this
> 	is also true, but it takes 150 milliseconds.)
>
> That should do for a start.  ;-)
>
> 							Thanx, Paul
>
>> Regards,
>> Tao
>>> 							Thanx, Paul
>>>
>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>>>> make the list contain thousands of elements.
>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>>>> Which is not the case.
>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
>>>> Because these freed elements will be freed after one RCU GP and in v4
>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>>>> these freed elements will be accumulated on the list.
>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>>>> Network and block IO doesn't process 176K packets without resched.
>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>>>
>>>>>> For small size buckets low_watermark=32 and high=96.
>>>>>> We typically move 32 elements at a time from one list to another.
>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>>>
>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>>>> allocation and freeing are done on different CPUs.
>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>>>
>>>>>>> And as we can see
>>>>>>> from the benchmark result above and in v3, the performance and the
>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>>>
>>>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>>>> 64 updates is realistic too. A thousand is not.
>>>> Will do that.
>>>>
> .
Paul E. McKenney June 9, 2023, 2:30 p.m. UTC | #24
On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
> > On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
> >> Hi Paul,
> >>
> >> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> >>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> >>>> Hi,
> >>>>
> >>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> >>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> >>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>>>>>> As said in the commit message, the command line for test is
> >>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>>>>>> using default max_entries and the number of CPUs is greater than 15,
> >>>>>>> use_percpu_counter will be false.
> >>>>>> Right. percpu or not depends on number of cpus.
> >> SNIP
> >>>>>>  known, because I had just proposed it in the email yesterday.
> >>>>> Also noticed that the overhead of shared reuse_ready list
> >>>>> comes both from the contended lock and from cache misses
> >>>>> when one cpu pushes to the list after RCU GP and another
> >>>>> cpu removes.
> >>>>>
> >>>>> Also low/batch/high watermark are all wrong in patch 3.
> >>>>> low=32 and high=96 makes no sense when it's not a single list.
> >>>>> I'm experimenting with 32 for all three heuristics.
> >>>>>
> >>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> >>>>> are redundant.
> >>>>> unit_free() can push into free_by_rcu directly
> >>>>> then reuse_bulk() can fill it up with free_llist_extra and
> >>>>> move them into waiting_for_gp.
> >>>> Yes. Indeed missing that.
> >>>>> All these _tail optimizations are obscuring the code and make it hard
> >>>>> to notice these issues.
> >>>>>
> >>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>>>>>
> >>>>>> Answering your other email:
> >>>>>>
> >>>>>>> I see your point. I will continue to debug the memory usage difference
> >>>>>>> between v3 and v4.
> >>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> >>>> Don't agree with that. I still think the potential memory usage of v4 is
> >>>> a problem and the difference memory usage between v3 and v4 reveals that
> >>>> there is some peculiarity in RCU subsystem, because the difference comes
> >>>> from the duration of RCU grace period. We need to find out why and fix
> >>>> or workaround that.
> >>> A tight loop in the kernel can extend RCU grace periods, especially
> >>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> >>> cond_resched() in such loops can help.  Of course, if you are in an
> >>> RCU read-side critical section (or holding a spinlock), you will need
> >>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> >>> state upon re-entry is valid.  After all, having exited either type of
> >>> critical section, anything might happen.
> >> As said in the help-wanted email just send out, it was weird that the
> >> RCU grace period was extended a lot, up to ~150ms or more. But queue a
> >> dummy kworker periodically which does nothing will help to reduce the
> >> grace period to ~10ms. I have explained the context of the problem in
> >> that email. And hope that we could get some help from you and the RCU
> >> experts in the community.
> > OK, I will bite...  Why do you think this is weird?
> >
> > RCU depends on context switches, among other things.  If you have a
> > long-running in-kernel task, that will naturally extend the grace period.
> > Scheduling the empty worker provided the context switch that RCU needed
> > to make forward progress.
> 
> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
> And there is neither implicit or explicit calling of schedule() in the
> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
> And also I don't know how to define a long-running in-kernel task,
> because I have checked the latency of getpgid() syscall in producer
> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
> there are indeed some outliers, but the most latency is less than 1ms,
> so the producer thread will do contex_switch in at most 1ms. But before
> the empty kwork trick, the latency of RCU callback is about 100ms or
> more, and after the empty kwork trick, the latenct for RCU callback is
> reduced to ~8ms.
> 
> @hist_us:
> [128, 256)            60
> |                                                    |
> [256, 512)        101364
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)          17580
> |@@@@@@@@@                                           |
> [1K, 2K)              16
> |                                                    |
> 
> @stat_us: count 119020, average 436, total 51957277

Ah!

The trick is that if you have a non-nohz_full CPU spending most of its
time in kernel space (in your case, due to repeated system calls), with
high probability this CPU will appear to RCU as if it were spending all
of its time in kernel space.  This probability will increase with the
fraction of the time this CPU spends in kernel space.

If I am reading your histogram correctly, the overhead of your system
call is usually etween 256 and 512 microseconds.  If you are invoking
that system call in a tight loop, that CPU might could easily be spending
way more than 99% of its time in the kernel.  Unless you have similar
statistics for the average lenght of your CPU's times in userspace,
let's assume exactly 99%.

On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
userspace occurs when the once-per-jiffy scheduling-clock interrupt
is taken from userspace.  If the CPU is spending 99% of its time in
the kernel, we can expect only one in 100 interrupts being taken from
userspace.  On a HZ=1000 system, that will give you 100 milliseconds
right there.

But it gets worse.  The variance is quite large.  For example, for a
given grace period, there is about a 5% chance that the delay will be
300 milliseconds instead of the expected 100 milliseconds.  (You can
easily check this by taking 0.99 to the 300th power.  Or whatever other
power you desire.)

So the problem is that simple statistics is biting pretty hard here.

As you may have noticed, life is like that sometimes.  ;-)

							Thanx, Paul

> > So 150 milliseconds is an OK RCU grace period.  A bit long, but not
> > excessively so.  In contrast, by default in mainline, RCU starts seriously
> > complaining if its grace period is extended beyond 21 *seconds*.  This is
> > when the RCU CPU stall warning will appear.  (Yes, some Android configs
> > are tuning this down to 20 milliseconds, but that is a special hardware
> > and software configuration.)
> >
> > But if you want shorter RCU grace periods, what can you do?
> >
> > 1.	Follow Alexei's good advice on one of your early patches.
> >
> > 2.	As an alternative to scheduling an empty kworker, invoke something
> > 	like rcu_momentary_dyntick_idle() periodically.  Note well that
> > 	it is illegal to invoke this in an RCU read-side critical section,
> > 	with preemption disabled, from idle, ...
> >
> > 3.	In non-preemptible kernels, cond_resched() is a much lighter
> > 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
> > 	kernels get the same effect by preempting.  In your case, this
> > 	is also true, but it takes 150 milliseconds.)
> >
> > That should do for a start.  ;-)
> >
> > 							Thanx, Paul
> >
> >> Regards,
> >> Tao
> >>> 							Thanx, Paul
> >>>
> >>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>>>>>> make the list contain thousands of elements.
> >>>>>> That would be true only if there were no scheduling events in all of 176K ops.
> >>>>>> Which is not the case.
> >>>>>> I'm not sure why you're saying that RCU GP is 30ms.
> >>>> Because these freed elements will be freed after one RCU GP and in v4
> >>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> >>>> these freed elements will be accumulated on the list.
> >>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >>>>>> Network and block IO doesn't process 176K packets without resched.
> >>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>>>>>
> >>>>>> For small size buckets low_watermark=32 and high=96.
> >>>>>> We typically move 32 elements at a time from one list to another.
> >>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
> >>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>>>>>
> >>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>>>>>> introduced. Previously, I though it was used to handle the case in which
> >>>>>>> allocation and freeing are done on different CPUs.
> >>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>>>>>
> >>>>>>> And as we can see
> >>>>>>> from the benchmark result above and in v3, the performance and the
> >>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>>>>>
> >>>>>> So far the only bench we can trust and analyze is map_perf_test.
> >>>>>> Please make bench in patch 2 yield the cpu after few updates.
> >>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >>>>>> 64 updates is realistic too. A thousand is not.
> >>>> Will do that.
> >>>>
> > .
>
Hou Tao June 12, 2023, 2:03 a.m. UTC | #25
Hi Paul,

On 6/9/2023 10:30 PM, Paul E. McKenney wrote:
> On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
>> Hi Paul,
>>
>> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
>>> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
>>>> Hi Paul,
>>>>
>>>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
>>>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
>>>>>>>>> As said in the commit message, the command line for test is
>>>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
>>>>>>>>> using default max_entries and the number of CPUs is greater than 15,
>>>>>>>>> use_percpu_counter will be false.
>>>>>>>> Right. percpu or not depends on number of cpus.
>>>> SNIP
>>>>>>>>  known, because I had just proposed it in the email yesterday.
>>>>>>> Also noticed that the overhead of shared reuse_ready list
>>>>>>> comes both from the contended lock and from cache misses
>>>>>>> when one cpu pushes to the list after RCU GP and another
>>>>>>> cpu removes.
>>>>>>>
>>>>>>> Also low/batch/high watermark are all wrong in patch 3.
>>>>>>> low=32 and high=96 makes no sense when it's not a single list.
>>>>>>> I'm experimenting with 32 for all three heuristics.
>>>>>>>
>>>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
>>>>>>> are redundant.
>>>>>>> unit_free() can push into free_by_rcu directly
>>>>>>> then reuse_bulk() can fill it up with free_llist_extra and
>>>>>>> move them into waiting_for_gp.
>>>>>> Yes. Indeed missing that.
>>>>>>> All these _tail optimizations are obscuring the code and make it hard
>>>>>>> to notice these issues.
>>>>>>>
>>>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>>>>>>>>
>>>>>>>> Answering your other email:
>>>>>>>>
>>>>>>>>> I see your point. I will continue to debug the memory usage difference
>>>>>>>>> between v3 and v4.
>>>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>>>>>> Don't agree with that. I still think the potential memory usage of v4 is
>>>>>> a problem and the difference memory usage between v3 and v4 reveals that
>>>>>> there is some peculiarity in RCU subsystem, because the difference comes
>>>>>> from the duration of RCU grace period. We need to find out why and fix
>>>>>> or workaround that.
>>>>> A tight loop in the kernel can extend RCU grace periods, especially
>>>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
>>>>> cond_resched() in such loops can help.  Of course, if you are in an
>>>>> RCU read-side critical section (or holding a spinlock), you will need
>>>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
>>>>> state upon re-entry is valid.  After all, having exited either type of
>>>>> critical section, anything might happen.
>>>> As said in the help-wanted email just send out, it was weird that the
>>>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
>>>> dummy kworker periodically which does nothing will help to reduce the
>>>> grace period to ~10ms. I have explained the context of the problem in
>>>> that email. And hope that we could get some help from you and the RCU
>>>> experts in the community.
>>> OK, I will bite...  Why do you think this is weird?
>>>
>>> RCU depends on context switches, among other things.  If you have a
>>> long-running in-kernel task, that will naturally extend the grace period.
>>> Scheduling the empty worker provided the context switch that RCU needed
>>> to make forward progress.
>> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
>> And there is neither implicit or explicit calling of schedule() in the
>> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
>> And also I don't know how to define a long-running in-kernel task,
>> because I have checked the latency of getpgid() syscall in producer
>> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
>> there are indeed some outliers, but the most latency is less than 1ms,
>> so the producer thread will do contex_switch in at most 1ms. But before
>> the empty kwork trick, the latency of RCU callback is about 100ms or
>> more, and after the empty kwork trick, the latenct for RCU callback is
>> reduced to ~8ms.
>>
>> @hist_us:
>> [128, 256)            60
>> |                                                    |
>> [256, 512)        101364
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [512, 1K)          17580
>> |@@@@@@@@@                                           |
>> [1K, 2K)              16
>> |                                                    |
>>
>> @stat_us: count 119020, average 436, total 51957277
> Ah!
>
> The trick is that if you have a non-nohz_full CPU spending most of its
> time in kernel space (in your case, due to repeated system calls), with
> high probability this CPU will appear to RCU as if it were spending all
> of its time in kernel space.  This probability will increase with the
> fraction of the time this CPU spends in kernel space.
>
> If I am reading your histogram correctly, the overhead of your system
> call is usually etween 256 and 512 microseconds.  If you are invoking
> that system call in a tight loop, that CPU might could easily be spending
> way more than 99% of its time in the kernel.  Unless you have similar
> statistics for the average lenght of your CPU's times in userspace,
> let's assume exactly 99%.
>
> On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
> userspace occurs when the once-per-jiffy scheduling-clock interrupt
> is taken from userspace.  If the CPU is spending 99% of its time in
> the kernel, we can expect only one in 100 interrupts being taken from
> userspace.  On a HZ=1000 system, that will give you 100 milliseconds
> right there.
Thanks for the detailed explanation. It seemed my previous understanding
of RCU was wrong. My original though was that when one syscall returns
from the kernel space, the process which does the syscall will be
considered as being in quiescent state and an RCU grace period will be
expired soon.
>
> But it gets worse.  The variance is quite large.  For example, for a
> given grace period, there is about a 5% chance that the delay will be
> 300 milliseconds instead of the expected 100 milliseconds.  (You can
> easily check this by taking 0.99 to the 300th power.  Or whatever other
> power you desire.)
>
> So the problem is that simple statistics is biting pretty hard here.
>
> As you may have noticed, life is like that sometimes.  ;-)
>
> 							Thanx, Paul
>
>>> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
>>> excessively so.  In contrast, by default in mainline, RCU starts seriously
>>> complaining if its grace period is extended beyond 21 *seconds*.  This is
>>> when the RCU CPU stall warning will appear.  (Yes, some Android configs
>>> are tuning this down to 20 milliseconds, but that is a special hardware
>>> and software configuration.)
>>>
>>> But if you want shorter RCU grace periods, what can you do?
>>>
>>> 1.	Follow Alexei's good advice on one of your early patches.
>>>
>>> 2.	As an alternative to scheduling an empty kworker, invoke something
>>> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
>>> 	it is illegal to invoke this in an RCU read-side critical section,
>>> 	with preemption disabled, from idle, ...
>>>
>>> 3.	In non-preemptible kernels, cond_resched() is a much lighter
>>> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
>>> 	kernels get the same effect by preempting.  In your case, this
>>> 	is also true, but it takes 150 milliseconds.)
>>>
>>> That should do for a start.  ;-)
>>>
>>> 							Thanx, Paul
>>>
>>>> Regards,
>>>> Tao
>>>>> 							Thanx, Paul
>>>>>
>>>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
>>>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
>>>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
>>>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
>>>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
>>>>>>>>> make the list contain thousands of elements.
>>>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
>>>>>>>> Which is not the case.
>>>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
>>>>>> Because these freed elements will be freed after one RCU GP and in v4
>>>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
>>>>>> these freed elements will be accumulated on the list.
>>>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
>>>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
>>>>>>>> Network and block IO doesn't process 176K packets without resched.
>>>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>>>>>>>>
>>>>>>>> For small size buckets low_watermark=32 and high=96.
>>>>>>>> We typically move 32 elements at a time from one list to another.
>>>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
>>>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
>>>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
>>>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>>>>>>>>
>>>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
>>>>>>>>> introduced. Previously, I though it was used to handle the case in which
>>>>>>>>> allocation and freeing are done on different CPUs.
>>>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>>>>>>>>
>>>>>>>>> And as we can see
>>>>>>>>> from the benchmark result above and in v3, the performance and the
>>>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
>>>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
>>>>>>>>
>>>>>>>> So far the only bench we can trust and analyze is map_perf_test.
>>>>>>>> Please make bench in patch 2 yield the cpu after few updates.
>>>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
>>>>>>>> 64 updates is realistic too. A thousand is not.
>>>>>> Will do that.
>>>>>>
>>> .
Paul E. McKenney June 12, 2023, 3:40 a.m. UTC | #26
On Mon, Jun 12, 2023 at 10:03:15AM +0800, Hou Tao wrote:
> Hi Paul,
> 
> On 6/9/2023 10:30 PM, Paul E. McKenney wrote:
> > On Fri, Jun 09, 2023 at 11:02:59AM +0800, Hou Tao wrote:
> >> Hi Paul,
> >>
> >> On 6/9/2023 12:18 AM, Paul E. McKenney wrote:
> >>> On Thu, Jun 08, 2023 at 11:43:54AM +0800, Hou Tao wrote:
> >>>> Hi Paul,
> >>>>
> >>>> On 6/8/2023 10:55 AM, Paul E. McKenney wrote:
> >>>>> On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 6/8/2023 4:50 AM, Alexei Starovoitov wrote:
> >>>>>>> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> >>>>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> >>>>>>>>> As said in the commit message, the command line for test is
> >>>>>>>>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> >>>>>>>>> using default max_entries and the number of CPUs is greater than 15,
> >>>>>>>>> use_percpu_counter will be false.
> >>>>>>>> Right. percpu or not depends on number of cpus.
> >>>> SNIP
> >>>>>>>>  known, because I had just proposed it in the email yesterday.
> >>>>>>> Also noticed that the overhead of shared reuse_ready list
> >>>>>>> comes both from the contended lock and from cache misses
> >>>>>>> when one cpu pushes to the list after RCU GP and another
> >>>>>>> cpu removes.
> >>>>>>>
> >>>>>>> Also low/batch/high watermark are all wrong in patch 3.
> >>>>>>> low=32 and high=96 makes no sense when it's not a single list.
> >>>>>>> I'm experimenting with 32 for all three heuristics.
> >>>>>>>
> >>>>>>> Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
> >>>>>>> are redundant.
> >>>>>>> unit_free() can push into free_by_rcu directly
> >>>>>>> then reuse_bulk() can fill it up with free_llist_extra and
> >>>>>>> move them into waiting_for_gp.
> >>>>>> Yes. Indeed missing that.
> >>>>>>> All these _tail optimizations are obscuring the code and make it hard
> >>>>>>> to notice these issues.
> >>>>>>>
> >>>>>>>> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
> >>>>>>>>
> >>>>>>>> Answering your other email:
> >>>>>>>>
> >>>>>>>>> I see your point. I will continue to debug the memory usage difference
> >>>>>>>>> between v3 and v4.
> >>>>>>>> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
> >>>>>> Don't agree with that. I still think the potential memory usage of v4 is
> >>>>>> a problem and the difference memory usage between v3 and v4 reveals that
> >>>>>> there is some peculiarity in RCU subsystem, because the difference comes
> >>>>>> from the duration of RCU grace period. We need to find out why and fix
> >>>>>> or workaround that.
> >>>>> A tight loop in the kernel can extend RCU grace periods, especially
> >>>>> for kernels built with CONFIG_PREEPTION=n.  Placing things like
> >>>>> cond_resched() in such loops can help.  Of course, if you are in an
> >>>>> RCU read-side critical section (or holding a spinlock), you will need
> >>>>> to exit, cond_resched(), then re-enter.  Taking care to ensure that the
> >>>>> state upon re-entry is valid.  After all, having exited either type of
> >>>>> critical section, anything might happen.
> >>>> As said in the help-wanted email just send out, it was weird that the
> >>>> RCU grace period was extended a lot, up to ~150ms or more. But queue a
> >>>> dummy kworker periodically which does nothing will help to reduce the
> >>>> grace period to ~10ms. I have explained the context of the problem in
> >>>> that email. And hope that we could get some help from you and the RCU
> >>>> experts in the community.
> >>> OK, I will bite...  Why do you think this is weird?
> >>>
> >>> RCU depends on context switches, among other things.  If you have a
> >>> long-running in-kernel task, that will naturally extend the grace period.
> >>> Scheduling the empty worker provided the context switch that RCU needed
> >>> to make forward progress.
> >> Because the empty kwork trick also works for CONFIG_PREEMPT_VOLUNTARY=y.
> >> And there is neither implicit or explicit calling of schedule() in the
> >> kernel space of producer thread when CONFIG_PREEMPT_VOLUNTARY=y.
> >> And also I don't know how to define a long-running in-kernel task,
> >> because I have checked the latency of getpgid() syscall in producer
> >> thread when CONFIG_PREEMPT_VOLUNTARY=y . As shown in the text below,
> >> there are indeed some outliers, but the most latency is less than 1ms,
> >> so the producer thread will do contex_switch in at most 1ms. But before
> >> the empty kwork trick, the latency of RCU callback is about 100ms or
> >> more, and after the empty kwork trick, the latenct for RCU callback is
> >> reduced to ~8ms.
> >>
> >> @hist_us:
> >> [128, 256)            60
> >> |                                                    |
> >> [256, 512)        101364
> >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> [512, 1K)          17580
> >> |@@@@@@@@@                                           |
> >> [1K, 2K)              16
> >> |                                                    |
> >>
> >> @stat_us: count 119020, average 436, total 51957277
> > Ah!
> >
> > The trick is that if you have a non-nohz_full CPU spending most of its
> > time in kernel space (in your case, due to repeated system calls), with
> > high probability this CPU will appear to RCU as if it were spending all
> > of its time in kernel space.  This probability will increase with the
> > fraction of the time this CPU spends in kernel space.
> >
> > If I am reading your histogram correctly, the overhead of your system
> > call is usually etween 256 and 512 microseconds.  If you are invoking
> > that system call in a tight loop, that CPU might could easily be spending
> > way more than 99% of its time in the kernel.  Unless you have similar
> > statistics for the average lenght of your CPU's times in userspace,
> > let's assume exactly 99%.
> >
> > On a non-nohz_full CPU, RCU's only hint that a CPU is executing in
> > userspace occurs when the once-per-jiffy scheduling-clock interrupt
> > is taken from userspace.  If the CPU is spending 99% of its time in
> > the kernel, we can expect only one in 100 interrupts being taken from
> > userspace.  On a HZ=1000 system, that will give you 100 milliseconds
> > right there.
> 
> Thanks for the detailed explanation. It seemed my previous understanding
> of RCU was wrong. My original though was that when one syscall returns
> from the kernel space, the process which does the syscall will be
> considered as being in quiescent state and an RCU grace period will be
> expired soon.

On a nohz_full CPU, that does in fact happen, courtesy of the
context-tracking code's ct_user_enter() and ct_user_exit() functions.

So your understanding was not entirely wrong.

But context tracking adds overhead, so it is not enabled on kernels
not requiring it.  Such kernels run certain types of HPC and/or
real-time workloads.  On other kernels, there is no context tracking
for user/kernel transitions, which (as discussed above) forces RCU to
rely on the scheduling clock interrupt, with the resulting grace-period
delays.

							Thanx, Paul

> > But it gets worse.  The variance is quite large.  For example, for a
> > given grace period, there is about a 5% chance that the delay will be
> > 300 milliseconds instead of the expected 100 milliseconds.  (You can
> > easily check this by taking 0.99 to the 300th power.  Or whatever other
> > power you desire.)
> >
> > So the problem is that simple statistics is biting pretty hard here.
> >
> > As you may have noticed, life is like that sometimes.  ;-)
> >
> > 							Thanx, Paul
> >
> >>> So 150 milliseconds is an OK RCU grace period.  A bit long, but not
> >>> excessively so.  In contrast, by default in mainline, RCU starts seriously
> >>> complaining if its grace period is extended beyond 21 *seconds*.  This is
> >>> when the RCU CPU stall warning will appear.  (Yes, some Android configs
> >>> are tuning this down to 20 milliseconds, but that is a special hardware
> >>> and software configuration.)
> >>>
> >>> But if you want shorter RCU grace periods, what can you do?
> >>>
> >>> 1.	Follow Alexei's good advice on one of your early patches.
> >>>
> >>> 2.	As an alternative to scheduling an empty kworker, invoke something
> >>> 	like rcu_momentary_dyntick_idle() periodically.  Note well that
> >>> 	it is illegal to invoke this in an RCU read-side critical section,
> >>> 	with preemption disabled, from idle, ...
> >>>
> >>> 3.	In non-preemptible kernels, cond_resched() is a much lighter
> >>> 	weight alternative to rcu_momentary_dyntick_idle().  (Preemptible
> >>> 	kernels get the same effect by preempting.  In your case, this
> >>> 	is also true, but it takes 150 milliseconds.)
> >>>
> >>> That should do for a start.  ;-)
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> Regards,
> >>>> Tao
> >>>>> 							Thanx, Paul
> >>>>>
> >>>>>>>>> I don't think so. Let's considering the per-cpu list first. Assume the
> >>>>>>>>> normal RCU grace period is about 30ms and we are tracing the IO latency
> >>>>>>>>> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> >>>>>>>>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> >>>>>>>>> For the per-ma list, when the number of CPUs increased, it is easy to
> >>>>>>>>> make the list contain thousands of elements.
> >>>>>>>> That would be true only if there were no scheduling events in all of 176K ops.
> >>>>>>>> Which is not the case.
> >>>>>>>> I'm not sure why you're saying that RCU GP is 30ms.
> >>>>>> Because these freed elements will be freed after one RCU GP and in v4
> >>>>>> there is only one RCU callback per-cpu, so before one RCU GP is expired,
> >>>>>> these freed elements will be accumulated on the list.
> >>>>>>>> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> >>>>>>>> Every sched event is sort-of implicit rcu_read_lock/unlock.
> >>>>>>>> Network and block IO doesn't process 176K packets without resched.
> >>>>>>>> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
> >>>>>>>>
> >>>>>>>> For small size buckets low_watermark=32 and high=96.
> >>>>>>>> We typically move 32 elements at a time from one list to another.
> >>>>>>>> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> >>>>>>>> is not instant, but once __free_rcu_tasks_trace is called we need to take
> >>>>>>>> elements from waiting_for_gp one at a time and kfree it one at a time.
> >>>>>>>> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
> >>>>>>>>
> >>>>>>>>> Before I post v5, I want to know the reason why per-bpf-ma list is
> >>>>>>>>> introduced. Previously, I though it was used to handle the case in which
> >>>>>>>>> allocation and freeing are done on different CPUs.
> >>>>>>>> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
> >>>>>>>>
> >>>>>>>>> And as we can see
> >>>>>>>>> from the benchmark result above and in v3, the performance and the
> >>>>>>>>> memory usage of v4 for add_del_on_diff_cpu is better than v3.
> >>>>>>>> bench from patch 2 is invalid. Hence no conclusion can be made.
> >>>>>>>>
> >>>>>>>> So far the only bench we can trust and analyze is map_perf_test.
> >>>>>>>> Please make bench in patch 2 yield the cpu after few updates.
> >>>>>>>> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> >>>>>>>> 64 updates is realistic too. A thousand is not.
> >>>>>> Will do that.
> >>>>>>
> >>> .
>