Message ID | 20240909054318.1809580-1-almasrymina@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Device Memory TCP | expand |
On 2024/9/9 13:43, Mina Almasry wrote: > > Perf - page-pool benchmark: > --------------------------- > > bench_page_pool_simple.ko tests with and without these changes: > https://pastebin.com/raw/ncHDwAbn > > AFAIK the number that really matters in the perf tests is the > 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 > cycles without the changes but there is some 1 cycle noise in some > results. > > With the patches this regresses to 9 cycles with the changes but there > is 1 cycle noise occasionally running this test repeatedly. > > Lastly I tried disable the static_branch_unlikely() in > netmem_is_net_iov() check. To my surprise disabling the > static_branch_unlikely() check reduces the fast path back to 8 cycles, > but the 1 cycle noise remains. Sorry for the late report, as I was adding a testing page_pool ko basing on [1] to avoid introducing performance regression when fixing the bug in [2]. I used it to test the performance impact of devmem patchset for page_pool too, it seems there might be some noticable performance impact quite stably for the below testcases, about 5%~16% performance degradation as below in the arm64 system: Before the devmem patchset: Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1' (100 runs): 17.167561 task-clock (msec) # 0.003 CPUs utilized ( +- 0.40% ) 8 context-switches # 0.474 K/sec ( +- 0.65% ) 0 cpu-migrations # 0.001 K/sec ( +-100.00% ) 84 page-faults # 0.005 M/sec ( +- 0.13% ) 44576552 cycles # 2.597 GHz ( +- 0.40% ) 59627412 instructions # 1.34 insn per cycle ( +- 0.03% ) 14370325 branches # 837.063 M/sec ( +- 0.02% ) 21902 branch-misses # 0.15% of all branches ( +- 0.27% ) 6.818873600 seconds time elapsed ( +- 0.02% ) Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1 test_direct=1' (100 runs): 17.595423 task-clock (msec) # 0.004 CPUs utilized ( +- 0.01% ) 8 context-switches # 0.460 K/sec ( +- 0.50% ) 0 cpu-migrations # 0.000 K/sec 84 page-faults # 0.005 M/sec ( +- 0.15% ) 45693020 cycles # 2.597 GHz ( +- 0.01% ) 59676212 instructions # 1.31 insn per cycle ( +- 0.00% ) 14385384 branches # 817.564 M/sec ( +- 0.00% ) 21786 branch-misses # 0.15% of all branches ( +- 0.14% ) 4.098627802 seconds time elapsed ( +- 0.11% ) After the devmem patchset: Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1' (100 runs): 17.047973 task-clock (msec) # 0.002 CPUs utilized ( +- 0.39% ) 8 context-switches # 0.488 K/sec ( +- 0.82% ) 0 cpu-migrations # 0.001 K/sec ( +- 70.35% ) 84 page-faults # 0.005 M/sec ( +- 0.12% ) 44269558 cycles # 2.597 GHz ( +- 0.39% ) 59594383 instructions # 1.35 insn per cycle ( +- 0.02% ) 14362599 branches # 842.481 M/sec ( +- 0.02% ) 21949 branch-misses # 0.15% of all branches ( +- 0.25% ) 7.964890303 seconds time elapsed ( +- 0.16% ) Performance counter stats for 'insmod ./page_pool_test.ko test_push_cpu=16 test_pop_cpu=16 nr_test=100000000 test_napi=1 test_direct=1' (100 runs): 17.660975 task-clock (msec) # 0.004 CPUs utilized ( +- 0.02% ) 8 context-switches # 0.458 K/sec ( +- 0.57% ) 0 cpu-migrations # 0.003 K/sec ( +- 43.81% ) 84 page-faults # 0.005 M/sec ( +- 0.17% ) 45862652 cycles # 2.597 GHz ( +- 0.02% ) 59764866 instructions # 1.30 insn per cycle ( +- 0.01% ) 14404323 branches # 815.602 M/sec ( +- 0.01% ) 21826 branch-misses # 0.15% of all branches ( +- 0.19% ) 4.304644609 seconds time elapsed ( +- 0.75% ) 1. https://lore.kernel.org/all/20240906073646.2930809-2-linyunsheng@huawei.com/ 2. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ >
On Mon, Sep 9, 2024 at 4:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2024/9/9 13:43, Mina Almasry wrote: > > > > > Perf - page-pool benchmark: > > --------------------------- > > > > bench_page_pool_simple.ko tests with and without these changes: > > https://pastebin.com/raw/ncHDwAbn > > > > AFAIK the number that really matters in the perf tests is the > > 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 > > cycles without the changes but there is some 1 cycle noise in some > > results. > > > > With the patches this regresses to 9 cycles with the changes but there > > is 1 cycle noise occasionally running this test repeatedly. > > > > Lastly I tried disable the static_branch_unlikely() in > > netmem_is_net_iov() check. To my surprise disabling the > > static_branch_unlikely() check reduces the fast path back to 8 cycles, > > but the 1 cycle noise remains. > > Sorry for the late report, as I was adding a testing page_pool ko basing > on [1] to avoid introducing performance regression when fixing the bug in > [2]. > I used it to test the performance impact of devmem patchset for page_pool > too, it seems there might be some noticable performance impact quite stably > for the below testcases, about 5%~16% performance degradation as below in > the arm64 system: > Correct me if I'm wrong here, but on the surface here it seems that you're re-reporting a known issue. Consensus seems to be that it's a non-issue. In v6 I reported that the bench_page_pool_simple.ko test reports a 1 cycle regression with these patches, from 8->9 cycles. That is roughly consistent with the 5-15% you're reporting. I root caused the reason for the regression to be the netmem_is_net_iov() check in the fast path. I removed this regression in v7 (see the change log) by conditionally compiling the check in that function. In v8, Pavel/Jens/David pushed back on the ifdef check. See this entire thread, but in particular this response from Jens: https://lore.kernel.org/lkml/11f52113-7b67-4b45-ba1d-29b070050cec@kernel.dk/ Seems consensus that it's 'not really worth it in this scenario'.
On 2024/9/10 0:54, Mina Almasry wrote: > On Mon, Sep 9, 2024 at 4:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> On 2024/9/9 13:43, Mina Almasry wrote: >> >>> >>> Perf - page-pool benchmark: >>> --------------------------- >>> >>> bench_page_pool_simple.ko tests with and without these changes: >>> https://pastebin.com/raw/ncHDwAbn >>> >>> AFAIK the number that really matters in the perf tests is the >>> 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 >>> cycles without the changes but there is some 1 cycle noise in some >>> results. >>> >>> With the patches this regresses to 9 cycles with the changes but there >>> is 1 cycle noise occasionally running this test repeatedly. >>> >>> Lastly I tried disable the static_branch_unlikely() in >>> netmem_is_net_iov() check. To my surprise disabling the >>> static_branch_unlikely() check reduces the fast path back to 8 cycles, >>> but the 1 cycle noise remains. >> >> Sorry for the late report, as I was adding a testing page_pool ko basing >> on [1] to avoid introducing performance regression when fixing the bug in >> [2]. >> I used it to test the performance impact of devmem patchset for page_pool >> too, it seems there might be some noticable performance impact quite stably >> for the below testcases, about 5%~16% performance degradation as below in >> the arm64 system: >> > > Correct me if I'm wrong here, but on the surface here it seems that > you're re-reporting a known issue. Consensus seems to be that it's a > non-issue. > > In v6 I reported that the bench_page_pool_simple.ko test reports a 1 > cycle regression with these patches, from 8->9 cycles. That is roughly > consistent with the 5-15% you're reporting. From the description above in the cover letter, I thought the performance data using the out of tree testing ko is not stable enough to justify the performance impact. > > I root caused the reason for the regression to be the > netmem_is_net_iov() check in the fast path. I removed this regression > in v7 (see the change log) by conditionally compiling the check in > that function. > > In v8, Pavel/Jens/David pushed back on the ifdef check. See this > entire thread, but in particular this response from Jens: It seemed the main objection is about how to enable this feature for the io_uring? And it seemed that you had added the CONFIG_NET_DEVMEM for this devmem thing, why not use it for that? > > https://lore.kernel.org/lkml/11f52113-7b67-4b45-ba1d-29b070050cec@kernel.dk/ > > Seems consensus that it's 'not really worth it in this scenario'. I was only reading through the above thread, it didn't seemed to reach to consensus as Jesper pointed out the performance impact for the XDP DROP case in the same thread. https://lore.kernel.org/lkml/779b9542-4170-483a-af54-ca0dd471f774@kernel.org/ >
On 9/10/24 11:44, Yunsheng Lin wrote: > On 2024/9/10 0:54, Mina Almasry wrote: >> On Mon, Sep 9, 2024 at 4:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >>> >>> On 2024/9/9 13:43, Mina Almasry wrote: >>> >>>> >>>> Perf - page-pool benchmark: >>>> --------------------------- >>>> >>>> bench_page_pool_simple.ko tests with and without these changes: >>>> https://pastebin.com/raw/ncHDwAbn >>>> >>>> AFAIK the number that really matters in the perf tests is the >>>> 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 >>>> cycles without the changes but there is some 1 cycle noise in some >>>> results. >>>> >>>> With the patches this regresses to 9 cycles with the changes but there >>>> is 1 cycle noise occasionally running this test repeatedly. >>>> >>>> Lastly I tried disable the static_branch_unlikely() in >>>> netmem_is_net_iov() check. To my surprise disabling the >>>> static_branch_unlikely() check reduces the fast path back to 8 cycles, >>>> but the 1 cycle noise remains. >>> >>> Sorry for the late report, as I was adding a testing page_pool ko basing >>> on [1] to avoid introducing performance regression when fixing the bug in >>> [2]. >>> I used it to test the performance impact of devmem patchset for page_pool >>> too, it seems there might be some noticable performance impact quite stably >>> for the below testcases, about 5%~16% performance degradation as below in >>> the arm64 system: >>> >> >> Correct me if I'm wrong here, but on the surface here it seems that >> you're re-reporting a known issue. Consensus seems to be that it's a >> non-issue. >> >> In v6 I reported that the bench_page_pool_simple.ko test reports a 1 >> cycle regression with these patches, from 8->9 cycles. That is roughly >> consistent with the 5-15% you're reporting. > > From the description above in the cover letter, I thought the performance > data using the out of tree testing ko is not stable enough to justify the > performance impact. > >> >> I root caused the reason for the regression to be the >> netmem_is_net_iov() check in the fast path. I removed this regression >> in v7 (see the change log) by conditionally compiling the check in >> that function. >> >> In v8, Pavel/Jens/David pushed back on the ifdef check. See this >> entire thread, but in particular this response from Jens: > > It seemed the main objection is about how to enable this feature > for the io_uring? The pushback was that config checks as optimisation don't work in real life, they inevitably get enabled everywhere but some niche cases. io_uring could do another config for memory providers, but even if it's not enabled by default (which is not a great option), distributions will eventually turn it on. So, if you have that "niche use case" that fully controls the kernel and wants to shed this overhead, we can do a config structure, but if it's about overhead for everyone in general, configs hardly help anything, even without any io_uring in the picture. > And it seemed that you had added the CONFIG_NET_DEVMEM for this > devmem thing, why not use it for that?
On Mon, 9 Sep 2024 05:43:05 +0000 Mina Almasry wrote: > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. Mina, if you'd like to see this in v6.12 -- please fix the nits and repost ASAP.
On Tue, Sep 10, 2024 at 8:02 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 9 Sep 2024 05:43:05 +0000 Mina Almasry wrote: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > > from device memory efficiently, without bouncing the data to a host memory > > buffer. > > Mina, if you'd like to see this in v6.12 -- please fix the nits and > repost ASAP. Running my presubmits now and will repost in the next 2 hours or so. -- Thanks, Mina