Message ID | 1643764336-63864-1-git-send-email-jdamato@fastly.com (mailing list archive) |
---|---|
Headers | show |
Series | page_pool: Add page_pool stat counters | expand |
Hi Joe, Again thanks for the patches! On Wed, 2 Feb 2022 at 03:13, Joe Damato <jdamato@fastly.com> wrote: > > Greetings: > > Sending a v3 as I noted some issues with the procfs code in patch 10 I > submit in v2 (thanks, kernel test robot) and fixing the placement of the > refill stat increment in patch 8. > > I only modified the placement of the refill stat, but decided to re-run the > benchmarks used in the v2 [1], and the results are: > > Test system: > - 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz > - 2 NUMA zones, with 18 cores per zone and 2 threads per core > > bench_page_pool_simple results: > test name stats enabled stats disabled > cycles nanosec cycles nanosec > > for_loop 0 0.335 0 0.334 > atomic_inc 13 6.028 13 6.035 > lock 32 14.017 31 13.552 > > no-softirq-page_pool01 45 19.832 46 20.193 > no-softirq-page_pool02 44 19.478 46 20.083 > no-softirq-page_pool03 110 48.365 109 47.699 > > tasklet_page_pool01_fast_path 14 6.204 13 6.021 > tasklet_page_pool02_ptr_ring 41 18.115 42 18.699 > tasklet_page_pool03_slow 110 48.085 108 47.395 > > bench_page_pool_cross_cpu results: > test name stats enabled stats disabled > cycles nanosec cycles nanosec > > page_pool_cross_cpu CPU(0) 2216 966.179 2101 915.692 > page_pool_cross_cpu CPU(1) 2211 963.914 2159 941.087 > page_pool_cross_cpu CPU(2) 1108 483.097 1079 470.573 > > page_pool_cross_cpu average 1845 - 1779 - > > v2 -> v3: > - patch 8/10 ("Add stat tracking cache refill") fixed placement of > counter increment. > - patch 10/10 ("net-procfs: Show page pool stats in proc") updated: > - fix unused label warning from kernel test robot, > - fixed page_pool_seq_show to only display the refill stat > once, > - added a remove_proc_entry for page_pool_stat to > dev_proc_net_exit. > > v1 -> v2: > - A new kernel config option has been added, which defaults to N, > preventing this code from being compiled in by default > - The stats structure has been converted to a per-cpu structure > - The stats are now exported via proc (/proc/net/page_pool_stat) > CC'ing Saeed since he is interested on page pool stats for mlx5. I'd be much happier if we had per cpu per pool stats and a way to pick them up via ethtool, instead of global page pool stats in /proc. Anyone has an opinion on this? [...] Thanks! /Ilias
Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past. On 02/02/2022 02.12, Joe Damato wrote: > Greetings: > > Sending a v3 as I noted some issues with the procfs code in patch 10 I > submit in v2 (thanks, kernel test robot) and fixing the placement of the > refill stat increment in patch 8. Could you explain why a single global stats (/proc/net/page_pool_stat) for all page_pool instances for all RX-queues makes sense? I think this argument/explanation belongs in the cover letter. What are you using this for? And do Tariq and Saeeds agree with this single global stats approach? > I only modified the placement of the refill stat, but decided to re-run the > benchmarks used in the v2 [1], and the results are: I appreciate that you are running the benchmarks. > Test system: > - 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz > - 2 NUMA zones, with 18 cores per zone and 2 threads per core > > bench_page_pool_simple results: > test name stats enabled stats disabled > cycles nanosec cycles nanosec > > for_loop 0 0.335 0 0.334 I think you can drop the 'for_loop' results, we can see that the overhead is insignificant. > atomic_inc 13 6.028 13 6.035 > lock 32 14.017 31 13.552 > > no-softirq-page_pool01 45 19.832 46 20.193 > no-softirq-page_pool02 44 19.478 46 20.083 > no-softirq-page_pool03 110 48.365 109 47.699 > > tasklet_page_pool01_fast_path 14 6.204 13 6.021 > tasklet_page_pool02_ptr_ring 41 18.115 42 18.699 > tasklet_page_pool03_slow 110 48.085 108 47.395 > > bench_page_pool_cross_cpu results: > test name stats enabled stats disabled > cycles nanosec cycles nanosec > > page_pool_cross_cpu CPU(0) 2216 966.179 2101 915.692 > page_pool_cross_cpu CPU(1) 2211 963.914 2159 941.087 > page_pool_cross_cpu CPU(2) 1108 483.097 1079 470.573 > > page_pool_cross_cpu average 1845 - 1779 - > > v2 -> v3: > - patch 8/10 ("Add stat tracking cache refill") fixed placement of > counter increment. > - patch 10/10 ("net-procfs: Show page pool stats in proc") updated: > - fix unused label warning from kernel test robot, > - fixed page_pool_seq_show to only display the refill stat > once, > - added a remove_proc_entry for page_pool_stat to > dev_proc_net_exit. > > v1 -> v2: > - A new kernel config option has been added, which defaults to N, > preventing this code from being compiled in by default > - The stats structure has been converted to a per-cpu structure > - The stats are now exported via proc (/proc/net/page_pool_stat) > > Thanks. > > [1]: > https://lore.kernel.org/all/1643499540-8351-1-git-send-email-jdamato@fastly.com/T/#md82c6d5233e35bb518bc40c8fd7dff7a7a17e199 > > Joe Damato (10): > page_pool: kconfig: Add flag for page pool stats > page_pool: Add per-cpu page_pool_stats struct > page_pool: Add a macro for incrementing stats > page_pool: Add stat tracking fast path allocations > page_pool: Add slow path order 0 allocation stat > page_pool: Add slow path high order allocation stat > page_pool: Add stat tracking empty ring > page_pool: Add stat tracking cache refill > page_pool: Add a stat tracking waived pages > net-procfs: Show page pool stats in proc > > include/net/page_pool.h | 20 +++++++++++++++ > net/Kconfig | 12 +++++++++ > net/core/net-procfs.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++ > net/core/page_pool.c | 28 ++++++++++++++++++--- > 4 files changed, 124 insertions(+), 3 deletions(-) >
On Wed, Feb 2, 2022 at 6:31 AM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past. > > On 02/02/2022 02.12, Joe Damato wrote: > > Greetings: > > > > Sending a v3 as I noted some issues with the procfs code in patch 10 I > > submit in v2 (thanks, kernel test robot) and fixing the placement of the > > refill stat increment in patch 8. > > Could you explain why a single global stats (/proc/net/page_pool_stat) > for all page_pool instances for all RX-queues makes sense? > > I think this argument/explanation belongs in the cover letter. I included an explanation in the v2 cover letter where those changes occurred, but you are right: I should have also included it in the v3 cover letter. My thought process was this: - Stats now have to be enabled by an explicit kernel config option, so the user has to know what they are doing - Advanced users can move softirqs to CPUs as they wish and they could isolate a particular set of RX-queues on a set of CPUs this way - The result is that there is no need to expose anything to the drivers and no modifications to drivers are necessary once the single kernel config option is enabled and softirq affinity is configured I had assumed by not exposing new APIs / page pool internals and by not requiring drivers to make any changes, I would have a better shot of getting my patches accepted. It sounds like both you and Ilias strongly prefer per-pool-per-cpu stats, so I can make that change in the v4. > What are you using this for? I currently graph NIC driver stats from a number of different vendors to help better understand the performance of those NICs under my company's production workload. For example, on i40e, I submit changes to the upstream driver [1] and am graphing those stats to better understand memory reuse rate. We have seen some issues around mm allocation contention in production workloads with certain NICs and system architectures. My findings with mlx5 have indicated that the proprietary page reuse algorithm in the driver, with our workload, does not provide much memory re-use, and causes pressure against the kernel's page allocator. The page pool should help remedy this, but without stats I don't have a clear way to measure the effect. So in short: I'd like to gather and graph stats about the page pool API to determine how much impact the page pool API has on page reuse for mlx5 in our workload. > And do Tariq and Saeeds agree with this single global stats approach? I don't know; I hope they'll chime in. As I mentioned above, I don't really mind which approach is preferred by you all. I had assumed that something with fewer external APIs would be more likely to be accepted, and so I made that change in v2. > > I only modified the placement of the refill stat, but decided to re-run the > > benchmarks used in the v2 [1], and the results are: > > I appreciate that you are running the benchmarks. Sure, no worries. As you mentioned in the other thread, perhaps some settings need to be adjusted to show more relevant data on faster systems. When I work on the v4, I will take a look at the benchmarks and explain any modifications made to them or their options when presenting the test results. > > Test system: [...] Thanks, Joe [1]: https://patchwork.ozlabs.org/project/intel-wired-lan/cover/1639769719-81285-1-git-send-email-jdamato@fastly.com/
On 2/2/2022 7:30 PM, Joe Damato wrote: > On Wed, Feb 2, 2022 at 6:31 AM Jesper Dangaard Brouer > <jbrouer@redhat.com> wrote: >> >> >> Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past. >> >> On 02/02/2022 02.12, Joe Damato wrote: >>> Greetings: >>> >>> Sending a v3 as I noted some issues with the procfs code in patch 10 I >>> submit in v2 (thanks, kernel test robot) and fixing the placement of the >>> refill stat increment in patch 8. >> >> Could you explain why a single global stats (/proc/net/page_pool_stat) >> for all page_pool instances for all RX-queues makes sense? >> >> I think this argument/explanation belongs in the cover letter. > > I included an explanation in the v2 cover letter where those changes > occurred, but you are right: I should have also included it in the v3 > cover letter. > > My thought process was this: > > - Stats now have to be enabled by an explicit kernel config option, so > the user has to know what they are doing > - Advanced users can move softirqs to CPUs as they wish and they could > isolate a particular set of RX-queues on a set of CPUs this way > - The result is that there is no need to expose anything to the > drivers and no modifications to drivers are necessary once the single > kernel config option is enabled and softirq affinity is configured > > I had assumed by not exposing new APIs / page pool internals and by > not requiring drivers to make any changes, I would have a better shot > of getting my patches accepted. > > It sounds like both you and Ilias strongly prefer per-pool-per-cpu > stats, so I can make that change in the v4. > >> What are you using this for? > > I currently graph NIC driver stats from a number of different vendors > to help better understand the performance of those NICs under my > company's production workload. > > For example, on i40e, I submit changes to the upstream driver [1] and > am graphing those stats to better understand memory reuse rate. We > have seen some issues around mm allocation contention in production > workloads with certain NICs and system architectures. > > My findings with mlx5 have indicated that the proprietary page reuse > algorithm in the driver, with our workload, does not provide much > memory re-use, and causes pressure against the kernel's page > allocator. The page pool should help remedy this, but without stats I > don't have a clear way to measure the effect. > > So in short: I'd like to gather and graph stats about the page pool > API to determine how much impact the page pool API has on page reuse > for mlx5 in our workload. > Hi Joe, Jesper, Ilias, and all, We plan to totally remove the in-driver page-cache and fully rely on page-pool for the allocations and dma mapping. This did not happen until now as the page pool did not support elevated page refcount (multiple frags per-page) and stats. I'm happy to see that these are getting attention! Thanks for investing time and effort to push these tasks forward! >> And do Tariq and Saeeds agree with this single global stats approach? > > I don't know; I hope they'll chime in. > I agree with Jesper and Ilias. Global per-cpu pool stats are very limited. There is not much we can do with the super-position of several page-pools. IMO, these stats can be of real value only when each cpu has a single pool. Otherwise, the summed stats of two or more pools won't help much in observability, or debug. Tariq > As I mentioned above, I don't really mind which approach is preferred > by you all. I had assumed that something with fewer external APIs > would be more likely to be accepted, and so I made that change in v2. > >>> I only modified the placement of the refill stat, but decided to re-run the >>> benchmarks used in the v2 [1], and the results are: >> >> I appreciate that you are running the benchmarks. > > Sure, no worries. As you mentioned in the other thread, perhaps some > settings need to be adjusted to show more relevant data on faster > systems. > > When I work on the v4, I will take a look at the benchmarks and > explain any modifications made to them or their options when > presenting the test results. > >>> Test system: > > [...] > > Thanks, > Joe > > [1]: https://patchwork.ozlabs.org/project/intel-wired-lan/cover/1639769719-81285-1-git-send-email-jdamato@fastly.com/
On Thu, Feb 3, 2022 at 11:21 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote: > > > > On 2/2/2022 7:30 PM, Joe Damato wrote: > > On Wed, Feb 2, 2022 at 6:31 AM Jesper Dangaard Brouer > > <jbrouer@redhat.com> wrote: > >> > >> > >> Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past. > >> > >> On 02/02/2022 02.12, Joe Damato wrote: > >>> Greetings: > >>> > >>> Sending a v3 as I noted some issues with the procfs code in patch 10 I > >>> submit in v2 (thanks, kernel test robot) and fixing the placement of the > >>> refill stat increment in patch 8. > >> > >> Could you explain why a single global stats (/proc/net/page_pool_stat) > >> for all page_pool instances for all RX-queues makes sense? > >> > >> I think this argument/explanation belongs in the cover letter. > > > > I included an explanation in the v2 cover letter where those changes > > occurred, but you are right: I should have also included it in the v3 > > cover letter. > > > > My thought process was this: > > > > - Stats now have to be enabled by an explicit kernel config option, so > > the user has to know what they are doing > > - Advanced users can move softirqs to CPUs as they wish and they could > > isolate a particular set of RX-queues on a set of CPUs this way > > - The result is that there is no need to expose anything to the > > drivers and no modifications to drivers are necessary once the single > > kernel config option is enabled and softirq affinity is configured > > > > I had assumed by not exposing new APIs / page pool internals and by > > not requiring drivers to make any changes, I would have a better shot > > of getting my patches accepted. > > > > It sounds like both you and Ilias strongly prefer per-pool-per-cpu > > stats, so I can make that change in the v4. > > > >> What are you using this for? > > > > I currently graph NIC driver stats from a number of different vendors > > to help better understand the performance of those NICs under my > > company's production workload. > > > > For example, on i40e, I submit changes to the upstream driver [1] and > > am graphing those stats to better understand memory reuse rate. We > > have seen some issues around mm allocation contention in production > > workloads with certain NICs and system architectures. > > > > My findings with mlx5 have indicated that the proprietary page reuse > > algorithm in the driver, with our workload, does not provide much > > memory re-use, and causes pressure against the kernel's page > > allocator. The page pool should help remedy this, but without stats I > > don't have a clear way to measure the effect. > > > > So in short: I'd like to gather and graph stats about the page pool > > API to determine how much impact the page pool API has on page reuse > > for mlx5 in our workload. > > > Hi Joe, Jesper, Ilias, and all, > > We plan to totally remove the in-driver page-cache and fully rely on > page-pool for the allocations and dma mapping. This did not happen until > now as the page pool did not support elevated page refcount (multiple > frags per-page) and stats. > > I'm happy to see that these are getting attention! Thanks for investing > time and effort to push these tasks forward! > > >> And do Tariq and Saeeds agree with this single global stats approach? > > > > I don't know; I hope they'll chime in. > > > > I agree with Jesper and Ilias. Global per-cpu pool stats are very > limited. There is not much we can do with the super-position of several > page-pools. IMO, these stats can be of real value only when each cpu has > a single pool. Otherwise, the summed stats of two or more pools won't > help much in observability, or debug. OK thanks Tariq -- that makes sense to me. I can propose a v4 that converts the stats to per-pool-per-cpu and re-run the benchmarks, with the modification Jesper suggested to make them run a bit longer. I'm still thinking through what the best API design is for accessing stats from the drivers, but I'll propose something and see what you all think in the v4. Thanks, Joe