mbox series

[net-next,v3,00/10] page_pool: Add page_pool stat counters

Message ID 1643764336-63864-1-git-send-email-jdamato@fastly.com (mailing list archive)
Headers show
Series page_pool: Add page_pool stat counters | expand

Message

Joe Damato Feb. 2, 2022, 1:12 a.m. UTC
Greetings:

Sending a v3 as I noted some issues with the procfs code in patch 10 I
submit in v2 (thanks, kernel test robot) and fixing the placement of the
refill stat increment in patch 8.

I only modified the placement of the refill stat, but decided to re-run the
benchmarks used in the v2 [1], and the results are:

Test system:
	- 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
	- 2 NUMA zones, with 18 cores per zone and 2 threads per core

bench_page_pool_simple results:
test name			stats enabled		stats disabled
				cycles	nanosec		cycles	nanosec

for_loop			0	0.335		0	0.334
atomic_inc 			13	6.028		13	6.035
lock				32	14.017		31	13.552

no-softirq-page_pool01		45	19.832		46	20.193
no-softirq-page_pool02		44	19.478		46	20.083
no-softirq-page_pool03		110	48.365		109	47.699

tasklet_page_pool01_fast_path	14	6.204		13	6.021
tasklet_page_pool02_ptr_ring	41	18.115		42	18.699
tasklet_page_pool03_slow	110	48.085		108	47.395

bench_page_pool_cross_cpu results:
test name			stats enabled		stats disabled
				cycles	nanosec		cycles	nanosec

page_pool_cross_cpu CPU(0)	2216	966.179		2101	915.692
page_pool_cross_cpu CPU(1)	2211	963.914		2159	941.087
page_pool_cross_cpu CPU(2)	1108	483.097		1079	470.573

page_pool_cross_cpu average	1845	-		1779	-

v2 -> v3:
	- patch 8/10 ("Add stat tracking cache refill") fixed placement of
	  counter increment.
	- patch 10/10 ("net-procfs: Show page pool stats in proc") updated:
		- fix unused label warning from kernel test robot,
		- fixed page_pool_seq_show to only display the refill stat
		  once,
		- added a remove_proc_entry for page_pool_stat to
		  dev_proc_net_exit.

v1 -> v2:
	- A new kernel config option has been added, which defaults to N,
	   preventing this code from being compiled in by default
	- The stats structure has been converted to a per-cpu structure
	- The stats are now exported via proc (/proc/net/page_pool_stat)

Thanks.

[1]:
https://lore.kernel.org/all/1643499540-8351-1-git-send-email-jdamato@fastly.com/T/#md82c6d5233e35bb518bc40c8fd7dff7a7a17e199

Joe Damato (10):
  page_pool: kconfig: Add flag for page pool stats
  page_pool: Add per-cpu page_pool_stats struct
  page_pool: Add a macro for incrementing stats
  page_pool: Add stat tracking fast path allocations
  page_pool: Add slow path order 0 allocation stat
  page_pool: Add slow path high order allocation stat
  page_pool: Add stat tracking empty ring
  page_pool: Add stat tracking cache refill
  page_pool: Add a stat tracking waived pages
  net-procfs: Show page pool stats in proc

 include/net/page_pool.h | 20 +++++++++++++++
 net/Kconfig             | 12 +++++++++
 net/core/net-procfs.c   | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/page_pool.c    | 28 ++++++++++++++++++---
 4 files changed, 124 insertions(+), 3 deletions(-)

Comments

Ilias Apalodimas Feb. 2, 2022, 2:29 p.m. UTC | #1
Hi Joe,

Again thanks for the patches!

On Wed, 2 Feb 2022 at 03:13, Joe Damato <jdamato@fastly.com> wrote:
>
> Greetings:
>
> Sending a v3 as I noted some issues with the procfs code in patch 10 I
> submit in v2 (thanks, kernel test robot) and fixing the placement of the
> refill stat increment in patch 8.
>
> I only modified the placement of the refill stat, but decided to re-run the
> benchmarks used in the v2 [1], and the results are:
>
> Test system:
>         - 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
>         - 2 NUMA zones, with 18 cores per zone and 2 threads per core
>
> bench_page_pool_simple results:
> test name                       stats enabled           stats disabled
>                                 cycles  nanosec         cycles  nanosec
>
> for_loop                        0       0.335           0       0.334
> atomic_inc                      13      6.028           13      6.035
> lock                            32      14.017          31      13.552
>
> no-softirq-page_pool01          45      19.832          46      20.193
> no-softirq-page_pool02          44      19.478          46      20.083
> no-softirq-page_pool03          110     48.365          109     47.699
>
> tasklet_page_pool01_fast_path   14      6.204           13      6.021
> tasklet_page_pool02_ptr_ring    41      18.115          42      18.699
> tasklet_page_pool03_slow        110     48.085          108     47.395
>
> bench_page_pool_cross_cpu results:
> test name                       stats enabled           stats disabled
>                                 cycles  nanosec         cycles  nanosec
>
> page_pool_cross_cpu CPU(0)      2216    966.179         2101    915.692
> page_pool_cross_cpu CPU(1)      2211    963.914         2159    941.087
> page_pool_cross_cpu CPU(2)      1108    483.097         1079    470.573
>
> page_pool_cross_cpu average     1845    -               1779    -
>
> v2 -> v3:
>         - patch 8/10 ("Add stat tracking cache refill") fixed placement of
>           counter increment.
>         - patch 10/10 ("net-procfs: Show page pool stats in proc") updated:
>                 - fix unused label warning from kernel test robot,
>                 - fixed page_pool_seq_show to only display the refill stat
>                   once,
>                 - added a remove_proc_entry for page_pool_stat to
>                   dev_proc_net_exit.
>
> v1 -> v2:
>         - A new kernel config option has been added, which defaults to N,
>            preventing this code from being compiled in by default
>         - The stats structure has been converted to a per-cpu structure
>         - The stats are now exported via proc (/proc/net/page_pool_stat)
>

CC'ing Saeed since he is interested on page pool stats for mlx5.
I'd be much happier if we had per cpu per pool stats and a way to pick
them up via ethtool, instead of global page pool stats in /proc.
Anyone has an opinion on this?

[...]

Thanks!
/Ilias
Jesper Dangaard Brouer Feb. 2, 2022, 2:31 p.m. UTC | #2
Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past.

On 02/02/2022 02.12, Joe Damato wrote:
> Greetings:
> 
> Sending a v3 as I noted some issues with the procfs code in patch 10 I
> submit in v2 (thanks, kernel test robot) and fixing the placement of the
> refill stat increment in patch 8.

Could you explain why a single global stats (/proc/net/page_pool_stat) 
for all page_pool instances for all RX-queues makes sense?

I think this argument/explanation belongs in the cover letter.

What are you using this for?

And do Tariq and Saeeds agree with this single global stats approach?


> I only modified the placement of the refill stat, but decided to re-run the
> benchmarks used in the v2 [1], and the results are:

I appreciate that you are running the benchmarks.

> Test system:
> 	- 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
> 	- 2 NUMA zones, with 18 cores per zone and 2 threads per core
> 
> bench_page_pool_simple results:
> test name			stats enabled		stats disabled
> 				cycles	nanosec		cycles	nanosec
> 
> for_loop			0	0.335		0	0.334

I think you can drop the 'for_loop' results, we can see that the 
overhead is insignificant.

> atomic_inc 			13	6.028		13	6.035
> lock				32	14.017		31	13.552
> 
> no-softirq-page_pool01		45	19.832		46	20.193
> no-softirq-page_pool02		44	19.478		46	20.083
> no-softirq-page_pool03		110	48.365		109	47.699
> 
> tasklet_page_pool01_fast_path	14	6.204		13	6.021
> tasklet_page_pool02_ptr_ring	41	18.115		42	18.699
> tasklet_page_pool03_slow	110	48.085		108	47.395
> 
> bench_page_pool_cross_cpu results:
> test name			stats enabled		stats disabled
> 				cycles	nanosec		cycles	nanosec
> 
> page_pool_cross_cpu CPU(0)	2216	966.179		2101	915.692
> page_pool_cross_cpu CPU(1)	2211	963.914		2159	941.087
> page_pool_cross_cpu CPU(2)	1108	483.097		1079	470.573
> 
> page_pool_cross_cpu average	1845	-		1779	-
> 
> v2 -> v3:
> 	- patch 8/10 ("Add stat tracking cache refill") fixed placement of
> 	  counter increment.
> 	- patch 10/10 ("net-procfs: Show page pool stats in proc") updated:
> 		- fix unused label warning from kernel test robot,
> 		- fixed page_pool_seq_show to only display the refill stat
> 		  once,
> 		- added a remove_proc_entry for page_pool_stat to
> 		  dev_proc_net_exit.
> 
> v1 -> v2:
> 	- A new kernel config option has been added, which defaults to N,
> 	   preventing this code from being compiled in by default
> 	- The stats structure has been converted to a per-cpu structure
> 	- The stats are now exported via proc (/proc/net/page_pool_stat)
> 
> Thanks.
> 
> [1]:
> https://lore.kernel.org/all/1643499540-8351-1-git-send-email-jdamato@fastly.com/T/#md82c6d5233e35bb518bc40c8fd7dff7a7a17e199
> 
> Joe Damato (10):
>    page_pool: kconfig: Add flag for page pool stats
>    page_pool: Add per-cpu page_pool_stats struct
>    page_pool: Add a macro for incrementing stats
>    page_pool: Add stat tracking fast path allocations
>    page_pool: Add slow path order 0 allocation stat
>    page_pool: Add slow path high order allocation stat
>    page_pool: Add stat tracking empty ring
>    page_pool: Add stat tracking cache refill
>    page_pool: Add a stat tracking waived pages
>    net-procfs: Show page pool stats in proc
> 
>   include/net/page_pool.h | 20 +++++++++++++++
>   net/Kconfig             | 12 +++++++++
>   net/core/net-procfs.c   | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
>   net/core/page_pool.c    | 28 ++++++++++++++++++---
>   4 files changed, 124 insertions(+), 3 deletions(-)
>
Joe Damato Feb. 2, 2022, 5:30 p.m. UTC | #3
On Wed, Feb 2, 2022 at 6:31 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past.
>
> On 02/02/2022 02.12, Joe Damato wrote:
> > Greetings:
> >
> > Sending a v3 as I noted some issues with the procfs code in patch 10 I
> > submit in v2 (thanks, kernel test robot) and fixing the placement of the
> > refill stat increment in patch 8.
>
> Could you explain why a single global stats (/proc/net/page_pool_stat)
> for all page_pool instances for all RX-queues makes sense?
>
> I think this argument/explanation belongs in the cover letter.

I included an explanation in the v2 cover letter where those changes
occurred, but you are right: I should have also included it in the v3
cover letter.

My thought process was this:

- Stats now have to be enabled by an explicit kernel config option, so
the user has to know what they are doing
- Advanced users can move softirqs to CPUs as they wish and they could
isolate a particular set of RX-queues on a set of CPUs this way
- The result is that there is no need to expose anything to the
drivers and no modifications to drivers are necessary once the single
kernel config option is enabled and softirq affinity is configured

I had assumed by not exposing new APIs / page pool internals and by
not requiring drivers to make any changes, I would have a better shot
of getting my patches accepted.

It sounds like both you and Ilias strongly prefer per-pool-per-cpu
stats, so I can make that change in the v4.

> What are you using this for?

I currently graph NIC driver stats from a number of different vendors
to help better understand the performance of those NICs under my
company's production workload.

For example, on i40e, I submit changes to the upstream driver [1] and
am graphing those stats to better understand memory reuse rate. We
have seen some issues around mm allocation contention in production
workloads with certain NICs and system architectures.

My findings with mlx5 have indicated that the proprietary page reuse
algorithm in the driver, with our workload, does not provide much
memory re-use, and causes pressure against the kernel's page
allocator.  The page pool should help remedy this, but without stats I
don't have a clear way to measure the effect.

So in short: I'd like to gather and graph stats about the page pool
API to determine how much impact the page pool API has on page reuse
for mlx5 in our workload.

> And do Tariq and Saeeds agree with this single global stats approach?

I don't know; I hope they'll chime in.

As I mentioned above, I don't really mind which approach is preferred
by you all. I had assumed that something with fewer external APIs
would be more likely to be accepted, and so I made that change in v2.

> > I only modified the placement of the refill stat, but decided to re-run the
> > benchmarks used in the v2 [1], and the results are:
>
> I appreciate that you are running the benchmarks.

Sure, no worries. As you mentioned in the other thread, perhaps some
settings need to be adjusted to show more relevant data on faster
systems.

When I work on the v4, I will take a look at the benchmarks and
explain any modifications made to them or their options when
presenting the test results.

> > Test system:

[...]

Thanks,
Joe

[1]: https://patchwork.ozlabs.org/project/intel-wired-lan/cover/1639769719-81285-1-git-send-email-jdamato@fastly.com/
Tariq Toukan Feb. 3, 2022, 7:21 p.m. UTC | #4
On 2/2/2022 7:30 PM, Joe Damato wrote:
> On Wed, Feb 2, 2022 at 6:31 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past.
>>
>> On 02/02/2022 02.12, Joe Damato wrote:
>>> Greetings:
>>>
>>> Sending a v3 as I noted some issues with the procfs code in patch 10 I
>>> submit in v2 (thanks, kernel test robot) and fixing the placement of the
>>> refill stat increment in patch 8.
>>
>> Could you explain why a single global stats (/proc/net/page_pool_stat)
>> for all page_pool instances for all RX-queues makes sense?
>>
>> I think this argument/explanation belongs in the cover letter.
> 
> I included an explanation in the v2 cover letter where those changes
> occurred, but you are right: I should have also included it in the v3
> cover letter.
> 
> My thought process was this:
> 
> - Stats now have to be enabled by an explicit kernel config option, so
> the user has to know what they are doing
> - Advanced users can move softirqs to CPUs as they wish and they could
> isolate a particular set of RX-queues on a set of CPUs this way
> - The result is that there is no need to expose anything to the
> drivers and no modifications to drivers are necessary once the single
> kernel config option is enabled and softirq affinity is configured
> 
> I had assumed by not exposing new APIs / page pool internals and by
> not requiring drivers to make any changes, I would have a better shot
> of getting my patches accepted.
> 
> It sounds like both you and Ilias strongly prefer per-pool-per-cpu
> stats, so I can make that change in the v4.
> 
>> What are you using this for?
> 
> I currently graph NIC driver stats from a number of different vendors
> to help better understand the performance of those NICs under my
> company's production workload.
> 
> For example, on i40e, I submit changes to the upstream driver [1] and
> am graphing those stats to better understand memory reuse rate. We
> have seen some issues around mm allocation contention in production
> workloads with certain NICs and system architectures.
> 
> My findings with mlx5 have indicated that the proprietary page reuse
> algorithm in the driver, with our workload, does not provide much
> memory re-use, and causes pressure against the kernel's page
> allocator.  The page pool should help remedy this, but without stats I
> don't have a clear way to measure the effect.
> 
> So in short: I'd like to gather and graph stats about the page pool
> API to determine how much impact the page pool API has on page reuse
> for mlx5 in our workload.
> 
Hi Joe, Jesper, Ilias, and all,

We plan to totally remove the in-driver page-cache and fully rely on 
page-pool for the allocations and dma mapping. This did not happen until 
now as the page pool did not support elevated page refcount (multiple 
frags per-page) and stats.

I'm happy to see that these are getting attention! Thanks for investing 
time and effort to push these tasks forward!

>> And do Tariq and Saeeds agree with this single global stats approach?
> 
> I don't know; I hope they'll chime in.
> 

I agree with Jesper and Ilias. Global per-cpu pool stats are very 
limited. There is not much we can do with the super-position of several 
page-pools. IMO, these stats can be of real value only when each cpu has 
a single pool. Otherwise, the summed stats of two or more pools won't 
help much in observability, or debug.

Tariq

> As I mentioned above, I don't really mind which approach is preferred
> by you all. I had assumed that something with fewer external APIs
> would be more likely to be accepted, and so I made that change in v2.
> 
>>> I only modified the placement of the refill stat, but decided to re-run the
>>> benchmarks used in the v2 [1], and the results are:
>>
>> I appreciate that you are running the benchmarks.
> 
> Sure, no worries. As you mentioned in the other thread, perhaps some
> settings need to be adjusted to show more relevant data on faster
> systems.
> 
> When I work on the v4, I will take a look at the benchmarks and
> explain any modifications made to them or their options when
> presenting the test results.
> 
>>> Test system:
> 
> [...]
> 
> Thanks,
> Joe
> 
> [1]: https://patchwork.ozlabs.org/project/intel-wired-lan/cover/1639769719-81285-1-git-send-email-jdamato@fastly.com/
Joe Damato Feb. 3, 2022, 7:31 p.m. UTC | #5
On Thu, Feb 3, 2022 at 11:21 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 2/2/2022 7:30 PM, Joe Damato wrote:
> > On Wed, Feb 2, 2022 at 6:31 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >> Adding Cc. Tariq and Saeed, as they wanted page_pool stats in the past.
> >>
> >> On 02/02/2022 02.12, Joe Damato wrote:
> >>> Greetings:
> >>>
> >>> Sending a v3 as I noted some issues with the procfs code in patch 10 I
> >>> submit in v2 (thanks, kernel test robot) and fixing the placement of the
> >>> refill stat increment in patch 8.
> >>
> >> Could you explain why a single global stats (/proc/net/page_pool_stat)
> >> for all page_pool instances for all RX-queues makes sense?
> >>
> >> I think this argument/explanation belongs in the cover letter.
> >
> > I included an explanation in the v2 cover letter where those changes
> > occurred, but you are right: I should have also included it in the v3
> > cover letter.
> >
> > My thought process was this:
> >
> > - Stats now have to be enabled by an explicit kernel config option, so
> > the user has to know what they are doing
> > - Advanced users can move softirqs to CPUs as they wish and they could
> > isolate a particular set of RX-queues on a set of CPUs this way
> > - The result is that there is no need to expose anything to the
> > drivers and no modifications to drivers are necessary once the single
> > kernel config option is enabled and softirq affinity is configured
> >
> > I had assumed by not exposing new APIs / page pool internals and by
> > not requiring drivers to make any changes, I would have a better shot
> > of getting my patches accepted.
> >
> > It sounds like both you and Ilias strongly prefer per-pool-per-cpu
> > stats, so I can make that change in the v4.
> >
> >> What are you using this for?
> >
> > I currently graph NIC driver stats from a number of different vendors
> > to help better understand the performance of those NICs under my
> > company's production workload.
> >
> > For example, on i40e, I submit changes to the upstream driver [1] and
> > am graphing those stats to better understand memory reuse rate. We
> > have seen some issues around mm allocation contention in production
> > workloads with certain NICs and system architectures.
> >
> > My findings with mlx5 have indicated that the proprietary page reuse
> > algorithm in the driver, with our workload, does not provide much
> > memory re-use, and causes pressure against the kernel's page
> > allocator.  The page pool should help remedy this, but without stats I
> > don't have a clear way to measure the effect.
> >
> > So in short: I'd like to gather and graph stats about the page pool
> > API to determine how much impact the page pool API has on page reuse
> > for mlx5 in our workload.
> >
> Hi Joe, Jesper, Ilias, and all,
>
> We plan to totally remove the in-driver page-cache and fully rely on
> page-pool for the allocations and dma mapping. This did not happen until
> now as the page pool did not support elevated page refcount (multiple
> frags per-page) and stats.
>
> I'm happy to see that these are getting attention! Thanks for investing
> time and effort to push these tasks forward!
>
> >> And do Tariq and Saeeds agree with this single global stats approach?
> >
> > I don't know; I hope they'll chime in.
> >
>
> I agree with Jesper and Ilias. Global per-cpu pool stats are very
> limited. There is not much we can do with the super-position of several
> page-pools. IMO, these stats can be of real value only when each cpu has
> a single pool. Otherwise, the summed stats of two or more pools won't
> help much in observability, or debug.

OK thanks Tariq -- that makes sense to me.

I can propose a v4 that converts the stats to per-pool-per-cpu and
re-run the benchmarks, with the modification Jesper suggested to make
them run a bit longer.

I'm still thinking through what the best API design is for accessing
stats from the drivers, but I'll propose something and see what you
all think in the v4.

Thanks,
Joe