mbox series

[net-next,v6,0/2] page_pool: Add page pool stats

Message ID 1645574424-60857-1-git-send-email-jdamato@fastly.com (mailing list archive)
Headers show
Series page_pool: Add page pool stats | expand

Message

Joe Damato Feb. 23, 2022, midnight UTC
Greetings:

Welcome to v6.

As a reminder: this patch series adds basic per-cpu per-page pool stats
tracking, if enabled at compile time. An API is provided,
page_pool_get_stats, which fills a caller provided page_pool_stats struct
with the totaled stats for the specified page pool. The intention is that
drivers can call this API, if they choose, and export these statistics to
users via ethtool/debugfs/etc. Users can then examine and monitor this
information to get a better sense of how their RX queues, page pools, and
the kernel page allocator are interacting with each other.

This revision contains a single change from v5 based on Jesper's feedback:
the struct page_pool_stats pointer in struct page_pool is now marked as
____cacheline_aligned_in_smp. Its position is unchanged: it remains the
last field in the page_pool struct, but is now located on its own
cache-line. Other locations are possible, such as: sharing the cache-line
with xdp_mem_id, but I didn't hear back on placement preference so I went
with a new cache-line for the page_pool_stats pointer.

With this series applied, but with stats disabled, pahole reports the
following data about struct page_pool:

/* size: 1600, cachelines: 25, members: 15 */
/* sum members: 1436, holes: 2, sum holes: 116 */
/* padding: 48 */

With this series applied, but with stats enabled, pahole shows that the
stats pointer is on its own cache-line:

/* --- cacheline 25 boundary (1600 bytes) --- */
struct page_pool_stats *   stats;                /*  1600     8 */

And, the page_pool struct has grown in size slightly (as expected):

/* size: 1664, cachelines: 26, members: 16 */
/* sum members: 1444, holes: 3, sum holes: 164 */
/* padding: 56 */

I re-ran bench_page_pool_simple and bench_page_pool_cross_cpu with the same
arguments as previous revisions (see below). I also ran
bench_page_pool_cross_cpu a second time, but with 8 returning cpus. Links
to the raw output with this series applied with stats off [1] and stats on
[2] are included for examination. Note that the benchmarks were slightly
modified [3] to produce more verbose output, but no functional changes were
made.

Back to back runs of the same benchmark reveal variability in the results.
The runs below show that the kernel with the extra code enabled is slightly
faster than with a kernel built with this code disabled. Re-running the
benchmarks again on each kernel can (and has) shown different results;
including results which show that disabling stats is slightly faster.

This code is defaulted to disabled unless explicitly enabled
by a user in their kernel configuration. Any additional resource
consumption this code generates is limited to users who have explicitly
opted-in.

Jesper: my apologies if I'm missing something obvious with running the
benchmarks, but please let me know if there is anything else you'd like me
to provide. Also, feel free to let me know if you think this code is not
desired at all -- in which case, I'll stop sending updated revisions :)

Test system:
	- 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
	- 2 NUMA zones, with 18 cores per zone and 2 threads per core

bench_page_pool_simple results, loops=200000000
test name			stats enabled		stats disabled
				cycles	nanosec		cycles	nanosec

for_loop			0	0.335		0	0.336
atomic_inc 			13	6.051		13	6.071
lock				31	13.938		32	13.985

no-softirq-page_pool01		75	32.856		74	32.305
no-softirq-page_pool02		75	32.894		74	32.343
no-softirq-page_pool03		107	46.651		107	46.710

tasklet_page_pool01_fast_path	14	6.386		13	5.829
tasklet_page_pool02_ptr_ring	41	17.998		39	17.390
tasklet_page_pool03_slow	107	47.040		107	46.830

bench_page_pool_cross_cpu results, loops=20000000 returning_cpus=4:
test name			stats enabled		stats disabled
				cycles	nanosec		cycles	nanosec

page_pool_cross_cpu CPU(0)	4050	1765.133	4090	1782.700
page_pool_cross_cpu CPU(1)	4043	1762.174	4090	1782.636
page_pool_cross_cpu CPU(2)	4050	1765.171	4090	1782.762
page_pool_cross_cpu CPU(3)	4050	1765.074	4087	1781.393
page_pool_cross_cpu CPU(4)	1012	441.293		1022	445.691

page_pool_cross_cpu average	3441	-		3475	-

bench_page_pool_cross_cpu results, loops=20000000 returning_cpus=8:
test name			stats enabled		stats disabled
				cycles	nanosec		cycles	nanosec

page_pool_cross_cpu CPU(0)	7458	3250.426	7531	3282.485
page_pool_cross_cpu CPU(1)	7473	3256.957	7531	3282.409
page_pool_cross_cpu CPU(2)	7459	3251.038	7532	3282.635
page_pool_cross_cpu CPU(3)	7473	3257.048	7530	3282.006
page_pool_cross_cpu CPU(4)	7460	3251.341	7531	3282.372
page_pool_cross_cpu CPU(5)	7474	3257.719	7532	3282.831
page_pool_cross_cpu CPU(6)	7462	3252.377	7531	3282.369
page_pool_cross_cpu CPU(7)	7478	3259.245	7532	3282.799
page_pool_cross_cpu CPU(8)	934	407.406		941	410.354

page_pool_cross_cpu average	6741	-		6799	-

Thanks.

[1]:
https://gist.githubusercontent.com/jdamato-fsly/c86d9bb4144fa12dbab41126772b167e/raw/d1fcf995bbc3f44d1362233fab1c18365922c47f/stats_disabled
[2]:
https://gist.githubusercontent.com/jdamato-fsly/c86d9bb4144fa12dbab41126772b167e/raw/d1fcf995bbc3f44d1362233fab1c18365922c47f/stats_enabled
[3]:
https://gist.githubusercontent.com/jdamato-fsly/26ec72238522074515d47011434d054f/raw/0774a02d29e33cd5876dbe240e23c7a118b743da/benchmark.patch

v5 -> v6:
	- Per cpu page_pool_stats struct pointer is now marked as
	  ____cacheline_aligned_in_smp. Placement of the field in the
	  struct is unchanged; it is the last field.

v4 -> v5:
	- Fixed the description of the kernel option in Kconfig.
	- Squashed commits 1-10 from v4 into a single commit for easier
	  review.
	- Changed the comment style of the comment for
	  the this_cpu_inc_alloc_stat macro.
	- Changed the return type of page_pool_get_stats from struct
	  page_pool_stat * to bool.

v3 -> v4:
	- Restructured stats to be per-cpu per-pool.
	- Global stats and proc file were removed.
	- Exposed an API (page_pool_get_stats) for batching the pool stats.

v2 -> v3:
	- patch 8/10 ("Add stat tracking cache refill") fixed placement of
	  counter increment.
	- patch 10/10 ("net-procfs: Show page pool stats in proc") updated:
		- fix unused label warning from kernel test robot,
		- fixed page_pool_seq_show to only display the refill stat
		  once,
		- added a remove_proc_entry for page_pool_stat to
		  dev_proc_net_exit.

v1 -> v2:
	- A new kernel config option has been added, which defaults to N,
	   preventing this code from being compiled in by default
	- The stats structure has been converted to a per-cpu structure
	- The stats are now exported via proc (/proc/net/page_pool_stat)
Joe Damato (2):
  page_pool: Add page_pool stats
  page_pool: Add function to batch and return stats

 include/net/page_pool.h | 27 +++++++++++++++++++++
 net/Kconfig             | 13 +++++++++++
 net/core/page_pool.c    | 62 +++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 98 insertions(+), 4 deletions(-)