[RFC,net-next,0/4] Adjust page pool netlink filling to non common case

Message ID	20240625120807.1165581-1-amcohen@nvidia.com (mailing list archive)
Headers	show Received: from NAM02-SN1-obe.outbound.protection.outlook.com (mail-sn1nam02on2041.outbound.protection.outlook.com [40.107.96.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A48B21553AF for <netdev@vger.kernel.org>; Tue, 25 Jun 2024 12:08:46 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C From: Amit Cohen <amcohen@nvidia.com> To: <kuba@kernel.org> CC: <davem@davemloft.net>, <edumazet@google.com>, <pabeni@redhat.com>, <hawk@kernel.org>, <idosch@nvidia.com>, <petrm@nvidia.com>, <mlxsw@nvidia.com>, <netdev@vger.kernel.org>, Amit Cohen <amcohen@nvidia.com> Subject: [PATCH RFC net-next 0/4] Adjust page pool netlink filling to non common case Date: Tue, 25 Jun 2024 15:08:03 +0300 Message-ID: <20240625120807.1165581-1-amcohen@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Adjust page pool netlink filling to non common case \| expand [RFC,net-next,0/4] Adjust page pool netlink filling to non common case [RFC,net-next,1/4] net: core: page_pool_user: Allow flexibility of 'ifindex' value [RFC,net-next,2/4] net: core: page_pool_user: Change 'ifindex' for page pool dump [RFC,net-next,3/4] mlxsw: pci: Allow get page pool info/stats via netlink [RFC,net-next,4/4] mlxsw: Set page pools list for netdevices

Message ID

20240625120807.1165581-1-amcohen@nvidia.com (mailing list archive)

Headers

Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates
 216.228.117.160 as permitted sender) receiver=protection.outlook.com;
 client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C
From: Amit Cohen <amcohen@nvidia.com>
To: <kuba@kernel.org>
CC: <davem@davemloft.net>, <edumazet@google.com>, <pabeni@redhat.com>,
	<hawk@kernel.org>, <idosch@nvidia.com>, <petrm@nvidia.com>,
	<mlxsw@nvidia.com>, <netdev@vger.kernel.org>, Amit Cohen <amcohen@nvidia.com>
Subject: [PATCH RFC net-next 0/4] Adjust page pool netlink filling to non
 common case
Date: Tue, 25 Jun 2024 15:08:03 +0300
Message-ID: <20240625120807.1165581-1-amcohen@nvidia.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Jun 2024 12:08:42.3349
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 b62b8e38-e5c4-46eb-112e-08dc950f8e55
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.160];Helo=[mail.nvidia.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	DS1PEPF0001708E.namprd03.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB4080
X-Patchwork-Delegate: kuba@kernel.org
X-Patchwork-State: RFC

Series

Adjust page pool netlink filling to non common case | expand

Message

Amit Cohen June 25, 2024, 12:08 p.m. UTC

Most network drivers has 1:1 mapping between netdevice and event queues,
so then each page pool is used by only one netdevice. This is not the case
in mlxsw driver.

Currently, the netlink message is filled with 'pool->slow.netdev->ifindex',
which should be NULL in case that several netdevices use the same pool.
Adjust page pool netlink filling to use the netdevice which the pool is
stored in its list. See more info in commit messages.

Without this set, mlxsw driver cannot dump all page pools:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
	--dump page-pool-stats-get --output-json | jq
[]

With this set, "dump" command prints all the page pools for all the
netdevices:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
	--dump page-pool-get --output-json | \
	jq -e ".[] | select(.ifindex == 64)" | grep "napi-id" | wc -l
56

From driver POV, such queries are supported by associating the pools with
an unregistered netdevice (dummy netdevice). The following limitations
are caused by such implementation:
1. The get command output specifies the 'ifindex' as 0, which is
meaningless. `iproute2` will print this as "*", but there might be other
tools which fail in such case.
2. get command does not work when devlink instance is reloaded to namespace
which is not the initial one, as the dummy device associated with the pools
belongs to the initial namespace.
See examples in commit messages.

We would like to expose page pool stats and info via the standard
interface, but such implementation is not perfect. An additional option
is to use debugfs, but we prefer to avoid it, if it is possible. Any
suggestions for better implementation in case of pool for several
netdevices will be welcomed.

Patch set overview:
Patch #1 makes netlink filling code more flex
Patch #2 changes the 'ifindex' which is used for dump
Patch #3 sets netdevice for page pools in mlxsw driver, to allow "do"
commands
Patch #4 sets page pools list for netdevices in mlxsw driver, to allow
"dump" commands

Amit Cohen (4):
  net: core: page_pool_user: Allow flexibility of 'ifindex' value
  net: core: page_pool_user: Change 'ifindex' for page pool dump
  mlxsw: pci: Allow get page pool info/stats via netlink
  mlxsw: Set page pools list for netdevices

 drivers/net/ethernet/mellanox/mlxsw/core.c    |  6 +++++
 drivers/net/ethernet/mellanox/mlxsw/core.h    |  2 ++
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |  9 ++++++++
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  2 ++
 net/core/page_pool_user.c                     | 22 +++++++++----------
 5 files changed, 29 insertions(+), 12 deletions(-)

Comments

Jakub Kicinski June 25, 2024, 2:35 p.m. UTC | #1

On Tue, 25 Jun 2024 15:08:03 +0300 Amit Cohen wrote:
> Most network drivers has 1:1 mapping between netdevice and event queues,
> so then each page pool is used by only one netdevice. This is not the case
> in mlxsw driver.
> 
> Currently, the netlink message is filled with 'pool->slow.netdev->ifindex',
> which should be NULL in case that several netdevices use the same pool.
> Adjust page pool netlink filling to use the netdevice which the pool is
> stored in its list. See more info in commit messages.
> 
> Without this set, mlxsw driver cannot dump all page pools:
> $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
> 	--dump page-pool-stats-get --output-json | jq
> []
> 
> With this set, "dump" command prints all the page pools for all the
> netdevices:
> $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
> 	--dump page-pool-get --output-json | \
> 	jq -e ".[] | select(.ifindex == 64)" | grep "napi-id" | wc -l
> 56
> 
> From driver POV, such queries are supported by associating the pools with
> an unregistered netdevice (dummy netdevice). The following limitations
> are caused by such implementation:
> 1. The get command output specifies the 'ifindex' as 0, which is
> meaningless. `iproute2` will print this as "*", but there might be other
> tools which fail in such case.
> 2. get command does not work when devlink instance is reloaded to namespace
> which is not the initial one, as the dummy device associated with the pools
> belongs to the initial namespace.
> See examples in commit messages.
> 
> We would like to expose page pool stats and info via the standard
> interface, but such implementation is not perfect. An additional option
> is to use debugfs, but we prefer to avoid it, if it is possible. Any
> suggestions for better implementation in case of pool for several
> netdevices will be welcomed.

If I read the code correctly you dump all page pools for all port
netdevs? Primary use for page pool stats right now is to measure
how much memory have netdevs gobbled up. You can't duplicate entries,
because user space may double count the memory...

How about we instead add a net pointer and have the page pools listed
under loopback from the start? That's the best we can do I reckon.
Or just go with debugfs / ethtool -S, the standard interface is for
things which are standard. If the device doesn't work in a standard way
there's no need to shoehorn it in. This series feels a bit like checkbox
engineering to me, if I'm completely honest..

Amit Cohen June 25, 2024, 3:37 p.m. UTC | #2

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, 25 June 2024 17:35
> To: Amit Cohen <amcohen@nvidia.com>
> Cc: davem@davemloft.net; edumazet@google.com; pabeni@redhat.com; hawk@kernel.org; Ido Schimmel <idosch@nvidia.com>; Petr
> Machata <petrm@nvidia.com>; mlxsw <mlxsw@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH RFC net-next 0/4] Adjust page pool netlink filling to non common case
> 
> On Tue, 25 Jun 2024 15:08:03 +0300 Amit Cohen wrote:
> > Most network drivers has 1:1 mapping between netdevice and event queues,
> > so then each page pool is used by only one netdevice. This is not the case
> > in mlxsw driver.
> >
> > Currently, the netlink message is filled with 'pool->slow.netdev->ifindex',
> > which should be NULL in case that several netdevices use the same pool.
> > Adjust page pool netlink filling to use the netdevice which the pool is
> > stored in its list. See more info in commit messages.
> >
> > Without this set, mlxsw driver cannot dump all page pools:
> > $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
> > 	--dump page-pool-stats-get --output-json | jq
> > []
> >
> > With this set, "dump" command prints all the page pools for all the
> > netdevices:
> > $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
> > 	--dump page-pool-get --output-json | \
> > 	jq -e ".[] | select(.ifindex == 64)" | grep "napi-id" | wc -l
> > 56
> >
> > From driver POV, such queries are supported by associating the pools with
> > an unregistered netdevice (dummy netdevice). The following limitations
> > are caused by such implementation:
> > 1. The get command output specifies the 'ifindex' as 0, which is
> > meaningless. `iproute2` will print this as "*", but there might be other
> > tools which fail in such case.
> > 2. get command does not work when devlink instance is reloaded to namespace
> > which is not the initial one, as the dummy device associated with the pools
> > belongs to the initial namespace.
> > See examples in commit messages.
> >
> > We would like to expose page pool stats and info via the standard
> > interface, but such implementation is not perfect. An additional option
> > is to use debugfs, but we prefer to avoid it, if it is possible. Any
> > suggestions for better implementation in case of pool for several
> > netdevices will be welcomed.
> 
> If I read the code correctly you dump all page pools for all port
> netdevs?

Yes.

 Primary use for page pool stats right now is to measure
> how much memory have netdevs gobbled up. You can't duplicate entries,
> because user space may double count the memory...
> 
> How about we instead add a net pointer and have the page pools listed
> under loopback from the start? That's the best we can do I reckon.
> Or just go with debugfs / ethtool -S, the standard interface is for
> things which are standard. If the device doesn't work in a standard way
> there's no need to shoehorn it in. This series feels a bit like checkbox
> engineering to me, if I'm completely honest..


Thanks for your feedback, I sent it to consult if you have better suggestions for this case.
I agree that if the device is not standard, we should not use the standard interface when the solution is not perfect.
I think that for now we won't expose statistics, if we will find it necessary, we will probably use debugfs for that.
Thanks!