diff mbox series

[RFC,net-next,1/3] netdev: add per-queue statistics

Message ID 20240222223629.158254-2-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series netdev: add per-queue statistics | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; GEN HAS DIFF 2 files changed, 189 insertions(+);
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5046 this patch: 5046
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 3 maintainers not CCed: linux-doc@vger.kernel.org sridhar.samudrala@intel.com corbet@lwn.net
netdev/build_clang success Errors and warnings before: 1071 this patch: 1071
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5348 this patch: 5348
netdev/checkpatch warning WARNING: line length of 92 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Kicinski Feb. 22, 2024, 10:36 p.m. UTC
The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.

Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.

The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml |  84 +++++++++
 Documentation/networking/statistics.rst |  17 +-
 include/linux/netdevice.h               |   3 +
 include/net/netdev_queues.h             |  54 ++++++
 include/uapi/linux/netdev.h             |  20 +++
 net/core/netdev-genl-gen.c              |  12 ++
 net/core/netdev-genl-gen.h              |   2 +
 net/core/netdev-genl.c                  | 218 ++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  20 +++
 9 files changed, 429 insertions(+), 1 deletion(-)

Comments

Nambiar, Amritha Feb. 23, 2024, 12:23 a.m. UTC | #1
On 2/22/2024 2:36 PM, Jakub Kicinski wrote:
> The ethtool-nl family does a good job exposing various protocol
> related and IEEE/IETF statistics which used to get dumped under
> ethtool -S, with creative names. Queue stats don't have a netlink
> API, yet, and remain a lion's share of ethtool -S output for new
> drivers. Not only is that bad because the names differ driver to
> driver but it's also bug-prone. Intuitively drivers try to report
> only the stats for active queues, but querying ethtool stats
> involves multiple system calls, and the number of stats is
> read separately from the stats themselves. Worse still when user
> space asks for values of the stats, it doesn't inform the kernel
> how big the buffer is. If number of stats increases in the meantime
> kernel will overflow user buffer.
> 
> Add a netlink API for dumping queue stats. Queue information is
> exposed via the netdev-genl family, so add the stats there.
> Support per-queue and sum-for-device dumps. Latter will be useful
> when subsequent patches add more interesting common stats than
> just bytes and packets.
> 
> The API does not currently distinguish between HW and SW stats.
> The expectation is that the source of the stats will either not
> matter much (good packets) or be obvious (skb alloc errors).
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   Documentation/netlink/specs/netdev.yaml |  84 +++++++++
>   Documentation/networking/statistics.rst |  17 +-
>   include/linux/netdevice.h               |   3 +
>   include/net/netdev_queues.h             |  54 ++++++
>   include/uapi/linux/netdev.h             |  20 +++
>   net/core/netdev-genl-gen.c              |  12 ++
>   net/core/netdev-genl-gen.h              |   2 +
>   net/core/netdev-genl.c                  | 218 ++++++++++++++++++++++++
>   tools/include/uapi/linux/netdev.h       |  20 +++
>   9 files changed, 429 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 3addac970680..eea41e9de98c 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -74,6 +74,10 @@ name: netdev
>       name: queue-type
>       type: enum
>       entries: [ rx, tx ]
> +  -
> +    name: stats-projection
> +    type: enum
> +    entries: [ netdev, queue ]
>   
>   attribute-sets:
>     -
> @@ -265,6 +269,66 @@ name: netdev
>           doc: ID of the NAPI instance which services this queue.
>           type: u32
>   
> +  -
> +    name: stats
> +    doc: |
> +      Get device statistics, scoped to a device or a queue.
> +      These statistics extend (and partially duplicate) statistics available
> +      in struct rtnl_link_stats64.
> +      Value of the `projection` attribute determines how statistics are
> +      aggregated. When aggregated for the entire device the statistics
> +      represent the total number of events since last explicit reset of
> +      the device (i.e. not a reconfiguration like changing queue count).
> +      When reported per-queue, however, the statistics may not add
> +      up to the total number of events, will only be reported for currently
> +      active objects, and will likely report the number of events since last
> +      reconfiguration.
> +    attributes:
> +      -
> +        name: ifindex
> +        doc: ifindex of the netdevice to which stats belong.
> +        type: u32
> +        checks:
> +          min: 1
> +      -
> +        name: queue-type
> +        doc: Queue type as rx, tx, for queue-id.
> +        type: u32
> +        enum: queue-type
> +      -
> +        name: queue-id
> +        doc: Queue ID, if stats are scoped to a single queue instance.
> +        type: u32
> +      -
> +        name: projection
> +        doc: |
> +          What object type should be used to iterate over the stats.
> +        type: uint
> +        enum: stats-projection
> +      -
> +        name: rx-packets
> +        doc: |
> +          Number of wire packets successfully received and passed to the stack.
> +          For drivers supporting XDP, XDP is considered the first layer
> +          of the stack, so packets consumed by XDP are still counted here.
> +        type: uint
> +        value: 8 # reserve some attr ids in case we need more metadata later
> +      -
> +        name: rx-bytes
> +        doc: Successfully received bytes, see `rx-packets`.
> +        type: uint
> +      -
> +        name: tx-packets
> +        doc: |
> +          Number of wire packets successfully sent. Packet is considered to be
> +          successfully sent once it is in device memory (usually this means
> +          the device has issued a DMA completion for the packet).
> +        type: uint
> +      -
> +        name: tx-bytes
> +        doc: Successfully sent bytes, see `tx-packets`.
> +        type: uint
> +
>   operations:
>     list:
>       -
> @@ -405,6 +469,26 @@ name: netdev
>             attributes:
>               - ifindex
>           reply: *napi-get-op
> +    -
> +      name: stats-get
> +      doc: |
> +        Get / dump fine grained statistics. Which statistics are reported
> +        depends on the device and the driver, and whether the driver stores
> +        software counters per-queue.
> +      attribute-set: stats
> +      dump:
> +        request:
> +          attributes:
> +            - projection
> +        reply:
> +          attributes:
> +            - ifindex
> +            - queue-type
> +            - queue-id
> +            - rx-packets
> +            - rx-bytes
> +            - tx-packets
> +            - tx-bytes
>   
>   mcast-groups:
>     list:
> diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
> index 551b3cc29a41..8a4d166af3c0 100644
> --- a/Documentation/networking/statistics.rst
> +++ b/Documentation/networking/statistics.rst
> @@ -41,6 +41,15 @@ If `-s` is specified once the detailed errors won't be shown.
>   
>   `ip` supports JSON formatting via the `-j` option.
>   
> +Queue statistics
> +~~~~~~~~~~~~~~~~
> +
> +Queue statistics are accessible via the netdev netlink family.
> +
> +Currently no widely distributed CLI exists to access those statistics.
> +Kernel development tools (ynl) can be used to experiment with them,
> +see :ref:`Documentation/userspace-api/netlink/intro-specs.rst`.
> +
>   Protocol-specific statistics
>   ----------------------------
>   
> @@ -134,7 +143,7 @@ reading multiple stats as it internally performs a full dump of
>   and reports only the stat corresponding to the accessed file.
>   
>   Sysfs files are documented in
> -`Documentation/ABI/testing/sysfs-class-net-statistics`.
> +:ref:`Documentation/ABI/testing/sysfs-class-net-statistics`.
>   
>   
>   netlink
> @@ -147,6 +156,12 @@ Statistics are reported both in the responses to link information
>   requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
>   when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
>   
> +netdev (netlink)
> +~~~~~~~~~~~~~~~~
> +
> +`netdev` generic netlink family allows accessing page pool and per queue
> +statistics.
> +
>   ethtool
>   -------
>   
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f07c8374f29c..afcb2a0566f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2039,6 +2039,7 @@ enum netdev_reg_state {
>    *
>    *	@sysfs_rx_queue_group:	Space for optional per-rx queue attributes
>    *	@rtnl_link_ops:	Rtnl_link_ops
> + *	@stat_ops:	Optional ops for queue-aware statistics
>    *
>    *	@gso_max_size:	Maximum size of generic segmentation offload
>    *	@tso_max_size:	Device (as in HW) limit on the max TSO request size
> @@ -2419,6 +2420,8 @@ struct net_device {
>   
>   	const struct rtnl_link_ops *rtnl_link_ops;
>   
> +	const struct netdev_stat_ops *stat_ops;
> +
>   	/* for setting kernel sock attribute on TCP connection setup */
>   #define GSO_MAX_SEGS		65535u
>   #define GSO_LEGACY_MAX_SIZE	65536u
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index 8b8ed4e13d74..d633347eeda5 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -4,6 +4,60 @@
>   
>   #include <linux/netdevice.h>
>   
> +struct netdev_queue_stats_rx {
> +	u64 bytes;
> +	u64 packets;
> +};
> +
> +struct netdev_queue_stats_tx {
> +	u64 bytes;
> +	u64 packets;
> +};
> +
> +/**
> + * struct netdev_stat_ops - netdev ops for fine grained stats
> + * @get_queue_stats_rx:	get stats for a given Rx queue
> + * @get_queue_stats_tx:	get stats for a given Tx queue
> + * @get_base_stats:	get base stats (not belonging to any live instance)
> + *
> + * Query stats for a given object. The values of the statistics are undefined
> + * on entry (specifically they are *not* zero-initialized). Drivers should
> + * assign values only to the statistics they collect. Statistics which are not
> + * collected must be left undefined.
> + *
> + * Queue objects are not necessarily persistent, and only currently active
> + * queues are queried by the per-queue callbacks. This means that per-queue
> + * statistics will not generally add up to the total number of events for
> + * the device. The @get_base_stats callback allows filling in the delta
> + * between events for currently live queues and overall device history.
> + * When the statistics for the entire device are queried, first @get_base_stats
> + * is issued to collect the delta, and then a series of per-queue callbacks.
> + * Only statistics which are set in @get_base_stats will be reported
> + * at the device level, meaning that unlike in queue callbacks, setting
> + * a statistic to zero in @get_base_stats is a legitimate thing to do.
> + * This is because @get_base_stats has a second function of designating which
> + * statistics are in fact correct for the entire device (e.g. when history
> + * for some of the events is not maintained, and reliable "total" cannot
> + * be provided).
> + *
> + * Device drivers can assume that when collecting total device stats,
> + * the @get_base_stats and subsequent per-queue calls are performed
> + * "atomically" (without releasing the rtnl_lock).
> + *
> + * Device drivers are encouraged to reset the per-queue statistics when
> + * number of queues change. This is because the primary use case for
> + * per-queue statistics is currently to detect traffic imbalance.
> + */
> +struct netdev_stat_ops {
> +	void (*get_queue_stats_rx)(struct net_device *dev, int idx,
> +				   struct netdev_queue_stats_rx *stats);
> +	void (*get_queue_stats_tx)(struct net_device *dev, int idx,
> +				   struct netdev_queue_stats_tx *stats);
> +	void (*get_base_stats)(struct net_device *dev,
> +			       struct netdev_queue_stats_rx *rx,
> +			       struct netdev_queue_stats_tx *tx);
> +};
> +
>   /**
>    * DOC: Lockless queue stopping / waking helpers.
>    *
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 93cb411adf72..c6a5e4b03828 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -70,6 +70,11 @@ enum netdev_queue_type {
>   	NETDEV_QUEUE_TYPE_TX,
>   };
>   
> +enum netdev_stats_projection {
> +	NETDEV_STATS_PROJECTION_NETDEV,
> +	NETDEV_STATS_PROJECTION_QUEUE,
> +};
> +
>   enum {
>   	NETDEV_A_DEV_IFINDEX = 1,
>   	NETDEV_A_DEV_PAD,
> @@ -132,6 +137,20 @@ enum {
>   	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
>   };
>   
> +enum {
> +	NETDEV_A_STATS_IFINDEX = 1,
> +	NETDEV_A_STATS_QUEUE_TYPE,
> +	NETDEV_A_STATS_QUEUE_ID,
> +	NETDEV_A_STATS_PROJECTION,
> +	NETDEV_A_STATS_RX_PACKETS = 8,
> +	NETDEV_A_STATS_RX_BYTES,
> +	NETDEV_A_STATS_TX_PACKETS,
> +	NETDEV_A_STATS_TX_BYTES,
> +
> +	__NETDEV_A_STATS_MAX,
> +	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
> +};
> +
>   enum {
>   	NETDEV_CMD_DEV_GET = 1,
>   	NETDEV_CMD_DEV_ADD_NTF,
> @@ -144,6 +163,7 @@ enum {
>   	NETDEV_CMD_PAGE_POOL_STATS_GET,
>   	NETDEV_CMD_QUEUE_GET,
>   	NETDEV_CMD_NAPI_GET,
> +	NETDEV_CMD_STATS_GET,
>   
>   	__NETDEV_CMD_MAX,
>   	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index be7f2ebd61b2..a786590fc0e2 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -68,6 +68,11 @@ static const struct nla_policy netdev_napi_get_dump_nl_policy[NETDEV_A_NAPI_IFIN
>   	[NETDEV_A_NAPI_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
>   };
>   
> +/* NETDEV_CMD_STATS_GET - dump */
> +static const struct nla_policy netdev_stats_get_nl_policy[NETDEV_A_STATS_PROJECTION + 1] = {
> +	[NETDEV_A_STATS_PROJECTION] = NLA_POLICY_MAX(NLA_UINT, 1),
> +};
> +
>   /* Ops table for netdev */
>   static const struct genl_split_ops netdev_nl_ops[] = {
>   	{
> @@ -138,6 +143,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>   		.maxattr	= NETDEV_A_NAPI_IFINDEX,
>   		.flags		= GENL_CMD_CAP_DUMP,
>   	},
> +	{
> +		.cmd		= NETDEV_CMD_STATS_GET,
> +		.dumpit		= netdev_nl_stats_get_dumpit,
> +		.policy		= netdev_stats_get_nl_policy,
> +		.maxattr	= NETDEV_A_STATS_PROJECTION,
> +		.flags		= GENL_CMD_CAP_DUMP,
> +	},
>   };
>   
>   static const struct genl_multicast_group netdev_nl_mcgrps[] = {
> diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
> index a47f2bcbe4fa..de878ba2bad7 100644
> --- a/net/core/netdev-genl-gen.h
> +++ b/net/core/netdev-genl-gen.h
> @@ -28,6 +28,8 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
>   			       struct netlink_callback *cb);
>   int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
>   int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
> +int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
> +			       struct netlink_callback *cb);
>   
>   enum {
>   	NETDEV_NLGRP_MGMT,
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index fd98936da3ae..fe4e9bc5436a 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -8,6 +8,7 @@
>   #include <net/xdp.h>
>   #include <net/xdp_sock.h>
>   #include <net/netdev_rx_queue.h>
> +#include <net/netdev_queues.h>
>   #include <net/busy_poll.h>
>   
>   #include "netdev-genl-gen.h"
> @@ -469,6 +470,223 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
>   	return skb->len;
>   }
>   
> +#define NETDEV_STAT_NOT_SET		(~0ULL)
> +
> +static void
> +netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
> +{
> +	const u64 *add = _add;
> +	u64 *sum = _sum;
> +
> +	while (size) {
> +		if (*add != NETDEV_STAT_NOT_SET && *sum != NETDEV_STAT_NOT_SET)
> +			*sum += *add;
> +		sum++;
> +		add++;
> +		size -= 8;
> +	}
> +}
> +
> +static int netdev_stat_put(struct sk_buff *rsp, unsigned int attr_id, u64 value)
> +{
> +	if (value == NETDEV_STAT_NOT_SET)
> +		return 0;
> +	return nla_put_uint(rsp, attr_id, value);
> +}
> +
> +static int
> +netdev_nl_stats_write_rx(struct sk_buff *rsp, struct netdev_queue_stats_rx *rx)
> +{
> +	if (netdev_stat_put(rsp, NETDEV_A_STATS_RX_PACKETS, rx->packets) ||
> +	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_BYTES, rx->bytes))
> +		return -EMSGSIZE;
> +	return 0;
> +}
> +
> +static int
> +netdev_nl_stats_write_tx(struct sk_buff *rsp, struct netdev_queue_stats_tx *tx)
> +{
> +	if (netdev_stat_put(rsp, NETDEV_A_STATS_TX_PACKETS, tx->packets) ||
> +	    netdev_stat_put(rsp, NETDEV_A_STATS_TX_BYTES, tx->bytes))
> +		return -EMSGSIZE;
> +	return 0;
> +}
> +
> +static int
> +netdev_nl_stats_queue(struct net_device *netdev, struct sk_buff *rsp,
> +		      u32 q_type, int i, const struct genl_info *info)
> +{
> +	const struct netdev_stat_ops *ops = netdev->stat_ops;
> +	struct netdev_queue_stats_rx rx;
> +	struct netdev_queue_stats_tx tx;
> +	void *hdr;
> +
> +	hdr = genlmsg_iput(rsp, info);
> +	if (!hdr)
> +		return -EMSGSIZE;
> +	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex) ||
> +	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_TYPE, q_type) ||
> +	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_ID, i))
> +		goto nla_put_failure;
> +
> +	switch (q_type) {
> +	case NETDEV_QUEUE_TYPE_RX:
> +		memset(&rx, 0xff, sizeof(rx));
> +		ops->get_queue_stats_rx(netdev, i, &rx);
> +		if (!memchr_inv(&rx, 0xff, sizeof(rx)))
> +			goto nla_cancel;
> +		if (netdev_nl_stats_write_rx(rsp, &rx))
> +			goto nla_put_failure;
> +		break;
> +	case NETDEV_QUEUE_TYPE_TX:
> +		memset(&tx, 0xff, sizeof(tx));
> +		ops->get_queue_stats_tx(netdev, i, &tx);
> +		if (!memchr_inv(&tx, 0xff, sizeof(tx)))
> +			goto nla_cancel;
> +		if (netdev_nl_stats_write_tx(rsp, &tx))
> +			goto nla_put_failure;
> +		break;
> +	}
> +
> +	genlmsg_end(rsp, hdr);
> +	return 0;
> +
> +nla_cancel:
> +	genlmsg_cancel(rsp, hdr);
> +	return 0;
> +nla_put_failure:
> +	genlmsg_cancel(rsp, hdr);
> +	return -EMSGSIZE;
> +}
> +
> +static int
> +netdev_nl_stats_by_queue(struct net_device *netdev, struct sk_buff *rsp,
> +			 const struct genl_info *info,
> +			 struct netdev_nl_dump_ctx *ctx)
> +{
> +	const struct netdev_stat_ops *ops = netdev->stat_ops;
> +	int i, err;
> +
> +	if (!(netdev->flags & IFF_UP))
> +		return 0;
> +
> +	i = ctx->rxq_idx;
> +	while (ops->get_queue_stats_rx && i < netdev->real_num_rx_queues) {
> +		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_RX,
> +					    i, info);
> +		if (err)
> +			return err;
> +		ctx->rxq_idx = i++;
> +	}
> +	i = ctx->txq_idx;
> +	while (ops->get_queue_stats_tx && i < netdev->real_num_tx_queues) {
> +		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_TX,
> +					    i, info);
> +		if (err)
> +			return err;
> +		ctx->txq_idx = i++;
> +	}
> +
> +	ctx->rxq_idx = 0;
> +	ctx->txq_idx = 0;
> +	return 0;
> +}
> +
> +static int
> +netdev_nl_stats_by_netdev(struct net_device *netdev, struct sk_buff *rsp,
> +			  const struct genl_info *info)
> +{
> +	struct netdev_queue_stats_rx rx_sum, rx;
> +	struct netdev_queue_stats_tx tx_sum, tx;
> +	const struct netdev_stat_ops *ops;
> +	void *hdr;
> +	int i;
> +
> +	ops = netdev->stat_ops;
> +	/* Netdev can't guarantee any complete counters */
> +	if (!ops->get_base_stats)
> +		return 0;
> +
> +	memset(&rx_sum, 0xff, sizeof(rx_sum));
> +	memset(&tx_sum, 0xff, sizeof(tx_sum));
> +
> +	ops->get_base_stats(netdev, &rx_sum, &tx_sum);
> +
> +	/* The op was there, but nothing reported, don't bother */
> +	if (!memchr_inv(&rx_sum, 0xff, sizeof(rx_sum)) &&
> +	    !memchr_inv(&tx_sum, 0xff, sizeof(tx_sum)))
> +		return 0;
> +
> +	hdr = genlmsg_iput(rsp, info);
> +	if (!hdr)
> +		return -EMSGSIZE;
> +	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex))
> +		goto nla_put_failure;
> +
> +	for (i = 0; i < netdev->real_num_rx_queues; i++) {
> +		memset(&rx, 0xff, sizeof(rx));
> +		if (ops->get_queue_stats_rx)
> +			ops->get_queue_stats_rx(netdev, i, &rx);
> +		netdev_nl_stats_add(&rx_sum, &rx, sizeof(rx));
> +	}
> +	for (i = 0; i < netdev->real_num_tx_queues; i++) {
> +		memset(&tx, 0xff, sizeof(tx));
> +		if (ops->get_queue_stats_tx)
> +			ops->get_queue_stats_tx(netdev, i, &tx);
> +		netdev_nl_stats_add(&tx_sum, &tx, sizeof(tx));
> +	}
> +
> +	if (netdev_nl_stats_write_rx(rsp, &rx_sum) ||
> +	    netdev_nl_stats_write_tx(rsp, &tx_sum))
> +		goto nla_put_failure;
> +
> +	genlmsg_end(rsp, hdr);
> +	return 0;
> +
> +nla_put_failure:
> +	genlmsg_cancel(rsp, hdr);
> +	return -EMSGSIZE;
> +}
> +
> +int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
> +			       struct netlink_callback *cb)
> +{
> +	struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb);
> +	const struct genl_info *info = genl_info_dump(cb);
> +	enum netdev_stats_projection projection;
> +	struct net *net = sock_net(skb->sk);
> +	struct net_device *netdev;
> +	int err = 0;
> +
> +	projection = NETDEV_STATS_PROJECTION_NETDEV;
> +	if (info->attrs[NETDEV_A_STATS_PROJECTION])
> +		projection =
> +			nla_get_uint(info->attrs[NETDEV_A_STATS_PROJECTION]);
> +
> +	rtnl_lock();

Could we also add filtered-dump for a user provided ifindex ?

> +	for_each_netdev_dump(net, netdev, ctx->ifindex) {
> +		if (!netdev->stat_ops)
> +			continue;
> +
> +		switch (projection) {
> +		case NETDEV_STATS_PROJECTION_NETDEV:
> +			err = netdev_nl_stats_by_netdev(netdev, skb, info);
> +			break;
> +		case NETDEV_STATS_PROJECTION_QUEUE:
> +			err = netdev_nl_stats_by_queue(netdev, skb, info, ctx);
> +			break;
> +		}
> +		if (err < 0)
> +			break;
> +	}
> +	rtnl_unlock();
> +
> +	if (err != -EMSGSIZE)
> +		return err;
> +
> +	return skb->len;
> +}
> +
>   static int netdev_genl_netdevice_event(struct notifier_block *nb,
>   				       unsigned long event, void *ptr)
>   {
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index 93cb411adf72..c6a5e4b03828 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -70,6 +70,11 @@ enum netdev_queue_type {
>   	NETDEV_QUEUE_TYPE_TX,
>   };
>   
> +enum netdev_stats_projection {
> +	NETDEV_STATS_PROJECTION_NETDEV,
> +	NETDEV_STATS_PROJECTION_QUEUE,
> +};
> +
>   enum {
>   	NETDEV_A_DEV_IFINDEX = 1,
>   	NETDEV_A_DEV_PAD,
> @@ -132,6 +137,20 @@ enum {
>   	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
>   };
>   
> +enum {
> +	NETDEV_A_STATS_IFINDEX = 1,
> +	NETDEV_A_STATS_QUEUE_TYPE,
> +	NETDEV_A_STATS_QUEUE_ID,
> +	NETDEV_A_STATS_PROJECTION,
> +	NETDEV_A_STATS_RX_PACKETS = 8,
> +	NETDEV_A_STATS_RX_BYTES,
> +	NETDEV_A_STATS_TX_PACKETS,
> +	NETDEV_A_STATS_TX_BYTES,
> +
> +	__NETDEV_A_STATS_MAX,
> +	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
> +};
> +
>   enum {
>   	NETDEV_CMD_DEV_GET = 1,
>   	NETDEV_CMD_DEV_ADD_NTF,
> @@ -144,6 +163,7 @@ enum {
>   	NETDEV_CMD_PAGE_POOL_STATS_GET,
>   	NETDEV_CMD_QUEUE_GET,
>   	NETDEV_CMD_NAPI_GET,
> +	NETDEV_CMD_STATS_GET,
>   
>   	__NETDEV_CMD_MAX,
>   	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
Nambiar, Amritha Feb. 23, 2024, 12:29 a.m. UTC | #2
On 2/22/2024 2:36 PM, Jakub Kicinski wrote:
> The ethtool-nl family does a good job exposing various protocol
> related and IEEE/IETF statistics which used to get dumped under
> ethtool -S, with creative names. Queue stats don't have a netlink
> API, yet, and remain a lion's share of ethtool -S output for new
> drivers. Not only is that bad because the names differ driver to
> driver but it's also bug-prone. Intuitively drivers try to report
> only the stats for active queues, but querying ethtool stats
> involves multiple system calls, and the number of stats is
> read separately from the stats themselves. Worse still when user
> space asks for values of the stats, it doesn't inform the kernel
> how big the buffer is. If number of stats increases in the meantime
> kernel will overflow user buffer.
> 
> Add a netlink API for dumping queue stats. Queue information is
> exposed via the netdev-genl family, so add the stats there.
> Support per-queue and sum-for-device dumps. Latter will be useful
> when subsequent patches add more interesting common stats than
> just bytes and packets.
> 
> The API does not currently distinguish between HW and SW stats.
> The expectation is that the source of the stats will either not
> matter much (good packets) or be obvious (skb alloc errors).
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Thanks, this almost has all the bits to also lookup stats for a single 
queue with --do stats-get with a queue id and type.

> ---
>   Documentation/netlink/specs/netdev.yaml |  84 +++++++++
>   Documentation/networking/statistics.rst |  17 +-
>   include/linux/netdevice.h               |   3 +
>   include/net/netdev_queues.h             |  54 ++++++
>   include/uapi/linux/netdev.h             |  20 +++
>   net/core/netdev-genl-gen.c              |  12 ++
>   net/core/netdev-genl-gen.h              |   2 +
>   net/core/netdev-genl.c                  | 218 ++++++++++++++++++++++++
>   tools/include/uapi/linux/netdev.h       |  20 +++
>   9 files changed, 429 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 3addac970680..eea41e9de98c 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -74,6 +74,10 @@ name: netdev
>       name: queue-type
>       type: enum
>       entries: [ rx, tx ]
> +  -
> +    name: stats-projection
> +    type: enum
> +    entries: [ netdev, queue ]
>   
>   attribute-sets:
>     -
> @@ -265,6 +269,66 @@ name: netdev
>           doc: ID of the NAPI instance which services this queue.
>           type: u32
>   
> +  -
> +    name: stats
> +    doc: |
> +      Get device statistics, scoped to a device or a queue.
> +      These statistics extend (and partially duplicate) statistics available
> +      in struct rtnl_link_stats64.
> +      Value of the `projection` attribute determines how statistics are
> +      aggregated. When aggregated for the entire device the statistics
> +      represent the total number of events since last explicit reset of
> +      the device (i.e. not a reconfiguration like changing queue count).
> +      When reported per-queue, however, the statistics may not add
> +      up to the total number of events, will only be reported for currently
> +      active objects, and will likely report the number of events since last
> +      reconfiguration.
> +    attributes:
> +      -
> +        name: ifindex
> +        doc: ifindex of the netdevice to which stats belong.
> +        type: u32
> +        checks:
> +          min: 1
> +      -
> +        name: queue-type
> +        doc: Queue type as rx, tx, for queue-id.
> +        type: u32
> +        enum: queue-type
> +      -
> +        name: queue-id
> +        doc: Queue ID, if stats are scoped to a single queue instance.
> +        type: u32
> +      -
> +        name: projection
> +        doc: |
> +          What object type should be used to iterate over the stats.
> +        type: uint
> +        enum: stats-projection
> +      -
> +        name: rx-packets
> +        doc: |
> +          Number of wire packets successfully received and passed to the stack.
> +          For drivers supporting XDP, XDP is considered the first layer
> +          of the stack, so packets consumed by XDP are still counted here.
> +        type: uint
> +        value: 8 # reserve some attr ids in case we need more metadata later
> +      -
> +        name: rx-bytes
> +        doc: Successfully received bytes, see `rx-packets`.
> +        type: uint
> +      -
> +        name: tx-packets
> +        doc: |
> +          Number of wire packets successfully sent. Packet is considered to be
> +          successfully sent once it is in device memory (usually this means
> +          the device has issued a DMA completion for the packet).
> +        type: uint
> +      -
> +        name: tx-bytes
> +        doc: Successfully sent bytes, see `tx-packets`.
> +        type: uint
> +
>   operations:
>     list:
>       -
> @@ -405,6 +469,26 @@ name: netdev
>             attributes:
>               - ifindex
>           reply: *napi-get-op
> +    -
> +      name: stats-get
> +      doc: |
> +        Get / dump fine grained statistics. Which statistics are reported
> +        depends on the device and the driver, and whether the driver stores
> +        software counters per-queue.
> +      attribute-set: stats
> +      dump:
> +        request:
> +          attributes:
> +            - projection
> +        reply:
> +          attributes:
> +            - ifindex
> +            - queue-type
> +            - queue-id
> +            - rx-packets
> +            - rx-bytes
> +            - tx-packets
> +            - tx-bytes
>   
>   mcast-groups:
>     list:
> diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
> index 551b3cc29a41..8a4d166af3c0 100644
> --- a/Documentation/networking/statistics.rst
> +++ b/Documentation/networking/statistics.rst
> @@ -41,6 +41,15 @@ If `-s` is specified once the detailed errors won't be shown.
>   
>   `ip` supports JSON formatting via the `-j` option.
>   
> +Queue statistics
> +~~~~~~~~~~~~~~~~
> +
> +Queue statistics are accessible via the netdev netlink family.
> +
> +Currently no widely distributed CLI exists to access those statistics.
> +Kernel development tools (ynl) can be used to experiment with them,
> +see :ref:`Documentation/userspace-api/netlink/intro-specs.rst`.
> +
>   Protocol-specific statistics
>   ----------------------------
>   
> @@ -134,7 +143,7 @@ reading multiple stats as it internally performs a full dump of
>   and reports only the stat corresponding to the accessed file.
>   
>   Sysfs files are documented in
> -`Documentation/ABI/testing/sysfs-class-net-statistics`.
> +:ref:`Documentation/ABI/testing/sysfs-class-net-statistics`.
>   
>   
>   netlink
> @@ -147,6 +156,12 @@ Statistics are reported both in the responses to link information
>   requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
>   when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
>   
> +netdev (netlink)
> +~~~~~~~~~~~~~~~~
> +
> +`netdev` generic netlink family allows accessing page pool and per queue
> +statistics.
> +
>   ethtool
>   -------
>   
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f07c8374f29c..afcb2a0566f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2039,6 +2039,7 @@ enum netdev_reg_state {
>    *
>    *	@sysfs_rx_queue_group:	Space for optional per-rx queue attributes
>    *	@rtnl_link_ops:	Rtnl_link_ops
> + *	@stat_ops:	Optional ops for queue-aware statistics
>    *
>    *	@gso_max_size:	Maximum size of generic segmentation offload
>    *	@tso_max_size:	Device (as in HW) limit on the max TSO request size
> @@ -2419,6 +2420,8 @@ struct net_device {
>   
>   	const struct rtnl_link_ops *rtnl_link_ops;
>   
> +	const struct netdev_stat_ops *stat_ops;
> +
>   	/* for setting kernel sock attribute on TCP connection setup */
>   #define GSO_MAX_SEGS		65535u
>   #define GSO_LEGACY_MAX_SIZE	65536u
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index 8b8ed4e13d74..d633347eeda5 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -4,6 +4,60 @@
>   
>   #include <linux/netdevice.h>
>   
> +struct netdev_queue_stats_rx {
> +	u64 bytes;
> +	u64 packets;
> +};
> +
> +struct netdev_queue_stats_tx {
> +	u64 bytes;
> +	u64 packets;
> +};
> +
> +/**
> + * struct netdev_stat_ops - netdev ops for fine grained stats
> + * @get_queue_stats_rx:	get stats for a given Rx queue
> + * @get_queue_stats_tx:	get stats for a given Tx queue
> + * @get_base_stats:	get base stats (not belonging to any live instance)
> + *
> + * Query stats for a given object. The values of the statistics are undefined
> + * on entry (specifically they are *not* zero-initialized). Drivers should
> + * assign values only to the statistics they collect. Statistics which are not
> + * collected must be left undefined.
> + *
> + * Queue objects are not necessarily persistent, and only currently active
> + * queues are queried by the per-queue callbacks. This means that per-queue
> + * statistics will not generally add up to the total number of events for
> + * the device. The @get_base_stats callback allows filling in the delta
> + * between events for currently live queues and overall device history.
> + * When the statistics for the entire device are queried, first @get_base_stats
> + * is issued to collect the delta, and then a series of per-queue callbacks.
> + * Only statistics which are set in @get_base_stats will be reported
> + * at the device level, meaning that unlike in queue callbacks, setting
> + * a statistic to zero in @get_base_stats is a legitimate thing to do.
> + * This is because @get_base_stats has a second function of designating which
> + * statistics are in fact correct for the entire device (e.g. when history
> + * for some of the events is not maintained, and reliable "total" cannot
> + * be provided).
> + *
> + * Device drivers can assume that when collecting total device stats,
> + * the @get_base_stats and subsequent per-queue calls are performed
> + * "atomically" (without releasing the rtnl_lock).
> + *
> + * Device drivers are encouraged to reset the per-queue statistics when
> + * number of queues change. This is because the primary use case for
> + * per-queue statistics is currently to detect traffic imbalance.
> + */
> +struct netdev_stat_ops {
> +	void (*get_queue_stats_rx)(struct net_device *dev, int idx,
> +				   struct netdev_queue_stats_rx *stats);
> +	void (*get_queue_stats_tx)(struct net_device *dev, int idx,
> +				   struct netdev_queue_stats_tx *stats);
> +	void (*get_base_stats)(struct net_device *dev,
> +			       struct netdev_queue_stats_rx *rx,
> +			       struct netdev_queue_stats_tx *tx);
> +};
> +
>   /**
>    * DOC: Lockless queue stopping / waking helpers.
>    *
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 93cb411adf72..c6a5e4b03828 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -70,6 +70,11 @@ enum netdev_queue_type {
>   	NETDEV_QUEUE_TYPE_TX,
>   };
>   
> +enum netdev_stats_projection {
> +	NETDEV_STATS_PROJECTION_NETDEV,
> +	NETDEV_STATS_PROJECTION_QUEUE,
> +};
> +
>   enum {
>   	NETDEV_A_DEV_IFINDEX = 1,
>   	NETDEV_A_DEV_PAD,
> @@ -132,6 +137,20 @@ enum {
>   	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
>   };
>   
> +enum {
> +	NETDEV_A_STATS_IFINDEX = 1,
> +	NETDEV_A_STATS_QUEUE_TYPE,
> +	NETDEV_A_STATS_QUEUE_ID,
> +	NETDEV_A_STATS_PROJECTION,
> +	NETDEV_A_STATS_RX_PACKETS = 8,
> +	NETDEV_A_STATS_RX_BYTES,
> +	NETDEV_A_STATS_TX_PACKETS,
> +	NETDEV_A_STATS_TX_BYTES,
> +
> +	__NETDEV_A_STATS_MAX,
> +	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
> +};
> +
>   enum {
>   	NETDEV_CMD_DEV_GET = 1,
>   	NETDEV_CMD_DEV_ADD_NTF,
> @@ -144,6 +163,7 @@ enum {
>   	NETDEV_CMD_PAGE_POOL_STATS_GET,
>   	NETDEV_CMD_QUEUE_GET,
>   	NETDEV_CMD_NAPI_GET,
> +	NETDEV_CMD_STATS_GET,
>   
>   	__NETDEV_CMD_MAX,
>   	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index be7f2ebd61b2..a786590fc0e2 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -68,6 +68,11 @@ static const struct nla_policy netdev_napi_get_dump_nl_policy[NETDEV_A_NAPI_IFIN
>   	[NETDEV_A_NAPI_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
>   };
>   
> +/* NETDEV_CMD_STATS_GET - dump */
> +static const struct nla_policy netdev_stats_get_nl_policy[NETDEV_A_STATS_PROJECTION + 1] = {
> +	[NETDEV_A_STATS_PROJECTION] = NLA_POLICY_MAX(NLA_UINT, 1),
> +};
> +
>   /* Ops table for netdev */
>   static const struct genl_split_ops netdev_nl_ops[] = {
>   	{
> @@ -138,6 +143,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>   		.maxattr	= NETDEV_A_NAPI_IFINDEX,
>   		.flags		= GENL_CMD_CAP_DUMP,
>   	},
> +	{
> +		.cmd		= NETDEV_CMD_STATS_GET,
> +		.dumpit		= netdev_nl_stats_get_dumpit,
> +		.policy		= netdev_stats_get_nl_policy,
> +		.maxattr	= NETDEV_A_STATS_PROJECTION,
> +		.flags		= GENL_CMD_CAP_DUMP,
> +	},
>   };
>   
>   static const struct genl_multicast_group netdev_nl_mcgrps[] = {
> diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
> index a47f2bcbe4fa..de878ba2bad7 100644
> --- a/net/core/netdev-genl-gen.h
> +++ b/net/core/netdev-genl-gen.h
> @@ -28,6 +28,8 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
>   			       struct netlink_callback *cb);
>   int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
>   int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
> +int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
> +			       struct netlink_callback *cb);
>   
>   enum {
>   	NETDEV_NLGRP_MGMT,
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index fd98936da3ae..fe4e9bc5436a 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -8,6 +8,7 @@
>   #include <net/xdp.h>
>   #include <net/xdp_sock.h>
>   #include <net/netdev_rx_queue.h>
> +#include <net/netdev_queues.h>
>   #include <net/busy_poll.h>
>   
>   #include "netdev-genl-gen.h"
> @@ -469,6 +470,223 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
>   	return skb->len;
>   }
>   
> +#define NETDEV_STAT_NOT_SET		(~0ULL)
> +
> +static void
> +netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
> +{
> +	const u64 *add = _add;
> +	u64 *sum = _sum;
> +
> +	while (size) {
> +		if (*add != NETDEV_STAT_NOT_SET && *sum != NETDEV_STAT_NOT_SET)
> +			*sum += *add;
> +		sum++;
> +		add++;
> +		size -= 8;
> +	}
> +}
> +
> +static int netdev_stat_put(struct sk_buff *rsp, unsigned int attr_id, u64 value)
> +{
> +	if (value == NETDEV_STAT_NOT_SET)
> +		return 0;
> +	return nla_put_uint(rsp, attr_id, value);
> +}
> +
> +static int
> +netdev_nl_stats_write_rx(struct sk_buff *rsp, struct netdev_queue_stats_rx *rx)
> +{
> +	if (netdev_stat_put(rsp, NETDEV_A_STATS_RX_PACKETS, rx->packets) ||
> +	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_BYTES, rx->bytes))
> +		return -EMSGSIZE;
> +	return 0;
> +}
> +
> +static int
> +netdev_nl_stats_write_tx(struct sk_buff *rsp, struct netdev_queue_stats_tx *tx)
> +{
> +	if (netdev_stat_put(rsp, NETDEV_A_STATS_TX_PACKETS, tx->packets) ||
> +	    netdev_stat_put(rsp, NETDEV_A_STATS_TX_BYTES, tx->bytes))
> +		return -EMSGSIZE;
> +	return 0;
> +}
> +
> +static int
> +netdev_nl_stats_queue(struct net_device *netdev, struct sk_buff *rsp,
> +		      u32 q_type, int i, const struct genl_info *info)
> +{
> +	const struct netdev_stat_ops *ops = netdev->stat_ops;
> +	struct netdev_queue_stats_rx rx;
> +	struct netdev_queue_stats_tx tx;
> +	void *hdr;
> +
> +	hdr = genlmsg_iput(rsp, info);
> +	if (!hdr)
> +		return -EMSGSIZE;
> +	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex) ||
> +	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_TYPE, q_type) ||
> +	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_ID, i))
> +		goto nla_put_failure;
> +
> +	switch (q_type) {
> +	case NETDEV_QUEUE_TYPE_RX:
> +		memset(&rx, 0xff, sizeof(rx));
> +		ops->get_queue_stats_rx(netdev, i, &rx);
> +		if (!memchr_inv(&rx, 0xff, sizeof(rx)))
> +			goto nla_cancel;
> +		if (netdev_nl_stats_write_rx(rsp, &rx))
> +			goto nla_put_failure;
> +		break;
> +	case NETDEV_QUEUE_TYPE_TX:
> +		memset(&tx, 0xff, sizeof(tx));
> +		ops->get_queue_stats_tx(netdev, i, &tx);
> +		if (!memchr_inv(&tx, 0xff, sizeof(tx)))
> +			goto nla_cancel;
> +		if (netdev_nl_stats_write_tx(rsp, &tx))
> +			goto nla_put_failure;
> +		break;
> +	}
> +
> +	genlmsg_end(rsp, hdr);
> +	return 0;
> +
> +nla_cancel:
> +	genlmsg_cancel(rsp, hdr);
> +	return 0;
> +nla_put_failure:
> +	genlmsg_cancel(rsp, hdr);
> +	return -EMSGSIZE;
> +}
> +
> +static int
> +netdev_nl_stats_by_queue(struct net_device *netdev, struct sk_buff *rsp,
> +			 const struct genl_info *info,
> +			 struct netdev_nl_dump_ctx *ctx)
> +{
> +	const struct netdev_stat_ops *ops = netdev->stat_ops;
> +	int i, err;
> +
> +	if (!(netdev->flags & IFF_UP))
> +		return 0;
> +
> +	i = ctx->rxq_idx;
> +	while (ops->get_queue_stats_rx && i < netdev->real_num_rx_queues) {
> +		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_RX,
> +					    i, info);
> +		if (err)
> +			return err;
> +		ctx->rxq_idx = i++;
> +	}
> +	i = ctx->txq_idx;
> +	while (ops->get_queue_stats_tx && i < netdev->real_num_tx_queues) {
> +		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_TX,
> +					    i, info);
> +		if (err)
> +			return err;
> +		ctx->txq_idx = i++;
> +	}
> +
> +	ctx->rxq_idx = 0;
> +	ctx->txq_idx = 0;
> +	return 0;
> +}
> +
> +static int
> +netdev_nl_stats_by_netdev(struct net_device *netdev, struct sk_buff *rsp,
> +			  const struct genl_info *info)
> +{
> +	struct netdev_queue_stats_rx rx_sum, rx;
> +	struct netdev_queue_stats_tx tx_sum, tx;
> +	const struct netdev_stat_ops *ops;
> +	void *hdr;
> +	int i;
> +
> +	ops = netdev->stat_ops;
> +	/* Netdev can't guarantee any complete counters */
> +	if (!ops->get_base_stats)
> +		return 0;
> +
> +	memset(&rx_sum, 0xff, sizeof(rx_sum));
> +	memset(&tx_sum, 0xff, sizeof(tx_sum));
> +
> +	ops->get_base_stats(netdev, &rx_sum, &tx_sum);
> +
> +	/* The op was there, but nothing reported, don't bother */
> +	if (!memchr_inv(&rx_sum, 0xff, sizeof(rx_sum)) &&
> +	    !memchr_inv(&tx_sum, 0xff, sizeof(tx_sum)))
> +		return 0;
> +
> +	hdr = genlmsg_iput(rsp, info);
> +	if (!hdr)
> +		return -EMSGSIZE;
> +	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex))
> +		goto nla_put_failure;
> +
> +	for (i = 0; i < netdev->real_num_rx_queues; i++) {
> +		memset(&rx, 0xff, sizeof(rx));
> +		if (ops->get_queue_stats_rx)
> +			ops->get_queue_stats_rx(netdev, i, &rx);
> +		netdev_nl_stats_add(&rx_sum, &rx, sizeof(rx));
> +	}
> +	for (i = 0; i < netdev->real_num_tx_queues; i++) {
> +		memset(&tx, 0xff, sizeof(tx));
> +		if (ops->get_queue_stats_tx)
> +			ops->get_queue_stats_tx(netdev, i, &tx);
> +		netdev_nl_stats_add(&tx_sum, &tx, sizeof(tx));
> +	}
> +
> +	if (netdev_nl_stats_write_rx(rsp, &rx_sum) ||
> +	    netdev_nl_stats_write_tx(rsp, &tx_sum))
> +		goto nla_put_failure;
> +
> +	genlmsg_end(rsp, hdr);
> +	return 0;
> +
> +nla_put_failure:
> +	genlmsg_cancel(rsp, hdr);
> +	return -EMSGSIZE;
> +}
> +
> +int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
> +			       struct netlink_callback *cb)
> +{
> +	struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb);
> +	const struct genl_info *info = genl_info_dump(cb);
> +	enum netdev_stats_projection projection;
> +	struct net *net = sock_net(skb->sk);
> +	struct net_device *netdev;
> +	int err = 0;
> +
> +	projection = NETDEV_STATS_PROJECTION_NETDEV;
> +	if (info->attrs[NETDEV_A_STATS_PROJECTION])
> +		projection =
> +			nla_get_uint(info->attrs[NETDEV_A_STATS_PROJECTION]);
> +
> +	rtnl_lock();
> +	for_each_netdev_dump(net, netdev, ctx->ifindex) {
> +		if (!netdev->stat_ops)
> +			continue;
> +
> +		switch (projection) {
> +		case NETDEV_STATS_PROJECTION_NETDEV:
> +			err = netdev_nl_stats_by_netdev(netdev, skb, info);
> +			break;
> +		case NETDEV_STATS_PROJECTION_QUEUE:
> +			err = netdev_nl_stats_by_queue(netdev, skb, info, ctx);
> +			break;
> +		}
> +		if (err < 0)
> +			break;
> +	}
> +	rtnl_unlock();
> +
> +	if (err != -EMSGSIZE)
> +		return err;
> +
> +	return skb->len;
> +}
> +
>   static int netdev_genl_netdevice_event(struct notifier_block *nb,
>   				       unsigned long event, void *ptr)
>   {
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index 93cb411adf72..c6a5e4b03828 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -70,6 +70,11 @@ enum netdev_queue_type {
>   	NETDEV_QUEUE_TYPE_TX,
>   };
>   
> +enum netdev_stats_projection {
> +	NETDEV_STATS_PROJECTION_NETDEV,
> +	NETDEV_STATS_PROJECTION_QUEUE,
> +};
> +
>   enum {
>   	NETDEV_A_DEV_IFINDEX = 1,
>   	NETDEV_A_DEV_PAD,
> @@ -132,6 +137,20 @@ enum {
>   	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
>   };
>   
> +enum {
> +	NETDEV_A_STATS_IFINDEX = 1,
> +	NETDEV_A_STATS_QUEUE_TYPE,
> +	NETDEV_A_STATS_QUEUE_ID,
> +	NETDEV_A_STATS_PROJECTION,
> +	NETDEV_A_STATS_RX_PACKETS = 8,
> +	NETDEV_A_STATS_RX_BYTES,
> +	NETDEV_A_STATS_TX_PACKETS,
> +	NETDEV_A_STATS_TX_BYTES,
> +
> +	__NETDEV_A_STATS_MAX,
> +	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
> +};
> +
>   enum {
>   	NETDEV_CMD_DEV_GET = 1,
>   	NETDEV_CMD_DEV_ADD_NTF,
> @@ -144,6 +163,7 @@ enum {
>   	NETDEV_CMD_PAGE_POOL_STATS_GET,
>   	NETDEV_CMD_QUEUE_GET,
>   	NETDEV_CMD_NAPI_GET,
> +	NETDEV_CMD_STATS_GET,
>   
>   	__NETDEV_CMD_MAX,
>   	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
Jakub Kicinski Feb. 23, 2024, 1:37 a.m. UTC | #3
On Thu, 22 Feb 2024 16:23:57 -0800 Nambiar, Amritha wrote:
> > +int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
> > +			       struct netlink_callback *cb)
> > +{
> > +	struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb);
> > +	const struct genl_info *info = genl_info_dump(cb);
> > +	enum netdev_stats_projection projection;
> > +	struct net *net = sock_net(skb->sk);
> > +	struct net_device *netdev;
> > +	int err = 0;
> > +
> > +	projection = NETDEV_STATS_PROJECTION_NETDEV;
> > +	if (info->attrs[NETDEV_A_STATS_PROJECTION])
> > +		projection =
> > +			nla_get_uint(info->attrs[NETDEV_A_STATS_PROJECTION]);
> > +
> > +	rtnl_lock();  
> 
> Could we also add filtered-dump for a user provided ifindex ?

Definitely, wasn't sure if that's a pre-requisite for merging,
or we can leave it on the "netdev ToDo sheet" as a learning task 
for someone. Opinions welcome..
Jakub Kicinski Feb. 23, 2024, 1:44 a.m. UTC | #4
On Thu, 22 Feb 2024 16:29:08 -0800 Nambiar, Amritha wrote:
> Thanks, this almost has all the bits to also lookup stats for a single 
> queue with --do stats-get with a queue id and type.

We could without the projection. The projection (BTW not a great name,
couldn't come up with a better one.. split? dis-aggregation? view?
un-grouping?) "splits" a single object (netdev stats) across components
(queues). I was wondering if at some point we may add another
projection, splitting a queue. And then a queue+id+projection would
actually have to return multiple objects. So maybe it's more consistent
to just not support do at all for this op, and only support dump?

We can support filtered dump on ifindex + queue id + type, and expect
it to return one object for now.

Not 100% sure so I went with the "keep it simple, we can add more later"
approach.
Stanislav Fomichev Feb. 23, 2024, 4:32 a.m. UTC | #5
On 02/22, Jakub Kicinski wrote:
> On Thu, 22 Feb 2024 16:29:08 -0800 Nambiar, Amritha wrote:
> > Thanks, this almost has all the bits to also lookup stats for a single 
> > queue with --do stats-get with a queue id and type.
> 
> We could without the projection. The projection (BTW not a great name,
> couldn't come up with a better one.. split? dis-aggregation? view?
> un-grouping?) "splits" a single object (netdev stats) across components

How about "scope" ? Device scope. Queue scope.

> (queues). I was wondering if at some point we may add another
> projection, splitting a queue. And then a queue+id+projection would
> actually have to return multiple objects. So maybe it's more consistent
> to just not support do at all for this op, and only support dump?
> 
> We can support filtered dump on ifindex + queue id + type, and expect
> it to return one object for now.
> 
> Not 100% sure so I went with the "keep it simple, we can add more later"
> approach.
Vadim Fedorenko Feb. 23, 2024, 9:22 a.m. UTC | #6
On 23/02/2024 04:32, Stanislav Fomichev wrote:
> On 02/22, Jakub Kicinski wrote:
>> On Thu, 22 Feb 2024 16:29:08 -0800 Nambiar, Amritha wrote:
>>> Thanks, this almost has all the bits to also lookup stats for a single
>>> queue with --do stats-get with a queue id and type.
>>
>> We could without the projection. The projection (BTW not a great name,
>> couldn't come up with a better one.. split? dis-aggregation? view?
>> un-grouping?) "splits" a single object (netdev stats) across components
> 
> How about "scope" ? Device scope. Queue scope.
> 

"scope" or "view" looks better, WDYT?

>> (queues). I was wondering if at some point we may add another
>> projection, splitting a queue. And then a queue+id+projection would
>> actually have to return multiple objects. So maybe it's more consistent
>> to just not support do at all for this op, and only support dump?
>>
>> We can support filtered dump on ifindex + queue id + type, and expect
>> it to return one object for now.
>>
>> Not 100% sure so I went with the "keep it simple, we can add more later"
>> approach.
Nambiar, Amritha Feb. 23, 2024, 8:40 p.m. UTC | #7
On 2/22/2024 5:37 PM, Jakub Kicinski wrote:
> On Thu, 22 Feb 2024 16:23:57 -0800 Nambiar, Amritha wrote:
>>> +int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
>>> +			       struct netlink_callback *cb)
>>> +{
>>> +	struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb);
>>> +	const struct genl_info *info = genl_info_dump(cb);
>>> +	enum netdev_stats_projection projection;
>>> +	struct net *net = sock_net(skb->sk);
>>> +	struct net_device *netdev;
>>> +	int err = 0;
>>> +
>>> +	projection = NETDEV_STATS_PROJECTION_NETDEV;
>>> +	if (info->attrs[NETDEV_A_STATS_PROJECTION])
>>> +		projection =
>>> +			nla_get_uint(info->attrs[NETDEV_A_STATS_PROJECTION]);
>>> +
>>> +	rtnl_lock();
>>
>> Could we also add filtered-dump for a user provided ifindex ?
> 
> Definitely, wasn't sure if that's a pre-requisite for merging,
> or we can leave it on the "netdev ToDo sheet" as a learning task
> for someone. Opinions welcome..

Totally! Ignore the nit-pick.
Nambiar, Amritha Feb. 23, 2024, 8:51 p.m. UTC | #8
On 2/22/2024 5:44 PM, Jakub Kicinski wrote:
> On Thu, 22 Feb 2024 16:29:08 -0800 Nambiar, Amritha wrote:
>> Thanks, this almost has all the bits to also lookup stats for a single
>> queue with --do stats-get with a queue id and type.
> 
> We could without the projection. The projection (BTW not a great name,
> couldn't come up with a better one.. split? dis-aggregation? view?
> un-grouping?) "splits" a single object (netdev stats) across components
> (queues). I was wondering if at some point we may add another
> projection, splitting a queue. And then a queue+id+projection would
> actually have to return multiple objects. So maybe it's more consistent
> to just not support do at all for this op, and only support dump?
> 

So I understand splitting a netdev object into component queues, but do 
you have anything in mind WRT to splitting a queue, what could be the 
components for a queue object?
Agree that we can avoid the 'do' support if there are multiple 
possibilities for the projection/scope/view.
"scope" or "view" LGTM.

> We can support filtered dump on ifindex + queue id + type, and expect
> it to return one object for now.
> 
Sounds good if we are doing away with the 'do' support.

> Not 100% sure so I went with the "keep it simple, we can add more later"
> approach.
>
Jakub Kicinski Feb. 24, 2024, 12:13 a.m. UTC | #9
On Fri, 23 Feb 2024 12:51:51 -0800 Nambiar, Amritha wrote:
> So I understand splitting a netdev object into component queues, but do 
> you have anything in mind WRT to splitting a queue, what could be the 
> components for a queue object?

HW vs SW stats was something that come to mind when I was writing 
the code. More speculatively speaking - there could also be queues
fed from multiple buffer pool, so split per buffer pool could maybe
one day make some sense?
Nambiar, Amritha Feb. 26, 2024, 7:42 p.m. UTC | #10
On 2/23/2024 4:13 PM, Jakub Kicinski wrote:
> On Fri, 23 Feb 2024 12:51:51 -0800 Nambiar, Amritha wrote:
>> So I understand splitting a netdev object into component queues, but do
>> you have anything in mind WRT to splitting a queue, what could be the
>> components for a queue object?
> 
> HW vs SW stats was something that come to mind when I was writing
> the code. More speculatively speaking - there could also be queues
> fed from multiple buffer pool, so split per buffer pool could maybe
> one day make some sense?

Okay, HW/SW stats SGTM. Split per buffer pool also could be useful, 
queue+id+projection/scope/view would return multiple objects based on 
the pool.
diff mbox series

Patch

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 3addac970680..eea41e9de98c 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -74,6 +74,10 @@  name: netdev
     name: queue-type
     type: enum
     entries: [ rx, tx ]
+  -
+    name: stats-projection
+    type: enum
+    entries: [ netdev, queue ]
 
 attribute-sets:
   -
@@ -265,6 +269,66 @@  name: netdev
         doc: ID of the NAPI instance which services this queue.
         type: u32
 
+  -
+    name: stats
+    doc: |
+      Get device statistics, scoped to a device or a queue.
+      These statistics extend (and partially duplicate) statistics available
+      in struct rtnl_link_stats64.
+      Value of the `projection` attribute determines how statistics are
+      aggregated. When aggregated for the entire device the statistics
+      represent the total number of events since last explicit reset of
+      the device (i.e. not a reconfiguration like changing queue count).
+      When reported per-queue, however, the statistics may not add
+      up to the total number of events, will only be reported for currently
+      active objects, and will likely report the number of events since last
+      reconfiguration.
+    attributes:
+      -
+        name: ifindex
+        doc: ifindex of the netdevice to which stats belong.
+        type: u32
+        checks:
+          min: 1
+      -
+        name: queue-type
+        doc: Queue type as rx, tx, for queue-id.
+        type: u32
+        enum: queue-type
+      -
+        name: queue-id
+        doc: Queue ID, if stats are scoped to a single queue instance.
+        type: u32
+      -
+        name: projection
+        doc: |
+          What object type should be used to iterate over the stats.
+        type: uint
+        enum: stats-projection
+      -
+        name: rx-packets
+        doc: |
+          Number of wire packets successfully received and passed to the stack.
+          For drivers supporting XDP, XDP is considered the first layer
+          of the stack, so packets consumed by XDP are still counted here.
+        type: uint
+        value: 8 # reserve some attr ids in case we need more metadata later
+      -
+        name: rx-bytes
+        doc: Successfully received bytes, see `rx-packets`.
+        type: uint
+      -
+        name: tx-packets
+        doc: |
+          Number of wire packets successfully sent. Packet is considered to be
+          successfully sent once it is in device memory (usually this means
+          the device has issued a DMA completion for the packet).
+        type: uint
+      -
+        name: tx-bytes
+        doc: Successfully sent bytes, see `tx-packets`.
+        type: uint
+
 operations:
   list:
     -
@@ -405,6 +469,26 @@  name: netdev
           attributes:
             - ifindex
         reply: *napi-get-op
+    -
+      name: stats-get
+      doc: |
+        Get / dump fine grained statistics. Which statistics are reported
+        depends on the device and the driver, and whether the driver stores
+        software counters per-queue.
+      attribute-set: stats
+      dump:
+        request:
+          attributes:
+            - projection
+        reply:
+          attributes:
+            - ifindex
+            - queue-type
+            - queue-id
+            - rx-packets
+            - rx-bytes
+            - tx-packets
+            - tx-bytes
 
 mcast-groups:
   list:
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
index 551b3cc29a41..8a4d166af3c0 100644
--- a/Documentation/networking/statistics.rst
+++ b/Documentation/networking/statistics.rst
@@ -41,6 +41,15 @@  If `-s` is specified once the detailed errors won't be shown.
 
 `ip` supports JSON formatting via the `-j` option.
 
+Queue statistics
+~~~~~~~~~~~~~~~~
+
+Queue statistics are accessible via the netdev netlink family.
+
+Currently no widely distributed CLI exists to access those statistics.
+Kernel development tools (ynl) can be used to experiment with them,
+see :ref:`Documentation/userspace-api/netlink/intro-specs.rst`.
+
 Protocol-specific statistics
 ----------------------------
 
@@ -134,7 +143,7 @@  reading multiple stats as it internally performs a full dump of
 and reports only the stat corresponding to the accessed file.
 
 Sysfs files are documented in
-`Documentation/ABI/testing/sysfs-class-net-statistics`.
+:ref:`Documentation/ABI/testing/sysfs-class-net-statistics`.
 
 
 netlink
@@ -147,6 +156,12 @@  Statistics are reported both in the responses to link information
 requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
 when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
 
+netdev (netlink)
+~~~~~~~~~~~~~~~~
+
+`netdev` generic netlink family allows accessing page pool and per queue
+statistics.
+
 ethtool
 -------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f07c8374f29c..afcb2a0566f9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2039,6 +2039,7 @@  enum netdev_reg_state {
  *
  *	@sysfs_rx_queue_group:	Space for optional per-rx queue attributes
  *	@rtnl_link_ops:	Rtnl_link_ops
+ *	@stat_ops:	Optional ops for queue-aware statistics
  *
  *	@gso_max_size:	Maximum size of generic segmentation offload
  *	@tso_max_size:	Device (as in HW) limit on the max TSO request size
@@ -2419,6 +2420,8 @@  struct net_device {
 
 	const struct rtnl_link_ops *rtnl_link_ops;
 
+	const struct netdev_stat_ops *stat_ops;
+
 	/* for setting kernel sock attribute on TCP connection setup */
 #define GSO_MAX_SEGS		65535u
 #define GSO_LEGACY_MAX_SIZE	65536u
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 8b8ed4e13d74..d633347eeda5 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -4,6 +4,60 @@ 
 
 #include <linux/netdevice.h>
 
+struct netdev_queue_stats_rx {
+	u64 bytes;
+	u64 packets;
+};
+
+struct netdev_queue_stats_tx {
+	u64 bytes;
+	u64 packets;
+};
+
+/**
+ * struct netdev_stat_ops - netdev ops for fine grained stats
+ * @get_queue_stats_rx:	get stats for a given Rx queue
+ * @get_queue_stats_tx:	get stats for a given Tx queue
+ * @get_base_stats:	get base stats (not belonging to any live instance)
+ *
+ * Query stats for a given object. The values of the statistics are undefined
+ * on entry (specifically they are *not* zero-initialized). Drivers should
+ * assign values only to the statistics they collect. Statistics which are not
+ * collected must be left undefined.
+ *
+ * Queue objects are not necessarily persistent, and only currently active
+ * queues are queried by the per-queue callbacks. This means that per-queue
+ * statistics will not generally add up to the total number of events for
+ * the device. The @get_base_stats callback allows filling in the delta
+ * between events for currently live queues and overall device history.
+ * When the statistics for the entire device are queried, first @get_base_stats
+ * is issued to collect the delta, and then a series of per-queue callbacks.
+ * Only statistics which are set in @get_base_stats will be reported
+ * at the device level, meaning that unlike in queue callbacks, setting
+ * a statistic to zero in @get_base_stats is a legitimate thing to do.
+ * This is because @get_base_stats has a second function of designating which
+ * statistics are in fact correct for the entire device (e.g. when history
+ * for some of the events is not maintained, and reliable "total" cannot
+ * be provided).
+ *
+ * Device drivers can assume that when collecting total device stats,
+ * the @get_base_stats and subsequent per-queue calls are performed
+ * "atomically" (without releasing the rtnl_lock).
+ *
+ * Device drivers are encouraged to reset the per-queue statistics when
+ * number of queues change. This is because the primary use case for
+ * per-queue statistics is currently to detect traffic imbalance.
+ */
+struct netdev_stat_ops {
+	void (*get_queue_stats_rx)(struct net_device *dev, int idx,
+				   struct netdev_queue_stats_rx *stats);
+	void (*get_queue_stats_tx)(struct net_device *dev, int idx,
+				   struct netdev_queue_stats_tx *stats);
+	void (*get_base_stats)(struct net_device *dev,
+			       struct netdev_queue_stats_rx *rx,
+			       struct netdev_queue_stats_tx *tx);
+};
+
 /**
  * DOC: Lockless queue stopping / waking helpers.
  *
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 93cb411adf72..c6a5e4b03828 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -70,6 +70,11 @@  enum netdev_queue_type {
 	NETDEV_QUEUE_TYPE_TX,
 };
 
+enum netdev_stats_projection {
+	NETDEV_STATS_PROJECTION_NETDEV,
+	NETDEV_STATS_PROJECTION_QUEUE,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
@@ -132,6 +137,20 @@  enum {
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
 };
 
+enum {
+	NETDEV_A_STATS_IFINDEX = 1,
+	NETDEV_A_STATS_QUEUE_TYPE,
+	NETDEV_A_STATS_QUEUE_ID,
+	NETDEV_A_STATS_PROJECTION,
+	NETDEV_A_STATS_RX_PACKETS = 8,
+	NETDEV_A_STATS_RX_BYTES,
+	NETDEV_A_STATS_TX_PACKETS,
+	NETDEV_A_STATS_TX_BYTES,
+
+	__NETDEV_A_STATS_MAX,
+	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
@@ -144,6 +163,7 @@  enum {
 	NETDEV_CMD_PAGE_POOL_STATS_GET,
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
+	NETDEV_CMD_STATS_GET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index be7f2ebd61b2..a786590fc0e2 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -68,6 +68,11 @@  static const struct nla_policy netdev_napi_get_dump_nl_policy[NETDEV_A_NAPI_IFIN
 	[NETDEV_A_NAPI_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
 };
 
+/* NETDEV_CMD_STATS_GET - dump */
+static const struct nla_policy netdev_stats_get_nl_policy[NETDEV_A_STATS_PROJECTION + 1] = {
+	[NETDEV_A_STATS_PROJECTION] = NLA_POLICY_MAX(NLA_UINT, 1),
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -138,6 +143,13 @@  static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_NAPI_IFINDEX,
 		.flags		= GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= NETDEV_CMD_STATS_GET,
+		.dumpit		= netdev_nl_stats_get_dumpit,
+		.policy		= netdev_stats_get_nl_policy,
+		.maxattr	= NETDEV_A_STATS_PROJECTION,
+		.flags		= GENL_CMD_CAP_DUMP,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index a47f2bcbe4fa..de878ba2bad7 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,8 @@  int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
 			       struct netlink_callback *cb);
 int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
+			       struct netlink_callback *cb);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index fd98936da3ae..fe4e9bc5436a 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -8,6 +8,7 @@ 
 #include <net/xdp.h>
 #include <net/xdp_sock.h>
 #include <net/netdev_rx_queue.h>
+#include <net/netdev_queues.h>
 #include <net/busy_poll.h>
 
 #include "netdev-genl-gen.h"
@@ -469,6 +470,223 @@  int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+#define NETDEV_STAT_NOT_SET		(~0ULL)
+
+static void
+netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
+{
+	const u64 *add = _add;
+	u64 *sum = _sum;
+
+	while (size) {
+		if (*add != NETDEV_STAT_NOT_SET && *sum != NETDEV_STAT_NOT_SET)
+			*sum += *add;
+		sum++;
+		add++;
+		size -= 8;
+	}
+}
+
+static int netdev_stat_put(struct sk_buff *rsp, unsigned int attr_id, u64 value)
+{
+	if (value == NETDEV_STAT_NOT_SET)
+		return 0;
+	return nla_put_uint(rsp, attr_id, value);
+}
+
+static int
+netdev_nl_stats_write_rx(struct sk_buff *rsp, struct netdev_queue_stats_rx *rx)
+{
+	if (netdev_stat_put(rsp, NETDEV_A_STATS_RX_PACKETS, rx->packets) ||
+	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_BYTES, rx->bytes))
+		return -EMSGSIZE;
+	return 0;
+}
+
+static int
+netdev_nl_stats_write_tx(struct sk_buff *rsp, struct netdev_queue_stats_tx *tx)
+{
+	if (netdev_stat_put(rsp, NETDEV_A_STATS_TX_PACKETS, tx->packets) ||
+	    netdev_stat_put(rsp, NETDEV_A_STATS_TX_BYTES, tx->bytes))
+		return -EMSGSIZE;
+	return 0;
+}
+
+static int
+netdev_nl_stats_queue(struct net_device *netdev, struct sk_buff *rsp,
+		      u32 q_type, int i, const struct genl_info *info)
+{
+	const struct netdev_stat_ops *ops = netdev->stat_ops;
+	struct netdev_queue_stats_rx rx;
+	struct netdev_queue_stats_tx tx;
+	void *hdr;
+
+	hdr = genlmsg_iput(rsp, info);
+	if (!hdr)
+		return -EMSGSIZE;
+	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex) ||
+	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_TYPE, q_type) ||
+	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_ID, i))
+		goto nla_put_failure;
+
+	switch (q_type) {
+	case NETDEV_QUEUE_TYPE_RX:
+		memset(&rx, 0xff, sizeof(rx));
+		ops->get_queue_stats_rx(netdev, i, &rx);
+		if (!memchr_inv(&rx, 0xff, sizeof(rx)))
+			goto nla_cancel;
+		if (netdev_nl_stats_write_rx(rsp, &rx))
+			goto nla_put_failure;
+		break;
+	case NETDEV_QUEUE_TYPE_TX:
+		memset(&tx, 0xff, sizeof(tx));
+		ops->get_queue_stats_tx(netdev, i, &tx);
+		if (!memchr_inv(&tx, 0xff, sizeof(tx)))
+			goto nla_cancel;
+		if (netdev_nl_stats_write_tx(rsp, &tx))
+			goto nla_put_failure;
+		break;
+	}
+
+	genlmsg_end(rsp, hdr);
+	return 0;
+
+nla_cancel:
+	genlmsg_cancel(rsp, hdr);
+	return 0;
+nla_put_failure:
+	genlmsg_cancel(rsp, hdr);
+	return -EMSGSIZE;
+}
+
+static int
+netdev_nl_stats_by_queue(struct net_device *netdev, struct sk_buff *rsp,
+			 const struct genl_info *info,
+			 struct netdev_nl_dump_ctx *ctx)
+{
+	const struct netdev_stat_ops *ops = netdev->stat_ops;
+	int i, err;
+
+	if (!(netdev->flags & IFF_UP))
+		return 0;
+
+	i = ctx->rxq_idx;
+	while (ops->get_queue_stats_rx && i < netdev->real_num_rx_queues) {
+		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_RX,
+					    i, info);
+		if (err)
+			return err;
+		ctx->rxq_idx = i++;
+	}
+	i = ctx->txq_idx;
+	while (ops->get_queue_stats_tx && i < netdev->real_num_tx_queues) {
+		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_TX,
+					    i, info);
+		if (err)
+			return err;
+		ctx->txq_idx = i++;
+	}
+
+	ctx->rxq_idx = 0;
+	ctx->txq_idx = 0;
+	return 0;
+}
+
+static int
+netdev_nl_stats_by_netdev(struct net_device *netdev, struct sk_buff *rsp,
+			  const struct genl_info *info)
+{
+	struct netdev_queue_stats_rx rx_sum, rx;
+	struct netdev_queue_stats_tx tx_sum, tx;
+	const struct netdev_stat_ops *ops;
+	void *hdr;
+	int i;
+
+	ops = netdev->stat_ops;
+	/* Netdev can't guarantee any complete counters */
+	if (!ops->get_base_stats)
+		return 0;
+
+	memset(&rx_sum, 0xff, sizeof(rx_sum));
+	memset(&tx_sum, 0xff, sizeof(tx_sum));
+
+	ops->get_base_stats(netdev, &rx_sum, &tx_sum);
+
+	/* The op was there, but nothing reported, don't bother */
+	if (!memchr_inv(&rx_sum, 0xff, sizeof(rx_sum)) &&
+	    !memchr_inv(&tx_sum, 0xff, sizeof(tx_sum)))
+		return 0;
+
+	hdr = genlmsg_iput(rsp, info);
+	if (!hdr)
+		return -EMSGSIZE;
+	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex))
+		goto nla_put_failure;
+
+	for (i = 0; i < netdev->real_num_rx_queues; i++) {
+		memset(&rx, 0xff, sizeof(rx));
+		if (ops->get_queue_stats_rx)
+			ops->get_queue_stats_rx(netdev, i, &rx);
+		netdev_nl_stats_add(&rx_sum, &rx, sizeof(rx));
+	}
+	for (i = 0; i < netdev->real_num_tx_queues; i++) {
+		memset(&tx, 0xff, sizeof(tx));
+		if (ops->get_queue_stats_tx)
+			ops->get_queue_stats_tx(netdev, i, &tx);
+		netdev_nl_stats_add(&tx_sum, &tx, sizeof(tx));
+	}
+
+	if (netdev_nl_stats_write_rx(rsp, &rx_sum) ||
+	    netdev_nl_stats_write_tx(rsp, &tx_sum))
+		goto nla_put_failure;
+
+	genlmsg_end(rsp, hdr);
+	return 0;
+
+nla_put_failure:
+	genlmsg_cancel(rsp, hdr);
+	return -EMSGSIZE;
+}
+
+int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
+			       struct netlink_callback *cb)
+{
+	struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb);
+	const struct genl_info *info = genl_info_dump(cb);
+	enum netdev_stats_projection projection;
+	struct net *net = sock_net(skb->sk);
+	struct net_device *netdev;
+	int err = 0;
+
+	projection = NETDEV_STATS_PROJECTION_NETDEV;
+	if (info->attrs[NETDEV_A_STATS_PROJECTION])
+		projection =
+			nla_get_uint(info->attrs[NETDEV_A_STATS_PROJECTION]);
+
+	rtnl_lock();
+	for_each_netdev_dump(net, netdev, ctx->ifindex) {
+		if (!netdev->stat_ops)
+			continue;
+
+		switch (projection) {
+		case NETDEV_STATS_PROJECTION_NETDEV:
+			err = netdev_nl_stats_by_netdev(netdev, skb, info);
+			break;
+		case NETDEV_STATS_PROJECTION_QUEUE:
+			err = netdev_nl_stats_by_queue(netdev, skb, info, ctx);
+			break;
+		}
+		if (err < 0)
+			break;
+	}
+	rtnl_unlock();
+
+	if (err != -EMSGSIZE)
+		return err;
+
+	return skb->len;
+}
+
 static int netdev_genl_netdevice_event(struct notifier_block *nb,
 				       unsigned long event, void *ptr)
 {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 93cb411adf72..c6a5e4b03828 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -70,6 +70,11 @@  enum netdev_queue_type {
 	NETDEV_QUEUE_TYPE_TX,
 };
 
+enum netdev_stats_projection {
+	NETDEV_STATS_PROJECTION_NETDEV,
+	NETDEV_STATS_PROJECTION_QUEUE,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
@@ -132,6 +137,20 @@  enum {
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
 };
 
+enum {
+	NETDEV_A_STATS_IFINDEX = 1,
+	NETDEV_A_STATS_QUEUE_TYPE,
+	NETDEV_A_STATS_QUEUE_ID,
+	NETDEV_A_STATS_PROJECTION,
+	NETDEV_A_STATS_RX_PACKETS = 8,
+	NETDEV_A_STATS_RX_BYTES,
+	NETDEV_A_STATS_TX_PACKETS,
+	NETDEV_A_STATS_TX_BYTES,
+
+	__NETDEV_A_STATS_MAX,
+	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
@@ -144,6 +163,7 @@  enum {
 	NETDEV_CMD_PAGE_POOL_STATS_GET,
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
+	NETDEV_CMD_STATS_GET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)