diff mbox series

net :mana : Add per-cpu stats for MANA device

Message ID 1709823132-22411-1-git-send-email-shradhagupta@linux.microsoft.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net :mana : Add per-cpu stats for MANA device | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 940 this patch: 940
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: ahmed.zaki@intel.com
netdev/build_clang success Errors and warnings before: 956 this patch: 956
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 956 this patch: 956
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 152 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-07--21-00 (tests: 890)

Commit Message

Shradha Gupta March 7, 2024, 2:52 p.m. UTC
Extend 'ethtool -S' output for mana devices to include per-CPU packet
stats

Built-on: Ubuntu22
Tested-on: Ubuntu22
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++++++++
 .../ethernet/microsoft/mana/mana_ethtool.c    | 40 ++++++++++++++++++-
 include/net/mana/mana.h                       | 12 ++++++
 3 files changed, 72 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski March 7, 2024, 3:29 p.m. UTC | #1
On Thu,  7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote:
> Extend 'ethtool -S' output for mana devices to include per-CPU packet
> stats

But why? You already have per queue stats.

> Built-on: Ubuntu22
> Tested-on: Ubuntu22

I understand when people mention testing driver changes on particular
SKUs of supported devices, that adds meaningful info. I don't understand
what "built" and "tested" on a distro is supposed to tell us.
Honest question, what is the value of this?
Haiyang Zhang March 7, 2024, 3:49 p.m. UTC | #2
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, March 7, 2024 10:29 AM
> To: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Cc: linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org; Eric Dumazet
> <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Ajay Sharma
> <sharmaajay@microsoft.com>; Leon Romanovsky <leon@kernel.org>; Thomas
> Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley
> <mikelley@microsoft.com>; Shradha Gupta <shradhagupta@microsoft.com>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
> 
> On Thu,  7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote:
> > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > stats
> 
> But why? You already have per queue stats.
Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat 
to analyze the CPU usage by counting the packets and bytes on each CPU.

> 
> > Built-on: Ubuntu22
> > Tested-on: Ubuntu22
> 
> I understand when people mention testing driver changes on particular
> SKUs of supported devices, that adds meaningful info. I don't understand
> what "built" and "tested" on a distro is supposed to tell us.
> Honest question, what is the value of this?
@Shradha Gupta Please add the tested SKU and test case here. Or, you don't
have to include test info with the patch.

Thanks,
- Haiyang
Haiyang Zhang March 7, 2024, 4:17 p.m. UTC | #3
> -----Original Message-----
> From: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Sent: Thursday, March 7, 2024 9:52 AM
> To: linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Eric Dumazet
> <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni
> <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon
> Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>;
> Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; Long Li
> <longli@microsoft.com>; Michael Kelley <mikelley@microsoft.com>; Shradha
> Gupta <shradhagupta@microsoft.com>
> Subject: [PATCH] net :mana : Add per-cpu stats for MANA device
> 
> Extend 'ethtool -S' output for mana devices to include per-CPU packet
> stats
> 
> Built-on: Ubuntu22
> Tested-on: Ubuntu22
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++++++++
>  .../ethernet/microsoft/mana/mana_ethtool.c    | 40 ++++++++++++++++++-
>  include/net/mana/mana.h                       | 12 ++++++
>  3 files changed, 72 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 59287c6e6cee..b27ee6684936 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -224,6 +224,7 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
>  	int gso_hs = 0; /* zero for non-GSO pkts */
>  	u16 txq_idx = skb_get_queue_mapping(skb);
>  	struct gdma_dev *gd = apc->ac->gdma_dev;
> +	struct mana_pcpu_stats *pcpu_stats;
>  	bool ipv4 = false, ipv6 = false;
>  	struct mana_tx_package pkg = {};
>  	struct netdev_queue *net_txq;
> @@ -234,6 +235,8 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
>  	struct mana_cq *cq;
>  	int err, len;
> 
> +	pcpu_stats = this_cpu_ptr(apc->pcpu_stats);
> +
>  	if (unlikely(!apc->port_is_up))
>  		goto tx_drop;
> 
> @@ -412,6 +415,12 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
>  	tx_stats->bytes += len;
>  	u64_stats_update_end(&tx_stats->syncp);
> 
> +	/* Also update the per-CPU stats */
> +	u64_stats_update_begin(&pcpu_stats->syncp);
> +	pcpu_stats->tx_packets++;
> +	pcpu_stats->tx_bytes += len;
> +	u64_stats_update_end(&pcpu_stats->syncp);
> +
>  tx_busy:
>  	if (netif_tx_queue_stopped(net_txq) && mana_can_tx(gdma_sq)) {
>  		netif_tx_wake_queue(net_txq);
> @@ -425,6 +434,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
>  	kfree(pkg.sgl_ptr);
>  tx_drop_count:
>  	ndev->stats.tx_dropped++;
> +	u64_stats_update_begin(&pcpu_stats->syncp);
> +	pcpu_stats->tx_dropped++;
> +	u64_stats_update_end(&pcpu_stats->syncp);
>  tx_drop:
>  	dev_kfree_skb_any(skb);
>  	return NETDEV_TX_OK;
> @@ -1505,6 +1517,8 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
>  	struct mana_stats_rx *rx_stats = &rxq->stats;
>  	struct net_device *ndev = rxq->ndev;
>  	uint pkt_len = cqe->ppi[0].pkt_len;
> +	struct mana_pcpu_stats *pcpu_stats;
> +	struct mana_port_context *apc;
>  	u16 rxq_idx = rxq->rxq_idx;
>  	struct napi_struct *napi;
>  	struct xdp_buff xdp = {};
> @@ -1512,6 +1526,9 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
>  	u32 hash_value;
>  	u32 act;
> 
> +	apc = netdev_priv(ndev);
> +	pcpu_stats = this_cpu_ptr(apc->pcpu_stats);
> +
>  	rxq->rx_cq.work_done++;
>  	napi = &rxq->rx_cq.napi;
> 
> @@ -1570,6 +1587,11 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
>  		rx_stats->xdp_tx++;
>  	u64_stats_update_end(&rx_stats->syncp);
> 
> +	u64_stats_update_begin(&pcpu_stats->syncp);
> +	pcpu_stats->rx_packets++;
> +	pcpu_stats->rx_bytes += pkt_len;
> +	u64_stats_update_end(&pcpu_stats->syncp);
> +
>  	if (act == XDP_TX) {
>  		skb_set_queue_mapping(skb, rxq_idx);
>  		mana_xdp_tx(skb, ndev);
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index ab2413d71f6c..e3aa47ead601 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -83,8 +83,9 @@ static int mana_get_sset_count(struct net_device *ndev,
> int stringset)
>  	if (stringset != ETH_SS_STATS)
>  		return -EINVAL;
> 
> -	return ARRAY_SIZE(mana_eth_stats) + num_queues *
> -				(MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
> +	return ARRAY_SIZE(mana_eth_stats) +
> +	       (num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT)) +
> +	       (num_present_cpus() * (MANA_STATS_RX_PCPU +
> MANA_STATS_TX_PCPU));
>  }
> 
>  static void mana_get_strings(struct net_device *ndev, u32 stringset, u8
> *data)
> @@ -139,6 +140,19 @@ static void mana_get_strings(struct net_device
> *ndev, u32 stringset, u8 *data)
>  		sprintf(p, "tx_%d_mana_map_err", i);
>  		p += ETH_GSTRING_LEN;
>  	}
> +
> +	for (i = 0; i < num_present_cpus(); i++) {
> +		sprintf(p, "cpu%d_rx_packets", i);
> +		p += ETH_GSTRING_LEN;
> +		sprintf(p, "cpu%d_rx_bytes", i);
> +		p += ETH_GSTRING_LEN;
> +		sprintf(p, "cpu%d_tx_packets", i);
> +		p += ETH_GSTRING_LEN;
> +		sprintf(p, "cpu%d_tx_bytes", i);
> +		p += ETH_GSTRING_LEN;
> +		sprintf(p, "cpu%d_tx_dropped", i);
> +		p += ETH_GSTRING_LEN;
> +	}
>  }
> 
>  static void mana_get_ethtool_stats(struct net_device *ndev,
> @@ -222,6 +236,28 @@ static void mana_get_ethtool_stats(struct net_device
> *ndev,
>  		data[i++] = csum_partial;
>  		data[i++] = mana_map_err;
>  	}
> +
> +	for_each_possible_cpu(q) {
> +		const struct mana_pcpu_stats *pcpu_stats =
> +				per_cpu_ptr(apc->pcpu_stats, q);
> +		u64 rx_packets, rx_bytes, tx_packets, tx_bytes, tx_dropped;
> +		unsigned int start;
> +
> +		do {
> +			start = u64_stats_fetch_begin(&pcpu_stats->syncp);
> +			rx_packets = pcpu_stats->rx_packets;
> +			tx_packets = pcpu_stats->tx_packets;
> +			rx_bytes = pcpu_stats->rx_bytes;
> +			tx_bytes = pcpu_stats->tx_bytes;
> +			tx_dropped = pcpu_stats->tx_dropped;
> +		} while (u64_stats_fetch_retry(&pcpu_stats->syncp, start));
> +
> +		data[i++] = rx_packets;
> +		data[i++] = rx_bytes;
> +		data[i++] = tx_packets;
> +		data[i++] = tx_bytes;
> +		data[i++] = tx_dropped;
> +	}
>  }
> 
>  static int mana_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc
> *cmd,
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> index 76147feb0d10..9a2414ee7f02 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -51,6 +51,8 @@ enum TRI_STATE {
>  /* Update this count whenever the respective structures are changed */
>  #define MANA_STATS_RX_COUNT 5
>  #define MANA_STATS_TX_COUNT 11
> +#define MANA_STATS_RX_PCPU 2
> +#define MANA_STATS_TX_PCPU 3
> 
>  struct mana_stats_rx {
>  	u64 packets;
> @@ -386,6 +388,15 @@ struct mana_ethtool_stats {
>  	u64 rx_cqe_unknown_type;
>  };
> 
> +struct mana_pcpu_stats {
> +	u64 rx_packets;
> +	u64 rx_bytes;
> +	u64 tx_packets;
> +	u64 tx_bytes;
> +	u64 tx_dropped;
> +	struct u64_stats_sync syncp;
> +};
> +
>  struct mana_context {
>  	struct gdma_dev *gdma_dev;
> 
> @@ -449,6 +460,7 @@ struct mana_port_context {
>  	bool port_st_save; /* Saved port state */
> 
>  	struct mana_ethtool_stats eth_stats;
> +	struct mana_pcpu_stats __percpu *pcpu_stats;

Where are pcpu_stats alloc-ed?
Seems I cannot see any alloc in the patch.

Thanks,
- Haiyang
Jakub Kicinski March 7, 2024, 5:01 p.m. UTC | #4
On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote:
> > > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > > stats  
> > 
> > But why? You already have per queue stats.  
> Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat 
> to analyze the CPU usage by counting the packets and bytes on each CPU.

Dynamic is a bit of an exaggeration, right? On a well-configured system
each CPU should use a single queue assigned thru XPS. And for manual
debug bpftrace should serve the purpose quite well.

Please note that you can't use num_present_cpus() to size stats in
ethtool -S , you have to use possible_cpus(), because the retrieval
of the stats is done in a multi-syscall fashion and there are no
explicit lengths in the API. So you must always report all possible
stats, not just currently active :(
Shradha Gupta March 8, 2024, 5:30 a.m. UTC | #5
Thanks for the comments, I am sending out a newer version with these fixes.

On Thu, Mar 07, 2024 at 09:01:45AM -0800, Jakub Kicinski wrote:
> On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote:
> > > > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > > > stats  
> > > 
> > > But why? You already have per queue stats.  
> > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat 
> > to analyze the CPU usage by counting the packets and bytes on each CPU.
> 
> Dynamic is a bit of an exaggeration, right? On a well-configured system
> each CPU should use a single queue assigned thru XPS. And for manual
> debug bpftrace should serve the purpose quite well.
> 
> Please note that you can't use num_present_cpus() to size stats in
> ethtool -S , you have to use possible_cpus(), because the retrieval
> of the stats is done in a multi-syscall fashion and there are no
> explicit lengths in the API. So you must always report all possible
> stats, not just currently active :(
Haiyang Zhang March 8, 2024, 6:51 p.m. UTC | #6
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, March 7, 2024 12:02 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org;
> netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni
> <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon
> Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>;
> Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley
> <mikelley@microsoft.com>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
> 
> On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote:
> > > > Extend 'ethtool -S' output for mana devices to include per-CPU
> packet
> > > > stats
> > >
> > > But why? You already have per queue stats.
> > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat
> > to analyze the CPU usage by counting the packets and bytes on each CPU.
> 
> Dynamic is a bit of an exaggeration, right? On a well-configured system
> each CPU should use a single queue assigned thru XPS. And for manual
> debug bpftrace should serve the purpose quite well.

Some programs, like irqbalancer can dynamically change the CPU affinity, 
so we want to add the per-CPU counters for better understanding of the CPU 
usage.

Thanks,
- Haiyang
Jakub Kicinski March 8, 2024, 7:22 p.m. UTC | #7
On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > each CPU should use a single queue assigned thru XPS. And for manual
> > debug bpftrace should serve the purpose quite well.  
> 
> Some programs, like irqbalancer can dynamically change the CPU affinity, 
> so we want to add the per-CPU counters for better understanding of the CPU 
> usage.

Do you have experimental data showing this making a difference
in production?

Seems unlikely, but if it does work we should enable it for all
devices, no driver by driver.
Haiyang Zhang March 8, 2024, 7:43 p.m. UTC | #8
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, March 8, 2024 2:23 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org;
> netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni
> <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon
> Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>;
> Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley
> <mikelley@microsoft.com>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
> 
> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > Dynamic is a bit of an exaggeration, right? On a well-configured
> system
> > > each CPU should use a single queue assigned thru XPS. And for manual
> > > debug bpftrace should serve the purpose quite well.
> >
> > Some programs, like irqbalancer can dynamically change the CPU
> affinity,
> > so we want to add the per-CPU counters for better understanding of the
> CPU
> > usage.
> 
> Do you have experimental data showing this making a difference
> in production?
Shradha, could you please add some data before / after enabling irqbalancer 
which changes cpu affinity?

> 
> Seems unlikely, but if it does work we should enable it for all
> devices, no driver by driver.
There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu 
counters. Are you suggesting we add a common API for all drivers?

Thanks,
- Haiyang
Rahul Rameshbabu March 8, 2024, 7:52 p.m. UTC | #9
On Fri, 08 Mar, 2024 19:43:57 +0000 Haiyang Zhang <haiyangz@microsoft.com> wrote:
>> -----Original Message-----
>> From: Jakub Kicinski <kuba@kernel.org>
>> Sent: Friday, March 8, 2024 2:23 PM
>> To: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta
>> <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-
>> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org;
>> netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni
>> <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon
>> Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>;
>> Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan
>> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
>> <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley
>> <mikelley@microsoft.com>
>> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
>> 
>> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
>> > > Dynamic is a bit of an exaggeration, right? On a well-configured
>> system
>> > > each CPU should use a single queue assigned thru XPS. And for manual
>> > > debug bpftrace should serve the purpose quite well.
>> >
>> > Some programs, like irqbalancer can dynamically change the CPU
>> affinity,
>> > so we want to add the per-CPU counters for better understanding of the
>> CPU
>> > usage.
>> 
>> Do you have experimental data showing this making a difference
>> in production?
> Shradha, could you please add some data before / after enabling irqbalancer 
> which changes cpu affinity?
>
>> 
>> Seems unlikely, but if it does work we should enable it for all
>> devices, no driver by driver.
> There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu 
> counters. Are you suggesting we add a common API for all drivers?

Wanted to chime in with regards to mlx. You might be conflating per-cpu
with per-queue. When we run ethtool -S, we present counters per netdev
queue rather than per-cpu. The number of queues we instantiate is
related to CPUs but it not always 1-1.

Jakub just recently supported a proper interface for per-queue stats
counters that we are interested in supporting.

  https://lore.kernel.org/netdev/20240222223629.158254-1-kuba@kernel.org/

--
Thanks,

Rahul Rameshbabu
Jakub Kicinski March 8, 2024, 8:27 p.m. UTC | #10
On Fri, 8 Mar 2024 19:43:57 +0000 Haiyang Zhang wrote:
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.  
> There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu 
> counters. Are you suggesting we add a common API for all drivers?

Hm, I don't think mlx does. The typical use case for pcpu stats so far
has been software devices which don't have queues, and implement
lockless Tx. In that case instead of recording stats on a local queue
struct we can count per-cpu and not worry about locking.
Sebastian Andrzej Siewior March 8, 2024, 8:33 p.m. UTC | #11
On 2024-03-08 19:43:57 [+0000], Haiyang Zhang wrote:
> > Do you have experimental data showing this making a difference
> > in production?
> Shradha, could you please add some data before / after enabling irqbalancer 
> which changes cpu affinity?

so you have one queue and one interrupt and then the irqbalancer is
pushing the interrupt from CPU to another. What kind of information do
you gain from per-CPU counters here?

> Thanks,
> - Haiyang

Sebastian
Shradha Gupta March 11, 2024, 4:19 a.m. UTC | #12
On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote:
> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > > each CPU should use a single queue assigned thru XPS. And for manual
> > > debug bpftrace should serve the purpose quite well.  
> > 
> > Some programs, like irqbalancer can dynamically change the CPU affinity, 
> > so we want to add the per-CPU counters for better understanding of the CPU 
> > usage.
> 
> Do you have experimental data showing this making a difference
> in production?
Sure, will try to get that data for this discussion
> 
> Seems unlikely, but if it does work we should enable it for all
> devices, no driver by driver.
You mean, if the usecase seems valid we should try to extend the framework
mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/)
to include these stats as well?
Will explore this a bit more and update. Thanks.
Stephen Hemminger March 11, 2024, 3:49 p.m. UTC | #13
On Sun, 10 Mar 2024 21:19:50 -0700
Shradha Gupta <shradhagupta@linux.microsoft.com> wrote:

> On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:  
> > > > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > > > each CPU should use a single queue assigned thru XPS. And for manual
> > > > debug bpftrace should serve the purpose quite well.    
> > > 
> > > Some programs, like irqbalancer can dynamically change the CPU affinity, 
> > > so we want to add the per-CPU counters for better understanding of the CPU 
> > > usage.  
> > 
> > Do you have experimental data showing this making a difference
> > in production?  
> Sure, will try to get that data for this discussion
> > 
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.  
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/)
> to include these stats as well?
> Will explore this a bit more and update. Thanks.
> 

Remember, statistics aren't free, and even per-cpu stats end up taking cache space.
Jakub Kicinski March 11, 2024, 3:51 p.m. UTC | #14
On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote:
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.  
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/)
> to include these stats as well?

"framework" is a big word, but yes, add a netlink command to get 
pcpu stats. Let's focus on the usefulness before investing time
in a rewrite, tho.
Stephen Hemminger March 11, 2024, 4:41 p.m. UTC | #15
On Mon, 11 Mar 2024 08:51:26 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote:
> > > Seems unlikely, but if it does work we should enable it for all
> > > devices, no driver by driver.    
> > You mean, if the usecase seems valid we should try to extend the framework
> > mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/)
> > to include these stats as well?  
> 
> "framework" is a big word, but yes, add a netlink command to get 
> pcpu stats. Let's focus on the usefulness before investing time
> in a rewrite, tho.

Couldn't userspace do the same thing by looking at /proc/interrupts
and PCI info?
Shradha Gupta March 14, 2024, 2:57 a.m. UTC | #16
On Sun, Mar 10, 2024 at 09:19:50PM -0700, Shradha Gupta wrote:
> On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > > > each CPU should use a single queue assigned thru XPS. And for manual
> > > > debug bpftrace should serve the purpose quite well.  
> > > 
> > > Some programs, like irqbalancer can dynamically change the CPU affinity, 
> > > so we want to add the per-CPU counters for better understanding of the CPU 
> > > usage.
> > 
> > Do you have experimental data showing this making a difference
> > in production?
> Sure, will try to get that data for this discussion
> > 
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/)
> to include these stats as well?
> Will explore this a bit more and update. Thanks.

Following is the data we can share:

Default interrupts affinity for each queue:

 25:          1        103          0    2989138  Hyper-V PCIe MSI 4138200989697-edge      mana_q0@pci:7870:00:00.0
 26:          0          1    4005360          0  Hyper-V PCIe MSI 4138200989698-edge      mana_q1@pci:7870:00:00.0
 27:          0          0          1    2997584  Hyper-V PCIe MSI 4138200989699-edge      mana_q2@pci:7870:00:00.0
 28:    3565461          0          0          1  Hyper-V PCIe MSI 4138200989700-edge      mana_q3
@pci:7870:00:00.0

As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped 
to cpu3. From this knowledge we can figure out the total RX stats processed by
each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
this data changes dynamically using irqbalance or smp_affinity file edits, the
above assumption fails. 

Interrupt affinity for mana_q2 changes and the affinity table looks as follows
 25:          1        103          0    3038084  Hyper-V PCIe MSI 4138200989697-edge      mana_q0@pci:7870:00:00.0
 26:          0          1    4012447          0  Hyper-V PCIe MSI 4138200989698-edge      mana_q1@pci:7870:00:00.0
 27:     157181         10          1    3007990  Hyper-V PCIe MSI 4138200989699-edge      mana_q2@pci:7870:00:00.0
 28:    3593858          0          0          1  Hyper-V PCIe MSI 4138200989700-edge      mana_q3@pci:7870:00:00.0 

And during this time we might end up calculating the per-CPU stats incorrectly,
messing up the understanding of CPU usage by MANA driver that is consumed by 
monitoring services. 
 

Also sharing the existing per-queue stats during this experiment, in case needed

Per-queue stats before changing CPU-affinities:
     tx_cq_err: 0
     tx_cqe_unknown_type: 0
     rx_coalesced_err: 0
     rx_cqe_unknown_type: 0
     rx_0_packets: 4230152
     rx_0_bytes: 289545167
     rx_0_xdp_drop: 0
     rx_0_xdp_tx: 0
     rx_0_xdp_redirect: 0
     rx_1_packets: 4113017
     rx_1_bytes: 314552601
     rx_1_xdp_drop: 0
     rx_1_xdp_tx: 0
     rx_1_xdp_redirect: 0
     rx_2_packets: 4458906
     rx_2_bytes: 305117506
     rx_2_xdp_drop: 0
     rx_2_xdp_tx: 0
     rx_2_xdp_redirect: 0
     rx_3_packets: 4619589
     rx_3_bytes: 315445084
     rx_3_xdp_drop: 0
     rx_3_xdp_tx: 0
     rx_3_xdp_redirect: 0
     hc_tx_err_vport_disabled: 0
     hc_tx_err_inval_vportoffset_pkt: 0
     hc_tx_err_vlan_enforcement: 0
     hc_tx_err_eth_type_enforcement: 0
     hc_tx_err_sa_enforcement: 0
     hc_tx_err_sqpdid_enforcement: 0
     hc_tx_err_cqpdid_enforcement: 0
     hc_tx_err_mtu_violation: 0
     hc_tx_err_inval_oob: 0
     hc_tx_err_gdma: 0
     hc_tx_bytes: 126336708121
     hc_tx_ucast_pkts: 86748013
     hc_tx_ucast_bytes: 126336703775
     hc_tx_bcast_pkts: 37
     hc_tx_bcast_bytes: 2842
     hc_tx_mcast_pkts: 7
     hc_tx_mcast_bytes: 1504
     tx_cq_err: 0
     tx_cqe_unknown_type: 0
     rx_coalesced_err: 0
     rx_cqe_unknown_type: 0
     rx_0_packets: 4230152
     rx_0_bytes: 289545167
     rx_0_xdp_drop: 0
     rx_0_xdp_tx: 0
     rx_0_xdp_redirect: 0
     rx_1_packets: 4113017
     rx_1_bytes: 314552601
     rx_1_xdp_drop: 0
     rx_1_xdp_tx: 0
     rx_1_xdp_redirect: 0
     rx_2_packets: 4458906
     rx_2_bytes: 305117506
     rx_2_xdp_drop: 0
     rx_2_xdp_tx: 0
     rx_2_xdp_redirect: 0
     rx_3_packets: 4619589
     rx_3_bytes: 315445084
     rx_3_xdp_drop: 0
     rx_3_xdp_tx: 0
     rx_3_xdp_redirect: 0
     tx_0_packets: 5995507
     tx_0_bytes: 28749696408
     tx_0_xdp_xmit: 0
     tx_0_tso_packets: 4719840
     tx_0_tso_bytes: 26873844525
     tx_0_tso_inner_packets: 0
     tx_0_tso_inner_bytes: 0
     tx_0_long_pkt_fmt: 0
     tx_0_short_pkt_fmt: 5995507
     tx_0_csum_partial: 1275621
     tx_0_mana_map_err: 0
     tx_1_packets: 6653598
     tx_1_bytes: 38318341475
     tx_1_xdp_xmit: 0
     tx_1_tso_packets: 5330921
     tx_1_tso_bytes: 36210150488
     tx_1_tso_inner_packets: 0
     tx_1_tso_inner_bytes: 0
     tx_1_long_pkt_fmt: 0
     tx_1_short_pkt_fmt: 6653598
     tx_1_csum_partial: 1322643
     tx_1_mana_map_err: 0
     tx_2_packets: 5715246
     tx_2_bytes: 25662283686
     tx_2_xdp_xmit: 0
     tx_2_tso_packets: 4619118
     tx_2_tso_bytes: 23829680267
     tx_2_tso_inner_packets: 0
     tx_2_tso_inner_bytes: 0
     tx_2_long_pkt_fmt: 0
     tx_2_short_pkt_fmt: 5715246
     tx_2_csum_partial: 1096092
     tx_2_mana_map_err: 0
     tx_3_packets: 6175860
     tx_3_bytes: 29500667904
     tx_3_xdp_xmit: 0
     tx_3_tso_packets: 4951591
     tx_3_tso_bytes: 27446937448
     tx_3_tso_inner_packets: 0
     tx_3_tso_inner_bytes: 0
     tx_3_long_pkt_fmt: 0
     tx_3_short_pkt_fmt: 6175860
     tx_3_csum_partial: 1224213
     tx_3_mana_map_err: 0

Per-queue stats after changing CPU-affinities:
     rx_0_packets: 4781895
     rx_0_bytes: 326478061
     rx_0_xdp_drop: 0
     rx_0_xdp_tx: 0
     rx_0_xdp_redirect: 0
     rx_1_packets: 4116990
     rx_1_bytes: 315439234
     rx_1_xdp_drop: 0
     rx_1_xdp_tx: 0
     rx_1_xdp_redirect: 0
     rx_2_packets: 4528800
     rx_2_bytes: 310312337
     rx_2_xdp_drop: 0
     rx_2_xdp_tx: 0
     rx_2_xdp_redirect: 0
     rx_3_packets: 4622622
     rx_3_bytes: 316282431
     rx_3_xdp_drop: 0
     rx_3_xdp_tx: 0
     rx_3_xdp_redirect: 0
     tx_0_packets: 5999379
     tx_0_bytes: 28750864476
     tx_0_xdp_xmit: 0
     tx_0_tso_packets: 4720027
     tx_0_tso_bytes: 26874344494
     tx_0_tso_inner_packets: 0
     tx_0_tso_inner_bytes: 0
     tx_0_long_pkt_fmt: 0
     tx_0_short_pkt_fmt: 5999379
     tx_0_csum_partial: 1279296
     tx_0_mana_map_err: 0
     tx_1_packets: 6656913
     tx_1_bytes: 38319355168
     tx_1_xdp_xmit: 0
     tx_1_tso_packets: 5331086
     tx_1_tso_bytes: 36210592040
     tx_1_tso_inner_packets: 0
     tx_1_tso_inner_bytes: 0
     tx_1_long_pkt_fmt: 0
     tx_1_short_pkt_fmt: 6656913
     tx_1_csum_partial: 1325785
     tx_1_mana_map_err: 0
     tx_2_packets: 5906172
     tx_2_bytes: 36758032245
     tx_2_xdp_xmit: 0
     tx_2_tso_packets: 4806348
     tx_2_tso_bytes: 34912213258
     tx_2_tso_inner_packets: 0
     tx_2_tso_inner_bytes: 0
     tx_2_long_pkt_fmt: 0
     tx_2_short_pkt_fmt: 5906172
     tx_2_csum_partial: 1099782
     tx_2_mana_map_err: 0
     tx_3_packets: 6202399
     tx_3_bytes: 30840325531
     tx_3_xdp_xmit: 0
     tx_3_tso_packets: 4973730
     tx_3_tso_bytes: 28784371532
     tx_3_tso_inner_packets: 0
     tx_3_tso_inner_bytes: 0
     tx_3_long_pkt_fmt: 0
     tx_3_short_pkt_fmt: 6202399
     tx_3_csum_partial: 1228603
     tx_3_mana_map_err: 0
Stephen Hemminger March 14, 2024, 3:05 a.m. UTC | #17
On Wed, 13 Mar 2024 19:57:20 -0700
Shradha Gupta <shradhagupta@linux.microsoft.com> wrote:

> Following is the data we can share:
> 
> Default interrupts affinity for each queue:
> 
>  25:          1        103          0    2989138  Hyper-V PCIe MSI 4138200989697-edge      mana_q0@pci:7870:00:00.0
>  26:          0          1    4005360          0  Hyper-V PCIe MSI 4138200989698-edge      mana_q1@pci:7870:00:00.0
>  27:          0          0          1    2997584  Hyper-V PCIe MSI 4138200989699-edge      mana_q2@pci:7870:00:00.0
>  28:    3565461          0          0          1  Hyper-V PCIe MSI 4138200989700-edge      mana_q3
> @pci:7870:00:00.0
> 
> As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped 
> to cpu3. From this knowledge we can figure out the total RX stats processed by
> each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
> this data changes dynamically using irqbalance or smp_affinity file edits, the
> above assumption fails. 


irqbalance is often a bad idea.
In the past, doing one shot balancing at startup was a better plan.
Jakub Kicinski March 14, 2024, 6:27 p.m. UTC | #18
On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote:
> Default interrupts affinity for each queue:
> 
>  25:          1        103          0    2989138  Hyper-V PCIe MSI 4138200989697-edge      mana_q0@pci:7870:00:00.0
>  26:          0          1    4005360          0  Hyper-V PCIe MSI 4138200989698-edge      mana_q1@pci:7870:00:00.0
>  27:          0          0          1    2997584  Hyper-V PCIe MSI 4138200989699-edge      mana_q2@pci:7870:00:00.0
>  28:    3565461          0          0          1  Hyper-V PCIe MSI 4138200989700-edge      mana_q3
> @pci:7870:00:00.0
> 
> As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped 
> to cpu3. From this knowledge we can figure out the total RX stats processed by
> each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
> this data changes dynamically using irqbalance or smp_affinity file edits, the
> above assumption fails. 
> 
> Interrupt affinity for mana_q2 changes and the affinity table looks as follows
>  25:          1        103          0    3038084  Hyper-V PCIe MSI 4138200989697-edge      mana_q0@pci:7870:00:00.0
>  26:          0          1    4012447          0  Hyper-V PCIe MSI 4138200989698-edge      mana_q1@pci:7870:00:00.0
>  27:     157181         10          1    3007990  Hyper-V PCIe MSI 4138200989699-edge      mana_q2@pci:7870:00:00.0
>  28:    3593858          0          0          1  Hyper-V PCIe MSI 4138200989700-edge      mana_q3@pci:7870:00:00.0 
> 
> And during this time we might end up calculating the per-CPU stats incorrectly,
> messing up the understanding of CPU usage by MANA driver that is consumed by 
> monitoring services. 

Like Stephen said, forget about irqbalance for networking.

Assume that the IRQs are affinitized and XPS set, correctly.

Now, presumably you can use your pcpu stats to "trade queues",
e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1
instead of queue 0, you can swap the 0 <> 1 assignment.
That's just an example of an "algorithm", maybe you have other
use cases. But if the problem is "user runs broken irqbalance"
the solution is not in the kernel...
Haiyang Zhang March 14, 2024, 6:54 p.m. UTC | #19
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, March 14, 2024 2:28 PM
> To: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>; Shradha Gupta
> <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org;
> netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni
> <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon
> Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>;
> Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley
> <mikelley@microsoft.com>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
> 
> On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote:
> > Default interrupts affinity for each queue:
> >
> >  25:          1        103          0    2989138  Hyper-V PCIe MSI
> 4138200989697-edge      mana_q0@pci:7870:00:00.0
> >  26:          0          1    4005360          0  Hyper-V PCIe MSI
> 4138200989698-edge      mana_q1@pci:7870:00:00.0
> >  27:          0          0          1    2997584  Hyper-V PCIe MSI
> 4138200989699-edge      mana_q2@pci:7870:00:00.0
> >  28:    3565461          0          0          1  Hyper-V PCIe MSI
> 4138200989700-edge      mana_q3
> > @pci:7870:00:00.0
> >
> > As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both
> mapped
> > to cpu3. From this knowledge we can figure out the total RX stats
> processed by
> > each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3.
> But if
> > this data changes dynamically using irqbalance or smp_affinity file
> edits, the
> > above assumption fails.
> >
> > Interrupt affinity for mana_q2 changes and the affinity table looks as
> follows
> >  25:          1        103          0    3038084  Hyper-V PCIe MSI
> 4138200989697-edge      mana_q0@pci:7870:00:00.0
> >  26:          0          1    4012447          0  Hyper-V PCIe MSI
> 4138200989698-edge      mana_q1@pci:7870:00:00.0
> >  27:     157181         10          1    3007990  Hyper-V PCIe MSI
> 4138200989699-edge      mana_q2@pci:7870:00:00.0
> >  28:    3593858          0          0          1  Hyper-V PCIe MSI
> 4138200989700-edge      mana_q3@pci:7870:00:00.0
> >
> > And during this time we might end up calculating the per-CPU stats
> incorrectly,
> > messing up the understanding of CPU usage by MANA driver that is
> consumed by
> > monitoring services.
> 
> Like Stephen said, forget about irqbalance for networking.
> 
> Assume that the IRQs are affinitized and XPS set, correctly.
> 
> Now, presumably you can use your pcpu stats to "trade queues",
> e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1
> instead of queue 0, you can swap the 0 <> 1 assignment.
> That's just an example of an "algorithm", maybe you have other
> use cases. But if the problem is "user runs broken irqbalance"
> the solution is not in the kernel...

We understand irqbalance may be a "bad idea", and recommended some 
customers to disable it when having problems with it... But it's 
still enabled by default, and we cannot let all distro vendors 
and custom image makers to disable the irqbalance. So, our host-
networking team is eager to have per-CPU stats for analyzing CPU 
usage related to irqbalance or other issues.

Thanks,
- Haiyang
Jakub Kicinski March 14, 2024, 7:05 p.m. UTC | #20
On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote:
> We understand irqbalance may be a "bad idea", and recommended some 
> customers to disable it when having problems with it... But it's 
> still enabled by default, and we cannot let all distro vendors 
> and custom image makers to disable the irqbalance. So, our host-
> networking team is eager to have per-CPU stats for analyzing CPU 
> usage related to irqbalance or other issues.

You need a use case to get the stats upstream.
"CPU usage related to irqbalance or other issues" is both too vague,
and irqbalance is a user space problem.
Alireza Dabagh March 14, 2024, 8:01 p.m. UTC | #21
Per processor network stats have been supported on netvsc and Mellanox NICs for years. They are also supported on Windows.

I routinely use these stats to rule in/out the issues with RSS imbalance due to either NIC not handling RSS correctly, incorrect MSI-X affinity, or not having enough entropy in flows. 

And yes, I perfectly understand that there are cases that packets received on processor x are processed (in tcp/ip stack) on processor y. But in many cases, there is a correlation between individual processor utilization and the processor where these packets are received on by the NIC.

-thanks, ali
(some suggested that I do mention on this thread that I am one of the original inventors of RSS. So there you have it. Personally I don't think that telling people "I invented the damn thing and I know what I am talking about" is the right way to handle this.) 
-----Original Message-----
From: Jakub Kicinski <kuba@kernel.org> 
Sent: Thursday, March 14, 2024 12:05 PM
To: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley <mikelley@microsoft.com>; Alireza Dabagh <alid@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>
Subject: [EXTERNAL] Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote:
> We understand irqbalance may be a "bad idea", and recommended some 
> customers to disable it when having problems with it... But it's still 
> enabled by default, and we cannot let all distro vendors and custom 
> image makers to disable the irqbalance. So, our host- networking team 
> is eager to have per-CPU stats for analyzing CPU usage related to 
> irqbalance or other issues.

You need a use case to get the stats upstream.
"CPU usage related to irqbalance or other issues" is both too vague, and irqbalance is a user space problem.
Shradha Gupta April 3, 2024, 5:43 a.m. UTC | #22
Hi all,
I have a newer version of this patch with netlink implementation ready, to support the usecase of per-cpu stats for RSS handling and imbalance investigations. Please let me know if we can proceed with that.

Thanks and regards,
Shradha.

On Thu, Mar 14, 2024 at 08:01:24PM +0000, Alireza Dabagh wrote:
> Per processor network stats have been supported on netvsc and Mellanox NICs for years. They are also supported on Windows.
> 
> I routinely use these stats to rule in/out the issues with RSS imbalance due to either NIC not handling RSS correctly, incorrect MSI-X affinity, or not having enough entropy in flows. 
> 
> And yes, I perfectly understand that there are cases that packets received on processor x are processed (in tcp/ip stack) on processor y. But in many cases, there is a correlation between individual processor utilization and the processor where these packets are received on by the NIC.
> 
> -thanks, ali
> (some suggested that I do mention on this thread that I am one of the original inventors of RSS. So there you have it. Personally I don't think that telling people "I invented the damn thing and I know what I am talking about" is the right way to handle this.) 
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org> 
> Sent: Thursday, March 14, 2024 12:05 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley <mikelley@microsoft.com>; Alireza Dabagh <alid@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH] net :mana : Add per-cpu stats for MANA device
> 
> On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote:
> > We understand irqbalance may be a "bad idea", and recommended some 
> > customers to disable it when having problems with it... But it's still 
> > enabled by default, and we cannot let all distro vendors and custom 
> > image makers to disable the irqbalance. So, our host- networking team 
> > is eager to have per-CPU stats for analyzing CPU usage related to 
> > irqbalance or other issues.
> 
> You need a use case to get the stats upstream.
> "CPU usage related to irqbalance or other issues" is both too vague, and irqbalance is a user space problem.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 59287c6e6cee..b27ee6684936 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -224,6 +224,7 @@  netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	int gso_hs = 0; /* zero for non-GSO pkts */
 	u16 txq_idx = skb_get_queue_mapping(skb);
 	struct gdma_dev *gd = apc->ac->gdma_dev;
+	struct mana_pcpu_stats *pcpu_stats;
 	bool ipv4 = false, ipv6 = false;
 	struct mana_tx_package pkg = {};
 	struct netdev_queue *net_txq;
@@ -234,6 +235,8 @@  netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	struct mana_cq *cq;
 	int err, len;
 
+	pcpu_stats = this_cpu_ptr(apc->pcpu_stats);
+
 	if (unlikely(!apc->port_is_up))
 		goto tx_drop;
 
@@ -412,6 +415,12 @@  netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	tx_stats->bytes += len;
 	u64_stats_update_end(&tx_stats->syncp);
 
+	/* Also update the per-CPU stats */
+	u64_stats_update_begin(&pcpu_stats->syncp);
+	pcpu_stats->tx_packets++;
+	pcpu_stats->tx_bytes += len;
+	u64_stats_update_end(&pcpu_stats->syncp);
+
 tx_busy:
 	if (netif_tx_queue_stopped(net_txq) && mana_can_tx(gdma_sq)) {
 		netif_tx_wake_queue(net_txq);
@@ -425,6 +434,9 @@  netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	kfree(pkg.sgl_ptr);
 tx_drop_count:
 	ndev->stats.tx_dropped++;
+	u64_stats_update_begin(&pcpu_stats->syncp);
+	pcpu_stats->tx_dropped++;
+	u64_stats_update_end(&pcpu_stats->syncp);
 tx_drop:
 	dev_kfree_skb_any(skb);
 	return NETDEV_TX_OK;
@@ -1505,6 +1517,8 @@  static void mana_rx_skb(void *buf_va, bool from_pool,
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
 	uint pkt_len = cqe->ppi[0].pkt_len;
+	struct mana_pcpu_stats *pcpu_stats;
+	struct mana_port_context *apc;
 	u16 rxq_idx = rxq->rxq_idx;
 	struct napi_struct *napi;
 	struct xdp_buff xdp = {};
@@ -1512,6 +1526,9 @@  static void mana_rx_skb(void *buf_va, bool from_pool,
 	u32 hash_value;
 	u32 act;
 
+	apc = netdev_priv(ndev);
+	pcpu_stats = this_cpu_ptr(apc->pcpu_stats);
+
 	rxq->rx_cq.work_done++;
 	napi = &rxq->rx_cq.napi;
 
@@ -1570,6 +1587,11 @@  static void mana_rx_skb(void *buf_va, bool from_pool,
 		rx_stats->xdp_tx++;
 	u64_stats_update_end(&rx_stats->syncp);
 
+	u64_stats_update_begin(&pcpu_stats->syncp);
+	pcpu_stats->rx_packets++;
+	pcpu_stats->rx_bytes += pkt_len;
+	u64_stats_update_end(&pcpu_stats->syncp);
+
 	if (act == XDP_TX) {
 		skb_set_queue_mapping(skb, rxq_idx);
 		mana_xdp_tx(skb, ndev);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index ab2413d71f6c..e3aa47ead601 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -83,8 +83,9 @@  static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	if (stringset != ETH_SS_STATS)
 		return -EINVAL;
 
-	return ARRAY_SIZE(mana_eth_stats) + num_queues *
-				(MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	return ARRAY_SIZE(mana_eth_stats) +
+	       (num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT)) +
+	       (num_present_cpus() * (MANA_STATS_RX_PCPU + MANA_STATS_TX_PCPU));
 }
 
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
@@ -139,6 +140,19 @@  static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 		sprintf(p, "tx_%d_mana_map_err", i);
 		p += ETH_GSTRING_LEN;
 	}
+
+	for (i = 0; i < num_present_cpus(); i++) {
+		sprintf(p, "cpu%d_rx_packets", i);
+		p += ETH_GSTRING_LEN;
+		sprintf(p, "cpu%d_rx_bytes", i);
+		p += ETH_GSTRING_LEN;
+		sprintf(p, "cpu%d_tx_packets", i);
+		p += ETH_GSTRING_LEN;
+		sprintf(p, "cpu%d_tx_bytes", i);
+		p += ETH_GSTRING_LEN;
+		sprintf(p, "cpu%d_tx_dropped", i);
+		p += ETH_GSTRING_LEN;
+	}
 }
 
 static void mana_get_ethtool_stats(struct net_device *ndev,
@@ -222,6 +236,28 @@  static void mana_get_ethtool_stats(struct net_device *ndev,
 		data[i++] = csum_partial;
 		data[i++] = mana_map_err;
 	}
+
+	for_each_possible_cpu(q) {
+		const struct mana_pcpu_stats *pcpu_stats =
+				per_cpu_ptr(apc->pcpu_stats, q);
+		u64 rx_packets, rx_bytes, tx_packets, tx_bytes, tx_dropped;
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin(&pcpu_stats->syncp);
+			rx_packets = pcpu_stats->rx_packets;
+			tx_packets = pcpu_stats->tx_packets;
+			rx_bytes = pcpu_stats->rx_bytes;
+			tx_bytes = pcpu_stats->tx_bytes;
+			tx_dropped = pcpu_stats->tx_dropped;
+		} while (u64_stats_fetch_retry(&pcpu_stats->syncp, start));
+
+		data[i++] = rx_packets;
+		data[i++] = rx_bytes;
+		data[i++] = tx_packets;
+		data[i++] = tx_bytes;
+		data[i++] = tx_dropped;
+	}
 }
 
 static int mana_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc *cmd,
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 76147feb0d10..9a2414ee7f02 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -51,6 +51,8 @@  enum TRI_STATE {
 /* Update this count whenever the respective structures are changed */
 #define MANA_STATS_RX_COUNT 5
 #define MANA_STATS_TX_COUNT 11
+#define MANA_STATS_RX_PCPU 2
+#define MANA_STATS_TX_PCPU 3
 
 struct mana_stats_rx {
 	u64 packets;
@@ -386,6 +388,15 @@  struct mana_ethtool_stats {
 	u64 rx_cqe_unknown_type;
 };
 
+struct mana_pcpu_stats {
+	u64 rx_packets;
+	u64 rx_bytes;
+	u64 tx_packets;
+	u64 tx_bytes;
+	u64 tx_dropped;
+	struct u64_stats_sync syncp;
+};
+
 struct mana_context {
 	struct gdma_dev *gdma_dev;
 
@@ -449,6 +460,7 @@  struct mana_port_context {
 	bool port_st_save; /* Saved port state */
 
 	struct mana_ethtool_stats eth_stats;
+	struct mana_pcpu_stats __percpu *pcpu_stats;
 };
 
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);