Message ID | 1709823132-22411-1-git-send-email-shradhagupta@linux.microsoft.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net :mana : Add per-cpu stats for MANA device | expand |
On Thu, 7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote: > Extend 'ethtool -S' output for mana devices to include per-CPU packet > stats But why? You already have per queue stats. > Built-on: Ubuntu22 > Tested-on: Ubuntu22 I understand when people mention testing driver changes on particular SKUs of supported devices, that adds meaningful info. I don't understand what "built" and "tested" on a distro is supposed to tell us. Honest question, what is the value of this?
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Thursday, March 7, 2024 10:29 AM > To: Shradha Gupta <shradhagupta@linux.microsoft.com> > Cc: linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux- > rdma@vger.kernel.org; netdev@vger.kernel.org; Eric Dumazet > <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Ajay Sharma > <sharmaajay@microsoft.com>; Leon Romanovsky <leon@kernel.org>; Thomas > Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior > <bigeasy@linutronix.de>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui > <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley > <mikelley@microsoft.com>; Shradha Gupta <shradhagupta@microsoft.com> > Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device > > On Thu, 7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote: > > Extend 'ethtool -S' output for mana devices to include per-CPU packet > > stats > > But why? You already have per queue stats. Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat to analyze the CPU usage by counting the packets and bytes on each CPU. > > > Built-on: Ubuntu22 > > Tested-on: Ubuntu22 > > I understand when people mention testing driver changes on particular > SKUs of supported devices, that adds meaningful info. I don't understand > what "built" and "tested" on a distro is supposed to tell us. > Honest question, what is the value of this? @Shradha Gupta Please add the tested SKU and test case here. Or, you don't have to include test info with the patch. Thanks, - Haiyang
> -----Original Message----- > From: Shradha Gupta <shradhagupta@linux.microsoft.com> > Sent: Thursday, March 7, 2024 9:52 AM > To: linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux- > rdma@vger.kernel.org; netdev@vger.kernel.org > Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Eric Dumazet > <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni > <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon > Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; > Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan > <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; Long Li > <longli@microsoft.com>; Michael Kelley <mikelley@microsoft.com>; Shradha > Gupta <shradhagupta@microsoft.com> > Subject: [PATCH] net :mana : Add per-cpu stats for MANA device > > Extend 'ethtool -S' output for mana devices to include per-CPU packet > stats > > Built-on: Ubuntu22 > Tested-on: Ubuntu22 > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> > --- > drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++++++++ > .../ethernet/microsoft/mana/mana_ethtool.c | 40 ++++++++++++++++++- > include/net/mana/mana.h | 12 ++++++ > 3 files changed, 72 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > b/drivers/net/ethernet/microsoft/mana/mana_en.c > index 59287c6e6cee..b27ee6684936 100644 > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > @@ -224,6 +224,7 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, > struct net_device *ndev) > int gso_hs = 0; /* zero for non-GSO pkts */ > u16 txq_idx = skb_get_queue_mapping(skb); > struct gdma_dev *gd = apc->ac->gdma_dev; > + struct mana_pcpu_stats *pcpu_stats; > bool ipv4 = false, ipv6 = false; > struct mana_tx_package pkg = {}; > struct netdev_queue *net_txq; > @@ -234,6 +235,8 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, > struct net_device *ndev) > struct mana_cq *cq; > int err, len; > > + pcpu_stats = this_cpu_ptr(apc->pcpu_stats); > + > if (unlikely(!apc->port_is_up)) > goto tx_drop; > > @@ -412,6 +415,12 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, > struct net_device *ndev) > tx_stats->bytes += len; > u64_stats_update_end(&tx_stats->syncp); > > + /* Also update the per-CPU stats */ > + u64_stats_update_begin(&pcpu_stats->syncp); > + pcpu_stats->tx_packets++; > + pcpu_stats->tx_bytes += len; > + u64_stats_update_end(&pcpu_stats->syncp); > + > tx_busy: > if (netif_tx_queue_stopped(net_txq) && mana_can_tx(gdma_sq)) { > netif_tx_wake_queue(net_txq); > @@ -425,6 +434,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, > struct net_device *ndev) > kfree(pkg.sgl_ptr); > tx_drop_count: > ndev->stats.tx_dropped++; > + u64_stats_update_begin(&pcpu_stats->syncp); > + pcpu_stats->tx_dropped++; > + u64_stats_update_end(&pcpu_stats->syncp); > tx_drop: > dev_kfree_skb_any(skb); > return NETDEV_TX_OK; > @@ -1505,6 +1517,8 @@ static void mana_rx_skb(void *buf_va, bool > from_pool, > struct mana_stats_rx *rx_stats = &rxq->stats; > struct net_device *ndev = rxq->ndev; > uint pkt_len = cqe->ppi[0].pkt_len; > + struct mana_pcpu_stats *pcpu_stats; > + struct mana_port_context *apc; > u16 rxq_idx = rxq->rxq_idx; > struct napi_struct *napi; > struct xdp_buff xdp = {}; > @@ -1512,6 +1526,9 @@ static void mana_rx_skb(void *buf_va, bool > from_pool, > u32 hash_value; > u32 act; > > + apc = netdev_priv(ndev); > + pcpu_stats = this_cpu_ptr(apc->pcpu_stats); > + > rxq->rx_cq.work_done++; > napi = &rxq->rx_cq.napi; > > @@ -1570,6 +1587,11 @@ static void mana_rx_skb(void *buf_va, bool > from_pool, > rx_stats->xdp_tx++; > u64_stats_update_end(&rx_stats->syncp); > > + u64_stats_update_begin(&pcpu_stats->syncp); > + pcpu_stats->rx_packets++; > + pcpu_stats->rx_bytes += pkt_len; > + u64_stats_update_end(&pcpu_stats->syncp); > + > if (act == XDP_TX) { > skb_set_queue_mapping(skb, rxq_idx); > mana_xdp_tx(skb, ndev); > diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c > b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c > index ab2413d71f6c..e3aa47ead601 100644 > --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c > +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c > @@ -83,8 +83,9 @@ static int mana_get_sset_count(struct net_device *ndev, > int stringset) > if (stringset != ETH_SS_STATS) > return -EINVAL; > > - return ARRAY_SIZE(mana_eth_stats) + num_queues * > - (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT); > + return ARRAY_SIZE(mana_eth_stats) + > + (num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT)) + > + (num_present_cpus() * (MANA_STATS_RX_PCPU + > MANA_STATS_TX_PCPU)); > } > > static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 > *data) > @@ -139,6 +140,19 @@ static void mana_get_strings(struct net_device > *ndev, u32 stringset, u8 *data) > sprintf(p, "tx_%d_mana_map_err", i); > p += ETH_GSTRING_LEN; > } > + > + for (i = 0; i < num_present_cpus(); i++) { > + sprintf(p, "cpu%d_rx_packets", i); > + p += ETH_GSTRING_LEN; > + sprintf(p, "cpu%d_rx_bytes", i); > + p += ETH_GSTRING_LEN; > + sprintf(p, "cpu%d_tx_packets", i); > + p += ETH_GSTRING_LEN; > + sprintf(p, "cpu%d_tx_bytes", i); > + p += ETH_GSTRING_LEN; > + sprintf(p, "cpu%d_tx_dropped", i); > + p += ETH_GSTRING_LEN; > + } > } > > static void mana_get_ethtool_stats(struct net_device *ndev, > @@ -222,6 +236,28 @@ static void mana_get_ethtool_stats(struct net_device > *ndev, > data[i++] = csum_partial; > data[i++] = mana_map_err; > } > + > + for_each_possible_cpu(q) { > + const struct mana_pcpu_stats *pcpu_stats = > + per_cpu_ptr(apc->pcpu_stats, q); > + u64 rx_packets, rx_bytes, tx_packets, tx_bytes, tx_dropped; > + unsigned int start; > + > + do { > + start = u64_stats_fetch_begin(&pcpu_stats->syncp); > + rx_packets = pcpu_stats->rx_packets; > + tx_packets = pcpu_stats->tx_packets; > + rx_bytes = pcpu_stats->rx_bytes; > + tx_bytes = pcpu_stats->tx_bytes; > + tx_dropped = pcpu_stats->tx_dropped; > + } while (u64_stats_fetch_retry(&pcpu_stats->syncp, start)); > + > + data[i++] = rx_packets; > + data[i++] = rx_bytes; > + data[i++] = tx_packets; > + data[i++] = tx_bytes; > + data[i++] = tx_dropped; > + } > } > > static int mana_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc > *cmd, > diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h > index 76147feb0d10..9a2414ee7f02 100644 > --- a/include/net/mana/mana.h > +++ b/include/net/mana/mana.h > @@ -51,6 +51,8 @@ enum TRI_STATE { > /* Update this count whenever the respective structures are changed */ > #define MANA_STATS_RX_COUNT 5 > #define MANA_STATS_TX_COUNT 11 > +#define MANA_STATS_RX_PCPU 2 > +#define MANA_STATS_TX_PCPU 3 > > struct mana_stats_rx { > u64 packets; > @@ -386,6 +388,15 @@ struct mana_ethtool_stats { > u64 rx_cqe_unknown_type; > }; > > +struct mana_pcpu_stats { > + u64 rx_packets; > + u64 rx_bytes; > + u64 tx_packets; > + u64 tx_bytes; > + u64 tx_dropped; > + struct u64_stats_sync syncp; > +}; > + > struct mana_context { > struct gdma_dev *gdma_dev; > > @@ -449,6 +460,7 @@ struct mana_port_context { > bool port_st_save; /* Saved port state */ > > struct mana_ethtool_stats eth_stats; > + struct mana_pcpu_stats __percpu *pcpu_stats; Where are pcpu_stats alloc-ed? Seems I cannot see any alloc in the patch. Thanks, - Haiyang
On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote: > > > Extend 'ethtool -S' output for mana devices to include per-CPU packet > > > stats > > > > But why? You already have per queue stats. > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat > to analyze the CPU usage by counting the packets and bytes on each CPU. Dynamic is a bit of an exaggeration, right? On a well-configured system each CPU should use a single queue assigned thru XPS. And for manual debug bpftrace should serve the purpose quite well. Please note that you can't use num_present_cpus() to size stats in ethtool -S , you have to use possible_cpus(), because the retrieval of the stats is done in a multi-syscall fashion and there are no explicit lengths in the API. So you must always report all possible stats, not just currently active :(
Thanks for the comments, I am sending out a newer version with these fixes. On Thu, Mar 07, 2024 at 09:01:45AM -0800, Jakub Kicinski wrote: > On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote: > > > > Extend 'ethtool -S' output for mana devices to include per-CPU packet > > > > stats > > > > > > But why? You already have per queue stats. > > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat > > to analyze the CPU usage by counting the packets and bytes on each CPU. > > Dynamic is a bit of an exaggeration, right? On a well-configured system > each CPU should use a single queue assigned thru XPS. And for manual > debug bpftrace should serve the purpose quite well. > > Please note that you can't use num_present_cpus() to size stats in > ethtool -S , you have to use possible_cpus(), because the retrieval > of the stats is done in a multi-syscall fashion and there are no > explicit lengths in the API. So you must always report all possible > stats, not just currently active :(
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Thursday, March 7, 2024 12:02 PM > To: Haiyang Zhang <haiyangz@microsoft.com> > Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta > <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux- > hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; > netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni > <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon > Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; > Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan > <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui > <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley > <mikelley@microsoft.com> > Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device > > On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote: > > > > Extend 'ethtool -S' output for mana devices to include per-CPU > packet > > > > stats > > > > > > But why? You already have per queue stats. > > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat > > to analyze the CPU usage by counting the packets and bytes on each CPU. > > Dynamic is a bit of an exaggeration, right? On a well-configured system > each CPU should use a single queue assigned thru XPS. And for manual > debug bpftrace should serve the purpose quite well. Some programs, like irqbalancer can dynamically change the CPU affinity, so we want to add the per-CPU counters for better understanding of the CPU usage. Thanks, - Haiyang
On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote: > > Dynamic is a bit of an exaggeration, right? On a well-configured system > > each CPU should use a single queue assigned thru XPS. And for manual > > debug bpftrace should serve the purpose quite well. > > Some programs, like irqbalancer can dynamically change the CPU affinity, > so we want to add the per-CPU counters for better understanding of the CPU > usage. Do you have experimental data showing this making a difference in production? Seems unlikely, but if it does work we should enable it for all devices, no driver by driver.
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Friday, March 8, 2024 2:23 PM > To: Haiyang Zhang <haiyangz@microsoft.com> > Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta > <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux- > hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; > netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni > <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon > Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; > Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan > <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui > <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley > <mikelley@microsoft.com> > Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device > > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote: > > > Dynamic is a bit of an exaggeration, right? On a well-configured > system > > > each CPU should use a single queue assigned thru XPS. And for manual > > > debug bpftrace should serve the purpose quite well. > > > > Some programs, like irqbalancer can dynamically change the CPU > affinity, > > so we want to add the per-CPU counters for better understanding of the > CPU > > usage. > > Do you have experimental data showing this making a difference > in production? Shradha, could you please add some data before / after enabling irqbalancer which changes cpu affinity? > > Seems unlikely, but if it does work we should enable it for all > devices, no driver by driver. There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu counters. Are you suggesting we add a common API for all drivers? Thanks, - Haiyang
On Fri, 08 Mar, 2024 19:43:57 +0000 Haiyang Zhang <haiyangz@microsoft.com> wrote: >> -----Original Message----- >> From: Jakub Kicinski <kuba@kernel.org> >> Sent: Friday, March 8, 2024 2:23 PM >> To: Haiyang Zhang <haiyangz@microsoft.com> >> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta >> <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux- >> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; >> netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni >> <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon >> Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; >> Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan >> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui >> <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley >> <mikelley@microsoft.com> >> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device >> >> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote: >> > > Dynamic is a bit of an exaggeration, right? On a well-configured >> system >> > > each CPU should use a single queue assigned thru XPS. And for manual >> > > debug bpftrace should serve the purpose quite well. >> > >> > Some programs, like irqbalancer can dynamically change the CPU >> affinity, >> > so we want to add the per-CPU counters for better understanding of the >> CPU >> > usage. >> >> Do you have experimental data showing this making a difference >> in production? > Shradha, could you please add some data before / after enabling irqbalancer > which changes cpu affinity? > >> >> Seems unlikely, but if it does work we should enable it for all >> devices, no driver by driver. > There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu > counters. Are you suggesting we add a common API for all drivers? Wanted to chime in with regards to mlx. You might be conflating per-cpu with per-queue. When we run ethtool -S, we present counters per netdev queue rather than per-cpu. The number of queues we instantiate is related to CPUs but it not always 1-1. Jakub just recently supported a proper interface for per-queue stats counters that we are interested in supporting. https://lore.kernel.org/netdev/20240222223629.158254-1-kuba@kernel.org/ -- Thanks, Rahul Rameshbabu
On Fri, 8 Mar 2024 19:43:57 +0000 Haiyang Zhang wrote: > > Seems unlikely, but if it does work we should enable it for all > > devices, no driver by driver. > There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu > counters. Are you suggesting we add a common API for all drivers? Hm, I don't think mlx does. The typical use case for pcpu stats so far has been software devices which don't have queues, and implement lockless Tx. In that case instead of recording stats on a local queue struct we can count per-cpu and not worry about locking.
On 2024-03-08 19:43:57 [+0000], Haiyang Zhang wrote: > > Do you have experimental data showing this making a difference > > in production? > Shradha, could you please add some data before / after enabling irqbalancer > which changes cpu affinity? so you have one queue and one interrupt and then the irqbalancer is pushing the interrupt from CPU to another. What kind of information do you gain from per-CPU counters here? > Thanks, > - Haiyang Sebastian
On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote: > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote: > > > Dynamic is a bit of an exaggeration, right? On a well-configured system > > > each CPU should use a single queue assigned thru XPS. And for manual > > > debug bpftrace should serve the purpose quite well. > > > > Some programs, like irqbalancer can dynamically change the CPU affinity, > > so we want to add the per-CPU counters for better understanding of the CPU > > usage. > > Do you have experimental data showing this making a difference > in production? Sure, will try to get that data for this discussion > > Seems unlikely, but if it does work we should enable it for all > devices, no driver by driver. You mean, if the usecase seems valid we should try to extend the framework mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/) to include these stats as well? Will explore this a bit more and update. Thanks.
On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta <shradhagupta@linux.microsoft.com> wrote: > On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote: > > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote: > > > > Dynamic is a bit of an exaggeration, right? On a well-configured system > > > > each CPU should use a single queue assigned thru XPS. And for manual > > > > debug bpftrace should serve the purpose quite well. > > > > > > Some programs, like irqbalancer can dynamically change the CPU affinity, > > > so we want to add the per-CPU counters for better understanding of the CPU > > > usage. > > > > Do you have experimental data showing this making a difference > > in production? > Sure, will try to get that data for this discussion > > > > Seems unlikely, but if it does work we should enable it for all > > devices, no driver by driver. > You mean, if the usecase seems valid we should try to extend the framework > mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/) > to include these stats as well? > Will explore this a bit more and update. Thanks. > Remember, statistics aren't free, and even per-cpu stats end up taking cache space.
On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote: > > Seems unlikely, but if it does work we should enable it for all > > devices, no driver by driver. > You mean, if the usecase seems valid we should try to extend the framework > mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/) > to include these stats as well? "framework" is a big word, but yes, add a netlink command to get pcpu stats. Let's focus on the usefulness before investing time in a rewrite, tho.
On Mon, 11 Mar 2024 08:51:26 -0700 Jakub Kicinski <kuba@kernel.org> wrote: > On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote: > > > Seems unlikely, but if it does work we should enable it for all > > > devices, no driver by driver. > > You mean, if the usecase seems valid we should try to extend the framework > > mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/) > > to include these stats as well? > > "framework" is a big word, but yes, add a netlink command to get > pcpu stats. Let's focus on the usefulness before investing time > in a rewrite, tho. Couldn't userspace do the same thing by looking at /proc/interrupts and PCI info?
On Sun, Mar 10, 2024 at 09:19:50PM -0700, Shradha Gupta wrote: > On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote: > > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote: > > > > Dynamic is a bit of an exaggeration, right? On a well-configured system > > > > each CPU should use a single queue assigned thru XPS. And for manual > > > > debug bpftrace should serve the purpose quite well. > > > > > > Some programs, like irqbalancer can dynamically change the CPU affinity, > > > so we want to add the per-CPU counters for better understanding of the CPU > > > usage. > > > > Do you have experimental data showing this making a difference > > in production? > Sure, will try to get that data for this discussion > > > > Seems unlikely, but if it does work we should enable it for all > > devices, no driver by driver. > You mean, if the usecase seems valid we should try to extend the framework > mentioned by Rahul (https://lore.kernel.org/lkml/20240307072923.6cc8a2ba@kernel.org/) > to include these stats as well? > Will explore this a bit more and update. Thanks. Following is the data we can share: Default interrupts affinity for each queue: 25: 1 103 0 2989138 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0 26: 0 1 4005360 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0 27: 0 0 1 2997584 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0 28: 3565461 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3 @pci:7870:00:00.0 As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped to cpu3. From this knowledge we can figure out the total RX stats processed by each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if this data changes dynamically using irqbalance or smp_affinity file edits, the above assumption fails. Interrupt affinity for mana_q2 changes and the affinity table looks as follows 25: 1 103 0 3038084 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0 26: 0 1 4012447 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0 27: 157181 10 1 3007990 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0 28: 3593858 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3@pci:7870:00:00.0 And during this time we might end up calculating the per-CPU stats incorrectly, messing up the understanding of CPU usage by MANA driver that is consumed by monitoring services. Also sharing the existing per-queue stats during this experiment, in case needed Per-queue stats before changing CPU-affinities: tx_cq_err: 0 tx_cqe_unknown_type: 0 rx_coalesced_err: 0 rx_cqe_unknown_type: 0 rx_0_packets: 4230152 rx_0_bytes: 289545167 rx_0_xdp_drop: 0 rx_0_xdp_tx: 0 rx_0_xdp_redirect: 0 rx_1_packets: 4113017 rx_1_bytes: 314552601 rx_1_xdp_drop: 0 rx_1_xdp_tx: 0 rx_1_xdp_redirect: 0 rx_2_packets: 4458906 rx_2_bytes: 305117506 rx_2_xdp_drop: 0 rx_2_xdp_tx: 0 rx_2_xdp_redirect: 0 rx_3_packets: 4619589 rx_3_bytes: 315445084 rx_3_xdp_drop: 0 rx_3_xdp_tx: 0 rx_3_xdp_redirect: 0 hc_tx_err_vport_disabled: 0 hc_tx_err_inval_vportoffset_pkt: 0 hc_tx_err_vlan_enforcement: 0 hc_tx_err_eth_type_enforcement: 0 hc_tx_err_sa_enforcement: 0 hc_tx_err_sqpdid_enforcement: 0 hc_tx_err_cqpdid_enforcement: 0 hc_tx_err_mtu_violation: 0 hc_tx_err_inval_oob: 0 hc_tx_err_gdma: 0 hc_tx_bytes: 126336708121 hc_tx_ucast_pkts: 86748013 hc_tx_ucast_bytes: 126336703775 hc_tx_bcast_pkts: 37 hc_tx_bcast_bytes: 2842 hc_tx_mcast_pkts: 7 hc_tx_mcast_bytes: 1504 tx_cq_err: 0 tx_cqe_unknown_type: 0 rx_coalesced_err: 0 rx_cqe_unknown_type: 0 rx_0_packets: 4230152 rx_0_bytes: 289545167 rx_0_xdp_drop: 0 rx_0_xdp_tx: 0 rx_0_xdp_redirect: 0 rx_1_packets: 4113017 rx_1_bytes: 314552601 rx_1_xdp_drop: 0 rx_1_xdp_tx: 0 rx_1_xdp_redirect: 0 rx_2_packets: 4458906 rx_2_bytes: 305117506 rx_2_xdp_drop: 0 rx_2_xdp_tx: 0 rx_2_xdp_redirect: 0 rx_3_packets: 4619589 rx_3_bytes: 315445084 rx_3_xdp_drop: 0 rx_3_xdp_tx: 0 rx_3_xdp_redirect: 0 tx_0_packets: 5995507 tx_0_bytes: 28749696408 tx_0_xdp_xmit: 0 tx_0_tso_packets: 4719840 tx_0_tso_bytes: 26873844525 tx_0_tso_inner_packets: 0 tx_0_tso_inner_bytes: 0 tx_0_long_pkt_fmt: 0 tx_0_short_pkt_fmt: 5995507 tx_0_csum_partial: 1275621 tx_0_mana_map_err: 0 tx_1_packets: 6653598 tx_1_bytes: 38318341475 tx_1_xdp_xmit: 0 tx_1_tso_packets: 5330921 tx_1_tso_bytes: 36210150488 tx_1_tso_inner_packets: 0 tx_1_tso_inner_bytes: 0 tx_1_long_pkt_fmt: 0 tx_1_short_pkt_fmt: 6653598 tx_1_csum_partial: 1322643 tx_1_mana_map_err: 0 tx_2_packets: 5715246 tx_2_bytes: 25662283686 tx_2_xdp_xmit: 0 tx_2_tso_packets: 4619118 tx_2_tso_bytes: 23829680267 tx_2_tso_inner_packets: 0 tx_2_tso_inner_bytes: 0 tx_2_long_pkt_fmt: 0 tx_2_short_pkt_fmt: 5715246 tx_2_csum_partial: 1096092 tx_2_mana_map_err: 0 tx_3_packets: 6175860 tx_3_bytes: 29500667904 tx_3_xdp_xmit: 0 tx_3_tso_packets: 4951591 tx_3_tso_bytes: 27446937448 tx_3_tso_inner_packets: 0 tx_3_tso_inner_bytes: 0 tx_3_long_pkt_fmt: 0 tx_3_short_pkt_fmt: 6175860 tx_3_csum_partial: 1224213 tx_3_mana_map_err: 0 Per-queue stats after changing CPU-affinities: rx_0_packets: 4781895 rx_0_bytes: 326478061 rx_0_xdp_drop: 0 rx_0_xdp_tx: 0 rx_0_xdp_redirect: 0 rx_1_packets: 4116990 rx_1_bytes: 315439234 rx_1_xdp_drop: 0 rx_1_xdp_tx: 0 rx_1_xdp_redirect: 0 rx_2_packets: 4528800 rx_2_bytes: 310312337 rx_2_xdp_drop: 0 rx_2_xdp_tx: 0 rx_2_xdp_redirect: 0 rx_3_packets: 4622622 rx_3_bytes: 316282431 rx_3_xdp_drop: 0 rx_3_xdp_tx: 0 rx_3_xdp_redirect: 0 tx_0_packets: 5999379 tx_0_bytes: 28750864476 tx_0_xdp_xmit: 0 tx_0_tso_packets: 4720027 tx_0_tso_bytes: 26874344494 tx_0_tso_inner_packets: 0 tx_0_tso_inner_bytes: 0 tx_0_long_pkt_fmt: 0 tx_0_short_pkt_fmt: 5999379 tx_0_csum_partial: 1279296 tx_0_mana_map_err: 0 tx_1_packets: 6656913 tx_1_bytes: 38319355168 tx_1_xdp_xmit: 0 tx_1_tso_packets: 5331086 tx_1_tso_bytes: 36210592040 tx_1_tso_inner_packets: 0 tx_1_tso_inner_bytes: 0 tx_1_long_pkt_fmt: 0 tx_1_short_pkt_fmt: 6656913 tx_1_csum_partial: 1325785 tx_1_mana_map_err: 0 tx_2_packets: 5906172 tx_2_bytes: 36758032245 tx_2_xdp_xmit: 0 tx_2_tso_packets: 4806348 tx_2_tso_bytes: 34912213258 tx_2_tso_inner_packets: 0 tx_2_tso_inner_bytes: 0 tx_2_long_pkt_fmt: 0 tx_2_short_pkt_fmt: 5906172 tx_2_csum_partial: 1099782 tx_2_mana_map_err: 0 tx_3_packets: 6202399 tx_3_bytes: 30840325531 tx_3_xdp_xmit: 0 tx_3_tso_packets: 4973730 tx_3_tso_bytes: 28784371532 tx_3_tso_inner_packets: 0 tx_3_tso_inner_bytes: 0 tx_3_long_pkt_fmt: 0 tx_3_short_pkt_fmt: 6202399 tx_3_csum_partial: 1228603 tx_3_mana_map_err: 0
On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta <shradhagupta@linux.microsoft.com> wrote: > Following is the data we can share: > > Default interrupts affinity for each queue: > > 25: 1 103 0 2989138 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0 > 26: 0 1 4005360 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0 > 27: 0 0 1 2997584 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0 > 28: 3565461 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3 > @pci:7870:00:00.0 > > As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped > to cpu3. From this knowledge we can figure out the total RX stats processed by > each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if > this data changes dynamically using irqbalance or smp_affinity file edits, the > above assumption fails. irqbalance is often a bad idea. In the past, doing one shot balancing at startup was a better plan.
On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote: > Default interrupts affinity for each queue: > > 25: 1 103 0 2989138 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0 > 26: 0 1 4005360 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0 > 27: 0 0 1 2997584 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0 > 28: 3565461 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3 > @pci:7870:00:00.0 > > As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped > to cpu3. From this knowledge we can figure out the total RX stats processed by > each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if > this data changes dynamically using irqbalance or smp_affinity file edits, the > above assumption fails. > > Interrupt affinity for mana_q2 changes and the affinity table looks as follows > 25: 1 103 0 3038084 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0 > 26: 0 1 4012447 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0 > 27: 157181 10 1 3007990 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0 > 28: 3593858 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3@pci:7870:00:00.0 > > And during this time we might end up calculating the per-CPU stats incorrectly, > messing up the understanding of CPU usage by MANA driver that is consumed by > monitoring services. Like Stephen said, forget about irqbalance for networking. Assume that the IRQs are affinitized and XPS set, correctly. Now, presumably you can use your pcpu stats to "trade queues", e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1 instead of queue 0, you can swap the 0 <> 1 assignment. That's just an example of an "algorithm", maybe you have other use cases. But if the problem is "user runs broken irqbalance" the solution is not in the kernel...
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Thursday, March 14, 2024 2:28 PM > To: Shradha Gupta <shradhagupta@linux.microsoft.com> > Cc: Haiyang Zhang <haiyangz@microsoft.com>; Shradha Gupta > <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux- > hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; > netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni > <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon > Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; > Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan > <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui > <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley > <mikelley@microsoft.com> > Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device > > On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote: > > Default interrupts affinity for each queue: > > > > 25: 1 103 0 2989138 Hyper-V PCIe MSI > 4138200989697-edge mana_q0@pci:7870:00:00.0 > > 26: 0 1 4005360 0 Hyper-V PCIe MSI > 4138200989698-edge mana_q1@pci:7870:00:00.0 > > 27: 0 0 1 2997584 Hyper-V PCIe MSI > 4138200989699-edge mana_q2@pci:7870:00:00.0 > > 28: 3565461 0 0 1 Hyper-V PCIe MSI > 4138200989700-edge mana_q3 > > @pci:7870:00:00.0 > > > > As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both > mapped > > to cpu3. From this knowledge we can figure out the total RX stats > processed by > > each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. > But if > > this data changes dynamically using irqbalance or smp_affinity file > edits, the > > above assumption fails. > > > > Interrupt affinity for mana_q2 changes and the affinity table looks as > follows > > 25: 1 103 0 3038084 Hyper-V PCIe MSI > 4138200989697-edge mana_q0@pci:7870:00:00.0 > > 26: 0 1 4012447 0 Hyper-V PCIe MSI > 4138200989698-edge mana_q1@pci:7870:00:00.0 > > 27: 157181 10 1 3007990 Hyper-V PCIe MSI > 4138200989699-edge mana_q2@pci:7870:00:00.0 > > 28: 3593858 0 0 1 Hyper-V PCIe MSI > 4138200989700-edge mana_q3@pci:7870:00:00.0 > > > > And during this time we might end up calculating the per-CPU stats > incorrectly, > > messing up the understanding of CPU usage by MANA driver that is > consumed by > > monitoring services. > > Like Stephen said, forget about irqbalance for networking. > > Assume that the IRQs are affinitized and XPS set, correctly. > > Now, presumably you can use your pcpu stats to "trade queues", > e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1 > instead of queue 0, you can swap the 0 <> 1 assignment. > That's just an example of an "algorithm", maybe you have other > use cases. But if the problem is "user runs broken irqbalance" > the solution is not in the kernel... We understand irqbalance may be a "bad idea", and recommended some customers to disable it when having problems with it... But it's still enabled by default, and we cannot let all distro vendors and custom image makers to disable the irqbalance. So, our host- networking team is eager to have per-CPU stats for analyzing CPU usage related to irqbalance or other issues. Thanks, - Haiyang
On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote: > We understand irqbalance may be a "bad idea", and recommended some > customers to disable it when having problems with it... But it's > still enabled by default, and we cannot let all distro vendors > and custom image makers to disable the irqbalance. So, our host- > networking team is eager to have per-CPU stats for analyzing CPU > usage related to irqbalance or other issues. You need a use case to get the stats upstream. "CPU usage related to irqbalance or other issues" is both too vague, and irqbalance is a user space problem.
Per processor network stats have been supported on netvsc and Mellanox NICs for years. They are also supported on Windows. I routinely use these stats to rule in/out the issues with RSS imbalance due to either NIC not handling RSS correctly, incorrect MSI-X affinity, or not having enough entropy in flows. And yes, I perfectly understand that there are cases that packets received on processor x are processed (in tcp/ip stack) on processor y. But in many cases, there is a correlation between individual processor utilization and the processor where these packets are received on by the NIC. -thanks, ali (some suggested that I do mention on this thread that I am one of the original inventors of RSS. So there you have it. Personally I don't think that telling people "I invented the damn thing and I know what I am talking about" is the right way to handle this.) -----Original Message----- From: Jakub Kicinski <kuba@kernel.org> Sent: Thursday, March 14, 2024 12:05 PM To: Haiyang Zhang <haiyangz@microsoft.com> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley <mikelley@microsoft.com>; Alireza Dabagh <alid@microsoft.com>; Paul Rosswurm <paulros@microsoft.com> Subject: [EXTERNAL] Re: [PATCH] net :mana : Add per-cpu stats for MANA device On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote: > We understand irqbalance may be a "bad idea", and recommended some > customers to disable it when having problems with it... But it's still > enabled by default, and we cannot let all distro vendors and custom > image makers to disable the irqbalance. So, our host- networking team > is eager to have per-CPU stats for analyzing CPU usage related to > irqbalance or other issues. You need a use case to get the stats upstream. "CPU usage related to irqbalance or other issues" is both too vague, and irqbalance is a user space problem.
Hi all, I have a newer version of this patch with netlink implementation ready, to support the usecase of per-cpu stats for RSS handling and imbalance investigations. Please let me know if we can proceed with that. Thanks and regards, Shradha. On Thu, Mar 14, 2024 at 08:01:24PM +0000, Alireza Dabagh wrote: > Per processor network stats have been supported on netvsc and Mellanox NICs for years. They are also supported on Windows. > > I routinely use these stats to rule in/out the issues with RSS imbalance due to either NIC not handling RSS correctly, incorrect MSI-X affinity, or not having enough entropy in flows. > > And yes, I perfectly understand that there are cases that packets received on processor x are processed (in tcp/ip stack) on processor y. But in many cases, there is a correlation between individual processor utilization and the processor where these packets are received on by the NIC. > > -thanks, ali > (some suggested that I do mention on this thread that I am one of the original inventors of RSS. So there you have it. Personally I don't think that telling people "I invented the damn thing and I know what I am talking about" is the right way to handle this.) > -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Thursday, March 14, 2024 12:05 PM > To: Haiyang Zhang <haiyangz@microsoft.com> > Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Shradha Gupta <shradhagupta@microsoft.com>; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; netdev@vger.kernel.org; Eric Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Ajay Sharma <sharmaajay@microsoft.com>; Leon Romanovsky <leon@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; Sebastian Andrzej Siewior <bigeasy@linutronix.de>; KY Srinivasan <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; Long Li <longli@microsoft.com>; Michael Kelley <mikelley@microsoft.com>; Alireza Dabagh <alid@microsoft.com>; Paul Rosswurm <paulros@microsoft.com> > Subject: [EXTERNAL] Re: [PATCH] net :mana : Add per-cpu stats for MANA device > > On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote: > > We understand irqbalance may be a "bad idea", and recommended some > > customers to disable it when having problems with it... But it's still > > enabled by default, and we cannot let all distro vendors and custom > > image makers to disable the irqbalance. So, our host- networking team > > is eager to have per-CPU stats for analyzing CPU usage related to > > irqbalance or other issues. > > You need a use case to get the stats upstream. > "CPU usage related to irqbalance or other issues" is both too vague, and irqbalance is a user space problem.
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index 59287c6e6cee..b27ee6684936 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -224,6 +224,7 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev) int gso_hs = 0; /* zero for non-GSO pkts */ u16 txq_idx = skb_get_queue_mapping(skb); struct gdma_dev *gd = apc->ac->gdma_dev; + struct mana_pcpu_stats *pcpu_stats; bool ipv4 = false, ipv6 = false; struct mana_tx_package pkg = {}; struct netdev_queue *net_txq; @@ -234,6 +235,8 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev) struct mana_cq *cq; int err, len; + pcpu_stats = this_cpu_ptr(apc->pcpu_stats); + if (unlikely(!apc->port_is_up)) goto tx_drop; @@ -412,6 +415,12 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev) tx_stats->bytes += len; u64_stats_update_end(&tx_stats->syncp); + /* Also update the per-CPU stats */ + u64_stats_update_begin(&pcpu_stats->syncp); + pcpu_stats->tx_packets++; + pcpu_stats->tx_bytes += len; + u64_stats_update_end(&pcpu_stats->syncp); + tx_busy: if (netif_tx_queue_stopped(net_txq) && mana_can_tx(gdma_sq)) { netif_tx_wake_queue(net_txq); @@ -425,6 +434,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev) kfree(pkg.sgl_ptr); tx_drop_count: ndev->stats.tx_dropped++; + u64_stats_update_begin(&pcpu_stats->syncp); + pcpu_stats->tx_dropped++; + u64_stats_update_end(&pcpu_stats->syncp); tx_drop: dev_kfree_skb_any(skb); return NETDEV_TX_OK; @@ -1505,6 +1517,8 @@ static void mana_rx_skb(void *buf_va, bool from_pool, struct mana_stats_rx *rx_stats = &rxq->stats; struct net_device *ndev = rxq->ndev; uint pkt_len = cqe->ppi[0].pkt_len; + struct mana_pcpu_stats *pcpu_stats; + struct mana_port_context *apc; u16 rxq_idx = rxq->rxq_idx; struct napi_struct *napi; struct xdp_buff xdp = {}; @@ -1512,6 +1526,9 @@ static void mana_rx_skb(void *buf_va, bool from_pool, u32 hash_value; u32 act; + apc = netdev_priv(ndev); + pcpu_stats = this_cpu_ptr(apc->pcpu_stats); + rxq->rx_cq.work_done++; napi = &rxq->rx_cq.napi; @@ -1570,6 +1587,11 @@ static void mana_rx_skb(void *buf_va, bool from_pool, rx_stats->xdp_tx++; u64_stats_update_end(&rx_stats->syncp); + u64_stats_update_begin(&pcpu_stats->syncp); + pcpu_stats->rx_packets++; + pcpu_stats->rx_bytes += pkt_len; + u64_stats_update_end(&pcpu_stats->syncp); + if (act == XDP_TX) { skb_set_queue_mapping(skb, rxq_idx); mana_xdp_tx(skb, ndev); diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c index ab2413d71f6c..e3aa47ead601 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c @@ -83,8 +83,9 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset) if (stringset != ETH_SS_STATS) return -EINVAL; - return ARRAY_SIZE(mana_eth_stats) + num_queues * - (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT); + return ARRAY_SIZE(mana_eth_stats) + + (num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT)) + + (num_present_cpus() * (MANA_STATS_RX_PCPU + MANA_STATS_TX_PCPU)); } static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data) @@ -139,6 +140,19 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data) sprintf(p, "tx_%d_mana_map_err", i); p += ETH_GSTRING_LEN; } + + for (i = 0; i < num_present_cpus(); i++) { + sprintf(p, "cpu%d_rx_packets", i); + p += ETH_GSTRING_LEN; + sprintf(p, "cpu%d_rx_bytes", i); + p += ETH_GSTRING_LEN; + sprintf(p, "cpu%d_tx_packets", i); + p += ETH_GSTRING_LEN; + sprintf(p, "cpu%d_tx_bytes", i); + p += ETH_GSTRING_LEN; + sprintf(p, "cpu%d_tx_dropped", i); + p += ETH_GSTRING_LEN; + } } static void mana_get_ethtool_stats(struct net_device *ndev, @@ -222,6 +236,28 @@ static void mana_get_ethtool_stats(struct net_device *ndev, data[i++] = csum_partial; data[i++] = mana_map_err; } + + for_each_possible_cpu(q) { + const struct mana_pcpu_stats *pcpu_stats = + per_cpu_ptr(apc->pcpu_stats, q); + u64 rx_packets, rx_bytes, tx_packets, tx_bytes, tx_dropped; + unsigned int start; + + do { + start = u64_stats_fetch_begin(&pcpu_stats->syncp); + rx_packets = pcpu_stats->rx_packets; + tx_packets = pcpu_stats->tx_packets; + rx_bytes = pcpu_stats->rx_bytes; + tx_bytes = pcpu_stats->tx_bytes; + tx_dropped = pcpu_stats->tx_dropped; + } while (u64_stats_fetch_retry(&pcpu_stats->syncp, start)); + + data[i++] = rx_packets; + data[i++] = rx_bytes; + data[i++] = tx_packets; + data[i++] = tx_bytes; + data[i++] = tx_dropped; + } } static int mana_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc *cmd, diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h index 76147feb0d10..9a2414ee7f02 100644 --- a/include/net/mana/mana.h +++ b/include/net/mana/mana.h @@ -51,6 +51,8 @@ enum TRI_STATE { /* Update this count whenever the respective structures are changed */ #define MANA_STATS_RX_COUNT 5 #define MANA_STATS_TX_COUNT 11 +#define MANA_STATS_RX_PCPU 2 +#define MANA_STATS_TX_PCPU 3 struct mana_stats_rx { u64 packets; @@ -386,6 +388,15 @@ struct mana_ethtool_stats { u64 rx_cqe_unknown_type; }; +struct mana_pcpu_stats { + u64 rx_packets; + u64 rx_bytes; + u64 tx_packets; + u64 tx_bytes; + u64 tx_dropped; + struct u64_stats_sync syncp; +}; + struct mana_context { struct gdma_dev *gdma_dev; @@ -449,6 +460,7 @@ struct mana_port_context { bool port_st_save; /* Saved port state */ struct mana_ethtool_stats eth_stats; + struct mana_pcpu_stats __percpu *pcpu_stats; }; netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
Extend 'ethtool -S' output for mana devices to include per-CPU packet stats Built-on: Ubuntu22 Tested-on: Ubuntu22 Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> --- drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++++++++ .../ethernet/microsoft/mana/mana_ethtool.c | 40 ++++++++++++++++++- include/net/mana/mana.h | 12 ++++++ 3 files changed, 72 insertions(+), 2 deletions(-)