[3/3] net: mana: add a function to spread IRQs per CPUs

Message ID	20231217213214.1905481-4-yury.norov@gmail.com (mailing list archive)
State	Not Applicable
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-io1-f43.google.com (mail-io1-f43.google.com [209.85.166.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2E4449F98; Sun, 17 Dec 2023 21:32:22 +0000 (UTC) From: Yury Norov <yury.norov@gmail.com> To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, longli@microsoft.com, yury.norov@gmail.com, leon@kernel.org, cai.huoqing@linux.dev, ssengar@linux.microsoft.com, vkuznets@redhat.com, tglx@linutronix.de, linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Cc: schakrabarti@microsoft.com, paulros@microsoft.com Subject: [PATCH 3/3] net: mana: add a function to spread IRQs per CPUs Date: Sun, 17 Dec 2023 13:32:14 -0800 Message-Id: <20231217213214.1905481-4-yury.norov@gmail.com> In-Reply-To: <20231217213214.1905481-1-yury.norov@gmail.com> References: <20231217213214.1905481-1-yury.norov@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	net: mana: add irq_spread() \| expand [0/3] net: mana: add irq_spread() [1/3] cpumask: add cpumask_weight_andnot() [2/3] cpumask: define cleanup function for cpumasks [3/3] net: mana: add a function to spread IRQs per CPUs

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers	warning	1 maintainers not CCed: kotaranov@microsoft.com
netdev/build_clang	success	Errors and warnings before: 1142 this patch: 1142
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1142 this patch: 1142
netdev/checkpatch	warning	WARNING: Co-developed-by and Signed-off-by: name/email do not match WARNING: Missing a blank line after declarations WARNING: externs should be avoided in .c files WARNING: function definition argument 'free_cpumask_var' should also have an identifier name WARNING: line length of 83 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Yury Norov Dec. 17, 2023, 9:32 p.m. UTC

Souradeep investigated that the driver performs faster if IRQs are
spread on CPUs with the following heuristics:

1. No more than one IRQ per CPU, if possible;
2. NUMA locality is the second priority;
3. Sibling dislocality is the last priority.

Let's consider this topology:

Node            0               1
Core        0       1       2       3
CPU       0   1   2   3   4   5   6   7

The most performant IRQ distribution based on the above topology
and heuristics may look like this:

IRQ     Nodes   Cores   CPUs
0       1       0       0-1
1       1       1       2-3
2       1       0       0-1
3       1       1       2-3
4       2       2       4-5
5       2       3       6-7
6       2       2       4-5
7       2       3       6-7

The irq_setup() routine introduced in this patch leverages the
for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups
as described above.

According to [1], for NUMA-aware but sibling-ignorant IRQ distribution
based on cpumask_local_spread() performance test results look like this:

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:05:20 INFO: 17 threads created
08:05:28 INFO: Network activity progressing...
08:06:28 INFO: Test run completed.
08:06:28 INFO: Test cycle finished.
08:06:28 INFO: #####  Totals:  #####
08:06:28 INFO: test duration    :60.00 seconds
08:06:28 INFO: total bytes      :630292053310
08:06:28 INFO:   throughput     :84.04Gbps
08:06:28 INFO:   retrans segs   :4
08:06:28 INFO: cpu cores        :192
08:06:28 INFO:   cpu speed      :3799.725MHz
08:06:28 INFO:   user           :0.05%
08:06:28 INFO:   system         :1.60%
08:06:28 INFO:   idle           :96.41%
08:06:28 INFO:   iowait         :0.00%
08:06:28 INFO:   softirq        :1.94%
08:06:28 INFO:   cycles/byte    :2.50
08:06:28 INFO: cpu busy (all)   :534.41%

For NUMA- and sibling-aware IRQ distribution, the same test works
15% faster:

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:08:51 INFO: 17 threads created
08:08:56 INFO: Network activity progressing...
08:09:56 INFO: Test run completed.
08:09:56 INFO: Test cycle finished.
08:09:56 INFO: #####  Totals:  #####
08:09:56 INFO: test duration    :60.00 seconds
08:09:56 INFO: total bytes      :741966608384
08:09:56 INFO:   throughput     :98.93Gbps
08:09:56 INFO:   retrans segs   :6
08:09:56 INFO: cpu cores        :192
08:09:56 INFO:   cpu speed      :3799.791MHz
08:09:56 INFO:   user           :0.06%
08:09:56 INFO:   system         :1.81%
08:09:56 INFO:   idle           :96.18%
08:09:56 INFO:   iowait         :0.00%
08:09:56 INFO:   softirq        :1.95%
08:09:56 INFO:   cycles/byte    :2.25
08:09:56 INFO: cpu busy (all)   :569.22%

[1] https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 28 +++++++++++++++++++
 1 file changed, 28 insertions(+)

Jacob Keller Dec. 18, 2023, 9:17 p.m. UTC | #1

On 12/17/2023 1:32 PM, Yury Norov wrote:
> +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node)
> +{
> +	const struct cpumask *next, *prev = cpu_none_mask;
> +	cpumask_var_t cpus __free(free_cpumask_var);
> +	int cpu, weight;
> +
> +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	rcu_read_lock();
> +	for_each_numa_hop_mask(next, node) {
> +		weight = cpumask_weight_andnot(next, prev);
> +		while (weight-- > 0) {
> +			cpumask_andnot(cpus, next, prev);
> +			for_each_cpu(cpu, cpus) {
> +				if (len-- == 0)
> +					goto done;
> +				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> +				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> +			}
> +		}
> +		prev = next;
> +	}
> +done:
> +	rcu_read_unlock();
> +	return 0;
> +}
> +

You're adding a function here but its not called and even marked as
__maybe_unused?

>  static int mana_gd_setup_irqs(struct pci_dev *pdev)
>  {
>  	unsigned int max_queues_per_port = num_online_cpus();

Yury Norov Dec. 18, 2023, 9:42 p.m. UTC | #2

On Mon, Dec 18, 2023 at 01:17:53PM -0800, Jacob Keller wrote:
> 
> 
> On 12/17/2023 1:32 PM, Yury Norov wrote:
> > +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > +{
> > +	const struct cpumask *next, *prev = cpu_none_mask;
> > +	cpumask_var_t cpus __free(free_cpumask_var);
> > +	int cpu, weight;
> > +
> > +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> > +		return -ENOMEM;
> > +
> > +	rcu_read_lock();
> > +	for_each_numa_hop_mask(next, node) {
> > +		weight = cpumask_weight_andnot(next, prev);
> > +		while (weight-- > 0) {
> > +			cpumask_andnot(cpus, next, prev);
> > +			for_each_cpu(cpu, cpus) {
> > +				if (len-- == 0)
> > +					goto done;
> > +				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> > +				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> > +			}
> > +		}
> > +		prev = next;
> > +	}
> > +done:
> > +	rcu_read_unlock();
> > +	return 0;
> > +}
> > +
> 
> You're adding a function here but its not called and even marked as
> __maybe_unused?

I expect that Souradeep would build his driver improvement on top of
this function. cpumask API is somewhat tricky to use it properly here,
so this is an attempt help him, instead of moving back and forth on
review.

Sorry, I had to be more explicit.

Thanks,
Yury

Souradeep Chakrabarti Dec. 19, 2023, 7:14 a.m. UTC | #3

>-----Original Message-----
>From: Yury Norov <yury.norov@gmail.com>
>Sent: Monday, December 18, 2023 3:02 AM
>To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan
><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
>wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; davem@davemloft.net;
>edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li
><longli@microsoft.com>; yury.norov@gmail.com; leon@kernel.org;
>cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
>tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org; linux-rdma@vger.kernel.org
>Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>; Paul Rosswurm
><paulros@microsoft.com>
>Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per
>CPUs
>
>[Some people who received this message don't often get email from
>yury.norov@gmail.com. Learn why this is important at
>https://aka.ms/LearnAboutSenderIdentification ]
>
>Souradeep investigated that the driver performs faster if IRQs are spread on CPUs
>with the following heuristics:
>
>1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second
>priority; 3. Sibling dislocality is the last priority.
>
>Let's consider this topology:
>
>Node            0               1
>Core        0       1       2       3
>CPU       0   1   2   3   4   5   6   7
>
>The most performant IRQ distribution based on the above topology and heuristics
>may look like this:
>
>IRQ     Nodes   Cores   CPUs
>0       1       0       0-1
>1       1       1       2-3
>2       1       0       0-1
>3       1       1       2-3
>4       2       2       4-5
>5       2       3       6-7
>6       2       2       4-5
>7       2       3       6-7
>
>The irq_setup() routine introduced in this patch leverages the
>for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as
>described above.
>
>According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on
>cpumask_local_spread() performance test results look like this:
>
>./ntttcp -r -m 16
>NTTTCP for Linux 1.4.0
>---------------------------------------------------------
>08:05:20 INFO: 17 threads created
>08:05:28 INFO: Network activity progressing...
>08:06:28 INFO: Test run completed.
>08:06:28 INFO: Test cycle finished.
>08:06:28 INFO: #####  Totals:  #####
>08:06:28 INFO: test duration    :60.00 seconds
>08:06:28 INFO: total bytes      :630292053310
>08:06:28 INFO:   throughput     :84.04Gbps
>08:06:28 INFO:   retrans segs   :4
>08:06:28 INFO: cpu cores        :192
>08:06:28 INFO:   cpu speed      :3799.725MHz
>08:06:28 INFO:   user           :0.05%
>08:06:28 INFO:   system         :1.60%
>08:06:28 INFO:   idle           :96.41%
>08:06:28 INFO:   iowait         :0.00%
>08:06:28 INFO:   softirq        :1.94%
>08:06:28 INFO:   cycles/byte    :2.50
>08:06:28 INFO: cpu busy (all)   :534.41%
>
>For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster:
>
>./ntttcp -r -m 16
>NTTTCP for Linux 1.4.0
>---------------------------------------------------------
>08:08:51 INFO: 17 threads created
>08:08:56 INFO: Network activity progressing...
>08:09:56 INFO: Test run completed.
>08:09:56 INFO: Test cycle finished.
>08:09:56 INFO: #####  Totals:  #####
>08:09:56 INFO: test duration    :60.00 seconds
>08:09:56 INFO: total bytes      :741966608384
>08:09:56 INFO:   throughput     :98.93Gbps
>08:09:56 INFO:   retrans segs   :6
>08:09:56 INFO: cpu cores        :192
>08:09:56 INFO:   cpu speed      :3799.791MHz
>08:09:56 INFO:   user           :0.06%
>08:09:56 INFO:   system         :1.81%
>08:09:56 INFO:   idle           :96.18%
>08:09:56 INFO:   iowait         :0.00%
>08:09:56 INFO:   softirq        :1.95%
>08:09:56 INFO:   cycles/byte    :2.25
>08:09:56 INFO: cpu busy (all)   :569.22%
>
>[1]
>https://lore.kernel/
>.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v
>ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros
>oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7
>cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d
>8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
>7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26
>IhqkE%3D&reserved=0
>
>Signed-off-by: Yury Norov <yury.norov@gmail.com>
>Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Please also add Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
>---
> .../net/ethernet/microsoft/mana/gdma_main.c   | 28 +++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
>diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>index 6367de0c2c2e..11e64e42e3b2 100644
>--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource
>*r)
>        r->size = 0;
> }
>
>+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int
>+len, int node) {
>+       const struct cpumask *next, *prev = cpu_none_mask;
>+       cpumask_var_t cpus __free(free_cpumask_var);
>+       int cpu, weight;
>+
>+       if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>+               return -ENOMEM;
>+
>+       rcu_read_lock();
>+       for_each_numa_hop_mask(next, node) {
>+               weight = cpumask_weight_andnot(next, prev);
>+               while (weight-- > 0) {
>+                       cpumask_andnot(cpus, next, prev);
>+                       for_each_cpu(cpu, cpus) {
>+                               if (len-- == 0)
>+                                       goto done;
>+                               irq_set_affinity_and_hint(*irqs++,
>topology_sibling_cpumask(cpu));
>+                               cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
>+                       }
>+               }
>+               prev = next;
>+       }
>+done:
>+       rcu_read_unlock();
>+       return 0;
>+}
>+
> static int mana_gd_setup_irqs(struct pci_dev *pdev)  {
>        unsigned int max_queues_per_port = num_online_cpus();
>--
>2.40.1

Souradeep Chakrabarti Dec. 19, 2023, 10:18 a.m. UTC | #4

>-----Original Message-----
>From: Yury Norov <yury.norov@gmail.com>
>Sent: Monday, December 18, 2023 3:02 AM
>To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan
><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
>wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; davem@davemloft.net;
>edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li
><longli@microsoft.com>; yury.norov@gmail.com; leon@kernel.org;
>cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
>tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org; linux-rdma@vger.kernel.org
>Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>; Paul Rosswurm
><paulros@microsoft.com>
>Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per
>CPUs
>
>[Some people who received this message don't often get email from
>yury.norov@gmail.com. Learn why this is important at
>https://aka.ms/LearnAboutSenderIdentification ]
>
>Souradeep investigated that the driver performs faster if IRQs are spread on CPUs
>with the following heuristics:
>
>1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second
>priority; 3. Sibling dislocality is the last priority.
>
>Let's consider this topology:
>
>Node            0               1
>Core        0       1       2       3
>CPU       0   1   2   3   4   5   6   7
>
>The most performant IRQ distribution based on the above topology and heuristics
>may look like this:
>
>IRQ     Nodes   Cores   CPUs
>0       1       0       0-1
>1       1       1       2-3
>2       1       0       0-1
>3       1       1       2-3
>4       2       2       4-5
>5       2       3       6-7
>6       2       2       4-5
>7       2       3       6-7
>
>The irq_setup() routine introduced in this patch leverages the
>for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as
>described above.
>
>According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on
>cpumask_local_spread() performance test results look like this:
>
>./ntttcp -r -m 16
>NTTTCP for Linux 1.4.0
>---------------------------------------------------------
>08:05:20 INFO: 17 threads created
>08:05:28 INFO: Network activity progressing...
>08:06:28 INFO: Test run completed.
>08:06:28 INFO: Test cycle finished.
>08:06:28 INFO: #####  Totals:  #####
>08:06:28 INFO: test duration    :60.00 seconds
>08:06:28 INFO: total bytes      :630292053310
>08:06:28 INFO:   throughput     :84.04Gbps
>08:06:28 INFO:   retrans segs   :4
>08:06:28 INFO: cpu cores        :192
>08:06:28 INFO:   cpu speed      :3799.725MHz
>08:06:28 INFO:   user           :0.05%
>08:06:28 INFO:   system         :1.60%
>08:06:28 INFO:   idle           :96.41%
>08:06:28 INFO:   iowait         :0.00%
>08:06:28 INFO:   softirq        :1.94%
>08:06:28 INFO:   cycles/byte    :2.50
>08:06:28 INFO: cpu busy (all)   :534.41%
>
>For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster:
>
>./ntttcp -r -m 16
>NTTTCP for Linux 1.4.0
>---------------------------------------------------------
>08:08:51 INFO: 17 threads created
>08:08:56 INFO: Network activity progressing...
>08:09:56 INFO: Test run completed.
>08:09:56 INFO: Test cycle finished.
>08:09:56 INFO: #####  Totals:  #####
>08:09:56 INFO: test duration    :60.00 seconds
>08:09:56 INFO: total bytes      :741966608384
>08:09:56 INFO:   throughput     :98.93Gbps
>08:09:56 INFO:   retrans segs   :6
>08:09:56 INFO: cpu cores        :192
>08:09:56 INFO:   cpu speed      :3799.791MHz
>08:09:56 INFO:   user           :0.06%
>08:09:56 INFO:   system         :1.81%
>08:09:56 INFO:   idle           :96.18%
>08:09:56 INFO:   iowait         :0.00%
>08:09:56 INFO:   softirq        :1.95%
>08:09:56 INFO:   cycles/byte    :2.25
>08:09:56 INFO: cpu busy (all)   :569.22%
>
>[1]
>https://lore.kernel/
>.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v
>ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros
>oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7
>cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d
>8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
>7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26
>IhqkE%3D&reserved=0
>
>Signed-off-by: Yury Norov <yury.norov@gmail.com>
>Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
>---
> .../net/ethernet/microsoft/mana/gdma_main.c   | 28 +++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
>diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>index 6367de0c2c2e..11e64e42e3b2 100644
>--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
>+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource
>*r)
>        r->size = 0;
> }
>
>+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int
>+len, int node) {
>+       const struct cpumask *next, *prev = cpu_none_mask;
>+       cpumask_var_t cpus __free(free_cpumask_var);
>+       int cpu, weight;
>+
>+       if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>+               return -ENOMEM;
>+
>+       rcu_read_lock();
>+       for_each_numa_hop_mask(next, node) {
>+               weight = cpumask_weight_andnot(next, prev);
>+               while (weight-- > 0) {
Make it while (weight > 0) {
>+                       cpumask_andnot(cpus, next, prev);
>+                       for_each_cpu(cpu, cpus) {
>+                               if (len-- == 0)
>+                                       goto done;
>+                               irq_set_affinity_and_hint(*irqs++,
>topology_sibling_cpumask(cpu));
>+                               cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
Here do --weight, else this code will traverse the same node N^2 times, where each
node has N cpus .
>+                       }
>+               }
>+               prev = next;
>+       }
>+done:
>+       rcu_read_unlock();
>+       return 0;
>+}
>+
> static int mana_gd_setup_irqs(struct pci_dev *pdev)  {
>        unsigned int max_queues_per_port = num_online_cpus();
>--
>2.40.1

Yury Norov Dec. 19, 2023, 2:03 p.m. UTC | #5

On Tue, Dec 19, 2023 at 10:18:49AM +0000, Souradeep Chakrabarti wrote:
> 
> 
> >-----Original Message-----
> >From: Yury Norov <yury.norov@gmail.com>
> >Sent: Monday, December 18, 2023 3:02 AM
> >To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan
> ><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> >wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; davem@davemloft.net;
> >edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li
> ><longli@microsoft.com>; yury.norov@gmail.com; leon@kernel.org;
> >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
> >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
> >kernel@vger.kernel.org; linux-rdma@vger.kernel.org
> >Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>; Paul Rosswurm
> ><paulros@microsoft.com>
> >Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per
> >CPUs
> >
> >[Some people who received this message don't often get email from
> >yury.norov@gmail.com. Learn why this is important at
> >https://aka.ms/LearnAboutSenderIdentification ]
> >
> >Souradeep investigated that the driver performs faster if IRQs are spread on CPUs
> >with the following heuristics:
> >
> >1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second
> >priority; 3. Sibling dislocality is the last priority.
> >
> >Let's consider this topology:
> >
> >Node            0               1
> >Core        0       1       2       3
> >CPU       0   1   2   3   4   5   6   7
> >
> >The most performant IRQ distribution based on the above topology and heuristics
> >may look like this:
> >
> >IRQ     Nodes   Cores   CPUs
> >0       1       0       0-1
> >1       1       1       2-3
> >2       1       0       0-1
> >3       1       1       2-3
> >4       2       2       4-5
> >5       2       3       6-7
> >6       2       2       4-5
> >7       2       3       6-7
> >
> >The irq_setup() routine introduced in this patch leverages the
> >for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as
> >described above.
> >
> >According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on
> >cpumask_local_spread() performance test results look like this:
> >
> >./ntttcp -r -m 16
> >NTTTCP for Linux 1.4.0
> >---------------------------------------------------------
> >08:05:20 INFO: 17 threads created
> >08:05:28 INFO: Network activity progressing...
> >08:06:28 INFO: Test run completed.
> >08:06:28 INFO: Test cycle finished.
> >08:06:28 INFO: #####  Totals:  #####
> >08:06:28 INFO: test duration    :60.00 seconds
> >08:06:28 INFO: total bytes      :630292053310
> >08:06:28 INFO:   throughput     :84.04Gbps
> >08:06:28 INFO:   retrans segs   :4
> >08:06:28 INFO: cpu cores        :192
> >08:06:28 INFO:   cpu speed      :3799.725MHz
> >08:06:28 INFO:   user           :0.05%
> >08:06:28 INFO:   system         :1.60%
> >08:06:28 INFO:   idle           :96.41%
> >08:06:28 INFO:   iowait         :0.00%
> >08:06:28 INFO:   softirq        :1.94%
> >08:06:28 INFO:   cycles/byte    :2.50
> >08:06:28 INFO: cpu busy (all)   :534.41%
> >
> >For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster:
> >
> >./ntttcp -r -m 16
> >NTTTCP for Linux 1.4.0
> >---------------------------------------------------------
> >08:08:51 INFO: 17 threads created
> >08:08:56 INFO: Network activity progressing...
> >08:09:56 INFO: Test run completed.
> >08:09:56 INFO: Test cycle finished.
> >08:09:56 INFO: #####  Totals:  #####
> >08:09:56 INFO: test duration    :60.00 seconds
> >08:09:56 INFO: total bytes      :741966608384
> >08:09:56 INFO:   throughput     :98.93Gbps
> >08:09:56 INFO:   retrans segs   :6
> >08:09:56 INFO: cpu cores        :192
> >08:09:56 INFO:   cpu speed      :3799.791MHz
> >08:09:56 INFO:   user           :0.06%
> >08:09:56 INFO:   system         :1.81%
> >08:09:56 INFO:   idle           :96.18%
> >08:09:56 INFO:   iowait         :0.00%
> >08:09:56 INFO:   softirq        :1.95%
> >08:09:56 INFO:   cycles/byte    :2.25
> >08:09:56 INFO: cpu busy (all)   :569.22%
> >
> >[1]
> >https://lore.kernel/
> >.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v
> >ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros
> >oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7
> >cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d
> >8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%
> >7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26
> >IhqkE%3D&reserved=0
> >
> >Signed-off-by: Yury Norov <yury.norov@gmail.com>
> >Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> >---
> > .../net/ethernet/microsoft/mana/gdma_main.c   | 28 +++++++++++++++++++
> > 1 file changed, 28 insertions(+)
> >
> >diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >index 6367de0c2c2e..11e64e42e3b2 100644
> >--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource
> >*r)
> >        r->size = 0;
> > }
> >
> >+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int
> >+len, int node) {
> >+       const struct cpumask *next, *prev = cpu_none_mask;
> >+       cpumask_var_t cpus __free(free_cpumask_var);
> >+       int cpu, weight;
> >+
> >+       if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> >+               return -ENOMEM;
> >+
> >+       rcu_read_lock();
> >+       for_each_numa_hop_mask(next, node) {
> >+               weight = cpumask_weight_andnot(next, prev);
> >+               while (weight-- > 0) {
> Make it while (weight > 0) {
> >+                       cpumask_andnot(cpus, next, prev);
> >+                       for_each_cpu(cpu, cpus) {
> >+                               if (len-- == 0)
> >+                                       goto done;
> >+                               irq_set_affinity_and_hint(*irqs++,
> >topology_sibling_cpumask(cpu));
> >+                               cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> Here do --weight, else this code will traverse the same node N^2 times, where each
> node has N cpus .

Sure.

When building your series on top of this, can you please fix it
inplace?

Thanks,
Yury

> >+                       }
> >+               }
> >+               prev = next;
> >+       }
> >+done:
> >+       rcu_read_unlock();
> >+       return 0;
> >+}
> >+
> > static int mana_gd_setup_irqs(struct pci_dev *pdev)  {
> >        unsigned int max_queues_per_port = num_online_cpus();
> >--
> >2.40.1

[3/3] net: mana: add a function to spread IRQs per CPUs

Checks

Commit Message

Comments

Patch