diff mbox series

[net-next,V4,1/3] sched/topology: Add NUMA-based CPUs spread API

Message ID 20220728191203.4055-2-tariqt@nvidia.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series Introduce and use NUMA distance metrics | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16608 this patch: 16608
netdev/cc_maintainers warning 9 maintainers not CCed: rostedt@goodmis.org vschneid@redhat.com lukasz.luba@arm.com mgorman@suse.de song.bao.hua@hisilicon.com bsegall@google.com viresh.kumar@linaro.org dietmar.eggemann@arm.com bristot@redhat.com
netdev/build_clang success Errors and warnings before: 3021 this patch: 3021
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 15865 this patch: 15865
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 74 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Tariq Toukan July 28, 2022, 7:12 p.m. UTC
Implement and expose API that sets the spread of CPUs based on distance,
given a NUMA node.  Fallback to legacy logic that uses
cpumask_local_spread.

This logic can be used by device drivers to prefer some remote cpus over
others.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/linux/sched/topology.h |  5 ++++
 kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

Comments

Tariq Toukan July 30, 2022, 5:29 p.m. UTC | #1
On 7/28/2022 10:12 PM, Tariq Toukan wrote:
> Implement and expose API that sets the spread of CPUs based on distance,
> given a NUMA node.  Fallback to legacy logic that uses
> cpumask_local_spread.
> 
> This logic can be used by device drivers to prefer some remote cpus over
> others.
> 
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>   include/linux/sched/topology.h |  5 ++++
>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>   2 files changed, 54 insertions(+)
> 

++

Dear SCHEDULER maintainers,

V3 of my series was submitted ~12 days ago and had significant changes.
I'd appreciate your review to this patch, so we could make it to the 
upcoming kernel.

Regards,
Tariq
Tariq Toukan Aug. 2, 2022, 6:40 a.m. UTC | #2
On 7/30/2022 8:29 PM, Tariq Toukan wrote:
> 
> 
> On 7/28/2022 10:12 PM, Tariq Toukan wrote:
>> Implement and expose API that sets the spread of CPUs based on distance,
>> given a NUMA node.  Fallback to legacy logic that uses
>> cpumask_local_spread.
>>
>> This logic can be used by device drivers to prefer some remote cpus over
>> others.
>>
>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>   include/linux/sched/topology.h |  5 ++++
>>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>>   2 files changed, 54 insertions(+)
>>
> 
> ++
> 
> Dear SCHEDULER maintainers,
> 
> V3 of my series was submitted ~12 days ago and had significant changes.
> I'd appreciate your review to this patch, so we could make it to the 
> upcoming kernel.
> 
> Regards,
> Tariq

Hi,
Another reminder.
Do you have any comments on this patch?
If not, please provide your Ack.
Valentin Schneider Aug. 2, 2022, 9:38 a.m. UTC | #3
On 02/08/22 09:40, Tariq Toukan wrote:
> On 7/30/2022 8:29 PM, Tariq Toukan wrote:
>>
>>
>> On 7/28/2022 10:12 PM, Tariq Toukan wrote:
>>> Implement and expose API that sets the spread of CPUs based on distance,
>>> given a NUMA node.  Fallback to legacy logic that uses
>>> cpumask_local_spread.
>>>
>>> This logic can be used by device drivers to prefer some remote cpus over
>>> others.
>>>
>>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>> ---
>>>   include/linux/sched/topology.h |  5 ++++
>>>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>>>   2 files changed, 54 insertions(+)
>>>
>>
>> ++
>>
>> Dear SCHEDULER maintainers,
>>
>> V3 of my series was submitted ~12 days ago and had significant changes.
>> I'd appreciate your review to this patch, so we could make it to the
>> upcoming kernel.
>>
>> Regards,
>> Tariq
>
> Hi,
> Another reminder.
> Do you have any comments on this patch?
> If not, please provide your Ack.

It's not even been a week since you submitted v4 (and ~3 days since you
last pinged this thread), and not all of us are limitless reviewing
machines :-)

This is already in my todo-list, but isn't the topmost item yet.
Jakub Kicinski Aug. 2, 2022, 4:05 p.m. UTC | #4
On Tue, 02 Aug 2022 10:38:08 +0100 Valentin Schneider wrote:
> It's not even been a week since you submitted v4 (and ~3 days since you
> last pinged this thread), and not all of us are limitless reviewing
> machines :-)
> 
> This is already in my todo-list, but isn't the topmost item yet.

I'd appreciate a review on this one soonish (as a favor, don't read this
as a passive aggressive reprimand).

Tariq got a review on a trivial export patch, which put all the logic 
in the driver instead, rather promptly. I asked them to go the extra
mile and move the code to the core so other drivers can benefit.

If this doesn't get into 6.0 it'll be a clear sign for driver
maintainers that building shared infrastructure is arduous and should
be avoided at all cost :(
Valentin Schneider Aug. 4, 2022, 5:28 p.m. UTC | #5
On 28/07/22 22:12, Tariq Toukan wrote:
> Implement and expose API that sets the spread of CPUs based on distance,
> given a NUMA node.  Fallback to legacy logic that uses
> cpumask_local_spread.
>
> This logic can be used by device drivers to prefer some remote cpus over
> others.
>
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

IIUC you want a multi-CPU version of sched_numa_find_closest(). I'm OK with
the need (and you have the numbers to back it up), but I have some qualms
with the implementation, see more below.

> ---
>  include/linux/sched/topology.h |  5 ++++
>  kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..a49167c2a0e5 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -210,6 +210,7 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
>  # define SD_INIT_NAME(type)
>  #endif
>
> +void sched_cpus_set_spread(int node, u16 *cpus, int ncpus);
>  #else /* CONFIG_SMP */
>
>  struct sched_domain_attr;
> @@ -231,6 +232,10 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
>       return true;
>  }
>
> +static inline void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
> +{
> +	memset(cpus, 0, ncpus * sizeof(*cpus));
> +}
>  #endif	/* !CONFIG_SMP */
>
>  #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..157aef862c04 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,8 +2067,57 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>       return found;
>  }
>
> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
                                                       ^^^^^^^^^
That should be a struct *cpumask.

> +{
> +	cpumask_var_t cpumask;
> +	int first, i;
> +
> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> +		return false;
> +
> +	cpumask_copy(cpumask, cpu_online_mask);
> +
> +	first = cpumask_first(cpumask_of_node(node));
> +
> +	for (i = 0; i < ncpus; i++) {
> +		int cpu;
> +
> +		cpu = sched_numa_find_closest(cpumask, first);
> +		if (cpu >= nr_cpu_ids) {
> +			free_cpumask_var(cpumask);
> +			return false;
> +		}
> +		cpus[i] = cpu;
> +		__cpumask_clear_cpu(cpu, cpumask);
> +	}
> +
> +	free_cpumask_var(cpumask);
> +	return true;
> +}

This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
would make more sense to set *up to* ncpus, the calling code can then
decide if getting fewer than requested is OK or not.

I also don't get the fallback to cpumask_local_spread(), is that if the
NUMA topology hasn't been initialized yet? It feels like most users of this
would invoke it late enough (i.e. anything after early initcalls) to have
the backing data available.

Finally, I think iterating only once per NUMA level would make more sense.

I've scribbled something together from those thoughts, see below. This has
just the mlx5 bits touched to show what I mean, but that's just compile
tested.
---
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..2d010d8d670c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -810,7 +810,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 {
 	struct mlx5_eq_table *table = dev->priv.eq_table;
 	int ncomp_eqs = table->num_comp_eqs;
-	u16 *cpus;
+	cpumask_var_t cpus;
 	int ret;
 	int i;
 
@@ -825,15 +825,14 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		return ret;
 	}
 
-	cpus = kcalloc(ncomp_eqs, sizeof(*cpus), GFP_KERNEL);
-	if (!cpus) {
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) {
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	sched_numa_find_n_closest(cpus, dev->piv.numa_node, ncomp_eqs);
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
-	kfree(cpus);
+	free_cpumask_var(cpus);
 	if (ret < 0)
 		goto free_irqs;
 	return ret;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 662f1d55e30e..2330f81aeede 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -448,7 +448,7 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
 /**
  * mlx5_irqs_request_vectors - request one or more IRQs for mlx5 device.
  * @dev: mlx5 device that is requesting the IRQs.
- * @cpus: CPUs array for binding the IRQs
+ * @cpus: cpumask for binding the IRQs
  * @nirqs: number of IRQs to request.
  * @irqs: an output array of IRQs pointers.
  *
@@ -458,25 +458,22 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
  * This function returns the number of IRQs requested, (which might be smaller than
  * @nirqs), if successful, or a negative error code in case of an error.
  */
-int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
+int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev,
+			      const struct cpumask *cpus,
+			      int nirqs,
 			      struct mlx5_irq **irqs)
 {
-	cpumask_var_t req_mask;
+	int cpu = cpumask_first(cpus);
 	struct mlx5_irq *irq;
-	int i;
 
-	if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
-		return -ENOMEM;
-	for (i = 0; i < nirqs; i++) {
-		cpumask_set_cpu(cpus[i], req_mask);
-		irq = mlx5_irq_request(dev, i, req_mask);
+	for (i = 0; i < nirqs && cpu < nr_cpu_ids; i++) {
+		irq = mlx5_irq_request(dev, i, cpumask_of(cpu));
 		if (IS_ERR(irq))
 			break;
-		cpumask_clear(req_mask);
 		irqs[i] = irq;
+		cpu = cpumask_next(cpu, cpus);
 	}
 
-	free_cpumask_var(req_mask);
 	return i ? i : PTR_ERR(irq);
 }
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..bdc9c5df84cd 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus);
+#else
+static inline int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
+{
+	return -ENOTSUPP;
+}
+#endif
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..499f6ef611fa 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,56 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_find_n_closest - Find the 'n' closest cpus to a given node
+ * @cpus: The cpumask to fill in with CPUs
+ * @ncpus: How many CPUs to look for
+ * @node: The node to start the search from
+ *
+ * This will fill *up to* @ncpus in @cpus, using the closest (in NUMA distance)
+ * first and expanding outside the node if more CPUs are required.
+ *
+ * Return: Number of found CPUs, negative value on error.
+ */
+int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
+{
+	struct cpumask ***masks;
+	int cpu, lvl, ntofind = ncpus;
+
+	if (node >= nr_node_ids)
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	masks = rcu_dereference(sched_domains_numa_masks);
+	if (!masks)
+		goto unlock;
+
+	/*
+	 * Walk up the level masks; the first mask should be CPUs LOCAL_DISTANCE
+	 * away (aka the local node), and we incrementally grow the search
+	 * beyond that.
+	 */
+	for (lvl = 0; lvl < sched_domains_numa_levels; lvl++) {
+		if (!masks[lvl][node])
+			goto unlock;
+
+		/* XXX: could be neater with for_each_cpu_andnot() */
+		for_each_cpu(cpu, masks[lvl][node]) {
+			if (cpumask_test_cpu(cpu, cpus))
+				continue;
+
+			__cpumask_set_cpu(cpu, cpus);
+			if (--ntofind == 0)
+				goto unlock;
+		}
+	}
+unlock:
+	rcu_read_unlock();
+	return ncpus - ntofind;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_n_closest);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
Tariq Toukan Aug. 8, 2022, 2:39 p.m. UTC | #6
On 8/4/2022 8:28 PM, Valentin Schneider wrote:
> On 28/07/22 22:12, Tariq Toukan wrote:
>> Implement and expose API that sets the spread of CPUs based on distance,
>> given a NUMA node.  Fallback to legacy logic that uses
>> cpumask_local_spread.
>>
>> This logic can be used by device drivers to prefer some remote cpus over
>> others.
>>
>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> 
> IIUC you want a multi-CPU version of sched_numa_find_closest(). I'm OK with
> the need (and you have the numbers to back it up), but I have some qualms
> with the implementation, see more below.
> 

I want a sorted multi-CPU version.

>> ---
>>   include/linux/sched/topology.h |  5 ++++
>>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>>   2 files changed, 54 insertions(+)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..a49167c2a0e5 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -210,6 +210,7 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
>>   # define SD_INIT_NAME(type)
>>   #endif
>>
>> +void sched_cpus_set_spread(int node, u16 *cpus, int ncpus);
>>   #else /* CONFIG_SMP */
>>
>>   struct sched_domain_attr;
>> @@ -231,6 +232,10 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
>>        return true;
>>   }
>>
>> +static inline void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
>> +{
>> +	memset(cpus, 0, ncpus * sizeof(*cpus));
>> +}
>>   #endif	/* !CONFIG_SMP */
>>
>>   #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 05b6c2ad90b9..157aef862c04 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2067,8 +2067,57 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>>        return found;
>>   }
>>
>> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
>                                                         ^^^^^^^^^
> That should be a struct *cpumask.

With cpumask, we'll lose the order.

> 
>> +{
>> +	cpumask_var_t cpumask;
>> +	int first, i;
>> +
>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>> +		return false;
>> +
>> +	cpumask_copy(cpumask, cpu_online_mask);
>> +
>> +	first = cpumask_first(cpumask_of_node(node));
>> +
>> +	for (i = 0; i < ncpus; i++) {
>> +		int cpu;
>> +
>> +		cpu = sched_numa_find_closest(cpumask, first);
>> +		if (cpu >= nr_cpu_ids) {
>> +			free_cpumask_var(cpumask);
>> +			return false;
>> +		}
>> +		cpus[i] = cpu;
>> +		__cpumask_clear_cpu(cpu, cpumask);
>> +	}
>> +
>> +	free_cpumask_var(cpumask);
>> +	return true;
>> +}
> 
> This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
> would make more sense to set *up to* ncpus, the calling code can then
> decide if getting fewer than requested is OK or not.
> 
> I also don't get the fallback to cpumask_local_spread(), is that if the
> NUMA topology hasn't been initialized yet? It feels like most users of this
> would invoke it late enough (i.e. anything after early initcalls) to have
> the backing data available.

I don't expect this to fail, as we invoke it late enough. Fallback is 
there just in case, to preserve the old behavior instead of getting 
totally broken.

> 
> Finally, I think iterating only once per NUMA level would make more sense.

Agree, although it's just a setup stage.
I'll check if it can work for me, based on the reference code below.

> 
> I've scribbled something together from those thoughts, see below. This has
> just the mlx5 bits touched to show what I mean, but that's just compile
> tested.

My function returns a *sorted* list of the N closest cpus.
That is important. In many cases, drivers do not need all N irqs, but 
only a portion of it, so it wants to use the closest subset of cpus.

IIUC, the code below relaxes this and returns the set of N closest cpus, 
unsorted.


> ---
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 229728c80233..2d010d8d670c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -810,7 +810,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   {
>   	struct mlx5_eq_table *table = dev->priv.eq_table;
>   	int ncomp_eqs = table->num_comp_eqs;
> -	u16 *cpus;
> +	cpumask_var_t cpus;
>   	int ret;
>   	int i;
>   
> @@ -825,15 +825,14 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   		return ret;
>   	}
>   
> -	cpus = kcalloc(ncomp_eqs, sizeof(*cpus), GFP_KERNEL);
> -	if (!cpus) {
> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) {
>   		ret = -ENOMEM;
>   		goto free_irqs;
>   	}
> -	for (i = 0; i < ncomp_eqs; i++)
> -		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
> +
> +	sched_numa_find_n_closest(cpus, dev->piv.numa_node, ncomp_eqs);
>   	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
> -	kfree(cpus);
> +	free_cpumask_var(cpus);
>   	if (ret < 0)
>   		goto free_irqs;
>   	return ret;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 662f1d55e30e..2330f81aeede 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -448,7 +448,7 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
>   /**
>    * mlx5_irqs_request_vectors - request one or more IRQs for mlx5 device.
>    * @dev: mlx5 device that is requesting the IRQs.
> - * @cpus: CPUs array for binding the IRQs
> + * @cpus: cpumask for binding the IRQs
>    * @nirqs: number of IRQs to request.
>    * @irqs: an output array of IRQs pointers.
>    *
> @@ -458,25 +458,22 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
>    * This function returns the number of IRQs requested, (which might be smaller than
>    * @nirqs), if successful, or a negative error code in case of an error.
>    */
> -int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
> +int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev,
> +			      const struct cpumask *cpus,
> +			      int nirqs,
>   			      struct mlx5_irq **irqs)
>   {
> -	cpumask_var_t req_mask;
> +	int cpu = cpumask_first(cpus);
>   	struct mlx5_irq *irq;
> -	int i;
>   
> -	if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> -		return -ENOMEM;
> -	for (i = 0; i < nirqs; i++) {
> -		cpumask_set_cpu(cpus[i], req_mask);
> -		irq = mlx5_irq_request(dev, i, req_mask);
> +	for (i = 0; i < nirqs && cpu < nr_cpu_ids; i++) {
> +		irq = mlx5_irq_request(dev, i, cpumask_of(cpu));
>   		if (IS_ERR(irq))
>   			break;
> -		cpumask_clear(req_mask);
>   		irqs[i] = irq;
> +		cpu = cpumask_next(cpu, cpus);
>   	}
>   
> -	free_cpumask_var(req_mask);
>   	return i ? i : PTR_ERR(irq);
>   }
>   
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..bdc9c5df84cd 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>   	return cpumask_of_node(cpu_to_node(cpu));
>   }
>   
> +#ifdef CONFIG_NUMA
> +extern int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus);
> +#else
> +static inline int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
> +{
> +	return -ENOTSUPP;
> +}
> +#endif
>   
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..499f6ef611fa 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,56 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>   	return found;
>   }
>   
> +/**
> + * sched_numa_find_n_closest - Find the 'n' closest cpus to a given node
> + * @cpus: The cpumask to fill in with CPUs
> + * @ncpus: How many CPUs to look for
> + * @node: The node to start the search from
> + *
> + * This will fill *up to* @ncpus in @cpus, using the closest (in NUMA distance)
> + * first and expanding outside the node if more CPUs are required.
> + *
> + * Return: Number of found CPUs, negative value on error.
> + */
> +int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
> +{
> +	struct cpumask ***masks;
> +	int cpu, lvl, ntofind = ncpus;
> +
> +	if (node >= nr_node_ids)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +
> +	masks = rcu_dereference(sched_domains_numa_masks);
> +	if (!masks)
> +		goto unlock;
> +
> +	/*
> +	 * Walk up the level masks; the first mask should be CPUs LOCAL_DISTANCE
> +	 * away (aka the local node), and we incrementally grow the search
> +	 * beyond that.
> +	 */
> +	for (lvl = 0; lvl < sched_domains_numa_levels; lvl++) {
> +		if (!masks[lvl][node])
> +			goto unlock;
> +
> +		/* XXX: could be neater with for_each_cpu_andnot() */
> +		for_each_cpu(cpu, masks[lvl][node]) {
> +			if (cpumask_test_cpu(cpu, cpus))
> +				continue;
> +
> +			__cpumask_set_cpu(cpu, cpus);
> +			if (--ntofind == 0)
> +				goto unlock;
> +		}
> +	}
> +unlock:
> +	rcu_read_unlock();
> +	return ncpus - ntofind;
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_find_n_closest);
> +
>   #endif /* CONFIG_NUMA */
>   
>   static int __sdt_alloc(const struct cpumask *cpu_map)
>
Valentin Schneider Aug. 9, 2022, 10:02 a.m. UTC | #7
On 08/08/22 17:39, Tariq Toukan wrote:
> On 8/4/2022 8:28 PM, Valentin Schneider wrote:
>> On 28/07/22 22:12, Tariq Toukan wrote:
>>> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
>>                                                         ^^^^^^^^^
>> That should be a struct *cpumask.
>
> With cpumask, we'll lose the order.
>

Right, I didn't get that from the changelog.

>>
>>> +{
>>> +	cpumask_var_t cpumask;
>>> +	int first, i;
>>> +
>>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>>> +		return false;
>>> +
>>> +	cpumask_copy(cpumask, cpu_online_mask);
>>> +
>>> +	first = cpumask_first(cpumask_of_node(node));
>>> +
>>> +	for (i = 0; i < ncpus; i++) {
>>> +		int cpu;
>>> +
>>> +		cpu = sched_numa_find_closest(cpumask, first);
>>> +		if (cpu >= nr_cpu_ids) {
>>> +			free_cpumask_var(cpumask);
>>> +			return false;
>>> +		}
>>> +		cpus[i] = cpu;
>>> +		__cpumask_clear_cpu(cpu, cpumask);
>>> +	}
>>> +
>>> +	free_cpumask_var(cpumask);
>>> +	return true;
>>> +}
>>
>> This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
>> would make more sense to set *up to* ncpus, the calling code can then
>> decide if getting fewer than requested is OK or not.
>>
>> I also don't get the fallback to cpumask_local_spread(), is that if the
>> NUMA topology hasn't been initialized yet? It feels like most users of this
>> would invoke it late enough (i.e. anything after early initcalls) to have
>> the backing data available.
>
> I don't expect this to fail, as we invoke it late enough. Fallback is
> there just in case, to preserve the old behavior instead of getting
> totally broken.
>

Then there shouldn't be a fallback method - the main method is expected to
work.

>>
>> Finally, I think iterating only once per NUMA level would make more sense.
>
> Agree, although it's just a setup stage.
> I'll check if it can work for me, based on the reference code below.
>
>>
>> I've scribbled something together from those thoughts, see below. This has
>> just the mlx5 bits touched to show what I mean, but that's just compile
>> tested.
>
> My function returns a *sorted* list of the N closest cpus.
> That is important. In many cases, drivers do not need all N irqs, but
> only a portion of it, so it wants to use the closest subset of cpus.
>

Are there cases where we can't figure this out in advance? From what I grok
out of the two callsites you patched, all vectors will be used unless some
error happens, so compressing the CPUs in a single cpumask seemed
sufficient.
Tariq Toukan Aug. 9, 2022, 10:18 a.m. UTC | #8
On 8/9/2022 1:02 PM, Valentin Schneider wrote:
> On 08/08/22 17:39, Tariq Toukan wrote:
>> On 8/4/2022 8:28 PM, Valentin Schneider wrote:
>>> On 28/07/22 22:12, Tariq Toukan wrote:
>>>> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
>>>                                                          ^^^^^^^^^
>>> That should be a struct *cpumask.
>>
>> With cpumask, we'll lose the order.
>>
> 
> Right, I didn't get that from the changelog.

I'll make it clear when re-spinned.

> 
>>>
>>>> +{
>>>> +	cpumask_var_t cpumask;
>>>> +	int first, i;
>>>> +
>>>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>>>> +		return false;
>>>> +
>>>> +	cpumask_copy(cpumask, cpu_online_mask);
>>>> +
>>>> +	first = cpumask_first(cpumask_of_node(node));
>>>> +
>>>> +	for (i = 0; i < ncpus; i++) {
>>>> +		int cpu;
>>>> +
>>>> +		cpu = sched_numa_find_closest(cpumask, first);
>>>> +		if (cpu >= nr_cpu_ids) {
>>>> +			free_cpumask_var(cpumask);
>>>> +			return false;
>>>> +		}
>>>> +		cpus[i] = cpu;
>>>> +		__cpumask_clear_cpu(cpu, cpumask);
>>>> +	}
>>>> +
>>>> +	free_cpumask_var(cpumask);
>>>> +	return true;
>>>> +}
>>>
>>> This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
>>> would make more sense to set *up to* ncpus, the calling code can then
>>> decide if getting fewer than requested is OK or not.
>>>
>>> I also don't get the fallback to cpumask_local_spread(), is that if the
>>> NUMA topology hasn't been initialized yet? It feels like most users of this
>>> would invoke it late enough (i.e. anything after early initcalls) to have
>>> the backing data available.
>>
>> I don't expect this to fail, as we invoke it late enough. Fallback is
>> there just in case, to preserve the old behavior instead of getting
>> totally broken.
>>
> 
> Then there shouldn't be a fallback method - the main method is expected to
> work.
> 

I'll drop the fallback then.

>>>
>>> Finally, I think iterating only once per NUMA level would make more sense.
>>
>> Agree, although it's just a setup stage.
>> I'll check if it can work for me, based on the reference code below.
>>
>>>
>>> I've scribbled something together from those thoughts, see below. This has
>>> just the mlx5 bits touched to show what I mean, but that's just compile
>>> tested.
>>
>> My function returns a *sorted* list of the N closest cpus.
>> That is important. In many cases, drivers do not need all N irqs, but
>> only a portion of it, so it wants to use the closest subset of cpus.
>>
> 
> Are there cases where we can't figure this out in advance? From what I grok
> out of the two callsites you patched, all vectors will be used unless some
> error happens, so compressing the CPUs in a single cpumask seemed
> sufficient.
> 

All vectors will be initialized to support the maximum number of traffic 
rings. However, the actual number of traffic rings can be controlled and 
set to a lower number N_actual < N. In this case, we'll be using only 
N_actual instances and we want them to be the first/closest.
Valentin Schneider Aug. 9, 2022, 12:52 p.m. UTC | #9
On 09/08/22 13:18, Tariq Toukan wrote:
> On 8/9/2022 1:02 PM, Valentin Schneider wrote:
>>
>> Are there cases where we can't figure this out in advance? From what I grok
>> out of the two callsites you patched, all vectors will be used unless some
>> error happens, so compressing the CPUs in a single cpumask seemed
>> sufficient.
>>
>
> All vectors will be initialized to support the maximum number of traffic
> rings. However, the actual number of traffic rings can be controlled and
> set to a lower number N_actual < N. In this case, we'll be using only
> N_actual instances and we want them to be the first/closest.

Ok, that makes sense, thank you.

In that case I wonder if we'd want a public-facing iterator for
sched_domains_numa_masks[%i][node], rather than copy a portion of
it. Something like the below (naming and implementation haven't been
thought about too much).

  const struct cpumask *sched_numa_level_mask(int node, int level)
  {
          struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);

          if (node >= nr_node_ids || level >= sched_domains_numa_levels)
                  return NULL;

          if (!masks)
                  return NULL;

          return masks[level][node];
  }
  EXPORT_SYMBOL_GPL(sched_numa_level_mask);

  #define for_each_numa_level_mask(node, lvl, mask)	    \
          for (mask = sched_numa_level_mask(node, lvl); mask;	\
               mask = sched_numa_level_mask(node, ++lvl))

  void foo(int node, int cpus[], int ncpus)
  {
          const struct cpumask *mask;
          int lvl = 0;
          int i = 0;
          int cpu;

          rcu_read_lock();
          for_each_numa_level_mask(node, lvl, mask) {
                  for_each_cpu(cpu, mask) {
                          cpus[i] = cpu;
                          if (++i == ncpus)
                                  goto done;
                  }
          }
  done:
          rcu_read_unlock();
  }
Tariq Toukan Aug. 9, 2022, 2:04 p.m. UTC | #10
On 8/9/2022 3:52 PM, Valentin Schneider wrote:
> On 09/08/22 13:18, Tariq Toukan wrote:
>> On 8/9/2022 1:02 PM, Valentin Schneider wrote:
>>>
>>> Are there cases where we can't figure this out in advance? From what I grok
>>> out of the two callsites you patched, all vectors will be used unless some
>>> error happens, so compressing the CPUs in a single cpumask seemed
>>> sufficient.
>>>
>>
>> All vectors will be initialized to support the maximum number of traffic
>> rings. However, the actual number of traffic rings can be controlled and
>> set to a lower number N_actual < N. In this case, we'll be using only
>> N_actual instances and we want them to be the first/closest.
> 
> Ok, that makes sense, thank you.
> 
> In that case I wonder if we'd want a public-facing iterator for
> sched_domains_numa_masks[%i][node], rather than copy a portion of
> it. Something like the below (naming and implementation haven't been
> thought about too much).
> 
>    const struct cpumask *sched_numa_level_mask(int node, int level)
>    {
>            struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> 
>            if (node >= nr_node_ids || level >= sched_domains_numa_levels)
>                    return NULL;
> 
>            if (!masks)
>                    return NULL;
> 
>            return masks[level][node];
>    }
>    EXPORT_SYMBOL_GPL(sched_numa_level_mask);
> 

The above can be kept static, and expose only the foo() function below, 
similar to my sched_cpus_set_spread().

LGTM.
How do you suggest to proceed?
You want to formalize it? Or should I take it from here?


>    #define for_each_numa_level_mask(node, lvl, mask)	    \
>            for (mask = sched_numa_level_mask(node, lvl); mask;	\
>                 mask = sched_numa_level_mask(node, ++lvl))
> 
>    void foo(int node, int cpus[], int ncpus)
>    {
>            const struct cpumask *mask;
>            int lvl = 0;
>            int i = 0;
>            int cpu;
> 
>            rcu_read_lock();
>            for_each_numa_level_mask(node, lvl, mask) {
>                    for_each_cpu(cpu, mask) {
>                            cpus[i] = cpu;
>                            if (++i == ncpus)
>                                    goto done;
>                    }
>            }
>    done:
>            rcu_read_unlock();
>    }
>
Valentin Schneider Aug. 9, 2022, 5:36 p.m. UTC | #11
On 09/08/22 17:04, Tariq Toukan wrote:
> On 8/9/2022 3:52 PM, Valentin Schneider wrote:
>> On 09/08/22 13:18, Tariq Toukan wrote:
>>> On 8/9/2022 1:02 PM, Valentin Schneider wrote:
>>>>
>>>> Are there cases where we can't figure this out in advance? From what I grok
>>>> out of the two callsites you patched, all vectors will be used unless some
>>>> error happens, so compressing the CPUs in a single cpumask seemed
>>>> sufficient.
>>>>
>>>
>>> All vectors will be initialized to support the maximum number of traffic
>>> rings. However, the actual number of traffic rings can be controlled and
>>> set to a lower number N_actual < N. In this case, we'll be using only
>>> N_actual instances and we want them to be the first/closest.
>>
>> Ok, that makes sense, thank you.
>>
>> In that case I wonder if we'd want a public-facing iterator for
>> sched_domains_numa_masks[%i][node], rather than copy a portion of
>> it. Something like the below (naming and implementation haven't been
>> thought about too much).
>>
>>    const struct cpumask *sched_numa_level_mask(int node, int level)
>>    {
>>            struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
>>
>>            if (node >= nr_node_ids || level >= sched_domains_numa_levels)
>>                    return NULL;
>>
>>            if (!masks)
>>                    return NULL;
>>
>>            return masks[level][node];
>>    }
>>    EXPORT_SYMBOL_GPL(sched_numa_level_mask);
>>
>
> The above can be kept static, and expose only the foo() function below,
> similar to my sched_cpus_set_spread().
>

So what I was thinking with this was to only have to export the
sched_numa_level_mask() thing and the iterator, and then consumers would be
free to use whatever storage form they want (cpumask, array, list...).

Right now I believe sched_domains_numa_masks has the right shape for the
interface (for a given node, you a cpumask per distance level) and I
don't want to impose an interface that uses just an array, but perhaps I'm
being silly and the array isn't so bad and simpler to use - that said we
could always build an array-based helper on top of the array of cpumasks
thing.

Clearly I need to scratch my head a bit longer :-)

> LGTM.
> How do you suggest to proceed?
> You want to formalize it? Or should I take it from here?
>

I need to have a think (feel free to ponder and share your thoughts as
well) - I'm happy to push something if I get a brain-wave, but don't let
that hold you back either.
Valentin Schneider Aug. 10, 2022, 10:46 a.m. UTC | #12
On 09/08/22 18:36, Valentin Schneider wrote:
> On 09/08/22 17:04, Tariq Toukan wrote:
>> LGTM.
>> How do you suggest to proceed?
>> You want to formalize it? Or should I take it from here?
>>
>
> I need to have a think (feel free to ponder and share your thoughts as
> well) - I'm happy to push something if I get a brain-wave, but don't let
> that hold you back either.

This is the form that I hate the least, what do you think?
diff mbox series

Patch

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..a49167c2a0e5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -210,6 +210,7 @@  extern void set_sched_topology(struct sched_domain_topology_level *tl);
 # define SD_INIT_NAME(type)
 #endif
 
+void sched_cpus_set_spread(int node, u16 *cpus, int ncpus);
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
@@ -231,6 +232,10 @@  static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
+{
+	memset(cpus, 0, ncpus * sizeof(*cpus));
+}
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..157aef862c04 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,8 +2067,57 @@  int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
+{
+	cpumask_var_t cpumask;
+	int first, i;
+
+	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return false;
+
+	cpumask_copy(cpumask, cpu_online_mask);
+
+	first = cpumask_first(cpumask_of_node(node));
+
+	for (i = 0; i < ncpus; i++) {
+		int cpu;
+
+		cpu = sched_numa_find_closest(cpumask, first);
+		if (cpu >= nr_cpu_ids) {
+			free_cpumask_var(cpumask);
+			return false;
+		}
+		cpus[i] = cpu;
+		__cpumask_clear_cpu(cpu, cpumask);
+	}
+
+	free_cpumask_var(cpumask);
+	return true;
+}
+#else
+static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
+{
+	return false;
+}
 #endif /* CONFIG_NUMA */
 
+static void sched_cpus_by_local_spread(int node, u16 *cpus, int ncpus)
+{
+	int i;
+
+	for (i = 0; i < ncpus; i++)
+		cpus[i] = cpumask_local_spread(i, node);
+}
+
+void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
+{
+	bool success = sched_cpus_spread_by_distance(node, cpus, ncpus);
+
+	if (!success)
+		sched_cpus_by_local_spread(node, cpus, ncpus);
+}
+EXPORT_SYMBOL_GPL(sched_cpus_set_spread);
+
 static int __sdt_alloc(const struct cpumask *cpu_map)
 {
 	struct sched_domain_topology_level *tl;