diff mbox series

[1/2] sched/topology: Introduce sched_numa_hop_mask()

Message ID 20220810105119.2684079-1-vschneid@redhat.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [1/2] sched/topology: Introduce sched_numa_hop_mask() | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 17605 this patch: 38069
netdev/cc_maintainers warning 8 maintainers not CCed: rostedt@goodmis.org hca@linux.ibm.com mgorman@suse.de gregkh@linuxfoundation.org song.bao.hua@hisilicon.com bsegall@google.com dietmar.eggemann@arm.com bristot@redhat.com
netdev/build_clang success Errors and warnings before: 3259 this patch: 3259
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 16768 this patch: 16769
netdev/checkpatch warning CHECK: extern prototypes should be avoided in .h files WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP WARNING: line length of 84 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Valentin Schneider Aug. 10, 2022, 10:51 a.m. UTC
Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks. Add in an iteration helper
to iterate over CPUs at an incremental distance from a given node.

Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/topology.h | 12 ++++++++++++
 kernel/sched/topology.c  | 28 ++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

Comments

Tariq Toukan Aug. 10, 2022, 12:42 p.m. UTC | #1
On 8/10/2022 1:51 PM, Valentin Schneider wrote:
> Tariq has pointed out that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness - cpumask_local_spread() only knows
> about the local node and everything outside is in the same bucket.
> 
> sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
> of CPUs reachable within a given distance budget), introduce
> sched_numa_hop_mask() to export those cpumasks. Add in an iteration helper
> to iterate over CPUs at an incremental distance from a given node.
> 
> Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>   include/linux/topology.h | 12 ++++++++++++
>   kernel/sched/topology.c  | 28 ++++++++++++++++++++++++++++
>   2 files changed, 40 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..d66e3cf40823 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>   	return cpumask_of_node(cpu_to_node(cpu));
>   }
>   
> +#ifdef CONFIG_NUMA
> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> +#else
> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	return -ENOTSUPP;

missing ERR_PTR()

> +}
> +#endif	/* CONFIG_NUMA */
> +
> +#define for_each_numa_hop_mask(node, hops, mask)			\
> +	for (mask = sched_numa_hop_mask(node, hops); !IS_ERR_OR_NULL(mask); \
> +	     mask = sched_numa_hop_mask(node, ++hops))
>   
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..f0236a0ae65c 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>   	return found;
>   }
>   
> +/**
> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> + * @node: The node to count hops from.
> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> + *
> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> + * read-side section, copy it if required beyond that.
> + *
> + * Note that not all hops are equal in size; see sched_init_numa() for how
> + * distances and masks are handled.
> + *
> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> + * during the lifetime of the system (offline nodes are taken out of the masks).
> + */
> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> +
> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!masks)
> +		return NULL;
> +
> +	return masks[hops][node];
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
> +
>   #endif /* CONFIG_NUMA */
>   
>   static int __sdt_alloc(const struct cpumask *cpu_map)
Tariq Toukan Aug. 10, 2022, 12:57 p.m. UTC | #2
On 8/10/2022 3:42 PM, Tariq Toukan wrote:
> 
> 
> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>> Tariq has pointed out that drivers allocating IRQ vectors would benefit
>> from having smarter NUMA-awareness - cpumask_local_spread() only knows
>> about the local node and everything outside is in the same bucket.
>>
>> sched_domains_numa_masks is pretty much what we want to hand out (a 
>> cpumask
>> of CPUs reachable within a given distance budget), introduce
>> sched_numa_hop_mask() to export those cpumasks. Add in an iteration 
>> helper
>> to iterate over CPUs at an incremental distance from a given node.
>>
>> Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>>   include/linux/topology.h | 12 ++++++++++++
>>   kernel/sched/topology.c  | 28 ++++++++++++++++++++++++++++
>>   2 files changed, 40 insertions(+)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 4564faafd0e1..d66e3cf40823 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -245,5 +245,17 @@ static inline const struct cpumask 
>> *cpu_cpu_mask(int cpu)
>>       return cpumask_of_node(cpu_to_node(cpu));
>>   }
>> +#ifdef CONFIG_NUMA
>> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
>> +#else
>> +static inline const struct cpumask *sched_numa_hop_mask(int node, int 
>> hops)
>> +{
>> +    return -ENOTSUPP;
> 
> missing ERR_PTR()
> 
>> +}
>> +#endif    /* CONFIG_NUMA */
>> +
>> +#define for_each_numa_hop_mask(node, hops, mask)            \
>> +    for (mask = sched_numa_hop_mask(node, hops); 
>> !IS_ERR_OR_NULL(mask); \
>> +         mask = sched_numa_hop_mask(node, ++hops))
>>   #endif /* _LINUX_TOPOLOGY_H */
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 8739c2a5a54e..f0236a0ae65c 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct 
>> cpumask *cpus, int cpu)
>>       return found;
>>   }
>> +/**
>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops 
>> away.
>> + * @node: The node to count hops from.
>> + * @hops: Include CPUs up to that many hops away. 0 means local node.

AFAIU, here you work with a specific level/num of hops, description is 
not accurate.


>> + *
>> + * Requires rcu_lock to be held. Returned cpumask is only valid 
>> within that
>> + * read-side section, copy it if required beyond that.
>> + *
>> + * Note that not all hops are equal in size; see sched_init_numa() 
>> for how
>> + * distances and masks are handled.
>> + *
>> + * Also note that this is a reflection of sched_domains_numa_masks, 
>> which may change
>> + * during the lifetime of the system (offline nodes are taken out of 
>> the masks).
>> + */
>> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
>> +{
>> +    struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
>> +
>> +    if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
>> +        return ERR_PTR(-EINVAL);
>> +
>> +    if (!masks)
>> +        return NULL;
>> +
>> +    return masks[hops][node];
>> +}
>> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
>> +
>>   #endif /* CONFIG_NUMA */
>>   static int __sdt_alloc(const struct cpumask *cpu_map)
Valentin Schneider Aug. 11, 2022, 2:26 p.m. UTC | #3
On 10/08/22 15:57, Tariq Toukan wrote:
> On 8/10/2022 3:42 PM, Tariq Toukan wrote:
>>
>>
>> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 8739c2a5a54e..f0236a0ae65c 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct
>>> cpumask *cpus, int cpu)
>>>       return found;
>>>   }
>>> +/**
>>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops
>>> away.
>>> + * @node: The node to count hops from.
>>> + * @hops: Include CPUs up to that many hops away. 0 means local node.
>
> AFAIU, here you work with a specific level/num of hops, description is
> not accurate.
>

Hmph, unfortunately it's the other way around - the masks do include CPUs
*up to* a number of hops, but in my mlx5 example I've used it as if it only
included CPUs a specific distance away :/

As things stand we'd need a temporary cpumask to account for which CPUs we
have visited (which is what you had in your original submission), but with
a for_each_cpu_andnot() we don't need any of that.

Below is what I ended up with. I've tested it on a range of NUMA topologies
and it behaves as I'd expect, and on the plus side the code required in the
driver side is even simpler than before.

If you don't have major gripes with it, I'll shape that into a proper
series and will let you handle the mlx5/enic bits.

---

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..0a5432903edd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 	int ncomp_eqs = table->num_comp_eqs;
 	u16 *cpus;
 	int ret;
+	int cpu;
 	int i;
 
 	ncomp_eqs = table->num_comp_eqs;
@@ -830,8 +831,15 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	rcu_read_lock();
+	for_each_numa_hop_cpus(cpu, dev->priv.numa_node) {
+		cpus[i] = cpu;
+		if (++i == ncomp_eqs)
+			goto spread_done;
+	}
+spread_done:
+	rcu_read_unlock();
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
 	kfree(cpus);
 	if (ret < 0)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index fe29ac7cc469..ccd5d71aefef 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -157,6 +157,13 @@ static inline unsigned int cpumask_next_and(int n,
 	return n+1;
 }
 
+static inline unsigned int cpumask_next_andnot(int n,
+					    const struct cpumask *srcp,
+					    const struct cpumask *andp)
+{
+	return n+1;
+}
+
 static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *mask,
 					     int start, bool wrap)
 {
@@ -194,6 +201,8 @@ static inline int cpumask_any_distribute(const struct cpumask *srcp)
 	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)(start))
 #define for_each_cpu_and(cpu, mask1, mask2)	\
 	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
+#define for_each_cpu_andnot(cpu, mask1, mask2)	\
+	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
 #else
 /**
  * cpumask_first - get the first cpu in a cpumask
@@ -259,6 +268,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 }
 
 int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
+int __pure cpumask_next_andnot(int n, const struct cpumask *, const struct cpumask *);
 int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
@@ -324,6 +334,26 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
 	for ((cpu) = -1;						\
 		(cpu) = cpumask_next_and((cpu), (mask1), (mask2)),	\
 		(cpu) < nr_cpu_ids;)
+
+/**
+ * for_each_cpu_andnot - iterate over every cpu in one mask but not in another
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * This saves a temporary CPU mask in many places.  It is equivalent to:
+ *	struct cpumask tmp;
+ *	cpumask_andnot(&tmp, &mask1, &mask2);
+ *	for_each_cpu(cpu, &tmp)
+ *		...
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_andnot(cpu, mask1, mask2)				\
+	for ((cpu) = -1;						\
+		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
+		(cpu) < nr_cpu_ids;)
+
 #endif /* SMP */
 
 #define CPU_BITS_NONE						\
diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..454cde69b30b 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -10,7 +10,8 @@
 
 extern unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+		unsigned long start, unsigned long invert, unsigned long le,
+		bool negate);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0, 0);
 }
 #endif
 
@@ -71,7 +72,38 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 0);
+}
+#endif
+
+#ifndef find_next_andnot_bit
+/**
+ * find_next_andnot_bit - find the next set bit in one memory region
+ *                        but not in the other
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_andnot_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 1);
 }
 #endif
 
@@ -99,7 +131,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0, 0);
 }
 #endif
 
@@ -247,7 +279,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1, 0);
 }
 #endif
 
@@ -266,7 +298,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 1, 0);
 }
 #endif
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..41bed4b883d3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,50 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
+#else
+static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+#endif	/* CONFIG_NUMA */
+
+/**
+ * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
+ *                         starting from a given node.
+ * @cpu: the iteration variable.
+ * @node: the NUMA node to start the search from.
+ *
+ * Requires rcu_lock to be held.
+ * Careful: this is a double loop, 'break' won't work as expected.
+ *
+ *
+ * Implementation notes:
+ *
+ * Providing it is valid, the mask returned by
+ *  sched_numa_hop_mask(node, hops+1)
+ * is a superset of the one returned by
+ *   sched_numa_hop_mask(node, hops)
+ * which may not be that useful for drivers that try to spread things out and
+ * want to visit a CPU not more than once.
+ *
+ * To accomodate for that, we use for_each_cpu_andnot() to iterate over the cpus
+ * of sched_numa_hop_mask(node, hops+1) with the CPUs of
+ * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
+ * a given distance away (rather than *up to* a given distance).
+ *
+ * h=0 forces us to play silly games and pass cpu_none_mask to
+ * for_each_cpu_andnot(), which turns it into for_each_cpu().
+ */
+#define for_each_numa_hop_cpu(cpu, node)				       \
+	for (struct { const struct cpumask *mask; int hops; } __v__ =	       \
+		     { sched_numa_hop_mask(node, 0), 0 };		       \
+	     !IS_ERR_OR_NULL(__v__.mask);				       \
+	     __v__.hops++, __v__.mask = sched_numa_hop_mask(node, __v__.hops)) \
+		for_each_cpu_andnot(cpu, __v__.mask,			       \
+				    __v__.hops ?			       \
+				    sched_numa_hop_mask(node, __v__.hops - 1) :\
+				    cpu_none_mask)
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..9182101f2c4f 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,6 +29,6 @@ endif
 # build parallelizes well and finishes roughly at once:
 #
 obj-y += core.o
-obj-y += fair.o
+obj-y += fair.o yolo.o
 obj-y += build_policy.o
 obj-y += build_utility.o
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..f0236a0ae65c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in size; see sched_init_numa() for how
+ * distances and masks are handled.
+ *
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
+
+	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+		return ERR_PTR(-EINVAL);
+
+	if (!masks)
+		return NULL;
+
+	return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
diff --git a/lib/cpumask.c b/lib/cpumask.c
index a971a82d2f43..8bcf7e919193 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -42,6 +42,25 @@ int cpumask_next_and(int n, const struct cpumask *src1p,
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
+/**
+ * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first cpumask pointer
+ * @src2p: the second cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if no further cpus set in both.
+ */
+int cpumask_next_andnot(int n, const struct cpumask *src1p,
+		     const struct cpumask *src2p)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+		nr_cpumask_bits, n + 1);
+}
+EXPORT_SYMBOL(cpumask_next_andnot);
+
 /**
  * cpumask_any_but - return a "random" in a cpumask, but not this one.
  * @mask: the cpumask to search
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..6e5f42c621a9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,17 +21,19 @@
 
 #if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
 	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
-	!defined(find_next_and_bit)
+	!defined(find_next_and_bit) || !defined(find_next_andnot_bit)
 /*
  * This is a common helper function for find_next_bit, find_next_zero_bit, and
  * find_next_and_bit. The differences are:
  *  - The "invert" argument, which is XORed with each fetched word before
  *    searching it for one bits.
- *  - The optional "addr2", which is anded with "addr1" if present.
+ *  - The optional "addr2", negated if "negate" and ANDed with "addr1" if
+ *    present.
  */
 unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
+		unsigned long start, unsigned long invert, unsigned long le,
+		bool negate)
 {
 	unsigned long tmp, mask;
 
@@ -40,7 +42,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 
 	tmp = addr1[start / BITS_PER_LONG];
 	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
+		tmp &= negate ?
+		       ~addr2[start / BITS_PER_LONG] :
+			addr2[start / BITS_PER_LONG];
 	tmp ^= invert;
 
 	/* Handle 1st word. */
@@ -59,7 +63,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 
 		tmp = addr1[start / BITS_PER_LONG];
 		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
+			tmp &= negate ?
+			       ~addr2[start / BITS_PER_LONG] :
+				addr2[start / BITS_PER_LONG];
 		tmp ^= invert;
 	}
Tariq Toukan Aug. 14, 2022, 8:19 a.m. UTC | #4
On 8/11/2022 5:26 PM, Valentin Schneider wrote:
> On 10/08/22 15:57, Tariq Toukan wrote:
>> On 8/10/2022 3:42 PM, Tariq Toukan wrote:
>>>
>>>
>>> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index 8739c2a5a54e..f0236a0ae65c 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct
>>>> cpumask *cpus, int cpu)
>>>>        return found;
>>>>    }
>>>> +/**
>>>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops
>>>> away.
>>>> + * @node: The node to count hops from.
>>>> + * @hops: Include CPUs up to that many hops away. 0 means local node.
>>
>> AFAIU, here you work with a specific level/num of hops, description is
>> not accurate.
>>
> 
> Hmph, unfortunately it's the other way around - the masks do include CPUs
> *up to* a number of hops, but in my mlx5 example I've used it as if it only
> included CPUs a specific distance away :/
> 

Aha, got it. It makes it more challenging :)

> As things stand we'd need a temporary cpumask to account for which CPUs we
> have visited (which is what you had in your original submission), but with
> a for_each_cpu_andnot() we don't need any of that.
> 
> Below is what I ended up with. I've tested it on a range of NUMA topologies
> and it behaves as I'd expect, and on the plus side the code required in the
> driver side is even simpler than before.
> 
> If you don't have major gripes with it, I'll shape that into a proper
> series and will let you handle the mlx5/enic bits.
> 

The API is indeed easy to use, the driver part looks straight forward.

I appreciate the tricks you used to make it work!
However, the implementation is relatively complicated, not easy to read 
or understand, and touches several files. I do understand what you did 
here, but I guess not all respective maintainers will like it. Let's see.

One alternative to consider, that will simplify things up, is switching 
back to returning an array of cpus, ordered by their distance, up to a 
provided argument 'npus'.
This way, you will iterate over sched_numa_hop_mask() internally, easily 
maintaining the cpumask diffs between two hops, without the need of 
making it on-the-fly as part an an exposed for-loop macro.

> ---
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 229728c80233..0a5432903edd 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   	int ncomp_eqs = table->num_comp_eqs;
>   	u16 *cpus;
>   	int ret;
> +	int cpu;
>   	int i;
>   
>   	ncomp_eqs = table->num_comp_eqs;
> @@ -830,8 +831,15 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   		ret = -ENOMEM;
>   		goto free_irqs;
>   	}
> -	for (i = 0; i < ncomp_eqs; i++)
> -		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
> +
> +	rcu_read_lock();
> +	for_each_numa_hop_cpus(cpu, dev->priv.numa_node) {
> +		cpus[i] = cpu;
> +		if (++i == ncomp_eqs)
> +			goto spread_done;
> +	}
> +spread_done:
> +	rcu_read_unlock();
>   	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
>   	kfree(cpus);
>   	if (ret < 0)
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index fe29ac7cc469..ccd5d71aefef 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -157,6 +157,13 @@ static inline unsigned int cpumask_next_and(int n,
>   	return n+1;
>   }
>   
> +static inline unsigned int cpumask_next_andnot(int n,
> +					    const struct cpumask *srcp,
> +					    const struct cpumask *andp)
> +{
> +	return n+1;
> +}
> +
>   static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *mask,
>   					     int start, bool wrap)
>   {
> @@ -194,6 +201,8 @@ static inline int cpumask_any_distribute(const struct cpumask *srcp)
>   	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)(start))
>   #define for_each_cpu_and(cpu, mask1, mask2)	\
>   	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
> +#define for_each_cpu_andnot(cpu, mask1, mask2)	\
> +	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
>   #else
>   /**
>    * cpumask_first - get the first cpu in a cpumask
> @@ -259,6 +268,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
>   }
>   
>   int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
> +int __pure cpumask_next_andnot(int n, const struct cpumask *, const struct cpumask *);
>   int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
>   unsigned int cpumask_local_spread(unsigned int i, int node);
>   int cpumask_any_and_distribute(const struct cpumask *src1p,
> @@ -324,6 +334,26 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
>   	for ((cpu) = -1;						\
>   		(cpu) = cpumask_next_and((cpu), (mask1), (mask2)),	\
>   		(cpu) < nr_cpu_ids;)
> +
> +/**
> + * for_each_cpu_andnot - iterate over every cpu in one mask but not in another
> + * @cpu: the (optionally unsigned) integer iterator
> + * @mask1: the first cpumask pointer
> + * @mask2: the second cpumask pointer
> + *
> + * This saves a temporary CPU mask in many places.  It is equivalent to:
> + *	struct cpumask tmp;
> + *	cpumask_andnot(&tmp, &mask1, &mask2);
> + *	for_each_cpu(cpu, &tmp)
> + *		...
> + *
> + * After the loop, cpu is >= nr_cpu_ids.
> + */
> +#define for_each_cpu_andnot(cpu, mask1, mask2)				\
> +	for ((cpu) = -1;						\
> +		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
> +		(cpu) < nr_cpu_ids;)
> +
>   #endif /* SMP */
>   
>   #define CPU_BITS_NONE						\
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 424ef67d4a42..454cde69b30b 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -10,7 +10,8 @@
>   
>   extern unsigned long _find_next_bit(const unsigned long *addr1,
>   		const unsigned long *addr2, unsigned long nbits,
> -		unsigned long start, unsigned long invert, unsigned long le);
> +		unsigned long start, unsigned long invert, unsigned long le,
> +		bool negate);
>   extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
>   extern unsigned long _find_first_and_bit(const unsigned long *addr1,
>   					 const unsigned long *addr2, unsigned long size);
> @@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>   		return val ? __ffs(val) : size;
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
> +	return _find_next_bit(addr, NULL, size, offset, 0UL, 0, 0);
>   }
>   #endif
>   
> @@ -71,7 +72,38 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
>   		return val ? __ffs(val) : size;
>   	}
>   
> -	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
> +	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 0);
> +}
> +#endif
> +
> +#ifndef find_next_andnot_bit
> +/**
> + * find_next_andnot_bit - find the next set bit in one memory region
> + *                        but not in the other
> + * @addr1: The first address to base the search on
> + * @addr2: The second address to base the search on
> + * @size: The bitmap size in bits
> + * @offset: The bitnumber to start searching at
> + *
> + * Returns the bit number for the next set bit
> + * If no bits are set, returns @size.
> + */
> +static inline
> +unsigned long find_next_andnot_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long size,
> +		unsigned long offset)
> +{
> +	if (small_const_nbits(size)) {
> +		unsigned long val;
> +
> +		if (unlikely(offset >= size))
> +			return size;
> +
> +		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
> +		return val ? __ffs(val) : size;
> +	}
> +
> +	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 1);
>   }
>   #endif
>   
> @@ -99,7 +131,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
>   		return val == ~0UL ? size : ffz(val);
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
> +	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0, 0);
>   }
>   #endif
>   
> @@ -247,7 +279,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
>   		return val == ~0UL ? size : ffz(val);
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
> +	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1, 0);
>   }
>   #endif
>   
> @@ -266,7 +298,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
>   		return val ? __ffs(val) : size;
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
> +	return _find_next_bit(addr, NULL, size, offset, 0UL, 1, 0);
>   }
>   #endif
>   
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..41bed4b883d3 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,50 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>   	return cpumask_of_node(cpu_to_node(cpu));
>   }
>   
> +#ifdef CONFIG_NUMA
> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> +#else
> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	return ERR_PTR(-ENOTSUPP);
> +}
> +#endif	/* CONFIG_NUMA */
> +
> +/**
> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
> + *                         starting from a given node.
> + * @cpu: the iteration variable.
> + * @node: the NUMA node to start the search from.
> + *
> + * Requires rcu_lock to be held.
> + * Careful: this is a double loop, 'break' won't work as expected.
> + *
> + *
> + * Implementation notes:
> + *
> + * Providing it is valid, the mask returned by
> + *  sched_numa_hop_mask(node, hops+1)
> + * is a superset of the one returned by
> + *   sched_numa_hop_mask(node, hops)
> + * which may not be that useful for drivers that try to spread things out and
> + * want to visit a CPU not more than once.
> + *
> + * To accomodate for that, we use for_each_cpu_andnot() to iterate over the cpus
> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of
> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
> + * a given distance away (rather than *up to* a given distance).
> + *
> + * h=0 forces us to play silly games and pass cpu_none_mask to
> + * for_each_cpu_andnot(), which turns it into for_each_cpu().
> + */
> +#define for_each_numa_hop_cpu(cpu, node)				       \
> +	for (struct { const struct cpumask *mask; int hops; } __v__ =	       \
> +		     { sched_numa_hop_mask(node, 0), 0 };		       \
> +	     !IS_ERR_OR_NULL(__v__.mask);				       \
> +	     __v__.hops++, __v__.mask = sched_numa_hop_mask(node, __v__.hops)) \
> +		for_each_cpu_andnot(cpu, __v__.mask,			       \
> +				    __v__.hops ?			       \
> +				    sched_numa_hop_mask(node, __v__.hops - 1) :\
> +				    cpu_none_mask)
>   
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 976092b7bd45..9182101f2c4f 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -29,6 +29,6 @@ endif
>   # build parallelizes well and finishes roughly at once:
>   #
>   obj-y += core.o
> -obj-y += fair.o
> +obj-y += fair.o yolo.o
>   obj-y += build_policy.o
>   obj-y += build_utility.o
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..f0236a0ae65c 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>   	return found;
>   }
>   
> +/**
> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> + * @node: The node to count hops from.
> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> + *
> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> + * read-side section, copy it if required beyond that.
> + *
> + * Note that not all hops are equal in size; see sched_init_numa() for how
> + * distances and masks are handled.
> + *
> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> + * during the lifetime of the system (offline nodes are taken out of the masks).
> + */
> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> +
> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!masks)
> +		return NULL;
> +
> +	return masks[hops][node];
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
> +
>   #endif /* CONFIG_NUMA */
>   
>   static int __sdt_alloc(const struct cpumask *cpu_map)
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index a971a82d2f43..8bcf7e919193 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -42,6 +42,25 @@ int cpumask_next_and(int n, const struct cpumask *src1p,
>   }
>   EXPORT_SYMBOL(cpumask_next_and);
>   
> +/**
> + * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
> + * @n: the cpu prior to the place to search (ie. return will be > @n)
> + * @src1p: the first cpumask pointer
> + * @src2p: the second cpumask pointer
> + *
> + * Returns >= nr_cpu_ids if no further cpus set in both.
> + */
> +int cpumask_next_andnot(int n, const struct cpumask *src1p,
> +		     const struct cpumask *src2p)
> +{
> +	/* -1 is a legal arg here. */
> +	if (n != -1)
> +		cpumask_check(n);
> +	return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> +		nr_cpumask_bits, n + 1);
> +}
> +EXPORT_SYMBOL(cpumask_next_andnot);
> +
>   /**
>    * cpumask_any_but - return a "random" in a cpumask, but not this one.
>    * @mask: the cpumask to search
> diff --git a/lib/find_bit.c b/lib/find_bit.c
> index 1b8e4b2a9cba..6e5f42c621a9 100644
> --- a/lib/find_bit.c
> +++ b/lib/find_bit.c
> @@ -21,17 +21,19 @@
>   
>   #if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
>   	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
> -	!defined(find_next_and_bit)
> +	!defined(find_next_and_bit) || !defined(find_next_andnot_bit)
>   /*
>    * This is a common helper function for find_next_bit, find_next_zero_bit, and
>    * find_next_and_bit. The differences are:
>    *  - The "invert" argument, which is XORed with each fetched word before
>    *    searching it for one bits.
> - *  - The optional "addr2", which is anded with "addr1" if present.
> + *  - The optional "addr2", negated if "negate" and ANDed with "addr1" if
> + *    present.
>    */
>   unsigned long _find_next_bit(const unsigned long *addr1,
>   		const unsigned long *addr2, unsigned long nbits,
> -		unsigned long start, unsigned long invert, unsigned long le)
> +		unsigned long start, unsigned long invert, unsigned long le,
> +		bool negate)
>   {
>   	unsigned long tmp, mask;
>   
> @@ -40,7 +42,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
>   
>   	tmp = addr1[start / BITS_PER_LONG];
>   	if (addr2)
> -		tmp &= addr2[start / BITS_PER_LONG];
> +		tmp &= negate ?
> +		       ~addr2[start / BITS_PER_LONG] :
> +			addr2[start / BITS_PER_LONG];
>   	tmp ^= invert;
>   
>   	/* Handle 1st word. */
> @@ -59,7 +63,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
>   
>   		tmp = addr1[start / BITS_PER_LONG];
>   		if (addr2)
> -			tmp &= addr2[start / BITS_PER_LONG];
> +			tmp &= negate ?
> +			       ~addr2[start / BITS_PER_LONG] :
> +				addr2[start / BITS_PER_LONG];
>   		tmp ^= invert;
>   	}
>   
>
Tariq Toukan Aug. 14, 2022, 8:26 a.m. UTC | #5
>> If you don't have major gripes with it, I'll shape that into a proper
>> series and will let you handle the mlx5/enic bits.
>>

Sure I can take the drivers/networking parts. In order to submit the API 
and its usage combined in a one patchset, I can send you these parts 
privately and you combine it into your submitted series, or the other 
way-around if you want me to do the submission.
Both work for me.
I will do it once we converge with the API.

Important note: I'll be out-of-office, with very limited access to 
email, until Sep 1st. I doubt I can progress much before then.
Valentin Schneider Aug. 15, 2022, 2:20 p.m. UTC | #6
On 14/08/22 11:19, Tariq Toukan wrote:
> The API is indeed easy to use, the driver part looks straight forward.
>
> I appreciate the tricks you used to make it work!
> However, the implementation is relatively complicated, not easy to read
> or understand, and touches several files. I do understand what you did
> here, but I guess not all respective maintainers will like it. Let's see.
>

Dumping it all into a single diff also doesn't help :-) I think the changes to
get a for_each_cpu_andnot() are straightforward enough, the one eyesore is
the macro but I consider it a necessary evil to get an allocation-free
interface.

> One alternative to consider, that will simplify things up, is switching
> back to returning an array of cpus, ordered by their distance, up to a
> provided argument 'npus'.
> This way, you will iterate over sched_numa_hop_mask() internally, easily
> maintaining the cpumask diffs between two hops, without the need of
> making it on-the-fly as part an an exposed for-loop macro.
>

That requires extra storage however: at the very least the array, and a
temp cpumask to remember already-visited CPUs (the alternative being
scanning the array every CPU iteration to figure out if it's been added
already).

I'm going to submit the cpumask / sched changes, hopefully I get to
something by the time you're back from PTO.
diff mbox series

Patch

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..d66e3cf40823 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,17 @@  static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
+#else
+static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	return -ENOTSUPP;
+}
+#endif	/* CONFIG_NUMA */
+
+#define for_each_numa_hop_mask(node, hops, mask)			\
+	for (mask = sched_numa_hop_mask(node, hops); !IS_ERR_OR_NULL(mask); \
+	     mask = sched_numa_hop_mask(node, ++hops))
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..f0236a0ae65c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,34 @@  int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in size; see sched_init_numa() for how
+ * distances and masks are handled.
+ *
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
+
+	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+		return ERR_PTR(-EINVAL);
+
+	if (!masks)
+		return NULL;
+
+	return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)