[5/6] sched_ext: idle: Per-node idle cpumasks

Message ID	20250207211104.30009-6-arighi@nvidia.com (mailing list archive)
State	Not Applicable
Headers	show Received: from NAM04-BN8-obe.outbound.protection.outlook.com (mail-bn8nam04on2057.outbound.protection.outlook.com [40.107.100.57]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C742A1B415B; Fri, 7 Feb 2025 21:11:42 +0000 (UTC) From: Andrea Righi <arighi@nvidia.com> To: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>, Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Valentin Schneider <vschneid@redhat.com>, Ian May <ianm@nvidia.com>, bpf@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 5/6] sched_ext: idle: Per-node idle cpumasks Date: Fri, 7 Feb 2025 21:40:52 +0100 Message-ID: <20250207211104.30009-6-arighi@nvidia.com> In-Reply-To: <20250207211104.30009-1-arighi@nvidia.com> References: <20250207211104.30009-1-arighi@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	[1/6] mm/numa: Introduce numa_nearest_nodemask() \| expand [1/6] mm/numa: Introduce numa_nearest_nodemask() [2/6] sched/topology: Introduce for_each_numa_node() iterator [3/6] sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE [4/6] sched_ext: idle: introduce SCX_PICK_IDLE_IN_NODE [5/6] sched_ext: idle: Per-node idle cpumasks [6/6] sched_ext: idle: Introduce node-aware idle cpu kfunc helpers

Context	Check	Description
netdev/tree_selection	success	Not a local patch

Andrea Righi Feb. 7, 2025, 8:40 p.m. UTC

Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++--------
 kernel/sched/ext_idle.h |  11 +-
 2 files changed, 203 insertions(+), 50 deletions(-)

Tejun Heo Feb. 7, 2025, 10:30 p.m. UTC | #1

Hello,

On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote:
> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask
> + * from is used to track all the idle CPUs in the system.
> + */
> +struct idle_cpus {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +};

Can you prefix the type name with scx_?

Unrelated to this series but I wonder whether we can replace "smt" with
"core" in the future to become more consistent with how the terms are used
in the kernel:

  struct scx_idle_masks {
          cpumask_var_t   cpus;
          cpumask_var_t   cores;
  };

We expose "smt" name through kfuncs but we can rename that to "core" through
compat macros later too.

> +/*
> + * Find the best idle CPU in the system, relative to @node.
> + */
> +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> +{
> +	nodemask_t unvisited = NODE_MASK_ALL;
> +	s32 cpu = -EBUSY;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags);
> +
> +	/*
> +	 * If an initial node is not specified, start with the current
> +	 * node.
> +	 */
> +	if (node == NUMA_NO_NODE)
> +		node = numa_node_id();
> +
> +	/*
> +	 * Traverse all nodes in order of increasing distance, starting
> +	 * from @node.
> +	 *
> +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> +	 * which might be quite expensive in large NUMA systems. However,
> +	 * this complexity comes into play only when a scheduler enables
> +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> +	 * without specifying a target NUMA node, so it shouldn't be a
> +	 * bottleneck is most cases.
> +	 *
> +	 * As a future optimization we may want to cache the list of hop
> +	 * nodes in a per-node array, instead of actually traversing them
> +	 * every time.
> +	 */
> +	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> +		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);

Maybe rename pick_idle_cpu_in_node() to stay in sync with
SCX_PICK_IDLE_IN_NODE? It's not like pick_idle_cpu_from_node() walks from
the node, right? It just picks within the node.

> @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool
>  
>  void scx_idle_reset_masks(void)
>  {
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask);
> +		cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> -}
> +	for_each_node(node) {
> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpus = idle_cpumask(node)->cpu;
> +		struct cpumask *idle_smts = idle_cpumask(node)->smt;
> -void scx_idle_init_masks(void)
> -{
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +		cpumask_and(idle_cpus, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smts, idle_cpus);
> +	}

nitpick: Maybe something like the following is more symmetric with the
global case and easier to read?

  for_each_node(node) {
        const struct cpumask *node_mask = cpumask_of_node(node);
        cpumask_and(idle_cpumask(node)->cpu, cpu_online_mask, node_mask);
        cpumask_and(idle_cpumask(node)->smt, cpu_online_mask, node_mask);
  }

>  }
>  
>  static void update_builtin_idle(int cpu, bool idle)
>  {
> -	assign_cpu(cpu, idle_masks.cpu, idle);
> +	int node = idle_cpu_to_node(cpu);

minor: I wonder whether idle_cpu_to_node() name is a bit confusing - why
does a CPU being idle have anything to do with its node mapping? If there is
a better naming convention, great. If not, it is what it is.

Thanks.

Andrea Righi Feb. 8, 2025, 8:47 a.m. UTC | #2

On Fri, Feb 07, 2025 at 12:30:37PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote:
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask
> > + * from is used to track all the idle CPUs in the system.
> > + */
> > +struct idle_cpus {
> >  	cpumask_var_t cpu;
> >  	cpumask_var_t smt;
> > -} idle_masks CL_ALIGNED_IF_ONSTACK;
> > +};
> 
> Can you prefix the type name with scx_?
> 
> Unrelated to this series but I wonder whether we can replace "smt" with
> "core" in the future to become more consistent with how the terms are used
> in the kernel:
> 
>   struct scx_idle_masks {
>           cpumask_var_t   cpus;
>           cpumask_var_t   cores;
>   };
> 
> We expose "smt" name through kfuncs but we can rename that to "core" through
> compat macros later too.
> 
> > +/*
> > + * Find the best idle CPU in the system, relative to @node.
> > + */
> > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > +{
> > +	nodemask_t unvisited = NODE_MASK_ALL;
> > +	s32 cpu = -EBUSY;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags);
> > +
> > +	/*
> > +	 * If an initial node is not specified, start with the current
> > +	 * node.
> > +	 */
> > +	if (node == NUMA_NO_NODE)
> > +		node = numa_node_id();
> > +
> > +	/*
> > +	 * Traverse all nodes in order of increasing distance, starting
> > +	 * from @node.
> > +	 *
> > +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> > +	 * which might be quite expensive in large NUMA systems. However,
> > +	 * this complexity comes into play only when a scheduler enables
> > +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> > +	 * without specifying a target NUMA node, so it shouldn't be a
> > +	 * bottleneck is most cases.
> > +	 *
> > +	 * As a future optimization we may want to cache the list of hop
> > +	 * nodes in a per-node array, instead of actually traversing them
> > +	 * every time.
> > +	 */
> > +	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> > +		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
> 
> Maybe rename pick_idle_cpu_in_node() to stay in sync with
> SCX_PICK_IDLE_IN_NODE? It's not like pick_idle_cpu_from_node() walks from
> the node, right? It just picks within the node.
> 
> > @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool
> >  
> >  void scx_idle_reset_masks(void)
> >  {
> > +	int node;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +		cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask);
> > +		cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask);
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > -}
> > +	for_each_node(node) {
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpus = idle_cpumask(node)->cpu;
> > +		struct cpumask *idle_smts = idle_cpumask(node)->smt;
> > -void scx_idle_init_masks(void)
> > -{
> > -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> > -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> > +		cpumask_and(idle_cpus, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smts, idle_cpus);
> > +	}
> 
> nitpick: Maybe something like the following is more symmetric with the
> global case and easier to read?
> 
>   for_each_node(node) {
>         const struct cpumask *node_mask = cpumask_of_node(node);
>         cpumask_and(idle_cpumask(node)->cpu, cpu_online_mask, node_mask);
>         cpumask_and(idle_cpumask(node)->smt, cpu_online_mask, node_mask);
>   }
> 
> >  }
> >  
> >  static void update_builtin_idle(int cpu, bool idle)
> >  {
> > -	assign_cpu(cpu, idle_masks.cpu, idle);
> > +	int node = idle_cpu_to_node(cpu);

Ok to all of the above.

> 
> minor: I wonder whether idle_cpu_to_node() name is a bit confusing - why
> does a CPU being idle have anything to do with its node mapping? If there is
> a better naming convention, great. If not, it is what it is.

Maybe scx_cpu_to_node()? At the end it's just a wrapper to cpu_to_node(),
but from the scx perspective, so if NUMA-awareness is not enabled in scx,
it'd return NUMA_NO_NODE.

Thanks,
-Andrea

Yury Norov Feb. 9, 2025, 6:07 p.m. UTC | #3

On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.

Can you put your perf numbers here too?
 
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
> 
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.
> 
> The split of multiple per-node idle cpumasks can be controlled using the
> SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
> 
> By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
> host-wide idle cpumask is used, maintaining the previous behavior.
> 
> NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
> SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
> trigger an scx error, since there are no system-wide cpumasks.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++--------
>  kernel/sched/ext_idle.h |  11 +-
>  2 files changed, 203 insertions(+), 50 deletions(-)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index a3f2b00903ac2..4b90ec9018c1a 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -18,25 +18,88 @@ DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
>  DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
>  
>  #ifdef CONFIG_SMP
> -#ifdef CONFIG_CPUMASK_OFFSTACK
> -#define CL_ALIGNED_IF_ONSTACK
> -#else
> -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> -#endif
> -
>  /* Enable/disable LLC aware optimizations */
>  DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
>  
>  /* Enable/disable NUMA aware optimizations */
>  DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
>  
> -static struct {
> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask
> + * from is used to track all the idle CPUs in the system.
> + */
> +struct idle_cpus {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +};
> +
> +/*
> + * Global host-wide idle cpumasks (used when SCX_OPS_BUILTIN_IDLE_PER_NODE
> + * is not enabled).
> + */
> +static struct idle_cpus scx_idle_global_masks;
> +
> +/*
> + * Per-node idle cpumasks.
> + */
> +static struct idle_cpus **scx_idle_node_masks;
> +
> +/*
> + * Initialize per-node idle cpumasks.
> + *
> + * In case of a single NUMA node or if NUMA support is disabled, only a
> + * single global host-wide cpumask will be initialized.
> + */
> +void scx_idle_init_masks(void)
> +{
> +	int node;
> +
> +	/* Allocate global idle cpumasks */
> +	BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL));
> +	BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.smt, GFP_KERNEL));
> +
> +	/* Allocate per-node idle cpumasks */
> +	scx_idle_node_masks = kcalloc(num_possible_nodes(),
> +				      sizeof(*scx_idle_node_masks), GFP_KERNEL);
> +	BUG_ON(!scx_idle_node_masks);
> +
> +	for_each_node(node) {
> +		scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks),
> +							 GFP_KERNEL, node);
> +		BUG_ON(!scx_idle_node_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
> +
> +/*
> + * Return the idle masks associated to a target @node.
> + */
> +static struct idle_cpus *idle_cpumask(int node)
> +{
> +	return node == NUMA_NO_NODE ? &scx_idle_global_masks : scx_idle_node_masks[node];
> +}
> +
> +/*
> + * Return the node id associated to a target idle CPU (used to determine
> + * the proper idle cpumask).
> + */
> +static int idle_cpu_to_node(int cpu)
> +{
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return NUMA_NO_NODE;
> +
> +	return cpu_to_node(cpu);
> +}
>  
>  bool scx_idle_test_and_clear_cpu(int cpu)
>  {
> +	int node = idle_cpu_to_node(cpu);
> +	struct cpumask *idle_cpus = idle_cpumask(node)->cpu;
> +
>  #ifdef CONFIG_SCHED_SMT
>  	/*
>  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> @@ -45,33 +108,38 @@ bool scx_idle_test_and_clear_cpu(int cpu)
>  	 */
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smts = idle_cpumask(node)->smt;
>  
>  		/*
>  		 * If offline, @cpu is not its own sibling and
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> -		 * is eventually cleared.
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
>  		 *
>  		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
>  		 * reduce memory writes, which may help alleviate cache
>  		 * coherence pressure.
>  		 */
> -		if (cpumask_intersects(smt, idle_masks.smt))
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +		if (cpumask_intersects(smt, idle_smts))
> +			cpumask_andnot(idle_smts, idle_smts, smt);
> +		else if (cpumask_test_cpu(cpu, idle_smts))
> +			__cpumask_clear_cpu(cpu, idle_smts);
>  	}
>  #endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +
> +	return cpumask_test_and_clear_cpu(cpu, idle_cpus);
>  }
>  
> -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +/*
> + * Pick an idle CPU in a specific NUMA node.
> + */
> +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags)
>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(idle_cpumask(node)->smt, cpus_allowed);
>  		if (cpu < nr_cpu_ids)
>  			goto found;
>  
> @@ -79,7 +147,7 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  			return -EBUSY;
>  	}
>  
> -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> +	cpu = cpumask_any_and_distribute(idle_cpumask(node)->cpu, cpus_allowed);
>  	if (cpu >= nr_cpu_ids)
>  		return -EBUSY;
>  
> @@ -90,6 +158,55 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  		goto retry;
>  }
>  
> +/*
> + * Find the best idle CPU in the system, relative to @node.
> + */
> +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> +{
> +	nodemask_t unvisited = NODE_MASK_ALL;
> +	s32 cpu = -EBUSY;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags);
> +
> +	/*
> +	 * If an initial node is not specified, start with the current
> +	 * node.
> +	 */
> +	if (node == NUMA_NO_NODE)
> +		node = numa_node_id();
> +
> +	/*
> +	 * Traverse all nodes in order of increasing distance, starting
> +	 * from @node.
> +	 *
> +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> +	 * which might be quite expensive in large NUMA systems. However,
> +	 * this complexity comes into play only when a scheduler enables
> +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> +	 * without specifying a target NUMA node, so it shouldn't be a
> +	 * bottleneck is most cases.
> +	 *
> +	 * As a future optimization we may want to cache the list of hop
> +	 * nodes in a per-node array, instead of actually traversing them
> +	 * every time.
> +	 */
> +	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> +		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
> +		if (cpu >= 0)
> +			break;
> +
> +		/*
> +		 * Check if the search is restricted to the same core or
> +		 * the same node.
> +		 */
> +		if (flags & SCX_PICK_IDLE_IN_NODE)
> +			break;

If SCX_PICK_IDLE_IN_NODE is set, you can avoid the loop at all, right?
Just:
	if (flags & SCX_PICK_IDLE_IN_NODE)
	        return pick_idle_cpu_from_node(cpus_allowed, node, flags);

	for_each_numa_node(node, unvisited, N_POSSIBLE) {
		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
		if (cpu >= 0)
			return cpu;
        }

Thanks,
Yury

Yury Norov Feb. 10, 2025, 4:57 p.m. UTC | #4

On Sun, Feb 09, 2025 at 01:07:44PM -0500, Yury Norov wrote:
> On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> 
> Can you put your perf numbers here too?
>  
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> > 
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> > 
> > The split of multiple per-node idle cpumasks can be controlled using the
> > SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
> > 
> > By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
> > host-wide idle cpumask is used, maintaining the previous behavior.
> > 
> > NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
> > SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
> > trigger an scx error, since there are no system-wide cpumasks.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++--------
> >  kernel/sched/ext_idle.h |  11 +-
> >  2 files changed, 203 insertions(+), 50 deletions(-)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index a3f2b00903ac2..4b90ec9018c1a 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -18,25 +18,88 @@ DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> >  DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> >  
> >  #ifdef CONFIG_SMP
> > -#ifdef CONFIG_CPUMASK_OFFSTACK
> > -#define CL_ALIGNED_IF_ONSTACK
> > -#else
> > -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> > -#endif
> > -
> >  /* Enable/disable LLC aware optimizations */
> >  DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> >  
> >  /* Enable/disable NUMA aware optimizations */
> >  DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> >  
> > -static struct {
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask
> > + * from is used to track all the idle CPUs in the system.
> > + */
> > +struct idle_cpus {
> >  	cpumask_var_t cpu;
> >  	cpumask_var_t smt;
> > -} idle_masks CL_ALIGNED_IF_ONSTACK;
> > +};
> > +
> > +/*
> > + * Global host-wide idle cpumasks (used when SCX_OPS_BUILTIN_IDLE_PER_NODE
> > + * is not enabled).
> > + */
> > +static struct idle_cpus scx_idle_global_masks;
> > +
> > +/*
> > + * Per-node idle cpumasks.
> > + */
> > +static struct idle_cpus **scx_idle_node_masks;
> > +
> > +/*
> > + * Initialize per-node idle cpumasks.
> > + *
> > + * In case of a single NUMA node or if NUMA support is disabled, only a
> > + * single global host-wide cpumask will be initialized.
> > + */
> > +void scx_idle_init_masks(void)
> > +{
> > +	int node;
> > +
> > +	/* Allocate global idle cpumasks */
> > +	BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL));
> > +	BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.smt, GFP_KERNEL));
> > +
> > +	/* Allocate per-node idle cpumasks */
> > +	scx_idle_node_masks = kcalloc(num_possible_nodes(),
> > +				      sizeof(*scx_idle_node_masks), GFP_KERNEL);
> > +	BUG_ON(!scx_idle_node_masks);
> > +
> > +	for_each_node(node) {
> > +		scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks),
> > +							 GFP_KERNEL, node);
> > +		BUG_ON(!scx_idle_node_masks[node]);
> > +
> > +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node));
> > +		BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node));
> > +	}
> > +}
> > +
> > +/*
> > + * Return the idle masks associated to a target @node.
> > + */
> > +static struct idle_cpus *idle_cpumask(int node)
> > +{
> > +	return node == NUMA_NO_NODE ? &scx_idle_global_masks : scx_idle_node_masks[node];
> > +}
> > +
> > +/*
> > + * Return the node id associated to a target idle CPU (used to determine
> > + * the proper idle cpumask).
> > + */
> > +static int idle_cpu_to_node(int cpu)
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return NUMA_NO_NODE;
> > +
> > +	return cpu_to_node(cpu);
> > +}
> >  
> >  bool scx_idle_test_and_clear_cpu(int cpu)
> >  {
> > +	int node = idle_cpu_to_node(cpu);
> > +	struct cpumask *idle_cpus = idle_cpumask(node)->cpu;
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  	/*
> >  	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> > @@ -45,33 +108,38 @@ bool scx_idle_test_and_clear_cpu(int cpu)
> >  	 */
> >  	if (sched_smt_active()) {
> >  		const struct cpumask *smt = cpu_smt_mask(cpu);
> > +		struct cpumask *idle_smts = idle_cpumask(node)->smt;
> >  
> >  		/*
> >  		 * If offline, @cpu is not its own sibling and
> >  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> > -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> > -		 * is eventually cleared.
> > +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> > +		 * @cpu is eventually cleared.
> >  		 *
> >  		 * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to
> >  		 * reduce memory writes, which may help alleviate cache
> >  		 * coherence pressure.
> >  		 */
> > -		if (cpumask_intersects(smt, idle_masks.smt))
> > -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> > -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> > -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> > +		if (cpumask_intersects(smt, idle_smts))
> > +			cpumask_andnot(idle_smts, idle_smts, smt);
> > +		else if (cpumask_test_cpu(cpu, idle_smts))
> > +			__cpumask_clear_cpu(cpu, idle_smts);
> >  	}
> >  #endif
> > -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> > +
> > +	return cpumask_test_and_clear_cpu(cpu, idle_cpus);
> >  }
> >  
> > -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +/*
> > + * Pick an idle CPU in a specific NUMA node.
> > + */
> > +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags)
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(idle_cpumask(node)->smt, cpus_allowed);
> >  		if (cpu < nr_cpu_ids)
> >  			goto found;
> >  
> > @@ -79,7 +147,7 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  			return -EBUSY;
> >  	}
> >  
> > -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> > +	cpu = cpumask_any_and_distribute(idle_cpumask(node)->cpu, cpus_allowed);
> >  	if (cpu >= nr_cpu_ids)
> >  		return -EBUSY;
> >  
> > @@ -90,6 +158,55 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> >  		goto retry;
> >  }
> >  
> > +/*
> > + * Find the best idle CPU in the system, relative to @node.
> > + */
> > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > +{
> > +	nodemask_t unvisited = NODE_MASK_ALL;

This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
stack, right?

> > +	s32 cpu = -EBUSY;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags);
> > +
> > +	/*
> > +	 * If an initial node is not specified, start with the current
> > +	 * node.
> > +	 */
> > +	if (node == NUMA_NO_NODE)
> > +		node = numa_node_id();
> > +
> > +	/*
> > +	 * Traverse all nodes in order of increasing distance, starting
> > +	 * from @node.
> > +	 *
> > +	 * This loop is O(N^2), with N being the amount of NUMA nodes,
> > +	 * which might be quite expensive in large NUMA systems. However,
> > +	 * this complexity comes into play only when a scheduler enables
> > +	 * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU
> > +	 * without specifying a target NUMA node, so it shouldn't be a
> > +	 * bottleneck is most cases.
> > +	 *
> > +	 * As a future optimization we may want to cache the list of hop
> > +	 * nodes in a per-node array, instead of actually traversing them
> > +	 * every time.
> > +	 */
> > +	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> > +		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
> > +		if (cpu >= 0)
> > +			break;
> > +
> > +		/*
> > +		 * Check if the search is restricted to the same core or
> > +		 * the same node.
> > +		 */
> > +		if (flags & SCX_PICK_IDLE_IN_NODE)
> > +			break;
> 
> If SCX_PICK_IDLE_IN_NODE is set, you can avoid the loop at all, right?
> Just:
> 	if (flags & SCX_PICK_IDLE_IN_NODE)
> 	        return pick_idle_cpu_from_node(cpus_allowed, node, flags);
> 
> 	for_each_numa_node(node, unvisited, N_POSSIBLE) {
> 		cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags);
> 		if (cpu >= 0)
> 			return cpu;
>         }
> 
> Thanks,
> Yury

Andrea Righi Feb. 11, 2025, 7:32 a.m. UTC | #5

On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote:
...
> > > +/*
> > > + * Find the best idle CPU in the system, relative to @node.
> > > + */
> > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > > +{
> > > +	nodemask_t unvisited = NODE_MASK_ALL;
> 
> This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
> stack, right?

Ok, and if I want to initialize unvisited to all online nodes, is there a
better than doing:

  nodemask_clear(*unvisited);
  nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]);

We don't have nodemask_copy() right?

Thanks,
-Andrea

Andrea Righi Feb. 11, 2025, 7:41 a.m. UTC | #6

On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote:
> On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote:
> ...
> > > > +/*
> > > > + * Find the best idle CPU in the system, relative to @node.
> > > > + */
> > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > > > +{
> > > > +	nodemask_t unvisited = NODE_MASK_ALL;
> > 
> > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
> > stack, right?
> 
> Ok, and if I want to initialize unvisited to all online nodes, is there a
> better than doing:
> 
>   nodemask_clear(*unvisited);
>   nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]);
> 
> We don't have nodemask_copy() right?

Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy().

-Andrea

Andrea Righi Feb. 11, 2025, 9:50 a.m. UTC | #7

On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote:
> On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote:
> > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote:
> > ...
> > > > > +/*
> > > > > + * Find the best idle CPU in the system, relative to @node.
> > > > > + */
> > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > > > > +{
> > > > > +	nodemask_t unvisited = NODE_MASK_ALL;
> > > 
> > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
> > > stack, right?
> > 
> > Ok, and if I want to initialize unvisited to all online nodes, is there a
> > better than doing:
> > 
> >   nodemask_clear(*unvisited);
> >   nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]);
> > 
> > We don't have nodemask_copy() right?
> 
> Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy().

Also, it might be problematic to use NODEMASK_ALLOC() here, since we're
potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t,
but then we need to preempt_disable() the entire loop, since
scx_pick_idle_cpu() can be be called potentially from any context.

Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP,
nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe
we can accept to have it on the stack in this case?

Thanks,
-Andrea

Yury Norov Feb. 11, 2025, 2:19 p.m. UTC | #8

On Tue, Feb 11, 2025 at 10:50:46AM +0100, Andrea Righi wrote:
> On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote:
> > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote:
> > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote:
> > > ...
> > > > > > +/*
> > > > > > + * Find the best idle CPU in the system, relative to @node.
> > > > > > + */
> > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > > > > > +{
> > > > > > +	nodemask_t unvisited = NODE_MASK_ALL;
> > > > 
> > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
> > > > stack, right?
> > > 
> > > Ok, and if I want to initialize unvisited to all online nodes, is there a
> > > better than doing:
> > > 
> > >   nodemask_clear(*unvisited);
> > >   nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]);
> > > 
> > > We don't have nodemask_copy() right?
> > 
> > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy().
> 
> Also, it might be problematic to use NODEMASK_ALLOC() here, since we're
> potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t,
> but then we need to preempt_disable() the entire loop, since
> scx_pick_idle_cpu() can be be called potentially from any context.
> 
> Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP,
> nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe
> we can accept to have it on the stack in this case?

If you expect calling this in strict SMP lock-held or IRQ contexts, You
need to be careful about stack overflow even mode. We've got GFP_ATOMIC
for that:
     non sleeping allocation with an expensive fallback so it can access
     some portion of memory reserves. Usually used from interrupt/bottom-half
     context with an expensive slow path fallback.

Check Documentation/core-api/memory-allocation.rst for other options.
You may be interested in __GFP_NORETRY as well.

Thanks,
Yury

Andrea Righi Feb. 11, 2025, 2:34 p.m. UTC | #9

On Tue, Feb 11, 2025 at 09:19:52AM -0500, Yury Norov wrote:
> On Tue, Feb 11, 2025 at 10:50:46AM +0100, Andrea Righi wrote:
> > On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote:
> > > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote:
> > > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote:
> > > > ...
> > > > > > > +/*
> > > > > > > + * Find the best idle CPU in the system, relative to @node.
> > > > > > > + */
> > > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > > > > > > +{
> > > > > > > +	nodemask_t unvisited = NODE_MASK_ALL;
> > > > > 
> > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
> > > > > stack, right?
> > > > 
> > > > Ok, and if I want to initialize unvisited to all online nodes, is there a
> > > > better than doing:
> > > > 
> > > >   nodemask_clear(*unvisited);
> > > >   nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]);
> > > > 
> > > > We don't have nodemask_copy() right?
> > > 
> > > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy().
> > 
> > Also, it might be problematic to use NODEMASK_ALLOC() here, since we're
> > potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t,
> > but then we need to preempt_disable() the entire loop, since
> > scx_pick_idle_cpu() can be be called potentially from any context.
> > 
> > Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP,
> > nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe
> > we can accept to have it on the stack in this case?
> 
> If you expect calling this in strict SMP lock-held or IRQ contexts, You
> need to be careful about stack overflow even mode. We've got GFP_ATOMIC
> for that:
>      non sleeping allocation with an expensive fallback so it can access
>      some portion of memory reserves. Usually used from interrupt/bottom-half
>      context with an expensive slow path fallback.
> 
> Check Documentation/core-api/memory-allocation.rst for other options.
> You may be interested in __GFP_NORETRY as well.

I know about GFP_ATOMIC, but even with that I'm hitting some bugs.
Will try with __GFP_NORETRY.

Thanks,
-Andrea

Andrea Righi Feb. 11, 2025, 2:45 p.m. UTC | #10

On Tue, Feb 11, 2025 at 03:34:11PM +0100, Andrea Righi wrote:
> On Tue, Feb 11, 2025 at 09:19:52AM -0500, Yury Norov wrote:
> > On Tue, Feb 11, 2025 at 10:50:46AM +0100, Andrea Righi wrote:
> > > On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote:
> > > > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote:
> > > > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote:
> > > > > ...
> > > > > > > > +/*
> > > > > > > > + * Find the best idle CPU in the system, relative to @node.
> > > > > > > > + */
> > > > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags)
> > > > > > > > +{
> > > > > > > > +	nodemask_t unvisited = NODE_MASK_ALL;
> > > > > > 
> > > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the
> > > > > > stack, right?
> > > > > 
> > > > > Ok, and if I want to initialize unvisited to all online nodes, is there a
> > > > > better than doing:
> > > > > 
> > > > >   nodemask_clear(*unvisited);
> > > > >   nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]);
> > > > > 
> > > > > We don't have nodemask_copy() right?
> > > > 
> > > > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy().
> > > 
> > > Also, it might be problematic to use NODEMASK_ALLOC() here, since we're
> > > potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t,
> > > but then we need to preempt_disable() the entire loop, since
> > > scx_pick_idle_cpu() can be be called potentially from any context.
> > > 
> > > Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP,
> > > nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe
> > > we can accept to have it on the stack in this case?
> > 
> > If you expect calling this in strict SMP lock-held or IRQ contexts, You
> > need to be careful about stack overflow even mode. We've got GFP_ATOMIC
> > for that:
> >      non sleeping allocation with an expensive fallback so it can access
> >      some portion of memory reserves. Usually used from interrupt/bottom-half
> >      context with an expensive slow path fallback.
> > 
> > Check Documentation/core-api/memory-allocation.rst for other options.
> > You may be interested in __GFP_NORETRY as well.
> 
> I know about GFP_ATOMIC, but even with that I'm hitting some bugs.
> Will try with __GFP_NORETRY.

...which is basically this (with GFP_ATOMIC):

[   11.829079] =============================
[   11.829109] [ BUG: Invalid wait context ]
[   11.829146] 6.13.0-virtme #51 Not tainted
[   11.829185] -----------------------------
[   11.829243] fish/344 is trying to lock:
[   11.829285] ffff9659bec450b0 (&c->lock){..-.}-{3:3}, at: ___slab_alloc+0x66/0x1510
[   11.829380] other info that might help us debug this:
[   11.829450] context-{5:5}
[   11.829494] 8 locks held by fish/344:
[   11.829534]  #0: ffff965a409c70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60
[   11.829643]  #1: ffff965a409c7130 (&tty->atomic_write_lock){+.+.}-{4:4}, at: file_tty_write.isra.0+0xa1/0x330
[   11.829765]  #2: ffff965a409c72e8 (&tty->termios_rwsem/1){++++}-{4:4}, at: n_tty_write+0x9e/0x510
[   11.829871]  #3: ffffbc6d01433380 (&ldata->output_lock){+.+.}-{4:4}, at: n_tty_write+0x1f1/0x510
[   11.829979]  #4: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x59/0x680
[   11.830173]  #5: ffff9659800f0018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xd7/0x680
[   11.830286]  #6: ffff9659801bcf60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x920
[   11.830396]  #7: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: scx_select_cpu_dfl+0x56/0x460

And I think that's because:

 * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
 * watermark is applied to allow access to "atomic reserves".
 * The current implementation doesn't support NMI and few other strict
 * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.

So I guess we the only viable option is to preallocate nodemask_t and
protect it somehow, hoping that it doesn't add too much overhead...

-Andrea

Steven Rostedt Feb. 11, 2025, 4:38 p.m. UTC | #11

On Tue, 11 Feb 2025 15:45:15 +0100
Andrea Righi <arighi@nvidia.com> wrote:

> ...which is basically this (with GFP_ATOMIC):
> 
> [   11.829079] =============================
> [   11.829109] [ BUG: Invalid wait context ]
> [   11.829146] 6.13.0-virtme #51 Not tainted
> [   11.829185] -----------------------------
> [   11.829243] fish/344 is trying to lock:
> [   11.829285] ffff9659bec450b0 (&c->lock){..-.}-{3:3}, at: ___slab_alloc+0x66/0x1510
> [   11.829380] other info that might help us debug this:
> [   11.829450] context-{5:5}
> [   11.829494] 8 locks held by fish/344:
> [   11.829534]  #0: ffff965a409c70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60
> [   11.829643]  #1: ffff965a409c7130 (&tty->atomic_write_lock){+.+.}-{4:4}, at: file_tty_write.isra.0+0xa1/0x330
> [   11.829765]  #2: ffff965a409c72e8 (&tty->termios_rwsem/1){++++}-{4:4}, at: n_tty_write+0x9e/0x510
> [   11.829871]  #3: ffffbc6d01433380 (&ldata->output_lock){+.+.}-{4:4}, at: n_tty_write+0x1f1/0x510
> [   11.829979]  #4: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x59/0x680
> [   11.830173]  #5: ffff9659800f0018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xd7/0x680
> [   11.830286]  #6: ffff9659801bcf60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x920
> [   11.830396]  #7: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: scx_select_cpu_dfl+0x56/0x460
> 
> And I think that's because:
> 
>  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>  * watermark is applied to allow access to "atomic reserves".
>  * The current implementation doesn't support NMI and few other strict
>  * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> 
> So I guess we the only viable option is to preallocate nodemask_t and
> protect it somehow, hoping that it doesn't add too much overhead...

I believe it's because you have p->pi_lock which is a raw_spin_lock() and
you are trying to take a lock in ___slab_alloc() which I bet is a normal
spin_lock(). In PREEMPT_RT() that turns into a mutex, and you can not take
a spin_lock while holding a raw_spin_lock.

-- Steve

Andrea Righi Feb. 11, 2025, 6:05 p.m. UTC | #12

On Tue, Feb 11, 2025 at 11:38:27AM -0500, Steven Rostedt wrote:
> On Tue, 11 Feb 2025 15:45:15 +0100
> Andrea Righi <arighi@nvidia.com> wrote:
> 
> > ...which is basically this (with GFP_ATOMIC):
> > 
> > [   11.829079] =============================
> > [   11.829109] [ BUG: Invalid wait context ]
> > [   11.829146] 6.13.0-virtme #51 Not tainted
> > [   11.829185] -----------------------------
> > [   11.829243] fish/344 is trying to lock:
> > [   11.829285] ffff9659bec450b0 (&c->lock){..-.}-{3:3}, at: ___slab_alloc+0x66/0x1510
> > [   11.829380] other info that might help us debug this:
> > [   11.829450] context-{5:5}
> > [   11.829494] 8 locks held by fish/344:
> > [   11.829534]  #0: ffff965a409c70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60
> > [   11.829643]  #1: ffff965a409c7130 (&tty->atomic_write_lock){+.+.}-{4:4}, at: file_tty_write.isra.0+0xa1/0x330
> > [   11.829765]  #2: ffff965a409c72e8 (&tty->termios_rwsem/1){++++}-{4:4}, at: n_tty_write+0x9e/0x510
> > [   11.829871]  #3: ffffbc6d01433380 (&ldata->output_lock){+.+.}-{4:4}, at: n_tty_write+0x1f1/0x510
> > [   11.829979]  #4: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x59/0x680
> > [   11.830173]  #5: ffff9659800f0018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xd7/0x680
> > [   11.830286]  #6: ffff9659801bcf60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x920
> > [   11.830396]  #7: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: scx_select_cpu_dfl+0x56/0x460
> > 
> > And I think that's because:
> > 
> >  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> >  * watermark is applied to allow access to "atomic reserves".
> >  * The current implementation doesn't support NMI and few other strict
> >  * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> > 
> > So I guess we the only viable option is to preallocate nodemask_t and
> > protect it somehow, hoping that it doesn't add too much overhead...
> 
> I believe it's because you have p->pi_lock which is a raw_spin_lock() and
> you are trying to take a lock in ___slab_alloc() which I bet is a normal
> spin_lock(). In PREEMPT_RT() that turns into a mutex, and you can not take
> a spin_lock while holding a raw_spin_lock.

Exactly that, thanks Steve. I'll run some tests using per-cpu nodemask_t,
given that most of the times this is called with p->pi_lock held, it should
be safe and we shouldn't introduce any overhead.

-Andrea

[5/6] sched_ext: idle: Per-node idle cpumasks

Checks

Commit Message

Comments

Patch