Message ID | 20250317175717.163267-2-arighi@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/6] sched_ext: idle: Extend topology optimizations to all tasks | expand |
Hello, On Mon, Mar 17, 2025 at 06:53:24PM +0100, Andrea Righi wrote: > +/* > + * Return the subset of @cpus that task @p can use or NULL if none of the > + * CPUs in the @cpus cpumask can be used. > + */ > +static const struct cpumask *task_cpumask(const struct task_struct *p, const struct cpumask *cpus, > + struct cpumask *local_cpus) task_cpus_allowed_and()? It also would help to add comment explaining the parameters as the function is a bit unusual. > +{ > + /* > + * If the task is allowed to run on all CPUs, simply use the > + * architecture's cpumask directly. Otherwise, compute the > + * intersection of the architecture's cpumask and the task's > + * allowed cpumask. > + */ > + if (!cpus || p->nr_cpus_allowed >= num_possible_cpus() || > + cpumask_subset(cpus, p->cpus_ptr)) > + return cpus; > + > + if (!cpumask_equal(cpus, p->cpus_ptr) && Hmm... isn't this covered by the preceding cpumask_subset() test? Here, cpus is not a subset of p->cpus_ptr, so how can it be the same as p->cpus_ptr? > + cpumask_and(local_cpus, cpus, p->cpus_ptr)) > + return local_cpus; > + > + return NULL; and return values need some explanation too. Thanks.
On Mon, Mar 17, 2025 at 08:22:35AM -1000, Tejun Heo wrote: > Hello, > > On Mon, Mar 17, 2025 at 06:53:24PM +0100, Andrea Righi wrote: > > +/* > > + * Return the subset of @cpus that task @p can use or NULL if none of the > > + * CPUs in the @cpus cpumask can be used. > > + */ > > +static const struct cpumask *task_cpumask(const struct task_struct *p, const struct cpumask *cpus, > > + struct cpumask *local_cpus) > > task_cpus_allowed_and()? It also would help to add comment explaining the > parameters as the function is a bit unusual. Ack. > > > +{ > > + /* > > + * If the task is allowed to run on all CPUs, simply use the > > + * architecture's cpumask directly. Otherwise, compute the > > + * intersection of the architecture's cpumask and the task's > > + * allowed cpumask. > > + */ > > + if (!cpus || p->nr_cpus_allowed >= num_possible_cpus() || > > + cpumask_subset(cpus, p->cpus_ptr)) > > + return cpus; > > + > > + if (!cpumask_equal(cpus, p->cpus_ptr) && > > Hmm... isn't this covered by the preceding cpumask_subset() test? Here, cpus > is not a subset of p->cpus_ptr, so how can it be the same as p->cpus_ptr? Oh that's right, I missed that between all the refactoring, thanks for catching it. Will remove it. > > > + cpumask_and(local_cpus, cpus, p->cpus_ptr)) > > + return local_cpus; > > + > > + return NULL; > > and return values need some explanation too. Ok. Thanks, -Andrea
On Mon, Mar 17, 2025 at 08:22:35AM -1000, Tejun Heo wrote: ... > > + /* > > + * If the task is allowed to run on all CPUs, simply use the > > + * architecture's cpumask directly. Otherwise, compute the > > + * intersection of the architecture's cpumask and the task's > > + * allowed cpumask. > > + */ > > + if (!cpus || p->nr_cpus_allowed >= num_possible_cpus() || > > + cpumask_subset(cpus, p->cpus_ptr)) > > + return cpus; > > + > > + if (!cpumask_equal(cpus, p->cpus_ptr) && > > Hmm... isn't this covered by the preceding cpumask_subset() test? Here, cpus > is not a subset of p->cpus_ptr, so how can it be the same as p->cpus_ptr? > > > + cpumask_and(local_cpus, cpus, p->cpus_ptr)) > > + return local_cpus; > > + > > + return NULL; Also, I'm also wondering if there's really a benefit checking for cpumask_subset() and then doing cpumask_and() only when it's needed, or if we should just do cpumask_and(). It's true that we can save some writes, but they're done on a temporary local per-CPU cpumask, so they shouldn't introduce cache contention. -Andrea
Hello, On Tue, Mar 18, 2025 at 08:31:29AM +0100, Andrea Righi wrote: > On Mon, Mar 17, 2025 at 08:22:35AM -1000, Tejun Heo wrote: > ... > > > + /* > > > + * If the task is allowed to run on all CPUs, simply use the > > > + * architecture's cpumask directly. Otherwise, compute the > > > + * intersection of the architecture's cpumask and the task's > > > + * allowed cpumask. > > > + */ > > > + if (!cpus || p->nr_cpus_allowed >= num_possible_cpus() || > > > + cpumask_subset(cpus, p->cpus_ptr)) > > > + return cpus; > > > + > > > + if (!cpumask_equal(cpus, p->cpus_ptr) && > > > > Hmm... isn't this covered by the preceding cpumask_subset() test? Here, cpus > > is not a subset of p->cpus_ptr, so how can it be the same as p->cpus_ptr? > > > > > + cpumask_and(local_cpus, cpus, p->cpus_ptr)) > > > + return local_cpus; > > > + > > > + return NULL; > > Also, I'm also wondering if there's really a benefit checking for > cpumask_subset() and then doing cpumask_and() only when it's needed, or if > we should just do cpumask_and(). It's true that we can save some writes, > but they're done on a temporary local per-CPU cpumask, so they shouldn't > introduce cache contention. Yeah, I can imagine it going either way, so no strong preference. Thanks.
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index 52c36a70a3d04..e1e020c27c07c 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -46,6 +46,12 @@ static struct scx_idle_cpus scx_idle_global_masks; */ static struct scx_idle_cpus **scx_idle_node_masks; +/* + * Local per-CPU cpumasks (used to generate temporary idle cpumasks). + */ +static DEFINE_PER_CPU(cpumask_var_t, local_llc_idle_cpumask); +static DEFINE_PER_CPU(cpumask_var_t, local_numa_idle_cpumask); + /* * Return the idle masks associated to a target @node. * @@ -391,6 +397,30 @@ void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops) static_branch_disable_cpuslocked(&scx_selcpu_topo_numa); } +/* + * Return the subset of @cpus that task @p can use or NULL if none of the + * CPUs in the @cpus cpumask can be used. + */ +static const struct cpumask *task_cpumask(const struct task_struct *p, const struct cpumask *cpus, + struct cpumask *local_cpus) +{ + /* + * If the task is allowed to run on all CPUs, simply use the + * architecture's cpumask directly. Otherwise, compute the + * intersection of the architecture's cpumask and the task's + * allowed cpumask. + */ + if (!cpus || p->nr_cpus_allowed >= num_possible_cpus() || + cpumask_subset(cpus, p->cpus_ptr)) + return cpus; + + if (!cpumask_equal(cpus, p->cpus_ptr) && + cpumask_and(local_cpus, cpus, p->cpus_ptr)) + return local_cpus; + + return NULL; +} + /* * Built-in CPU idle selection policy: * @@ -426,8 +456,7 @@ void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops) */ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 flags) { - const struct cpumask *llc_cpus = NULL; - const struct cpumask *numa_cpus = NULL; + const struct cpumask *llc_cpus = NULL, *numa_cpus = NULL; int node = scx_cpu_node_if_enabled(prev_cpu); s32 cpu; @@ -437,23 +466,16 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 rcu_read_lock(); /* - * Determine the scheduling domain only if the task is allowed to run - * on all CPUs. - * - * This is done primarily for efficiency, as it avoids the overhead of - * updating a cpumask every time we need to select an idle CPU (which - * can be costly in large SMP systems), but it also aligns logically: - * if a task's scheduling domain is restricted by user-space (through - * CPU affinity), the task will simply use the flat scheduling domain - * defined by user-space. + * Determine the subset of CPUs that the task can use in its + * current LLC and node. */ - if (p->nr_cpus_allowed >= num_possible_cpus()) { - if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa)) - numa_cpus = numa_span(prev_cpu); + if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa)) + numa_cpus = task_cpumask(p, numa_span(prev_cpu), + this_cpu_cpumask_var_ptr(local_numa_idle_cpumask)); - if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc)) - llc_cpus = llc_span(prev_cpu); - } + if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc)) + llc_cpus = task_cpumask(p, llc_span(prev_cpu), + this_cpu_cpumask_var_ptr(local_llc_idle_cpumask)); /* * If WAKE_SYNC, try to migrate the wakee to the waker's CPU. @@ -598,7 +620,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 */ void scx_idle_init_masks(void) { - int node; + int i; /* Allocate global idle cpumasks */ BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL)); @@ -609,13 +631,21 @@ void scx_idle_init_masks(void) sizeof(*scx_idle_node_masks), GFP_KERNEL); BUG_ON(!scx_idle_node_masks); - for_each_node(node) { - scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks), - GFP_KERNEL, node); - BUG_ON(!scx_idle_node_masks[node]); + for_each_node(i) { + scx_idle_node_masks[i] = kzalloc_node(sizeof(**scx_idle_node_masks), + GFP_KERNEL, i); + BUG_ON(!scx_idle_node_masks[i]); + + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[i]->cpu, GFP_KERNEL, i)); + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[i]->smt, GFP_KERNEL, i)); + } - BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node)); - BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node)); + /* Allocate local per-cpu idle cpumasks */ + for_each_possible_cpu(i) { + BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_llc_idle_cpumask, i), + GFP_KERNEL, cpu_to_node(i))); + BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_numa_idle_cpumask, i), + GFP_KERNEL, cpu_to_node(i))); } }
The built-in idle selection policy, scx_select_cpu_dfl(), always prioritizes picking idle CPUs within the same LLC or NUMA node, but these optimizations are currently applied only when a task has no CPU affinity constraints. This is done primarily for efficiency, as it avoids the overhead of updating a cpumask every time we need to select an idle CPU (which can be costly in large SMP systems). However, this approach limits the effectiveness of the built-in idle policy and results in inconsistent behavior, as affinity-restricted tasks don't benefit from topology-aware optimizations. To address this, modify the policy to apply LLC and NUMA-aware optimizations even when a task is constrained to a subset of CPUs. We can still avoid updating the cpumasks by checking if the subset of LLC and node CPUs are contained in the subset of allowed CPUs usable by the task (which is true in most of the cases - for tasks that don't have affinity constratints). Moreover, use temporary local per-CPU cpumasks to determine the LLC and node subsets, minimizing potential overhead even on large SMP systems. Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext_idle.c | 78 ++++++++++++++++++++++++++++------------- 1 file changed, 54 insertions(+), 24 deletions(-)