Message ID | 20250207211104.30009-6-arighi@nvidia.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [1/6] mm/numa: Introduce numa_nearest_nodemask() | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
Hello, On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote: > +/* > + * cpumasks to track idle CPUs within each NUMA node. > + * > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask > + * from is used to track all the idle CPUs in the system. > + */ > +struct idle_cpus { > cpumask_var_t cpu; > cpumask_var_t smt; > -} idle_masks CL_ALIGNED_IF_ONSTACK; > +}; Can you prefix the type name with scx_? Unrelated to this series but I wonder whether we can replace "smt" with "core" in the future to become more consistent with how the terms are used in the kernel: struct scx_idle_masks { cpumask_var_t cpus; cpumask_var_t cores; }; We expose "smt" name through kfuncs but we can rename that to "core" through compat macros later too. > +/* > + * Find the best idle CPU in the system, relative to @node. > + */ > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > +{ > + nodemask_t unvisited = NODE_MASK_ALL; > + s32 cpu = -EBUSY; > + > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); > + > + /* > + * If an initial node is not specified, start with the current > + * node. > + */ > + if (node == NUMA_NO_NODE) > + node = numa_node_id(); > + > + /* > + * Traverse all nodes in order of increasing distance, starting > + * from @node. > + * > + * This loop is O(N^2), with N being the amount of NUMA nodes, > + * which might be quite expensive in large NUMA systems. However, > + * this complexity comes into play only when a scheduler enables > + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU > + * without specifying a target NUMA node, so it shouldn't be a > + * bottleneck is most cases. > + * > + * As a future optimization we may want to cache the list of hop > + * nodes in a per-node array, instead of actually traversing them > + * every time. > + */ > + for_each_numa_node(node, unvisited, N_POSSIBLE) { > + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); Maybe rename pick_idle_cpu_in_node() to stay in sync with SCX_PICK_IDLE_IN_NODE? It's not like pick_idle_cpu_from_node() walks from the node, right? It just picks within the node. > @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool > > void scx_idle_reset_masks(void) > { > + int node; > + > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) { > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask); > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask); > + return; > + } > + > /* > * Consider all online cpus idle. Should converge to the actual state > * quickly. > */ > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > - cpumask_copy(idle_masks.smt, cpu_online_mask); > -} > + for_each_node(node) { > + const struct cpumask *node_mask = cpumask_of_node(node); > + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; > + struct cpumask *idle_smts = idle_cpumask(node)->smt; > -void scx_idle_init_masks(void) > -{ > - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); > - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); > + cpumask_and(idle_cpus, cpu_online_mask, node_mask); > + cpumask_copy(idle_smts, idle_cpus); > + } nitpick: Maybe something like the following is more symmetric with the global case and easier to read? for_each_node(node) { const struct cpumask *node_mask = cpumask_of_node(node); cpumask_and(idle_cpumask(node)->cpu, cpu_online_mask, node_mask); cpumask_and(idle_cpumask(node)->smt, cpu_online_mask, node_mask); } > } > > static void update_builtin_idle(int cpu, bool idle) > { > - assign_cpu(cpu, idle_masks.cpu, idle); > + int node = idle_cpu_to_node(cpu); minor: I wonder whether idle_cpu_to_node() name is a bit confusing - why does a CPU being idle have anything to do with its node mapping? If there is a better naming convention, great. If not, it is what it is. Thanks.
On Fri, Feb 07, 2025 at 12:30:37PM -1000, Tejun Heo wrote: > Hello, > > On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote: > > +/* > > + * cpumasks to track idle CPUs within each NUMA node. > > + * > > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask > > + * from is used to track all the idle CPUs in the system. > > + */ > > +struct idle_cpus { > > cpumask_var_t cpu; > > cpumask_var_t smt; > > -} idle_masks CL_ALIGNED_IF_ONSTACK; > > +}; > > Can you prefix the type name with scx_? > > Unrelated to this series but I wonder whether we can replace "smt" with > "core" in the future to become more consistent with how the terms are used > in the kernel: > > struct scx_idle_masks { > cpumask_var_t cpus; > cpumask_var_t cores; > }; > > We expose "smt" name through kfuncs but we can rename that to "core" through > compat macros later too. > > > +/* > > + * Find the best idle CPU in the system, relative to @node. > > + */ > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > +{ > > + nodemask_t unvisited = NODE_MASK_ALL; > > + s32 cpu = -EBUSY; > > + > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); > > + > > + /* > > + * If an initial node is not specified, start with the current > > + * node. > > + */ > > + if (node == NUMA_NO_NODE) > > + node = numa_node_id(); > > + > > + /* > > + * Traverse all nodes in order of increasing distance, starting > > + * from @node. > > + * > > + * This loop is O(N^2), with N being the amount of NUMA nodes, > > + * which might be quite expensive in large NUMA systems. However, > > + * this complexity comes into play only when a scheduler enables > > + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU > > + * without specifying a target NUMA node, so it shouldn't be a > > + * bottleneck is most cases. > > + * > > + * As a future optimization we may want to cache the list of hop > > + * nodes in a per-node array, instead of actually traversing them > > + * every time. > > + */ > > + for_each_numa_node(node, unvisited, N_POSSIBLE) { > > + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); > > Maybe rename pick_idle_cpu_in_node() to stay in sync with > SCX_PICK_IDLE_IN_NODE? It's not like pick_idle_cpu_from_node() walks from > the node, right? It just picks within the node. > > > @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool > > > > void scx_idle_reset_masks(void) > > { > > + int node; > > + > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) { > > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask); > > + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask); > > + return; > > + } > > + > > /* > > * Consider all online cpus idle. Should converge to the actual state > > * quickly. > > */ > > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > > - cpumask_copy(idle_masks.smt, cpu_online_mask); > > -} > > + for_each_node(node) { > > + const struct cpumask *node_mask = cpumask_of_node(node); > > + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; > > + struct cpumask *idle_smts = idle_cpumask(node)->smt; > > -void scx_idle_init_masks(void) > > -{ > > - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); > > - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); > > + cpumask_and(idle_cpus, cpu_online_mask, node_mask); > > + cpumask_copy(idle_smts, idle_cpus); > > + } > > nitpick: Maybe something like the following is more symmetric with the > global case and easier to read? > > for_each_node(node) { > const struct cpumask *node_mask = cpumask_of_node(node); > cpumask_and(idle_cpumask(node)->cpu, cpu_online_mask, node_mask); > cpumask_and(idle_cpumask(node)->smt, cpu_online_mask, node_mask); > } > > > } > > > > static void update_builtin_idle(int cpu, bool idle) > > { > > - assign_cpu(cpu, idle_masks.cpu, idle); > > + int node = idle_cpu_to_node(cpu); Ok to all of the above. > > minor: I wonder whether idle_cpu_to_node() name is a bit confusing - why > does a CPU being idle have anything to do with its node mapping? If there is > a better naming convention, great. If not, it is what it is. Maybe scx_cpu_to_node()? At the end it's just a wrapper to cpu_to_node(), but from the scx perspective, so if NUMA-awareness is not enabled in scx, it'd return NUMA_NO_NODE. Thanks, -Andrea
On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote: > Using a single global idle mask can lead to inefficiencies and a lot of > stress on the cache coherency protocol on large systems with multiple > NUMA nodes, since all the CPUs can create a really intense read/write > activity on the single global cpumask. Can you put your perf numbers here too? > Therefore, split the global cpumask into multiple per-NUMA node cpumasks > to improve scalability and performance on large systems. > > The concept is that each cpumask will track only the idle CPUs within > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. > In this way concurrent access to the idle cpumask will be restricted > within each NUMA node. > > The split of multiple per-node idle cpumasks can be controlled using the > SCX_OPS_BUILTIN_IDLE_PER_NODE flag. > > By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global > host-wide idle cpumask is used, maintaining the previous behavior. > > NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via > SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will > trigger an scx error, since there are no system-wide cpumasks. > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++-------- > kernel/sched/ext_idle.h | 11 +- > 2 files changed, 203 insertions(+), 50 deletions(-) > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > index a3f2b00903ac2..4b90ec9018c1a 100644 > --- a/kernel/sched/ext_idle.c > +++ b/kernel/sched/ext_idle.c > @@ -18,25 +18,88 @@ DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); > DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node); > > #ifdef CONFIG_SMP > -#ifdef CONFIG_CPUMASK_OFFSTACK > -#define CL_ALIGNED_IF_ONSTACK > -#else > -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp > -#endif > - > /* Enable/disable LLC aware optimizations */ > DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc); > > /* Enable/disable NUMA aware optimizations */ > DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa); > > -static struct { > +/* > + * cpumasks to track idle CPUs within each NUMA node. > + * > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask > + * from is used to track all the idle CPUs in the system. > + */ > +struct idle_cpus { > cpumask_var_t cpu; > cpumask_var_t smt; > -} idle_masks CL_ALIGNED_IF_ONSTACK; > +}; > + > +/* > + * Global host-wide idle cpumasks (used when SCX_OPS_BUILTIN_IDLE_PER_NODE > + * is not enabled). > + */ > +static struct idle_cpus scx_idle_global_masks; > + > +/* > + * Per-node idle cpumasks. > + */ > +static struct idle_cpus **scx_idle_node_masks; > + > +/* > + * Initialize per-node idle cpumasks. > + * > + * In case of a single NUMA node or if NUMA support is disabled, only a > + * single global host-wide cpumask will be initialized. > + */ > +void scx_idle_init_masks(void) > +{ > + int node; > + > + /* Allocate global idle cpumasks */ > + BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL)); > + BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.smt, GFP_KERNEL)); > + > + /* Allocate per-node idle cpumasks */ > + scx_idle_node_masks = kcalloc(num_possible_nodes(), > + sizeof(*scx_idle_node_masks), GFP_KERNEL); > + BUG_ON(!scx_idle_node_masks); > + > + for_each_node(node) { > + scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks), > + GFP_KERNEL, node); > + BUG_ON(!scx_idle_node_masks[node]); > + > + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node)); > + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node)); > + } > +} > + > +/* > + * Return the idle masks associated to a target @node. > + */ > +static struct idle_cpus *idle_cpumask(int node) > +{ > + return node == NUMA_NO_NODE ? &scx_idle_global_masks : scx_idle_node_masks[node]; > +} > + > +/* > + * Return the node id associated to a target idle CPU (used to determine > + * the proper idle cpumask). > + */ > +static int idle_cpu_to_node(int cpu) > +{ > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > + return NUMA_NO_NODE; > + > + return cpu_to_node(cpu); > +} > > bool scx_idle_test_and_clear_cpu(int cpu) > { > + int node = idle_cpu_to_node(cpu); > + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; > + > #ifdef CONFIG_SCHED_SMT > /* > * SMT mask should be cleared whether we can claim @cpu or not. The SMT > @@ -45,33 +108,38 @@ bool scx_idle_test_and_clear_cpu(int cpu) > */ > if (sched_smt_active()) { > const struct cpumask *smt = cpu_smt_mask(cpu); > + struct cpumask *idle_smts = idle_cpumask(node)->smt; > > /* > * If offline, @cpu is not its own sibling and > * scx_pick_idle_cpu() can get caught in an infinite loop as > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > - * is eventually cleared. > + * @cpu is never cleared from the idle SMT mask. Ensure that > + * @cpu is eventually cleared. > * > * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to > * reduce memory writes, which may help alleviate cache > * coherence pressure. > */ > - if (cpumask_intersects(smt, idle_masks.smt)) > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > - __cpumask_clear_cpu(cpu, idle_masks.smt); > + if (cpumask_intersects(smt, idle_smts)) > + cpumask_andnot(idle_smts, idle_smts, smt); > + else if (cpumask_test_cpu(cpu, idle_smts)) > + __cpumask_clear_cpu(cpu, idle_smts); > } > #endif > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); > + > + return cpumask_test_and_clear_cpu(cpu, idle_cpus); > } > > -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > +/* > + * Pick an idle CPU in a specific NUMA node. > + */ > +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags) > { > int cpu; > > retry: > if (sched_smt_active()) { > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > + cpu = cpumask_any_and_distribute(idle_cpumask(node)->smt, cpus_allowed); > if (cpu < nr_cpu_ids) > goto found; > > @@ -79,7 +147,7 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > return -EBUSY; > } > > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); > + cpu = cpumask_any_and_distribute(idle_cpumask(node)->cpu, cpus_allowed); > if (cpu >= nr_cpu_ids) > return -EBUSY; > > @@ -90,6 +158,55 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > goto retry; > } > > +/* > + * Find the best idle CPU in the system, relative to @node. > + */ > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > +{ > + nodemask_t unvisited = NODE_MASK_ALL; > + s32 cpu = -EBUSY; > + > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); > + > + /* > + * If an initial node is not specified, start with the current > + * node. > + */ > + if (node == NUMA_NO_NODE) > + node = numa_node_id(); > + > + /* > + * Traverse all nodes in order of increasing distance, starting > + * from @node. > + * > + * This loop is O(N^2), with N being the amount of NUMA nodes, > + * which might be quite expensive in large NUMA systems. However, > + * this complexity comes into play only when a scheduler enables > + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU > + * without specifying a target NUMA node, so it shouldn't be a > + * bottleneck is most cases. > + * > + * As a future optimization we may want to cache the list of hop > + * nodes in a per-node array, instead of actually traversing them > + * every time. > + */ > + for_each_numa_node(node, unvisited, N_POSSIBLE) { > + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); > + if (cpu >= 0) > + break; > + > + /* > + * Check if the search is restricted to the same core or > + * the same node. > + */ > + if (flags & SCX_PICK_IDLE_IN_NODE) > + break; If SCX_PICK_IDLE_IN_NODE is set, you can avoid the loop at all, right? Just: if (flags & SCX_PICK_IDLE_IN_NODE) return pick_idle_cpu_from_node(cpus_allowed, node, flags); for_each_numa_node(node, unvisited, N_POSSIBLE) { cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); if (cpu >= 0) return cpu; } Thanks, Yury
On Sun, Feb 09, 2025 at 01:07:44PM -0500, Yury Norov wrote: > On Fri, Feb 07, 2025 at 09:40:52PM +0100, Andrea Righi wrote: > > Using a single global idle mask can lead to inefficiencies and a lot of > > stress on the cache coherency protocol on large systems with multiple > > NUMA nodes, since all the CPUs can create a really intense read/write > > activity on the single global cpumask. > > Can you put your perf numbers here too? > > > Therefore, split the global cpumask into multiple per-NUMA node cpumasks > > to improve scalability and performance on large systems. > > > > The concept is that each cpumask will track only the idle CPUs within > > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. > > In this way concurrent access to the idle cpumask will be restricted > > within each NUMA node. > > > > The split of multiple per-node idle cpumasks can be controlled using the > > SCX_OPS_BUILTIN_IDLE_PER_NODE flag. > > > > By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global > > host-wide idle cpumask is used, maintaining the previous behavior. > > > > NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via > > SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will > > trigger an scx error, since there are no system-wide cpumasks. > > > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > > --- > > kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++-------- > > kernel/sched/ext_idle.h | 11 +- > > 2 files changed, 203 insertions(+), 50 deletions(-) > > > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > > index a3f2b00903ac2..4b90ec9018c1a 100644 > > --- a/kernel/sched/ext_idle.c > > +++ b/kernel/sched/ext_idle.c > > @@ -18,25 +18,88 @@ DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); > > DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node); > > > > #ifdef CONFIG_SMP > > -#ifdef CONFIG_CPUMASK_OFFSTACK > > -#define CL_ALIGNED_IF_ONSTACK > > -#else > > -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp > > -#endif > > - > > /* Enable/disable LLC aware optimizations */ > > DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc); > > > > /* Enable/disable NUMA aware optimizations */ > > DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa); > > > > -static struct { > > +/* > > + * cpumasks to track idle CPUs within each NUMA node. > > + * > > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask > > + * from is used to track all the idle CPUs in the system. > > + */ > > +struct idle_cpus { > > cpumask_var_t cpu; > > cpumask_var_t smt; > > -} idle_masks CL_ALIGNED_IF_ONSTACK; > > +}; > > + > > +/* > > + * Global host-wide idle cpumasks (used when SCX_OPS_BUILTIN_IDLE_PER_NODE > > + * is not enabled). > > + */ > > +static struct idle_cpus scx_idle_global_masks; > > + > > +/* > > + * Per-node idle cpumasks. > > + */ > > +static struct idle_cpus **scx_idle_node_masks; > > + > > +/* > > + * Initialize per-node idle cpumasks. > > + * > > + * In case of a single NUMA node or if NUMA support is disabled, only a > > + * single global host-wide cpumask will be initialized. > > + */ > > +void scx_idle_init_masks(void) > > +{ > > + int node; > > + > > + /* Allocate global idle cpumasks */ > > + BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL)); > > + BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.smt, GFP_KERNEL)); > > + > > + /* Allocate per-node idle cpumasks */ > > + scx_idle_node_masks = kcalloc(num_possible_nodes(), > > + sizeof(*scx_idle_node_masks), GFP_KERNEL); > > + BUG_ON(!scx_idle_node_masks); > > + > > + for_each_node(node) { > > + scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks), > > + GFP_KERNEL, node); > > + BUG_ON(!scx_idle_node_masks[node]); > > + > > + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node)); > > + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node)); > > + } > > +} > > + > > +/* > > + * Return the idle masks associated to a target @node. > > + */ > > +static struct idle_cpus *idle_cpumask(int node) > > +{ > > + return node == NUMA_NO_NODE ? &scx_idle_global_masks : scx_idle_node_masks[node]; > > +} > > + > > +/* > > + * Return the node id associated to a target idle CPU (used to determine > > + * the proper idle cpumask). > > + */ > > +static int idle_cpu_to_node(int cpu) > > +{ > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > + return NUMA_NO_NODE; > > + > > + return cpu_to_node(cpu); > > +} > > > > bool scx_idle_test_and_clear_cpu(int cpu) > > { > > + int node = idle_cpu_to_node(cpu); > > + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; > > + > > #ifdef CONFIG_SCHED_SMT > > /* > > * SMT mask should be cleared whether we can claim @cpu or not. The SMT > > @@ -45,33 +108,38 @@ bool scx_idle_test_and_clear_cpu(int cpu) > > */ > > if (sched_smt_active()) { > > const struct cpumask *smt = cpu_smt_mask(cpu); > > + struct cpumask *idle_smts = idle_cpumask(node)->smt; > > > > /* > > * If offline, @cpu is not its own sibling and > > * scx_pick_idle_cpu() can get caught in an infinite loop as > > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > > - * is eventually cleared. > > + * @cpu is never cleared from the idle SMT mask. Ensure that > > + * @cpu is eventually cleared. > > * > > * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to > > * reduce memory writes, which may help alleviate cache > > * coherence pressure. > > */ > > - if (cpumask_intersects(smt, idle_masks.smt)) > > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > > - __cpumask_clear_cpu(cpu, idle_masks.smt); > > + if (cpumask_intersects(smt, idle_smts)) > > + cpumask_andnot(idle_smts, idle_smts, smt); > > + else if (cpumask_test_cpu(cpu, idle_smts)) > > + __cpumask_clear_cpu(cpu, idle_smts); > > } > > #endif > > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); > > + > > + return cpumask_test_and_clear_cpu(cpu, idle_cpus); > > } > > > > -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > +/* > > + * Pick an idle CPU in a specific NUMA node. > > + */ > > +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags) > > { > > int cpu; > > > > retry: > > if (sched_smt_active()) { > > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > > + cpu = cpumask_any_and_distribute(idle_cpumask(node)->smt, cpus_allowed); > > if (cpu < nr_cpu_ids) > > goto found; > > > > @@ -79,7 +147,7 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > return -EBUSY; > > } > > > > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); > > + cpu = cpumask_any_and_distribute(idle_cpumask(node)->cpu, cpus_allowed); > > if (cpu >= nr_cpu_ids) > > return -EBUSY; > > > > @@ -90,6 +158,55 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > goto retry; > > } > > > > +/* > > + * Find the best idle CPU in the system, relative to @node. > > + */ > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > +{ > > + nodemask_t unvisited = NODE_MASK_ALL; This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the stack, right? > > + s32 cpu = -EBUSY; > > + > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); > > + > > + /* > > + * If an initial node is not specified, start with the current > > + * node. > > + */ > > + if (node == NUMA_NO_NODE) > > + node = numa_node_id(); > > + > > + /* > > + * Traverse all nodes in order of increasing distance, starting > > + * from @node. > > + * > > + * This loop is O(N^2), with N being the amount of NUMA nodes, > > + * which might be quite expensive in large NUMA systems. However, > > + * this complexity comes into play only when a scheduler enables > > + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU > > + * without specifying a target NUMA node, so it shouldn't be a > > + * bottleneck is most cases. > > + * > > + * As a future optimization we may want to cache the list of hop > > + * nodes in a per-node array, instead of actually traversing them > > + * every time. > > + */ > > + for_each_numa_node(node, unvisited, N_POSSIBLE) { > > + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); > > + if (cpu >= 0) > > + break; > > + > > + /* > > + * Check if the search is restricted to the same core or > > + * the same node. > > + */ > > + if (flags & SCX_PICK_IDLE_IN_NODE) > > + break; > > If SCX_PICK_IDLE_IN_NODE is set, you can avoid the loop at all, right? > Just: > if (flags & SCX_PICK_IDLE_IN_NODE) > return pick_idle_cpu_from_node(cpus_allowed, node, flags); > > for_each_numa_node(node, unvisited, N_POSSIBLE) { > cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); > if (cpu >= 0) > return cpu; > } > > Thanks, > Yury
On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote: ... > > > +/* > > > + * Find the best idle CPU in the system, relative to @node. > > > + */ > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > > +{ > > > + nodemask_t unvisited = NODE_MASK_ALL; > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the > stack, right? Ok, and if I want to initialize unvisited to all online nodes, is there a better than doing: nodemask_clear(*unvisited); nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]); We don't have nodemask_copy() right? Thanks, -Andrea
On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote: > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote: > ... > > > > +/* > > > > + * Find the best idle CPU in the system, relative to @node. > > > > + */ > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > > > +{ > > > > + nodemask_t unvisited = NODE_MASK_ALL; > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the > > stack, right? > > Ok, and if I want to initialize unvisited to all online nodes, is there a > better than doing: > > nodemask_clear(*unvisited); > nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]); > > We don't have nodemask_copy() right? Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy(). -Andrea
On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote: > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote: > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote: > > ... > > > > > +/* > > > > > + * Find the best idle CPU in the system, relative to @node. > > > > > + */ > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > > > > +{ > > > > > + nodemask_t unvisited = NODE_MASK_ALL; > > > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the > > > stack, right? > > > > Ok, and if I want to initialize unvisited to all online nodes, is there a > > better than doing: > > > > nodemask_clear(*unvisited); > > nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]); > > > > We don't have nodemask_copy() right? > > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy(). Also, it might be problematic to use NODEMASK_ALLOC() here, since we're potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t, but then we need to preempt_disable() the entire loop, since scx_pick_idle_cpu() can be be called potentially from any context. Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP, nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe we can accept to have it on the stack in this case? Thanks, -Andrea
On Tue, Feb 11, 2025 at 10:50:46AM +0100, Andrea Righi wrote: > On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote: > > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote: > > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote: > > > ... > > > > > > +/* > > > > > > + * Find the best idle CPU in the system, relative to @node. > > > > > > + */ > > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > > > > > +{ > > > > > > + nodemask_t unvisited = NODE_MASK_ALL; > > > > > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the > > > > stack, right? > > > > > > Ok, and if I want to initialize unvisited to all online nodes, is there a > > > better than doing: > > > > > > nodemask_clear(*unvisited); > > > nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]); > > > > > > We don't have nodemask_copy() right? > > > > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy(). > > Also, it might be problematic to use NODEMASK_ALLOC() here, since we're > potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t, > but then we need to preempt_disable() the entire loop, since > scx_pick_idle_cpu() can be be called potentially from any context. > > Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP, > nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe > we can accept to have it on the stack in this case? If you expect calling this in strict SMP lock-held or IRQ contexts, You need to be careful about stack overflow even mode. We've got GFP_ATOMIC for that: non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bottom-half context with an expensive slow path fallback. Check Documentation/core-api/memory-allocation.rst for other options. You may be interested in __GFP_NORETRY as well. Thanks, Yury
On Tue, Feb 11, 2025 at 09:19:52AM -0500, Yury Norov wrote: > On Tue, Feb 11, 2025 at 10:50:46AM +0100, Andrea Righi wrote: > > On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote: > > > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote: > > > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote: > > > > ... > > > > > > > +/* > > > > > > > + * Find the best idle CPU in the system, relative to @node. > > > > > > > + */ > > > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > > > > > > +{ > > > > > > > + nodemask_t unvisited = NODE_MASK_ALL; > > > > > > > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the > > > > > stack, right? > > > > > > > > Ok, and if I want to initialize unvisited to all online nodes, is there a > > > > better than doing: > > > > > > > > nodemask_clear(*unvisited); > > > > nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]); > > > > > > > > We don't have nodemask_copy() right? > > > > > > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy(). > > > > Also, it might be problematic to use NODEMASK_ALLOC() here, since we're > > potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t, > > but then we need to preempt_disable() the entire loop, since > > scx_pick_idle_cpu() can be be called potentially from any context. > > > > Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP, > > nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe > > we can accept to have it on the stack in this case? > > If you expect calling this in strict SMP lock-held or IRQ contexts, You > need to be careful about stack overflow even mode. We've got GFP_ATOMIC > for that: > non sleeping allocation with an expensive fallback so it can access > some portion of memory reserves. Usually used from interrupt/bottom-half > context with an expensive slow path fallback. > > Check Documentation/core-api/memory-allocation.rst for other options. > You may be interested in __GFP_NORETRY as well. I know about GFP_ATOMIC, but even with that I'm hitting some bugs. Will try with __GFP_NORETRY. Thanks, -Andrea
On Tue, Feb 11, 2025 at 03:34:11PM +0100, Andrea Righi wrote: > On Tue, Feb 11, 2025 at 09:19:52AM -0500, Yury Norov wrote: > > On Tue, Feb 11, 2025 at 10:50:46AM +0100, Andrea Righi wrote: > > > On Tue, Feb 11, 2025 at 08:41:45AM +0100, Andrea Righi wrote: > > > > On Tue, Feb 11, 2025 at 08:32:51AM +0100, Andrea Righi wrote: > > > > > On Mon, Feb 10, 2025 at 11:57:42AM -0500, Yury Norov wrote: > > > > > ... > > > > > > > > +/* > > > > > > > > + * Find the best idle CPU in the system, relative to @node. > > > > > > > > + */ > > > > > > > > +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) > > > > > > > > +{ > > > > > > > > + nodemask_t unvisited = NODE_MASK_ALL; > > > > > > > > > > > > This should be a NODEMASK_ALLOC(). We don't want to eat up too much of the > > > > > > stack, right? > > > > > > > > > > Ok, and if I want to initialize unvisited to all online nodes, is there a > > > > > better than doing: > > > > > > > > > > nodemask_clear(*unvisited); > > > > > nodemask_or(*unvisited, *unvisited, node_states[N_ONLINE]); > > > > > > > > > > We don't have nodemask_copy() right? > > > > > > > > Sorry, and with that I mean nodes_clear() / nodes_or() / nodes_copy(). > > > > > > Also, it might be problematic to use NODEMASK_ALLOC() here, since we're > > > potentially holding raw spinlocks. Maybe we could use per-cpu nodemask_t, > > > but then we need to preempt_disable() the entire loop, since > > > scx_pick_idle_cpu() can be be called potentially from any context. > > > > > > Considering that the maximum value for NODE_SHIFT is 10 with CONFIG_MAXSMP, > > > nodemask_t should be 128 bytes at most, that doesn't seem too bad... Maybe > > > we can accept to have it on the stack in this case? > > > > If you expect calling this in strict SMP lock-held or IRQ contexts, You > > need to be careful about stack overflow even mode. We've got GFP_ATOMIC > > for that: > > non sleeping allocation with an expensive fallback so it can access > > some portion of memory reserves. Usually used from interrupt/bottom-half > > context with an expensive slow path fallback. > > > > Check Documentation/core-api/memory-allocation.rst for other options. > > You may be interested in __GFP_NORETRY as well. > > I know about GFP_ATOMIC, but even with that I'm hitting some bugs. > Will try with __GFP_NORETRY. ...which is basically this (with GFP_ATOMIC): [ 11.829079] ============================= [ 11.829109] [ BUG: Invalid wait context ] [ 11.829146] 6.13.0-virtme #51 Not tainted [ 11.829185] ----------------------------- [ 11.829243] fish/344 is trying to lock: [ 11.829285] ffff9659bec450b0 (&c->lock){..-.}-{3:3}, at: ___slab_alloc+0x66/0x1510 [ 11.829380] other info that might help us debug this: [ 11.829450] context-{5:5} [ 11.829494] 8 locks held by fish/344: [ 11.829534] #0: ffff965a409c70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60 [ 11.829643] #1: ffff965a409c7130 (&tty->atomic_write_lock){+.+.}-{4:4}, at: file_tty_write.isra.0+0xa1/0x330 [ 11.829765] #2: ffff965a409c72e8 (&tty->termios_rwsem/1){++++}-{4:4}, at: n_tty_write+0x9e/0x510 [ 11.829871] #3: ffffbc6d01433380 (&ldata->output_lock){+.+.}-{4:4}, at: n_tty_write+0x1f1/0x510 [ 11.829979] #4: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x59/0x680 [ 11.830173] #5: ffff9659800f0018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xd7/0x680 [ 11.830286] #6: ffff9659801bcf60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x920 [ 11.830396] #7: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: scx_select_cpu_dfl+0x56/0x460 And I think that's because: * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower * watermark is applied to allow access to "atomic reserves". * The current implementation doesn't support NMI and few other strict * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT. So I guess we the only viable option is to preallocate nodemask_t and protect it somehow, hoping that it doesn't add too much overhead... -Andrea
On Tue, 11 Feb 2025 15:45:15 +0100 Andrea Righi <arighi@nvidia.com> wrote: > ...which is basically this (with GFP_ATOMIC): > > [ 11.829079] ============================= > [ 11.829109] [ BUG: Invalid wait context ] > [ 11.829146] 6.13.0-virtme #51 Not tainted > [ 11.829185] ----------------------------- > [ 11.829243] fish/344 is trying to lock: > [ 11.829285] ffff9659bec450b0 (&c->lock){..-.}-{3:3}, at: ___slab_alloc+0x66/0x1510 > [ 11.829380] other info that might help us debug this: > [ 11.829450] context-{5:5} > [ 11.829494] 8 locks held by fish/344: > [ 11.829534] #0: ffff965a409c70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60 > [ 11.829643] #1: ffff965a409c7130 (&tty->atomic_write_lock){+.+.}-{4:4}, at: file_tty_write.isra.0+0xa1/0x330 > [ 11.829765] #2: ffff965a409c72e8 (&tty->termios_rwsem/1){++++}-{4:4}, at: n_tty_write+0x9e/0x510 > [ 11.829871] #3: ffffbc6d01433380 (&ldata->output_lock){+.+.}-{4:4}, at: n_tty_write+0x1f1/0x510 > [ 11.829979] #4: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x59/0x680 > [ 11.830173] #5: ffff9659800f0018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xd7/0x680 > [ 11.830286] #6: ffff9659801bcf60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x920 > [ 11.830396] #7: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: scx_select_cpu_dfl+0x56/0x460 > > And I think that's because: > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > * watermark is applied to allow access to "atomic reserves". > * The current implementation doesn't support NMI and few other strict > * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT. > > So I guess we the only viable option is to preallocate nodemask_t and > protect it somehow, hoping that it doesn't add too much overhead... I believe it's because you have p->pi_lock which is a raw_spin_lock() and you are trying to take a lock in ___slab_alloc() which I bet is a normal spin_lock(). In PREEMPT_RT() that turns into a mutex, and you can not take a spin_lock while holding a raw_spin_lock. -- Steve
On Tue, Feb 11, 2025 at 11:38:27AM -0500, Steven Rostedt wrote: > On Tue, 11 Feb 2025 15:45:15 +0100 > Andrea Righi <arighi@nvidia.com> wrote: > > > ...which is basically this (with GFP_ATOMIC): > > > > [ 11.829079] ============================= > > [ 11.829109] [ BUG: Invalid wait context ] > > [ 11.829146] 6.13.0-virtme #51 Not tainted > > [ 11.829185] ----------------------------- > > [ 11.829243] fish/344 is trying to lock: > > [ 11.829285] ffff9659bec450b0 (&c->lock){..-.}-{3:3}, at: ___slab_alloc+0x66/0x1510 > > [ 11.829380] other info that might help us debug this: > > [ 11.829450] context-{5:5} > > [ 11.829494] 8 locks held by fish/344: > > [ 11.829534] #0: ffff965a409c70a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60 > > [ 11.829643] #1: ffff965a409c7130 (&tty->atomic_write_lock){+.+.}-{4:4}, at: file_tty_write.isra.0+0xa1/0x330 > > [ 11.829765] #2: ffff965a409c72e8 (&tty->termios_rwsem/1){++++}-{4:4}, at: n_tty_write+0x9e/0x510 > > [ 11.829871] #3: ffffbc6d01433380 (&ldata->output_lock){+.+.}-{4:4}, at: n_tty_write+0x1f1/0x510 > > [ 11.829979] #4: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x59/0x680 > > [ 11.830173] #5: ffff9659800f0018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xd7/0x680 > > [ 11.830286] #6: ffff9659801bcf60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x920 > > [ 11.830396] #7: ffffffffb556b5c0 (rcu_read_lock){....}-{1:3}, at: scx_select_cpu_dfl+0x56/0x460 > > > > And I think that's because: > > > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > > * watermark is applied to allow access to "atomic reserves". > > * The current implementation doesn't support NMI and few other strict > > * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT. > > > > So I guess we the only viable option is to preallocate nodemask_t and > > protect it somehow, hoping that it doesn't add too much overhead... > > I believe it's because you have p->pi_lock which is a raw_spin_lock() and > you are trying to take a lock in ___slab_alloc() which I bet is a normal > spin_lock(). In PREEMPT_RT() that turns into a mutex, and you can not take > a spin_lock while holding a raw_spin_lock. Exactly that, thanks Steve. I'll run some tests using per-cpu nodemask_t, given that most of the times this is called with p->pi_lock held, it should be safe and we shouldn't introduce any overhead. -Andrea
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index a3f2b00903ac2..4b90ec9018c1a 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -18,25 +18,88 @@ DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node); #ifdef CONFIG_SMP -#ifdef CONFIG_CPUMASK_OFFSTACK -#define CL_ALIGNED_IF_ONSTACK -#else -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp -#endif - /* Enable/disable LLC aware optimizations */ DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc); /* Enable/disable NUMA aware optimizations */ DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa); -static struct { +/* + * cpumasks to track idle CPUs within each NUMA node. + * + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled, a single global cpumask + * from is used to track all the idle CPUs in the system. + */ +struct idle_cpus { cpumask_var_t cpu; cpumask_var_t smt; -} idle_masks CL_ALIGNED_IF_ONSTACK; +}; + +/* + * Global host-wide idle cpumasks (used when SCX_OPS_BUILTIN_IDLE_PER_NODE + * is not enabled). + */ +static struct idle_cpus scx_idle_global_masks; + +/* + * Per-node idle cpumasks. + */ +static struct idle_cpus **scx_idle_node_masks; + +/* + * Initialize per-node idle cpumasks. + * + * In case of a single NUMA node or if NUMA support is disabled, only a + * single global host-wide cpumask will be initialized. + */ +void scx_idle_init_masks(void) +{ + int node; + + /* Allocate global idle cpumasks */ + BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL)); + BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.smt, GFP_KERNEL)); + + /* Allocate per-node idle cpumasks */ + scx_idle_node_masks = kcalloc(num_possible_nodes(), + sizeof(*scx_idle_node_masks), GFP_KERNEL); + BUG_ON(!scx_idle_node_masks); + + for_each_node(node) { + scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks), + GFP_KERNEL, node); + BUG_ON(!scx_idle_node_masks[node]); + + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node)); + BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node)); + } +} + +/* + * Return the idle masks associated to a target @node. + */ +static struct idle_cpus *idle_cpumask(int node) +{ + return node == NUMA_NO_NODE ? &scx_idle_global_masks : scx_idle_node_masks[node]; +} + +/* + * Return the node id associated to a target idle CPU (used to determine + * the proper idle cpumask). + */ +static int idle_cpu_to_node(int cpu) +{ + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) + return NUMA_NO_NODE; + + return cpu_to_node(cpu); +} bool scx_idle_test_and_clear_cpu(int cpu) { + int node = idle_cpu_to_node(cpu); + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; + #ifdef CONFIG_SCHED_SMT /* * SMT mask should be cleared whether we can claim @cpu or not. The SMT @@ -45,33 +108,38 @@ bool scx_idle_test_and_clear_cpu(int cpu) */ if (sched_smt_active()) { const struct cpumask *smt = cpu_smt_mask(cpu); + struct cpumask *idle_smts = idle_cpumask(node)->smt; /* * If offline, @cpu is not its own sibling and * scx_pick_idle_cpu() can get caught in an infinite loop as - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu - * is eventually cleared. + * @cpu is never cleared from the idle SMT mask. Ensure that + * @cpu is eventually cleared. * * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to * reduce memory writes, which may help alleviate cache * coherence pressure. */ - if (cpumask_intersects(smt, idle_masks.smt)) - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); - else if (cpumask_test_cpu(cpu, idle_masks.smt)) - __cpumask_clear_cpu(cpu, idle_masks.smt); + if (cpumask_intersects(smt, idle_smts)) + cpumask_andnot(idle_smts, idle_smts, smt); + else if (cpumask_test_cpu(cpu, idle_smts)) + __cpumask_clear_cpu(cpu, idle_smts); } #endif - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); + + return cpumask_test_and_clear_cpu(cpu, idle_cpus); } -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) +/* + * Pick an idle CPU in a specific NUMA node. + */ +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags) { int cpu; retry: if (sched_smt_active()) { - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); + cpu = cpumask_any_and_distribute(idle_cpumask(node)->smt, cpus_allowed); if (cpu < nr_cpu_ids) goto found; @@ -79,7 +147,7 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) return -EBUSY; } - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); + cpu = cpumask_any_and_distribute(idle_cpumask(node)->cpu, cpus_allowed); if (cpu >= nr_cpu_ids) return -EBUSY; @@ -90,6 +158,55 @@ s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) goto retry; } +/* + * Find the best idle CPU in the system, relative to @node. + */ +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) +{ + nodemask_t unvisited = NODE_MASK_ALL; + s32 cpu = -EBUSY; + + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) + return pick_idle_cpu_from_node(cpus_allowed, NUMA_NO_NODE, flags); + + /* + * If an initial node is not specified, start with the current + * node. + */ + if (node == NUMA_NO_NODE) + node = numa_node_id(); + + /* + * Traverse all nodes in order of increasing distance, starting + * from @node. + * + * This loop is O(N^2), with N being the amount of NUMA nodes, + * which might be quite expensive in large NUMA systems. However, + * this complexity comes into play only when a scheduler enables + * SCX_OPS_BUILTIN_IDLE_PER_NODE and it's requesting an idle CPU + * without specifying a target NUMA node, so it shouldn't be a + * bottleneck is most cases. + * + * As a future optimization we may want to cache the list of hop + * nodes in a per-node array, instead of actually traversing them + * every time. + */ + for_each_numa_node(node, unvisited, N_POSSIBLE) { + cpu = pick_idle_cpu_from_node(cpus_allowed, node, flags); + if (cpu >= 0) + break; + + /* + * Check if the search is restricted to the same core or + * the same node. + */ + if (flags & SCX_PICK_IDLE_IN_NODE) + break; + } + + return cpu; +} + /* * Return the amount of CPUs in the same LLC domain of @cpu (or zero if the LLC * domain is not defined). @@ -310,6 +427,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool { const struct cpumask *llc_cpus = NULL; const struct cpumask *numa_cpus = NULL; + int node = idle_cpu_to_node(prev_cpu); s32 cpu; *found = false; @@ -367,9 +485,9 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool * piled up on it even if there is an idle core elsewhere on * the system. */ - if (!cpumask_empty(idle_masks.cpu) && - !(current->flags & PF_EXITING) && - cpu_rq(cpu)->scx.local_dsq.nr == 0) { + if (!(current->flags & PF_EXITING) && + cpu_rq(cpu)->scx.local_dsq.nr == 0 && + !cpumask_empty(idle_cpumask(cpu_to_node(cpu))->cpu)) { if (cpumask_test_cpu(cpu, p->cpus_ptr)) goto cpu_found; } @@ -383,7 +501,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool /* * Keep using @prev_cpu if it's part of a fully idle core. */ - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && + if (cpumask_test_cpu(prev_cpu, idle_cpumask(node)->smt) && scx_idle_test_and_clear_cpu(prev_cpu)) { cpu = prev_cpu; goto cpu_found; @@ -393,7 +511,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool * Search for any fully idle core in the same LLC domain. */ if (llc_cpus) { - cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE); + cpu = pick_idle_cpu_from_node(llc_cpus, node, SCX_PICK_IDLE_CORE); if (cpu >= 0) goto cpu_found; } @@ -402,15 +520,19 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool * Search for any fully idle core in the same NUMA node. */ if (numa_cpus) { - cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE); + cpu = pick_idle_cpu_from_node(numa_cpus, node, SCX_PICK_IDLE_CORE); if (cpu >= 0) goto cpu_found; } /* * Search for any full idle core usable by the task. + * + * If NUMA aware idle selection is enabled, the search will + * begin in prev_cpu's node and proceed to other nodes in + * order of increasing distance. */ - cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); + cpu = scx_pick_idle_cpu(p->cpus_ptr, node, SCX_PICK_IDLE_CORE); if (cpu >= 0) goto cpu_found; } @@ -427,7 +549,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool * Search for any idle CPU in the same LLC domain. */ if (llc_cpus) { - cpu = scx_pick_idle_cpu(llc_cpus, 0); + cpu = pick_idle_cpu_from_node(llc_cpus, node, 0); if (cpu >= 0) goto cpu_found; } @@ -436,7 +558,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool * Search for any idle CPU in the same NUMA node. */ if (numa_cpus) { - cpu = scx_pick_idle_cpu(numa_cpus, 0); + cpu = pick_idle_cpu_from_node(numa_cpus, node, 0); if (cpu >= 0) goto cpu_found; } @@ -444,7 +566,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool /* * Search for any idle CPU usable by the task. */ - cpu = scx_pick_idle_cpu(p->cpus_ptr, 0); + cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0); if (cpu >= 0) goto cpu_found; @@ -460,38 +582,50 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool void scx_idle_reset_masks(void) { + int node; + + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) { + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->cpu, cpu_online_mask); + cpumask_copy(idle_cpumask(NUMA_NO_NODE)->smt, cpu_online_mask); + return; + } + /* * Consider all online cpus idle. Should converge to the actual state * quickly. */ - cpumask_copy(idle_masks.cpu, cpu_online_mask); - cpumask_copy(idle_masks.smt, cpu_online_mask); -} + for_each_node(node) { + const struct cpumask *node_mask = cpumask_of_node(node); + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; + struct cpumask *idle_smts = idle_cpumask(node)->smt; -void scx_idle_init_masks(void) -{ - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); + cpumask_and(idle_cpus, cpu_online_mask, node_mask); + cpumask_copy(idle_smts, idle_cpus); + } } static void update_builtin_idle(int cpu, bool idle) { - assign_cpu(cpu, idle_masks.cpu, idle); + int node = idle_cpu_to_node(cpu); + struct cpumask *idle_cpus = idle_cpumask(node)->cpu; + + assign_cpu(cpu, idle_cpus, idle); #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) { const struct cpumask *smt = cpu_smt_mask(cpu); + struct cpumask *idle_smts = idle_cpumask(node)->smt; if (idle) { /* - * idle_masks.smt handling is racy but that's fine as - * it's only for optimization and self-correcting. + * idle_smt handling is racy but that's fine as it's + * only for optimization and self-correcting. */ - if (!cpumask_subset(smt, idle_masks.cpu)) + if (!cpumask_subset(smt, idle_cpus)) return; - cpumask_or(idle_masks.smt, idle_masks.smt, smt); + cpumask_or(idle_smts, idle_smts, smt); } else { - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + cpumask_andnot(idle_smts, idle_smts, smt); } } #endif @@ -599,15 +733,21 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking * per-CPU cpumask. * - * Returns NULL if idle tracking is not enabled, or running on a UP kernel. + * Returns an empty mask if idle tracking is not enabled, or running on a + * UP kernel. */ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) { if (!check_builtin_idle_enabled()) return cpu_none_mask; + if (static_branch_unlikely(&scx_builtin_idle_per_node)) { + scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE enabled"); + return cpu_none_mask; + } + #ifdef CONFIG_SMP - return idle_masks.cpu; + return idle_cpumask(NUMA_NO_NODE)->cpu; #else return cpu_none_mask; #endif @@ -618,18 +758,24 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) * per-physical-core cpumask. Can be used to determine if an entire physical * core is free. * - * Returns NULL if idle tracking is not enabled, or running on a UP kernel. + * Returns an empty mask if idle tracking is not enabled, or running on a + * UP kernel. */ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) { if (!check_builtin_idle_enabled()) return cpu_none_mask; + if (static_branch_unlikely(&scx_builtin_idle_per_node)) { + scx_ops_error("SCX_OPS_BUILTIN_IDLE_PER_NODE enabled"); + return cpu_none_mask; + } + #ifdef CONFIG_SMP if (sched_smt_active()) - return idle_masks.smt; + return idle_cpumask(NUMA_NO_NODE)->smt; else - return idle_masks.cpu; + return idle_cpumask(NUMA_NO_NODE)->cpu; #else return cpu_none_mask; #endif @@ -696,7 +842,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, if (!check_builtin_idle_enabled()) return -EBUSY; - return scx_pick_idle_cpu(cpus_allowed, flags); + return scx_pick_idle_cpu(cpus_allowed, NUMA_NO_NODE, flags); } /** @@ -719,7 +865,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, s32 cpu; if (static_branch_likely(&scx_builtin_idle_enabled)) { - cpu = scx_pick_idle_cpu(cpus_allowed, flags); + cpu = scx_pick_idle_cpu(cpus_allowed, NUMA_NO_NODE, flags); if (cpu >= 0) return cpu; } diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h index d005bd22c19a5..b00ed5ad48e89 100644 --- a/kernel/sched/ext_idle.h +++ b/kernel/sched/ext_idle.h @@ -13,6 +13,7 @@ struct sched_ext_ops; extern struct static_key_false scx_builtin_idle_enabled; +extern struct static_key_false scx_builtin_idle_per_node; #ifdef CONFIG_SMP extern struct static_key_false scx_selcpu_topo_llc; @@ -22,13 +23,18 @@ void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); void scx_idle_reset_masks(void); void scx_idle_init_masks(void); bool scx_idle_test_and_clear_cpu(int cpu); -s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags); +s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags); +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags); #else /* !CONFIG_SMP */ static inline void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops) {} static inline void scx_idle_reset_masks(void) {} static inline void scx_idle_init_masks(void) {} static inline bool scx_idle_test_and_clear_cpu(int cpu) { return false; } -static inline s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) +static inline s32 pick_idle_cpu_from_node(const struct cpumask *cpus_allowed, int node, u64 flags) +{ + return -EBUSY; +} +static inline s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node, u64 flags) { return -EBUSY; } @@ -36,6 +42,7 @@ static inline s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flag s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *found); +extern void scx_idle_init_masks(void); extern int scx_idle_init(void); #endif /* _KERNEL_SCHED_EXT_IDLE_H */
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext_idle.c | 242 ++++++++++++++++++++++++++++++++-------- kernel/sched/ext_idle.h | 11 +- 2 files changed, 203 insertions(+), 50 deletions(-)