diff mbox series

[6/6] sched_ext: idle: Introduce node-aware idle cpu kfunc helpers

Message ID 20250207211104.30009-7-arighi@nvidia.com (mailing list archive)
State Not Applicable
Headers show
Series [1/6] mm/numa: Introduce numa_nearest_nodemask() | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrea Righi Feb. 7, 2025, 8:40 p.m. UTC
Add the following kfunc's to provide scx schedulers direct access to
per-node idle cpumasks information:

 int scx_bpf_cpu_to_node(s32 cpu)
 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
				int node, u64 flags)

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c                  | 121 +++++++++++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h |   4 +
 tools/sched_ext/include/scx/compat.bpf.h |  19 ++++
 3 files changed, 144 insertions(+)

Comments

Tejun Heo Feb. 7, 2025, 10:39 p.m. UTC | #1
Hello,

On Fri, Feb 07, 2025 at 09:40:53PM +0100, Andrea Righi wrote:
> +/**
> + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> + */
> +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)

Maybe scx_bpf_cpu_node() to be in line with scx_bpf_task_cpu/cgroup()?

> +{
> +#ifdef CONFIG_NUMA
> +	if (cpu < 0 || cpu >= nr_cpu_ids)
> +		return -EINVAL;

Use ops_cpu_valid()? Otherwise, we can end up calling cpu_to_node() with an
impossible CPU. Also, I don't think CPU -> node mapping function should be
able to return an error value. It should just trigger ops error.

> +
> +	return idle_cpu_to_node(cpu);

This is contingent on scx_builtin_idle_per_node, right? It's confusing for
CPU -> node mapping function to return NUMA_NO_NODE depending on an ops
flag. Shouldn't this be a generic mapping function?

> index 50e1499ae0935..caa1a80f9a60c 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -130,6 +130,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
>  	 scx_bpf_now() :							\
>  	 bpf_ktime_get_ns())
>  
> +#define __COMPAT_scx_bpf_cpu_to_node(cpu)					\
> +	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?					\
> +	 scx_bpf_cpu_to_node(cpu) : 0)
> +
> +#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)				\
> +	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?			\
> +	 scx_bpf_get_idle_cpumask_node(node) :					\
> +	 scx_bpf_get_idle_cpumask())						\
> +
> +#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)				\
> +	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?			\
> +	 scx_bpf_get_idle_smtmask_node(node) :					\
> +	 scx_bpf_get_idle_smtmask())
> +
> +#define __COMPAT_scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags)		\
> +	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?				\
> +	 scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) :		\
> +	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))

Can you please document when these compat macros can be dropped? Also,
shouldn't it also provide a compat macro for the new ops flag using
__COMPAT_ENUM_OR_ZERO()? Otherwise, trying to load new binary using the new
flag on an older kernel will fail, right?

Thanks.
Andrea Righi Feb. 8, 2025, 9:19 a.m. UTC | #2
On Fri, Feb 07, 2025 at 12:39:57PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 07, 2025 at 09:40:53PM +0100, Andrea Righi wrote:
> > +/**
> > + * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
> > + */
> > +__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
> 
> Maybe scx_bpf_cpu_node() to be in line with scx_bpf_task_cpu/cgroup()?

Ok, then maybe we can have scx_bpf_cpu_node() for the kfunc, that wraps
scx_cpu_node() for internal use.

> 
> > +{
> > +#ifdef CONFIG_NUMA
> > +	if (cpu < 0 || cpu >= nr_cpu_ids)
> > +		return -EINVAL;
> 
> Use ops_cpu_valid()? Otherwise, we can end up calling cpu_to_node() with an
> impossible CPU. Also, I don't think CPU -> node mapping function should be
> able to return an error value. It should just trigger ops error.

Ok.

> 
> > +
> > +	return idle_cpu_to_node(cpu);
> 
> This is contingent on scx_builtin_idle_per_node, right? It's confusing for
> CPU -> node mapping function to return NUMA_NO_NODE depending on an ops
> flag. Shouldn't this be a generic mapping function?

The idea is that BPF schedulers can use this kfunc to determine the right
idle cpumask to use, for example a typical usage could be:

  int node = scx_bpf_cpu_node(prev_cpu);
  s32 cpu = scx_bpf_pick_idle_cpu_in_node(p->cpus_ptr, node, SCX_PICK_IDLE_IN_NODE);

Or:

  int node = scx_bpf_cpu_node(prev_cpu);
  const struct cpumask *idle_cpumask = scx_bpf_get_idle_cpumask_node(node);

When SCX_OPS_BUILTIN_IDLE_PER_NODE is disabled, we need to point to the
global idle cpumask, that is identified by NUMA_NO_NODE, so this is why we
can return NUMA_NO_NODE fro scx_bpf_cpu_node().

Do you think we should make this more clear / document this better. Or do
you think we should use a different API?

> 
> > index 50e1499ae0935..caa1a80f9a60c 100644
> > --- a/tools/sched_ext/include/scx/compat.bpf.h
> > +++ b/tools/sched_ext/include/scx/compat.bpf.h
> > @@ -130,6 +130,25 @@ bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
> >  	 scx_bpf_now() :							\
> >  	 bpf_ktime_get_ns())
> >  
> > +#define __COMPAT_scx_bpf_cpu_to_node(cpu)					\
> > +	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?					\
> > +	 scx_bpf_cpu_to_node(cpu) : 0)
> > +
> > +#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)				\
> > +	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?			\
> > +	 scx_bpf_get_idle_cpumask_node(node) :					\
> > +	 scx_bpf_get_idle_cpumask())						\
> > +
> > +#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)				\
> > +	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?			\
> > +	 scx_bpf_get_idle_smtmask_node(node) :					\
> > +	 scx_bpf_get_idle_smtmask())
> > +
> > +#define __COMPAT_scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags)		\
> > +	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?				\
> > +	 scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) :		\
> > +	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
> 
> Can you please document when these compat macros can be dropped? Also,
> shouldn't it also provide a compat macro for the new ops flag using
> __COMPAT_ENUM_OR_ZERO()? Otherwise, trying to load new binary using the new
> flag on an older kernel will fail, right?

Right. Will add that.

Thanks,
-Andrea
Tejun Heo Feb. 9, 2025, 6:31 a.m. UTC | #3
Hello,

On Sat, Feb 08, 2025 at 10:19:38AM +0100, Andrea Righi wrote:
...
> > This is contingent on scx_builtin_idle_per_node, right? It's confusing for
> > CPU -> node mapping function to return NUMA_NO_NODE depending on an ops
> > flag. Shouldn't this be a generic mapping function?
> 
> The idea is that BPF schedulers can use this kfunc to determine the right
> idle cpumask to use, for example a typical usage could be:
> 
>   int node = scx_bpf_cpu_node(prev_cpu);
>   s32 cpu = scx_bpf_pick_idle_cpu_in_node(p->cpus_ptr, node, SCX_PICK_IDLE_IN_NODE);
> 
> Or:
> 
>   int node = scx_bpf_cpu_node(prev_cpu);
>   const struct cpumask *idle_cpumask = scx_bpf_get_idle_cpumask_node(node);
> 
> When SCX_OPS_BUILTIN_IDLE_PER_NODE is disabled, we need to point to the
> global idle cpumask, that is identified by NUMA_NO_NODE, so this is why we
> can return NUMA_NO_NODE fro scx_bpf_cpu_node().
> 
> Do you think we should make this more clear / document this better. Or do
> you think we should use a different API?

I think this is too error-prone. It'd be really easy for users to assume
that scx_bpf_cpu_node() always returns the NUMA node for the given CPU which
can lead to really subtle surprises. Why even allow e.g.
scx_bpf_get_idle_cpumask_node() if IDLE_PER_NODE is not enabled?

Thanks.
Andrea Righi Feb. 9, 2025, 8:11 a.m. UTC | #4
On Sat, Feb 08, 2025 at 08:31:27PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Sat, Feb 08, 2025 at 10:19:38AM +0100, Andrea Righi wrote:
> ...
> > > This is contingent on scx_builtin_idle_per_node, right? It's confusing for
> > > CPU -> node mapping function to return NUMA_NO_NODE depending on an ops
> > > flag. Shouldn't this be a generic mapping function?
> > 
> > The idea is that BPF schedulers can use this kfunc to determine the right
> > idle cpumask to use, for example a typical usage could be:
> > 
> >   int node = scx_bpf_cpu_node(prev_cpu);
> >   s32 cpu = scx_bpf_pick_idle_cpu_in_node(p->cpus_ptr, node, SCX_PICK_IDLE_IN_NODE);
> > 
> > Or:
> > 
> >   int node = scx_bpf_cpu_node(prev_cpu);
> >   const struct cpumask *idle_cpumask = scx_bpf_get_idle_cpumask_node(node);
> > 
> > When SCX_OPS_BUILTIN_IDLE_PER_NODE is disabled, we need to point to the
> > global idle cpumask, that is identified by NUMA_NO_NODE, so this is why we
> > can return NUMA_NO_NODE fro scx_bpf_cpu_node().
> > 
> > Do you think we should make this more clear / document this better. Or do
> > you think we should use a different API?
> 
> I think this is too error-prone. It'd be really easy for users to assume
> that scx_bpf_cpu_node() always returns the NUMA node for the given CPU which
> can lead to really subtle surprises. Why even allow e.g.
> scx_bpf_get_idle_cpumask_node() if IDLE_PER_NODE is not enabled?

Ok, for the kfuncs I agree that we should just trigger an scx_ops_error()
if any of the scx_*_node() are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is
disabled (will change this).

About scx_cpu_node(), which is used internally, I think it's convenient to
return NUMA_NO_NODE when idle-per-node is disabled, just to avoid repeating
the same check for scx_builtin_idle_per_node everywhere, and NUMA_NO_NODE
internally always means "use the global cpumask".

Do you think this is still error-prone? Or should I try to refactor the
code to get rid of this NUMA_NO_NODE == global cpumask logic?

Thanks,
-Andrea
Tejun Heo Feb. 10, 2025, 6:01 a.m. UTC | #5
Hello,

On Sun, Feb 09, 2025 at 09:11:15AM +0100, Andrea Righi wrote:
...
> About scx_cpu_node(), which is used internally, I think it's convenient to
> return NUMA_NO_NODE when idle-per-node is disabled, just to avoid repeating
> the same check for scx_builtin_idle_per_node everywhere, and NUMA_NO_NODE
> internally always means "use the global cpumask".
> 
> Do you think this is still error-prone? Or should I try to refactor the
> code to get rid of this NUMA_NO_NODE == global cpumask logic?

I think that's fine as long as the name clearly indicates that the function
is doing something special other than mapping CPU to node. It can get pretty
confusing if something with a really generic name doesn't behave in a
generic manner.

Thanks.
diff mbox series

Patch

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 4b90ec9018c1a..260a26c6cb0b8 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -686,6 +686,33 @@  void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
 /********************************************************************************
  * Helpers that can be called from the BPF scheduler.
  */
+
+static int validate_node(int node)
+{
+	if (!static_branch_likely(&scx_builtin_idle_per_node)) {
+		scx_ops_error("per-node idle tracking is disabled");
+		return -EOPNOTSUPP;
+	}
+
+	/* Return no entry for NUMA_NO_NODE (not a critical scx error) */
+	if (node == NUMA_NO_NODE)
+		return -ENOENT;
+
+	/* Make sure node is in a valid range */
+	if (node < 0 || node >= nr_node_ids) {
+		scx_ops_error("invalid node %d", node);
+		return -EINVAL;
+	}
+
+	/* Make sure the node is part of the set of possible nodes */
+	if (!node_possible(node)) {
+		scx_ops_error("unavailable node %d", node);
+		return -EINVAL;
+	}
+
+	return node;
+}
+
 __bpf_kfunc_start_defs();
 
 static bool check_builtin_idle_enabled(void)
@@ -697,6 +724,21 @@  static bool check_builtin_idle_enabled(void)
 	return false;
 }
 
+/**
+ * scx_bpf_cpu_to_node - Return the NUMA node the given @cpu belongs to
+ */
+__bpf_kfunc int scx_bpf_cpu_to_node(s32 cpu)
+{
+#ifdef CONFIG_NUMA
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return -EINVAL;
+
+	return idle_cpu_to_node(cpu);
+#else
+	return 0;
+#endif
+}
+
 /**
  * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
  * @p: task_struct to select a CPU for
@@ -729,6 +771,27 @@  __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	return prev_cpu;
 }
 
+/**
+ * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the
+ * idle-tracking per-CPU cpumask of a target NUMA node.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is
+ * not valid, or running on a UP kernel. In this case the actual error will
+ * be reported to the BPF scheduler via scx_ops_error().
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return cpu_none_mask;
+
+#ifdef CONFIG_SMP
+	return idle_cpumask(node)->cpu;
+#else
+	return cpu_none_mask;
+#endif
+}
+
 /**
  * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
  * per-CPU cpumask.
@@ -753,6 +816,31 @@  __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
 #endif
 }
 
+/**
+ * scx_bpf_get_idle_smtmask_node - Get a referenced kptr to the
+ * idle-tracking, per-physical-core cpumask of a target NUMA node. Can be
+ * used to determine if an entire physical core is free.
+ *
+ * Returns an empty cpumask if idle tracking is not enabled, if @node is
+ * not valid, or running on a UP kernel. In this case the actual error will
+ * be reported to the BPF scheduler via scx_ops_error().
+ */
+__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return cpu_none_mask;
+
+#ifdef CONFIG_SMP
+	if (sched_smt_active())
+		return idle_cpumask(node)->smt;
+	else
+		return idle_cpumask(node)->cpu;
+#else
+	return cpu_none_mask;
+#endif
+}
+
 /**
  * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
  * per-physical-core cpumask. Can be used to determine if an entire physical
@@ -817,6 +905,35 @@  __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
 		return false;
 }
 
+/**
+ * scx_bpf_pick_idle_cpu_node - Pick and claim an idle cpu from a NUMA node
+ * @cpus_allowed: Allowed cpumask
+ * @node: target NUMA node
+ * @flags: %SCX_PICK_IDLE_* flags
+ *
+ * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node.
+ *
+ * Returns the picked idle cpu number on success, or -%EBUSY if no matching
+ * cpu was found.
+ *
+ * The search starts from @node and proceeds to other online NUMA nodes in
+ * order of increasing distance (unless SCX_PICK_IDLE_IN_NODE is specified,
+ * in which case the search is limited to the target @node).
+ *
+ * Always returns an error if ops.update_idle() is implemented and
+ * %SCX_OPS_KEEP_BUILTIN_IDLE is not set, or if
+ * %SCX_OPS_BUILTIN_IDLE_PER_NODE is not set.
+ */
+__bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed,
+					   int node, u64 flags)
+{
+	node = validate_node(node);
+	if (node < 0)
+		return node;
+
+	return scx_pick_idle_cpu(cpus_allowed, node, flags);
+}
+
 /**
  * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
  * @cpus_allowed: Allowed cpumask
@@ -880,10 +997,14 @@  __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed,
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_idle)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_node)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_idle)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 7055400030241..7dd8ba2964553 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -63,13 +63,17 @@  u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
 u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
 void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
+int scx_bpf_cpu_to_node(s32 cpu) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
 void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
+const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
+const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
 void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
 bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
+s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) __ksym __weak;
 s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 50e1499ae0935..caa1a80f9a60c 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -130,6 +130,25 @@  bool scx_bpf_dispatch_vtime_from_dsq___compat(struct bpf_iter_scx_dsq *it__iter,
 	 scx_bpf_now() :							\
 	 bpf_ktime_get_ns())
 
+#define __COMPAT_scx_bpf_cpu_to_node(cpu)					\
+	(bpf_ksym_exists(scx_bpf_cpu_to_node) ?					\
+	 scx_bpf_cpu_to_node(cpu) : 0)
+
+#define __COMPAT_scx_bpf_get_idle_cpumask_node(node)				\
+	(bpf_ksym_exists(scx_bpf_get_idle_cpumask_node) ?			\
+	 scx_bpf_get_idle_cpumask_node(node) :					\
+	 scx_bpf_get_idle_cpumask())						\
+
+#define __COMPAT_scx_bpf_get_idle_smtmask_node(node)				\
+	(bpf_ksym_exists(scx_bpf_get_idle_smtmask_node) ?			\
+	 scx_bpf_get_idle_smtmask_node(node) :					\
+	 scx_bpf_get_idle_smtmask())
+
+#define __COMPAT_scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags)		\
+	(bpf_ksym_exists(scx_bpf_pick_idle_cpu_node) ?				\
+	 scx_bpf_pick_idle_cpu_node(cpus_allowed, node, flags) :		\
+	 scx_bpf_pick_idle_cpu(cpus_allowed, flags))
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().