diff mbox series

[27/30] sched_ext: Implement core-sched support

Message ID 20230128001639.3510083-28-tj@kernel.org (mailing list archive)
State Not Applicable
Headers show
Series [01/30] cgroup: Implement cgroup_show_cftypes() | expand

Commit Message

Tejun Heo Jan. 28, 2023, 12:16 a.m. UTC
The core-sched support is composed of the following parts:

* task_struct->scx.core_sched_at is added. This is a timestamp which can be
  used to order tasks. Depending on whether the BPF scheduler implements
  custom ordering, it tracks either global FIFO ordering of all tasks or
  local-DSQ ordering within the dispatched tasks on a CPU.

* prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
  scx_prio_less() calls ops.core_sched_before() if available or uses the
  core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
  doesn't need to do anything. Otherwise, it should implement
  ops.core_sched_before() which reflects the ordering.

* When core-sched is enabled, balance_scx() balances all SMT siblings so
  that they all have tasks dispatched if necessary before pick_task_scx() is
  called. pick_task_scx() picks between the current task and the first
  dispatched task on the local DSQ based on availability and the
  core_sched_at timestamps. Note that FIFO ordering is expected among the
  already dispatched tasks whether running or on the local DSQ, so this path
  always compares core_sched_at instead of calling into
  ops.core_sched_before().

qmap_core_sched_before() is added to scx_example_qmap. It scales the
distances from the heads of the queues to compare the tasks across different
priority queues and seems to behave as expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
---
 include/linux/sched/ext.h              |  21 +++
 kernel/Kconfig.preempt                 |   2 +-
 kernel/sched/core.c                    |  12 +-
 kernel/sched/ext.c                     | 196 +++++++++++++++++++++++--
 kernel/sched/ext.h                     |  12 ++
 tools/sched_ext/scx_example_qmap.bpf.c |  87 ++++++++++-
 tools/sched_ext/scx_example_qmap.c     |   5 +-
 7 files changed, 319 insertions(+), 16 deletions(-)

Comments

kernel test robot Jan. 28, 2023, 7:07 p.m. UTC | #1
Hi Tejun,

I love your patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v6.2-rc5]
[cannot apply to tip/sched/core tj-cgroup/for-next next-20230127]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
patch link:    https://lore.kernel.org/r/20230128001639.3510083-28-tj%40kernel.org
patch subject: [PATCH 27/30] sched_ext: Implement core-sched support
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20230129/202301290223.0qWZoY9T-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/750f973dd349cc5c3df29319b0ffae740738a9d2
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Tejun-Heo/sched-Encapsulate-task-attribute-change-sequence-into-a-helper-macro/20230128-123001
        git checkout 750f973dd349cc5c3df29319b0ffae740738a9d2
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_policy.c:58:
>> kernel/sched/ext.c:3349:25: error: initialization of 'int (*)(const struct btf_type *, const struct btf_member *)' from incompatible pointer type 'int (*)(const struct btf_type *, const struct btf_member *, const struct bpf_prog *)' [-Werror=incompatible-pointer-types]
    3349 |         .check_member = bpf_scx_check_member,
         |                         ^~~~~~~~~~~~~~~~~~~~
   kernel/sched/ext.c:3349:25: note: (near initialization for 'bpf_sched_ext_ops.check_member')
   kernel/sched/ext.c: In function 'scx_bpf_error_bstr':
>> kernel/sched/ext.c:3906:16: error: variable 'bprintf_data' has initializer but incomplete type
    3906 |         struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
         |                ^~~~~~~~~~~~~~~~
>> kernel/sched/ext.c:3906:51: error: 'struct bpf_bprintf_data' has no member named 'get_bin_args'
    3906 |         struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
         |                                                   ^~~~~~~~~~~~
>> kernel/sched/ext.c:3906:66: warning: excess elements in struct initializer
    3906 |         struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
         |                                                                  ^~~~
   kernel/sched/ext.c:3906:66: note: (near initialization for 'bprintf_data')
>> kernel/sched/ext.c:3906:33: error: storage size of 'bprintf_data' isn't known
    3906 |         struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
         |                                 ^~~~~~~~~~~~
>> kernel/sched/ext.c:3927:71: warning: passing argument 4 of 'bpf_bprintf_prepare' makes pointer from integer without a cast [-Wint-conversion]
    3927 |         ret = bpf_bprintf_prepare(fmt, UINT_MAX, bufs->data, data__sz / 8,
         |                                                              ~~~~~~~~~^~~
         |                                                                       |
         |                                                                       u32 {aka unsigned int}
   In file included from include/linux/bpf_verifier.h:7,
                    from kernel/sched/ext.c:3193:
   include/linux/bpf.h:2800:31: note: expected 'u32 **' {aka 'unsigned int **'} but argument is of type 'u32' {aka 'unsigned int'}
    2800 |                         u32 **bin_buf, u32 num_args);
         |                         ~~~~~~^~~~~~~
>> kernel/sched/ext.c:3936:9: error: too many arguments to function 'bpf_bprintf_cleanup'
    3936 |         bpf_bprintf_cleanup(&bprintf_data);
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/bpf.h:2801:6: note: declared here
    2801 | void bpf_bprintf_cleanup(void);
         |      ^~~~~~~~~~~~~~~~~~~
   kernel/sched/ext.c:3906:33: warning: unused variable 'bprintf_data' [-Wunused-variable]
    3906 |         struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
         |                                 ^~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +3349 kernel/sched/ext.c

4c016c2bafb66b Tejun Heo 2023-01-27  3344  
4c016c2bafb66b Tejun Heo 2023-01-27  3345  struct bpf_struct_ops bpf_sched_ext_ops = {
4c016c2bafb66b Tejun Heo 2023-01-27  3346  	.verifier_ops = &bpf_scx_verifier_ops,
4c016c2bafb66b Tejun Heo 2023-01-27  3347  	.reg = bpf_scx_reg,
4c016c2bafb66b Tejun Heo 2023-01-27  3348  	.unreg = bpf_scx_unreg,
4c016c2bafb66b Tejun Heo 2023-01-27 @3349  	.check_member = bpf_scx_check_member,
4c016c2bafb66b Tejun Heo 2023-01-27  3350  	.init_member = bpf_scx_init_member,
4c016c2bafb66b Tejun Heo 2023-01-27  3351  	.init = bpf_scx_init,
4c016c2bafb66b Tejun Heo 2023-01-27  3352  	.name = "sched_ext_ops",
4c016c2bafb66b Tejun Heo 2023-01-27  3353  };
4c016c2bafb66b Tejun Heo 2023-01-27  3354
Josh Don Jan. 30, 2023, 9:38 p.m. UTC | #2
Hi Tejun,

On Fri, Jan 27, 2023 at 4:17 PM Tejun Heo <tj@kernel.org> wrote:
>
> The core-sched support is composed of the following parts:

Thanks, this looks pretty reasonable overall.

One meta comment is that I think we can shortcircuit from
touch_core_sched when we have sched_core_disabled().

Reviewed-by: Josh Don <joshdon@google.com>

> +                       /*
> +                        * While core-scheduling, rq lock is shared among
> +                        * siblings but the debug annotations and rq clock
> +                        * aren't. Do pinning dance to transfer the ownership.
> +                        */
> +                       WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> +                       rq_unpin_lock(rq, rf);
> +                       rq_pin_lock(srq, &srf);
> +
> +                       update_rq_clock(srq);

Unfortunate that we have to do this superfluous update; maybe we can
save/restore the clock flags from before the pinning shenanigans?

> +static struct task_struct *pick_task_scx(struct rq *rq)
> +{
> +       struct task_struct *curr = rq->curr;
> +       struct task_struct *first = first_local_task(rq);
> +
> +       if (curr->scx.flags & SCX_TASK_QUEUED) {
> +               /* is curr the only runnable task? */
> +               if (!first)
> +                       return curr;
> +
> +               /*
> +                * Does curr trump first? We can always go by core_sched_at for
> +                * this comparison as it represents global FIFO ordering when
> +                * the default core-sched ordering is in used and local-DSQ FIFO
> +                * ordering otherwise.
> +                */
> +               if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> +                                                    first->scx.core_sched_at))
> +                       return curr;

So is this to handle the case where we have something running on 'rq'
to match the cookie of our sibling (which had priority), but now we
want to switch to running the first thing in the local queue, which
has a different cookie (and is now the highest priority entity)? Maybe
being slightly more specific in the comment would help :)
Tejun Heo Jan. 31, 2023, 12:26 a.m. UTC | #3
Hello,

On Mon, Jan 30, 2023 at 01:38:15PM -0800, Josh Don wrote:
> > The core-sched support is composed of the following parts:
> 
> Thanks, this looks pretty reasonable overall.
> 
> One meta comment is that I think we can shortcircuit from
> touch_core_sched when we have sched_core_disabled().

Yeah, touch_core_sched() is really cheap (it's just an assignment from an rq
field to a task field) but sched_core_disabled() is also just a static
branch. Will update.

> Reviewed-by: Josh Don <joshdon@google.com>
> 
> > +                       /*
> > +                        * While core-scheduling, rq lock is shared among
> > +                        * siblings but the debug annotations and rq clock
> > +                        * aren't. Do pinning dance to transfer the ownership.
> > +                        */
> > +                       WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> > +                       rq_unpin_lock(rq, rf);
> > +                       rq_pin_lock(srq, &srf);
> > +
> > +                       update_rq_clock(srq);
> 
> Unfortunate that we have to do this superfluous update; maybe we can
> save/restore the clock flags from before the pinning shenanigans?

So, this one isn't really superflous. There are two rq's involved - self and
sibling. self's rq clock is saved and restored through rq_unpin_lock() and
rq_repin_lock(). We're transferring the lock owner ship from self to sibling
without actually unlocking and relocking the lock as they should be sharing
the same lock; however, that doesn't mean that the two queues share rq
clocks, so the sibling needs to update its rq clock upon getting the lock
transferred to it. It might make sense to make the siblings share the rq
clock when core-sched is enabled but that's probably for some other time.

> > +static struct task_struct *pick_task_scx(struct rq *rq)
> > +{
> > +       struct task_struct *curr = rq->curr;
> > +       struct task_struct *first = first_local_task(rq);
> > +
> > +       if (curr->scx.flags & SCX_TASK_QUEUED) {
> > +               /* is curr the only runnable task? */
> > +               if (!first)
> > +                       return curr;
> > +
> > +               /*
> > +                * Does curr trump first? We can always go by core_sched_at for
> > +                * this comparison as it represents global FIFO ordering when
> > +                * the default core-sched ordering is in used and local-DSQ FIFO
> > +                * ordering otherwise.
> > +                */
> > +               if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
> > +                                                    first->scx.core_sched_at))
> > +                       return curr;
> 
> So is this to handle the case where we have something running on 'rq'
> to match the cookie of our sibling (which had priority), but now we
> want to switch to running the first thing in the local queue, which
> has a different cookie (and is now the highest priority entity)? Maybe
> being slightly more specific in the comment would help :)

pick_task_scx() is to pick the next best candidate for the rq. The
candidates then compete to determine the next cookie. Here, as long as only
one task gets dispatched at any given time, the condition check shouldn't
really trigger - ie. if curr has slice remaining, balance_one() wouldn't
have populated the local DSQ anyway and first would be NULL.

However, the BPF scheduler is free to dispatch whatever tasks anytime (e.g.
scx_example_central), so it's possible that a task with an earlier timestamp
has been dispatched to the local DSQ since curr started executing, in which
case we likely want to return the first on DSQ as the CPU's candidate.

IOW, the time_before() is there mostly to cover unusual cases. Most should
either trigger !first before or fail the curr->scx.slice test. Will update
the comment to clarify.

Thanks.
Tejun Heo Jan. 31, 2023, 12:36 a.m. UTC | #4
On Mon, Jan 30, 2023 at 02:26:20PM -1000, Tejun Heo wrote:
> However, the BPF scheduler is free to dispatch whatever tasks anytime (e.g.
> scx_example_central), so it's possible that a task with an earlier timestamp
> has been dispatched to the local DSQ since curr started executing, in which
> case we likely want to return the first on DSQ as the CPU's candidate.

Okay, a more common case would be when a CPU is forced to run a task which
isn't current by its sibling winning a different cookie and then the BPF
scheduler putting that task right back on the local DSQ. For the CPU then,
the right candidate would be the first task on DSQ not the current running
one which is dragged forward because the sibling trumping us.

Thanks.
Josh Don Jan. 31, 2023, 1:45 a.m. UTC | #5
On Mon, Jan 30, 2023 at 4:26 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Jan 30, 2023 at 01:38:15PM -0800, Josh Don wrote:
> > > The core-sched support is composed of the following parts:
> >
> > Thanks, this looks pretty reasonable overall.
> >
> > One meta comment is that I think we can shortcircuit from
> > touch_core_sched when we have sched_core_disabled().
>
> Yeah, touch_core_sched() is really cheap (it's just an assignment from an rq
> field to a task field) but sched_core_disabled() is also just a static
> branch. Will update.

Yep, true, I was just going through and reasoning about whether
anything needed to be done in the !sched_core_disabled() case.

> > Reviewed-by: Josh Don <joshdon@google.com>
> >
> > > +                       /*
> > > +                        * While core-scheduling, rq lock is shared among
> > > +                        * siblings but the debug annotations and rq clock
> > > +                        * aren't. Do pinning dance to transfer the ownership.
> > > +                        */
> > > +                       WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
> > > +                       rq_unpin_lock(rq, rf);
> > > +                       rq_pin_lock(srq, &srf);
> > > +
> > > +                       update_rq_clock(srq);
> >
> > Unfortunate that we have to do this superfluous update; maybe we can
> > save/restore the clock flags from before the pinning shenanigans?
>
> So, this one isn't really superflous. There are two rq's involved - self and
> sibling. self's rq clock is saved and restored through rq_unpin_lock() and
> rq_repin_lock(). We're transferring the lock owner ship from self to sibling
> without actually unlocking and relocking the lock as they should be sharing
> the same lock; however, that doesn't mean that the two queues share rq
> clocks, so the sibling needs to update its rq clock upon getting the lock
> transferred to it. It might make sense to make the siblings share the rq
> clock when core-sched is enabled but that's probably for some other time.

Yep, whoops, I forgot that part didn't make it.
diff mbox series

Patch

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 01c846445243..d3c2701bb4b4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -315,6 +315,24 @@  struct sched_ext_ops {
 	 */
 	bool (*yield)(struct task_struct *from, struct task_struct *to);
 
+	/**
+	 * core_sched_before - Task ordering for core-sched
+	 * @a: task A
+	 * @b: task B
+	 *
+	 * Used by core-sched to determine the ordering between two tasks. See
+	 * Documentation/admin-guide/hw-vuln/core-scheduling.rst for details on
+	 * core-sched.
+	 *
+	 * Both @a and @b are runnable and may or may not currently be queued on
+	 * the BPF scheduler. Should return %true if @a should run before @b.
+	 * %false if there's no required ordering or @b should run before @a.
+	 *
+	 * If not specified, the default is ordering them according to when they
+	 * became runnable.
+	 */
+	bool (*core_sched_before)(struct task_struct *a, struct task_struct *b);
+
 	/**
 	 * set_cpumask - Set CPU affinity
 	 * @p: task to set CPU affinity for
@@ -611,6 +629,9 @@  struct sched_ext_entity {
 	u32			kf_mask;	/* see scx_kf_mask above */
 	atomic64_t		ops_state;
 	unsigned long		runnable_at;
+#ifdef CONFIG_SCHED_CORE
+	u64			core_sched_at;	/* see scx_prio_less() */
+#endif
 
 	/* BPF scheduler modifiable fields */
 
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0afcda19bc50..e12a057ead7b 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -135,7 +135,7 @@  config SCHED_CORE
 
 config SCHED_CLASS_EXT
 	bool "Extensible Scheduling Class"
-	depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE
+	depends on BPF_SYSCALL && BPF_JIT
 	help
 	  This option enables a new scheduler class sched_ext (SCX), which
 	  allows scheduling policies to be implemented as BPF programs to
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47334e428031..a40b74a2fdbd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -163,7 +163,12 @@  static inline int __task_prio(struct task_struct *p)
 	if (p->sched_class == &idle_sched_class)
 		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
 
-	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (p->sched_class == &ext_sched_class)
+		return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */
+#endif
+
+	return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */
 }
 
 /*
@@ -191,6 +196,11 @@  static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool
 	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
 		return cfs_prio_less(a, b, in_fi);
 
+#ifdef CONFIG_SCHED_CLASS_EXT
+	if (pa == MAX_RT_PRIO + MAX_NICE + 1)	/* ext */
+		return scx_prio_less(a, b, in_fi);
+#endif
+
 	return false;
 }
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e981b7111e0a..8619eb2dcbd5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -447,6 +447,44 @@  static int ops_sanitize_err(const char *ops_name, s32 err)
 	return -EPROTO;
 }
 
+/**
+ * touch_core_sched - Update timestamp used for core-sched task ordering
+ * @rq: rq to read clock from, must be locked
+ * @p: task to update the timestamp for
+ *
+ * Update @p->scx.core_sched_at timestamp. This is used by scx_prio_less() to
+ * implement global or local-DSQ FIFO ordering for core-sched. Should be called
+ * when a task becomes runnable and its turn on the CPU ends (e.g. slice
+ * exhaustion).
+ */
+static void touch_core_sched(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_SCHED_CORE
+	p->scx.core_sched_at = rq_clock_task(rq);
+#endif
+}
+
+/**
+ * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
+ * @rq: rq to read clock from, must be locked
+ * @p: task being dispatched
+ *
+ * If the BPF scheduler implements custom core-sched ordering via
+ * ops.core_sched_before(), @p->scx.core_sched_at is used to implement FIFO
+ * ordering within each local DSQ. This function is called from dispatch paths
+ * and updates @p->scx.core_sched_at if custom core-sched ordering is in effect.
+ */
+static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+	assert_clock_updated(rq);
+
+#ifdef CONFIG_SCHED_CORE
+	if (SCX_HAS_OP(core_sched_before))
+		touch_core_sched(rq, p);
+#endif
+}
+
 static void update_curr_scx(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
@@ -462,8 +500,11 @@  static void update_curr_scx(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 	cgroup_account_cputime(curr, delta_exec);
 
-	if (curr->scx.slice != SCX_SLICE_INF)
+	if (curr->scx.slice != SCX_SLICE_INF) {
 		curr->scx.slice -= min(curr->scx.slice, delta_exec);
+		if (!curr->scx.slice)
+			touch_core_sched(rq, curr);
+	}
 }
 
 static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
@@ -619,6 +660,8 @@  static void direct_dispatch(struct task_struct *ddsp_task, struct task_struct *p
 		return;
 	}
 
+	touch_core_sched_dispatch(task_rq(p), p);
+
 	dsq = find_dsq_for_dispatch(task_rq(p), dsq_id, p);
 	dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 
@@ -702,12 +745,19 @@  static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	return;
 
 local:
+	/*
+	 * For task-ordering, slice refill must be treated as implying the end
+	 * of the current slice. Otherwise, the longer @p stays on the CPU, the
+	 * higher priority it becomes from scx_prio_less()'s POV.
+	 */
+	touch_core_sched(rq, p);
 	p->scx.slice = SCX_SLICE_DFL;
 local_norefill:
 	dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
 	return;
 
 global:
+	touch_core_sched(rq, p);	/* see the comment in local: */
 	p->scx.slice = SCX_SLICE_DFL;
 	dispatch_enqueue(&scx_dsq_global, p, enq_flags);
 }
@@ -762,6 +812,9 @@  static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	if (SCX_HAS_OP(runnable))
 		scx_ops.runnable(p, enq_flags);
 
+	if (enq_flags & SCX_ENQ_WAKEUP)
+		touch_core_sched(rq, p);
+
 	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 }
 
@@ -1201,6 +1254,7 @@  static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
 	struct scx_dispatch_q *dsq;
 	u64 opss;
 
+	touch_core_sched_dispatch(rq, p);
 retry:
 	/*
 	 * No need for _acquire here. @p is accessed only after a successful
@@ -1278,8 +1332,8 @@  static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
 	dspc->buf_cursor = 0;
 }
 
-static int balance_scx(struct rq *rq, struct task_struct *prev,
-		       struct rq_flags *rf)
+static int balance_one(struct rq *rq, struct task_struct *prev,
+		       struct rq_flags *rf, bool local)
 {
 	struct scx_rq *scx_rq = &rq->scx;
 	struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
@@ -1302,7 +1356,7 @@  static int balance_scx(struct rq *rq, struct task_struct *prev,
 	}
 
 	if (prev_on_scx) {
-		WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
+		WARN_ON_ONCE(local && (prev->scx.flags & SCX_TASK_BAL_KEEP));
 		update_curr_scx(rq);
 
 		/*
@@ -1314,10 +1368,16 @@  static int balance_scx(struct rq *rq, struct task_struct *prev,
 		 *
 		 * See scx_ops_disable_workfn() for the explanation on the
 		 * disabling() test.
+		 *
+		 * When balancing a remote CPU for core-sched, there won't be a
+		 * following put_prev_task_scx() call and we don't own
+		 * %SCX_TASK_BAL_KEEP. Instead, pick_task_scx() will test the
+		 * same conditions later and pick @rq->curr accordingly.
 		 */
 		if ((prev->scx.flags & SCX_TASK_QUEUED) &&
 		    prev->scx.slice && !scx_ops_disabling()) {
-			prev->scx.flags |= SCX_TASK_BAL_KEEP;
+			if (local)
+				prev->scx.flags |= SCX_TASK_BAL_KEEP;
 			return 1;
 		}
 	}
@@ -1373,10 +1433,55 @@  static int balance_scx(struct rq *rq, struct task_struct *prev,
 	return 0;
 }
 
+static int balance_scx(struct rq *rq, struct task_struct *prev,
+		       struct rq_flags *rf)
+{
+	int ret;
+
+	ret = balance_one(rq, prev, rf, true);
+
+	/*
+	 * When core-sched is enabled, this ops.balance() call will be followed
+	 * by put_prev_scx() and pick_task_scx() on this CPU and pick_task_scx()
+	 * on the SMT siblings. Balance the siblings too.
+	 */
+	if (sched_core_enabled(rq)) {
+		const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
+		int scpu;
+
+		for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) {
+			struct rq *srq = cpu_rq(scpu);
+			struct rq_flags srf;
+			struct task_struct *sprev = srq->curr;
+
+			/*
+			 * While core-scheduling, rq lock is shared among
+			 * siblings but the debug annotations and rq clock
+			 * aren't. Do pinning dance to transfer the ownership.
+			 */
+			WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq));
+			rq_unpin_lock(rq, rf);
+			rq_pin_lock(srq, &srf);
+
+			update_rq_clock(srq);
+			balance_one(srq, sprev, &srf, false);
+
+			rq_unpin_lock(srq, &srf);
+			rq_repin_lock(rq, rf);
+		}
+	}
+
+	return ret;
+}
+
 static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 {
 	if (p->scx.flags & SCX_TASK_QUEUED) {
-		WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+		/*
+		 * Core-sched might decide to execute @p before it is
+		 * dispatched. Call ops_dequeue() to notify the BPF scheduler.
+		 */
+		ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC);
 		dispatch_dequeue(&rq->scx, p);
 	}
 
@@ -1516,6 +1621,69 @@  static struct task_struct *pick_next_task_scx(struct rq *rq)
 	return p;
 }
 
+#ifdef CONFIG_SCHED_CORE
+/**
+ * scx_prio_less - Task ordering for core-sched
+ * @a: task A
+ * @b: task B
+ *
+ * Core-sched is implemented as an additional scheduling layer on top of the
+ * usual sched_class'es and needs to find out the expected task ordering. For
+ * SCX, core-sched calls this function to interrogate the task ordering.
+ *
+ * Unless overridden by ops.core_sched_before(), @p->scx.core_sched_at is used
+ * to implement the default task ordering. The older the timestamp, the higher
+ * prority the task - the global FIFO ordering matching the default scheduling
+ * behavior.
+ *
+ * When ops.core_sched_before() is enabled, @p->scx.core_sched_at is used to
+ * implement FIFO ordering within each local DSQ. See pick_task_scx().
+ */
+bool scx_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
+{
+	if (SCX_HAS_OP(core_sched_before) && !scx_ops_disabling())
+		return scx_ops.core_sched_before(a, b);
+	else
+		return time_after64(a->scx.core_sched_at, b->scx.core_sched_at);
+}
+
+/**
+ * pick_task_scx - Pick a candidate task for core-sched
+ * @rq: rq to pick the candidate task from
+ *
+ * Core-sched calls this function on each SMT sibling to determine the next
+ * tasks to run on the SMT siblings. balance_one() has been called on all
+ * siblings and put_prev_task_scx() has been called only for the current CPU.
+ *
+ * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look
+ * at the first task in the local dsq. @rq->curr has to be considered explicitly
+ * to mimic %SCX_TASK_BAL_KEEP.
+ */
+static struct task_struct *pick_task_scx(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct task_struct *first = first_local_task(rq);
+
+	if (curr->scx.flags & SCX_TASK_QUEUED) {
+		/* is curr the only runnable task? */
+		if (!first)
+			return curr;
+
+		/*
+		 * Does curr trump first? We can always go by core_sched_at for
+		 * this comparison as it represents global FIFO ordering when
+		 * the default core-sched ordering is in used and local-DSQ FIFO
+		 * ordering otherwise.
+		 */
+		if (curr->scx.slice && time_before64(curr->scx.core_sched_at,
+						     first->scx.core_sched_at))
+			return curr;
+	}
+
+	return first;	/* this may be %NULL */
+}
+#endif	/* CONFIG_SCHED_CORE */
+
 static enum scx_cpu_preempt_reason
 preempt_reason_from_class(const struct sched_class *class)
 {
@@ -1795,11 +1963,13 @@  static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
 	update_curr_scx(rq);
 
 	/*
-	 * While disabling, always resched as we can't trust the slice
-	 * management.
+	 * While disabling, always resched and refresh core-sched timestamp as
+	 * we can't trust the slice management or ops.core_sched_before().
 	 */
-	if (scx_ops_disabling())
+	if (scx_ops_disabling()) {
 		curr->scx.slice = 0;
+		touch_core_sched(rq, curr);
+	}
 
 	if (!curr->scx.slice)
 		resched_curr(rq);
@@ -2232,6 +2402,10 @@  DEFINE_SCHED_CLASS(ext) = {
 	.rq_offline		= rq_offline_scx,
 #endif
 
+#ifdef CONFIG_SCHED_CORE
+	.pick_task		= pick_task_scx,
+#endif
+
 	.task_tick		= task_tick_scx,
 
 	.switching_to		= switching_to_scx,
@@ -2560,9 +2734,11 @@  static void scx_ops_disable_workfn(struct kthread_work *work)
 	 *
 	 * b. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value
 	 *    can't be trusted. Whenever a tick triggers, the running task is
-	 *    rotated to the tail of the queue.
+	 *    rotated to the tail of the queue with core_sched_at touched.
 	 *
 	 * c. pick_next_task() suppresses zero slice warning.
+	 *
+	 * d. scx_prio_less() reverts to the default core_sched_at order.
 	 */
 	scx_ops.enqueue = scx_ops_fallback_enqueue;
 	scx_ops.dispatch = scx_ops_fallback_dispatch;
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 099e17e92228..c3df39984fc9 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -68,6 +68,14 @@  enum scx_enq_flags {
 enum scx_deq_flags {
 	/* expose select DEQUEUE_* flags as enums */
 	SCX_DEQ_SLEEP		= DEQUEUE_SLEEP,
+
+	/* high 32bits are SCX specific */
+
+	/*
+	 * The generic core-sched layer decided to execute the task even though
+	 * it hasn't been dispatched yet. Dequeue from the BPF side.
+	 */
+	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
 };
 
 enum scx_tg_flags {
@@ -173,6 +181,10 @@  static inline const struct sched_class *next_active_class(const struct sched_cla
 	for_active_class_range(class, (prev_class) > &ext_sched_class ?		\
 			       &ext_sched_class : (prev_class), (end_class))
 
+#ifdef CONFIG_SCHED_CORE
+bool scx_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi);
+#endif
+
 #else	/* CONFIG_SCHED_CLASS_EXT */
 
 #define scx_enabled()		false
diff --git a/tools/sched_ext/scx_example_qmap.bpf.c b/tools/sched_ext/scx_example_qmap.bpf.c
index 7e670986542b..7d851fd987ac 100644
--- a/tools/sched_ext/scx_example_qmap.bpf.c
+++ b/tools/sched_ext/scx_example_qmap.bpf.c
@@ -13,6 +13,7 @@ 
  * - Sleepable per-task storage allocation using ops.prep_enable().
  * - Using ops.cpu_release() to handle a higher priority scheduling class taking
  *   the CPU away.
+ * - Core-sched support.
  *
  * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
  * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
@@ -59,9 +60,21 @@  struct {
 	},
 };
 
+/*
+ * Per-queue sequence numbers to implement core-sched ordering.
+ *
+ * Tail seq is assigned to each queued task and incremented. Head seq tracks the
+ * sequence number of the latest dispatched task. The distance between the a
+ * task's seq and the associated queue's head seq is called the queue distance
+ * and used when comparing two tasks for ordering. See qmap_core_sched_before().
+ */
+static u64 core_sched_head_seqs[5];
+static u64 core_sched_tail_seqs[5];
+
 /* Per-task scheduling context */
 struct task_ctx {
 	bool	force_local;	/* Dispatch directly to local_dsq */
+	u64	core_sched_seq;
 };
 
 struct {
@@ -81,6 +94,7 @@  struct {
 
 /* Statistics */
 unsigned long nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
+unsigned long nr_core_sched_execed;
 
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
@@ -147,8 +161,18 @@  void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		return;
 	}
 
-	/* Is select_cpu() is telling us to enqueue locally? */
-	if (tctx->force_local) {
+	/*
+	 * All enqueued tasks must have their core_sched_seq updated for correct
+	 * core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in
+	 * qmap_ops.flags.
+	 */
+	tctx->core_sched_seq = core_sched_tail_seqs[idx]++;
+
+	/*
+	 * If qmap_select_cpu() is telling us to or this is the last runnable
+	 * task on the CPU, enqueue locally.
+	 */
+	if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) {
 		tctx->force_local = false;
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
 		return;
@@ -192,6 +216,19 @@  void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
 {
 	__sync_fetch_and_add(&nr_dequeued, 1);
+	if (deq_flags & SCX_DEQ_CORE_SCHED_EXEC)
+		__sync_fetch_and_add(&nr_core_sched_execed, 1);
+}
+
+static void update_core_sched_head_seq(struct task_struct *p)
+{
+	struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+	int idx = weight_to_idx(p->scx.weight);
+
+	if (tctx)
+		core_sched_head_seqs[idx] = tctx->core_sched_seq;
+	else
+		scx_bpf_error("task_ctx lookup failed");
 }
 
 void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
@@ -244,6 +281,7 @@  void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 
 			p = bpf_task_from_pid(pid);
 			if (p) {
+				update_core_sched_head_seq(p);
 				__sync_fetch_and_add(&nr_dispatched, 1);
 				scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, 0);
 				bpf_task_release(p);
@@ -255,6 +293,49 @@  void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 	}
 }
 
+/*
+ * The distance from the head of the queue scaled by the weight of the queue.
+ * The lower the number, the older the task and the higher the priority.
+ */
+static s64 task_qdist(struct task_struct *p)
+{
+	int idx = weight_to_idx(p->scx.weight);
+	struct task_ctx *tctx;
+	s64 qdist;
+
+	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+	if (!tctx) {
+		scx_bpf_error("task_ctx lookup failed");
+		return 0;
+	}
+
+	qdist = tctx->core_sched_seq - core_sched_head_seqs[idx];
+
+	/*
+	 * As queue index increments, the priority doubles. The queue w/ index 3
+	 * is dispatched twice more frequently than 2. Reflect the difference by
+	 * scaling qdists accordingly. Note that the shift amount needs to be
+	 * flipped depending on the sign to avoid flipping priority direction.
+	 */
+	if (qdist >= 0)
+		return qdist << (4 - idx);
+	else
+		return qdist << idx;
+}
+
+/*
+ * This is called to determine the task ordering when core-sched is picking
+ * tasks to execute on SMT siblings and should encode about the same ordering as
+ * the regular scheduling path. Use the priority-scaled distances from the head
+ * of the queues to compare the two tasks which should be consistent with the
+ * dispatch path behavior.
+ */
+bool BPF_STRUCT_OPS(qmap_core_sched_before,
+		    struct task_struct *a, struct task_struct *b)
+{
+	return task_qdist(a) > task_qdist(b);
+}
+
 void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
 {
 	u32 cnt;
@@ -306,10 +387,12 @@  struct sched_ext_ops qmap_ops = {
 	.enqueue		= (void *)qmap_enqueue,
 	.dequeue		= (void *)qmap_dequeue,
 	.dispatch		= (void *)qmap_dispatch,
+	.core_sched_before	= (void *)qmap_core_sched_before,
 	.cpu_release		= (void *)qmap_cpu_release,
 	.prep_enable		= (void *)qmap_prep_enable,
 	.init			= (void *)qmap_init,
 	.exit			= (void *)qmap_exit,
+	.flags			= SCX_OPS_ENQ_LAST,
 	.timeout_ms		= 5000U,
 	.name			= "qmap",
 };
diff --git a/tools/sched_ext/scx_example_qmap.c b/tools/sched_ext/scx_example_qmap.c
index de6f03ccb233..02fabe97ac9f 100644
--- a/tools/sched_ext/scx_example_qmap.c
+++ b/tools/sched_ext/scx_example_qmap.c
@@ -91,9 +91,10 @@  int main(int argc, char **argv)
 		long nr_enqueued = skel->bss->nr_enqueued;
 		long nr_dispatched = skel->bss->nr_dispatched;
 
-		printf("enq=%lu, dsp=%lu, delta=%ld, reenq=%lu, deq=%lu\n",
+		printf("enq=%lu, dsp=%lu, delta=%ld, reenq=%lu, deq=%lu, core=%lu\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
-		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued);
+		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
+		       skel->bss->nr_core_sched_execed);
 		fflush(stdout);
 		sleep(1);
 	}