diff mbox series

[v2,tip:,sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl

Message ID 20250212053644.14787-1-cpru@amazon.com
State New
Headers show
Series [v2,tip:,sched/core] sched: Move PLACE_LAG and RUN_TO_PARITY to sysctl | expand

Commit Message

Cristian Prundeanu Feb. 12, 2025, 5:36 a.m. UTC
Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
significant performance degradation in multiple database-oriented
workloads. This degradation manifests in all kernel versions using EEVDF,
across multiple Linux distributions, hardware architectures (x86_64,
aarm64, amd64), and CPU generations.

Testing combinations of available scheduler features showed that the
largest improvement (short of disabling all EEVDF features) came from
disabling both PLACE_LAG and RUN_TO_PARITY.

Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
their default values and persist them with established mechanisms.

Link: https://lore.kernel.org/20241017052000.99200-1-cpru@amazon.com
Signed-off-by: Cristian Prundeanu <cpru@amazon.com>
---
v2: use latest sched/core; defer default value change to a follow-up patch

 include/linux/sched/sysctl.h |  8 ++++++++
 kernel/sched/core.c          | 13 +++++++++++++
 kernel/sched/fair.c          |  7 ++++---
 kernel/sched/features.h      | 10 ----------
 kernel/sysctl.c              | 20 ++++++++++++++++++++
 5 files changed, 45 insertions(+), 13 deletions(-)


base-commit: 05dbaf8dd8bf537d4b4eb3115ab42a5fb40ff1f5

Comments

Peter Zijlstra Feb. 12, 2025, 9:17 a.m. UTC | #1
On Tue, Feb 11, 2025 at 11:36:44PM -0600, Cristian Prundeanu wrote:
> Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
> significant performance degradation in multiple database-oriented
> workloads. This degradation manifests in all kernel versions using EEVDF,
> across multiple Linux distributions, hardware architectures (x86_64,
> aarm64, amd64), and CPU generations.
> 
> Testing combinations of available scheduler features showed that the
> largest improvement (short of disabling all EEVDF features) came from
> disabling both PLACE_LAG and RUN_TO_PARITY.
> 
> Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
> their default values and persist them with established mechanisms.

Nope -- you have knobs in debugfs, and that's where they'll stay. Esp.
PLACE_LAG is super dodgy and should not get elevated to anything
remotely official.

Also, FYI, by keeping these emails threaded in the old thread I nearly
missed them again. I'm not sure where this nonsense of keeping
everything in one thread came from, but it is bloody stupid.
Peter Zijlstra Feb. 12, 2025, 9:37 a.m. UTC | #2
On Wed, Feb 12, 2025 at 10:17:11AM +0100, Peter Zijlstra wrote:
> On Tue, Feb 11, 2025 at 11:36:44PM -0600, Cristian Prundeanu wrote:
> > Replacing CFS with the EEVDF scheduler in kernel 6.6 introduced
> > significant performance degradation in multiple database-oriented
> > workloads. This degradation manifests in all kernel versions using EEVDF,
> > across multiple Linux distributions, hardware architectures (x86_64,
> > aarm64, amd64), and CPU generations.
> > 
> > Testing combinations of available scheduler features showed that the
> > largest improvement (short of disabling all EEVDF features) came from
> > disabling both PLACE_LAG and RUN_TO_PARITY.
> > 
> > Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
> > their default values and persist them with established mechanisms.
> 
> Nope -- you have knobs in debugfs, and that's where they'll stay. Esp.
> PLACE_LAG is super dodgy and should not get elevated to anything
> remotely official.

Just to clarify, the problem with NO_PLACE_LAG is that by discarding
lag, a task can game the system to 'gain' time. It fundamentally breaks
fairness, and the only reason I implemented it at all was because it is
one of the 'official' placement strategies in the original paper.

But ideally, it should just go, it is not a sound strategy and relies on
tasks behaving themselves.

That is, assuming your tasks behave like the traditional periodic or
sporadic tasks, then it works, but only because the tasks are limited by
the constraints of the task model.

If the tasks are unconstrained / aperiodic, this goes out the window and
the placement strategy becomes unsound. And given we must assume
userspace to be malicious / hostile / unbehaved, the whole thing is just
not good.

It is for this same reason that SCHED_DEADLINE has a constant bandwidth
server on top of the earliest deadline first policy. Pure EDF is only
sound for periodic / sporadic tasks, but we cannot assume userspace will
behave themselves, so we have to put in guard-rails, CBS in this case.
Cristian Prundeanu Feb. 12, 2025, 11 p.m. UTC | #3
>>> Moving PLACE_LAG and RUN_TO_PARITY to sysctl will allow users to override
>>> their default values and persist them with established mechanisms.
>>
>> Nope -- you have knobs in debugfs, and that's where they'll stay. Esp.
>> PLACE_LAG is super dodgy and should not get elevated to anything
>> remotely official.
>
> Just to clarify, the problem with NO_PLACE_LAG is that by discarding
> lag, a task can game the system to 'gain' time. It fundamentally breaks
> fairness, and the only reason I implemented it at all was because it is
> one of the 'official' placement strategies in the original paper.

Wouldn't this be an argument in favor of more official positioning of this 
knob? It may be dodgy, but it's currently the best mitigation option, 
until something better comes along.

> If the tasks are unconstrained / aperiodic, this goes out the window and
> the placement strategy becomes unsound. And given we must assume
> userspace to be malicious / hostile / unbehaved, the whole thing is just
> not good.

Userspace in general, absolutely. User intent should be king though, and 
impairing the ability to do precisely what you want with your machine 
feels like it stands against what Linux is best known (and often feared) 
for: configurability. There is _another_ OS which has made a habit of 
dictating how users should want to do something. We're not there of 
course, but it's a strong cautionary tale.

To ask more specifically, isn't a strong point of EEVDF the fact that it 
considers _more_ user needs and use cases than CFS (for instance, task 
lag/latency)?

>> Conversely, setting NO_PLACE_LAG + NO_RUN_TO_PARITY is simply done at boot 
>> time, and does not require further user effort. 
>
> For your workload. It will wreck other workloads.

I'd like to invite you to name one real-life workload that would be 
wrecked by allowing PL and RTP override in sysctl. I can name three that 
are currently impacted (mysql, postgres, and wordpress), with only poor 
means (increased effort, non-standard persistence leading to higher 
maintenance cost, requirement for debugfs) to mitigate the regression.

> Yes, SCHED_BATCH might be more fiddly, but it allows for composition.
> You can run multiple workloads together and they all behave.

Shouldn't we leave that to the user to decide, though? Forcing a new 
default configuration that only works well with multiple workloads can not 
be the right thing for everyone - especially for large scale providers, 
where servers and corresponding images are intended to run one main 
workload. Importantly, things that used to run well and now don't.

> Maybe the right thing here is to get mysql patched; so that it will
> request BATCH itself for the threads that need it.

For mysql in particular, it's a possible avenue (though I still object to 
the idea that individual users and vendors now need to put in additional 
effort to maintain the same performance as before).

But on a larger picture, this reproducer is only meant as a simplified 
illustration of the performance issues. It is not a single occurrence. 
There are far more complex workloads where tuning at thread level is at 
best impractical, or even downright impossible. Think of managed clusters 
where the load distribution and corresponding task density are not user 
controlled, or JVM workloads where individual threads are not even 
designed to be managed externally, or containers built from external 
dependencies where tuning a service is anything but trivial.

Are we really saying that everyone just needs to swallow the cost of this 
change, or put up with the lower performance level? Even if the Linux 
Kernel doesn't concern itself with business cost, surely at least the time 
burned on this by both commercial and non-commercial projects cannot be 
lost on you.

> Also, FYI, by keeping these emails threaded in the old thread I nearly
> missed them again. I'm not sure where this nonsense of keeping
> everything in one thread came from, but it is bloody stupid.

Thank you. This is a great opportunity for both of us to relate to the 
opposing stance on this patch, and I hope you too will see the parallel:

My reason for threading was well intended. I value your time and wanted to 
avoid you wasting it by having to search for the previous patch or older 
threads on the same topic.

However, I ended up inadvertently creating an issue for your use case. 
It, arguably, doesn't have a noticeable impact on my side, and it could be 
avoided by you, the user, by configuring your email client to always 
highlight messages directly addressed to you; assuming that your email 
client supports it, and you are able and willing to invest the effort to 
do it.
Nevertheless, this doesn't make it right.

I do apologize for the annoyance; it was not my intent to put additional 
burden on you, only to have the same experience or efficiency that you are 
used to having. I did consolidate the two recent threads into this one 
though, because I believe that it's easier to follow by everyone else.

It may be a silly parallel, but please consider that similar frustration 
is happening to many users who now are asked to put effort towards 
bringing performance back to previous levels - if at all possible and 
feasible - and at the same time are denied the right tools to do so.
Please consider that it took years for EEVDF commit messages to go from 
"horribly messes up things" to "isn't perfect yet, but much closer", and 
it may take years still until it's as stable, performant and vetted across 
varied scenarios as CFS was in kernel 6.5.
Please consider that along this journey are countless users and groups who 
would rather not wait for perfection, but have easy means to at least get 
the same performance they were getting before.

-Cristian
diff mbox series

Patch

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..a899398bc1c4 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -29,4 +29,12 @@  extern int sysctl_numa_balancing_mode;
 #define sysctl_numa_balancing_mode	0
 #endif
 
+#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
+extern unsigned int sysctl_sched_place_lag_enabled;
+extern unsigned int sysctl_sched_run_to_parity_enabled;
+#else
+#define sysctl_sched_place_lag_enabled 1
+#define sysctl_sched_run_to_parity_enabled 1
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9142a0394d46..a379240628ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -134,6 +134,19 @@  const_debug unsigned int sysctl_sched_features =
 	0;
 #undef SCHED_FEAT
 
+#ifdef CONFIG_SYSCTL
+/*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+__read_mostly unsigned int sysctl_sched_place_lag_enabled = 1;
+/*
+ * Inhibit (wakeup) preemption until the current task has either matched the
+ * 0-lag point or until it has exhausted its slice.
+ */
+__read_mostly unsigned int sysctl_sched_run_to_parity_enabled = 1;
+#endif
+
 /*
  * Print a warning if need_resched is set for the given duration (if
  * LATENCY_WARN is enabled).
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e78caa21436..c87fd1accd54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -923,7 +923,8 @@  static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	 * Once selected, run a task until it either becomes non-eligible or
 	 * until it gets a new slice. See the HACK in set_next_entity().
 	 */
-	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+	if (sysctl_sched_run_to_parity_enabled && curr &&
+	    curr->vlag == curr->deadline)
 		return curr;
 
 	/* Pick the leftmost entity if it's eligible */
@@ -5199,7 +5200,7 @@  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
+	if (sysctl_sched_place_lag_enabled && cfs_rq->nr_queued && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
 		unsigned long load;
 
@@ -9327,7 +9328,7 @@  static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 #else
 	dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
 #endif
-	if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued &&
+	if (sysctl_sched_place_lag_enabled && dst_cfs_rq->nr_queued &&
 	    !entity_eligible(task_cfs_rq(p), &p->se))
 		return 1;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..b98ec31ef2c4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,10 +1,5 @@ 
 /* SPDX-License-Identifier: GPL-2.0 */
 
-/*
- * Using the avg_vruntime, do the right thing and preserve lag across
- * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
- */
-SCHED_FEAT(PLACE_LAG, true)
 /*
  * Give new tasks half a slice to ease into the competition.
  */
@@ -13,11 +8,6 @@  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
  * Preserve relative virtual deadline on 'migration'.
  */
 SCHED_FEAT(PLACE_REL_DEADLINE, true)
-/*
- * Inhibit (wakeup) preemption until the current task has either matched the
- * 0-lag point or until is has exhausted it's slice.
- */
-SCHED_FEAT(RUN_TO_PARITY, true)
 /*
  * Allow wakeup of tasks with a shorter slice to cancel RUN_TO_PARITY for
  * current.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7ae7a4136855..11651d87f6d4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2019,6 +2019,26 @@  static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_INT_MAX,
 	},
 #endif
+#ifdef CONFIG_SCHED_DEBUG
+	{
+		.procname	= "sched_place_lag_enabled",
+		.data		= &sysctl_sched_place_lag_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "sched_run_to_parity_enabled",
+		.data		= &sysctl_sched_run_to_parity_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif
 };
 
 static struct ctl_table vm_table[] = {