Message ID | 20231107215742.363031-46-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Make the kernel preemptible | expand |
On Tue, 7 Nov 2023 13:57:31 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > Note: this commit is badly broken. Only here for discussion. > > Configurations with ARCH_NO_PREEMPT support preempt_count, but might > not be tested well enough under PREEMPTION to support it might not > be demarcating the necessary non-preemptible sections. > > One way to handle this is by limiting them to PREEMPT_NONE mode, not > doing any tick enforcement and limiting preemption to happen only at > user boundary. > > Unfortunately, this is only a partial solution because eager > rescheduling could still happen (say, due to RCU wanting an > expedited quiescent period.) And, because we do not trust the > preempt_count accounting, this would mean preemption inside an > unmarked critical section. Is preempt_count accounting really not trust worthy? That is, if we preempt at preempt_count() going to zero but nowhere else, would that work? At least it would create some places that can be resched. What's the broken part of these archs? The assembly? If that's the case, as long as the generic code has the preempt_count() I would think that would be trust worthy. I'm also guessing that in_irq() and friends are still reliable. -- Steve
Steven Rostedt <rostedt@goodmis.org> writes: > On Tue, 7 Nov 2023 13:57:31 -0800 > Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Note: this commit is badly broken. Only here for discussion. >> >> Configurations with ARCH_NO_PREEMPT support preempt_count, but might >> not be tested well enough under PREEMPTION to support it might not >> be demarcating the necessary non-preemptible sections. >> >> One way to handle this is by limiting them to PREEMPT_NONE mode, not >> doing any tick enforcement and limiting preemption to happen only at >> user boundary. >> >> Unfortunately, this is only a partial solution because eager >> rescheduling could still happen (say, due to RCU wanting an >> expedited quiescent period.) And, because we do not trust the >> preempt_count accounting, this would mean preemption inside an >> unmarked critical section. > > Is preempt_count accounting really not trust worthy? I think the concern was that we might not be marking all the sections that might be non-preemptible. Plus, given that these archs have always been !preemption, it is unlikely that they would work without changes. I think basically we don't have a clue if they work or not since haven't ever been tested. > That is, if we preempt at preempt_count() going to zero but nowhere else, > would that work? At least it would create some places that can be resched. I'm not sure we can be sure. I can imagine places where it should be preempt_disable() ; spin_lock() ; ... ; spin_unlock(); preempt_enable() but the preempt_disable/_enable() are MIA. I think even so it is a pretty good idea. We could decompose ARCH_NO_PREEMPT into ARCH_NO_PREEMPT_COUNT and ARCH_NO_PREEMPT_IRQ. And, as you imply, PREEMPTION (or rather PREEMPT_COUNT) only depends on ARCH_NO_PREEMPT_COUNT, not the ARCH_NO_PREEMPT_IRQ. And this change might make the task of fixing this simpler since you would only have to worry about neighborhood and paths leading to preempt_enable(). void irqentry_exit_cond_resched(void) { - if (!preempt_count()) { + if (IS_DISABLED(CONFIG_ARCH_NO_PREEMPT_IRQ) && !preempt_count()) { /* Sanity check RCU and thread stack */ rcu_irq_exit_check_preempt(); Geert, if you think it might help I can send out a patch. > What's the broken part of these archs? The assembly? Not sure anyone knows. But, assuming m68k is representative of the other three ARCH_NO_PREEMPT ones (might be better placed, because COLDFIRE m68k already supports preemption) the patches Geert had sent out add: - preempt_enable/_disable() pairs to the cache/tlb flush logic - a need-resched check and call to preempt_schedule_irq() in the exception return path. m68k support: https://lore.kernel.org/all/7858a184cda66e0991fd295c711dfed7e4d1248c.1696603287.git.geert@linux-m68k.org/#t (The series itself ran into an mmput() bug which might or might not have anything to do with preemption.) > If that's the case, as > long as the generic code has the preempt_count() I would think that would > be trust worthy. I'm also guessing that in_irq() and friends are still > reliable. True. -- ankur
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3fa78e8afb7d..bf5df2b866df 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1059,6 +1059,14 @@ void __resched_curr(struct rq *rq, resched_t rs) trace_sched_wake_idle_without_ipi(cpu); } +#ifndef CONFIG_ARCH_NO_PREEMPT +#define force_preempt() sched_feat(FORCE_PREEMPT) +#define preempt_priority() sched_feat(PREEMPT_PRIORITY) +#else +#define force_preempt() false +#define preempt_priority() false +#endif + /* * resched_curr - mark rq's current task 'to be rescheduled' eagerly * or lazily according to the current policy. @@ -1084,7 +1092,7 @@ void resched_curr(struct rq *rq, bool above) resched_t rs = RESCHED_lazy; int context; - if (sched_feat(FORCE_PREEMPT) || + if (force_preempt() || (rq->curr->sched_class == &idle_sched_class)) { rs = RESCHED_eager; goto resched; @@ -1115,7 +1123,7 @@ void resched_curr(struct rq *rq, bool above) goto resched; } - if (sched_feat(PREEMPT_PRIORITY) && above) + if (preempt_priority() && above) rs = RESCHED_eager; resched: diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 9bf30732b03f..2575d018b181 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -90,6 +90,12 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(HZ_BW, true) +#ifndef CONFIG_ARCH_NO_PREEMPT +/* + * Architectures with CONFIG_ARCH_NO_PREEMPT cannot safely preempt. + * So even though they enable CONFIG_PREEMPTION, they never have the + * option to dynamically switch preemption models. + */ #if defined(CONFIG_PREEMPT) SCHED_FEAT(FORCE_PREEMPT, true) SCHED_FEAT(PREEMPT_PRIORITY, true) @@ -100,3 +106,4 @@ SCHED_FEAT(PREEMPT_PRIORITY, true) SCHED_FEAT(FORCE_PREEMPT, false) SCHED_FEAT(PREEMPT_PRIORITY, false) #endif +#endif
Note: this commit is badly broken. Only here for discussion. Configurations with ARCH_NO_PREEMPT support preempt_count, but might not be tested well enough under PREEMPTION to support it might not be demarcating the necessary non-preemptible sections. One way to handle this is by limiting them to PREEMPT_NONE mode, not doing any tick enforcement and limiting preemption to happen only at user boundary. Unfortunately, this is only a partial solution because eager rescheduling could still happen (say, due to RCU wanting an expedited quiescent period.) And, because we do not trust the preempt_count accounting, this would mean preemption inside an unmarked critical section. I suppose we could disable that (say by selecting PREEMPTION=n), but then the only avenue for driving scheduling between kernel contexts (when there is no ongoing userspace work) would be explicit calls to schedule(). Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- kernel/sched/core.c | 12 ++++++++++-- kernel/sched/features.h | 7 +++++++ 2 files changed, 17 insertions(+), 2 deletions(-)