Message ID | 20231107215742.363031-42-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Make the kernel preemptible | expand |
On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head) > } > > /* > - * resched_curr - mark rq's current task 'to be rescheduled now'. > + * __resched_curr - mark rq's current task 'to be rescheduled'. > * > - * On UP this means the setting of the need_resched flag, on SMP it > - * might also involve a cross-CPU call to trigger the scheduler on > - * the target CPU. > + * On UP this means the setting of the need_resched flag, on SMP, for > + * eager resched it might also involve a cross-CPU call to trigger > + * the scheduler on the target CPU. > */ > -void resched_curr(struct rq *rq) > +void __resched_curr(struct rq *rq, resched_t rs) > { > struct task_struct *curr = rq->curr; > int cpu; > @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq) > cpu = cpu_of(rq); > > if (cpu == smp_processor_id()) { > - set_tsk_need_resched(curr, RESCHED_eager); > - set_preempt_need_resched(); > + set_tsk_need_resched(curr, rs); > + if (rs == RESCHED_eager) > + set_preempt_need_resched(); > return; > } > > - if (set_nr_and_not_polling(curr, RESCHED_eager)) > - smp_send_reschedule(cpu); > - else > + if (set_nr_and_not_polling(curr, rs)) { > + if (rs == RESCHED_eager) > + smp_send_reschedule(cpu); I think you just broke things. Not all idle threads have POLLING support, in which case you need that IPI to wake them up, even if it's LAZY. > + } else if (rs == RESCHED_eager) > trace_sched_wake_idle_without_ipi(cpu); > } > > +/* > + * resched_curr - mark rq's current task 'to be rescheduled' eagerly > + * or lazily according to the current policy. > + * > + * Always schedule eagerly, if: > + * > + * - running under full preemption > + * > + * - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG) > + * force TIF_NEED_RESCHED to be set and send a resched IPI. > + * (the polling case has already set TIF_NEED_RESCHED via > + * set_nr_if_polling()). > + * > + * - in userspace: run to completion semantics are only for kernel tasks > + * > + * Otherwise (regardless of priority), run to completion. > + */ > +void resched_curr(struct rq *rq) > +{ > + resched_t rs = RESCHED_lazy; > + int context; > + > + if (IS_ENABLED(CONFIG_PREEMPT) || > + (rq->curr->sched_class == &idle_sched_class)) { > + rs = RESCHED_eager; > + goto resched; > + } > + > + /* > + * We might race with the target CPU while checking its ct_state: > + * > + * 1. The task might have just entered the kernel, but has not yet > + * called user_exit(). We will see stale state (CONTEXT_USER) and > + * send an unnecessary resched-IPI. > + * > + * 2. The user task is through with exit_to_user_mode_loop() but has > + * not yet called user_enter(). > + * > + * We'll see the thread's state as CONTEXT_KERNEL and will try to > + * schedule it lazily. There's obviously nothing that will handle > + * this need-resched bit until the thread enters the kernel next. > + * > + * The scheduler will still do tick accounting, but a potentially > + * higher priority task waited to be scheduled for a user tick, > + * instead of execution time in the kernel. > + */ > + context = ct_state_cpu(cpu_of(rq)); > + if ((context == CONTEXT_USER) || > + (context == CONTEXT_GUEST)) { > + > + rs = RESCHED_eager; > + goto resched; > + } Like said, this simply cannot be. You must not rely on the remote CPU being in some state or not. Also, it's racy, you could observe USER and then it enters KERNEL. > + > +resched: > + __resched_curr(rq, rs); > +} > + > void resched_cpu(int cpu) > { > struct rq *rq = cpu_rq(cpu);
Peter Zijlstra <peterz@infradead.org> writes: > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: > >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head) >> } >> >> /* >> - * resched_curr - mark rq's current task 'to be rescheduled now'. >> + * __resched_curr - mark rq's current task 'to be rescheduled'. >> * >> - * On UP this means the setting of the need_resched flag, on SMP it >> - * might also involve a cross-CPU call to trigger the scheduler on >> - * the target CPU. >> + * On UP this means the setting of the need_resched flag, on SMP, for >> + * eager resched it might also involve a cross-CPU call to trigger >> + * the scheduler on the target CPU. >> */ >> -void resched_curr(struct rq *rq) >> +void __resched_curr(struct rq *rq, resched_t rs) >> { >> struct task_struct *curr = rq->curr; >> int cpu; >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq) >> cpu = cpu_of(rq); >> >> if (cpu == smp_processor_id()) { >> - set_tsk_need_resched(curr, RESCHED_eager); >> - set_preempt_need_resched(); >> + set_tsk_need_resched(curr, rs); >> + if (rs == RESCHED_eager) >> + set_preempt_need_resched(); >> return; >> } >> >> - if (set_nr_and_not_polling(curr, RESCHED_eager)) >> - smp_send_reschedule(cpu); >> - else >> + if (set_nr_and_not_polling(curr, rs)) { >> + if (rs == RESCHED_eager) >> + smp_send_reschedule(cpu); > > I think you just broke things. > > Not all idle threads have POLLING support, in which case you need that > IPI to wake them up, even if it's LAZY. Yes, I was concerned about that too. But doesn't this check against the idle_sched_class in resched_curr() cover that? >> + if (IS_ENABLED(CONFIG_PREEMPT) || >> + (rq->curr->sched_class == &idle_sched_class)) { >> + rs = RESCHED_eager; >> + goto resched; >> + } else if (rs == RESCHED_eager) >> trace_sched_wake_idle_without_ipi(cpu); >> } > > > >> >> +/* >> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly >> + * or lazily according to the current policy. >> + * >> + * Always schedule eagerly, if: >> + * >> + * - running under full preemption >> + * >> + * - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG) >> + * force TIF_NEED_RESCHED to be set and send a resched IPI. >> + * (the polling case has already set TIF_NEED_RESCHED via >> + * set_nr_if_polling()). >> + * >> + * - in userspace: run to completion semantics are only for kernel tasks >> + * >> + * Otherwise (regardless of priority), run to completion. >> + */ >> +void resched_curr(struct rq *rq) >> +{ >> + resched_t rs = RESCHED_lazy; >> + int context; >> + >> + if (IS_ENABLED(CONFIG_PREEMPT) || >> + (rq->curr->sched_class == &idle_sched_class)) { >> + rs = RESCHED_eager; >> + goto resched; >> + } >> + >> + /* >> + * We might race with the target CPU while checking its ct_state: >> + * >> + * 1. The task might have just entered the kernel, but has not yet >> + * called user_exit(). We will see stale state (CONTEXT_USER) and >> + * send an unnecessary resched-IPI. >> + * >> + * 2. The user task is through with exit_to_user_mode_loop() but has >> + * not yet called user_enter(). >> + * >> + * We'll see the thread's state as CONTEXT_KERNEL and will try to >> + * schedule it lazily. There's obviously nothing that will handle >> + * this need-resched bit until the thread enters the kernel next. >> + * >> + * The scheduler will still do tick accounting, but a potentially >> + * higher priority task waited to be scheduled for a user tick, >> + * instead of execution time in the kernel. >> + */ >> + context = ct_state_cpu(cpu_of(rq)); >> + if ((context == CONTEXT_USER) || >> + (context == CONTEXT_GUEST)) { >> + >> + rs = RESCHED_eager; >> + goto resched; >> + } > > Like said, this simply cannot be. You must not rely on the remote CPU > being in some state or not. Also, it's racy, you could observe USER and > then it enters KERNEL. Or worse. We might observe KERNEL and it enters USER. I think we would be fine if we observe USER: it would be upgrade to RESCHED_eager and send an unnecessary IPI. But if we observe KERNEL and it enters USER, then we will have set the need-resched-lazy bit which the thread might not see (it might have left exit_to_user_mode_loop()) until the next entry to the kernel. But, yes I would like to avoid the ct_state as well. But need-resched-lazy only makes sense when the task on the runqueue is executing in the kernel... -- ankur
On Wed, Nov 08, 2023 at 02:26:37AM -0800, Ankur Arora wrote: > > Peter Zijlstra <peterz@infradead.org> writes: > > > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: > > > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head) > >> } > >> > >> /* > >> - * resched_curr - mark rq's current task 'to be rescheduled now'. > >> + * __resched_curr - mark rq's current task 'to be rescheduled'. > >> * > >> - * On UP this means the setting of the need_resched flag, on SMP it > >> - * might also involve a cross-CPU call to trigger the scheduler on > >> - * the target CPU. > >> + * On UP this means the setting of the need_resched flag, on SMP, for > >> + * eager resched it might also involve a cross-CPU call to trigger > >> + * the scheduler on the target CPU. > >> */ > >> -void resched_curr(struct rq *rq) > >> +void __resched_curr(struct rq *rq, resched_t rs) > >> { > >> struct task_struct *curr = rq->curr; > >> int cpu; > >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq) > >> cpu = cpu_of(rq); > >> > >> if (cpu == smp_processor_id()) { > >> - set_tsk_need_resched(curr, RESCHED_eager); > >> - set_preempt_need_resched(); > >> + set_tsk_need_resched(curr, rs); > >> + if (rs == RESCHED_eager) > >> + set_preempt_need_resched(); > >> return; > >> } > >> > >> - if (set_nr_and_not_polling(curr, RESCHED_eager)) > >> - smp_send_reschedule(cpu); > >> - else > >> + if (set_nr_and_not_polling(curr, rs)) { > >> + if (rs == RESCHED_eager) > >> + smp_send_reschedule(cpu); > > > > I think you just broke things. > > > > Not all idle threads have POLLING support, in which case you need that > > IPI to wake them up, even if it's LAZY. > > Yes, I was concerned about that too. But doesn't this check against the > idle_sched_class in resched_curr() cover that? I that's what that was. Hmm, maybe. I mean, we have idle-injection too, those don't as FIFO, but as such, they can only get preempted from RT/DL, and those will already force preempt anyway. The way you've split and structured the code makes it very hard to follow. Something like: if (set_nr_and_not_polling(curr, rs) && (rs == RESCHED_force || is_idle_task(curr))) smp_send_reschedule(); is *far* clearer, no?
Ankur Arora <ankur.a.arora@oracle.com> writes: > Peter Zijlstra <peterz@infradead.org> writes: > >> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: >> >>> + * We might race with the target CPU while checking its ct_state: >>> + * >>> + * 1. The task might have just entered the kernel, but has not yet >>> + * called user_exit(). We will see stale state (CONTEXT_USER) and >>> + * send an unnecessary resched-IPI. >>> + * >>> + * 2. The user task is through with exit_to_user_mode_loop() but has >>> + * not yet called user_enter(). >>> + * >>> + * We'll see the thread's state as CONTEXT_KERNEL and will try to >>> + * schedule it lazily. There's obviously nothing that will handle >>> + * this need-resched bit until the thread enters the kernel next. >>> + * >>> + * The scheduler will still do tick accounting, but a potentially >>> + * higher priority task waited to be scheduled for a user tick, >>> + * instead of execution time in the kernel. >>> + */ >>> + context = ct_state_cpu(cpu_of(rq)); >>> + if ((context == CONTEXT_USER) || >>> + (context == CONTEXT_GUEST)) { >>> + >>> + rs = RESCHED_eager; >>> + goto resched; >>> + } >> >> Like said, this simply cannot be. You must not rely on the remote CPU >> being in some state or not. Also, it's racy, you could observe USER and >> then it enters KERNEL. > > Or worse. We might observe KERNEL and it enters USER. > > I think we would be fine if we observe USER: it would be upgrade > to RESCHED_eager and send an unnecessary IPI. > > But if we observe KERNEL and it enters USER, then we will have > set the need-resched-lazy bit which the thread might not see > (it might have left exit_to_user_mode_loop()) until the next > entry to the kernel. > > But, yes I would like to avoid the ct_state as well. But > need-resched-lazy only makes sense when the task on the runqueue > is executing in the kernel... So, I discussed this with Thomas offlist, and he pointed out that I'm overengineering this. If we decide to wake up a remote rq lazily with (!sched_feat(TTWU_QUEUE)), and if the target is running in user space, then the resched would happen when the process enters kernel mode. That's somewhat similar to how in this preemption model we let a task run for upto one extra tick while in kernel mode. So I'll drop this and allow the same behaviour in userspace instead of solving it in unnecessarily complicated ways. -- ankur
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Nov 08, 2023 at 02:26:37AM -0800, Ankur Arora wrote: >> >> Peter Zijlstra <peterz@infradead.org> writes: >> >> > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: >> > >> >> --- a/kernel/sched/core.c >> >> +++ b/kernel/sched/core.c >> >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head) >> >> } >> >> >> >> /* >> >> - * resched_curr - mark rq's current task 'to be rescheduled now'. >> >> + * __resched_curr - mark rq's current task 'to be rescheduled'. >> >> * >> >> - * On UP this means the setting of the need_resched flag, on SMP it >> >> - * might also involve a cross-CPU call to trigger the scheduler on >> >> - * the target CPU. >> >> + * On UP this means the setting of the need_resched flag, on SMP, for >> >> + * eager resched it might also involve a cross-CPU call to trigger >> >> + * the scheduler on the target CPU. >> >> */ >> >> -void resched_curr(struct rq *rq) >> >> +void __resched_curr(struct rq *rq, resched_t rs) >> >> { >> >> struct task_struct *curr = rq->curr; >> >> int cpu; >> >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq) >> >> cpu = cpu_of(rq); >> >> >> >> if (cpu == smp_processor_id()) { >> >> - set_tsk_need_resched(curr, RESCHED_eager); >> >> - set_preempt_need_resched(); >> >> + set_tsk_need_resched(curr, rs); >> >> + if (rs == RESCHED_eager) >> >> + set_preempt_need_resched(); >> >> return; >> >> } >> >> >> >> - if (set_nr_and_not_polling(curr, RESCHED_eager)) >> >> - smp_send_reschedule(cpu); >> >> - else >> >> + if (set_nr_and_not_polling(curr, rs)) { >> >> + if (rs == RESCHED_eager) >> >> + smp_send_reschedule(cpu); >> > >> > I think you just broke things. >> > >> > Not all idle threads have POLLING support, in which case you need that >> > IPI to wake them up, even if it's LAZY. >> >> Yes, I was concerned about that too. But doesn't this check against the >> idle_sched_class in resched_curr() cover that? > > I that's what that was. Hmm, maybe. > > I mean, we have idle-injection too, those don't as FIFO, but as such, > they can only get preempted from RT/DL, and those will already force > preempt anyway. Aah yes, of course those are FIFO. Thanks missed that. > The way you've split and structured the code makes it very hard to > follow. Something like: > > if (set_nr_and_not_polling(curr, rs) && > (rs == RESCHED_force || is_idle_task(curr))) > smp_send_reschedule(); > > is *far* clearer, no? Nods. I was trying to separate where we decide whether we do things eagerly or lazily. But this is way clearer. -- ankur
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 01df5ac2982c..f65bf3ce0e9d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head) } /* - * resched_curr - mark rq's current task 'to be rescheduled now'. + * __resched_curr - mark rq's current task 'to be rescheduled'. * - * On UP this means the setting of the need_resched flag, on SMP it - * might also involve a cross-CPU call to trigger the scheduler on - * the target CPU. + * On UP this means the setting of the need_resched flag, on SMP, for + * eager resched it might also involve a cross-CPU call to trigger + * the scheduler on the target CPU. */ -void resched_curr(struct rq *rq) +void __resched_curr(struct rq *rq, resched_t rs) { struct task_struct *curr = rq->curr; int cpu; @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq) cpu = cpu_of(rq); if (cpu == smp_processor_id()) { - set_tsk_need_resched(curr, RESCHED_eager); - set_preempt_need_resched(); + set_tsk_need_resched(curr, rs); + if (rs == RESCHED_eager) + set_preempt_need_resched(); return; } - if (set_nr_and_not_polling(curr, RESCHED_eager)) - smp_send_reschedule(cpu); - else + if (set_nr_and_not_polling(curr, rs)) { + if (rs == RESCHED_eager) + smp_send_reschedule(cpu); + } else if (rs == RESCHED_eager) trace_sched_wake_idle_without_ipi(cpu); } +/* + * resched_curr - mark rq's current task 'to be rescheduled' eagerly + * or lazily according to the current policy. + * + * Always schedule eagerly, if: + * + * - running under full preemption + * + * - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG) + * force TIF_NEED_RESCHED to be set and send a resched IPI. + * (the polling case has already set TIF_NEED_RESCHED via + * set_nr_if_polling()). + * + * - in userspace: run to completion semantics are only for kernel tasks + * + * Otherwise (regardless of priority), run to completion. + */ +void resched_curr(struct rq *rq) +{ + resched_t rs = RESCHED_lazy; + int context; + + if (IS_ENABLED(CONFIG_PREEMPT) || + (rq->curr->sched_class == &idle_sched_class)) { + rs = RESCHED_eager; + goto resched; + } + + /* + * We might race with the target CPU while checking its ct_state: + * + * 1. The task might have just entered the kernel, but has not yet + * called user_exit(). We will see stale state (CONTEXT_USER) and + * send an unnecessary resched-IPI. + * + * 2. The user task is through with exit_to_user_mode_loop() but has + * not yet called user_enter(). + * + * We'll see the thread's state as CONTEXT_KERNEL and will try to + * schedule it lazily. There's obviously nothing that will handle + * this need-resched bit until the thread enters the kernel next. + * + * The scheduler will still do tick accounting, but a potentially + * higher priority task waited to be scheduled for a user tick, + * instead of execution time in the kernel. + */ + context = ct_state_cpu(cpu_of(rq)); + if ((context == CONTEXT_USER) || + (context == CONTEXT_GUEST)) { + + rs = RESCHED_eager; + goto resched; + } + +resched: + __resched_curr(rq, rs); +} + void resched_cpu(int cpu) { struct rq *rq = cpu_rq(cpu);
One of the last ports of call before rescheduling is triggered is resched_curr(). It's task is to set TIF_NEED_RESCHED and, if running locally, either fold it in the preempt_count, or send a resched-IPI so the target CPU folds it in. To handle TIF_NEED_RESCHED_LAZY -- since the reschedule is not imminent -- it only needs to set the appropriate bit. Move all of underlying mechanism in __resched_curr(). And, define resched_curr() which handles the policy on when we want to set which need-resched variant. For now the approach is to run to completion (TIF_NEED_RESCHED_LAZY) with the following exceptions where we always want to reschedule at the next preemptible point (TIF_NEED_RESCHED): - idle: if we are polling in idle, then set_nr_if_polling() will do the right thing. When not polling, we force TIF_NEED_RESCHED and send a resched-IPI if needed. - the target CPU is in userspace: run to completion semantics are only for kernel tasks - running under the full preemption model Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 70 insertions(+), 10 deletions(-)