Message ID | 1480368809-23685-2-git-send-email-jacob.jun.pan@linux.intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Rafael Wysocki |
Headers | show |
On Mon, Nov 28, 2016 at 10:33 PM, Jacob Pan <jacob.jun.pan@linux.intel.com> wrote: > From: Peter Zijlstra <peterz@infradead.org> > > Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use > realtime tasks to take control of CPU then inject idle. There are two > issues with this approach: > > 1. Low efficiency: injected idle task is treated as busy so sched ticks > do not stop during injected idle period, the result of these > unwanted wakeups can be ~20% loss in power savings. > > 2. Idle accounting: injected idle time is presented to user as busy. > > This patch addresses the issues by introducing a new PF_IDLE flag which > allows any given task to be treated as idle task while the flag is set. > Therefore, idle injection tasks can run through the normal flow of NOHZ > idle enter/exit to get the correct accounting as well as tick stop when > possible. > > The implication is that idle task is then no longer limited to PID == 0. > > Acked-by: Ingo Molnar <mingo@kernel.org> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Have you made any changes to the original Peter's patch, or is this just a resend of that? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 28 Nov 2016 22:39:07 +0100 "Rafael J. Wysocki" <rafael@kernel.org> wrote: > On Mon, Nov 28, 2016 at 10:33 PM, Jacob Pan > <jacob.jun.pan@linux.intel.com> wrote: > > From: Peter Zijlstra <peterz@infradead.org> > > > > Idle injection drivers such as Intel powerclamp and ACPI PAD > > drivers use realtime tasks to take control of CPU then inject idle. > > There are two issues with this approach: > > > > 1. Low efficiency: injected idle task is treated as busy so sched > > ticks do not stop during injected idle period, the result of these > > unwanted wakeups can be ~20% loss in power savings. > > > > 2. Idle accounting: injected idle time is presented to user as > > busy. > > > > This patch addresses the issues by introducing a new PF_IDLE flag > > which allows any given task to be treated as idle task while the > > flag is set. Therefore, idle injection tasks can run through the > > normal flow of NOHZ idle enter/exit to get the correct accounting > > as well as tick stop when possible. > > > > The implication is that idle task is then no longer limited to PID > > == 0. > > > > Acked-by: Ingo Molnar <mingo@kernel.org> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > > Have you made any changes to the original Peter's patch, or is this > just a resend of that? No changes made to Peter's patch. I just rebased to v4.9-rc7 and tested it. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 28, 2016 at 10:46 PM, Jacob Pan <jacob.jun.pan@linux.intel.com> wrote: > On Mon, 28 Nov 2016 22:39:07 +0100 > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > >> On Mon, Nov 28, 2016 at 10:33 PM, Jacob Pan >> <jacob.jun.pan@linux.intel.com> wrote: >> > From: Peter Zijlstra <peterz@infradead.org> >> > >> > Idle injection drivers such as Intel powerclamp and ACPI PAD >> > drivers use realtime tasks to take control of CPU then inject idle. >> > There are two issues with this approach: >> > >> > 1. Low efficiency: injected idle task is treated as busy so sched >> > ticks do not stop during injected idle period, the result of these >> > unwanted wakeups can be ~20% loss in power savings. >> > >> > 2. Idle accounting: injected idle time is presented to user as >> > busy. >> > >> > This patch addresses the issues by introducing a new PF_IDLE flag >> > which allows any given task to be treated as idle task while the >> > flag is set. Therefore, idle injection tasks can run through the >> > normal flow of NOHZ idle enter/exit to get the correct accounting >> > as well as tick stop when possible. >> > >> > The implication is that idle task is then no longer limited to PID >> > == 0. >> > >> > Acked-by: Ingo Molnar <mingo@kernel.org> >> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> >> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> >> >> Have you made any changes to the original Peter's patch, or is this >> just a resend of that? > No changes made to Peter's patch. I just rebased to v4.9-rc7 and tested > it. OK, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Peter, [auto build test ERROR on tip/sched/core] [also build test ERROR on v4.9-rc7 next-20161128] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Jacob-Pan/Stop-sched-tick-in-idle-injection-task/20161129-062641 config: i386-randconfig-x007-201648 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 Note: the linux-review/Jacob-Pan/Stop-sched-tick-in-idle-injection-task/20161129-062641 HEAD 84bf4b8c5cda5c3d80df1d46e4e4f7e3f5ad31a6 builds fine. It only hurts bisectibility. All errors (new ones prefixed by >>): kernel/sched/idle.c: In function 'play_idle': >> kernel/sched/idle.c:304:2: error: implicit declaration of function 'cpuidle_use_deepest_state' [-Werror=implicit-function-declaration] cpuidle_use_deepest_state(true); ^~~~~~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/cpuidle_use_deepest_state +304 kernel/sched/idle.c 298 WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY)); 299 WARN_ON_ONCE(!duration_ms); 300 301 rcu_sleep_check(); 302 preempt_disable(); 303 current->flags |= PF_IDLE; > 304 cpuidle_use_deepest_state(true); 305 306 it.done = 0; 307 hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Mon, Nov 28, 2016 at 10:33 PM, Jacob Pan <jacob.jun.pan@linux.intel.com> wrote: > From: Peter Zijlstra <peterz@infradead.org> > > Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use > realtime tasks to take control of CPU then inject idle. There are two > issues with this approach: > > 1. Low efficiency: injected idle task is treated as busy so sched ticks > do not stop during injected idle period, the result of these > unwanted wakeups can be ~20% loss in power savings. > > 2. Idle accounting: injected idle time is presented to user as busy. > > This patch addresses the issues by introducing a new PF_IDLE flag which > allows any given task to be treated as idle task while the flag is set. > Therefore, idle injection tasks can run through the normal flow of NOHZ > idle enter/exit to get the correct accounting as well as tick stop when > possible. > > The implication is that idle task is then no longer limited to PID == 0. > > Acked-by: Ingo Molnar <mingo@kernel.org> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> > --- > include/linux/cpu.h | 2 + > include/linux/sched.h | 3 +- > kernel/fork.c | 2 +- > kernel/sched/core.c | 1 + > kernel/sched/idle.c | 164 +++++++++++++++++++++++++++++++------------------- > 5 files changed, 108 insertions(+), 64 deletions(-) > > diff --git a/include/linux/cpu.h b/include/linux/cpu.h > index b886dc1..ac0efae 100644 > --- a/include/linux/cpu.h > +++ b/include/linux/cpu.h > @@ -245,6 +245,8 @@ static inline void enable_nonboot_cpus(void) {} > int cpu_report_state(int cpu); > int cpu_check_up_prepare(int cpu); > void cpu_set_state_online(int cpu); > +void play_idle(unsigned long duration_ms); > + > #ifdef CONFIG_HOTPLUG_CPU > bool cpu_wait_death(unsigned int cpu, int seconds); > bool cpu_report_death(void); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index e9c009d..a3d338e 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2254,6 +2254,7 @@ static inline cputime_t task_gtime(struct task_struct *t) > /* > * Per process flags > */ > +#define PF_IDLE 0x00000002 /* I am an IDLE thread */ > #define PF_EXITING 0x00000004 /* getting shut down */ > #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ > #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ > @@ -2611,7 +2612,7 @@ extern int sched_setattr(struct task_struct *, > */ > static inline bool is_idle_task(const struct task_struct *p) > { > - return p->pid == 0; > + return !!(p->flags & PF_IDLE); > } > extern struct task_struct *curr_task(int cpu); > extern void ia64_set_curr_task(int cpu, struct task_struct *p); > diff --git a/kernel/fork.c b/kernel/fork.c > index 997ac1d..a8eb821 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1540,7 +1540,7 @@ static __latent_entropy struct task_struct *copy_process( > goto bad_fork_cleanup_count; > > delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ > - p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER); > + p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE); > p->flags |= PF_FORKNOEXEC; > INIT_LIST_HEAD(&p->children); > INIT_LIST_HEAD(&p->sibling); > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 154fd68..c95fbcd 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5279,6 +5279,7 @@ void init_idle(struct task_struct *idle, int cpu) > __sched_fork(0, idle); > idle->state = TASK_RUNNING; > idle->se.exec_start = sched_clock(); > + idle->flags |= PF_IDLE; > > kasan_unpoison_task_stack(idle); > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index 1d8718d..f01d494 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -202,76 +202,65 @@ static void cpuidle_idle_call(void) > * > * Called with polling cleared. > */ > -static void cpu_idle_loop(void) > +static void do_idle(void) > { > - int cpu = smp_processor_id(); > + /* > + * If the arch has a polling bit, we maintain an invariant: > + * > + * Our polling bit is clear if we're not scheduled (i.e. if rq->curr != > + * rq->idle). This means that, if rq->idle has the polling bit set, > + * then setting need_resched is guaranteed to cause the CPU to > + * reschedule. > + */ > > - while (1) { > - /* > - * If the arch has a polling bit, we maintain an invariant: > - * > - * Our polling bit is clear if we're not scheduled (i.e. if > - * rq->curr != rq->idle). This means that, if rq->idle has > - * the polling bit set, then setting need_resched is > - * guaranteed to cause the cpu to reschedule. > - */ > + __current_set_polling(); > + tick_nohz_idle_enter(); > + > + while (!need_resched()) { > + check_pgt_cache(); > + rmb(); > > - __current_set_polling(); > - quiet_vmstat(); > - tick_nohz_idle_enter(); > - > - while (!need_resched()) { > - check_pgt_cache(); > - rmb(); > - > - if (cpu_is_offline(cpu)) { > - cpuhp_report_idle_dead(); > - arch_cpu_idle_dead(); > - } > - > - local_irq_disable(); > - arch_cpu_idle_enter(); > - > - /* > - * In poll mode we reenable interrupts and spin. > - * > - * Also if we detected in the wakeup from idle > - * path that the tick broadcast device expired > - * for us, we don't want to go deep idle as we > - * know that the IPI is going to arrive right > - * away > - */ > - if (cpu_idle_force_poll || tick_check_broadcast_expired()) > - cpu_idle_poll(); > - else > - cpuidle_idle_call(); > - > - arch_cpu_idle_exit(); > + if (cpu_is_offline(smp_processor_id())) { > + cpuhp_report_idle_dead(); > + arch_cpu_idle_dead(); > } > > - /* > - * Since we fell out of the loop above, we know > - * TIF_NEED_RESCHED must be set, propagate it into > - * PREEMPT_NEED_RESCHED. > - * > - * This is required because for polling idle loops we will > - * not have had an IPI to fold the state for us. > - */ > - preempt_set_need_resched(); > - tick_nohz_idle_exit(); > - __current_clr_polling(); > + local_irq_disable(); > + arch_cpu_idle_enter(); > > /* > - * We promise to call sched_ttwu_pending and reschedule > - * if need_resched is set while polling is set. That > - * means that clearing polling needs to be visible > - * before doing these things. > + * In poll mode we reenable interrupts and spin. Also if we > + * detected in the wakeup from idle path that the tick > + * broadcast device expired for us, we don't want to go deep > + * idle as we know that the IPI is going to arrive right away. > */ > - smp_mb__after_atomic(); > - > - sched_ttwu_pending(); > - schedule_preempt_disabled(); > + if (cpu_idle_force_poll || tick_check_broadcast_expired()) > + cpu_idle_poll(); > + else > + cpuidle_idle_call(); > + arch_cpu_idle_exit(); > } > + > + /* > + * Since we fell out of the loop above, we know TIF_NEED_RESCHED must > + * be set, propagate it into PREEMPT_NEED_RESCHED. > + * > + * This is required because for polling idle loops we will not have had > + * an IPI to fold the state for us. > + */ > + preempt_set_need_resched(); > + tick_nohz_idle_exit(); > + __current_clr_polling(); > + > + /* > + * We promise to call sched_ttwu_pending() and reschedule if > + * need_resched() is set while polling is set. That means that clearing > + * polling needs to be visible before doing these things. > + */ > + smp_mb__after_atomic(); > + > + sched_ttwu_pending(); > + schedule_preempt_disabled(); > } > > bool cpu_in_idle(unsigned long pc) > @@ -280,6 +269,56 @@ bool cpu_in_idle(unsigned long pc) > pc < (unsigned long)__cpuidle_text_end; > } > > +struct idle_timer { > + struct hrtimer timer; > + int done; > +}; > + > +static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *timer) > +{ > + struct idle_timer *it = container_of(timer, struct idle_timer, timer); > + > + WRITE_ONCE(it->done, 1); > + set_tsk_need_resched(current); > + > + return HRTIMER_NORESTART; > +} > + > +void play_idle(unsigned long duration_ms) > +{ > + struct idle_timer it; > + > + /* > + * Only FIFO tasks can disable the tick since they don't need the forced > + * preemption. > + */ > + WARN_ON_ONCE(current->policy != SCHED_FIFO); > + WARN_ON_ONCE(current->nr_cpus_allowed != 1); > + WARN_ON_ONCE(!(current->flags & PF_KTHREAD)); > + WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY)); > + WARN_ON_ONCE(!duration_ms); > + > + rcu_sleep_check(); > + preempt_disable(); > + current->flags |= PF_IDLE; > + cpuidle_use_deepest_state(true); > + > + it.done = 0; > + hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + it.timer.function = idle_inject_timer_fn; > + hrtimer_start(&it.timer, ms_to_ktime(duration_ms), HRTIMER_MODE_REL_PINNED); > + > + while (!READ_ONCE(it.done)) > + do_idle(); > + > + cpuidle_use_deepest_state(false); This actually depends on your [2/2], doesn't it? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 29 Nov 2016 00:22:23 +0100 "Rafael J. Wysocki" <rafael@kernel.org> wrote: > > + while (!READ_ONCE(it.done)) > > + do_idle(); > > + > > + cpuidle_use_deepest_state(false); > > This actually depends on your [2/2], doesn't it? right, I shall put that after 2/2. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 28 Nov 2016 16:33:07 -0800 Jacob Pan <jacob.jun.pan@linux.intel.com> wrote: > > > + cpuidle_use_deepest_state(false); > > > > This actually depends on your [2/2], doesn't it? > > right, I shall put that after 2/2. I mean reverse the order of the two patches. Should I resend the series or you can reverse the order? Thanks, Jacob -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 29, 2016 at 1:39 AM, Jacob Pan <jacob.jun.pan@linux.intel.com> wrote: > On Mon, 28 Nov 2016 16:33:07 -0800 > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote: > >> > > + cpuidle_use_deepest_state(false); >> > >> > This actually depends on your [2/2], doesn't it? >> >> right, I shall put that after 2/2. > > I mean reverse the order of the two patches. Should I resend the series > or you can reverse the order? Well, you need to fix the [1/2] (build issue), so please resend. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/cpu.h b/include/linux/cpu.h index b886dc1..ac0efae 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -245,6 +245,8 @@ static inline void enable_nonboot_cpus(void) {} int cpu_report_state(int cpu); int cpu_check_up_prepare(int cpu); void cpu_set_state_online(int cpu); +void play_idle(unsigned long duration_ms); + #ifdef CONFIG_HOTPLUG_CPU bool cpu_wait_death(unsigned int cpu, int seconds); bool cpu_report_death(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index e9c009d..a3d338e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2254,6 +2254,7 @@ static inline cputime_t task_gtime(struct task_struct *t) /* * Per process flags */ +#define PF_IDLE 0x00000002 /* I am an IDLE thread */ #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ @@ -2611,7 +2612,7 @@ extern int sched_setattr(struct task_struct *, */ static inline bool is_idle_task(const struct task_struct *p) { - return p->pid == 0; + return !!(p->flags & PF_IDLE); } extern struct task_struct *curr_task(int cpu); extern void ia64_set_curr_task(int cpu, struct task_struct *p); diff --git a/kernel/fork.c b/kernel/fork.c index 997ac1d..a8eb821 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1540,7 +1540,7 @@ static __latent_entropy struct task_struct *copy_process( goto bad_fork_cleanup_count; delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ - p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER); + p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE); p->flags |= PF_FORKNOEXEC; INIT_LIST_HEAD(&p->children); INIT_LIST_HEAD(&p->sibling); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 154fd68..c95fbcd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5279,6 +5279,7 @@ void init_idle(struct task_struct *idle, int cpu) __sched_fork(0, idle); idle->state = TASK_RUNNING; idle->se.exec_start = sched_clock(); + idle->flags |= PF_IDLE; kasan_unpoison_task_stack(idle); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 1d8718d..f01d494 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -202,76 +202,65 @@ static void cpuidle_idle_call(void) * * Called with polling cleared. */ -static void cpu_idle_loop(void) +static void do_idle(void) { - int cpu = smp_processor_id(); + /* + * If the arch has a polling bit, we maintain an invariant: + * + * Our polling bit is clear if we're not scheduled (i.e. if rq->curr != + * rq->idle). This means that, if rq->idle has the polling bit set, + * then setting need_resched is guaranteed to cause the CPU to + * reschedule. + */ - while (1) { - /* - * If the arch has a polling bit, we maintain an invariant: - * - * Our polling bit is clear if we're not scheduled (i.e. if - * rq->curr != rq->idle). This means that, if rq->idle has - * the polling bit set, then setting need_resched is - * guaranteed to cause the cpu to reschedule. - */ + __current_set_polling(); + tick_nohz_idle_enter(); + + while (!need_resched()) { + check_pgt_cache(); + rmb(); - __current_set_polling(); - quiet_vmstat(); - tick_nohz_idle_enter(); - - while (!need_resched()) { - check_pgt_cache(); - rmb(); - - if (cpu_is_offline(cpu)) { - cpuhp_report_idle_dead(); - arch_cpu_idle_dead(); - } - - local_irq_disable(); - arch_cpu_idle_enter(); - - /* - * In poll mode we reenable interrupts and spin. - * - * Also if we detected in the wakeup from idle - * path that the tick broadcast device expired - * for us, we don't want to go deep idle as we - * know that the IPI is going to arrive right - * away - */ - if (cpu_idle_force_poll || tick_check_broadcast_expired()) - cpu_idle_poll(); - else - cpuidle_idle_call(); - - arch_cpu_idle_exit(); + if (cpu_is_offline(smp_processor_id())) { + cpuhp_report_idle_dead(); + arch_cpu_idle_dead(); } - /* - * Since we fell out of the loop above, we know - * TIF_NEED_RESCHED must be set, propagate it into - * PREEMPT_NEED_RESCHED. - * - * This is required because for polling idle loops we will - * not have had an IPI to fold the state for us. - */ - preempt_set_need_resched(); - tick_nohz_idle_exit(); - __current_clr_polling(); + local_irq_disable(); + arch_cpu_idle_enter(); /* - * We promise to call sched_ttwu_pending and reschedule - * if need_resched is set while polling is set. That - * means that clearing polling needs to be visible - * before doing these things. + * In poll mode we reenable interrupts and spin. Also if we + * detected in the wakeup from idle path that the tick + * broadcast device expired for us, we don't want to go deep + * idle as we know that the IPI is going to arrive right away. */ - smp_mb__after_atomic(); - - sched_ttwu_pending(); - schedule_preempt_disabled(); + if (cpu_idle_force_poll || tick_check_broadcast_expired()) + cpu_idle_poll(); + else + cpuidle_idle_call(); + arch_cpu_idle_exit(); } + + /* + * Since we fell out of the loop above, we know TIF_NEED_RESCHED must + * be set, propagate it into PREEMPT_NEED_RESCHED. + * + * This is required because for polling idle loops we will not have had + * an IPI to fold the state for us. + */ + preempt_set_need_resched(); + tick_nohz_idle_exit(); + __current_clr_polling(); + + /* + * We promise to call sched_ttwu_pending() and reschedule if + * need_resched() is set while polling is set. That means that clearing + * polling needs to be visible before doing these things. + */ + smp_mb__after_atomic(); + + sched_ttwu_pending(); + schedule_preempt_disabled(); } bool cpu_in_idle(unsigned long pc) @@ -280,6 +269,56 @@ bool cpu_in_idle(unsigned long pc) pc < (unsigned long)__cpuidle_text_end; } +struct idle_timer { + struct hrtimer timer; + int done; +}; + +static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *timer) +{ + struct idle_timer *it = container_of(timer, struct idle_timer, timer); + + WRITE_ONCE(it->done, 1); + set_tsk_need_resched(current); + + return HRTIMER_NORESTART; +} + +void play_idle(unsigned long duration_ms) +{ + struct idle_timer it; + + /* + * Only FIFO tasks can disable the tick since they don't need the forced + * preemption. + */ + WARN_ON_ONCE(current->policy != SCHED_FIFO); + WARN_ON_ONCE(current->nr_cpus_allowed != 1); + WARN_ON_ONCE(!(current->flags & PF_KTHREAD)); + WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY)); + WARN_ON_ONCE(!duration_ms); + + rcu_sleep_check(); + preempt_disable(); + current->flags |= PF_IDLE; + cpuidle_use_deepest_state(true); + + it.done = 0; + hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + it.timer.function = idle_inject_timer_fn; + hrtimer_start(&it.timer, ms_to_ktime(duration_ms), HRTIMER_MODE_REL_PINNED); + + while (!READ_ONCE(it.done)) + do_idle(); + + cpuidle_use_deepest_state(false); + current->flags &= ~PF_IDLE; + + preempt_fold_need_resched(); + preempt_enable(); +} +EXPORT_SYMBOL_GPL(play_idle); + void cpu_startup_entry(enum cpuhp_state state) { /* @@ -299,5 +338,6 @@ void cpu_startup_entry(enum cpuhp_state state) #endif arch_cpu_idle_prepare(); cpuhp_online_idle(state); - cpu_idle_loop(); + while (1) + do_idle(); }