diff mbox

[1/3] idle: add support for tasks that inject idle

Message ID 1478718312-12847-2-git-send-email-jacob.jun.pan@linux.intel.com (mailing list archive)
State Superseded, archived
Delegated to: Zhang Rui
Headers show

Commit Message

Jacob Pan Nov. 9, 2016, 7:05 p.m. UTC
From: Peter Zijlstra <peterz@infradead.org>

Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
realtime tasks to take control of CPU then inject idle. There are two issues
with this approach:
 1. Low efficiency: injected idle task is treated as busy so sched ticks do
    not stop during injected idle period, the result of these unwanted
    wakeups can be ~20% loss in power savings.
 2. Idle accounting: injected idle time is presented to user as busy.

This patch addresses the issues by introducing a new PF_IDLE flag which
allows any given task to be treated as idle task while the flag is set.
Therefore, idle injection tasks can run through the normal flow of NOHZ idle
enter/exit to get the correct accounting as well as tick stop when possible.

The implication is that idle task is then no longer limited to PID == 0.
Inheriting PF_IDLE flag from parents is also prevented.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/linux/cpu.h   |   2 +
 include/linux/sched.h |   3 +-
 kernel/fork.c         |   3 ++
 kernel/sched/core.c   |   1 +
 kernel/sched/idle.c   | 144 ++++++++++++++++++++++++++++----------------------
 5 files changed, 88 insertions(+), 65 deletions(-)

Comments

Peter Zijlstra Nov. 14, 2016, 2:57 p.m. UTC | #1
On Wed, Nov 09, 2016 at 11:05:10AM -0800, Jacob Pan wrote:
>  static inline bool is_idle_task(const struct task_struct *p)
>  {
> -	return p->pid == 0;
> +	return !!(p->flags & PF_IDLE);
>  }

> +void play_idle()
> +{
> +	/*
> +	 * Only FIFO tasks can disable the tick since they don't need the forced
> +	 * preemption.
> +	 */
> +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> +	rcu_sleep_check();
> +
> +	preempt_disable();
> +	current->flags |= PF_IDLE;
> +	do_idle();
> +	current->flags &= ~PF_IDLE;

Whoops, that will clear PF_IDLE from the idle thread. Maybe do something
like:

	unsigned int idle_flag = (~current->flags & PF_IDLE)

	current->flags |= PF_IDLE;

	current->flags ^= idle_flag;

> +
> +	preempt_fold_need_resched();
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(play_idle);

> @@ -299,5 +313,7 @@ void cpu_startup_entry(enum cpuhp_state state)
>  #endif
>  	arch_cpu_idle_prepare();
>  	cpuhp_online_idle(state);
> -	cpu_idle_loop();
> +	while (1)
> +		do_idle();
> +
>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 14, 2016, 3:01 p.m. UTC | #2
On Wed, Nov 09, 2016 at 11:05:10AM -0800, Jacob Pan wrote:
> +void play_idle()
> +{
> +	/*
> +	 * Only FIFO tasks can disable the tick since they don't need the forced
> +	 * preemption.
> +	 */
> +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> +	rcu_sleep_check();
> +
> +	preempt_disable();
> +	current->flags |= PF_IDLE;
> +	do_idle();
> +	current->flags &= ~PF_IDLE;
> +
> +	preempt_fold_need_resched();
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(play_idle);

Hurm.. didn't this initially also include the setup of the timer and
take an timeout argument?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 14, 2016, 3:01 p.m. UTC | #3
On Mon, Nov 14, 2016 at 03:57:21PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 09, 2016 at 11:05:10AM -0800, Jacob Pan wrote:
> >  static inline bool is_idle_task(const struct task_struct *p)
> >  {
> > -	return p->pid == 0;
> > +	return !!(p->flags & PF_IDLE);
> >  }
> 
> > +void play_idle()
> > +{
> > +	/*
> > +	 * Only FIFO tasks can disable the tick since they don't need the forced
> > +	 * preemption.
> > +	 */
> > +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> > +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> > +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> > +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> > +	rcu_sleep_check();
> > +
> > +	preempt_disable();
> > +	current->flags |= PF_IDLE;
> > +	do_idle();
> > +	current->flags &= ~PF_IDLE;
> 
> Whoops, that will clear PF_IDLE from the idle thread. Maybe do something
> like:
> 
> 	unsigned int idle_flag = (~current->flags & PF_IDLE)
> 
> 	current->flags |= PF_IDLE;
> 
> 	current->flags ^= idle_flag;
> 
> > +
> > +	preempt_fold_need_resched();
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL_GPL(play_idle);

n/m I cannot read, the actual idle loop below uses do_idle(), not
play_idle.

> > @@ -299,5 +313,7 @@ void cpu_startup_entry(enum cpuhp_state state)
> >  #endif
> >  	arch_cpu_idle_prepare();
> >  	cpuhp_online_idle(state);
> > -	cpu_idle_loop();
> > +	while (1)
> > +		do_idle();
> > +
> >  }
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacob Pan Nov. 14, 2016, 4:20 p.m. UTC | #4
On Mon, 14 Nov 2016 16:01:19 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 09, 2016 at 11:05:10AM -0800, Jacob Pan wrote:
> > +void play_idle()
> > +{
> > +	/*
> > +	 * Only FIFO tasks can disable the tick since they don't
> > need the forced
> > +	 * preemption.
> > +	 */
> > +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> > +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> > +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> > +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> > +	rcu_sleep_check();
> > +
> > +	preempt_disable();
> > +	current->flags |= PF_IDLE;
> > +	do_idle();
> > +	current->flags &= ~PF_IDLE;
> > +
> > +	preempt_fold_need_resched();
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL_GPL(play_idle);  
> 
> Hurm.. didn't this initially also include the setup of the timer and
> take an timeout argument?

right, the initial version has a timeout and
	init_timer_on_stack(timer);

I thought since this play_idle timer is likely to expire instead of
being canceled, so I switched to hrtimer by caller. We don't have an on
stack hrtimer, right? or Should we consider adding one?

Thanks,

Jacob
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 14, 2016, 4:22 p.m. UTC | #5
On Mon, Nov 14, 2016 at 08:20:28AM -0800, Jacob Pan wrote:
> On Mon, 14 Nov 2016 16:01:19 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Nov 09, 2016 at 11:05:10AM -0800, Jacob Pan wrote:
> > > +void play_idle()
> > > +{
> > > +	/*
> > > +	 * Only FIFO tasks can disable the tick since they don't
> > > need the forced
> > > +	 * preemption.
> > > +	 */
> > > +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> > > +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> > > +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> > > +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> > > +	rcu_sleep_check();
> > > +
> > > +	preempt_disable();
> > > +	current->flags |= PF_IDLE;
> > > +	do_idle();
> > > +	current->flags &= ~PF_IDLE;
> > > +
> > > +	preempt_fold_need_resched();
> > > +	preempt_enable();
> > > +}
> > > +EXPORT_SYMBOL_GPL(play_idle);  
> > 
> > Hurm.. didn't this initially also include the setup of the timer and
> > take an timeout argument?
> 
> right, the initial version has a timeout and
> 	init_timer_on_stack(timer);
> 
> I thought since this play_idle timer is likely to expire instead of
> being canceled, so I switched to hrtimer by caller. We don't have an on
> stack hrtimer, right? or Should we consider adding one?

hrtimer_init_on_stack() exists.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacob Pan Nov. 14, 2016, 4:29 p.m. UTC | #6
On Mon, 14 Nov 2016 17:22:11 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> hrtimer_init_on_stack() exists.

I missed it. let me move the timeout inside play_idle().

Thanks,

Jacob
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index b886dc1..9de0c4a 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -245,6 +245,8 @@  static inline void enable_nonboot_cpus(void) {}
 int cpu_report_state(int cpu);
 int cpu_check_up_prepare(int cpu);
 void cpu_set_state_online(int cpu);
+void play_idle(void);
+
 #ifdef CONFIG_HOTPLUG_CPU
 bool cpu_wait_death(unsigned int cpu, int seconds);
 bool cpu_report_death(void);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..114c7fc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2254,6 +2254,7 @@  static inline cputime_t task_gtime(struct task_struct *t)
 /*
  * Per process flags
  */
+#define PF_IDLE		0x00000002	/* I am an IDLE thread */
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
@@ -2609,7 +2610,7 @@  extern int sched_setattr(struct task_struct *,
  */
 static inline bool is_idle_task(const struct task_struct *p)
 {
-	return p->pid == 0;
+	return !!(p->flags & PF_IDLE);
 }
 extern struct task_struct *curr_task(int cpu);
 extern void ia64_set_curr_task(int cpu, struct task_struct *p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 997ac1d..73edbcd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1819,6 +1819,9 @@  static __latent_entropy struct task_struct *copy_process(
 	threadgroup_change_end(current);
 	perf_event_fork(p);
 
+	/* do not inherit idle task */
+	p->flags &= ~PF_IDLE;
+
 	trace_task_newtask(p, clone_flags);
 	uprobe_copy_process(p, clone_flags);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 154fd68..c95fbcd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5279,6 +5279,7 @@  void init_idle(struct task_struct *idle, int cpu)
 	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
+	idle->flags |= PF_IDLE;
 
 	kasan_unpoison_task_stack(idle);
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1d8718d..ffd9a83 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -202,76 +202,68 @@  static void cpuidle_idle_call(void)
  *
  * Called with polling cleared.
  */
-static void cpu_idle_loop(void)
+static void do_idle(void)
 {
-	int cpu = smp_processor_id();
+	/*
+	 * If the arch has a polling bit, we maintain an invariant:
+	 *
+	 * Our polling bit is clear if we're not scheduled (i.e. if
+	 * rq->curr != rq->idle).  This means that, if rq->idle has
+	 * the polling bit set, then setting need_resched is
+	 * guaranteed to cause the cpu to reschedule.
+	 */
 
-	while (1) {
-		/*
-		 * If the arch has a polling bit, we maintain an invariant:
-		 *
-		 * Our polling bit is clear if we're not scheduled (i.e. if
-		 * rq->curr != rq->idle).  This means that, if rq->idle has
-		 * the polling bit set, then setting need_resched is
-		 * guaranteed to cause the cpu to reschedule.
-		 */
+	__current_set_polling();
+	tick_nohz_idle_enter();
 
-		__current_set_polling();
-		quiet_vmstat();
-		tick_nohz_idle_enter();
-
-		while (!need_resched()) {
-			check_pgt_cache();
-			rmb();
-
-			if (cpu_is_offline(cpu)) {
-				cpuhp_report_idle_dead();
-				arch_cpu_idle_dead();
-			}
-
-			local_irq_disable();
-			arch_cpu_idle_enter();
-
-			/*
-			 * In poll mode we reenable interrupts and spin.
-			 *
-			 * Also if we detected in the wakeup from idle
-			 * path that the tick broadcast device expired
-			 * for us, we don't want to go deep idle as we
-			 * know that the IPI is going to arrive right
-			 * away
-			 */
-			if (cpu_idle_force_poll || tick_check_broadcast_expired())
-				cpu_idle_poll();
-			else
-				cpuidle_idle_call();
-
-			arch_cpu_idle_exit();
-		}
+	while (!need_resched()) {
+		check_pgt_cache();
+		rmb();
 
-		/*
-		 * Since we fell out of the loop above, we know
-		 * TIF_NEED_RESCHED must be set, propagate it into
-		 * PREEMPT_NEED_RESCHED.
-		 *
-		 * This is required because for polling idle loops we will
-		 * not have had an IPI to fold the state for us.
-		 */
-		preempt_set_need_resched();
-		tick_nohz_idle_exit();
-		__current_clr_polling();
+		if (cpu_is_offline(smp_processor_id())) {
+			cpuhp_report_idle_dead();
+			arch_cpu_idle_dead();
+		}
 
-		/*
-		 * We promise to call sched_ttwu_pending and reschedule
-		 * if need_resched is set while polling is set.  That
-		 * means that clearing polling needs to be visible
-		 * before doing these things.
-		 */
-		smp_mb__after_atomic();
+		local_irq_disable();
+		arch_cpu_idle_enter();
 
-		sched_ttwu_pending();
-		schedule_preempt_disabled();
+	/* In poll mode we reenable interrupts and spin.
+	 * Also if we detected in the wakeup from idle
+	 * path that the tick broadcast device expired
+	 * for us, we don't want to go deep idle as we
+	 * know that the IPI is going to arrive right
+	 * away
+	 */
+		if (cpu_idle_force_poll || tick_check_broadcast_expired())
+			cpu_idle_poll();
+		else
+			cpuidle_idle_call();
+		arch_cpu_idle_exit();
 	}
+
+	/*
+	 * Since we fell out of the loop above, we know
+	 * TIF_NEED_RESCHED must be set, propagate it into
+	 * PREEMPT_NEED_RESCHED.
+	 *
+	 * This is required because for polling idle loops we will
+	 * not have had an IPI to fold the state for us.
+	 */
+	preempt_set_need_resched();
+	tick_nohz_idle_exit();
+	__current_clr_polling();
+
+	/*
+	 * We promise to call sched_ttwu_pending and reschedule
+	 * if need_resched is set while polling is set.  That
+	 * means that clearing polling needs to be visible
+	 * before doing these things.
+	 */
+	smp_mb__after_atomic();
+
+	sched_ttwu_pending();
+	schedule_preempt_disabled();
 }
 
 bool cpu_in_idle(unsigned long pc)
@@ -280,6 +272,28 @@  bool cpu_in_idle(unsigned long pc)
 		pc < (unsigned long)__cpuidle_text_end;
 }
 
+void play_idle()
+{
+	/*
+	 * Only FIFO tasks can disable the tick since they don't need the forced
+	 * preemption.
+	 */
+	WARN_ON_ONCE(current->policy != SCHED_FIFO);
+	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
+	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
+	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
+	rcu_sleep_check();
+
+	preempt_disable();
+	current->flags |= PF_IDLE;
+	do_idle();
+	current->flags &= ~PF_IDLE;
+
+	preempt_fold_need_resched();
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(play_idle);
+
 void cpu_startup_entry(enum cpuhp_state state)
 {
 	/*
@@ -299,5 +313,7 @@  void cpu_startup_entry(enum cpuhp_state state)
 #endif
 	arch_cpu_idle_prepare();
 	cpuhp_online_idle(state);
-	cpu_idle_loop();
+	while (1)
+		do_idle();
+
 }