diff mbox

[3/7] idle, thermal, acpi: Remove home grown idle implementations

Message ID 20131120162736.508462614@infradead.org (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Peter Zijlstra Nov. 20, 2013, 4:04 p.m. UTC
People are starting to grow their own idle implementations in various
disgusting ways. Collapse the lot and use the generic idle code to
provide a proper idle cycle implementation.

This does not fully preseve existing behaviour in that the generic
idle cycle function calls into the normal cpuidle governed idle
routines and should thus respect things like QoS parameters and the
like.

If people want to over-ride the idle state they should talk to the
cpuidle folks about extending the interface and attempt to preserve
QoS guarantees, instead of jumping straight to the deepest possible C
state.

Compile tested only -- I've no idea how to actually use these vile
things.

Cc: hpa@zytor.com
Cc: arjan@linux.intel.com
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: lenb@kernel.org
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/acpi/acpi_pad.c            |   41 ------------
 drivers/thermal/intel_powerclamp.c |   38 -----------
 include/linux/cpu.h                |    2 
 kernel/cpu/idle.c                  |  123 ++++++++++++++++++++++---------------
 kernel/time/tick-sched.c           |    2 
 5 files changed, 82 insertions(+), 124 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Arjan van de Ven Nov. 20, 2013, 4:40 p.m. UTC | #1
On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> This does not fully preseve existing behaviour in that the generic
> idle cycle function calls into the normal cpuidle governed idle
> routines and should thus respect things like QoS parameters and the
> like.


NAK on the powerclamp side.

powerclamp MUST NOT do that....
it is needed to go to the deepest state no matter what
(this is for when your system is overheating. there is not a lot of choice
here... alternative is an emergency reset that the hardware does for safety)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 20, 2013, 4:59 p.m. UTC | #2
On Wed, Nov 20, 2013 at 08:40:49AM -0800, Arjan van de Ven wrote:
> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> >This does not fully preseve existing behaviour in that the generic
> >idle cycle function calls into the normal cpuidle governed idle
> >routines and should thus respect things like QoS parameters and the
> >like.
> 
> 
> NAK on the powerclamp side.
> 
> powerclamp MUST NOT do that....
> it is needed to go to the deepest state no matter what
> (this is for when your system is overheating. there is not a lot of choice
> here... alternative is an emergency reset that the hardware does for safety)

Then its a worse broken piece of shit than I thought it was.

The only way to do what you want is to pretty much do stop_machine().

There's no guarantee your current FIFO-50 tasks will ever get to run.

Also, since when does Intel hardware do emergency resets on thermal
events? It used to force throttle stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Nov. 20, 2013, 5:23 p.m. UTC | #3
On Wed, 20 Nov 2013, Arjan van de Ven wrote:

> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> > This does not fully preseve existing behaviour in that the generic
> > idle cycle function calls into the normal cpuidle governed idle
> > routines and should thus respect things like QoS parameters and the
> > like.
> 
> 
> NAK on the powerclamp side.
> 
> powerclamp MUST NOT do that....
> it is needed to go to the deepest state no matter what
> (this is for when your system is overheating. there is not a lot of choice
> here... alternative is an emergency reset that the hardware does for safety)

So that whole machinery falls apart when the thing which is running on
that hot core is a while(1) loop with a higher or equal FIFO priority
than this thread. Even if you'd run at prio 99, then there is no
guarantee that the cpu hog does not run with prio 99 as well and due
to FIFO and being on the CPU it's not going to let you on.

And even if the issue was caused by a lower prio CPU hog, the system
might just be in a RT throttled state because of the hog. So depending
on the throttler settings it might take quite some time for the
clamper thread to get on the CPU.

Amazingly naive and stupid approach.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Nov. 20, 2013, 5:23 p.m. UTC | #4
On 11/20/2013 9:23 AM, Thomas Gleixner wrote:
> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
>
>> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
>>> This does not fully preseve existing behaviour in that the generic
>>> idle cycle function calls into the normal cpuidle governed idle
>>> routines and should thus respect things like QoS parameters and the
>>> like.
>>
>>
>> NAK on the powerclamp side.
>>
>> powerclamp MUST NOT do that....
>> it is needed to go to the deepest state no matter what
>> (this is for when your system is overheating. there is not a lot of choice
>> here... alternative is an emergency reset that the hardware does for safety)
>
> So that whole machinery falls apart when the thing which is running on
> that hot core is a while(1) loop with a higher or equal FIFO priority
> than this thread. Even if you'd run at prio 99, then there is no
> guarantee that the cpu hog does not run with prio 99 as well and due
> to FIFO and being on the CPU it's not going to let you on.

the idea was to at least give people who know what they're doing a chance to run

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Nov. 20, 2013, 5:55 p.m. UTC | #5
On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> On 11/20/2013 9:23 AM, Thomas Gleixner wrote:
> > On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> > 
> > > On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> > > > This does not fully preseve existing behaviour in that the generic
> > > > idle cycle function calls into the normal cpuidle governed idle
> > > > routines and should thus respect things like QoS parameters and the
> > > > like.
> > > 
> > > 
> > > NAK on the powerclamp side.
> > > 
> > > powerclamp MUST NOT do that....
> > > it is needed to go to the deepest state no matter what
> > > (this is for when your system is overheating. there is not a lot of choice
> > > here... alternative is an emergency reset that the hardware does for
> > > safety)
> > 
> > So that whole machinery falls apart when the thing which is running on
> > that hot core is a while(1) loop with a higher or equal FIFO priority
> > than this thread. Even if you'd run at prio 99, then there is no
> > guarantee that the cpu hog does not run with prio 99 as well and due
> > to FIFO and being on the CPU it's not going to let you on.
> 
> the idea was to at least give people who know what they're doing a chance to
> run

You can't be serious about that. Even if people know what they are
doing, they have no chance to prevent the occasional high prio runaway
bug. And then you shrug your shoulders and tell that guy who spent
time to tune the thing proper "bad luck" or what?

This is the worst of all approaches and you know that. You fought such
crap tooth and nail not so long ago.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Nov. 20, 2013, 6:21 p.m. UTC | #6
On 11/20/2013 9:55 AM, Thomas Gleixner wrote:
> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
>> On 11/20/2013 9:23 AM, Thomas Gleixner wrote:
>>> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
>>>
>>>> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
>>>>> This does not fully preseve existing behaviour in that the generic
>>>>> idle cycle function calls into the normal cpuidle governed idle
>>>>> routines and should thus respect things like QoS parameters and the
>>>>> like.
>>>>
>>>>
>>>> NAK on the powerclamp side.
>>>>
>>>> powerclamp MUST NOT do that....
>>>> it is needed to go to the deepest state no matter what
>>>> (this is for when your system is overheating. there is not a lot of choice
>>>> here... alternative is an emergency reset that the hardware does for
>>>> safety)
>>>
>>> So that whole machinery falls apart when the thing which is running on
>>> that hot core is a while(1) loop with a higher or equal FIFO priority
>>> than this thread. Even if you'd run at prio 99, then there is no
>>> guarantee that the cpu hog does not run with prio 99 as well and due
>>> to FIFO and being on the CPU it's not going to let you on.
>>
>> the idea was to at least give people who know what they're doing a chance to
>> run
>
> You can't be serious about that. Even if people know what they are
> doing, they have no chance to prevent the occasional high prio runaway
> bug. And then you shrug your shoulders and tell that guy who spent
> time to tune the thing proper "bad luck" or what?

his system will crawl to a halt (and if it's for a long time on a system with
submarginal cooling, potentially reboot) due to the hardware protecting itself.
What powerclamp is trying to do is avoid these steps from the hardware (they are both
very harsh and very inefficient).

but for powerclamp to work, it needs to inject a deep idle....
I'm very ok using generic abstractions for that, but the abstraction needs to then
include a "don't be nice about picking shallow C states for performance reasons, just
pick something very deep" parameter.

(powerclamp will adjust if you have some periodic realtime task or interrupts or ..
taking away idle from it, by over time injecting a bit more idle)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Nov. 20, 2013, 7:38 p.m. UTC | #7
On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> 
> but for powerclamp to work, it needs to inject a deep idle....  I'm
> very ok using generic abstractions for that, but the abstraction
> needs to then include a "don't be nice about picking shallow C
> states for performance reasons, just pick something very deep"
> parameter.

And that's what you should have done in the first place. Make the
generic code take a parameter to indicate that. Or tell the scheduler
to throttle the machine and go deep idle. That would also be helpful
to others who might need some similar thing.

No, you went with the worst design:

    - Hack it into some random driver
    - Export random core interfaces so it's harder to change them
    - Let others deal with the fallout

I'm cleaning up that kind of mess for more than 9 years now and I'm
really disappointed that you went over to the "who cares, works for
me" camp.

I can lively remember our discussions when we were cleaning up the
whole timer mess together in order to make NOHZ actually useful. Your
cursing about such code was definitely impressive back then.

Thanks,

	tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacob Pan Nov. 20, 2013, 10:08 p.m. UTC | #8
On Wed, 20 Nov 2013 20:38:03 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> > 
> > but for powerclamp to work, it needs to inject a deep idle....  I'm
> > very ok using generic abstractions for that, but the abstraction
> > needs to then include a "don't be nice about picking shallow C
> > states for performance reasons, just pick something very deep"
> > parameter.
> 
> And that's what you should have done in the first place. Make the
> generic code take a parameter to indicate that. Or tell the scheduler
> to throttle the machine and go deep idle. That would also be helpful
> to others who might need some similar thing.
> 
I thought about that. Since the generic code is in performance critical
path and idle injection is a rare case. Then the question is do we
want sacrifice the 99% for %1 usage?

> No, you went with the worst design:
> 
>     - Hack it into some random driver
>     - Export random core interfaces so it's harder to change them
>     - Let others deal with the fallout
> 
> I'm cleaning up that kind of mess for more than 9 years now and I'm
> really disappointed that you went over to the "who cares, works for
> me" camp.
> 
> I can lively remember our discussions when we were cleaning up the
> whole timer mess together in order to make NOHZ actually useful. Your
> cursing about such code was definitely impressive back then.
> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> 

[Jacob Pan]
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacob Pan Nov. 21, 2013, 12:54 a.m. UTC | #9
On Wed, 20 Nov 2013 17:04:53 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> People are starting to grow their own idle implementations in various
> disgusting ways. Collapse the lot and use the generic idle code to
> provide a proper idle cycle implementation.
> 
+Paul

RCU and others rely on is_idle_task() might be broken with the
consolidated idle code since caller of do_idle may have pid != 0.

Should we use TS_POLL or introduce a new flag to identify idle task?

The reason why idle injection code does not inform RCU is that we have
known short period of idle time which does not impact RCU grace period.

On the other side, I see idle injection code is working with this
patchset with workaround in s_idle_task() by checking TS_POLL flag.
But the efficiency is down by ~30%. i.e.

before: inject 25% time to get 23-24% package idle
after: inject 25% time to get 16-17% package idle

Still looking into improvement.

Jacob



> This does not fully preseve existing behaviour in that the generic
> idle cycle function calls into the normal cpuidle governed idle
> routines and should thus respect things like QoS parameters and the
> like.
> 
> If people want to over-ride the idle state they should talk to the
> cpuidle folks about extending the interface and attempt to preserve
> QoS guarantees, instead of jumping straight to the deepest possible C
> state.
> 
> Compile tested only -- I've no idea how to actually use these vile
> things.
> 
> Cc: hpa@zytor.com
> Cc: arjan@linux.intel.com
> Cc: rui.zhang@intel.com
> Cc: jacob.jun.pan@linux.intel.com
> Cc: lenb@kernel.org
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  drivers/acpi/acpi_pad.c            |   41 ------------
>  drivers/thermal/intel_powerclamp.c |   38 -----------
>  include/linux/cpu.h                |    2 
>  kernel/cpu/idle.c                  |  123
> ++++++++++++++++++++++---------------
> kernel/time/tick-sched.c           |    2 5 files changed, 82
> insertions(+), 124 deletions(-)
> 
> --- a/drivers/acpi/acpi_pad.c
> +++ b/drivers/acpi/acpi_pad.c
> @@ -41,9 +41,7 @@ static DEFINE_MUTEX(round_robin_lock);
>  static unsigned long power_saving_mwait_eax;
>  
>  static unsigned char tsc_detected_unstable;
> -static unsigned char tsc_marked_unstable;
>  static unsigned char lapic_detected_unstable;
> -static unsigned char lapic_marked_unstable;
>  
>  static void power_saving_mwait_init(void)
>  {
> @@ -153,10 +151,9 @@ static int power_saving_thread(void *dat
>  	unsigned int tsk_index = (unsigned long)data;
>  	u64 last_jiffies = 0;
>  
> -	sched_setscheduler(current, SCHED_RR, &param);
> +	sched_setscheduler(current, SCHED_FIFO, &param);
>  
>  	while (!kthread_should_stop()) {
> -		int cpu;
>  		u64 expire_time;
>  
>  		try_to_freeze();
> @@ -171,41 +168,7 @@ static int power_saving_thread(void *dat
>  
>  		expire_time = jiffies + HZ * (100 - idle_pct) / 100;
>  
> -		while (!need_resched()) {
> -			if (tsc_detected_unstable
> && !tsc_marked_unstable) {
> -				/* TSC could halt in idle, so notify
> users */
> -				mark_tsc_unstable("TSC halts in
> idle");
> -				tsc_marked_unstable = 1;
> -			}
> -			if (lapic_detected_unstable
> && !lapic_marked_unstable) {
> -				int i;
> -				/* LAPIC could halt in idle, so
> notify users */
> -				for_each_online_cpu(i)
> -					clockevents_notify(
> -
> CLOCK_EVT_NOTIFY_BROADCAST_ON,
> -						&i);
> -				lapic_marked_unstable = 1;
> -			}
> -			local_irq_disable();
> -			cpu = smp_processor_id();
> -			if (lapic_marked_unstable)
> -				clockevents_notify(
> -
> CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> -			stop_critical_timings();
> -
> -
> mwait_idle_with_hints(power_saving_mwait_eax, 1); -
> -			start_critical_timings();
> -			if (lapic_marked_unstable)
> -				clockevents_notify(
> -
> CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> -			local_irq_enable();
> -
> -			if (jiffies > expire_time) {
> -				do_sleep = 1;
> -				break;
> -			}
> -		}
> +		play_idle(expire_time);
>  
>  		/*
>  		 * current sched_rt has threshold for rt task
> running time. --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -247,11 +247,6 @@ static u64 pkg_state_counter(void)
>  	return count;
>  }
>  
> -static void noop_timer(unsigned long foo)
> -{
> -	/* empty... just the fact that we get the interrupt wakes us
> up */ -}
> -
>  static unsigned int get_compensation(int ratio)
>  {
>  	unsigned int comp = 0;
> @@ -356,7 +351,6 @@ static bool powerclamp_adjust_controls(u
>  static int clamp_thread(void *arg)
>  {
>  	int cpunr = (unsigned long)arg;
> -	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
>  	static const struct sched_param param = {
>  		.sched_priority = MAX_USER_RT_PRIO/2,
>  	};
> @@ -365,11 +359,9 @@ static int clamp_thread(void *arg)
>  
>  	set_bit(cpunr, cpu_clamping_mask);
>  	set_freezable();
> -	init_timer_on_stack(&wakeup_timer);
>  	sched_setscheduler(current, SCHED_FIFO, &param);
>  
> -	while (true == clamping && !kthread_should_stop() &&
> -		cpu_online(cpunr)) {
> +	while (clamping && !kthread_should_stop() &&
> cpu_online(cpunr)) { int sleeptime;
>  		unsigned long target_jiffies;
>  		unsigned int guard;
> @@ -417,35 +409,11 @@ static int clamp_thread(void *arg)
>  		if (should_skip)
>  			continue;
>  
> -		target_jiffies = jiffies + duration_jiffies;
> -		mod_timer(&wakeup_timer, target_jiffies);
>  		if (unlikely(local_softirq_pending()))
>  			continue;
> -		/*
> -		 * stop tick sched during idle time, interrupts are
> still
> -		 * allowed. thus jiffies are updated properly.
> -		 */
> -		preempt_disable();
> -		tick_nohz_idle_enter();
> -		/* mwait until target jiffies is reached */
> -		while (time_before(jiffies, target_jiffies)) {
> -			unsigned long ecx = 1;
> -			unsigned long eax = target_mwait;
> -
> -			/*
> -			 * REVISIT: may call enter_idle() to notify
> drivers who
> -			 * can save power during cpu idle. same for
> exit_idle()
> -			 */
> -			local_touch_nmi();
> -			stop_critical_timings();
> -			mwait_idle_with_hints(eax, ecx);
> -			start_critical_timings();
> -			atomic_inc(&idle_wakeup_counter);
> -		}
> -		tick_nohz_idle_exit();
> -		preempt_enable_no_resched();
> +
> +		play_idle(duration_jiffies);
>  	}
> -	del_timer_sync(&wakeup_timer);
>  	clear_bit(cpunr, cpu_clamping_mask);
>  
>  	return 0;
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -215,6 +215,8 @@ enum cpuhp_state {
>  	CPUHP_ONLINE,
>  };
>  
> +void play_idle(unsigned long jiffies);
> +
>  void cpu_startup_entry(enum cpuhp_state state);
>  void cpu_idle(void);
>  
> --- a/kernel/cpu/idle.c
> +++ b/kernel/cpu/idle.c
> @@ -63,62 +63,88 @@ void __weak arch_cpu_idle(void)
>  }
>  
>  /*
> - * Generic idle loop implementation
> + * Generic idle cycle.
>   */
> -static void cpu_idle_loop(void)
> +static void do_idle(void)
>  {
> -	while (1) {
> -		tick_nohz_idle_enter();
> +	tick_nohz_idle_enter();
>  
> -		while (!need_resched()) {
> -			check_pgt_cache();
> -			rmb();
> -
> -			if (cpu_is_offline(smp_processor_id()))
> -				arch_cpu_idle_dead();
> -
> -			local_irq_disable();
> -			arch_cpu_idle_enter();
> -
> -			/*
> -			 * In poll mode we reenable interrupts and
> spin.
> -			 *
> -			 * Also if we detected in the wakeup from
> idle
> -			 * path that the tick broadcast device
> expired
> -			 * for us, we don't want to go deep idle as
> we
> -			 * know that the IPI is going to arrive right
> -			 * away
> -			 */
> -			if (cpu_idle_force_poll ||
> tick_check_broadcast_expired()) {
> -				cpu_idle_poll();
> -			} else {
> -				if (!current_clr_polling_and_test())
> {
> -					stop_critical_timings();
> -					rcu_idle_enter();
> -					arch_cpu_idle();
> -
> WARN_ON_ONCE(irqs_disabled());
> -					rcu_idle_exit();
> -					start_critical_timings();
> -				} else {
> -					local_irq_enable();
> -				}
> -				__current_set_polling();
> -			}
> -			arch_cpu_idle_exit();
> -		}
> +	while (!need_resched()) {
> +		check_pgt_cache();
> +		rmb();
> +
> +		if (cpu_is_offline(smp_processor_id()))
> +			arch_cpu_idle_dead();
> +
> +		local_irq_disable();
> +		arch_cpu_idle_enter();
>  
>  		/*
> -		 * We need to test and propagate the
> TIF_NEED_RESCHED bit here
> -		 * because we might not have send the reschedule IPI
> to idle
> -		 * tasks.
> +		 * In poll mode we reenable interrupts and spin.
> +		 *
> +		 * Also if we detected in the wakeup from idle path
> that the
> +		 * tick broadcast device expired for us, we don't
> want to go
> +		 * deep idle as we know that the IPI is going to
> arrive right
> +		 * away
>  		 */
> -		preempt_fold_need_resched();
> -		tick_nohz_idle_exit();
> -		schedule_preempt_disabled();
> +		if (cpu_idle_force_poll ||
> tick_check_broadcast_expired()) {
> +			cpu_idle_poll();
> +		} else {
> +			if (!current_clr_polling_and_test()) {
> +				stop_critical_timings();
> +				rcu_idle_enter();
> +				arch_cpu_idle();
> +				WARN_ON_ONCE(irqs_disabled());
> +				rcu_idle_exit();
> +				start_critical_timings();
> +			} else {
> +				local_irq_enable();
> +			}
> +			__current_set_polling();
> +		}
> +		arch_cpu_idle_exit();
>  	}
> +
> +	/*
> +	 * We need to test and propagate the TIF_NEED_RESCHED bit
> here
> +	 * because we might not have send the reschedule IPI to idle
> +	 * tasks.
> +	 */
> +	preempt_fold_need_resched();
> +	tick_nohz_idle_exit();
> +	schedule_preempt_disabled();
> +}
> +
> +static void play_idle_timer(unsigned long foo)
> +{
> +	set_tsk_need_resched(current);
> +}
> +
> +void play_idle(unsigned long duration)
> +{
> +	DEFINE_TIMER(wakeup_timer, play_idle_timer, 0, 0);
> +
> +	/*
> +	 * Only FIFO tasks can disable the tick since they don't
> need the forced
> +	 * preemption.
> +	 */
> +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> +
> +	init_timer_on_stack(&wakeup_timer);
> +	mod_timer_pinned(&wakeup_timer, jiffies + duration);
> +
> +	preempt_disable();
> +	do_idle();
> +	del_timer_sync(&wakeup_timer);
> +	preempt_fold_need_resched();
> +	preempt_enable();
>  }
> +EXPORT_SYMBOL_GPL(play_idle);
>  
> -void cpu_startup_entry(enum cpuhp_state state)
> +__noreturn void cpu_startup_entry(enum cpuhp_state state)
>  {
>  	/*
>  	 * This #ifdef needs to die, but it's too late in the cycle
> to @@ -137,5 +163,6 @@ void cpu_startup_entry(enum cpuhp_state
>  #endif
>  	__current_set_polling();
>  	arch_cpu_idle_prepare();
> -	cpu_idle_loop();
> +	while (1)
> +		do_idle();
>  }
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -804,7 +804,6 @@ void tick_nohz_idle_enter(void)
>  
>  	local_irq_enable();
>  }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
>  
>  /**
>   * tick_nohz_irq_exit - update next tick event from interrupt exit
> @@ -932,7 +931,6 @@ void tick_nohz_idle_exit(void)
>  
>  	local_irq_enable();
>  }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
>  
>  static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
>  {
> 
> 

[Jacob Pan]
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2013, 8:21 a.m. UTC | #10
On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> On Wed, 20 Nov 2013 17:04:53 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > People are starting to grow their own idle implementations in various
> > disgusting ways. Collapse the lot and use the generic idle code to
> > provide a proper idle cycle implementation.
> > 
> +Paul
> 
> RCU and others rely on is_idle_task() might be broken with the
> consolidated idle code since caller of do_idle may have pid != 0.
> 
> Should we use TS_POLL or introduce a new flag to identify idle task?

PF_IDLE would be my preference, I checked and we seem to have a grand
total of 2 unused task_struct::flags left ;-)

> The reason why idle injection code does not inform RCU is that we have
> known short period of idle time which does not impact RCU grace period.
> 
> On the other side, I see idle injection code is working with this
> patchset with workaround in s_idle_task() by checking TS_POLL flag.
> But the efficiency is down by ~30%. i.e.
> 
> before: inject 25% time to get 23-24% package idle
> after: inject 25% time to get 16-17% package idle
> 
> Still looking into improvement.

So the quick hack is to make acpi_idle/intel_idle use the highest
possible C-state when pid!=0 && PF_IDLE.

Ideally though I'd see some of the QoS ramifications explored. Because
forcing the CPU into the highest C-state basically invalidates the
entire QoS stack.

So either make QoS and this idle injection stuff mutually exclusive in a
very explicit way -- disable the QoS interface when you enable one of
these idle injectors AND fail to engage the idle injectors when an
incompatible QoS setting is pre-existing.

Or come up with something smarter.

You also have to explore the case of higher priority tasks messing with
the proper operation of your injectors, this one is harder to deal with.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 21, 2013, 4:07 p.m. UTC | #11
On Thu, Nov 21, 2013 at 09:21:51AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> > On Wed, 20 Nov 2013 17:04:53 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > People are starting to grow their own idle implementations in various
> > > disgusting ways. Collapse the lot and use the generic idle code to
> > > provide a proper idle cycle implementation.
> > > 
> > +Paul
> > 
> > RCU and others rely on is_idle_task() might be broken with the
> > consolidated idle code since caller of do_idle may have pid != 0.
> > 
> > Should we use TS_POLL or introduce a new flag to identify idle task?
> 
> PF_IDLE would be my preference, I checked and we seem to have a grand
> total of 2 unused task_struct::flags left ;-)

As long as RCU has some reliable way to identify an idle task, I am
good.  But I have to ask -- why can't idle injection coordinate with
the existing idle tasks rather than temporarily making alternative
idle tasks?

							Thanx, Paul

> > The reason why idle injection code does not inform RCU is that we have
> > known short period of idle time which does not impact RCU grace period.
> > 
> > On the other side, I see idle injection code is working with this
> > patchset with workaround in s_idle_task() by checking TS_POLL flag.
> > But the efficiency is down by ~30%. i.e.
> > 
> > before: inject 25% time to get 23-24% package idle
> > after: inject 25% time to get 16-17% package idle
> > 
> > Still looking into improvement.
> 
> So the quick hack is to make acpi_idle/intel_idle use the highest
> possible C-state when pid!=0 && PF_IDLE.
> 
> Ideally though I'd see some of the QoS ramifications explored. Because
> forcing the CPU into the highest C-state basically invalidates the
> entire QoS stack.
> 
> So either make QoS and this idle injection stuff mutually exclusive in a
> very explicit way -- disable the QoS interface when you enable one of
> these idle injectors AND fail to engage the idle injectors when an
> incompatible QoS setting is pre-existing.
> 
> Or come up with something smarter.
> 
> You also have to explore the case of higher priority tasks messing with
> the proper operation of your injectors, this one is harder to deal with.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Nov. 21, 2013, 4:21 p.m. UTC | #12
On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> As long as RCU has some reliable way to identify an idle task, I am
> good.  But I have to ask -- why can't idle injection coordinate with
> the existing idle tasks rather than temporarily making alternative
> idle tasks?

it's not a real idle. that's the whole problem of the situation.
to the rest of the OS, this is being BUSY (busy saving power using
a CPU instruction, but it might as well have been an mdelay() operation)
and it's also what end users expect; they want to be able to see
where there performance (read: cpu time in "top") is going.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 21, 2013, 4:29 p.m. UTC | #13
On Thu, Nov 21, 2013 at 08:07:16AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 21, 2013 at 09:21:51AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> > > On Wed, 20 Nov 2013 17:04:53 +0100
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > People are starting to grow their own idle implementations in various
> > > > disgusting ways. Collapse the lot and use the generic idle code to
> > > > provide a proper idle cycle implementation.
> > > > 
> > > +Paul
> > > 
> > > RCU and others rely on is_idle_task() might be broken with the
> > > consolidated idle code since caller of do_idle may have pid != 0.
> > > 
> > > Should we use TS_POLL or introduce a new flag to identify idle task?
> > 
> > PF_IDLE would be my preference, I checked and we seem to have a grand
> > total of 2 unused task_struct::flags left ;-)
> 
> As long as RCU has some reliable way to identify an idle task, I am
> good.  But I have to ask -- why can't idle injection coordinate with
> the existing idle tasks rather than temporarily making alternative
> idle tasks?

Because that'd completely wreck how the scheduler selects tasks for just
these 2 arguably insane drivers.

We'd have to somehow teach it to pick the actual idle task instead of
this one task, but keep scheduling the rest of the tasks like normal --
we very much should keep higher priority tasks running like normal.

And we'd need a way to make it stop doing this 'proxy' execution.

That said, once we manage to replace the entire PI implementation with a
proper proxy execution scheme, the above would be possible by having a
resource (rt_mutex) associated with every idle task, and always held by
that task.

At that point we can do something like:

  rt_mutex_lock_timeout(cpu_idle_lock(cpu), jiffies);

And get the idle thread executing in our stead.

That said, idle is _special_ and I'd not be surprised we'd find a few
'funnies' along the way of trying to get that to actually work.

For now I'd rather not go there quite yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 21, 2013, 5:27 p.m. UTC | #14
On Thu, Nov 21, 2013 at 05:29:56PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 08:07:16AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 21, 2013 at 09:21:51AM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> > > > On Wed, 20 Nov 2013 17:04:53 +0100
> > > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > > 
> > > > > People are starting to grow their own idle implementations in various
> > > > > disgusting ways. Collapse the lot and use the generic idle code to
> > > > > provide a proper idle cycle implementation.
> > > > > 
> > > > +Paul
> > > > 
> > > > RCU and others rely on is_idle_task() might be broken with the
> > > > consolidated idle code since caller of do_idle may have pid != 0.
> > > > 
> > > > Should we use TS_POLL or introduce a new flag to identify idle task?
> > > 
> > > PF_IDLE would be my preference, I checked and we seem to have a grand
> > > total of 2 unused task_struct::flags left ;-)
> > 
> > As long as RCU has some reliable way to identify an idle task, I am
> > good.  But I have to ask -- why can't idle injection coordinate with
> > the existing idle tasks rather than temporarily making alternative
> > idle tasks?
> 
> Because that'd completely wreck how the scheduler selects tasks for just
> these 2 arguably insane drivers.
> 
> We'd have to somehow teach it to pick the actual idle task instead of
> this one task, but keep scheduling the rest of the tasks like normal --
> we very much should keep higher priority tasks running like normal.
> 
> And we'd need a way to make it stop doing this 'proxy' execution.
> 
> That said, once we manage to replace the entire PI implementation with a
> proper proxy execution scheme, the above would be possible by having a
> resource (rt_mutex) associated with every idle task, and always held by
> that task.
> 
> At that point we can do something like:
> 
>   rt_mutex_lock_timeout(cpu_idle_lock(cpu), jiffies);
> 
> And get the idle thread executing in our stead.
> 
> That said, idle is _special_ and I'd not be surprised we'd find a few
> 'funnies' along the way of trying to get that to actually work.
> 
> For now I'd rather not go there quite yet.

Fair enough!

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 21, 2013, 7:19 p.m. UTC | #15
On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> >As long as RCU has some reliable way to identify an idle task, I am
> >good.  But I have to ask -- why can't idle injection coordinate with
> >the existing idle tasks rather than temporarily making alternative
> >idle tasks?
> 
> it's not a real idle. that's the whole problem of the situation.
> to the rest of the OS, this is being BUSY (busy saving power using
> a CPU instruction, but it might as well have been an mdelay() operation)
> and it's also what end users expect; they want to be able to see
> where there performance (read: cpu time in "top") is going.

My concern is keeping RCU's books straight.  Suppose that there is a need
to call for idle in the middle of a preemptible RCU read-side critical
section.  Now, if that call for idle involves a context switch, all is
well -- RCU will see the task as still being in its RCU read-side critical
section, which means that it is OK for RCU to see the CPU as idle.

However, if there is no context switch and RCU sees the CPU as idle,
preemptible RCU could prematurely end the grace period.  If there is no
context switch and RCU sees the CPU as non-idle for too long, we start
getting RCU CPU stall warning splats.

Another approach would be to only inject idle when the CPU is not
doing anything that could possibly be in an RCU read-side critical
section.  But things might get a bit hot in case of an overly
long RCU read-side critical section.

One approach that might work would be to hook into RCU's context-switch
code going in and coming out, then telling RCU that the CPU is idle,
even though top and friends see it as non-idle.  This last is in fact
similar to how RCU handles userspace execution for NO_HZ_FULL.

Or were you thinking of yet another possible approach for this?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Nov. 21, 2013, 7:45 p.m. UTC | #16
On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
>> On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
>>> As long as RCU has some reliable way to identify an idle task, I am
>>> good.  But I have to ask -- why can't idle injection coordinate with
>>> the existing idle tasks rather than temporarily making alternative
>>> idle tasks?
>>
>> it's not a real idle. that's the whole problem of the situation.
>> to the rest of the OS, this is being BUSY (busy saving power using
>> a CPU instruction, but it might as well have been an mdelay() operation)
>> and it's also what end users expect; they want to be able to see
>> where there performance (read: cpu time in "top") is going.
>
> My concern is keeping RCU's books straight.  Suppose that there is a need
> to call for idle in the middle of a preemptible RCU read-side critical
> section.  Now, if that call for idle involves a context switch, all is
> well -- RCU will see the task as still being in its RCU read-side critical
> section, which means that it is OK for RCU to see the CPU as idle.
>
> However, if there is no context switch and RCU sees the CPU as idle,
> preemptible RCU could prematurely end the grace period.  If there is no
> context switch and RCU sees the CPU as non-idle for too long, we start
> getting RCU CPU stall warning splats.
>
> Another approach would be to only inject idle when the CPU is not
> doing anything that could possibly be in an RCU read-side critical
> section.  But things might get a bit hot in case of an overly
> long RCU read-side critical section.
>
> One approach that might work would be to hook into RCU's context-switch
> code going in and coming out, then telling RCU that the CPU is idle,
> even though top and friends see it as non-idle.  This last is in fact
> similar to how RCU handles userspace execution for NO_HZ_FULL.
>

so powerclamp and such are not "idle".
They are "busy" from everything except the lowest level of the CPU hardware.
once you start thinking of them as idle, all hell breaks lose in terms of implications
(including sysadmin visibility etc).... (hence some of the explosions in this thread
as well).

but it's not "idle".

it's "put the cpu in a low power state for a specified amount of time". sure it uses the same
instruction to do so that the idle loop uses.

(now to make it messy, the current driver does a bunch of things similar to the idle loop
which is a mess and fair to be complained about)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 21, 2013, 8:07 p.m. UTC | #17
On Thu, Nov 21, 2013 at 11:45:20AM -0800, Arjan van de Ven wrote:
> On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> >On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> >>On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> >>>As long as RCU has some reliable way to identify an idle task, I am
> >>>good.  But I have to ask -- why can't idle injection coordinate with
> >>>the existing idle tasks rather than temporarily making alternative
> >>>idle tasks?
> >>
> >>it's not a real idle. that's the whole problem of the situation.
> >>to the rest of the OS, this is being BUSY (busy saving power using
> >>a CPU instruction, but it might as well have been an mdelay() operation)
> >>and it's also what end users expect; they want to be able to see
> >>where there performance (read: cpu time in "top") is going.
> >
> >My concern is keeping RCU's books straight.  Suppose that there is a need
> >to call for idle in the middle of a preemptible RCU read-side critical
> >section.  Now, if that call for idle involves a context switch, all is
> >well -- RCU will see the task as still being in its RCU read-side critical
> >section, which means that it is OK for RCU to see the CPU as idle.
> >
> >However, if there is no context switch and RCU sees the CPU as idle,
> >preemptible RCU could prematurely end the grace period.  If there is no
> >context switch and RCU sees the CPU as non-idle for too long, we start
> >getting RCU CPU stall warning splats.
> >
> >Another approach would be to only inject idle when the CPU is not
> >doing anything that could possibly be in an RCU read-side critical
> >section.  But things might get a bit hot in case of an overly
> >long RCU read-side critical section.
> >
> >One approach that might work would be to hook into RCU's context-switch
> >code going in and coming out, then telling RCU that the CPU is idle,
> >even though top and friends see it as non-idle.  This last is in fact
> >similar to how RCU handles userspace execution for NO_HZ_FULL.
> >
> 
> so powerclamp and such are not "idle".
> They are "busy" from everything except the lowest level of the CPU hardware.
> once you start thinking of them as idle, all hell breaks lose in terms of implications
> (including sysadmin visibility etc).... (hence some of the explosions in this thread
> as well).
> 
> but it's not "idle".
> 
> it's "put the cpu in a low power state for a specified amount of time". sure it uses the same
> instruction to do so that the idle loop uses.
> 
> (now to make it messy, the current driver does a bunch of things similar to the idle loop
> which is a mess and fair to be complained about)

Then from an RCU viewpoint, they need to be short in duration.  Otherwise
you risk getting CPU stall-warning explosions from RCU.  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacob Pan Nov. 22, 2013, 12:10 a.m. UTC | #18
On Thu, 21 Nov 2013 12:07:17 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Nov 21, 2013 at 11:45:20AM -0800, Arjan van de Ven wrote:
> > On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> > >On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> > >>On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> > >>>As long as RCU has some reliable way to identify an idle task, I
> > >>>am good.  But I have to ask -- why can't idle injection
> > >>>coordinate with the existing idle tasks rather than temporarily
> > >>>making alternative idle tasks?
> > >>
> > >>it's not a real idle. that's the whole problem of the situation.
> > >>to the rest of the OS, this is being BUSY (busy saving power using
> > >>a CPU instruction, but it might as well have been an mdelay()
> > >>operation) and it's also what end users expect; they want to be
> > >>able to see where there performance (read: cpu time in "top") is
> > >>going.
> > >
> > >My concern is keeping RCU's books straight.  Suppose that there is
> > >a need to call for idle in the middle of a preemptible RCU
> > >read-side critical section.  Now, if that call for idle involves a
> > >context switch, all is well -- RCU will see the task as still
> > >being in its RCU read-side critical section, which means that it
> > >is OK for RCU to see the CPU as idle.
> > >
> > >However, if there is no context switch and RCU sees the CPU as
> > >idle, preemptible RCU could prematurely end the grace period.  If
> > >there is no context switch and RCU sees the CPU as non-idle for
> > >too long, we start getting RCU CPU stall warning splats.
> > >
> > >Another approach would be to only inject idle when the CPU is not
> > >doing anything that could possibly be in an RCU read-side critical
> > >section.  But things might get a bit hot in case of an overly
> > >long RCU read-side critical section.
> > >
> > >One approach that might work would be to hook into RCU's
> > >context-switch code going in and coming out, then telling RCU that
> > >the CPU is idle, even though top and friends see it as non-idle.
> > >This last is in fact similar to how RCU handles userspace
> > >execution for NO_HZ_FULL.
> > >
> > 
> > so powerclamp and such are not "idle".
> > They are "busy" from everything except the lowest level of the CPU
> > hardware. once you start thinking of them as idle, all hell breaks
> > lose in terms of implications (including sysadmin visibility
> > etc).... (hence some of the explosions in this thread as well).
> > 
> > but it's not "idle".
> > 
> > it's "put the cpu in a low power state for a specified amount of
> > time". sure it uses the same instruction to do so that the idle
> > loop uses.
> > 
> > (now to make it messy, the current driver does a bunch of things
> > similar to the idle loop which is a mess and fair to be complained
> > about)
> 
> Then from an RCU viewpoint, they need to be short in duration.
> Otherwise you risk getting CPU stall-warning explosions from RCU.  ;-)
> 
> 							Thanx, Paul
> 
currently powerclamp allow idle injection duration between 6 to 25ms.
I guess that is short considering the stall check is in seconds?
	return till_stall_check * HZ + RCU_STALL_DELAY_DELTA;

BTW, by forcing intel_idle to use deepest c-states for idle injection
thread the efficiency problem is gone. I am surprised that cpuidle
would not pick the deepest c-states given powerclamp driver is asking
for 6ms idle time and the wakeup latencies are in the usec.
Anyway, for what i have tested so far powerclamp with this patchset can
work as well as the code before.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Jacob Pan]
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 22, 2013, 4:20 a.m. UTC | #19
On Thu, Nov 21, 2013 at 04:10:05PM -0800, Jacob Pan wrote:
> On Thu, 21 Nov 2013 12:07:17 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Nov 21, 2013 at 11:45:20AM -0800, Arjan van de Ven wrote:
> > > On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> > > >On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> > > >>On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> > > >>>As long as RCU has some reliable way to identify an idle task, I
> > > >>>am good.  But I have to ask -- why can't idle injection
> > > >>>coordinate with the existing idle tasks rather than temporarily
> > > >>>making alternative idle tasks?
> > > >>
> > > >>it's not a real idle. that's the whole problem of the situation.
> > > >>to the rest of the OS, this is being BUSY (busy saving power using
> > > >>a CPU instruction, but it might as well have been an mdelay()
> > > >>operation) and it's also what end users expect; they want to be
> > > >>able to see where there performance (read: cpu time in "top") is
> > > >>going.
> > > >
> > > >My concern is keeping RCU's books straight.  Suppose that there is
> > > >a need to call for idle in the middle of a preemptible RCU
> > > >read-side critical section.  Now, if that call for idle involves a
> > > >context switch, all is well -- RCU will see the task as still
> > > >being in its RCU read-side critical section, which means that it
> > > >is OK for RCU to see the CPU as idle.
> > > >
> > > >However, if there is no context switch and RCU sees the CPU as
> > > >idle, preemptible RCU could prematurely end the grace period.  If
> > > >there is no context switch and RCU sees the CPU as non-idle for
> > > >too long, we start getting RCU CPU stall warning splats.
> > > >
> > > >Another approach would be to only inject idle when the CPU is not
> > > >doing anything that could possibly be in an RCU read-side critical
> > > >section.  But things might get a bit hot in case of an overly
> > > >long RCU read-side critical section.
> > > >
> > > >One approach that might work would be to hook into RCU's
> > > >context-switch code going in and coming out, then telling RCU that
> > > >the CPU is idle, even though top and friends see it as non-idle.
> > > >This last is in fact similar to how RCU handles userspace
> > > >execution for NO_HZ_FULL.
> > > >
> > > 
> > > so powerclamp and such are not "idle".
> > > They are "busy" from everything except the lowest level of the CPU
> > > hardware. once you start thinking of them as idle, all hell breaks
> > > lose in terms of implications (including sysadmin visibility
> > > etc).... (hence some of the explosions in this thread as well).
> > > 
> > > but it's not "idle".
> > > 
> > > it's "put the cpu in a low power state for a specified amount of
> > > time". sure it uses the same instruction to do so that the idle
> > > loop uses.
> > > 
> > > (now to make it messy, the current driver does a bunch of things
> > > similar to the idle loop which is a mess and fair to be complained
> > > about)
> > 
> > Then from an RCU viewpoint, they need to be short in duration.
> > Otherwise you risk getting CPU stall-warning explosions from RCU.  ;-)
> > 
> > 							Thanx, Paul
> > 
> currently powerclamp allow idle injection duration between 6 to 25ms.
> I guess that is short considering the stall check is in seconds?
> 	return till_stall_check * HZ + RCU_STALL_DELAY_DELTA;

The 6ms to 25ms range should be just fine as far as normal RCU grace
periods are concerned.  However, it does mean that expedited grace
periods could be delayed: They normally take a few tens of microseconds,
but if they were unlucky enough to show up during an idle injection,
they would be magnified by two to three orders of magnitude, which is
not pretty.

Hence my suggestion of hooking into RCU on idle-injection start and end
so that RCU considers that time period to be idle.  Just like it does
for user-mode execution on NO_HZ_FULL kernels, so I still don't see this
approach to be a problem.  I must confess that I still don't understand
what Arjan doesn't like about it.

							Thanx, Paul

> BTW, by forcing intel_idle to use deepest c-states for idle injection
> thread the efficiency problem is gone. I am surprised that cpuidle
> would not pick the deepest c-states given powerclamp driver is asking
> for 6ms idle time and the wakeup latencies are in the usec.
> Anyway, for what i have tested so far powerclamp with this patchset can
> work as well as the code before.
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> [Jacob Pan]
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 22, 2013, 11:33 a.m. UTC | #20
On Thu, Nov 21, 2013 at 08:20:36PM -0800, Paul E. McKenney wrote:
> The 6ms to 25ms range should be just fine as far as normal RCU grace
> periods are concerned.  However, it does mean that expedited grace
> periods could be delayed: They normally take a few tens of microseconds,
> but if they were unlucky enough to show up during an idle injection,
> they would be magnified by two to three orders of magnitude, which is
> not pretty.
> 
> Hence my suggestion of hooking into RCU on idle-injection start and end
> so that RCU considers that time period to be idle.  Just like it does
> for user-mode execution on NO_HZ_FULL kernels, so I still don't see this
> approach to be a problem.  I must confess that I still don't understand
> what Arjan doesn't like about it.

Using these patches it would indeed use the RCU idle machinery as per
the normal idle path.

If you can I can add more WARN_ON()s in play_idle() to ensure we're not
called while holding any RCU locks.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 22, 2013, 5:17 p.m. UTC | #21
On Fri, Nov 22, 2013 at 12:33:00PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 08:20:36PM -0800, Paul E. McKenney wrote:
> > The 6ms to 25ms range should be just fine as far as normal RCU grace
> > periods are concerned.  However, it does mean that expedited grace
> > periods could be delayed: They normally take a few tens of microseconds,
> > but if they were unlucky enough to show up during an idle injection,
> > they would be magnified by two to three orders of magnitude, which is
> > not pretty.
> > 
> > Hence my suggestion of hooking into RCU on idle-injection start and end
> > so that RCU considers that time period to be idle.  Just like it does
> > for user-mode execution on NO_HZ_FULL kernels, so I still don't see this
> > approach to be a problem.  I must confess that I still don't understand
> > what Arjan doesn't like about it.
> 
> Using these patches it would indeed use the RCU idle machinery as per
> the normal idle path.

OK, sorry for my confusion!

> If you can I can add more WARN_ON()s in play_idle() to ensure we're not
> called while holding any RCU locks.

An rcu_sleep_check() or something similar, please!

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -41,9 +41,7 @@  static DEFINE_MUTEX(round_robin_lock);
 static unsigned long power_saving_mwait_eax;
 
 static unsigned char tsc_detected_unstable;
-static unsigned char tsc_marked_unstable;
 static unsigned char lapic_detected_unstable;
-static unsigned char lapic_marked_unstable;
 
 static void power_saving_mwait_init(void)
 {
@@ -153,10 +151,9 @@  static int power_saving_thread(void *dat
 	unsigned int tsk_index = (unsigned long)data;
 	u64 last_jiffies = 0;
 
-	sched_setscheduler(current, SCHED_RR, &param);
+	sched_setscheduler(current, SCHED_FIFO, &param);
 
 	while (!kthread_should_stop()) {
-		int cpu;
 		u64 expire_time;
 
 		try_to_freeze();
@@ -171,41 +168,7 @@  static int power_saving_thread(void *dat
 
 		expire_time = jiffies + HZ * (100 - idle_pct) / 100;
 
-		while (!need_resched()) {
-			if (tsc_detected_unstable && !tsc_marked_unstable) {
-				/* TSC could halt in idle, so notify users */
-				mark_tsc_unstable("TSC halts in idle");
-				tsc_marked_unstable = 1;
-			}
-			if (lapic_detected_unstable && !lapic_marked_unstable) {
-				int i;
-				/* LAPIC could halt in idle, so notify users */
-				for_each_online_cpu(i)
-					clockevents_notify(
-						CLOCK_EVT_NOTIFY_BROADCAST_ON,
-						&i);
-				lapic_marked_unstable = 1;
-			}
-			local_irq_disable();
-			cpu = smp_processor_id();
-			if (lapic_marked_unstable)
-				clockevents_notify(
-					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
-			stop_critical_timings();
-
-			mwait_idle_with_hints(power_saving_mwait_eax, 1);
-
-			start_critical_timings();
-			if (lapic_marked_unstable)
-				clockevents_notify(
-					CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
-			local_irq_enable();
-
-			if (jiffies > expire_time) {
-				do_sleep = 1;
-				break;
-			}
-		}
+		play_idle(expire_time);
 
 		/*
 		 * current sched_rt has threshold for rt task running time.
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -247,11 +247,6 @@  static u64 pkg_state_counter(void)
 	return count;
 }
 
-static void noop_timer(unsigned long foo)
-{
-	/* empty... just the fact that we get the interrupt wakes us up */
-}
-
 static unsigned int get_compensation(int ratio)
 {
 	unsigned int comp = 0;
@@ -356,7 +351,6 @@  static bool powerclamp_adjust_controls(u
 static int clamp_thread(void *arg)
 {
 	int cpunr = (unsigned long)arg;
-	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
 	static const struct sched_param param = {
 		.sched_priority = MAX_USER_RT_PRIO/2,
 	};
@@ -365,11 +359,9 @@  static int clamp_thread(void *arg)
 
 	set_bit(cpunr, cpu_clamping_mask);
 	set_freezable();
-	init_timer_on_stack(&wakeup_timer);
 	sched_setscheduler(current, SCHED_FIFO, &param);
 
-	while (true == clamping && !kthread_should_stop() &&
-		cpu_online(cpunr)) {
+	while (clamping && !kthread_should_stop() && cpu_online(cpunr)) {
 		int sleeptime;
 		unsigned long target_jiffies;
 		unsigned int guard;
@@ -417,35 +409,11 @@  static int clamp_thread(void *arg)
 		if (should_skip)
 			continue;
 
-		target_jiffies = jiffies + duration_jiffies;
-		mod_timer(&wakeup_timer, target_jiffies);
 		if (unlikely(local_softirq_pending()))
 			continue;
-		/*
-		 * stop tick sched during idle time, interrupts are still
-		 * allowed. thus jiffies are updated properly.
-		 */
-		preempt_disable();
-		tick_nohz_idle_enter();
-		/* mwait until target jiffies is reached */
-		while (time_before(jiffies, target_jiffies)) {
-			unsigned long ecx = 1;
-			unsigned long eax = target_mwait;
-
-			/*
-			 * REVISIT: may call enter_idle() to notify drivers who
-			 * can save power during cpu idle. same for exit_idle()
-			 */
-			local_touch_nmi();
-			stop_critical_timings();
-			mwait_idle_with_hints(eax, ecx);
-			start_critical_timings();
-			atomic_inc(&idle_wakeup_counter);
-		}
-		tick_nohz_idle_exit();
-		preempt_enable_no_resched();
+
+		play_idle(duration_jiffies);
 	}
-	del_timer_sync(&wakeup_timer);
 	clear_bit(cpunr, cpu_clamping_mask);
 
 	return 0;
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -215,6 +215,8 @@  enum cpuhp_state {
 	CPUHP_ONLINE,
 };
 
+void play_idle(unsigned long jiffies);
+
 void cpu_startup_entry(enum cpuhp_state state);
 void cpu_idle(void);
 
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -63,62 +63,88 @@  void __weak arch_cpu_idle(void)
 }
 
 /*
- * Generic idle loop implementation
+ * Generic idle cycle.
  */
-static void cpu_idle_loop(void)
+static void do_idle(void)
 {
-	while (1) {
-		tick_nohz_idle_enter();
+	tick_nohz_idle_enter();
 
-		while (!need_resched()) {
-			check_pgt_cache();
-			rmb();
-
-			if (cpu_is_offline(smp_processor_id()))
-				arch_cpu_idle_dead();
-
-			local_irq_disable();
-			arch_cpu_idle_enter();
-
-			/*
-			 * In poll mode we reenable interrupts and spin.
-			 *
-			 * Also if we detected in the wakeup from idle
-			 * path that the tick broadcast device expired
-			 * for us, we don't want to go deep idle as we
-			 * know that the IPI is going to arrive right
-			 * away
-			 */
-			if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
-				cpu_idle_poll();
-			} else {
-				if (!current_clr_polling_and_test()) {
-					stop_critical_timings();
-					rcu_idle_enter();
-					arch_cpu_idle();
-					WARN_ON_ONCE(irqs_disabled());
-					rcu_idle_exit();
-					start_critical_timings();
-				} else {
-					local_irq_enable();
-				}
-				__current_set_polling();
-			}
-			arch_cpu_idle_exit();
-		}
+	while (!need_resched()) {
+		check_pgt_cache();
+		rmb();
+
+		if (cpu_is_offline(smp_processor_id()))
+			arch_cpu_idle_dead();
+
+		local_irq_disable();
+		arch_cpu_idle_enter();
 
 		/*
-		 * We need to test and propagate the TIF_NEED_RESCHED bit here
-		 * because we might not have send the reschedule IPI to idle
-		 * tasks.
+		 * In poll mode we reenable interrupts and spin.
+		 *
+		 * Also if we detected in the wakeup from idle path that the
+		 * tick broadcast device expired for us, we don't want to go
+		 * deep idle as we know that the IPI is going to arrive right
+		 * away
 		 */
-		preempt_fold_need_resched();
-		tick_nohz_idle_exit();
-		schedule_preempt_disabled();
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+			cpu_idle_poll();
+		} else {
+			if (!current_clr_polling_and_test()) {
+				stop_critical_timings();
+				rcu_idle_enter();
+				arch_cpu_idle();
+				WARN_ON_ONCE(irqs_disabled());
+				rcu_idle_exit();
+				start_critical_timings();
+			} else {
+				local_irq_enable();
+			}
+			__current_set_polling();
+		}
+		arch_cpu_idle_exit();
 	}
+
+	/*
+	 * We need to test and propagate the TIF_NEED_RESCHED bit here
+	 * because we might not have send the reschedule IPI to idle
+	 * tasks.
+	 */
+	preempt_fold_need_resched();
+	tick_nohz_idle_exit();
+	schedule_preempt_disabled();
+}
+
+static void play_idle_timer(unsigned long foo)
+{
+	set_tsk_need_resched(current);
+}
+
+void play_idle(unsigned long duration)
+{
+	DEFINE_TIMER(wakeup_timer, play_idle_timer, 0, 0);
+
+	/*
+	 * Only FIFO tasks can disable the tick since they don't need the forced
+	 * preemption.
+	 */
+	WARN_ON_ONCE(current->policy != SCHED_FIFO);
+	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
+	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
+	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
+
+	init_timer_on_stack(&wakeup_timer);
+	mod_timer_pinned(&wakeup_timer, jiffies + duration);
+
+	preempt_disable();
+	do_idle();
+	del_timer_sync(&wakeup_timer);
+	preempt_fold_need_resched();
+	preempt_enable();
 }
+EXPORT_SYMBOL_GPL(play_idle);
 
-void cpu_startup_entry(enum cpuhp_state state)
+__noreturn void cpu_startup_entry(enum cpuhp_state state)
 {
 	/*
 	 * This #ifdef needs to die, but it's too late in the cycle to
@@ -137,5 +163,6 @@  void cpu_startup_entry(enum cpuhp_state
 #endif
 	__current_set_polling();
 	arch_cpu_idle_prepare();
-	cpu_idle_loop();
+	while (1)
+		do_idle();
 }
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -804,7 +804,6 @@  void tick_nohz_idle_enter(void)
 
 	local_irq_enable();
 }
-EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -932,7 +931,6 @@  void tick_nohz_idle_exit(void)
 
 	local_irq_enable();
 }
-EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
 
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
 {