diff mbox

[v2] cpuidle: poll_state: Add time limit to poll_idle()

Message ID 4137867.C4jYrWdt8n@aspire.rjw.lan (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Rafael J. Wysocki March 12, 2018, 9:36 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If poll_idle() is allowed to spin until need_resched() returns 'true',
it may actually spin for a much longer time than expected by the idle
governor, since set_tsk_need_resched() is not always called by the
timer interrupt handler.  If that happens, the CPU may spend much
more time than anticipated in the "polling" state.

To prevent that from happening, limit the time of the spinning loop
in poll_idle().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

-> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000.

---
 drivers/cpuidle/poll_state.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki March 14, 2018, 11:24 a.m. UTC | #1
On Monday, March 12, 2018 10:36:27 AM CET Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If poll_idle() is allowed to spin until need_resched() returns 'true',
> it may actually spin for a much longer time than expected by the idle
> governor, since set_tsk_need_resched() is not always called by the
> timer interrupt handler.  If that happens, the CPU may spend much
> more time than anticipated in the "polling" state.
> 
> To prevent that from happening, limit the time of the spinning loop
> in poll_idle().
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> -> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000.
> 
> ---
>  drivers/cpuidle/poll_state.c |   17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/cpuidle/poll_state.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/poll_state.c
> +++ linux-pm/drivers/cpuidle/poll_state.c
> @@ -5,16 +5,31 @@
>   */
>  
>  #include <linux/cpuidle.h>
> +#include <linux/ktime.h>
>  #include <linux/sched.h>
>  #include <linux/sched/idle.h>
>  
> +#define POLL_IDLE_TIME_CHECK_COUNT	1000
> +#define POLL_IDLE_TIME_LIMIT		(TICK_NSEC / 16)
> +
>  static int __cpuidle poll_idle(struct cpuidle_device *dev,
>  			       struct cpuidle_driver *drv, int index)
>  {
> +	ktime_t start = ktime_get();
> +
>  	local_irq_enable();
>  	if (!current_set_polling_and_test()) {
> -		while (!need_resched())
> +		unsigned int time_check_counter = 0;
> +
> +		while (!need_resched()) {
>  			cpu_relax();
> +			if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT)
> +				continue;
> +
> +			time_check_counter = 0;
> +			if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT)
> +				break;
> +		}
>  	}
>  	current_clr_polling();
>  

No comments, so I'm assuming no objections or concerns.

I've seen reports telling me that this patch alone may reduce the CPU package
power by as much as 30% sometimes.

Thanks!
Peter Zijlstra March 14, 2018, 12:04 p.m. UTC | #2
On Mon, Mar 12, 2018 at 10:36:27AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If poll_idle() is allowed to spin until need_resched() returns 'true',
> it may actually spin for a much longer time than expected by the idle
> governor, since set_tsk_need_resched() is not always called by the
> timer interrupt handler.  If that happens, the CPU may spend much
> more time than anticipated in the "polling" state.
> 
> To prevent that from happening, limit the time of the spinning loop
> in poll_idle().
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> -> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000.
> 
> ---
>  drivers/cpuidle/poll_state.c |   17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/cpuidle/poll_state.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/poll_state.c
> +++ linux-pm/drivers/cpuidle/poll_state.c
> @@ -5,16 +5,31 @@
>   */
>  
>  #include <linux/cpuidle.h>
> +#include <linux/ktime.h>
>  #include <linux/sched.h>
>  #include <linux/sched/idle.h>
>  
> +#define POLL_IDLE_TIME_CHECK_COUNT	1000
> +#define POLL_IDLE_TIME_LIMIT		(TICK_NSEC / 16)
> +
>  static int __cpuidle poll_idle(struct cpuidle_device *dev,
>  			       struct cpuidle_driver *drv, int index)
>  {
> +	ktime_t start = ktime_get();

I would recoomend not using ktime_get(), imagine the 'joy' if that
happens to be the HPET.

>  	local_irq_enable();
>  	if (!current_set_polling_and_test()) {
> +		unsigned int time_check_counter = 0;
> +
> +		while (!need_resched()) {
>  			cpu_relax();
> +			if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT)
> +				continue;
> +
> +			time_check_counter = 0;
> +			if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT)
> +				break;
> +		}
>  	}
>  	current_clr_polling();

Since the idle loop is strictly per-cpu, you can use regular
sched_clock() here. Something like:

	u64 start = sched_clock();

	local_irq_enable();
	if (!current_set_polling_and_test()) {
		while (!need_resched()) {
			cpu_relax();

			if (sched_clock() - start > POLL_IDLE_TIME_LIMIT)
				break;
		}
	}
	current_clr_polling();

On x86 we don't have to use that time_check_counter thing, sched_clock()
is really cheap, not sure if it makes sense on other platforms.
Rafael J. Wysocki March 14, 2018, 12:08 p.m. UTC | #3
On Wednesday, March 14, 2018 1:04:50 PM CET Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 10:36:27AM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > If poll_idle() is allowed to spin until need_resched() returns 'true',
> > it may actually spin for a much longer time than expected by the idle
> > governor, since set_tsk_need_resched() is not always called by the
> > timer interrupt handler.  If that happens, the CPU may spend much
> > more time than anticipated in the "polling" state.
> > 
> > To prevent that from happening, limit the time of the spinning loop
> > in poll_idle().
> > 
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > 
> > -> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000.
> > 
> > ---
> >  drivers/cpuidle/poll_state.c |   17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > Index: linux-pm/drivers/cpuidle/poll_state.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpuidle/poll_state.c
> > +++ linux-pm/drivers/cpuidle/poll_state.c
> > @@ -5,16 +5,31 @@
> >   */
> >  
> >  #include <linux/cpuidle.h>
> > +#include <linux/ktime.h>
> >  #include <linux/sched.h>
> >  #include <linux/sched/idle.h>
> >  
> > +#define POLL_IDLE_TIME_CHECK_COUNT	1000
> > +#define POLL_IDLE_TIME_LIMIT		(TICK_NSEC / 16)
> > +
> >  static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >  			       struct cpuidle_driver *drv, int index)
> >  {
> > +	ktime_t start = ktime_get();
> 
> I would recoomend not using ktime_get(), imagine the 'joy' if that
> happens to be the HPET.
> 
> >  	local_irq_enable();
> >  	if (!current_set_polling_and_test()) {
> > +		unsigned int time_check_counter = 0;
> > +
> > +		while (!need_resched()) {
> >  			cpu_relax();
> > +			if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT)
> > +				continue;
> > +
> > +			time_check_counter = 0;
> > +			if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT)
> > +				break;
> > +		}
> >  	}
> >  	current_clr_polling();
> 
> Since the idle loop is strictly per-cpu, you can use regular
> sched_clock() here. Something like:
> 
> 	u64 start = sched_clock();
> 
> 	local_irq_enable();
> 	if (!current_set_polling_and_test()) {
> 		while (!need_resched()) {
> 			cpu_relax();
> 
> 			if (sched_clock() - start > POLL_IDLE_TIME_LIMIT)
> 				break;
> 		}
> 	}
> 	current_clr_polling();

Good idea!

> On x86 we don't have to use that time_check_counter thing, sched_clock()
> is really cheap, not sure if it makes sense on other platforms.

Let's do it the way you suggested, if there are issues with it, we
can still add the counter later.

Thanks!
Rik van Riel March 22, 2018, 4:32 p.m. UTC | #4
On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:

> On x86 we don't have to use that time_check_counter thing,
> sched_clock()
> is really cheap, not sure if it makes sense on other platforms.

Are you sure? I saw a 5-10% increase in CPU use,
for a constant query rate to a memcache style
workload, with v3 of this patch.
Peter Zijlstra March 22, 2018, 7:11 p.m. UTC | #5
On Thu, Mar 22, 2018 at 12:32:18PM -0400, Rik van Riel wrote:
> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
> 
> > On x86 we don't have to use that time_check_counter thing,
> > sched_clock()
> > is really cheap, not sure if it makes sense on other platforms.
> 
> Are you sure? I saw a 5-10% increase in CPU use,
> for a constant query rate to a memcache style
> workload, with v3 of this patch.

Well, I expected it to be really cheap, but if you have numbers saying
otherwise, clearly I'll need to adjust expectations ;-)
Doug Smythies March 22, 2018, 7:11 p.m. UTC | #6
On 2018.03.22 09:32 Rik van Riel wrote:
> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
>
>> On x86 we don't have to use that time_check_counter thing,
>> sched_clock()
>> is really cheap, not sure if it makes sense on other platforms.
>
> Are you sure? I saw a 5-10% increase in CPU use,
> for a constant query rate to a memcache style
> workload, with v3 of this patch.

I would very much like to be able to repeat your test results.
However, I am not sure what you mean by "memcache style workload".
Is there a test you can point me to? Say a Phoronix type test, for example.

All of my tests with the V3 of this patch have been fine.

... Doug
Doug Smythies March 23, 2018, 3:19 a.m. UTC | #7
On 2018.03.22 12:12 Doug Smythies wrote: 
>On 2018.03.22 09:32 Rik van Riel wrote:
>> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
>>
>>> On x86 we don't have to use that time_check_counter thing,
>>> sched_clock()
>>> is really cheap, not sure if it makes sense on other platforms.
>>
>> Are you sure? I saw a 5-10% increase in CPU use,
>> for a constant query rate to a memcache style
>> workload, with v3 of this patch.
>
> I would very much like to be able to repeat your test results.
> However, I am not sure what you mean by "memcache style workload".
> Is there a test you can point me to? Say a Phoronix type test, for example.
>
> All of my tests with the V3 of this patch have been fine.

What is the difference between sched_clock() talked about herein,
and local_clock() used in the patch?

I'm not sure how good it is but I made a test. I didn't believe
the results, so I did it 3 times.

V7.3 is as from the git branch.
V7.3p is plus the patch adding the counter loop to poll_state.c

The test is a tight loop (about 19600 loops per second) running
on all 8 CPUs. I can not seem to get my system to use Idle State
0, so I disabled Idle States 1 and 2 to force use of Idle State 0.

V7.3 uses a processor package power of 62.5 Watts
V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power.

The loop times do not change.
The Idle state 0 residency per unit time does not change.

... Doug
Rafael J. Wysocki March 23, 2018, 8:57 a.m. UTC | #8
On Fri, Mar 23, 2018 at 4:19 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2018.03.22 12:12 Doug Smythies wrote:
>>On 2018.03.22 09:32 Rik van Riel wrote:
>>> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
>>>
>>>> On x86 we don't have to use that time_check_counter thing,
>>>> sched_clock()
>>>> is really cheap, not sure if it makes sense on other platforms.
>>>
>>> Are you sure? I saw a 5-10% increase in CPU use,
>>> for a constant query rate to a memcache style
>>> workload, with v3 of this patch.
>>
>> I would very much like to be able to repeat your test results.
>> However, I am not sure what you mean by "memcache style workload".
>> Is there a test you can point me to? Say a Phoronix type test, for example.
>>
>> All of my tests with the V3 of this patch have been fine.
>
> What is the difference between sched_clock() talked about herein,
> and local_clock() used in the patch?

It is almost the same (modulo offset) unless sched_clock_stable()
returns 'false'.

> I'm not sure how good it is but I made a test. I didn't believe
> the results, so I did it 3 times.
>
> V7.3 is as from the git branch.
> V7.3p is plus the patch adding the counter loop to poll_state.c
>
> The test is a tight loop (about 19600 loops per second) running
> on all 8 CPUs. I can not seem to get my system to use Idle State
> 0, so I disabled Idle States 1 and 2 to force use of Idle State 0.
>
> V7.3 uses a processor package power of 62.5 Watts
> V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power.
>
> The loop times do not change.
> The Idle state 0 residency per unit time does not change.

OK, so this means that the results should improve for Rik with this
patch too. :-)

Thanks!
Rafael J. Wysocki March 23, 2018, 9:07 a.m. UTC | #9
On Fri, Mar 23, 2018 at 9:57 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Mar 23, 2018 at 4:19 AM, Doug Smythies <dsmythies@telus.net> wrote:
>> On 2018.03.22 12:12 Doug Smythies wrote:
>>>On 2018.03.22 09:32 Rik van Riel wrote:
>>>> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
>>>>
>>>>> On x86 we don't have to use that time_check_counter thing,
>>>>> sched_clock()
>>>>> is really cheap, not sure if it makes sense on other platforms.
>>>>
>>>> Are you sure? I saw a 5-10% increase in CPU use,
>>>> for a constant query rate to a memcache style
>>>> workload, with v3 of this patch.
>>>
>>> I would very much like to be able to repeat your test results.
>>> However, I am not sure what you mean by "memcache style workload".
>>> Is there a test you can point me to? Say a Phoronix type test, for example.
>>>
>>> All of my tests with the V3 of this patch have been fine.
>>
>> What is the difference between sched_clock() talked about herein,
>> and local_clock() used in the patch?
>
> It is almost the same (modulo offset) unless sched_clock_stable()
> returns 'false'.
>
>> I'm not sure how good it is but I made a test. I didn't believe
>> the results, so I did it 3 times.
>>
>> V7.3 is as from the git branch.
>> V7.3p is plus the patch adding the counter loop to poll_state.c
>>
>> The test is a tight loop (about 19600 loops per second) running
>> on all 8 CPUs. I can not seem to get my system to use Idle State
>> 0, so I disabled Idle States 1 and 2 to force use of Idle State 0.
>>
>> V7.3 uses a processor package power of 62.5 Watts
>> V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power.
>>
>> The loop times do not change.
>> The Idle state 0 residency per unit time does not change.
>
> OK, so this means that the results should improve for Rik with this
> patch too. :-)

BTW, can you possibly check how much of a difference it makes to
reduce POLL_IDLE_COUNT in the patch to, say, 500 or even more?

The lower it is, the less noise it will introduce AFAICS.
Doug Smythies March 23, 2018, 9:30 p.m. UTC | #10
On 2018.03.23 02:08 Rafael J. Wysocki wrote:
> On Fri, Mar 23, 2018 at 9:57 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Fri, Mar 23, 2018 at 4:19 AM, Doug Smythies <dsmythies@telus.net> wrote:
>>> On 2018.03.22 12:12 Doug Smythies wrote:

...[snip]...

>>
>>> I'm not sure how good it is but I made a test. I didn't believe
>>> the results, so I did it 3 times.
>>>
>>> V7.3 is as from the git branch.
>>> V7.3p is plus the patch adding the counter loop to poll_state.c
>>>
>>> The test is a tight loop (about 19600 loops per second) running
>>> on all 8 CPUs. I can not seem to get my system to use Idle State
>>> 0, so I disabled Idle States 1 and 2 to force use of Idle State 0.
>>>
>>> V7.3 uses a processor package power of 62.5 Watts
>>> V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power.
>>>
>>> The loop times do not change.
>>> The Idle state 0 residency per unit time does not change.
>>
>> OK, so this means that the results should improve for Rik with this
>> patch too. :-)

I hope so.

> BTW, can you possibly check how much of a difference it makes to
> reduce POLL_IDLE_COUNT in the patch to, say, 500 or even more?
> 
> The lower it is, the less noise it will introduce AFAICS.

Well, we would expect the curve to be something like a typical 1/x curve:

Power = 53.4 + k1/(k2* POLL_IDLE_COUNT + k3)

I did some runs and did a crude fit:

Power ~= 53.4 + 35/(POLL_IDLE_COUNT + 3)

And then calculate an allowed error from that. A count of 100 gives
back only 0.64% of the power, and so I suggest would be a reasonable
number.

That being said, my test is quite crude and we should first see what
others, including Rik, get.

These two graphs might help explain what I did:

http://fast.smythies.com/v73p_vary_count.png
http://fast.smythies.com/v73p_extra_power.png

It is just my opinion, but I think users with very stringent
idle state 0 exit latency requirements should test with
POLL_IDLE_COUNT set to 1. Then they know the worst case works,
whereas they might not hit it at 1/POLL_IDLE_COUNT probability.
Once happy that the worst case works, use nominal (T.B.D.)
POLL_IDLE_COUNT, for the power savings.

... Doug
Rafael J. Wysocki March 27, 2018, 4:42 p.m. UTC | #11
On Thu, Mar 22, 2018 at 5:32 PM, Rik van Riel <riel@surriel.com> wrote:
> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
>
>> On x86 we don't have to use that time_check_counter thing,
>> sched_clock()
>> is really cheap, not sure if it makes sense on other platforms.
>
> Are you sure? I saw a 5-10% increase in CPU use,
> for a constant query rate to a memcache style
> workload, with v3 of this patch.

I think I know what's going on.

Increased utilization with the same amount of work per time unit (and
I guess that's the case given the lack of specific information about
the workload) means more non-idle time with respect to total time and
that implies reduced frequency (eg. less turbo).

Now, combine that with the Doug's observation that limiting the rate
of local_clock() invocations in the poll_idle() loop reduces power
draw during experiments on his system significantly and with the other
one that in both cases local_lock() ends up being rdtsc() (most
likely).

What this implies to me is that invoking rdtsc() at a high rate on
multiple logical CPUs in parallel causes chips to get hot.  Actually
that may be so hot that they hit power/thremal (eg. RAPL) limits and
get their frequency reduced as a result.

Limiting the rate of local_clock() invocations obviously avoids this issue.
Rik van Riel March 27, 2018, 6:02 p.m. UTC | #12
On Tue, 2018-03-27 at 18:42 +0200, Rafael J. Wysocki wrote:
> On Thu, Mar 22, 2018 at 5:32 PM, Rik van Riel <riel@surriel.com>
> wrote:
> > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
> > 
> > > On x86 we don't have to use that time_check_counter thing,
> > > sched_clock()
> > > is really cheap, not sure if it makes sense on other platforms.
> > 
> > Are you sure? I saw a 5-10% increase in CPU use,
> > for a constant query rate to a memcache style
> > workload, with v3 of this patch.
> 
> I think I know what's going on.

I ran my tests wrong, and the script never propagated
errors back to me. Sigh.

However, the poll_idle() that reads the TSC at a
reduced rate seems to perform better than the one
that reads the TSC every time it goes around the
loop.

The size of the idle loop seems to make a slight
difference, too. Having just one cpu_relax() in
the entire loop seems to be better than having
them all over the place.
Rafael J. Wysocki March 27, 2018, 9:09 p.m. UTC | #13
On Tue, Mar 27, 2018 at 8:02 PM, Rik van Riel <riel@surriel.com> wrote:
> On Tue, 2018-03-27 at 18:42 +0200, Rafael J. Wysocki wrote:
>> On Thu, Mar 22, 2018 at 5:32 PM, Rik van Riel <riel@surriel.com>
>> wrote:
>> > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote:
>> >
>> > > On x86 we don't have to use that time_check_counter thing,
>> > > sched_clock()
>> > > is really cheap, not sure if it makes sense on other platforms.
>> >
>> > Are you sure? I saw a 5-10% increase in CPU use,
>> > for a constant query rate to a memcache style
>> > workload, with v3 of this patch.
>>
>> I think I know what's going on.
>
> I ran my tests wrong, and the script never propagated
> errors back to me. Sigh.

Again, no worries. :-)

> However, the poll_idle() that reads the TSC at a
> reduced rate seems to perform better than the one
> that reads the TSC every time it goes around the
> loop.

OK

> The size of the idle loop seems to make a slight
> difference, too. Having just one cpu_relax() in
> the entire loop seems to be better than having
> them all over the place.

OK

Thanks for the above observations, they match my understanding of
what's happening.

I'll resend the patch to read the TSC at a reduced rate with a proper
changelog and I'll tentatively add it to my pm-cpuidle branch.

Thanks!
diff mbox

Patch

Index: linux-pm/drivers/cpuidle/poll_state.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/poll_state.c
+++ linux-pm/drivers/cpuidle/poll_state.c
@@ -5,16 +5,31 @@ 
  */
 
 #include <linux/cpuidle.h>
+#include <linux/ktime.h>
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 
+#define POLL_IDLE_TIME_CHECK_COUNT	1000
+#define POLL_IDLE_TIME_LIMIT		(TICK_NSEC / 16)
+
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
+	ktime_t start = ktime_get();
+
 	local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		while (!need_resched())
+		unsigned int time_check_counter = 0;
+
+		while (!need_resched()) {
 			cpu_relax();
+			if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT)
+				continue;
+
+			time_check_counter = 0;
+			if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT)
+				break;
+		}
 	}
 	current_clr_polling();