diff mbox

[RFT,v5,0/7] sched/cpuidle: Idle loop rework

Message ID 2043615.lCdO10SMaB@aspire.rjw.lan (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Rafael J. Wysocki March 18, 2018, 11 a.m. UTC
On Saturday, March 17, 2018 5:11:53 PM CET Doug Smythies wrote:
> On 2018.03.17 Thomas Ilsche wrote:
> 
> > Over the last week I tested v4+pollv2 and now v5+pollv3. With v5, I
> > observe a particular idle behavior, that I have not seen before with
> > v4. On a dual-socket Skylake system the idle power increases from
> > 74.1 W (system total) to 85.5 W with a 300 HZ build and even to
> > 138.3 W with a 1000 HZ build. A similar Haswell-EP system is also
> > affected.
> 
> I confirm your idle findings. There is a regression between V4 and V5.
> The differences on my test computer are much less than on yours,
> probably because I have only 8 CPUs.
> 
> http://fast.smythies.com/rjw_idle.png
> 
> 1000 Hz kernel only.

Doug, Thomas,

Thank you both for the reports, much appreciated!

Below is a drop-in v6 replacement for patch [4/7].

With this new patch applied instead of the [4/7] the behavior should be much
more in line with the v4 behavior, so please try it if you can and let me know
if that really is the case on your systems.

Patches [5-7/7] from the original v5 apply on top of it right away for me,
but I've also created a git branch you can use to pull all of the series
with the below included:

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 idle-loop

Thanks!

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH v6] cpuidle: Return nohz hint from cpuidle_select()

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
 (1) the idle exit latency is constrained at 0,
 (2) the selected state is a polling one, or
 (3) the selected state is not deep enough.

Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v5 -> v6: Change tick stopping conditions in the menu governor to
  be more in line with the v4 behavior.

---
 drivers/cpuidle/cpuidle.c          |   10 ++++++--
 drivers/cpuidle/governors/ladder.c |    3 +-
 drivers/cpuidle/governors/menu.c   |   44 ++++++++++++++++++++++++++++++++-----
 include/linux/cpuidle.h            |    8 ++++--
 kernel/sched/idle.c                |    4 ++-
 5 files changed, 57 insertions(+), 12 deletions(-)

Comments

Rafael J. Wysocki March 18, 2018, 4:15 p.m. UTC | #1
On Sun, Mar 18, 2018 at 12:00 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Saturday, March 17, 2018 5:11:53 PM CET Doug Smythies wrote:
>> On 2018.03.17 Thomas Ilsche wrote:
>>
>> > Over the last week I tested v4+pollv2 and now v5+pollv3. With v5, I
>> > observe a particular idle behavior, that I have not seen before with
>> > v4. On a dual-socket Skylake system the idle power increases from
>> > 74.1 W (system total) to 85.5 W with a 300 HZ build and even to
>> > 138.3 W with a 1000 HZ build. A similar Haswell-EP system is also
>> > affected.
>>
>> I confirm your idle findings. There is a regression between V4 and V5.
>> The differences on my test computer are much less than on yours,
>> probably because I have only 8 CPUs.
>>
>> http://fast.smythies.com/rjw_idle.png
>>
>> 1000 Hz kernel only.
>
> Doug, Thomas,
>
> Thank you both for the reports, much appreciated!
>
> Below is a drop-in v6 replacement for patch [4/7].
>
> With this new patch applied instead of the [4/7] the behavior should be much
> more in line with the v4 behavior, so please try it if you can and let me know
> if that really is the case on your systems.
>
> Patches [5-7/7] from the original v5 apply on top of it right away for me,
> but I've also created a git branch you can use to pull all of the series
> with the below included:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
>  idle-loop
>
> Thanks!
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v6] cpuidle: Return nohz hint from cpuidle_select()
>
> Add a new pointer argument to cpuidle_select() and to the ->select
> cpuidle governor callback to allow a boolean value indicating
> whether or not the tick should be stopped before entering the
> selected state to be returned from there.
>
> Make the ladder governor ignore that pointer (to preserve its
> current behavior) and make the menu governor return 'false" through
> it if:
>  (1) the idle exit latency is constrained at 0,
>  (2) the selected state is a polling one, or
>  (3) the selected state is not deep enough.
>
> Since the value returned through the new argument pointer is not
> used yet, this change is not expected to alter the functionality of
> the code.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---

[cut]

> @@ -354,6 +360,7 @@ static int menu_select(struct cpuidle_dr
>         if (latency_req > interactivity_req)
>                 latency_req = interactivity_req;
>
> +       expected_interval = TICK_USEC_HZ;
>         /*
>          * Find the idle state with the lowest power while satisfying
>          * our constraints.
> @@ -367,17 +374,44 @@ static int menu_select(struct cpuidle_dr
>                         continue;
>                 if (idx == -1)
>                         idx = i; /* first enabled state */
> -               if (s->target_residency > data->predicted_us)
> +               if (s->target_residency > data->predicted_us) {
> +                       /*
> +                        * Retain the tick if the selected state is shallower
> +                        * than the deepest available one with target residency
> +                        * within the tick period range.
> +                        *
> +                        * This allows the tick to be stopped even if the
> +                        * predicted idle duration is within the tick period
> +                        * range to counter the effect by which the prediction
> +                        * may be skewed towards lower values due to the tick
> +                        * bias.
> +                        */
> +                       expected_interval = s->target_residency;
>                         break;

BTW, I guess I need to explain the motivation here more thoroughly, so
here it goes.

The governor predicts idle duration under the assumption that the
tick will be stopped, so if the result of the prediction is within the tick
period range and it is not accurate, that needs to be taken into
account in the governor's statistics.  However, if the tick is allowed
to run every time the governor predicts idle duration within the tick
period range, the governor will always see that it was "almost
right" and the correction factor applied by it to improve the
prediction next time will not be sufficient.  For this reason, it
is better to stop the tick at least sometimes when the governor
predicts idle duration within the tick period range and the idea
here is to do that when the selected state is the deepest available
one with the target residency within the tick period range.  This
allows the opportunity to save more energy to be seized which
balances the extra overhead of stopping the tick.

HTH
Thomas Ilsche March 20, 2018, 10:01 a.m. UTC | #2
On 2018-03-18 17:15, Rafael J. Wysocki wrote:
>> Doug, Thomas,
>>
>> Thank you both for the reports, much appreciated!
>>
>> Below is a drop-in v6 replacement for patch [4/7].
>>
>> With this new patch applied instead of the [4/7] the behavior should be much
>> more in line with the v4 behavior, so please try it if you can and let me know
>> if that really is the case on your systems.
>>
>> Patches [5-7/7] from the original v5 apply on top of it right away for me,
>> but I've also created a git branch you can use to pull all of the series
>> with the below included:
>>
>>   git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
>>   idle-loop

Thanks for the git repo, that helps alot. I have tested v6 on a
Skylake desktop and server system as well as a Haswell server system.
The odd idle behavior of v5 is gone.

Some of the other findings may be obsolete by the upcoming respin,
I will retest.

Our originally observed Powernightmare pattern is effectively
prevented in both idle and with a synthetic trigger. However, I can
reproduce simple workloads under which the revised menu governor
wastes energy by going into *deeper* C-states than advisable.

Consider the Skylake server system which has residencies in C1E of
20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
unsynchronized pinned to each core. While this is an artificial
case, it is a very innocent one - easy to predict and regular. Between
vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
entirely in C1E. With the patches applied, the cores spend ~75% of
their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
C1E with the baseline). Generally the new menu governor seems to chose
C1E if the next timer is an enabled sched timer - which occasionally
interrupts the sleep-interval into two C1E sleeps rather than one C6.

Manually disabling C6, reduces power consumption back to 149.5 W.

This is far from what I expected, I did not yet figure out why the
patched menu governor decides to go to C6 under that workload. I
have tested this previously with v4 and saw this behavior even
without path "7/7".

The results from Haswell-EP and Skylake desktop are similar.

The tests are with a 1000 Hz kernel because I wanted to amplify
effects that happening when C-state residencies and tick timers are
closer together. But I suspect the results will be similar with
300 Hz as the impact from the sched tick interruption seems to be
minor compared to the odd C-state selection.

Some very raw illustrations, all from Skylake SP (2 == C1E, 3 == C6):
power consumption
trigger-10-10 is the synthetic Powernightmare
poller-omp-300 is the parallel  usleep(300) loop:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_power.png

cstate utilization with usleep(300) loop
(as per /sys/.../stateN/time / wall time)
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_utilization.png

average time spent in cstates
(as /sys/.../stateN/time / /sys/.../stateN/usage)
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_avg_time.png

detailed look:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_poll_300_skl.png


>>
>> Thanks!
>>
>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Subject: [PATCH v6] cpuidle: Return nohz hint from cpuidle_select()
>>
>> Add a new pointer argument to cpuidle_select() and to the ->select
>> cpuidle governor callback to allow a boolean value indicating
>> whether or not the tick should be stopped before entering the
>> selected state to be returned from there.
>>
>> Make the ladder governor ignore that pointer (to preserve its
>> current behavior) and make the menu governor return 'false" through
>> it if:
>>   (1) the idle exit latency is constrained at 0,
>>   (2) the selected state is a polling one, or
>>   (3) the selected state is not deep enough.
>>
>> Since the value returned through the new argument pointer is not
>> used yet, this change is not expected to alter the functionality of
>> the code.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
> 
> [cut]
> 
>> @@ -354,6 +360,7 @@ static int menu_select(struct cpuidle_dr
>>          if (latency_req > interactivity_req)
>>                  latency_req = interactivity_req;
>>
>> +       expected_interval = TICK_USEC_HZ;
>>          /*
>>           * Find the idle state with the lowest power while satisfying
>>           * our constraints.
>> @@ -367,17 +374,44 @@ static int menu_select(struct cpuidle_dr
>>                          continue;
>>                  if (idx == -1)
>>                          idx = i; /* first enabled state */
>> -               if (s->target_residency > data->predicted_us)
>> +               if (s->target_residency > data->predicted_us) {
>> +                       /*
>> +                        * Retain the tick if the selected state is shallower
>> +                        * than the deepest available one with target residency
>> +                        * within the tick period range.
>> +                        *
>> +                        * This allows the tick to be stopped even if the
>> +                        * predicted idle duration is within the tick period
>> +                        * range to counter the effect by which the prediction
>> +                        * may be skewed towards lower values due to the tick
>> +                        * bias.
>> +                        */
>> +                       expected_interval = s->target_residency;
>>                          break;
> 
> BTW, I guess I need to explain the motivation here more thoroughly, so
> here it goes.
> 
> The governor predicts idle duration under the assumption that the
> tick will be stopped, so if the result of the prediction is within the tick
> period range and it is not accurate, that needs to be taken into
> account in the governor's statistics.  However, if the tick is allowed
> to run every time the governor predicts idle duration within the tick
> period range, the governor will always see that it was "almost
> right" and the correction factor applied by it to improve the
> prediction next time will not be sufficient.  For this reason, it
> is better to stop the tick at least sometimes when the governor
> predicts idle duration within the tick period range and the idea
> here is to do that when the selected state is the deepest available
> one with the target residency within the tick period range.  This
> allows the opportunity to save more energy to be seized which
> balances the extra overhead of stopping the tick.
> 
> HTH
>
Rafael J. Wysocki March 20, 2018, 10:49 a.m. UTC | #3
On Tue, Mar 20, 2018 at 11:01 AM, Thomas Ilsche
<thomas.ilsche@tu-dresden.de> wrote:
> On 2018-03-18 17:15, Rafael J. Wysocki wrote:
>>>
>>> Doug, Thomas,
>>>
>>> Thank you both for the reports, much appreciated!
>>>
>>> Below is a drop-in v6 replacement for patch [4/7].
>>>
>>> With this new patch applied instead of the [4/7] the behavior should be
>>> much
>>> more in line with the v4 behavior, so please try it if you can and let me
>>> know
>>> if that really is the case on your systems.
>>>
>>> Patches [5-7/7] from the original v5 apply on top of it right away for
>>> me,
>>> but I've also created a git branch you can use to pull all of the series
>>> with the below included:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
>>>   idle-loop
>
>
> Thanks for the git repo, that helps alot. I have tested v6 on a
> Skylake desktop and server system as well as a Haswell server system.
> The odd idle behavior of v5 is gone.

Thank you for the report!

> Some of the other findings may be obsolete by the upcoming respin,
> I will retest.
>
> Our originally observed Powernightmare pattern is effectively
> prevented in both idle and with a synthetic trigger.

That's great!

> However, I can reproduce simple workloads under which the revised
> menu governor wastes energy by going into *deeper* C-states than
> advisable.
>
> Consider the Skylake server system which has residencies in C1E of
> 20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
> unsynchronized pinned to each core. While this is an artificial
> case, it is a very innocent one - easy to predict and regular. Between
> vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
> from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
> entirely in C1E. With the patches applied, the cores spend ~75% of
> their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
> also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
> C1E with the baseline). Generally the new menu governor seems to chose
> C1E if the next timer is an enabled sched timer - which occasionally
> interrupts the sleep-interval into two C1E sleeps rather than one C6.
>
> Manually disabling C6, reduces power consumption back to 149.5 W.
>
> This is far from what I expected, I did not yet figure out why the
> patched menu governor decides to go to C6 under that workload. I
> have tested this previously with v4 and saw this behavior even
> without path "7/7".

I see.

I'm not sure what the source of this effect is either.  If that is
also present in the v7 I'm working on now, it should be easier to
diagnose in there.

> The results from Haswell-EP and Skylake desktop are similar.
>
> The tests are with a 1000 Hz kernel because I wanted to amplify
> effects that happening when C-state residencies and tick timers are
> closer together. But I suspect the results will be similar with
> 300 Hz as the impact from the sched tick interruption seems to be
> minor compared to the odd C-state selection.

OK

Thanks!
Doug Smythies March 20, 2018, 5:15 p.m. UTC | #4
On 2018.03.20 03:02 Thomas Ilsche wrote:

...[snip]...

> Consider the Skylake server system which has residencies in C1E of
> 20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
> unsynchronized pinned to each core. While this is an artificial
> case, it is a very innocent one - easy to predict and regular. Between
> vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
> from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
> entirely in C1E. With the patches applied, the cores spend ~75% of
> their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
> also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
> C1E with the baseline). Generally the new menu governor seems to chose
> C1E if the next timer is an enabled sched timer - which occasionally
> interrupts the sleep-interval into two C1E sleeps rather than one C6.
>
> Manually disabling C6, reduces power consumption back to 149.5 W.

...[snip]...

Note that one of the tests that I normally do is a work/sleep
frequency sweep from 100 to 2100 Hz, typically at a lowish
workload. I didn't notice anything odd with this test:

http://fast.smythies.com/rjw_freq_sweep.png

However, your test is at 3333 Hz (well, minus overheads).
I did the same as you. And was surprised to confirm
your power findings. In my case package power goes from
~8.6 watts to ~7.3 watts with idle state 4 (C6) disabled.

I am getting different residency times than you though.
I also observe different overheads between idle state 4
being disabled or not. i.e. my actual loop frequency
drops from ~2801 Hz to ~2754 Hz.

Example residencies over the previous minute:

Idle state 4 (C6) disabled (seconds):

Idle state 0: 0.001119
Idle state 1: 0.056638
Idle state 2: 13.100550
Idle state 3: 446.266744
Idle state 4: 0.000000

Idle state 4 (C6) enabled (seconds):

Idle state 0: 0.034502
Idle state 1: 1.949595
Idle state 2: 78.291793
Idle state 3: 96.467974
Idle state 4: 286.247524

... Doug
Doug Smythies March 20, 2018, 9:03 p.m. UTC | #5
Summary: My results with kernel 4.16-rc6 and V8 of the patch set
are completely different, and now show no clear difference
(a longer test might reveal something). 

On 2018.03.20 10:16 Doug Smythies wrote:
> On 2018.03.20 03:02 Thomas Ilsche wrote:
>
>...[snip]...
>
>> Consider the Skylake server system which has residencies in C1E of
>> 20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
>> unsynchronized pinned to each core. While this is an artificial
>> case, it is a very innocent one - easy to predict and regular. Between
>> vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
>> from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
>> entirely in C1E. With the patches applied, the cores spend ~75% of
>> their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
>> also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
>> C1E with the baseline). Generally the new menu governor seems to chose
>> C1E if the next timer is an enabled sched timer - which occasionally
>> interrupts the sleep-interval into two C1E sleeps rather than one C6.
>>
>> Manually disabling C6, reduces power consumption back to 149.5 W.
>
> ...[snip]...
>
> Note that one of the tests that I normally do is a work/sleep
> frequency sweep from 100 to 2100 Hz, typically at a lowish
> workload. I didn't notice anything odd with this test:
> 
> http://fast.smythies.com/rjw_freq_sweep.png
>
> However, your test is at 3333 Hz (well, minus overheads).
> I did the same as you. And was surprised to confirm
> your power findings. In my case package power goes from
> ~8.6 watts to ~7.3 watts with idle state 4 (C6) disabled.
>
> I am getting different residency times than you though.
> I also observe different overheads between idle state 4
> being disabled or not. i.e. my actual loop frequency
> drops from ~2801 Hz to ~2754 Hz.
>
> Example residencies over the previous minute:
>
> Idle state 4 (C6) disabled (seconds):
>
> Idle state 0: 0.001119
> Idle state 1: 0.056638
> Idle state 2: 13.100550
> Idle state 3: 446.266744
> Idle state 4: 0.000000
>
> Idle state 4 (C6) enabled (seconds):
>
> Idle state 0: 0.034502
> Idle state 1: 1.949595
> Idle state 2: 78.291793
> Idle state 3: 96.467974
> Idle state 4: 286.247524

Now, with kernel 4.16-rc6 and V8 of the patch set and the poll fix
I am unable to measure the processor package power difference
between idle state 0 enabled or disabled (i.e. it is in the noise).
also the loop time changes (overhead changes) are minimal. However,
the overall loop time has dropped to ~2730 Hz, so there seems to be
a little more overhead in general.

I increased my loop frequency to ~3316 Hz. Similar.

I increased my loop frequency to ~15474 Hz. Similar.
Compared to a stock 4.16-rc6 kernel: The loop rate dropped
to 15,209 Hz and it (the stock kernel) used about 0.3 more
watts (out of 10.97, or ~3% more).

... Doug
Rafael J. Wysocki March 21, 2018, 6:33 a.m. UTC | #6
On Tuesday, March 20, 2018 10:03:50 PM CET Doug Smythies wrote:
> Summary: My results with kernel 4.16-rc6 and V8 of the patch set
> are completely different, and now show no clear difference
> (a longer test might reveal something). 

Does this mean that you see the "powernightmares" pattern with the v8
again or are you referring to something else?

> On 2018.03.20 10:16 Doug Smythies wrote:
> > On 2018.03.20 03:02 Thomas Ilsche wrote:
> >
> >...[snip]...
> >
> >> Consider the Skylake server system which has residencies in C1E of
> >> 20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
> >> unsynchronized pinned to each core. While this is an artificial
> >> case, it is a very innocent one - easy to predict and regular. Between
> >> vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
> >> from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
> >> entirely in C1E. With the patches applied, the cores spend ~75% of
> >> their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
> >> also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
> >> C1E with the baseline). Generally the new menu governor seems to chose
> >> C1E if the next timer is an enabled sched timer - which occasionally
> >> interrupts the sleep-interval into two C1E sleeps rather than one C6.
> >>
> >> Manually disabling C6, reduces power consumption back to 149.5 W.
> >
> > ...[snip]...
> >
> > Note that one of the tests that I normally do is a work/sleep
> > frequency sweep from 100 to 2100 Hz, typically at a lowish
> > workload. I didn't notice anything odd with this test:
> > 
> > http://fast.smythies.com/rjw_freq_sweep.png

Would it be possible to produce this graph with the v8 of the
patchset?

> > However, your test is at 3333 Hz (well, minus overheads).
> > I did the same as you. And was surprised to confirm
> > your power findings. In my case package power goes from
> > ~8.6 watts to ~7.3 watts with idle state 4 (C6) disabled.
> >
> > I am getting different residency times than you though.
> > I also observe different overheads between idle state 4
> > being disabled or not. i.e. my actual loop frequency
> > drops from ~2801 Hz to ~2754 Hz.
> >
> > Example residencies over the previous minute:
> >
> > Idle state 4 (C6) disabled (seconds):
> >
> > Idle state 0: 0.001119
> > Idle state 1: 0.056638
> > Idle state 2: 13.100550
> > Idle state 3: 446.266744
> > Idle state 4: 0.000000
> >
> > Idle state 4 (C6) enabled (seconds):
> >
> > Idle state 0: 0.034502
> > Idle state 1: 1.949595
> > Idle state 2: 78.291793
> > Idle state 3: 96.467974
> > Idle state 4: 286.247524
> 
> Now, with kernel 4.16-rc6 and V8 of the patch set and the poll fix
> I am unable to measure the processor package power difference
> between idle state 0 enabled or disabled (i.e. it is in the noise).
> also the loop time changes (overhead changes) are minimal. However,
> the overall loop time has dropped to ~2730 Hz, so there seems to be
> a little more overhead in general.
> 
> I increased my loop frequency to ~3316 Hz. Similar.
> 
> I increased my loop frequency to ~15474 Hz. Similar.
> Compared to a stock 4.16-rc6 kernel: The loop rate dropped
> to 15,209 Hz and it (the stock kernel) used about 0.3 more
> watts (out of 10.97, or ~3% more).

So do you prefer v6 or v8?  I guess the former?
Doug Smythies March 21, 2018, 1:51 p.m. UTC | #7
On 2018.03.20 23:33 Rafael J. Wysocki wrote:
> On Tuesday, March 20, 2018 10:03:50 PM CET Doug Smythies wrote:
>> Summary: My results with kernel 4.16-rc6 and V8 of the patch set
>> are completely different, and now show no clear difference
>> (a longer test might reveal something). 
>
> Does this mean that you see the "powernightmares" pattern with the v8
> again or are you referring to something else?

Sorry for not being clear.
I do not see any "powernightmares" at all with V8.

After this e-mail I did a 3 hour trace and saw none.

>> On 2018.03.20 10:16 Doug Smythies wrote:
>>> On 2018.03.20 03:02 Thomas Ilsche wrote:
>>>
>>>...[snip]...
>>>
>>>> Consider the Skylake server system which has residencies in C1E of
>>>> 20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
>>>> unsynchronized pinned to each core. While this is an artificial
>>>> case, it is a very innocent one - easy to predict and regular. Between
>>>> vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
>>>> from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
>>>> entirely in C1E. With the patches applied, the cores spend ~75% of
>>>> their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
>>>> also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
>>>> C1E with the baseline). Generally the new menu governor seems to chose
>>>> C1E if the next timer is an enabled sched timer - which occasionally
>>>> interrupts the sleep-interval into two C1E sleeps rather than one C6.
>>>>
>>>> Manually disabling C6, reduces power consumption back to 149.5 W.
>>>
>>> ...[snip]...
>>>
>>> Note that one of the tests that I normally do is a work/sleep
>>> frequency sweep from 100 to 2100 Hz, typically at a lowish
>>> workload. I didn't notice anything odd with this test:
>>> 
>>> http://fast.smythies.com/rjw_freq_sweep.png
>
> Would it be possible to produce this graph with the v8 of the
> patchset?

Yes, sure.

>>> However, your test is at 3333 Hz (well, minus overheads).
>>> I did the same as you. And was surprised to confirm
>>> your power findings. In my case package power goes from
>>> ~8.6 watts to ~7.3 watts with idle state 4 (C6) disabled.
>>>
>>> I am getting different residency times than you though.
>>> I also observe different overheads between idle state 4
>>> being disabled or not. i.e. my actual loop frequency
>>> drops from ~2801 Hz to ~2754 Hz.
>>>
>>> Example residencies over the previous minute:
>>>
>>> Idle state 4 (C6) disabled (seconds):
>>>
>>> Idle state 0: 0.001119
>>> Idle state 1: 0.056638
>>> Idle state 2: 13.100550
>>> Idle state 3: 446.266744
>>> Idle state 4: 0.000000
>>>
>>> Idle state 4 (C6) enabled (seconds):
>>>
>>> Idle state 0: 0.034502
>>> Idle state 1: 1.949595
>>> Idle state 2: 78.291793
>>> Idle state 3: 96.467974
>>> Idle state 4: 286.247524
>>
>> Now, with kernel 4.16-rc6 and V8 of the patch set and the poll fix
>> I am unable to measure the processor package power difference
>> between idle state 0 enabled or disabled (i.e. it is in the noise).
>> also the loop time changes (overhead changes) are minimal. However,
>> the overall loop time has dropped to ~2730 Hz, so there seems to be
>> a little more overhead in general.
>>
>> I increased my loop frequency to ~3316 Hz. Similar.
>> 
>> I increased my loop frequency to ~15474 Hz. Similar.
>> Compared to a stock 4.16-rc6 kernel: The loop rate dropped
>> to 15,209 Hz and it (the stock kernel) used about 0.3 more
>> watts (out of 10.97, or ~3% more).
>
> So do you prefer v6 or v8?  I guess the former?

Again sorry for not being clear.
I was saying that V8 is great.

I did more tests after the original e-mail was sent,
and the noted slight overhead drop was not always there
(i.e. it was inconsistent).

... Doug
Rafael J. Wysocki March 21, 2018, 1:58 p.m. UTC | #8
On Wednesday, March 21, 2018 2:51:13 PM CET Doug Smythies wrote:
> On 2018.03.20 23:33 Rafael J. Wysocki wrote:
> > On Tuesday, March 20, 2018 10:03:50 PM CET Doug Smythies wrote:
> >> Summary: My results with kernel 4.16-rc6 and V8 of the patch set
> >> are completely different, and now show no clear difference
> >> (a longer test might reveal something). 
> >
> > Does this mean that you see the "powernightmares" pattern with the v8
> > again or are you referring to something else?
> 
> Sorry for not being clear.
> I do not see any "powernightmares" at all with V8.

Great!

> After this e-mail I did a 3 hour trace and saw none.
> 
> >> On 2018.03.20 10:16 Doug Smythies wrote:
> >>> On 2018.03.20 03:02 Thomas Ilsche wrote:
> >>>
> >>>...[snip]...
> >>>
> >>>> Consider the Skylake server system which has residencies in C1E of
> >>>> 20 us and C6 of 800 us. I use a small while(1) {usleep(300);}
> >>>> unsynchronized pinned to each core. While this is an artificial
> >>>> case, it is a very innocent one - easy to predict and regular. Between
> >>>> vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases
> >>>> from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost
> >>>> entirely in C1E. With the patches applied, the cores spend ~75% of
> >>>> their sleep time in C6, ~25% in C1E. The average time/usage for C1E is
> >>>> also lower with v6 at ~350 us rather than the ~550 us in C6 (and in
> >>>> C1E with the baseline). Generally the new menu governor seems to chose
> >>>> C1E if the next timer is an enabled sched timer - which occasionally
> >>>> interrupts the sleep-interval into two C1E sleeps rather than one C6.
> >>>>
> >>>> Manually disabling C6, reduces power consumption back to 149.5 W.
> >>>
> >>> ...[snip]...
> >>>
> >>> Note that one of the tests that I normally do is a work/sleep
> >>> frequency sweep from 100 to 2100 Hz, typically at a lowish
> >>> workload. I didn't notice anything odd with this test:
> >>> 
> >>> http://fast.smythies.com/rjw_freq_sweep.png
> >
> > Would it be possible to produce this graph with the v8 of the
> > patchset?
> 
> Yes, sure.

Thanks!

> >>> However, your test is at 3333 Hz (well, minus overheads).
> >>> I did the same as you. And was surprised to confirm
> >>> your power findings. In my case package power goes from
> >>> ~8.6 watts to ~7.3 watts with idle state 4 (C6) disabled.
> >>>
> >>> I am getting different residency times than you though.
> >>> I also observe different overheads between idle state 4
> >>> being disabled or not. i.e. my actual loop frequency
> >>> drops from ~2801 Hz to ~2754 Hz.
> >>>
> >>> Example residencies over the previous minute:
> >>>
> >>> Idle state 4 (C6) disabled (seconds):
> >>>
> >>> Idle state 0: 0.001119
> >>> Idle state 1: 0.056638
> >>> Idle state 2: 13.100550
> >>> Idle state 3: 446.266744
> >>> Idle state 4: 0.000000
> >>>
> >>> Idle state 4 (C6) enabled (seconds):
> >>>
> >>> Idle state 0: 0.034502
> >>> Idle state 1: 1.949595
> >>> Idle state 2: 78.291793
> >>> Idle state 3: 96.467974
> >>> Idle state 4: 286.247524
> >>
> >> Now, with kernel 4.16-rc6 and V8 of the patch set and the poll fix
> >> I am unable to measure the processor package power difference
> >> between idle state 0 enabled or disabled (i.e. it is in the noise).
> >> also the loop time changes (overhead changes) are minimal. However,
> >> the overall loop time has dropped to ~2730 Hz, so there seems to be
> >> a little more overhead in general.
> >>
> >> I increased my loop frequency to ~3316 Hz. Similar.
> >> 
> >> I increased my loop frequency to ~15474 Hz. Similar.
> >> Compared to a stock 4.16-rc6 kernel: The loop rate dropped
> >> to 15,209 Hz and it (the stock kernel) used about 0.3 more
> >> watts (out of 10.97, or ~3% more).
> >
> > So do you prefer v6 or v8?  I guess the former?
> 
> Again sorry for not being clear.
> I was saying that V8 is great.

OK, thanks!

It's v7, actually. :-)

> I did more tests after the original e-mail was sent,
> and the noted slight overhead drop was not always there
> (i.e. it was inconsistent).

If possible, please also try the v7.2 replacement for patch [5/8]:

https://patchwork.kernel.org/patch/10299429/
diff mbox

Patch

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -135,7 +135,8 @@  extern bool cpuidle_not_available(struct
 				  struct cpuidle_device *dev);
 
 extern int cpuidle_select(struct cpuidle_driver *drv,
-			  struct cpuidle_device *dev);
+			  struct cpuidle_device *dev,
+			  bool *nohz_ret);
 extern int cpuidle_enter(struct cpuidle_driver *drv,
 			 struct cpuidle_device *dev, int index);
 extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -167,7 +168,7 @@  static inline bool cpuidle_not_available
 					 struct cpuidle_device *dev)
 {return true; }
 static inline int cpuidle_select(struct cpuidle_driver *drv,
-				 struct cpuidle_device *dev)
+				 struct cpuidle_device *dev, bool *nohz_ret)
 {return -ENODEV; }
 static inline int cpuidle_enter(struct cpuidle_driver *drv,
 				struct cpuidle_device *dev, int index)
@@ -250,7 +251,8 @@  struct cpuidle_governor {
 					struct cpuidle_device *dev);
 
 	int  (*select)		(struct cpuidle_driver *drv,
-					struct cpuidle_device *dev);
+					struct cpuidle_device *dev,
+					bool *nohz_ret);
 	void (*reflect)		(struct cpuidle_device *dev, int index);
 };
 
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,15 @@  static void cpuidle_idle_call(void)
 		next_state = cpuidle_find_deepest_state(drv, dev);
 		call_cpuidle(drv, dev, next_state);
 	} else {
+		bool nohz = true;
+
 		tick_nohz_idle_stop_tick();
 		rcu_idle_enter();
 
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
-		next_state = cpuidle_select(drv, dev);
+		next_state = cpuidle_select(drv, dev, &nohz);
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -272,12 +272,18 @@  int cpuidle_enter_state(struct cpuidle_d
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
  *
  * Returns the index of the idle state.  The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		   bool *nohz_ret)
 {
-	return cpuidle_curr_governor->select(drv, dev);
+	return cpuidle_curr_governor->select(drv, dev, nohz_ret);
 }
 
 /**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@  static inline void ladder_do_selection(s
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @dummy: not used
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-				struct cpuidle_device *dev)
+			       struct cpuidle_device *dev, bool *dummy)
 {
 	struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
 	struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -275,12 +275,16 @@  again:
 	goto again;
 }
 
+#define TICK_USEC_HZ   ((USEC_PER_SEC + HZ/2) / HZ)
+
 /**
  * menu_select - selects the next idle state to enter
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
  */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+		       bool *nohz_ret)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +307,10 @@  static int menu_select(struct cpuidle_dr
 		latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
+	if (unlikely(latency_req == 0)) {
+		*nohz_ret = false;
 		return 0;
+	}
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -354,6 +360,7 @@  static int menu_select(struct cpuidle_dr
 	if (latency_req > interactivity_req)
 		latency_req = interactivity_req;
 
+	expected_interval = TICK_USEC_HZ;
 	/*
 	 * Find the idle state with the lowest power while satisfying
 	 * our constraints.
@@ -367,17 +374,44 @@  static int menu_select(struct cpuidle_dr
 			continue;
 		if (idx == -1)
 			idx = i; /* first enabled state */
-		if (s->target_residency > data->predicted_us)
+		if (s->target_residency > data->predicted_us) {
+			/*
+			 * Retain the tick if the selected state is shallower
+			 * than the deepest available one with target residency
+			 * within the tick period range.
+			 *
+			 * This allows the tick to be stopped even if the
+			 * predicted idle duration is within the tick period
+			 * range to counter the effect by which the prediction
+			 * may be skewed towards lower values due to the tick
+			 * bias.
+			 */
+			expected_interval = s->target_residency;
 			break;
-		if (s->exit_latency > latency_req)
+		}
+		if (s->exit_latency > latency_req) {
+			/*
+			 * If we break out of the loop for latency reasons,
+			 * retain the tick unless the target residency of the
+			 * selected state is too high.
+			 */
+			expected_interval = drv->states[idx].target_residency;
 			break;
-
+		}
 		idx = i;
 	}
 
 	if (idx == -1)
 		idx = 0; /* No states enabled. Must use 0. */
 
+	/*
+	 * Don't stop the tick if the selected state is a polling one or it is
+	 * not deep enough.
+	 */
+	if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
+	    expected_interval < TICK_USEC_HZ)
+		*nohz_ret = false;
+
 	data->last_state_idx = idx;
 
 	return data->last_state_idx;