diff mbox

[RFC] cpufreq: intel_pstate: Change the calculation of next pstate

Message ID 5368255D.3090207@semaphore.gr (mailing list archive)
State RFC, archived
Headers show

Commit Message

Stratos Karafotis May 5, 2014, 11:57 p.m. UTC
Currently the driver calculates the next pstate proportional to
core_busy factor, scaled by the ratio max_pstate / current_pstate.

Using the scaled load (core_busy) to calculate the next pstate
is not always correct, because there are cases that the load is
independent from current pstate. For example, a tight 'for' loop
through many sampling intervals will cause a load of 100% in
every pstate.

So, change the above method and calculate the next pstate with
the assumption that the next pstate should not depend on the
current pstate. The next pstate should only be proportional
to measured load. Use the linear function to calculate the load:

Next P-state = A + B * load

where A = min_state and B = (max_pstate - min_pstate) / 100
If turbo is enabled the B = (turbo_pstate - min_pstate) / 100
The load is calculated using the kernel time functions.

Also remove the unused pid_calc function and pid structure and
related helper functions. 

Tested on Intel i7-3770 CPU @ 3.40GHz.
Phoronix benchmark of Linux Kernel Compilation 3.1 test (CPU busy 86%)
shows an increase ~1.35% in performance and a decrease by ~0.22% in
energy consumption. When turbo was disabled there was an increase by
~0.94% and a decrease by ~0.37% in energy consumption.

Phoronix Apache benchmark shows more interesting results.
With a CPU busy ~32% there was an increase in performance by ~46.84%
and a decrease in energy consumption by ~4.78%
When turbo was disabled, the performance boost was ~38.56 and
the decrease in energy consumption ~7.96%

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
---

Detailed test results can be found in this link:
https://docs.google.com/spreadsheets/d/1xiw8FOswoNFA8seNMz0nYUdhjPPvJ8J2S54kG02dOP8/edit?usp=sharing

 drivers/cpufreq/intel_pstate.c | 208 +++++++----------------------------------
 1 file changed, 35 insertions(+), 173 deletions(-)

Comments

dirk.brandewie@gmail.com May 8, 2014, 8:52 p.m. UTC | #1
On 05/05/2014 04:57 PM, Stratos Karafotis wrote:
> Currently the driver calculates the next pstate proportional to
> core_busy factor, scaled by the ratio max_pstate / current_pstate.
> 
> Using the scaled load (core_busy) to calculate the next pstate
> is not always correct, because there are cases that the load is
> independent from current pstate. For example, a tight 'for' loop
> through many sampling intervals will cause a load of 100% in
> every pstate.
> 
> So, change the above method and calculate the next pstate with
> the assumption that the next pstate should not depend on the
> current pstate. The next pstate should only be proportional
> to measured load. Use the linear function to calculate the load:
> 
> Next P-state = A + B * load
> 
> where A = min_state and B = (max_pstate - min_pstate) / 100
> If turbo is enabled the B = (turbo_pstate - min_pstate) / 100
> The load is calculated using the kernel time functions.
> 

This will hurt your power numbers under "normal" conditions where you
are not running a performance workload. Consider the following:

   1. The system is idle, all core at min P state and utilization is low say < 10%
   2. You run something that drives the load as seen by the kernel to 100%
      which scaled by the current P state.

This would cause the P state to go from min -> max in one step.  Which is
what you want if you are only looking at a single core.  But this will also
drag every core in the package to the max P state as well.  This would be fine
if the power vs frequency cure was linear all the cores would finish
their work faster and go idle sooner (race to halt) and maybe spend
more time in a deeper C state which dwarfs the amount of power we can
save by controlling P states. Unfortunately this is *not* the case, 
power vs frequency curve is non-linear and get very steep in the turbo
range.  If it were linear there would be no reason to have P state
control you could select the highest P state and walk away.

Being conservative on the way up and aggressive on way down give you
the best power efficiency on non-benchmark loads.  Most benchmarks
are pretty useless for measuring power efficiency (unless they were
designed for it) since they are measuring how fast something can be
done which is measuring the efficiency at max performance.

The performance issues you pointed out were caused by commit 
fcb6a15c intel_pstate: Take core C0 time into account for core busy calculation
and the ensuing problem is caused. These have been fixed in the patch set

   https://lkml.org/lkml/2014/5/8/574

The performance comparison between before/after this patch set, your patch
and ondemand/acpi_cpufreq is available at:
    http://openbenchmarking.org/result/1405085-PL-C0200965993
ffmpeg was added to the set of benchmarks because there was a regression
reported against this benchmark as well.
    https://bugzilla.kernel.org/show_bug.cgi?id=75121

--Dirk



 
   
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stratos Karafotis May 9, 2014, 2:56 p.m. UTC | #2
Hi Dirk,

On 08/05/2014 11:52 ??, Dirk Brandewie wrote:
> On 05/05/2014 04:57 PM, Stratos Karafotis wrote:
>> Currently the driver calculates the next pstate proportional to
>> core_busy factor, scaled by the ratio max_pstate / current_pstate.
>>
>> Using the scaled load (core_busy) to calculate the next pstate
>> is not always correct, because there are cases that the load is
>> independent from current pstate. For example, a tight 'for' loop
>> through many sampling intervals will cause a load of 100% in
>> every pstate.
>>
>> So, change the above method and calculate the next pstate with
>> the assumption that the next pstate should not depend on the
>> current pstate. The next pstate should only be proportional
>> to measured load. Use the linear function to calculate the load:
>>
>> Next P-state = A + B * load
>>
>> where A = min_state and B = (max_pstate - min_pstate) / 100
>> If turbo is enabled the B = (turbo_pstate - min_pstate) / 100
>> The load is calculated using the kernel time functions.
>>

Thank you very much for your comments and for your time to test my patch!


> 
> This will hurt your power numbers under "normal" conditions where you
> are not running a performance workload. Consider the following:
> 
>    1. The system is idle, all core at min P state and utilization is low say < 10%
>    2. You run something that drives the load as seen by the kernel to 100%
>       which scaled by the current P state.
> 
> This would cause the P state to go from min -> max in one step.  Which is
> what you want if you are only looking at a single core.  But this will also
> drag every core in the package to the max P state as well.  This would be fine

I think, this will also happen using the original driver (before your
new patch 4/5), after some sampling intervals.


> if the power vs frequency cure was linear all the cores would finish
> their work faster and go idle sooner (race to halt) and maybe spend
> more time in a deeper C state which dwarfs the amount of power we can
> save by controlling P states. Unfortunately this is *not* the case, 
> power vs frequency curve is non-linear and get very steep in the turbo
> range.  If it were linear there would be no reason to have P state
> control you could select the highest P state and walk away.
> 
> Being conservative on the way up and aggressive on way down give you
> the best power efficiency on non-benchmark loads.  Most benchmarks
> are pretty useless for measuring power efficiency (unless they were
> designed for it) since they are measuring how fast something can be
> done which is measuring the efficiency at max performance.
> 
> The performance issues you pointed out were caused by commit 
> fcb6a15c intel_pstate: Take core C0 time into account for core busy calculation
> and the ensuing problem is caused. These have been fixed in the patch set
> 
>    https://lkml.org/lkml/2014/5/8/574
> 
> The performance comparison between before/after this patch set, your patch
> and ondemand/acpi_cpufreq is available at:
>     http://openbenchmarking.org/result/1405085-PL-C0200965993
> ffmpeg was added to the set of benchmarks because there was a regression
> reported against this benchmark as well.
>     https://bugzilla.kernel.org/show_bug.cgi?id=75121

Of course, I agree generally with your comments above. But I believe that
the we should scale the core as soon as we measure high load. 

I tested your new patches and I confirm your benchmarks. But I think
they are against the above theory (at least on low loads).
With the new patches I get increased frequencies even on an idle system.
Please compare the results below.

With your latest patches during a mp3 decoding (a non-benchmark load)
the energy consumption increased to 5187.52 J from 5036.57 J (almost 3%).


Thanks again,
Stratos



With my patch
-------------
[root@albert ~]# /home/stratosk/kernels/linux-pm/tools/power/x86/turbostat/turbostat -i 60
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.06    1645    3392       0    0.26    0.00   99.67    0.00      32      32    0.00    0.00    0.00    0.00   20.18    2.00    0.02
       0       0       2    0.10    1623    3392       0    0.63    0.01   99.26    0.00      32      32    0.00    0.00    0.00    0.00   20.18    2.00    0.02
       0       4       0    0.01    1618    3392       0    0.72
       1       1       1    0.03    1618    3392       0    0.03    0.00   99.94    0.00      27
       1       5       0    0.01    1606    3392       0    0.05
       2       2       0    0.02    1635    3392       0    0.28    0.00   99.70    0.00      22
       2       6       3    0.17    1668    3392       0    0.13
       3       3       2    0.12    1647    3392       0    0.08    0.00   99.80    0.00      30
       3       7       0    0.02    1623    3392       0    0.18


With your latest patch
----------------------
[root@albert ~]# /home/stratosk/kernels/linux-pm/tools/power/x86/turbostat/turbostat -i 60
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.05    2035    3392       0    0.28    0.01   99.66    0.00      34      34    0.00    0.00    0.00    0.00   20.20    2.01    0.02
       0       0       1    0.04    1831    3392       0    0.06    0.00   99.90    0.00      34      34    0.00    0.00    0.00    0.00   20.20    2.01    0.02
       0       4       0    0.01    2136    3392       0    0.09
       1       1       1    0.06    1931    3392       0    0.70    0.00   99.24    0.00      31
       1       5       0    0.01    2024    3392       0    0.75
       2       2       1    0.03    2231    3392       0    0.21    0.03   99.73    0.00      26
       2       6       2    0.09    1967    3392       0    0.15
       3       3       3    0.15    2115    3392       0    0.06    0.00   99.78    0.00      34
       3       7       0    0.02    2073    3392       0    0.19



With my patch:
--------------
[root@albert ~]# /home/stratosk/kernels/linux-pm/tools/power/x86/turbostat/turbostat mpg321 /home/stratosk/One\ Direction\ -\ Story\ of\ My\ Life.mp3
[4:05] Decoding of One Direction - Story of My Life.mp3 finished.
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       7    0.45    1613    3392       0   14.55    0.02   84.97    0.00      35      35    0.00    0.00    0.00    0.00   20.51    2.33    0.01
       0       0      16    1.01    1623    3392       0    1.06    0.04   97.89    0.00      35      35    0.00    0.00    0.00    0.00   20.51    2.33    0.01
       0       4       0    0.02    1616    3392       0    2.05
       1       1       3    0.16    1609    3392       0    1.61    0.00   98.22    0.00      30
       1       5      13    0.80    1606    3392       0    0.97
       2       2       8    0.52    1606    3392       0   38.97    0.03   60.48    0.00      26
       2       6      10    0.65    1613    3392       0   38.84
       3       3       7    0.42    1613    3392       0   16.28    0.01   83.29    0.00      33
       3       7       1    0.05    1624    3392       0   16.65
245.566284 sec


With your patch:
----------------
[root@albert ~]# /home/stratosk/kernels/linux-pm/tools/power/x86/turbostat/turbostat mpg321 /home/stratosk/One\ Direction\ -\ Story\ of\ My\ Life.mp3
[4:05] Decoding of One Direction - Story of My Life.mp3 finished.
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       7    0.27    2773    3392       0   40.05    0.01   59.67    0.00      35      35    0.00    0.00    0.00    0.00   21.11    2.93    0.01
       0       0       9    0.31    2773    3392       0   82.55    0.01   17.12    0.00      35      35    0.00    0.00    0.00    0.00   21.11    2.93    0.01
       0       4       5    0.15    3290    3392       0   82.71
       1       1       8    0.31    2541    3392       0   26.87    0.00   72.82    0.00      30
       1       5      19    0.79    2400    3392       0   26.38
       2       2       8    0.23    3490    3392       0   15.43    0.00   84.34    0.00      27
       2       6       1    0.04    2086    3392       0   15.62
       3       3       4    0.13    2978    3392       0   35.44    0.00   64.42    0.00      31
       3       7       6    0.16    3553    3392       0   35.42
245.642873 sec


With original code
-----------------
[root@albert ~]# /home/stratosk/kernels/linux-pm/tools/power/x86/turbostat/turbostat mpg321 /home/stratosk/One\ Direction\ -\ Story\ of\ My\ Life.mp3
[4:05] Decoding of One Direction - Story of My Life.mp3 finished.
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       5    0.32    1608    3392       0   20.43    0.01   79.24    0.00      35      35    0.00    0.00    0.00    0.00   20.59    2.41    0.01
       0       0       2    0.11    1621    3392       0   20.90    0.01   78.98    0.00      35      35    0.00    0.00    0.00    0.00   20.59    2.41    0.01
       0       4       6    0.38    1600    3392       0   20.63
       1       1       8    0.50    1603    3392       0   24.10    0.00   75.40    0.00      29
       1       5       0    0.02    1611    3392       0   24.58
       2       2      13    0.81    1598    3392       0    0.45    0.02   98.73    0.00      29
       2       6       1    0.04    1675    3392       0    1.21
       3       3       9    0.59    1603    3392       0   35.54    0.01   63.86    0.00      33
       3       7       1    0.08    1749    3392       0   36.05
245.641863 sec
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuyang Du May 12, 2014, 7:34 p.m. UTC | #3
On Mon, May 12, 2014 at 11:30:03PM +0300, Stratos Karafotis wrote:
> On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
>
> Next performance state = min_perf + (max_perf - min_perf) * load / 100
> 
Hi,

This formula is fundamentally broken. You need to associate the load with its
frequency.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuyang Du May 12, 2014, 8:01 p.m. UTC | #4
On Tue, May 13, 2014 at 06:59:42AM +0300, Stratos Karafotis wrote:
> Hi,
> 
> On 12/05/2014 10:34 ??, Yuyang Du wrote:
> > On Mon, May 12, 2014 at 11:30:03PM +0300, Stratos Karafotis wrote:
> >> On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
> >>
> >> Next performance state = min_perf + (max_perf - min_perf) * load / 100
> >>
> > Hi,
> > 
> > This formula is fundamentally broken. You need to associate the load with its
> > frequency.
> 
> Could you please explain why is it broken? I think the load should be
> independent from the current frequency.

Why independent? The load not (somewhat) determined by that?

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stratos Karafotis May 12, 2014, 8:30 p.m. UTC | #5
On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
> Hi Dirk,
> 
> On 08/05/2014 11:52 ??, Dirk Brandewie wrote:
>> On 05/05/2014 04:57 PM, Stratos Karafotis wrote:
>>> Currently the driver calculates the next pstate proportional to
>>> core_busy factor, scaled by the ratio max_pstate / current_pstate.
>>>
>>> Using the scaled load (core_busy) to calculate the next pstate
>>> is not always correct, because there are cases that the load is
>>> independent from current pstate. For example, a tight 'for' loop
>>> through many sampling intervals will cause a load of 100% in
>>> every pstate.
>>>
>>> So, change the above method and calculate the next pstate with
>>> the assumption that the next pstate should not depend on the
>>> current pstate. The next pstate should only be proportional
>>> to measured load. Use the linear function to calculate the load:
>>>
>>> Next P-state = A + B * load
>>>
>>> where A = min_state and B = (max_pstate - min_pstate) / 100
>>> If turbo is enabled the B = (turbo_pstate - min_pstate) / 100
>>> The load is calculated using the kernel time functions.
>>>
> 
> Thank you very much for your comments and for your time to test my patch!
> 
> 
>>
>> This will hurt your power numbers under "normal" conditions where you
>> are not running a performance workload. Consider the following:
>>
>>    1. The system is idle, all core at min P state and utilization is low say < 10%
>>    2. You run something that drives the load as seen by the kernel to 100%
>>       which scaled by the current P state.
>>
>> This would cause the P state to go from min -> max in one step.  Which is
>> what you want if you are only looking at a single core.  But this will also
>> drag every core in the package to the max P state as well.  This would be fine
> 
> I think, this will also happen using the original driver (before your
> new patch 4/5), after some sampling intervals.
> 
> 
>> if the power vs frequency cure was linear all the cores would finish
>> their work faster and go idle sooner (race to halt) and maybe spend
>> more time in a deeper C state which dwarfs the amount of power we can
>> save by controlling P states. Unfortunately this is *not* the case, 
>> power vs frequency curve is non-linear and get very steep in the turbo
>> range.  If it were linear there would be no reason to have P state
>> control you could select the highest P state and walk away.
>>
>> Being conservative on the way up and aggressive on way down give you
>> the best power efficiency on non-benchmark loads.  Most benchmarks
>> are pretty useless for measuring power efficiency (unless they were
>> designed for it) since they are measuring how fast something can be
>> done which is measuring the efficiency at max performance.
>>
>> The performance issues you pointed out were caused by commit 
>> fcb6a15c intel_pstate: Take core C0 time into account for core busy calculation
>> and the ensuing problem is caused. These have been fixed in the patch set
>>
>>    https://lkml.org/lkml/2014/5/8/574
>>
>> The performance comparison between before/after this patch set, your patch
>> and ondemand/acpi_cpufreq is available at:
>>     http://openbenchmarking.org/result/1405085-PL-C0200965993
>> ffmpeg was added to the set of benchmarks because there was a regression
>> reported against this benchmark as well.
>>     https://bugzilla.kernel.org/show_bug.cgi?id=75121
> 
> Of course, I agree generally with your comments above. But I believe that
> the we should scale the core as soon as we measure high load. 
> 
> I tested your new patches and I confirm your benchmarks. But I think
> they are against the above theory (at least on low loads).
> With the new patches I get increased frequencies even on an idle system.
> Please compare the results below.
> 
> With your latest patches during a mp3 decoding (a non-benchmark load)
> the energy consumption increased to 5187.52 J from 5036.57 J (almost 3%).
> 
> 
> Thanks again,
> Stratos
> 

I would like to explain a little bit further the logic behind this patch.

The patch is based on the following assumptions (some of them are pretty
obvious but please let me mention them):

1) We define the load of the CPU as the percentage of sampling period that
CPU was busy (not idle), as measured by the kernel.

2) It's not possible to predict (with accuracy) the load of a CPU in future
sampling periods.

3) The load in the next sampling interval is most probable to be very
close to the current sampling interval. (Actually the load in the
next sampling interval could have any value, 0 - 100).

4) In order to select the next performance state of the CPU we need to
calculate the load frequently (as fast as hardware permits) and change
the next state accordingly.

5) At a given constant 0% (zero) load in a specific period, the CPU
performance state should be equal to minimum available state.

6) At a given constant 100% load in a specific period, the CPU performance
state should be equal to maximum available state.

7) Ideally, the CPU should execute instructions at maximum performance state.


According to the above if the measured load in a sampling interval is, for
example 50%, ideally the CPU should spent half of the next sampling period
to maximum pstate and half of the period to minimum pstate. Of course
it's impossible to increase the sampling frequency so much.

Thus, we consider that the best approximation would be:

Next performance state = min_perf + (max_perf - min_perf) * load / 100


Thanks again for your time,
Stratos



--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuyang Du May 12, 2014, 8:34 p.m. UTC | #6
On Tue, May 13, 2014 at 07:16:24AM +0300, Stratos Karafotis wrote:
> On 12/05/2014 11:01 ??, Yuyang Du wrote:
> > On Tue, May 13, 2014 at 06:59:42AM +0300, Stratos Karafotis wrote:
> >> Hi,
> >>
> >> On 12/05/2014 10:34 ??, Yuyang Du wrote:
> >>> On Mon, May 12, 2014 at 11:30:03PM +0300, Stratos Karafotis wrote:
> >>>> On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
> >>>>
> >>>> Next performance state = min_perf + (max_perf - min_perf) * load / 100
> >>>>
> >>> Hi,
> >>>
> >>> This formula is fundamentally broken. You need to associate the load with its
> >>> frequency.
> >>
> >> Could you please explain why is it broken? I think the load should be
> >> independent from the current frequency.
> > 
> > Why independent? The load not (somewhat) determined by that?
> > 
> > 
> 
> Maybe, in some cases yes. But not always.
> For example, please consider a CPU running a tight "for" loop in 100MHz
> for a couple of seconds. This produces a load of 100%.
> It will produce the same load (100%) in any other frequency.
 
Still fundamentally wrong, because you are not making a fair
comparison ("load" in 100MHz vs. any other freq).

Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stratos Karafotis May 13, 2014, 3:59 a.m. UTC | #7
Hi,

On 12/05/2014 10:34 ??, Yuyang Du wrote:
> On Mon, May 12, 2014 at 11:30:03PM +0300, Stratos Karafotis wrote:
>> On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
>>
>> Next performance state = min_perf + (max_perf - min_perf) * load / 100
>>
> Hi,
> 
> This formula is fundamentally broken. You need to associate the load with its
> frequency.

Could you please explain why is it broken? I think the load should be
independent from the current frequency.

Thanks,
Stratos
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stratos Karafotis May 13, 2014, 4:16 a.m. UTC | #8
On 12/05/2014 11:01 ??, Yuyang Du wrote:
> On Tue, May 13, 2014 at 06:59:42AM +0300, Stratos Karafotis wrote:
>> Hi,
>>
>> On 12/05/2014 10:34 ??, Yuyang Du wrote:
>>> On Mon, May 12, 2014 at 11:30:03PM +0300, Stratos Karafotis wrote:
>>>> On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
>>>>
>>>> Next performance state = min_perf + (max_perf - min_perf) * load / 100
>>>>
>>> Hi,
>>>
>>> This formula is fundamentally broken. You need to associate the load with its
>>> frequency.
>>
>> Could you please explain why is it broken? I think the load should be
>> independent from the current frequency.
> 
> Why independent? The load not (somewhat) determined by that?
> 
> 

Maybe, in some cases yes. But not always.
For example, please consider a CPU running a tight "for" loop in 100MHz
for a couple of seconds. This produces a load of 100%.
It will produce the same load (100%) in any other frequency.


Stratos
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stratos Karafotis May 17, 2014, 6:52 a.m. UTC | #9
Hi all!

On 12/05/2014 11:30 ??, Stratos Karafotis wrote:
> On 09/05/2014 05:56 ??, Stratos Karafotis wrote:
>> Hi Dirk,
>>
>> On 08/05/2014 11:52 ??, Dirk Brandewie wrote:
>>> On 05/05/2014 04:57 PM, Stratos Karafotis wrote:
>>>> Currently the driver calculates the next pstate proportional to
>>>> core_busy factor, scaled by the ratio max_pstate / current_pstate.
>>>>
>>>> Using the scaled load (core_busy) to calculate the next pstate
>>>> is not always correct, because there are cases that the load is
>>>> independent from current pstate. For example, a tight 'for' loop
>>>> through many sampling intervals will cause a load of 100% in
>>>> every pstate.
>>>>
>>>> So, change the above method and calculate the next pstate with
>>>> the assumption that the next pstate should not depend on the
>>>> current pstate. The next pstate should only be proportional
>>>> to measured load. Use the linear function to calculate the load:
>>>>
>>>> Next P-state = A + B * load
>>>>
>>>> where A = min_state and B = (max_pstate - min_pstate) / 100
>>>> If turbo is enabled the B = (turbo_pstate - min_pstate) / 100
>>>> The load is calculated using the kernel time functions.
>>>>
>>
>> Thank you very much for your comments and for your time to test my patch!
>>
>>
>>>
>>> This will hurt your power numbers under "normal" conditions where you
>>> are not running a performance workload. Consider the following:
>>>
>>>    1. The system is idle, all core at min P state and utilization is low say < 10%
>>>    2. You run something that drives the load as seen by the kernel to 100%
>>>       which scaled by the current P state.
>>>
>>> This would cause the P state to go from min -> max in one step.  Which is
>>> what you want if you are only looking at a single core.  But this will also
>>> drag every core in the package to the max P state as well.  This would be fine
>>
>> I think, this will also happen using the original driver (before your
>> new patch 4/5), after some sampling intervals.
>>
>>
>>> if the power vs frequency cure was linear all the cores would finish
>>> their work faster and go idle sooner (race to halt) and maybe spend
>>> more time in a deeper C state which dwarfs the amount of power we can
>>> save by controlling P states. Unfortunately this is *not* the case, 
>>> power vs frequency curve is non-linear and get very steep in the turbo
>>> range.  If it were linear there would be no reason to have P state
>>> control you could select the highest P state and walk away.
>>>
>>> Being conservative on the way up and aggressive on way down give you
>>> the best power efficiency on non-benchmark loads.  Most benchmarks
>>> are pretty useless for measuring power efficiency (unless they were
>>> designed for it) since they are measuring how fast something can be
>>> done which is measuring the efficiency at max performance.
>>>
>>> The performance issues you pointed out were caused by commit 
>>> fcb6a15c intel_pstate: Take core C0 time into account for core busy calculation
>>> and the ensuing problem is caused. These have been fixed in the patch set
>>>
>>>    https://lkml.org/lkml/2014/5/8/574
>>>
>>> The performance comparison between before/after this patch set, your patch
>>> and ondemand/acpi_cpufreq is available at:
>>>     http://openbenchmarking.org/result/1405085-PL-C0200965993
>>> ffmpeg was added to the set of benchmarks because there was a regression
>>> reported against this benchmark as well.
>>>     https://bugzilla.kernel.org/show_bug.cgi?id=75121
>>
>> Of course, I agree generally with your comments above. But I believe that
>> the we should scale the core as soon as we measure high load. 
>>
>> I tested your new patches and I confirm your benchmarks. But I think
>> they are against the above theory (at least on low loads).
>> With the new patches I get increased frequencies even on an idle system.
>> Please compare the results below.
>>
>> With your latest patches during a mp3 decoding (a non-benchmark load)
>> the energy consumption increased to 5187.52 J from 5036.57 J (almost 3%).
>>
>>
>> Thanks again,
>> Stratos
>>
> 
> I would like to explain a little bit further the logic behind this patch.
> 
> The patch is based on the following assumptions (some of them are pretty
> obvious but please let me mention them):
> 
> 1) We define the load of the CPU as the percentage of sampling period that
> CPU was busy (not idle), as measured by the kernel.
> 
> 2) It's not possible to predict (with accuracy) the load of a CPU in future
> sampling periods.
> 
> 3) The load in the next sampling interval is most probable to be very
> close to the current sampling interval. (Actually the load in the
> next sampling interval could have any value, 0 - 100).
> 
> 4) In order to select the next performance state of the CPU we need to
> calculate the load frequently (as fast as hardware permits) and change
> the next state accordingly.
> 
> 5) At a given constant 0% (zero) load in a specific period, the CPU
> performance state should be equal to minimum available state.
> 
> 6) At a given constant 100% load in a specific period, the CPU performance
> state should be equal to maximum available state.
> 
> 7) Ideally, the CPU should execute instructions at maximum performance state.
> 
> 
> According to the above if the measured load in a sampling interval is, for
> example 50%, ideally the CPU should spent half of the next sampling period
> to maximum pstate and half of the period to minimum pstate. Of course
> it's impossible to increase the sampling frequency so much.
> 
> Thus, we consider that the best approximation would be:
> 
> Next performance state = min_perf + (max_perf - min_perf) * load / 100
> 

Any additional comments?
Should I consider it a rejected approach?


Thanks,
Stratos


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 0999673..124c675 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -32,8 +32,6 @@ 
 #include <asm/msr.h>
 #include <asm/cpu_device_id.h>
 
-#define SAMPLE_COUNT		3
-
 #define BYT_RATIOS		0x66a
 #define BYT_VIDS		0x66b
 #define BYT_TURBO_RATIOS	0x66c
@@ -55,10 +53,11 @@  static inline int32_t div_fp(int32_t x, int32_t y)
 }
 
 struct sample {
-	int32_t core_pct_busy;
+	unsigned int core_pct_busy;
+	unsigned int duration_us;
+	unsigned int idletime_us;
 	u64 aperf;
 	u64 mperf;
-	unsigned long long tsc;
 	int freq;
 };
 
@@ -75,16 +74,6 @@  struct vid_data {
 	int32_t ratio;
 };
 
-struct _pid {
-	int setpoint;
-	int32_t integral;
-	int32_t p_gain;
-	int32_t i_gain;
-	int32_t d_gain;
-	int deadband;
-	int32_t last_err;
-};
-
 struct cpudata {
 	int cpu;
 
@@ -94,22 +83,17 @@  struct cpudata {
 
 	struct pstate_data pstate;
 	struct vid_data vid;
-	struct _pid pid;
 
+	ktime_t prev_sample;
+	u64	prev_idle_time_us;
 	u64	prev_aperf;
 	u64	prev_mperf;
-	unsigned long long prev_tsc;
 	struct sample sample;
 };
 
 static struct cpudata **all_cpu_data;
 struct pstate_adjust_policy {
 	int sample_rate_ms;
-	int deadband;
-	int setpoint;
-	int p_gain_pct;
-	int d_gain_pct;
-	int i_gain_pct;
 };
 
 struct pstate_funcs {
@@ -148,87 +132,10 @@  static struct perf_limits limits = {
 	.max_sysfs_pct = 100,
 };
 
-static inline void pid_reset(struct _pid *pid, int setpoint, int busy,
-			int deadband, int integral) {
-	pid->setpoint = setpoint;
-	pid->deadband  = deadband;
-	pid->integral  = int_tofp(integral);
-	pid->last_err  = int_tofp(setpoint) - int_tofp(busy);
-}
-
-static inline void pid_p_gain_set(struct _pid *pid, int percent)
-{
-	pid->p_gain = div_fp(int_tofp(percent), int_tofp(100));
-}
-
-static inline void pid_i_gain_set(struct _pid *pid, int percent)
-{
-	pid->i_gain = div_fp(int_tofp(percent), int_tofp(100));
-}
-
-static inline void pid_d_gain_set(struct _pid *pid, int percent)
-{
-
-	pid->d_gain = div_fp(int_tofp(percent), int_tofp(100));
-}
-
-static signed int pid_calc(struct _pid *pid, int32_t busy)
-{
-	signed int result;
-	int32_t pterm, dterm, fp_error;
-	int32_t integral_limit;
-
-	fp_error = int_tofp(pid->setpoint) - busy;
-
-	if (abs(fp_error) <= int_tofp(pid->deadband))
-		return 0;
-
-	pterm = mul_fp(pid->p_gain, fp_error);
-
-	pid->integral += fp_error;
-
-	/* limit the integral term */
-	integral_limit = int_tofp(30);
-	if (pid->integral > integral_limit)
-		pid->integral = integral_limit;
-	if (pid->integral < -integral_limit)
-		pid->integral = -integral_limit;
-
-	dterm = mul_fp(pid->d_gain, fp_error - pid->last_err);
-	pid->last_err = fp_error;
-
-	result = pterm + mul_fp(pid->integral, pid->i_gain) + dterm;
-
-	return (signed int)fp_toint(result);
-}
-
-static inline void intel_pstate_busy_pid_reset(struct cpudata *cpu)
-{
-	pid_p_gain_set(&cpu->pid, pid_params.p_gain_pct);
-	pid_d_gain_set(&cpu->pid, pid_params.d_gain_pct);
-	pid_i_gain_set(&cpu->pid, pid_params.i_gain_pct);
-
-	pid_reset(&cpu->pid,
-		pid_params.setpoint,
-		100,
-		pid_params.deadband,
-		0);
-}
-
-static inline void intel_pstate_reset_all_pid(void)
-{
-	unsigned int cpu;
-	for_each_online_cpu(cpu) {
-		if (all_cpu_data[cpu])
-			intel_pstate_busy_pid_reset(all_cpu_data[cpu]);
-	}
-}
-
 /************************** debugfs begin ************************/
 static int pid_param_set(void *data, u64 val)
 {
 	*(u32 *)data = val;
-	intel_pstate_reset_all_pid();
 	return 0;
 }
 static int pid_param_get(void *data, u64 *val)
@@ -246,11 +153,6 @@  struct pid_param {
 
 static struct pid_param pid_files[] = {
 	{"sample_rate_ms", &pid_params.sample_rate_ms},
-	{"d_gain_pct", &pid_params.d_gain_pct},
-	{"i_gain_pct", &pid_params.i_gain_pct},
-	{"deadband", &pid_params.deadband},
-	{"setpoint", &pid_params.setpoint},
-	{"p_gain_pct", &pid_params.p_gain_pct},
 	{NULL, NULL}
 };
 
@@ -452,11 +354,6 @@  static void core_set_pstate(struct cpudata *cpudata, int pstate)
 static struct cpu_defaults core_params = {
 	.pid_policy = {
 		.sample_rate_ms = 10,
-		.deadband = 0,
-		.setpoint = 97,
-		.p_gain_pct = 20,
-		.d_gain_pct = 0,
-		.i_gain_pct = 0,
 	},
 	.funcs = {
 		.get_max = core_get_max_pstate,
@@ -469,11 +366,6 @@  static struct cpu_defaults core_params = {
 static struct cpu_defaults byt_params = {
 	.pid_policy = {
 		.sample_rate_ms = 10,
-		.deadband = 0,
-		.setpoint = 97,
-		.p_gain_pct = 14,
-		.d_gain_pct = 0,
-		.i_gain_pct = 4,
 	},
 	.funcs = {
 		.get_max = byt_get_max_pstate,
@@ -520,21 +412,6 @@  static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
 	pstate_funcs.set(cpu, pstate);
 }
 
-static inline void intel_pstate_pstate_increase(struct cpudata *cpu, int steps)
-{
-	int target;
-	target = cpu->pstate.current_pstate + steps;
-
-	intel_pstate_set_pstate(cpu, target);
-}
-
-static inline void intel_pstate_pstate_decrease(struct cpudata *cpu, int steps)
-{
-	int target;
-	target = cpu->pstate.current_pstate - steps;
-	intel_pstate_set_pstate(cpu, target);
-}
-
 static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
 {
 	sprintf(cpu->name, "Intel 2nd generation core");
@@ -553,50 +430,55 @@  static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
 	intel_pstate_set_pstate(cpu, cpu->pstate.max_pstate);
 }
 
-static inline void intel_pstate_calc_busy(struct cpudata *cpu,
-					struct sample *sample)
+static inline void intel_pstate_calc_busy(struct cpudata *cpu)
 {
+	struct sample *sample = &cpu->sample;
 	int32_t core_pct;
-	int32_t c0_pct;
 
-	core_pct = div_fp(int_tofp((sample->aperf)),
-			int_tofp((sample->mperf)));
+	sample->core_pct_busy = 100 *
+				(sample->duration_us - sample->idletime_us) /
+				sample->duration_us;
+
+	core_pct = div_fp(int_tofp(sample->aperf), int_tofp(sample->mperf));
 	core_pct = mul_fp(core_pct, int_tofp(100));
 	FP_ROUNDUP(core_pct);
 
-	c0_pct = div_fp(int_tofp(sample->mperf), int_tofp(sample->tsc));
-
 	sample->freq = fp_toint(
 		mul_fp(int_tofp(cpu->pstate.max_pstate * 1000), core_pct));
 
-	sample->core_pct_busy = mul_fp(core_pct, c0_pct);
+	pr_debug("%s: core_pct_busy = %u", __func__, sample->core_pct_busy);
 }
 
 static inline void intel_pstate_sample(struct cpudata *cpu)
 {
+	ktime_t now;
+	u64 idle_time_us;
 	u64 aperf, mperf;
-	unsigned long long tsc;
+
+	now = ktime_get();
+	idle_time_us = get_cpu_idle_time_us(cpu->cpu, NULL);
 
 	rdmsrl(MSR_IA32_APERF, aperf);
 	rdmsrl(MSR_IA32_MPERF, mperf);
-	tsc = native_read_tsc();
 
 	aperf = aperf >> FRAC_BITS;
 	mperf = mperf >> FRAC_BITS;
-	tsc = tsc >> FRAC_BITS;
 
 	cpu->sample.aperf = aperf;
 	cpu->sample.mperf = mperf;
-	cpu->sample.tsc = tsc;
 	cpu->sample.aperf -= cpu->prev_aperf;
 	cpu->sample.mperf -= cpu->prev_mperf;
-	cpu->sample.tsc -= cpu->prev_tsc;
+	cpu->sample.duration_us = (unsigned int)ktime_us_delta(now,
+							cpu->prev_sample);
+	cpu->sample.idletime_us = (unsigned int)(idle_time_us -
+						 cpu->prev_idle_time_us);
 
-	intel_pstate_calc_busy(cpu, &cpu->sample);
+	intel_pstate_calc_busy(cpu);
 
+	cpu->prev_sample = now;
+	cpu->prev_idle_time_us = idle_time_us;
 	cpu->prev_aperf = aperf;
 	cpu->prev_mperf = mperf;
-	cpu->prev_tsc = tsc;
 }
 
 static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
@@ -608,35 +490,21 @@  static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
 	mod_timer_pinned(&cpu->timer, jiffies + delay);
 }
 
-static inline int32_t intel_pstate_get_scaled_busy(struct cpudata *cpu)
-{
-	int32_t core_busy, max_pstate, current_pstate;
-
-	core_busy = cpu->sample.core_pct_busy;
-	max_pstate = int_tofp(cpu->pstate.max_pstate);
-	current_pstate = int_tofp(cpu->pstate.current_pstate);
-	core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));
-	return FP_ROUNDUP(core_busy);
-}
-
 static inline void intel_pstate_adjust_busy_pstate(struct cpudata *cpu)
 {
-	int32_t busy_scaled;
-	struct _pid *pid;
-	signed int ctl = 0;
-	int steps;
+	int max_pstate, min_pstate, pstate;
+	unsigned int busy;
 
-	pid = &cpu->pid;
-	busy_scaled = intel_pstate_get_scaled_busy(cpu);
+	busy = cpu->sample.core_pct_busy;
+	max_pstate = limits.no_turbo ? cpu->pstate.max_pstate :
+				       cpu->pstate.turbo_pstate;
+	min_pstate = cpu->pstate.min_pstate;
 
-	ctl = pid_calc(pid, busy_scaled);
+	pstate = min_pstate + (max_pstate - min_pstate) * busy / 100;
 
-	steps = abs(ctl);
+	intel_pstate_set_pstate(cpu, pstate);
 
-	if (ctl < 0)
-		intel_pstate_pstate_increase(cpu, steps);
-	else
-		intel_pstate_pstate_decrease(cpu, steps);
+	pr_debug("%s, busy = %u, pstate = %u", __func__, busy, pstate);
 }
 
 static void intel_pstate_timer_func(unsigned long __data)
@@ -651,7 +519,7 @@  static void intel_pstate_timer_func(unsigned long __data)
 	intel_pstate_adjust_busy_pstate(cpu);
 
 	trace_pstate_sample(fp_toint(sample->core_pct_busy),
-			fp_toint(intel_pstate_get_scaled_busy(cpu)),
+			0,
 			cpu->pstate.current_pstate,
 			sample->mperf,
 			sample->aperf,
@@ -708,7 +576,6 @@  static int intel_pstate_init_cpu(unsigned int cpunum)
 	cpu->timer.data =
 		(unsigned long)cpu;
 	cpu->timer.expires = jiffies + HZ/100;
-	intel_pstate_busy_pid_reset(cpu);
 	intel_pstate_sample(cpu);
 	intel_pstate_set_pstate(cpu, cpu->pstate.max_pstate);
 
@@ -852,11 +719,6 @@  static int intel_pstate_msrs_not_valid(void)
 static void copy_pid_params(struct pstate_adjust_policy *policy)
 {
 	pid_params.sample_rate_ms = policy->sample_rate_ms;
-	pid_params.p_gain_pct = policy->p_gain_pct;
-	pid_params.i_gain_pct = policy->i_gain_pct;
-	pid_params.d_gain_pct = policy->d_gain_pct;
-	pid_params.deadband = policy->deadband;
-	pid_params.setpoint = policy->setpoint;
 }
 
 static void copy_cpu_funcs(struct pstate_funcs *funcs)