diff mbox series

cpufreq: intel_pstate: Optimize IO boost in non HWP mode

Message ID 20180831172851.79812-1-srinivas.pandruvada@linux.intel.com (mailing list archive)
State Changes Requested, archived
Headers show
Series cpufreq: intel_pstate: Optimize IO boost in non HWP mode | expand

Commit Message

Srinivas Pandruvada Aug. 31, 2018, 5:28 p.m. UTC
The io_boost mechanism, when scheduler update util callback indicates wake
from IO was intended for short term boost to improve disk IO performance.
But once i915 graphics driver changed to io-schedule_timeout() from
schedule_timeout() during waiting for response from GPU, this caused
constant IO boost, causing intel_pstate to constantly boost to turbo.
This has bad side effect on TDP limited Atom platforms. The following two
links shows the boost and frequency plot for GPU Test called
triangle_benchmark.
https://bugzilla.kernel.org/attachment.cgi?id=278091
https://bugzilla.kernel.org/attachment.cgi?id=278093
This type of behavior was observed with many graphics tests and regular
use of such platforms with graphical interface.

The fix in this patch is to optimize the io boost by:
- Limit the maximum boost to base frequency on TDP limited Atom platforms
and max limit as 1 core turbo for the rest of the non-HWP platforms (same
as the current algorithm).
- Multilevel gradual boost: Start with increment = half of max boost and
increase to max on the subsequent IO wakes.
- Once max is reached and subsequent IO wakes don't cause boost for TDP
limited Atom platforms, with assumption that the target CPU already built
up enough load to run at higher P-state and the use of average frequency
in the current algorithm will also help not to reduce drastically. For
other platforms this is not limited similar to the current algorithm.

With this fix the resultant plots show the reduced average frequency and
also causes upto 10% improvement in some graphics tests on Atom (Broxton)
platform.
https://bugzilla.kernel.org/attachment.cgi?id=278095
https://bugzilla.kernel.org/attachment.cgi?id=278097

As per testing Eero Tamminen, the results are comparable to the patchset
https://patchwork.kernel.org/patch/10312259/
But he has to watch results for several days to check trend.

Since here boost is getting limited to turbo and non turbo, we need some
ways to adjust the fractions corresponding to max non turbo as well. It
is much easier to use the actual P-state limits for boost instead of
fractions. So here P-state io boost limit is applied on top of the
P-state limit calculated via current algorithm by removing current
io_wait boost calculation using fractions.

Since we prefer to use common algorithm for all processor platforms, this
change was tested on other client and sever platforms as well. All results
were within the margin of errors. Results:
https://bugzilla.kernel.org/attachment.cgi?id=278149

To identify TDP limited platforms a new callback boost_limited() is
added, which will set a per cpudata field called iowait_boost_limited to
1. Currently this is set for Atom platforms only.

Tested-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 112 +++++++++++++++++++++++++++------
 1 file changed, 92 insertions(+), 20 deletions(-)

Comments

Eero Tamminen Sept. 3, 2018, 3:10 p.m. UTC | #1
Hi,

On 31.08.2018 20:28, Srinivas Pandruvada wrote:
...
> As per testing Eero Tamminen, the results are comparable to the patchset
> https://patchwork.kernel.org/patch/10312259/
> But he has to watch results for several days to check trend.

It's close, but there is some gap compared to Francisco's version.

(Because of the large variance on TDP limited devices, and factors 
causing extra perf differences e.g. between boots, it's hard to give 
exact number without having trends from several days / weeks.  I would 
also need new version of Fransisco's patch set that applies to latest 
kernel like yours does.)


> Since here boost is getting limited to turbo and non turbo, we need some
> ways to adjust the fractions corresponding to max non turbo as well. It
> is much easier to use the actual P-state limits for boost instead of
> fractions. So here P-state io boost limit is applied on top of the
> P-state limit calculated via current algorithm by removing current
> io_wait boost calculation using fractions.
> 
> Since we prefer to use common algorithm for all processor platforms, this
> change was tested on other client and sever platforms as well. All results
> were within the margin of errors. Results:
> https://bugzilla.kernel.org/attachment.cgi?id=278149

Good.

Francisco, how well the listed PTS tests cover latency bound cases you 
were concerned about?  [1]


	- Eero

[1] Fransisco was concerned that the patch:
* trade-off might regress latency bound cases (which I haven't tested, I 
tested only 3D throughput), and
* that it addressed only other of the sources of energy inefficiencies 
he had identified (which could explain slightly better 3D results from 
his more complex patch set).
Srinivas Pandruvada Sept. 3, 2018, 3:15 p.m. UTC | #2
On Mon, 2018-09-03 at 18:10 +0300, Eero Tamminen wrote:
> Hi,
> 
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
> > As per testing Eero Tamminen, the results are comparable to the
> > patchset
> > https://patchwork.kernel.org/patch/10312259/
> > But he has to watch results for several days to check trend.
> 
> It's close, but there is some gap compared to Francisco's version.
> 
> (Because of the large variance on TDP limited devices, and factors 
> causing extra perf differences e.g. between boots, it's hard to give 
> exact number without having trends from several days / weeks.  I
> would 
> also need new version of Fransisco's patch set that applies to
> latest 
> kernel like yours does.)
> 
> 
> > Since here boost is getting limited to turbo and non turbo, we need
> > some
> > ways to adjust the fractions corresponding to max non turbo as
> > well. It
> > is much easier to use the actual P-state limits for boost instead
> > of
> > fractions. So here P-state io boost limit is applied on top of the
> > P-state limit calculated via current algorithm by removing current
> > io_wait boost calculation using fractions.
> > 
> > Since we prefer to use common algorithm for all processor
> > platforms, this
> > change was tested on other client and sever platforms as well. All
> > results
> > were within the margin of errors. Results:
> > https://bugzilla.kernel.org/attachment.cgi?id=278149
> 
> Good.
> 
> Francisco, how well the listed PTS tests cover latency bound cases
> you 
> were concerned about?  [1]
> 
> 
> 	- Eero
> 
> [1] Fransisco was concerned that the patch:
> * trade-off might regress latency bound cases (which I haven't
> tested, I 
> tested only 3D throughput), and
> * that it addressed only other of the sources of energy
> inefficiencies 
> he had identified (which could explain slightly better 3D results
> from 
> his more complex patch set).
It is always possible to submit improvement patch later. Atleast this
patch will reduce the usage of turbo.

Thanks,
Srinivas
Francisco Jerez Sept. 4, 2018, 6:53 a.m. UTC | #3
Eero Tamminen <eero.t.tamminen@intel.com> writes:

> Hi,
>
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
>> As per testing Eero Tamminen, the results are comparable to the patchset
>> https://patchwork.kernel.org/patch/10312259/
>> But he has to watch results for several days to check trend.
>
> It's close, but there is some gap compared to Francisco's version.
>
> (Because of the large variance on TDP limited devices, and factors 
> causing extra perf differences e.g. between boots, it's hard to give 
> exact number without having trends from several days / weeks.  I would 
> also need new version of Fransisco's patch set that applies to latest 
> kernel like yours does.)
>
>
>> Since here boost is getting limited to turbo and non turbo, we need some
>> ways to adjust the fractions corresponding to max non turbo as well. It
>> is much easier to use the actual P-state limits for boost instead of
>> fractions. So here P-state io boost limit is applied on top of the
>> P-state limit calculated via current algorithm by removing current
>> io_wait boost calculation using fractions.
>> 
>> Since we prefer to use common algorithm for all processor platforms, this
>> change was tested on other client and sever platforms as well. All results
>> were within the margin of errors. Results:
>> https://bugzilla.kernel.org/attachment.cgi?id=278149
>
> Good.
>
> Francisco, how well the listed PTS tests cover latency bound cases you 
> were concerned about?  [1]
>
>
> 	- Eero
>
> [1] Fransisco was concerned that the patch:
> * trade-off might regress latency bound cases (which I haven't tested, I 
> tested only 3D throughput), and
> * that it addressed only other of the sources of energy inefficiencies 
> he had identified (which could explain slightly better 3D results from 
> his more complex patch set).

This patch causes a number of statistically significant regressions
(with significance of 1%) on the two systems I've tested it on.  On my
CHV N3050:

| phoronix/fs-mark/test=0:                                                       XXX ±7.25% x34 -> XXX ±7.00% x39  d=-36.85% ±5.91%  p=0.00%
| phoronix/sqlite:                                                               XXX ±1.86% x34 -> XXX ±1.88% x39  d=-21.73% ±1.66%  p=0.00%
| warsow/benchsow:                                                               XXX ±1.25% x34 -> XXX ±1.95% x39  d=-10.83% ±1.53%  p=0.00%
| phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read Performance:           XXX ±1.70% x31 -> XXX ±1.02% x34  d=-7.39% ±1.36%   p=0.00%
| phoronix/gtkperf/gtk-test=GtkComboBox:                                         XXX ±1.15% x13 -> XXX ±1.59% x14   d=-5.37% ±1.35%  p=0.00%
| lightsmark:                                                                    XXX ±1.45% x34 -> XXX ±0.97% x41   d=-4.66% ±1.19%  p=0.00%
| jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128:  XXX ±1.04% x31 -> XXX ±1.04% x39   d=-4.58% ±1.01%  p=0.00%
| jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:      XXX ±0.12% x31 -> XXX ±0.19% x39   d=-3.60% ±0.16%  p=0.00%
| dbench/client-count=1:                                                         XXX ±0.50% x34 -> XXX ±0.50% x39   d=-2.51% ±0.49%  p=0.00%

On my BXT J3455:

| fs-mark/test=0:                                                                XXX ±3.04% x6 ->   XXX ±3.05% x9  d=-15.96% ±2.76%  p=0.00%
| sqlite:                                                                        XXX ±2.54% x6 ->   XXX ±2.72% x9  d=-12.42% ±2.44%  p=0.00%
| dbench/client-count=1:                                                         XXX ±0.42% x6 ->   XXX ±0.36% x9   d=-6.52% ±0.37%  p=0.00%
| dbench/client-count=2:                                                         XXX ±0.26% x6 ->   XXX ±0.33% x9   d=-5.22% ±0.29%  p=0.00%
| dbench/client-count=3:                                                         XXX ±0.34% x6 ->   XXX ±0.53% x9   d=-2.92% ±0.45%  p=0.00%
| x11perf/test=500px Compositing From Pixmap To Window:                          XXX ±2.29% x16 ->  XXX ±2.11% x19  d=-2.69% ±2.16%  p=0.09%
| lightsmark:                                                                    XXX ±0.44% x6 ->   XXX ±0.33% x9   d=-1.76% ±0.37%  p=0.00%
| j2dbench/rendering-test=Vector Graphics Rendering:                             XXX ±1.18% x16 ->  XXX ±1.82% x19  d=-1.71% ±1.54%  p=0.26%
| gtkperf/gtk-test=GtkComboBox:                                                  XXX ±0.37% x6 ->   XXX ±0.45% x9   d=-0.95% ±0.42%  p=0.08%
| jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128:  XXX ±0.21% x3 ->   XXX ±0.23% x6   d=-0.87% ±0.22%  p=0.08%

This is not surprising given that the patch is making a hard trade-off
between latency and energy efficiency without considering whether the
workload is IO- or latency-bound, which is the reason why the series I
submitted earlier [1] to address this problem implemented an IO
utilization statistic in order to determine whether the workload is
IO-bound, in which case the latency trade-off wouldn't impact
performance negatively.

Aside from that the improvement in graphics throughput seems like a
small fraction of the series [1] while TDP-bound.  E.g. with this patch
on my BXT J3455:

| unigine/heaven:           XXX ±0.21% x3 -> XXX ±0.19% x6  d=1.18% ±0.19%  p=0.01%
| unigine/valley:           XXX ±0.52% x3 -> XXX ±0.28% x6  d=1.56% ±0.37%  p=0.06%
| gfxbench/gl_manhattan31:  XXX ±0.12% x3 -> XXX ±0.21% x6  d=1.64% ±0.19%  p=0.00%
| gfxbench/gl_trex:         XXX ±0.56% x3 -> XXX ±0.36% x6  d=7.07% ±0.44%  p=0.00%

vs. my series on the same system:

| gfxbench/gl_manhattan31:  XXX ±0.37% x3 -> XXX ±0.08% x3  d=7.30% ±0.27%  p=0.00%
| unigine/heaven:           XXX ±0.47% x3 -> XXX ±0.40% x3  d=7.99% ±0.45%  p=0.00%
| unigine/valley:           XXX ±0.35% x3 -> XXX ±0.50% x3  d=8.24% ±0.45%  p=0.00%
| gfxbench/gl_trex:         XXX ±0.15% x3 -> XXX ±0.26% x3  d=9.12% ±0.23%  p=0.00%

That's not surprising either considering that this patch is only
addressing one of the two reasons the current non-HWP intel_pstate
governor behaves inefficiently (see [1] for more details on the other
reason).  And even that is only partially addressed since the heuristic
implemented in this patch in order to decide the degree of IOWAIT
boosting to apply can and will frequently trigger in heavily GPU-bound
cases, which will cause the task to IOWAIT on the GPU frequently,
causing the P-state controller to waste shared TDP for no benefit.

[1] https://lists.freedesktop.org/archives/intel-gfx/2018-March/160532.html
Srinivas Pandruvada Sept. 4, 2018, 5:50 p.m. UTC | #4
On Mon, 2018-09-03 at 23:53 -0700, Francisco Jerez wrote:
> Eero Tamminen <eero.t.tamminen@intel.com> writes:
> 
> > Hi,
> > 
> > On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> > ...
> > > As per testing Eero Tamminen, the results are comparable to the
> > > patchset
> > > https://patchwork.kernel.org/patch/10312259/
> > > But he has to watch results for several days to check trend.
> > 
> > It's close, but there is some gap compared to Francisco's version.
> > 
> > (Because of the large variance on TDP limited devices, and factors 
> > causing extra perf differences e.g. between boots, it's hard to
> > give 
> > exact number without having trends from several days / weeks.  I
> > would 
> > also need new version of Fransisco's patch set that applies to
> > latest 
> > kernel like yours does.)
> > 
> > 
> > > Since here boost is getting limited to turbo and non turbo, we
> > > need some
> > > ways to adjust the fractions corresponding to max non turbo as
> > > well. It
> > > is much easier to use the actual P-state limits for boost instead
> > > of
> > > fractions. So here P-state io boost limit is applied on top of
> > > the
> > > P-state limit calculated via current algorithm by removing
> > > current
> > > io_wait boost calculation using fractions.
> > > 
> > > Since we prefer to use common algorithm for all processor
> > > platforms, this
> > > change was tested on other client and sever platforms as well.
> > > All results
> > > were within the margin of errors. Results:
> > > https://bugzilla.kernel.org/attachment.cgi?id=278149
> > 
> > Good.
> > 
> > Francisco, how well the listed PTS tests cover latency bound cases
> > you 
> > were concerned about?  [1]
> > 
> > 
> > 	- Eero
> > 
> > [1] Fransisco was concerned that the patch:
> > * trade-off might regress latency bound cases (which I haven't
> > tested, I 
> > tested only 3D throughput), and
> > * that it addressed only other of the sources of energy
> > inefficiencies 
> > he had identified (which could explain slightly better 3D results
> > from 
> > his more complex patch set).
> 
> This patch causes a number of statistically significant regressions
> (with significance of 1%) on the two systems I've tested it on.  On
> my
Sure. These patches are targeted to Atom clients where some of these
server like workload may have some minor regression on few watts TDP
parts. But weighing against reduced TURBO usage (which is enough
trigger) and improvement in tests done by Eero (which was primary
complaint to us).

It is trivial patch. Yes, the patch is not perfect and doesn't close
doors for any improvements.

I see your idea, but how to implement in acceptable way is a challenge.
So ultimately we will get there, but will require some time compared
with other high priority items.


Thanks
Srinivas

> CHV N3050:
> 
> > phoronix/fs-
> > mark/test=0:                                                       
> > XXX ±7.25% x34 -> XXX ±7.00% x39  d=-36.85% ±5.91%  p=0.00%
> > phoronix/sqlite:                                                   
> >             XXX ±1.86% x34 -> XXX ±1.88% x39  d=-21.73%
> > ±1.66%  p=0.00%
> > warsow/benchsow:                                                   
> >             XXX ±1.25% x34 -> XXX ±1.95% x39  d=-10.83%
> > ±1.53%  p=0.00%
> > phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read
> > Performance:           XXX ±1.70% x31 -> XXX ±1.02% x34  d=-7.39%
> > ±1.36%   p=0.00%
> > phoronix/gtkperf/gtk-
> > test=GtkComboBox:                                         XXX
> > ±1.15% x13 -> XXX ±1.59% x14   d=-5.37% ±1.35%  p=0.00%
> > lightsmark:                                                        
> >             XXX ±1.45% x34 -> XXX ±0.97% x41   d=-4.66%
> > ±1.19%  p=0.00%
> > jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-
> > size=128x128:  XXX ±1.04% x31 -> XXX ±1.04% x39   d=-4.58%
> > ±1.01%  p=0.00%
> > jxrendermark/rendering-test=Linear Gradient Blend/rendering-
> > size=128x128:      XXX ±0.12% x31 -> XXX ±0.19% x39   d=-3.60%
> > ±0.16%  p=0.00%
> > dbench/client-
> > count=1:                                                         XX
> > X ±0.50% x34 -> XXX ±0.50% x39   d=-2.51% ±0.49%  p=0.00%
> 
> On my BXT J3455:
> 
> > fs-
> > mark/test=0:                                                       
> >          XXX ±3.04% x6 ->   XXX ±3.05% x9  d=-15.96%
> > ±2.76%  p=0.00%
> > sqlite:                                                            
> >             XXX ±2.54% x6 ->   XXX ±2.72% x9  d=-12.42%
> > ±2.44%  p=0.00%
> > dbench/client-
> > count=1:                                                         XX
> > X ±0.42% x6 ->   XXX ±0.36% x9   d=-6.52% ±0.37%  p=0.00%
> > dbench/client-
> > count=2:                                                         XX
> > X ±0.26% x6 ->   XXX ±0.33% x9   d=-5.22% ±0.29%  p=0.00%
> > dbench/client-
> > count=3:                                                         XX
> > X ±0.34% x6 ->   XXX ±0.53% x9   d=-2.92% ±0.45%  p=0.00%
> > x11perf/test=500px Compositing From Pixmap To
> > Window:                          XXX ±2.29% x16 ->  XXX ±2.11%
> > x19  d=-2.69% ±2.16%  p=0.09%
> > lightsmark:                                                        
> >             XXX ±0.44% x6 ->   XXX ±0.33% x9   d=-1.76%
> > ±0.37%  p=0.00%
> > j2dbench/rendering-test=Vector Graphics
> > Rendering:                             XXX ±1.18% x16 ->  XXX
> > ±1.82% x19  d=-1.71% ±1.54%  p=0.26%
> > gtkperf/gtk-
> > test=GtkComboBox:                                                  
> > XXX ±0.37% x6 ->   XXX ±0.45% x9   d=-0.95% ±0.42%  p=0.08%
> > jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-
> > size=128x128:  XXX ±0.21% x3 ->   XXX ±0.23% x6   d=-0.87%
> > ±0.22%  p=0.08%
> 
> This is not surprising given that the patch is making a hard trade-
> off
> between latency and energy efficiency without considering whether the
> workload is IO- or latency-bound, which is the reason why the series
> I
> submitted earlier [1] to address this problem implemented an IO
> utilization statistic in order to determine whether the workload is
> IO-bound, in which case the latency trade-off wouldn't impact
> performance negatively.
> 
> Aside from that the improvement in graphics throughput seems like a
> small fraction of the series [1] while TDP-bound.  E.g. with this
> patch
> on my BXT J3455:
> 
> > unigine/heaven:           XXX ±0.21% x3 -> XXX ±0.19% x6  d=1.18%
> > ±0.19%  p=0.01%
> > unigine/valley:           XXX ±0.52% x3 -> XXX ±0.28% x6  d=1.56%
> > ±0.37%  p=0.06%
> > gfxbench/gl_manhattan31:  XXX ±0.12% x3 -> XXX ±0.21% x6  d=1.64%
> > ±0.19%  p=0.00%
> > gfxbench/gl_trex:         XXX ±0.56% x3 -> XXX ±0.36% x6  d=7.07%
> > ±0.44%  p=0.00%
> 
> vs. my series on the same system:
> 
> > gfxbench/gl_manhattan31:  XXX ±0.37% x3 -> XXX ±0.08% x3  d=7.30%
> > ±0.27%  p=0.00%
> > unigine/heaven:           XXX ±0.47% x3 -> XXX ±0.40% x3  d=7.99%
> > ±0.45%  p=0.00%
> > unigine/valley:           XXX ±0.35% x3 -> XXX ±0.50% x3  d=8.24%
> > ±0.45%  p=0.00%
> > gfxbench/gl_trex:         XXX ±0.15% x3 -> XXX ±0.26% x3  d=9.12%
> > ±0.23%  p=0.00%
> 
> That's not surprising either considering that this patch is only
> addressing one of the two reasons the current non-HWP intel_pstate
> governor behaves inefficiently (see [1] for more details on the other
> reason).  And even that is only partially addressed since the
> heuristic
> implemented in this patch in order to decide the degree of IOWAIT
> boosting to apply can and will frequently trigger in heavily GPU-
> bound
> cases, which will cause the task to IOWAIT on the GPU frequently,
> causing the P-state controller to waste shared TDP for no benefit.
> 
> [1] https://lists.freedesktop.org/archives/intel-gfx/2018-March/16053
> 2.html
Francisco Jerez Sept. 5, 2018, 1:37 a.m. UTC | #5
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> On Mon, 2018-09-03 at 23:53 -0700, Francisco Jerez wrote:
>> Eero Tamminen <eero.t.tamminen@intel.com> writes:
>> 
>> > Hi,
>> > 
>> > On 31.08.2018 20:28, Srinivas Pandruvada wrote:
>> > ...
>> > > As per testing Eero Tamminen, the results are comparable to the
>> > > patchset
>> > > https://patchwork.kernel.org/patch/10312259/
>> > > But he has to watch results for several days to check trend.
>> > 
>> > It's close, but there is some gap compared to Francisco's version.
>> > 
>> > (Because of the large variance on TDP limited devices, and factors 
>> > causing extra perf differences e.g. between boots, it's hard to
>> > give 
>> > exact number without having trends from several days / weeks.  I
>> > would 
>> > also need new version of Fransisco's patch set that applies to
>> > latest 
>> > kernel like yours does.)
>> > 
>> > 
>> > > Since here boost is getting limited to turbo and non turbo, we
>> > > need some
>> > > ways to adjust the fractions corresponding to max non turbo as
>> > > well. It
>> > > is much easier to use the actual P-state limits for boost instead
>> > > of
>> > > fractions. So here P-state io boost limit is applied on top of
>> > > the
>> > > P-state limit calculated via current algorithm by removing
>> > > current
>> > > io_wait boost calculation using fractions.
>> > > 
>> > > Since we prefer to use common algorithm for all processor
>> > > platforms, this
>> > > change was tested on other client and sever platforms as well.
>> > > All results
>> > > were within the margin of errors. Results:
>> > > https://bugzilla.kernel.org/attachment.cgi?id=278149
>> > 
>> > Good.
>> > 
>> > Francisco, how well the listed PTS tests cover latency bound cases
>> > you 
>> > were concerned about?  [1]
>> > 
>> > 
>> > 	- Eero
>> > 
>> > [1] Fransisco was concerned that the patch:
>> > * trade-off might regress latency bound cases (which I haven't
>> > tested, I 
>> > tested only 3D throughput), and
>> > * that it addressed only other of the sources of energy
>> > inefficiencies 
>> > he had identified (which could explain slightly better 3D results
>> > from 
>> > his more complex patch set).
>> 
>> This patch causes a number of statistically significant regressions
>> (with significance of 1%) on the two systems I've tested it on.  On
>> my
> Sure. These patches are targeted to Atom clients where some of these
> server like workload may have some minor regression on few watts TDP
> parts.

Neither the 36% regression of fs-mark, the 21% regression of sqlite, nor
the 10% regression of warsaw qualify as small.  And most of the test
cases on the list of regressions aren't exclusively server-like, if at
all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
benchmarks -- Latency is as important if not more for interactive
workloads than it is for server workloads.  In the case of a conflict
like the one we're dealing with right now between optimizing for
throughput (e.g. for the maximum number of requests per second) and
optimizing for latency (e.g. for the minimum request duration), you are
more likely to be concerned about the former than about the latter in a
server setup.

> But weighing against reduced TURBO usage (which is enough trigger) and
> improvement in tests done by Eero (which was primary complaint to us).
>
> It is trivial patch. Yes, the patch is not perfect and doesn't close
> doors for any improvements.
>

It's sort of self-contained because it's awfully incomplete.

> I see your idea, but how to implement in acceptable way is a challenge.

Main challenge was getting the code to work without regressions in
latency-sensitive workloads, which I did, because you told me that it
wasn't acceptable for it to cause any regressions on the Phoronix
daily-system-tracker, whether latency-bound or not.  What is left in
order to address Peter's concerns is for the most part plumbing, that's
guaranteed not to have any functional impact on the heuristic.  The fact
that we don't expect it to change the performance of the system makes it
substantially less time-consuming to validate than altering the
performance trade-offs of the heuristic dynamically in order to avoid
regressions (which is what has kept my systems busy most of the time
lately).  Seems like my series, even in its current state without the
changes requested by Peter is closer to being ready for production than
this patch is.

> So ultimately we will get there, but will require some time compared
> with other high priority items.
>
>
> Thanks
> Srinivas
>
>> CHV N3050:
>> 
>> > phoronix/fs-
>> > mark/test=0:                                                       
>> > XXX ±7.25% x34 -> XXX ±7.00% x39  d=-36.85% ±5.91%  p=0.00%
>> > phoronix/sqlite:                                                   
>> >             XXX ±1.86% x34 -> XXX ±1.88% x39  d=-21.73%
>> > ±1.66%  p=0.00%
>> > warsow/benchsow:                                                   
>> >             XXX ±1.25% x34 -> XXX ±1.95% x39  d=-10.83%
>> > ±1.53%  p=0.00%
>> > phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read
>> > Performance:           XXX ±1.70% x31 -> XXX ±1.02% x34  d=-7.39%
>> > ±1.36%   p=0.00%
>> > phoronix/gtkperf/gtk-
>> > test=GtkComboBox:                                         XXX
>> > ±1.15% x13 -> XXX ±1.59% x14   d=-5.37% ±1.35%  p=0.00%
>> > lightsmark:                                                        
>> >             XXX ±1.45% x34 -> XXX ±0.97% x41   d=-4.66%
>> > ±1.19%  p=0.00%
>> > jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-
>> > size=128x128:  XXX ±1.04% x31 -> XXX ±1.04% x39   d=-4.58%
>> > ±1.01%  p=0.00%
>> > jxrendermark/rendering-test=Linear Gradient Blend/rendering-
>> > size=128x128:      XXX ±0.12% x31 -> XXX ±0.19% x39   d=-3.60%
>> > ±0.16%  p=0.00%
>> > dbench/client-
>> > count=1:                                                         XX
>> > X ±0.50% x34 -> XXX ±0.50% x39   d=-2.51% ±0.49%  p=0.00%
>> 
>> On my BXT J3455:
>> 
>> > fs-
>> > mark/test=0:                                                       
>> >          XXX ±3.04% x6 ->   XXX ±3.05% x9  d=-15.96%
>> > ±2.76%  p=0.00%
>> > sqlite:                                                            
>> >             XXX ±2.54% x6 ->   XXX ±2.72% x9  d=-12.42%
>> > ±2.44%  p=0.00%
>> > dbench/client-
>> > count=1:                                                         XX
>> > X ±0.42% x6 ->   XXX ±0.36% x9   d=-6.52% ±0.37%  p=0.00%
>> > dbench/client-
>> > count=2:                                                         XX
>> > X ±0.26% x6 ->   XXX ±0.33% x9   d=-5.22% ±0.29%  p=0.00%
>> > dbench/client-
>> > count=3:                                                         XX
>> > X ±0.34% x6 ->   XXX ±0.53% x9   d=-2.92% ±0.45%  p=0.00%
>> > x11perf/test=500px Compositing From Pixmap To
>> > Window:                          XXX ±2.29% x16 ->  XXX ±2.11%
>> > x19  d=-2.69% ±2.16%  p=0.09%
>> > lightsmark:                                                        
>> >             XXX ±0.44% x6 ->   XXX ±0.33% x9   d=-1.76%
>> > ±0.37%  p=0.00%
>> > j2dbench/rendering-test=Vector Graphics
>> > Rendering:                             XXX ±1.18% x16 ->  XXX
>> > ±1.82% x19  d=-1.71% ±1.54%  p=0.26%
>> > gtkperf/gtk-
>> > test=GtkComboBox:                                                  
>> > XXX ±0.37% x6 ->   XXX ±0.45% x9   d=-0.95% ±0.42%  p=0.08%
>> > jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-
>> > size=128x128:  XXX ±0.21% x3 ->   XXX ±0.23% x6   d=-0.87%
>> > ±0.22%  p=0.08%
>> 
>> This is not surprising given that the patch is making a hard trade-
>> off
>> between latency and energy efficiency without considering whether the
>> workload is IO- or latency-bound, which is the reason why the series
>> I
>> submitted earlier [1] to address this problem implemented an IO
>> utilization statistic in order to determine whether the workload is
>> IO-bound, in which case the latency trade-off wouldn't impact
>> performance negatively.
>> 
>> Aside from that the improvement in graphics throughput seems like a
>> small fraction of the series [1] while TDP-bound.  E.g. with this
>> patch
>> on my BXT J3455:
>> 
>> > unigine/heaven:           XXX ±0.21% x3 -> XXX ±0.19% x6  d=1.18%
>> > ±0.19%  p=0.01%
>> > unigine/valley:           XXX ±0.52% x3 -> XXX ±0.28% x6  d=1.56%
>> > ±0.37%  p=0.06%
>> > gfxbench/gl_manhattan31:  XXX ±0.12% x3 -> XXX ±0.21% x6  d=1.64%
>> > ±0.19%  p=0.00%
>> > gfxbench/gl_trex:         XXX ±0.56% x3 -> XXX ±0.36% x6  d=7.07%
>> > ±0.44%  p=0.00%
>> 
>> vs. my series on the same system:
>> 
>> > gfxbench/gl_manhattan31:  XXX ±0.37% x3 -> XXX ±0.08% x3  d=7.30%
>> > ±0.27%  p=0.00%
>> > unigine/heaven:           XXX ±0.47% x3 -> XXX ±0.40% x3  d=7.99%
>> > ±0.45%  p=0.00%
>> > unigine/valley:           XXX ±0.35% x3 -> XXX ±0.50% x3  d=8.24%
>> > ±0.45%  p=0.00%
>> > gfxbench/gl_trex:         XXX ±0.15% x3 -> XXX ±0.26% x3  d=9.12%
>> > ±0.23%  p=0.00%
>> 
>> That's not surprising either considering that this patch is only
>> addressing one of the two reasons the current non-HWP intel_pstate
>> governor behaves inefficiently (see [1] for more details on the other
>> reason).  And even that is only partially addressed since the
>> heuristic
>> implemented in this patch in order to decide the degree of IOWAIT
>> boosting to apply can and will frequently trigger in heavily GPU-
>> bound
>> cases, which will cause the task to IOWAIT on the GPU frequently,
>> causing the P-state controller to waste shared TDP for no benefit.
>> 
>> [1] https://lists.freedesktop.org/archives/intel-gfx/2018-March/16053
>> 2.html
Srinivas Pandruvada Sept. 5, 2018, 5:59 a.m. UTC | #6
[...]

> > > 
> > > This patch causes a number of statistically significant
> > > regressions
> > > (with significance of 1%) on the two systems I've tested it
> > > on.  On
> > > my
> > 
> > Sure. These patches are targeted to Atom clients where some of
> > these
> > server like workload may have some minor regression on few watts
> > TDP
> > parts.
> 
> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> nor
> the 10% regression of warsaw qualify as small.  And most of the test
> cases on the list of regressions aren't exclusively server-like, if
> at
> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> benchmarks -- Latency is as important if not more for interactive
> workloads than it is for server workloads.  In the case of a conflict
> like the one we're dealing with right now between optimizing for
> throughput (e.g. for the maximum number of requests per second) and
> optimizing for latency (e.g. for the minimum request duration), you
> are
> more likely to be concerned about the former than about the latter in
> a
> server setup.

Eero,
Please add your test results here.

No matter which algorithm you do, there will be variations. So you have
to look at the platforms which you are targeting. For this platform 
number one item is use of less turbo and hope you know why?
On this platform GFX patch caused this issue as it was submitted after
io boost patchset. So rather that should be reverted first before
optimizing anything.


> 
> > But weighing against reduced TURBO usage (which is enough trigger)
> > and
> > improvement in tests done by Eero (which was primary complaint to
> > us).
> > 
> > It is trivial patch. Yes, the patch is not perfect and doesn't
> > close
> > doors for any improvements.
> > 
> 
> It's sort of self-contained because it's awfully incomplete.Don't
> agtr

> 
> > I see your idea, but how to implement in acceptable way is a
> > challenge.
> 
> Main challenge was getting the code to work without regressions in
> latency-sensitive workloads, which I did, because you told me that it
> wasn't acceptable for it to cause any regressions on the Phoronix
> daily-system-tracker, whether latency-bound or not.
Yes, because your intention was to have a global change not a low power
Atom specific,

>   What is left in
> order to address Peter's concerns is for the most part plumbing,
> that's
> guaranteed not to have any functional impact on the heuristic.  The
> fact
> that we don't expect it to change the performance of the system makes
> it
> substantially less time-consuming to validate than altering the
> performance trade-offs of the heuristic dynamically in order to avoid
> regressions (which is what has kept my systems busy most of the time
> lately).  Seems like my series, even in its current state without the
> changes requested by Peter is closer to being ready for production
> than
> this patch is.

Sorry, Not at all. We call such patches as experimental series.
You caused 100% regression to idle power. There is no version 2 after
that, even if you fixed locally even to look.

Thanks,
Srinivas
Francisco Jerez Sept. 6, 2018, 4:20 a.m. UTC | #7
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> [...]
>
>> > > 
>> > > This patch causes a number of statistically significant
>> > > regressions
>> > > (with significance of 1%) on the two systems I've tested it
>> > > on.  On
>> > > my
>> > 
>> > Sure. These patches are targeted to Atom clients where some of
>> > these
>> > server like workload may have some minor regression on few watts
>> > TDP
>> > parts.
>> 
>> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> nor
>> the 10% regression of warsaw qualify as small.  And most of the test
>> cases on the list of regressions aren't exclusively server-like, if
>> at
>> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> benchmarks -- Latency is as important if not more for interactive
>> workloads than it is for server workloads.  In the case of a conflict
>> like the one we're dealing with right now between optimizing for
>> throughput (e.g. for the maximum number of requests per second) and
>> optimizing for latency (e.g. for the minimum request duration), you
>> are
>> more likely to be concerned about the former than about the latter in
>> a
>> server setup.
>
> Eero,
> Please add your test results here.
>
> No matter which algorithm you do, there will be variations. So you have
> to look at the platforms which you are targeting. For this platform 
> number one item is use of less turbo and hope you know why?

Unfortunately the current controller uses turbo frequently on Atoms for
TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
boosting simply exacerbated the pre-existing energy efficiency problem.

> On this platform GFX patch caused this issue as it was submitted after
> io boost patchset. So rather that should be reverted first before
> optimizing anything.
>
>
>
>> 
>> > But weighing against reduced TURBO usage (which is enough trigger)
>> > and
>> > improvement in tests done by Eero (which was primary complaint to
>> > us).
>> > 
>> > It is trivial patch. Yes, the patch is not perfect and doesn't
>> > close
>> > doors for any improvements.
>> > 
>> 
>> It's sort of self-contained because it's awfully incomplete.Don't
>> agtr
>
>> 
>> > I see your idea, but how to implement in acceptable way is a
>> > challenge.
>> 
>> Main challenge was getting the code to work without regressions in
>> latency-sensitive workloads, which I did, because you told me that it
>> wasn't acceptable for it to cause any regressions on the Phoronix
>> daily-system-tracker, whether latency-bound or not.
> Yes, because your intention was to have a global change not a low power
> Atom specific,
>

My intention was to target low-power Atoms only since the first day we
discussed this problem.  The cover letter of the series I sent and the
commit messages make this fairly clear.

>>   What is left in
>> order to address Peter's concerns is for the most part plumbing,
>> that's
>> guaranteed not to have any functional impact on the heuristic.  The
>> fact
>> that we don't expect it to change the performance of the system makes
>> it
>> substantially less time-consuming to validate than altering the
>> performance trade-offs of the heuristic dynamically in order to avoid
>> regressions (which is what has kept my systems busy most of the time
>> lately).  Seems like my series, even in its current state without the
>> changes requested by Peter is closer to being ready for production
>> than
>> this patch is.
>
> Sorry, Not at all. We call such patches as experimental series.

The numbers speak otherwise.

> You caused 100% regression to idle power. There is no version 2 after
> that, even if you fixed locally even to look.
>

I did post a link to a v2 that fixed the idle power issue on the v1
thread, but I didn't intend to send it for review yet.  I'll send it out
once I've fully taken into account Peter's feedback.

> Thanks,
> Srinivas
Rafael J. Wysocki Sept. 11, 2018, 10:28 a.m. UTC | #8
On Friday, August 31, 2018 7:28:51 PM CEST Srinivas Pandruvada wrote:
> The io_boost mechanism, when scheduler update util callback indicates wake
> from IO was intended for short term boost to improve disk IO performance.
> But once i915 graphics driver changed to io-schedule_timeout() from
> schedule_timeout() during waiting for response from GPU, this caused
> constant IO boost, causing intel_pstate to constantly boost to turbo.
> This has bad side effect on TDP limited Atom platforms. The following two
> links shows the boost and frequency plot for GPU Test called
> triangle_benchmark.
> https://bugzilla.kernel.org/attachment.cgi?id=278091
> https://bugzilla.kernel.org/attachment.cgi?id=278093
> This type of behavior was observed with many graphics tests and regular
> use of such platforms with graphical interface.
> 
> The fix in this patch is to optimize the io boost by:
> - Limit the maximum boost to base frequency on TDP limited Atom platforms
> and max limit as 1 core turbo for the rest of the non-HWP platforms (same
> as the current algorithm).
> - Multilevel gradual boost: Start with increment = half of max boost and
> increase to max on the subsequent IO wakes.
> - Once max is reached and subsequent IO wakes don't cause boost for TDP
> limited Atom platforms, with assumption that the target CPU already built
> up enough load to run at higher P-state and the use of average frequency
> in the current algorithm will also help not to reduce drastically. For
> other platforms this is not limited similar to the current algorithm.
> 
> With this fix the resultant plots show the reduced average frequency and
> also causes upto 10% improvement in some graphics tests on Atom (Broxton)
> platform.
> https://bugzilla.kernel.org/attachment.cgi?id=278095
> https://bugzilla.kernel.org/attachment.cgi?id=278097
> 
> As per testing Eero Tamminen, the results are comparable to the patchset
> https://patchwork.kernel.org/patch/10312259/
> But he has to watch results for several days to check trend.
> 
> Since here boost is getting limited to turbo and non turbo, we need some
> ways to adjust the fractions corresponding to max non turbo as well. It
> is much easier to use the actual P-state limits for boost instead of
> fractions. So here P-state io boost limit is applied on top of the
> P-state limit calculated via current algorithm by removing current
> io_wait boost calculation using fractions.
> 
> Since we prefer to use common algorithm for all processor platforms, this
> change was tested on other client and sever platforms as well. All results
> were within the margin of errors. Results:
> https://bugzilla.kernel.org/attachment.cgi?id=278149
> 
> To identify TDP limited platforms a new callback boost_limited() is
> added, which will set a per cpudata field called iowait_boost_limited to
> 1. Currently this is set for Atom platforms only.
> 
> Tested-by: Eero Tamminen <eero.t.tamminen@intel.com>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

Let me say I would do this differently in a couple of ways.

First off, the patch makes at least two different changes in one go which
I don't quite agree with: it changes the way iowait boost works for everybody
and introduces differences betwee Atom and Core.  Doing them both in one
patch is quite questionable IMO.

It is not particularly clear that the iowait boost mechanism on non-Atom
needs to be changed at all and if so, then why.  Yes, you have results showing
that the change doesn't regress the other systems, but since we are going to
distinguish between Atom and non-Atom anyway, why change the others?

Also, it is not particularly clear to me if there's any benefit from doing
the iowait boosting on Atoms at all.  Wouldn't it be better to restrict
the iowait boosting to Core CPUs only and just leave Atoms alone?

On top of that, I don't quite like the new code organization, but let me
comment on the details below.

> ---
>  drivers/cpufreq/intel_pstate.c | 112 +++++++++++++++++++++++++++------
>  1 file changed, 92 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 1f2ce2f57842..15d9d5483d85 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -195,7 +195,11 @@ struct global_params {
>   * @policy:		CPUFreq policy value
>   * @update_util:	CPUFreq utility callback information
>   * @update_util_set:	CPUFreq utility callback is set
> - * @iowait_boost:	iowait-related boost fraction
> + * @iowait_boost:	iowait-related boost P-state
> + * @iowait_boost_active: iowait boost processing is active
> + * @iowait_boost_max:	Max boost P-state to apply
> + * @iowait_boost_increment: increment to last boost P-state

I don't think you need this field.  It just takes space for almost no
speed benefit.

> + * @owait_boost_limited: If set give up boost, once reach max boost state

This field doesn't appear to be necessary too.  The value of iowait_boost_max
depends on it, but it looks like using iowait_boost_max alone should be
sufficient.

>   * @last_update:	Time of the last update.
>   * @pstate:		Stores P state limits for this CPU
>   * @vid:		Stores VID limits for this CPU
> @@ -254,6 +258,10 @@ struct cpudata {
>  	bool valid_pss_table;
>  #endif
>  	unsigned int iowait_boost;
> +	unsigned int iowait_boost_active;
> +	int iowait_boost_max;
> +	int iowait_boost_increment;
> +	int iowait_boost_limited;
>  	s16 epp_powersave;
>  	s16 epp_policy;
>  	s16 epp_default;
> @@ -276,6 +284,7 @@ static struct cpudata **all_cpu_data;
>   * @get_scaling:	Callback to get frequency scaling factor
>   * @get_val:		Callback to convert P state to actual MSR write value
>   * @get_vid:		Callback to get VID data for Atom platforms
> + * @boost_limited:	Callback to get max boost P-state, when applicable
>   *
>   * Core and Atom CPU models have different way to get P State limits. This
>   * structure is used to store those callbacks.
> @@ -289,6 +298,7 @@ struct pstate_funcs {
>  	int (*get_aperf_mperf_shift)(void);
>  	u64 (*get_val)(struct cpudata*, int pstate);
>  	void (*get_vid)(struct cpudata *);
> +	void (*boost_limited)(struct cpudata *cpudata);
>  };
>  
>  static struct pstate_funcs pstate_funcs __read_mostly;
> @@ -1251,6 +1261,11 @@ static void atom_get_vid(struct cpudata *cpudata)
>  	cpudata->vid.turbo = value & 0x7f;
>  }
>  
> +static void atom_client_boost_limited(struct cpudata *cpudata)
> +{
> +	cpudata->iowait_boost_limited = 1;
> +}

Adding a new callback just for this is a bit excessive IMO.

> +
>  static int core_get_min_pstate(void)
>  {
>  	u64 value;
> @@ -1441,6 +1456,19 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
>  		pstate_funcs.get_vid(cpu);
>  
>  	intel_pstate_set_min_pstate(cpu);
> +
> +	if (pstate_funcs.boost_limited)
> +		pstate_funcs.boost_limited(cpu);

And here the new callback is invoked just to set the new flag and this is
the only place the new flag is checked (apart from the check below that looks
redundant to me) and the only effect of it is the different value of
cpu->iowait_boost_max.

I really think this could be done in a more straightforward way. :-)

> +
> +	if (cpu->iowait_boost_limited)
> +		cpu->iowait_boost_max = cpu->pstate.max_pstate;
> +	else
> +		cpu->iowait_boost_max = cpu->pstate.turbo_pstate;
> +
> +	cpu->iowait_boost_increment = (cpu->iowait_boost_max - cpu->pstate.min_pstate) >> 1;

This is a very simple and fast computation and you can do it when necessary
(so long as both iowait_boost_max and pstate.min_pstate reside in the same cache
line).

> +	pr_debug("iowait_boost_limited: %d max_limit: %d increment %d\n",
> +		 cpu->iowait_boost_limited, cpu->iowait_boost_max,
> +		 cpu->iowait_boost_increment);
>  }
>  
>  /*
> @@ -1616,18 +1644,12 @@ static inline int32_t get_avg_pstate(struct cpudata *cpu)
>  static inline int32_t get_target_pstate(struct cpudata *cpu)
>  {
>  	struct sample *sample = &cpu->sample;
> -	int32_t busy_frac, boost;
> +	int32_t busy_frac;
>  	int target, avg_pstate;
>  
>  	busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
>  			   sample->tsc);
>  
> -	boost = cpu->iowait_boost;
> -	cpu->iowait_boost >>= 1;
> -
> -	if (busy_frac < boost)
> -		busy_frac = boost;
> -
>  	sample->busy_scaled = busy_frac * 100;
>  
>  	target = global.no_turbo || global.turbo_disabled ?
> @@ -1670,6 +1692,27 @@ static void intel_pstate_update_pstate(struct cpudata *cpu, int pstate)
>  	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
>  }
>  
> +static int intel_pstate_adjust_for_io(struct cpudata *cpu, int target_pstate)
> +{
> +	if (!cpu->iowait_boost_active)
> +		return target_pstate;
> +
> +	cpu->iowait_boost_active = 0;
> +
> +	if (!cpu->iowait_boost)
> +		cpu->iowait_boost = cpu->pstate.min_pstate;
> +
> +	cpu->iowait_boost += cpu->iowait_boost_increment;
> +
> +	if (cpu->iowait_boost > cpu->iowait_boost_max)
> +		cpu->iowait_boost = cpu->iowait_boost_max;
> +
> +	if (cpu->iowait_boost > target_pstate)
> +		return cpu->iowait_boost;
> +
> +	return target_pstate;
> +}
> +
>  static void intel_pstate_adjust_pstate(struct cpudata *cpu)
>  {
>  	int from = cpu->pstate.current_pstate;
> @@ -1679,6 +1722,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
>  	update_turbo_state();
>  
>  	target_pstate = get_target_pstate(cpu);
> +	target_pstate = intel_pstate_adjust_for_io(cpu, target_pstate);
>  	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
>  	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
>  	intel_pstate_update_pstate(cpu, target_pstate);
> @@ -1692,7 +1736,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
>  		sample->aperf,
>  		sample->tsc,
>  		get_avg_frequency(cpu),
> -		fp_toint(cpu->iowait_boost * 100));
> +		cpu->iowait_boost);

This implicitly changes the definition of the tracepoint.

>  }
>  
>  static void intel_pstate_update_util(struct update_util_data *data, u64 time,
> @@ -1701,27 +1745,42 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
>  	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>  	u64 delta_ns;
>  
> +	cpu->sched_flags |= flags;
> +
>  	/* Don't allow remote callbacks */
>  	if (smp_processor_id() != cpu->cpu)
>  		return;
>  
> -	if (flags & SCHED_CPUFREQ_IOWAIT) {
> -		cpu->iowait_boost = int_tofp(1);
> -		cpu->last_update = time;
> +	if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
> +		cpu->sched_flags &= ~SCHED_CPUFREQ_IOWAIT;
> +
> +		if (cpu->iowait_boost_limited && cpu->iowait_boost >= cpu->iowait_boost_max)

Isn't the iowait_boost_limited check redundant here?  Why do you want to skip
only in that case?

> +			goto skip_ioboost;
> +
>  		/*
> -		 * The last time the busy was 100% so P-state was max anyway
> -		 * so avoid overhead of computation.
> +		 * Set iowait_boost flag and update time. Since IO WAIT flag
> +		 * is set all the time, we can't just conclude that there is
> +		 * some IO bound activity is scheduled on this CPU with just
> +		 * one occurrence. If we receive at least two in two
> +		 * consecutive ticks, then we treat as a boost trigger.
>  		 */
> -		if (fp_toint(cpu->sample.busy_scaled) == 100)
> -			return;
> +		if (cpu->iowait_boost || time_before64(time, cpu->last_io_update + 2 * TICK_NSEC)) {

It took me quite a bit of time to figure out what was going on here. :-)

It might be clearer to treat the "iowait_boost in progress" and "start
iowait boosting" conditions as different and move some code from
intel_psate_adjust_for_io() to here with that in mind.

> +			cpu->iowait_boost_active = 1;
> +			cpu->last_io_update = time;
> +			cpu->last_update = time;
> +			goto set_pstate;
> +		}
>  
> -		goto set_pstate;
> +		cpu->last_io_update = time;
>  	} else if (cpu->iowait_boost) {
>  		/* Clear iowait_boost if the CPU may have been idle. */
>  		delta_ns = time - cpu->last_update;
> -		if (delta_ns > TICK_NSEC)
> +		if (delta_ns > TICK_NSEC) {
> +			cpu->iowait_boost_active = 0;
>  			cpu->iowait_boost = 0;
> +		}
>  	}
> +skip_ioboost:
>  	cpu->last_update = time;
>  	delta_ns = time - cpu->sample.time;
>  	if ((s64)delta_ns < INTEL_PSTATE_SAMPLING_INTERVAL)
> @@ -1749,6 +1808,7 @@ static const struct pstate_funcs silvermont_funcs = {
>  	.get_val = atom_get_val,
>  	.get_scaling = silvermont_get_scaling,
>  	.get_vid = atom_get_vid,
> +	.boost_limited = atom_client_boost_limited,
>  };

So as I said before, IMO this is unnecessarily convoluted.

I would introduce a struct cpu_model_data defined like this:

struct cpu_model_data {
    const struct pstate_funcs funcs;
	 unsigned int iowait_boost_limited:1;
};

and redo the populating of intel_pstate_cpu_ids[] accordingly.  That would
be more lines of code, but easier to follow IMO.

>  static const struct pstate_funcs airmont_funcs = {
> @@ -1759,6 +1819,7 @@ static const struct pstate_funcs airmont_funcs = {
>  	.get_val = atom_get_val,
>  	.get_scaling = airmont_get_scaling,
>  	.get_vid = atom_get_vid,
> +	.boost_limited = atom_client_boost_limited,
>  };
>  
>  static const struct pstate_funcs knl_funcs = {
> @@ -1771,6 +1832,16 @@ static const struct pstate_funcs knl_funcs = {
>  	.get_val = core_get_val,
>  };
>
> +static struct pstate_funcs atom_core_funcs = {

Missing const?

And this whole struct looks like a waste of space.

> +	.get_max = core_get_max_pstate,
> +	.get_max_physical = core_get_max_pstate_physical,
> +	.get_min = core_get_min_pstate,
> +	.get_turbo = core_get_turbo_pstate,
> +	.get_scaling = core_get_scaling,
> +	.get_val = core_get_val,
> +	.boost_limited = atom_client_boost_limited,
> +};
> +
>  #define ICPU(model, policy) \
>  	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF,\
>  			(unsigned long)&policy }
> @@ -1794,8 +1865,8 @@ static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
>  	ICPU(INTEL_FAM6_BROADWELL_XEON_D,	core_funcs),
>  	ICPU(INTEL_FAM6_XEON_PHI_KNL,		knl_funcs),
>  	ICPU(INTEL_FAM6_XEON_PHI_KNM,		knl_funcs),
> -	ICPU(INTEL_FAM6_ATOM_GOLDMONT,		core_funcs),
> -	ICPU(INTEL_FAM6_ATOM_GEMINI_LAKE,       core_funcs),
> +	ICPU(INTEL_FAM6_ATOM_GOLDMONT,		atom_core_funcs),
> +	ICPU(INTEL_FAM6_ATOM_GEMINI_LAKE,	atom_core_funcs),
>  	ICPU(INTEL_FAM6_SKYLAKE_X,		core_funcs),
>  #ifdef INTEL_FAM6_ICELAKE_X
>  	ICPU(INTEL_FAM6_ICELAKE_X,		core_funcs),
> @@ -2390,6 +2461,7 @@ static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
>  	pstate_funcs.get_val   = funcs->get_val;
>  	pstate_funcs.get_vid   = funcs->get_vid;
>  	pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
> +	pstate_funcs.boost_limited = funcs->boost_limited;
>  }
>  
>  #ifdef CONFIG_ACPI
>
Rafael J. Wysocki Sept. 11, 2018, 10:29 a.m. UTC | #9
On Monday, September 3, 2018 5:10:27 PM CEST Eero Tamminen wrote:
> Hi,
> 
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
> > As per testing Eero Tamminen, the results are comparable to the patchset
> > https://patchwork.kernel.org/patch/10312259/
> > But he has to watch results for several days to check trend.
> 
> It's close, but there is some gap compared to Francisco's version.

It is unclear what exactly you are referring to without a link to the
patch.

Thanks,
Rafael
Rafael J. Wysocki Sept. 11, 2018, 11:21 a.m. UTC | #10
On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> 
> > [...]
> >
> >> > >=20
> >> > > This patch causes a number of statistically significant
> >> > > regressions
> >> > > (with significance of 1%) on the two systems I've tested it
> >> > > on.  On
> >> > > my
> >> >=20
> >> > Sure. These patches are targeted to Atom clients where some of
> >> > these
> >> > server like workload may have some minor regression on few watts
> >> > TDP
> >> > parts.
> >>=20
> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> >> nor
> >> the 10% regression of warsaw qualify as small.  And most of the test
> >> cases on the list of regressions aren't exclusively server-like, if
> >> at
> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> >> benchmarks -- Latency is as important if not more for interactive
> >> workloads than it is for server workloads.  In the case of a conflict
> >> like the one we're dealing with right now between optimizing for
> >> throughput (e.g. for the maximum number of requests per second) and
> >> optimizing for latency (e.g. for the minimum request duration), you
> >> are
> >> more likely to be concerned about the former than about the latter in
> >> a
> >> server setup.
> >
> > Eero,
> > Please add your test results here.
> >
> > No matter which algorithm you do, there will be variations. So you have
> > to look at the platforms which you are targeting. For this platform=20
> > number one item is use of less turbo and hope you know why?
> 
> Unfortunately the current controller uses turbo frequently on Atoms for
> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
> boosting simply exacerbated the pre-existing energy efficiency problem.

My current understanding of the issue at hand is that using IOWAIT boosting
on Atoms is a regression relative to the previous behavior.  That is what
Srinivas is trying to address here AFAICS.

Now, you seem to be saying that the overall behavior is suboptimal and the
IOWAIT boosting doesn't matter that much, so some deeper changes are needed
anyway.  That may be the case, but if there is a meaningful regression, we
should first get back to the point where it is not present and then to take
care of the more general problems.

So, I'd like to understand how much of a problem the IOWAIT boosting really is
in the first place.  If it is significant enough, let's address it first, this
way or another, and move on to the other problems subsequently.

Thanks,
Rafael
Francisco Jerez Sept. 11, 2018, 5:35 p.m. UTC | #11
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
>> 
>> --==-=-=
>> Content-Type: multipart/mixed; boundary="=-=-="
>> 
>> --=-=-=
>> Content-Type: text/plain; charset=utf-8
>> Content-Disposition: inline
>> Content-Transfer-Encoding: quoted-printable
>> 
>> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> 
>> > [...]
>> >
>> >> > >=20
>> >> > > This patch causes a number of statistically significant
>> >> > > regressions
>> >> > > (with significance of 1%) on the two systems I've tested it
>> >> > > on.  On
>> >> > > my
>> >> >=20
>> >> > Sure. These patches are targeted to Atom clients where some of
>> >> > these
>> >> > server like workload may have some minor regression on few watts
>> >> > TDP
>> >> > parts.
>> >>=20
>> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> >> nor
>> >> the 10% regression of warsaw qualify as small.  And most of the test
>> >> cases on the list of regressions aren't exclusively server-like, if
>> >> at
>> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> >> benchmarks -- Latency is as important if not more for interactive
>> >> workloads than it is for server workloads.  In the case of a conflict
>> >> like the one we're dealing with right now between optimizing for
>> >> throughput (e.g. for the maximum number of requests per second) and
>> >> optimizing for latency (e.g. for the minimum request duration), you
>> >> are
>> >> more likely to be concerned about the former than about the latter in
>> >> a
>> >> server setup.
>> >
>> > Eero,
>> > Please add your test results here.
>> >
>> > No matter which algorithm you do, there will be variations. So you have
>> > to look at the platforms which you are targeting. For this platform=20
>> > number one item is use of less turbo and hope you know why?
>> 
>> Unfortunately the current controller uses turbo frequently on Atoms for
>> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
>> boosting simply exacerbated the pre-existing energy efficiency problem.
>
> My current understanding of the issue at hand is that using IOWAIT boosting
> on Atoms is a regression relative to the previous behavior.

Not universally.  IOWAIT boosting helps under roughly the same
conditions on Atom as it does on big core, so applying this patch will
necessarily cause regressions too (see my reply from Sep. 3 for some
numbers), and won't completely restore the previous behavior since it
simply decreases the degree of IOWAIT boosting applied without being
able to avoid it (c.f. the series I'm working on that does something
similar to IOWAIT boosting when it's able to determine it's actually
CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
workloads that don't benefit from a higher CPU clock frequency anyway).

> That is what Srinivas is trying to address here AFAICS.
>
> Now, you seem to be saying that the overall behavior is suboptimal and the
> IOWAIT boosting doesn't matter that much,

I was just saying that IOWAIT boosting is less than half of the energy
efficiency problem, and this patch only partially addresses that half of
the problem.

> so some deeper changes are needed anyway.  That may be the case, but
> if there is a meaningful regression, we should first get back to the
> point where it is not present and then to take care of the more
> general problems.
>
> So, I'd like to understand how much of a problem the IOWAIT boosting really is
> in the first place.  If it is significant enough, let's address it first, this
> way or another, and move on to the other problems subsequently.
>

See the Unigine and Gfxbench numbers I provided in my reply from Sep. 3
to get an idea of the magnitude of the IOWAIT boosting problem vs. the
overall energy efficiency problem addressed by my series.

> Thanks,
> Rafael
Rafael J. Wysocki Sept. 14, 2018, 8:58 a.m. UTC | #12
On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
>
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> 
> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
> >
> >> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> >>=20
> >> > [...]
> >> >
> >> >> > >=3D20
> >> >> > > This patch causes a number of statistically significant
> >> >> > > regressions
> >> >> > > (with significance of 1%) on the two systems I've tested it
> >> >> > > on.  On
> >> >> > > my
> >> >> >=3D20
> >> >> > Sure. These patches are targeted to Atom clients where some of
> >> >> > these
> >> >> > server like workload may have some minor regression on few watts
> >> >> > TDP
> >> >> > parts.
> >> >>=3D20
> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> >> >> nor
> >> >> the 10% regression of warsaw qualify as small.  And most of the test
> >> >> cases on the list of regressions aren't exclusively server-like, if
> >> >> at
> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> >> >> benchmarks -- Latency is as important if not more for interactive
> >> >> workloads than it is for server workloads.  In the case of a conflict
> >> >> like the one we're dealing with right now between optimizing for
> >> >> throughput (e.g. for the maximum number of requests per second) and
> >> >> optimizing for latency (e.g. for the minimum request duration), you
> >> >> are
> >> >> more likely to be concerned about the former than about the latter in
> >> >> a
> >> >> server setup.
> >> >
> >> > Eero,
> >> > Please add your test results here.
> >> >
> >> > No matter which algorithm you do, there will be variations. So you have
> >> > to look at the platforms which you are targeting. For this platform=3D=
> 20
> >> > number one item is use of less turbo and hope you know why?
> >>=20
> >> Unfortunately the current controller uses turbo frequently on Atoms for
> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
> >> boosting simply exacerbated the pre-existing energy efficiency problem.
> >
> > My current understanding of the issue at hand is that using IOWAIT boosti=
> ng
> > on Atoms is a regression relative to the previous behavior.
> 
> Not universally.  IOWAIT boosting helps under roughly the same
> conditions on Atom as it does on big core, so applying this patch will
> necessarily cause regressions too (see my reply from Sep. 3 for some
> numbers), and won't completely restore the previous behavior since it
> simply decreases the degree of IOWAIT boosting applied without being
> able to avoid it (c.f. the series I'm working on that does something
> similar to IOWAIT boosting when it's able to determine it's actually
> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
> workloads that don't benefit from a higher CPU clock frequency anyway).

Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
since getting back is not desirable, apparently.

Or would it restore the previous behavior if we didn't do any IOWAIT
boosting on Atoms at all?

> > That is what Srinivas is trying to address here AFAICS.
> >
> > Now, you seem to be saying that the overall behavior is suboptimal and the
> > IOWAIT boosting doesn't matter that much,
> 
> I was just saying that IOWAIT boosting is less than half of the energy
> efficiency problem, and this patch only partially addresses that half of
> the problem.

Well, fair enough, but there are two things to consider here, the general
energy-efficiency problem and the difference made by IOWAIT boosting.

If the general energy-efficiency problem had existed for a relatively long
time, but it has got worse recently due to the IOWAIT boosting, it still
may be desirable to get the IOWAIT boosting out of the picture first
and then get to the general problem.

I'm not sure if that is the case as my experience with Atoms is limited
anyway, so please advise.

Thanks,
Rafael
Francisco Jerez Sept. 15, 2018, 6:34 a.m. UTC | #13
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes:

> On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
>>
>> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> 
>> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
>> >
>> >> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> >>=20
>> >> > [...]
>> >> >
>> >> >> > >=3D20
>> >> >> > > This patch causes a number of statistically significant
>> >> >> > > regressions
>> >> >> > > (with significance of 1%) on the two systems I've tested it
>> >> >> > > on.  On
>> >> >> > > my
>> >> >> >=3D20
>> >> >> > Sure. These patches are targeted to Atom clients where some of
>> >> >> > these
>> >> >> > server like workload may have some minor regression on few watts
>> >> >> > TDP
>> >> >> > parts.
>> >> >>=3D20
>> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> >> >> nor
>> >> >> the 10% regression of warsaw qualify as small.  And most of the test
>> >> >> cases on the list of regressions aren't exclusively server-like, if
>> >> >> at
>> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> >> >> benchmarks -- Latency is as important if not more for interactive
>> >> >> workloads than it is for server workloads.  In the case of a conflict
>> >> >> like the one we're dealing with right now between optimizing for
>> >> >> throughput (e.g. for the maximum number of requests per second) and
>> >> >> optimizing for latency (e.g. for the minimum request duration), you
>> >> >> are
>> >> >> more likely to be concerned about the former than about the latter in
>> >> >> a
>> >> >> server setup.
>> >> >
>> >> > Eero,
>> >> > Please add your test results here.
>> >> >
>> >> > No matter which algorithm you do, there will be variations. So you have
>> >> > to look at the platforms which you are targeting. For this platform=3D=
>> 20
>> >> > number one item is use of less turbo and hope you know why?
>> >>=20
>> >> Unfortunately the current controller uses turbo frequently on Atoms for
>> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
>> >> boosting simply exacerbated the pre-existing energy efficiency problem.
>> >
>> > My current understanding of the issue at hand is that using IOWAIT boosti=
>> ng
>> > on Atoms is a regression relative to the previous behavior.
>> 
>> Not universally.  IOWAIT boosting helps under roughly the same
>> conditions on Atom as it does on big core, so applying this patch will
>> necessarily cause regressions too (see my reply from Sep. 3 for some
>> numbers), and won't completely restore the previous behavior since it
>> simply decreases the degree of IOWAIT boosting applied without being
>> able to avoid it (c.f. the series I'm working on that does something
>> similar to IOWAIT boosting when it's able to determine it's actually
>> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
>> workloads that don't benefit from a higher CPU clock frequency anyway).
>
> Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
> since getting back is not desirable, apparently.
>
> Or would it restore the previous behavior if we didn't do any IOWAIT
> boosting on Atoms at all?
>
>> > That is what Srinivas is trying to address here AFAICS.
>> >
>> > Now, you seem to be saying that the overall behavior is suboptimal and the
>> > IOWAIT boosting doesn't matter that much,
>> 
>> I was just saying that IOWAIT boosting is less than half of the energy
>> efficiency problem, and this patch only partially addresses that half of
>> the problem.
>
> Well, fair enough, but there are two things to consider here, the general
> energy-efficiency problem and the difference made by IOWAIT boosting.
>
> If the general energy-efficiency problem had existed for a relatively long
> time, but it has got worse recently due to the IOWAIT boosting, it still
> may be desirable to get the IOWAIT boosting out of the picture first
> and then get to the general problem.
>

IMHO what is needed in order to address the IOWAIT boosting energy
efficiency problem is roughly the same we need in order to address the
other energy efficiency problem: A mechanism along the lines of [1]
allowing us to determine whether the workload is IO-bound or not.  In
the former case IOWAIT boosting won't be able to improve the performance
of the workload since the limiting factor is the IO throughput, so it
will only increase the energy usage, potentially exacerbating the
bottleneck if the IO device is an integrated GPU.  In the latter case
where the CPU and IO devices being waited on are both underutilized it
makes sense to optimize for low latency more aggressively (certainly
more aggressively than this patch does) which will increase the
utilization of the IO devices until at least one IO device becomes a
bottleneck, at which point the throughput of the system becomes roughly
independent of the CPU frequency and we're back to the former case.

[1] https://patchwork.kernel.org/patch/10312259/

> I'm not sure if that is the case as my experience with Atoms is limited
> anyway, so please advise.
>
> Thanks,
> Rafael
Rafael J. Wysocki Sept. 17, 2018, 9:07 a.m. UTC | #14
On Sat, Sep 15, 2018 at 8:53 AM Francisco Jerez <currojerez@riseup.net> wrote:
>
> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>
> > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
> >>
> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
> >>
> >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
> >> >
> >> >> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> >> >>=20
> >> >> > [...]
> >> >> >
> >> >> >> > >=3D20
> >> >> >> > > This patch causes a number of statistically significant
> >> >> >> > > regressions
> >> >> >> > > (with significance of 1%) on the two systems I've tested it
> >> >> >> > > on.  On
> >> >> >> > > my
> >> >> >> >=3D20
> >> >> >> > Sure. These patches are targeted to Atom clients where some of
> >> >> >> > these
> >> >> >> > server like workload may have some minor regression on few watts
> >> >> >> > TDP
> >> >> >> > parts.
> >> >> >>=3D20
> >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> >> >> >> nor
> >> >> >> the 10% regression of warsaw qualify as small.  And most of the test
> >> >> >> cases on the list of regressions aren't exclusively server-like, if
> >> >> >> at
> >> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> >> >> >> benchmarks -- Latency is as important if not more for interactive
> >> >> >> workloads than it is for server workloads.  In the case of a conflict
> >> >> >> like the one we're dealing with right now between optimizing for
> >> >> >> throughput (e.g. for the maximum number of requests per second) and
> >> >> >> optimizing for latency (e.g. for the minimum request duration), you
> >> >> >> are
> >> >> >> more likely to be concerned about the former than about the latter in
> >> >> >> a
> >> >> >> server setup.
> >> >> >
> >> >> > Eero,
> >> >> > Please add your test results here.
> >> >> >
> >> >> > No matter which algorithm you do, there will be variations. So you have
> >> >> > to look at the platforms which you are targeting. For this platform=3D=
> >> 20
> >> >> > number one item is use of less turbo and hope you know why?
> >> >>=20
> >> >> Unfortunately the current controller uses turbo frequently on Atoms for
> >> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
> >> >> boosting simply exacerbated the pre-existing energy efficiency problem.
> >> >
> >> > My current understanding of the issue at hand is that using IOWAIT boosti=
> >> ng
> >> > on Atoms is a regression relative to the previous behavior.
> >>
> >> Not universally.  IOWAIT boosting helps under roughly the same
> >> conditions on Atom as it does on big core, so applying this patch will
> >> necessarily cause regressions too (see my reply from Sep. 3 for some
> >> numbers), and won't completely restore the previous behavior since it
> >> simply decreases the degree of IOWAIT boosting applied without being
> >> able to avoid it (c.f. the series I'm working on that does something
> >> similar to IOWAIT boosting when it's able to determine it's actually
> >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
> >> workloads that don't benefit from a higher CPU clock frequency anyway).
> >
> > Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
> > since getting back is not desirable, apparently.
> >
> > Or would it restore the previous behavior if we didn't do any IOWAIT
> > boosting on Atoms at all?
> >
> >> > That is what Srinivas is trying to address here AFAICS.
> >> >
> >> > Now, you seem to be saying that the overall behavior is suboptimal and the
> >> > IOWAIT boosting doesn't matter that much,
> >>
> >> I was just saying that IOWAIT boosting is less than half of the energy
> >> efficiency problem, and this patch only partially addresses that half of
> >> the problem.
> >
> > Well, fair enough, but there are two things to consider here, the general
> > energy-efficiency problem and the difference made by IOWAIT boosting.
> >
> > If the general energy-efficiency problem had existed for a relatively long
> > time, but it has got worse recently due to the IOWAIT boosting, it still
> > may be desirable to get the IOWAIT boosting out of the picture first
> > and then get to the general problem.
> >
>
> IMHO what is needed in order to address the IOWAIT boosting energy
> efficiency problem is roughly the same we need in order to address the
> other energy efficiency problem: A mechanism along the lines of [1]
> allowing us to determine whether the workload is IO-bound or not.  In
> the former case IOWAIT boosting won't be able to improve the performance
> of the workload since the limiting factor is the IO throughput, so it
> will only increase the energy usage, potentially exacerbating the
> bottleneck if the IO device is an integrated GPU.  In the latter case
> where the CPU and IO devices being waited on are both underutilized it
> makes sense to optimize for low latency more aggressively (certainly
> more aggressively than this patch does) which will increase the
> utilization of the IO devices until at least one IO device becomes a
> bottleneck, at which point the throughput of the system becomes roughly
> independent of the CPU frequency and we're back to the former case.
>
> [1] https://patchwork.kernel.org/patch/10312259/

I remember your argumentation above from the previous posts and I'm
not questioning it.  I don't see much point in repeating arguments
that have been given already.

My question was whether or not there was a regression related
specifically to adding the IOWAIT boosting mechanism that needed to be
addressed separately.  I gather from the discussion so far that this
is not the case.

Thanks,
Rafael
Francisco Jerez Sept. 17, 2018, 9:23 p.m. UTC | #15
"Rafael J. Wysocki" <rafael@kernel.org> writes:

> On Sat, Sep 15, 2018 at 8:53 AM Francisco Jerez <currojerez@riseup.net> wrote:
>>
>> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>>
>> > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
>> >>
>> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes:
>> >>
>> >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
>> >> >
>> >> >> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> >> >>=20
>> >> >> > [...]
>> >> >> >
>> >> >> >> > >=3D20
>> >> >> >> > > This patch causes a number of statistically significant
>> >> >> >> > > regressions
>> >> >> >> > > (with significance of 1%) on the two systems I've tested it
>> >> >> >> > > on.  On
>> >> >> >> > > my
>> >> >> >> >=3D20
>> >> >> >> > Sure. These patches are targeted to Atom clients where some of
>> >> >> >> > these
>> >> >> >> > server like workload may have some minor regression on few watts
>> >> >> >> > TDP
>> >> >> >> > parts.
>> >> >> >>=3D20
>> >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> >> >> >> nor
>> >> >> >> the 10% regression of warsaw qualify as small.  And most of the test
>> >> >> >> cases on the list of regressions aren't exclusively server-like, if
>> >> >> >> at
>> >> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> >> >> >> benchmarks -- Latency is as important if not more for interactive
>> >> >> >> workloads than it is for server workloads.  In the case of a conflict
>> >> >> >> like the one we're dealing with right now between optimizing for
>> >> >> >> throughput (e.g. for the maximum number of requests per second) and
>> >> >> >> optimizing for latency (e.g. for the minimum request duration), you
>> >> >> >> are
>> >> >> >> more likely to be concerned about the former than about the latter in
>> >> >> >> a
>> >> >> >> server setup.
>> >> >> >
>> >> >> > Eero,
>> >> >> > Please add your test results here.
>> >> >> >
>> >> >> > No matter which algorithm you do, there will be variations. So you have
>> >> >> > to look at the platforms which you are targeting. For this platform=3D=
>> >> 20
>> >> >> > number one item is use of less turbo and hope you know why?
>> >> >>=20
>> >> >> Unfortunately the current controller uses turbo frequently on Atoms for
>> >> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
>> >> >> boosting simply exacerbated the pre-existing energy efficiency problem.
>> >> >
>> >> > My current understanding of the issue at hand is that using IOWAIT boosti=
>> >> ng
>> >> > on Atoms is a regression relative to the previous behavior.
>> >>
>> >> Not universally.  IOWAIT boosting helps under roughly the same
>> >> conditions on Atom as it does on big core, so applying this patch will
>> >> necessarily cause regressions too (see my reply from Sep. 3 for some
>> >> numbers), and won't completely restore the previous behavior since it
>> >> simply decreases the degree of IOWAIT boosting applied without being
>> >> able to avoid it (c.f. the series I'm working on that does something
>> >> similar to IOWAIT boosting when it's able to determine it's actually
>> >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
>> >> workloads that don't benefit from a higher CPU clock frequency anyway).
>> >
>> > Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
>> > since getting back is not desirable, apparently.
>> >
>> > Or would it restore the previous behavior if we didn't do any IOWAIT
>> > boosting on Atoms at all?
>> >
>> >> > That is what Srinivas is trying to address here AFAICS.
>> >> >
>> >> > Now, you seem to be saying that the overall behavior is suboptimal and the
>> >> > IOWAIT boosting doesn't matter that much,
>> >>
>> >> I was just saying that IOWAIT boosting is less than half of the energy
>> >> efficiency problem, and this patch only partially addresses that half of
>> >> the problem.
>> >
>> > Well, fair enough, but there are two things to consider here, the general
>> > energy-efficiency problem and the difference made by IOWAIT boosting.
>> >
>> > If the general energy-efficiency problem had existed for a relatively long
>> > time, but it has got worse recently due to the IOWAIT boosting, it still
>> > may be desirable to get the IOWAIT boosting out of the picture first
>> > and then get to the general problem.
>> >
>>
>> IMHO what is needed in order to address the IOWAIT boosting energy
>> efficiency problem is roughly the same we need in order to address the
>> other energy efficiency problem: A mechanism along the lines of [1]
>> allowing us to determine whether the workload is IO-bound or not.  In
>> the former case IOWAIT boosting won't be able to improve the performance
>> of the workload since the limiting factor is the IO throughput, so it
>> will only increase the energy usage, potentially exacerbating the
>> bottleneck if the IO device is an integrated GPU.  In the latter case
>> where the CPU and IO devices being waited on are both underutilized it
>> makes sense to optimize for low latency more aggressively (certainly
>> more aggressively than this patch does) which will increase the
>> utilization of the IO devices until at least one IO device becomes a
>> bottleneck, at which point the throughput of the system becomes roughly
>> independent of the CPU frequency and we're back to the former case.
>>
>> [1] https://patchwork.kernel.org/patch/10312259/
>
> I remember your argumentation above from the previous posts and I'm
> not questioning it.  I don't see much point in repeating arguments
> that have been given already.
>
> My question was whether or not there was a regression related
> specifically to adding the IOWAIT boosting mechanism that needed to be
> addressed separately.  I gather from the discussion so far that this
> is not the case.
>

There possibly was some slight graphics performance regression when i915
started doing IO waits, but i915 didn't have support for any BXT+
devices back then, which are the ones most severely hurt by it, so it
probably didn't cause a significant drop in most available benchmark
numbers and went unnoticed.

Regardless of whether there was a regression, I don't see the need of
fixing the IOWAIT issue separately from the other energy efficiency
issues of the active mode governor, because they both admit a common
solution...

> Thanks,
> Rafael
diff mbox series

Patch

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 1f2ce2f57842..15d9d5483d85 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -195,7 +195,11 @@  struct global_params {
  * @policy:		CPUFreq policy value
  * @update_util:	CPUFreq utility callback information
  * @update_util_set:	CPUFreq utility callback is set
- * @iowait_boost:	iowait-related boost fraction
+ * @iowait_boost:	iowait-related boost P-state
+ * @iowait_boost_active: iowait boost processing is active
+ * @iowait_boost_max:	Max boost P-state to apply
+ * @iowait_boost_increment: increment to last boost P-state
+ * @owait_boost_limited: If set give up boost, once reach max boost state
  * @last_update:	Time of the last update.
  * @pstate:		Stores P state limits for this CPU
  * @vid:		Stores VID limits for this CPU
@@ -254,6 +258,10 @@  struct cpudata {
 	bool valid_pss_table;
 #endif
 	unsigned int iowait_boost;
+	unsigned int iowait_boost_active;
+	int iowait_boost_max;
+	int iowait_boost_increment;
+	int iowait_boost_limited;
 	s16 epp_powersave;
 	s16 epp_policy;
 	s16 epp_default;
@@ -276,6 +284,7 @@  static struct cpudata **all_cpu_data;
  * @get_scaling:	Callback to get frequency scaling factor
  * @get_val:		Callback to convert P state to actual MSR write value
  * @get_vid:		Callback to get VID data for Atom platforms
+ * @boost_limited:	Callback to get max boost P-state, when applicable
  *
  * Core and Atom CPU models have different way to get P State limits. This
  * structure is used to store those callbacks.
@@ -289,6 +298,7 @@  struct pstate_funcs {
 	int (*get_aperf_mperf_shift)(void);
 	u64 (*get_val)(struct cpudata*, int pstate);
 	void (*get_vid)(struct cpudata *);
+	void (*boost_limited)(struct cpudata *cpudata);
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
@@ -1251,6 +1261,11 @@  static void atom_get_vid(struct cpudata *cpudata)
 	cpudata->vid.turbo = value & 0x7f;
 }
 
+static void atom_client_boost_limited(struct cpudata *cpudata)
+{
+	cpudata->iowait_boost_limited = 1;
+}
+
 static int core_get_min_pstate(void)
 {
 	u64 value;
@@ -1441,6 +1456,19 @@  static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
 		pstate_funcs.get_vid(cpu);
 
 	intel_pstate_set_min_pstate(cpu);
+
+	if (pstate_funcs.boost_limited)
+		pstate_funcs.boost_limited(cpu);
+
+	if (cpu->iowait_boost_limited)
+		cpu->iowait_boost_max = cpu->pstate.max_pstate;
+	else
+		cpu->iowait_boost_max = cpu->pstate.turbo_pstate;
+
+	cpu->iowait_boost_increment = (cpu->iowait_boost_max - cpu->pstate.min_pstate) >> 1;
+	pr_debug("iowait_boost_limited: %d max_limit: %d increment %d\n",
+		 cpu->iowait_boost_limited, cpu->iowait_boost_max,
+		 cpu->iowait_boost_increment);
 }
 
 /*
@@ -1616,18 +1644,12 @@  static inline int32_t get_avg_pstate(struct cpudata *cpu)
 static inline int32_t get_target_pstate(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
-	int32_t busy_frac, boost;
+	int32_t busy_frac;
 	int target, avg_pstate;
 
 	busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
 			   sample->tsc);
 
-	boost = cpu->iowait_boost;
-	cpu->iowait_boost >>= 1;
-
-	if (busy_frac < boost)
-		busy_frac = boost;
-
 	sample->busy_scaled = busy_frac * 100;
 
 	target = global.no_turbo || global.turbo_disabled ?
@@ -1670,6 +1692,27 @@  static void intel_pstate_update_pstate(struct cpudata *cpu, int pstate)
 	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
 }
 
+static int intel_pstate_adjust_for_io(struct cpudata *cpu, int target_pstate)
+{
+	if (!cpu->iowait_boost_active)
+		return target_pstate;
+
+	cpu->iowait_boost_active = 0;
+
+	if (!cpu->iowait_boost)
+		cpu->iowait_boost = cpu->pstate.min_pstate;
+
+	cpu->iowait_boost += cpu->iowait_boost_increment;
+
+	if (cpu->iowait_boost > cpu->iowait_boost_max)
+		cpu->iowait_boost = cpu->iowait_boost_max;
+
+	if (cpu->iowait_boost > target_pstate)
+		return cpu->iowait_boost;
+
+	return target_pstate;
+}
+
 static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 {
 	int from = cpu->pstate.current_pstate;
@@ -1679,6 +1722,7 @@  static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 	update_turbo_state();
 
 	target_pstate = get_target_pstate(cpu);
+	target_pstate = intel_pstate_adjust_for_io(cpu, target_pstate);
 	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
 	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
 	intel_pstate_update_pstate(cpu, target_pstate);
@@ -1692,7 +1736,7 @@  static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		cpu->iowait_boost);
 }
 
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
@@ -1701,27 +1745,42 @@  static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
 	u64 delta_ns;
 
+	cpu->sched_flags |= flags;
+
 	/* Don't allow remote callbacks */
 	if (smp_processor_id() != cpu->cpu)
 		return;
 
-	if (flags & SCHED_CPUFREQ_IOWAIT) {
-		cpu->iowait_boost = int_tofp(1);
-		cpu->last_update = time;
+	if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
+		cpu->sched_flags &= ~SCHED_CPUFREQ_IOWAIT;
+
+		if (cpu->iowait_boost_limited && cpu->iowait_boost >= cpu->iowait_boost_max)
+			goto skip_ioboost;
+
 		/*
-		 * The last time the busy was 100% so P-state was max anyway
-		 * so avoid overhead of computation.
+		 * Set iowait_boost flag and update time. Since IO WAIT flag
+		 * is set all the time, we can't just conclude that there is
+		 * some IO bound activity is scheduled on this CPU with just
+		 * one occurrence. If we receive at least two in two
+		 * consecutive ticks, then we treat as a boost trigger.
 		 */
-		if (fp_toint(cpu->sample.busy_scaled) == 100)
-			return;
+		if (cpu->iowait_boost || time_before64(time, cpu->last_io_update + 2 * TICK_NSEC)) {
+			cpu->iowait_boost_active = 1;
+			cpu->last_io_update = time;
+			cpu->last_update = time;
+			goto set_pstate;
+		}
 
-		goto set_pstate;
+		cpu->last_io_update = time;
 	} else if (cpu->iowait_boost) {
 		/* Clear iowait_boost if the CPU may have been idle. */
 		delta_ns = time - cpu->last_update;
-		if (delta_ns > TICK_NSEC)
+		if (delta_ns > TICK_NSEC) {
+			cpu->iowait_boost_active = 0;
 			cpu->iowait_boost = 0;
+		}
 	}
+skip_ioboost:
 	cpu->last_update = time;
 	delta_ns = time - cpu->sample.time;
 	if ((s64)delta_ns < INTEL_PSTATE_SAMPLING_INTERVAL)
@@ -1749,6 +1808,7 @@  static const struct pstate_funcs silvermont_funcs = {
 	.get_val = atom_get_val,
 	.get_scaling = silvermont_get_scaling,
 	.get_vid = atom_get_vid,
+	.boost_limited = atom_client_boost_limited,
 };
 
 static const struct pstate_funcs airmont_funcs = {
@@ -1759,6 +1819,7 @@  static const struct pstate_funcs airmont_funcs = {
 	.get_val = atom_get_val,
 	.get_scaling = airmont_get_scaling,
 	.get_vid = atom_get_vid,
+	.boost_limited = atom_client_boost_limited,
 };
 
 static const struct pstate_funcs knl_funcs = {
@@ -1771,6 +1832,16 @@  static const struct pstate_funcs knl_funcs = {
 	.get_val = core_get_val,
 };
 
+static struct pstate_funcs atom_core_funcs = {
+	.get_max = core_get_max_pstate,
+	.get_max_physical = core_get_max_pstate_physical,
+	.get_min = core_get_min_pstate,
+	.get_turbo = core_get_turbo_pstate,
+	.get_scaling = core_get_scaling,
+	.get_val = core_get_val,
+	.boost_limited = atom_client_boost_limited,
+};
+
 #define ICPU(model, policy) \
 	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF,\
 			(unsigned long)&policy }
@@ -1794,8 +1865,8 @@  static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
 	ICPU(INTEL_FAM6_BROADWELL_XEON_D,	core_funcs),
 	ICPU(INTEL_FAM6_XEON_PHI_KNL,		knl_funcs),
 	ICPU(INTEL_FAM6_XEON_PHI_KNM,		knl_funcs),
-	ICPU(INTEL_FAM6_ATOM_GOLDMONT,		core_funcs),
-	ICPU(INTEL_FAM6_ATOM_GEMINI_LAKE,       core_funcs),
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT,		atom_core_funcs),
+	ICPU(INTEL_FAM6_ATOM_GEMINI_LAKE,	atom_core_funcs),
 	ICPU(INTEL_FAM6_SKYLAKE_X,		core_funcs),
 #ifdef INTEL_FAM6_ICELAKE_X
 	ICPU(INTEL_FAM6_ICELAKE_X,		core_funcs),
@@ -2390,6 +2461,7 @@  static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
 	pstate_funcs.get_val   = funcs->get_val;
 	pstate_funcs.get_vid   = funcs->get_vid;
 	pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
+	pstate_funcs.boost_limited = funcs->boost_limited;
 }
 
 #ifdef CONFIG_ACPI