[4/4] cpufreq: intel_pstate: enable boost for Skylake Xeon

Message ID	20180605214242.62156-5-srinivas.pandruvada@linux.intel.com (mailing list archive)
State	Mainlined
Delegated to:	Rafael Wysocki
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> To: lenb@kernel.org, rjw@rjwysocki.net, mgorman@techsingularity.net Cc: peterz@infradead.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, juri.lelli@redhat.com, viresh.kumar@linaro.org, ggherdovich@suse.cz, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Subject: [PATCH 4/4] cpufreq: intel_pstate: enable boost for Skylake Xeon Date: Tue, 5 Jun 2018 14:42:42 -0700 Message-Id: <20180605214242.62156-5-srinivas.pandruvada@linux.intel.com> In-Reply-To: <20180605214242.62156-1-srinivas.pandruvada@linux.intel.com> References: <20180605214242.62156-1-srinivas.pandruvada@linux.intel.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

srinivas pandruvada June 5, 2018, 9:42 p.m. UTC

Enable HWP boost on Skylake server and workstations.

Reported-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Francisco Jerez July 28, 2018, 5:34 a.m. UTC | #1

Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> Enable HWP boost on Skylake server and workstations.
>

Please revert this series, it led to significant energy usage and
graphics performance regressions [1].  The reasons are roughly the ones
we discussed by e-mail off-list last April: This causes the intel_pstate
driver to decrease the EPP to zero when the workload blocks on IO
frequently enough, which for the regressing benchmarks detailed in [1]
is a symptom of the workload being heavily IO-bound, which means they
won't benefit at all from the EPP boost since they aren't significantly
CPU-bound, and they will suffer a decrease in parallelism due to the
active CPU core using a larger fraction of the TDP in order to achieve
the same work, causing the GPU to have a lower power budget available,
leading to a decrease in system performance.

You may want to give a shot to my previous suggestion of using [2] in
order to detect whether the system is IO-bound, which you can use as an
indicator that the optimization implemented in this series cannot
possibly improve performance and can be expected to hurt energy usage.

Thanks.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
[2] https://patchwork.kernel.org/patch/10312259/

> Reported-by: Mel Gorman <mgorman@techsingularity.net>
> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 70bf63bb4e0e..01c8da1f99db 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
>  	{}
>  };
>  
> +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] __initconst = {
> +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> +	{}
> +};
> +
>  static int intel_pstate_init_cpu(unsigned int cpunum)
>  {
>  	struct cpudata *cpu;
> @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
>  			intel_pstate_disable_ee(cpunum);
>  
>  		intel_pstate_hwp_enable(cpu);
> +
> +		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
> +		if (id)
> +			hwp_boost = true;
>  	}
>  
>  	intel_pstate_get_cpu_pstates(cpu);
> -- 
> 2.13.6

Mel Gorman July 28, 2018, 12:36 p.m. UTC | #2

On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> 
> > Enable HWP boost on Skylake server and workstations.
> >
> 
> Please revert this series, it led to significant energy usage and
> graphics performance regressions [1].  The reasons are roughly the ones
> we discussed by e-mail off-list last April: This causes the intel_pstate
> driver to decrease the EPP to zero when the workload blocks on IO
> frequently enough, which for the regressing benchmarks detailed in [1]
> is a symptom of the workload being heavily IO-bound, which means they
> won't benefit at all from the EPP boost since they aren't significantly
> CPU-bound, and they will suffer a decrease in parallelism due to the
> active CPU core using a larger fraction of the TDP in order to achieve
> the same work, causing the GPU to have a lower power budget available,
> leading to a decrease in system performance.

It slices both ways. With the series, there are large boosts to
performance on other workloads where a slight increase in power usage is
acceptable in exchange for performance. For example,

Single socket skylake running sqlite
                                 v4.17               41ab43c9
Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)

That's over doubling the transactions per second for that workload.

Two-socket skylake running dbench4
                                v4.17               41ab43c9
Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)

This is reporting the average latency of operations running dbench. The
series over halves the latencies. There are many examples of basic
workloads that benefit heavily from the series and while I accept it may
not be universal, such as the case where the graphics card needs the power
and not the CPU, a straight revert is not the answer. Without the series,
HWP cripplies the CPU.

srinivas pandruvada July 28, 2018, 2:14 p.m. UTC | #3

On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> 
> > Enable HWP boost on Skylake server and workstations.
> > 
> 
> Please revert this series, it led to significant energy usage and
> graphics performance regressions [1]. 
Which SKX platform is targeted to graphics?

>  The reasons are roughly the ones
> we discussed by e-mail off-list last April: This causes the
> intel_pstate
> driver to decrease the EPP to zero 
No. You didn't check this series. We are not using EPP at all.
The boost mechanism used here is not boost to max.

Thanks,
Srinivas

> when the workload blocks on IO
> frequently enough, which for the regressing benchmarks detailed in
> [1]
> is a symptom of the workload being heavily IO-bound, which means they
> won't benefit at all from the EPP boost since they aren't
> significantly
> CPU-bound, and they will suffer a decrease in parallelism due to the
> active CPU core using a larger fraction of the TDP in order to
> achieve
> the same work, causing the GPU to have a lower power budget
> available,
> leading to a decrease in system performance.
> 
> You may want to give a shot to my previous suggestion of using [2] in
> order to detect whether the system is IO-bound, which you can use as
> an
> indicator that the optimization implemented in this series cannot
> possibly improve performance and can be expected to hurt energy
> usage.
> 
> Thanks.
> 
> [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
> [2] https://patchwork.kernel.org/patch/10312259/
> 
> > Reported-by: Mel Gorman <mgorman@techsingularity.net>
> > Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
> > .com>
> > ---
> >  drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index 70bf63bb4e0e..01c8da1f99db 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
> > intel_pstate_cpu_ee_disable_ids[] = {
> >  	{}
> >  };
> >  
> > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
> > __initconst = {
> > +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> > +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> > +	{}
> > +};
> > +
> >  static int intel_pstate_init_cpu(unsigned int cpunum)
> >  {
> >  	struct cpudata *cpu;
> > @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
> > int cpunum)
> >  			intel_pstate_disable_ee(cpunum);
> >  
> >  		intel_pstate_hwp_enable(cpu);
> > +
> > +		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
> > +		if (id)
> > +			hwp_boost = true;
> >  	}
> >  
> >  	intel_pstate_get_cpu_pstates(cpu);
> > -- 
> > 2.13.6

Francisco Jerez July 28, 2018, 8:21 p.m. UTC | #4

Mel Gorman <mgorman@techsingularity.net> writes:

> On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> 
>> > Enable HWP boost on Skylake server and workstations.
>> >
>> 
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1].  The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the intel_pstate
>> driver to decrease the EPP to zero when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to achieve
>> the same work, causing the GPU to have a lower power budget available,
>> leading to a decrease in system performance.
>
> It slices both ways.

I don't think it's acceptable to land an optimization that trades
performance of one use-case for another, especially since one could make
both use-cases happy by avoiding the boost in cases where we know
beforehand that we aren't going to achieve any improvement in
performance, because an application waiting frequently on an IO device
which is 100% utilized isn't going to run faster just because we ramp up
the CPU frequency, since the IO device won't be able to process requests
from the application faster anyway, so we will only be pessimizing
energy efficiency (and potentially decreasing performance of the GPU
*and* of other CPU cores living on the same package for no benefit).

> With the series, there are large boosts to performance on other
> workloads where a slight increase in power usage is acceptable in
> exchange for performance. For example,
>
> Single socket skylake running sqlite
>                                  v4.17               41ab43c9
> Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
> Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
> Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
> CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
> Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
> BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
> BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
>
> That's over doubling the transactions per second for that workload.
>
> Two-socket skylake running dbench4
>                                 v4.17               41ab43c9
> Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
> Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
> Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
> Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
> Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
> Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
> Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
> Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
>
> This is reporting the average latency of operations running
> dbench. The series over halves the latencies. There are many examples
> of basic workloads that benefit heavily from the series and while I
> accept it may not be universal, such as the case where the graphics
> card needs the power and not the CPU, a straight revert is not the
> answer. Without the series, HWP cripplies the CPU.
>

That seems like a huge overstatement.  HWP doesn't "cripple" the CPU
without this series.  It will certainly set lower clocks than with this
series for workloads like you show above that utilize the CPU very
intermittently (i.e. they underutilize it).  But one could argue that
such workloads are inherently misdesigned and will perform suboptimally
regardless of the behavior of the CPUFREQ governor, for two different
reasons: On the one hand because they are unable to fully utilize their
CPU time (otherwise HWP would be giving them a CPU frequency close to
the maximum already), and on the other hand, because in order to achieve
maximum performance they will necessarily have to bounce back and forth
between the maximum P-state and idle at high frequency, which is
inherently energy-inefficient and will effectively *decrease* the
overall number of requests per second that an actual multi-threaded
server can process, even though the request throughput may seem to
increase in a single-threaded benchmark.

> -- 
> Mel Gorman
> SUSE Labs

Francisco Jerez July 28, 2018, 8:23 p.m. UTC | #5

Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> 
>> > Enable HWP boost on Skylake server and workstations.
>> > 
>> 
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1]. 
> Which SKX platform is targeted to graphics?
>

See the bug report, it's a regular desktop SKL.

>>  The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the
>> intel_pstate
>> driver to decrease the EPP to zero 
> No. You didn't check this series. We are not using EPP at all.
> The boost mechanism used here is not boost to max.
>

I see you've changed the mechanism to obtain a latency boost since our
last discussion, but that doesn't address my concerns in any way: This
series causes the intel_pstate driver to clamp the CPU frequency above
the optimal frequency to run the workload at, as a response to the
application waiting on IO frequently, even though that's only a sign of
the application being IO-bound and *not* a sign of it being
latency-sensitive, since the application's IO and CPU work are properly
pipelined.  This leads to a decrease in parallelism due to the active
CPU core using a larger fraction of the package TDP in order to achieve
the same work, leading to a decrease in system performance.

> Thanks,
> Srinivas
>
>> when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in
>> [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't
>> significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to
>> achieve
>> the same work, causing the GPU to have a lower power budget
>> available,
>> leading to a decrease in system performance.
>> 
>> You may want to give a shot to my previous suggestion of using [2] in
>> order to detect whether the system is IO-bound, which you can use as
>> an
>> indicator that the optimization implemented in this series cannot
>> possibly improve performance and can be expected to hurt energy
>> usage.
>> 
>> Thanks.
>> 
>> [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> [2] https://patchwork.kernel.org/patch/10312259/
>> 
>> > Reported-by: Mel Gorman <mgorman@techsingularity.net>
>> > Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
>> > .com>
>> > ---
>> >  drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>> >  1 file changed, 10 insertions(+)
>> > 
>> > diff --git a/drivers/cpufreq/intel_pstate.c
>> > b/drivers/cpufreq/intel_pstate.c
>> > index 70bf63bb4e0e..01c8da1f99db 100644
>> > --- a/drivers/cpufreq/intel_pstate.c
>> > +++ b/drivers/cpufreq/intel_pstate.c
>> > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
>> > intel_pstate_cpu_ee_disable_ids[] = {
>> >  	{}
>> >  };
>> >  
>> > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
>> > __initconst = {
>> > +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
>> > +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
>> > +	{}
>> > +};
>> > +
>> >  static int intel_pstate_init_cpu(unsigned int cpunum)
>> >  {
>> >  	struct cpudata *cpu;
>> > @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
>> > int cpunum)
>> >  			intel_pstate_disable_ee(cpunum);
>> >  
>> >  		intel_pstate_hwp_enable(cpu);
>> > +
>> > +		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
>> > +		if (id)
>> > +			hwp_boost = true;
>> >  	}
>> >  
>> >  	intel_pstate_get_cpu_pstates(cpu);
>> > -- 
>> > 2.13.6

Francisco Jerez July 28, 2018, 8:23 p.m. UTC | #6

"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:

> On Sat, 2018-07-28 at 07:14 -0700, Srinivas Pandruvada wrote:
>> On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
>> > Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> > 
>> > > Enable HWP boost on Skylake server and workstations.
>> > > 
>> > 
>> > Please revert this series, it led to significant energy usage and
>> > graphics performance regressions [1]. 
>> 
>> Which SKX platform is targeted to graphics?
> There are some Xeon E3, which is using SKL desktop CPUID.
> Do you have a SKL desktop with FADT pm profile 
> acpi_gbl_FADT.preferred_profile == PM_DESKTOP? You can take acpidump
> and check.
>
> Also what is
>  /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>

Eero is the one that reproduced the regressions -- Can you get Srinivas
this information from your system?

Thank you.

> Thanks,
> Srinivas
>
>> 
>> >  The reasons are roughly the ones
>> > we discussed by e-mail off-list last April: This causes the
>> > intel_pstate
>> > driver to decrease the EPP to zero 
>> 
>> No. You didn't check this series. We are not using EPP at all.
>> The boost mechanism used here is not boost to max.
>> 
>> Thanks,
>> Srinivas
>> 
>> > when the workload blocks on IO
>> > frequently enough, which for the regressing benchmarks detailed in
>> > [1]
>> > is a symptom of the workload being heavily IO-bound, which means
>> > they
>> > won't benefit at all from the EPP boost since they aren't
>> > significantly
>> > CPU-bound, and they will suffer a decrease in parallelism due to
>> > the
>> > active CPU core using a larger fraction of the TDP in order to
>> > achieve
>> > the same work, causing the GPU to have a lower power budget
>> > available,
>> > leading to a decrease in system performance.
>> > 
>> > You may want to give a shot to my previous suggestion of using [2]
>> > in
>> > order to detect whether the system is IO-bound, which you can use
>> > as
>> > an
>> > indicator that the optimization implemented in this series cannot
>> > possibly improve performance and can be expected to hurt energy
>> > usage.
>> > 
>> > Thanks.
>> > 
>> > [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> > [2] https://patchwork.kernel.org/patch/10312259/
>> > 
>> > > Reported-by: Mel Gorman <mgorman@techsingularity.net>
>> > > Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>> > > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.int
>> > > el
>> > > .com>
>> > > ---
>> > >  drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>> > >  1 file changed, 10 insertions(+)
>> > > 
>> > > diff --git a/drivers/cpufreq/intel_pstate.c
>> > > b/drivers/cpufreq/intel_pstate.c
>> > > index 70bf63bb4e0e..01c8da1f99db 100644
>> > > --- a/drivers/cpufreq/intel_pstate.c
>> > > +++ b/drivers/cpufreq/intel_pstate.c
>> > > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
>> > > intel_pstate_cpu_ee_disable_ids[] = {
>> > >  	{}
>> > >  };
>> > >  
>> > > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
>> > > __initconst = {
>> > > +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
>> > > +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
>> > > +	{}
>> > > +};
>> > > +
>> > >  static int intel_pstate_init_cpu(unsigned int cpunum)
>> > >  {
>> > >  	struct cpudata *cpu;
>> > > @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
>> > > int cpunum)
>> > >  			intel_pstate_disable_ee(cpunum);
>> > >  
>> > >  		intel_pstate_hwp_enable(cpu);
>> > > +
>> > > +		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
>> > > +		if (id)
>> > > +			hwp_boost = true;
>> > >  	}
>> > >  
>> > >  	intel_pstate_get_cpu_pstates(cpu);
>> > > -- 
>> > > 2.13.6

Pandruvada, Srinivas July 28, 2018, 10:06 p.m. UTC | #7

On Sat, 2018-07-28 at 13:23 -0700, Francisco Jerez wrote:
> "Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes:
> 
> > On Sat, 2018-07-28 at 07:14 -0700, Srinivas Pandruvada wrote:
> > > On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
> > > > Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > > writes:
> > > > 
> > > > > Enable HWP boost on Skylake server and workstations.
> > > > > 
> > > > 
> > > > Please revert this series, it led to significant energy usage
> > > > and
> > > > graphics performance regressions [1]. 
> > > 
> > > Which SKX platform is targeted to graphics?
> > 
> > There are some Xeon E3, which is using SKL desktop CPUID.
> > Do you have a SKL desktop with FADT pm profile 
> > acpi_gbl_FADT.preferred_profile == PM_DESKTOP? You can take
> > acpidump
> > and check.
> > 
> > Also what is
> >  /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> > 
> 
> Eero is the one that reproduced the regressions -- Can you get
> Srinivas
> this information from your system?
I commented in the bug. Let's take there first.

Thanks,
Srinivas

> 
> Thank you.
> 
> > Thanks,
> > Srinivas
> > 
> > > 
> > > >  The reasons are roughly the ones
> > > > we discussed by e-mail off-list last April: This causes the
> > > > intel_pstate
> > > > driver to decrease the EPP to zero 
> > > 
> > > No. You didn't check this series. We are not using EPP at all.
> > > The boost mechanism used here is not boost to max.
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > > when the workload blocks on IO
> > > > frequently enough, which for the regressing benchmarks detailed
> > > > in
> > > > [1]
> > > > is a symptom of the workload being heavily IO-bound, which
> > > > means
> > > > they
> > > > won't benefit at all from the EPP boost since they aren't
> > > > significantly
> > > > CPU-bound, and they will suffer a decrease in parallelism due
> > > > to
> > > > the
> > > > active CPU core using a larger fraction of the TDP in order to
> > > > achieve
> > > > the same work, causing the GPU to have a lower power budget
> > > > available,
> > > > leading to a decrease in system performance.
> > > > 
> > > > You may want to give a shot to my previous suggestion of using
> > > > [2]
> > > > in
> > > > order to detect whether the system is IO-bound, which you can
> > > > use
> > > > as
> > > > an
> > > > indicator that the optimization implemented in this series
> > > > cannot
> > > > possibly improve performance and can be expected to hurt energy
> > > > usage.
> > > > 
> > > > Thanks.
> > > > 
> > > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
> > > > [2] https://patchwork.kernel.org/patch/10312259/
> > > > 
> > > > > Reported-by: Mel Gorman <mgorman@techsingularity.net>
> > > > > Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> > > > > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux
> > > > > .int
> > > > > el
> > > > > .com>
> > > > > ---
> > > > >  drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> > > > >  1 file changed, 10 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > > > b/drivers/cpufreq/intel_pstate.c
> > > > > index 70bf63bb4e0e..01c8da1f99db 100644
> > > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
> > > > > intel_pstate_cpu_ee_disable_ids[] = {
> > > > >  	{}
> > > > >  };
> > > > >  
> > > > > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
> > > > > __initconst = {
> > > > > +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> > > > > +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> > > > > +	{}
> > > > > +};
> > > > > +
> > > > >  static int intel_pstate_init_cpu(unsigned int cpunum)
> > > > >  {
> > > > >  	struct cpudata *cpu;
> > > > > @@ -1824,6 +1830,10 @@ static int
> > > > > intel_pstate_init_cpu(unsigned
> > > > > int cpunum)
> > > > >  			intel_pstate_disable_ee(cpunum);
> > > > >  
> > > > >  		intel_pstate_hwp_enable(cpu);
> > > > > +
> > > > > +		id =
> > > > > x86_match_cpu(intel_pstate_hwp_boost_ids);
> > > > > +		if (id)
> > > > > +			hwp_boost = true;
> > > > >  	}
> > > > >  
> > > > >  	intel_pstate_get_cpu_pstates(cpu);
> > > > > -- 
> > > > > 2.13.6

Eero Tamminen July 30, 2018, 8:33 a.m. UTC | #8

Hi,

On 28.07.2018 17:14, Srinivas Pandruvada wrote:
> On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>>
>>> Enable HWP boost on Skylake server and workstations.
>>>
>>
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1].
> Which SKX platform is targeted to graphics?

Patch that Chris pointed out is this:
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = {
+       ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+       ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
+       {}
+};

The regressing platforms in our test system were:
- SKL 6600K i5 / GT2
- SKL 6770HQ i7 / GT4e

SKL-U i5 / GT3e device wasn't impacted, so I assume U devices don't 
match INTEL_FAM6_SKYLAKE_DESKTOP.



	- Eero

>>   The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the
>> intel_pstate
>> driver to decrease the EPP to zero
> No. You didn't check this series. We are not using EPP at all.
> The boost mechanism used here is not boost to max.
> 
> Thanks,
> Srinivas
> 
>> when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in
>> [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't
>> significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to
>> achieve
>> the same work, causing the GPU to have a lower power budget
>> available,
>> leading to a decrease in system performance.
>>
>> You may want to give a shot to my previous suggestion of using [2] in
>> order to detect whether the system is IO-bound, which you can use as
>> an
>> indicator that the optimization implemented in this series cannot
>> possibly improve performance and can be expected to hurt energy
>> usage.
>>
>> Thanks.
>>
>> [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> [2] https://patchwork.kernel.org/patch/10312259/
>>
>>> Reported-by: Mel Gorman <mgorman@techsingularity.net>
>>> Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
>>> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
>>> .com>
>>> ---
>>>   drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/cpufreq/intel_pstate.c
>>> b/drivers/cpufreq/intel_pstate.c
>>> index 70bf63bb4e0e..01c8da1f99db 100644
>>> --- a/drivers/cpufreq/intel_pstate.c
>>> +++ b/drivers/cpufreq/intel_pstate.c
>>> @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
>>> intel_pstate_cpu_ee_disable_ids[] = {
>>>   	{}
>>>   };
>>>   
>>> +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
>>> __initconst = {
>>> +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
>>> +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
>>> +	{}
>>> +};
>>> +
>>>   static int intel_pstate_init_cpu(unsigned int cpunum)
>>>   {
>>>   	struct cpudata *cpu;
>>> @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
>>> int cpunum)
>>>   			intel_pstate_disable_ee(cpunum);
>>>   
>>>   		intel_pstate_hwp_enable(cpu);
>>> +
>>> +		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
>>> +		if (id)
>>> +			hwp_boost = true;
>>>   	}
>>>   
>>>   	intel_pstate_get_cpu_pstates(cpu);
>>> -- 
>>> 2.13.6

Eero Tamminen July 30, 2018, 11:16 a.m. UTC | #9

Hi Mel,

On 28.07.2018 15:36, Mel Gorman wrote:
> On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>>
>>> Enable HWP boost on Skylake server and workstations.
>>>
>>
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1].  The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the intel_pstate
>> driver to decrease the EPP to zero when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to achieve
>> the same work, causing the GPU to have a lower power budget available,
>> leading to a decrease in system performance.
 >
> It slices both ways. With the series, there are large boosts to
> performance on other workloads where a slight increase in power usage is
> acceptable in exchange for performance. For example,
> 
> Single socket skylake running sqlite
[...]
> That's over doubling the transactions per second for that workload.
> 
> Two-socket skylake running dbench4
[...]> This is reporting the average latency of operations running 
dbench. The
> series over halves the latencies. There are many examples of basic
> workloads that benefit heavily from the series and while I accept it may
> not be universal, such as the case where the graphics card needs the power
> and not the CPU, a straight revert is not the answer. Without the series,
> HWP cripplies the CPU.

I assume SQLite IO-bottleneck is for the disk.  Disk doesn't share
the TDP limit with the CPU, like IGP does.

Constraints / performance considerations for TDP sharing IO-loads
differ from ones that don't share TDP with CPU cores.

Workloads that can be "IO-bound" and which can be on the same chip
with CPU i.e. share TDP with it are:
- 3D rendering
- Camera / video processing
- Compute

Intel, AMD and ARM manufacturers all have (non-server) chips where these
IP blocks are on the same die as CPU cores.  If CPU part redundantly
doubles its power consumption, it's directly eating TDP budget away from
these devices.

For workloads where IO bottleneck doesn't share TDP budget with CPU,
like (sqlite) databases, you don't lose performance by running CPU
constantly at full tilt, you only use more power [1].

Questions:

* Does currently kernel CPU freq management have any idea which IO
   devices share TDP with the CPU cores?

* Do you do performance testing also in conditions that hit TDP limits?

	- Eero

[1]  For them power usage is performance problem only if you start
hitting TDP limit with CPUs alone, or you hit temperature limits.

For CPUs alone to hit TDP limits, our test-case needs to be utilizing 
multiple cores and device needs to have lowish TDP compared to the
performance of the chip.

TDP limiting adds test results variance significantly, but that's
a property of the chips themselves so it cannot be avoided. :-/

Temperature limiting might happen on small enclosures like the ones used
for the SKL HQ devices i.e. laptops & NUCs, but not on servers.   In our
testing we try to avoid temperature limitations when its possible (=
extra cooling) as it increases variance so much that results are mostly
useless (same devices are also TDP limited i.e. already have high
variance).

srinivas pandruvada July 30, 2018, 1:38 p.m. UTC | #10

Hi Eero,
On Mon, 2018-07-30 at 11:33 +0300, Eero Tamminen wrote:
> Hi,
> 
> On 28.07.2018 17:14, Srinivas Pandruvada wrote:
> > On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
> > > Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> > > 
> > > > Enable HWP boost on Skylake server and workstations.
> > > > 
> > > 
> > > Please revert this series, it led to significant energy usage and
> > > graphics performance regressions [1].
> > 
> > Which SKX platform is targeted to graphics?
> 
> Patch that Chris pointed out is this:
> +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = {
> +       ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> +       ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> +       {}
> +};
> 
> The regressing platforms in our test system were:
> - SKL 6600K i5 / GT2
> - SKL 6770HQ i7 / GT4e
> 
> SKL-U i5 / GT3e device wasn't impacted, so I assume U devices don't 
> match INTEL_FAM6_SKYLAKE_DESKTOP.

I have updated some steps in the bugzilla. Can you try that?
Also add one workload which will show this issue immediately including
any parameters you need to add.

Thanks,
Srinivas


> 
> 
> 
> 	- Eero
> 
> > >   The reasons are roughly the ones
> > > we discussed by e-mail off-list last April: This causes the
> > > intel_pstate
> > > driver to decrease the EPP to zero
> > 
> > No. You didn't check this series. We are not using EPP at all.
> > The boost mechanism used here is not boost to max.
> > 
> > Thanks,
> > Srinivas
> > 
> > > when the workload blocks on IO
> > > frequently enough, which for the regressing benchmarks detailed
> > > in
> > > [1]
> > > is a symptom of the workload being heavily IO-bound, which means
> > > they
> > > won't benefit at all from the EPP boost since they aren't
> > > significantly
> > > CPU-bound, and they will suffer a decrease in parallelism due to
> > > the
> > > active CPU core using a larger fraction of the TDP in order to
> > > achieve
> > > the same work, causing the GPU to have a lower power budget
> > > available,
> > > leading to a decrease in system performance.
> > > 
> > > You may want to give a shot to my previous suggestion of using
> > > [2] in
> > > order to detect whether the system is IO-bound, which you can use
> > > as
> > > an
> > > indicator that the optimization implemented in this series cannot
> > > possibly improve performance and can be expected to hurt energy
> > > usage.
> > > 
> > > Thanks.
> > > 
> > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
> > > [2] https://patchwork.kernel.org/patch/10312259/
> > > 
> > > > Reported-by: Mel Gorman <mgorman@techsingularity.net>
> > > > Tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> > > > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.i
> > > > ntel
> > > > .com>
> > > > ---
> > > >   drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> > > >   1 file changed, 10 insertions(+)
> > > > 
> > > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > > b/drivers/cpufreq/intel_pstate.c
> > > > index 70bf63bb4e0e..01c8da1f99db 100644
> > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
> > > > intel_pstate_cpu_ee_disable_ids[] = {
> > > >   	{}
> > > >   };
> > > >   
> > > > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
> > > > __initconst = {
> > > > +	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> > > > +	ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> > > > +	{}
> > > > +};
> > > > +
> > > >   static int intel_pstate_init_cpu(unsigned int cpunum)
> > > >   {
> > > >   	struct cpudata *cpu;
> > > > @@ -1824,6 +1830,10 @@ static int
> > > > intel_pstate_init_cpu(unsigned
> > > > int cpunum)
> > > >   			intel_pstate_disable_ee(cpunum);
> > > >   
> > > >   		intel_pstate_hwp_enable(cpu);
> > > > +
> > > > +		id =
> > > > x86_match_cpu(intel_pstate_hwp_boost_ids);
> > > > +		if (id)
> > > > +			hwp_boost = true;
> > > >   	}
> > > >   
> > > >   	intel_pstate_get_cpu_pstates(cpu);
> > > > -- 
> > > > 2.13.6
> 
>

srinivas pandruvada July 30, 2018, 2:06 p.m. UTC | #11

On Mon, 2018-07-30 at 14:16 +0300, Eero Tamminen wrote:
> Hi Mel,
> 
> On 28.07.2018 15:36, Mel Gorman wrote:
> > On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
> > > Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> > > 
> > > > Enable HWP boost on Skylake server and workstations.
> > > > 
> > > 
> > > Please revert this series, it led to significant energy usage and
> > > graphics performance regressions [1].  The reasons are roughly
> > > the ones
> > > we discussed by e-mail off-list last April: This causes the
> > > intel_pstate
> > > driver to decrease the EPP to zero when the workload blocks on IO
> > > frequently enough, which for the regressing benchmarks detailed
> > > in [1]
> > > is a symptom of the workload being heavily IO-bound, which means
> > > they
> > > won't benefit at all from the EPP boost since they aren't
> > > significantly
> > > CPU-bound, and they will suffer a decrease in parallelism due to
> > > the
> > > active CPU core using a larger fraction of the TDP in order to
> > > achieve
> > > the same work, causing the GPU to have a lower power budget
> > > available,
> > > leading to a decrease in system performance.
> 
>  >
> > It slices both ways. With the series, there are large boosts to
> > performance on other workloads where a slight increase in power
> > usage is
> > acceptable in exchange for performance. For example,
> > 
> > Single socket skylake running sqlite
> 
> [...]
> > That's over doubling the transactions per second for that workload.
> > 
> > Two-socket skylake running dbench4
> 
> [...]> This is reporting the average latency of operations running 
> dbench. The
> > series over halves the latencies. There are many examples of basic
> > workloads that benefit heavily from the series and while I accept
> > it may
> > not be universal, such as the case where the graphics card needs
> > the power
> > and not the CPU, a straight revert is not the answer. Without the
> > series,
> > HWP cripplies the CPU.
> 
> I assume SQLite IO-bottleneck is for the disk.  Disk doesn't share
> the TDP limit with the CPU, like IGP does.
> 
> Constraints / performance considerations for TDP sharing IO-loads
> differ from ones that don't share TDP with CPU cores.
> 
> 
> Workloads that can be "IO-bound" and which can be on the same chip
> with CPU i.e. share TDP with it are:
> - 3D rendering
> - Camera / video processing
> - Compute
> 
> Intel, AMD and ARM manufacturers all have (non-server) chips where
> these
> IP blocks are on the same die as CPU cores.  If CPU part redundantly
> doubles its power consumption, it's directly eating TDP budget away
> from
> these devices.
> 
> For workloads where IO bottleneck doesn't share TDP budget with CPU,
> like (sqlite) databases, you don't lose performance by running CPU
> constantly at full tilt, you only use more power [1].
> 
> Questions:
> 
> * Does currently kernel CPU freq management have any idea which IO
>    devices share TDP with the CPU cores?
No. The requests we do to hardware is just indication only (HW can
ignore it). The HW has a bias register to adjust and distribute power
among users.
We can have several other active device on servers beside CPU which
when runs need extra power. So the HW arbitrates power.

> 
> * Do you do performance testing also in conditions that hit TDP
> limits?
Yes, several server benchmarks which also measures perf/watt.
For graphics we run KBL-G which hits TDP limits then we have a user
space power manager to distribute power.
Also one Intel 6600 and 7600 runs whole suit of phoronix tests.

Thanks,
Srinivas

> 
> 
> 	- Eero
> 
> [1]  For them power usage is performance problem only if you start
> hitting TDP limit with CPUs alone, or you hit temperature limits.
> 
> For CPUs alone to hit TDP limits, our test-case needs to be
> utilizing 
> multiple cores and device needs to have lowish TDP compared to the
> performance of the chip.
> 
> TDP limiting adds test results variance significantly, but that's
> a property of the chips themselves so it cannot be avoided. :-/
> 
> Temperature limiting might happen on small enclosures like the ones
> used
> for the SKL HQ devices i.e. laptops & NUCs, but not on servers.   In
> our
> testing we try to avoid temperature limitations when its possible (=
> extra cooling) as it increases variance so much that results are
> mostly
> useless (same devices are also TDP limited i.e. already have high
> variance).

Mel Gorman July 30, 2018, 3:43 p.m. UTC | #12

On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
> >> Please revert this series, it led to significant energy usage and
> >> graphics performance regressions [1].  The reasons are roughly the ones
> >> we discussed by e-mail off-list last April: This causes the intel_pstate
> >> driver to decrease the EPP to zero when the workload blocks on IO
> >> frequently enough, which for the regressing benchmarks detailed in [1]
> >> is a symptom of the workload being heavily IO-bound, which means they
> >> won't benefit at all from the EPP boost since they aren't significantly
> >> CPU-bound, and they will suffer a decrease in parallelism due to the
> >> active CPU core using a larger fraction of the TDP in order to achieve
> >> the same work, causing the GPU to have a lower power budget available,
> >> leading to a decrease in system performance.
> >
> > It slices both ways.
> 
> I don't think it's acceptable to land an optimization that trades
> performance of one use-case for another,

The same logic applies to a revert but that aside, I see that there is
at least one patch floating around to disable HWP Boost for desktops and
laptops. Maybe that'll be sufficient for the cases where IGP is a major
component.

> especially since one could make
> both use-cases happy by avoiding the boost in cases where we know
> beforehand that we aren't going to achieve any improvement in
> performance, because an application waiting frequently on an IO device
> which is 100% utilized isn't going to run faster just because we ramp up
> the CPU frequency, since the IO device won't be able to process requests
> from the application faster anyway, so we will only be pessimizing
> energy efficiency (and potentially decreasing performance of the GPU
> *and* of other CPU cores living on the same package for no benefit).
> 

The benchmarks in question are not necessarily utilising IO at 100% or
IO-bound. One pattern is a small fsync which ends up context switching
between the process and a journalling thread (may be dedicated thread, may be
workqueue depending on filesystem) and the process waking again in the very
near future on IO completion. While the workload may be single threaded,
more than one core is in use because of how the short sleeps migrate the
task to other cores.  HWP does not necessarily notice that the task is
quite CPU-intensive due to the migrations and so the performance suffers.

Some effort is made to minimise the number of cores used with this sort
of waker/wakee relationship but it's not necessarily enough for HWP to
boost the frequency.  Minimally, the journalling thread woken up will
not wake on the same CPU as the IO issuer except under extremely heavily
utilisation and this is not likely to change (stacking stacks too often
increases wakeup latency).

> > With the series, there are large boosts to performance on other
> > workloads where a slight increase in power usage is acceptable in
> > exchange for performance. For example,
> >
> > Single socket skylake running sqlite
> >                                  v4.17               41ab43c9
> > Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
> > Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
> > Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
> > CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
> > Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
> > BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
> > BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> > BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> >
> > That's over doubling the transactions per second for that workload.
> >
> > Two-socket skylake running dbench4
> >                                 v4.17               41ab43c9
> > Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
> > Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
> > Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
> > Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
> > Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
> > Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
> > Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
> > Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
> >
> > This is reporting the average latency of operations running
> > dbench. The series over halves the latencies. There are many examples
> > of basic workloads that benefit heavily from the series and while I
> > accept it may not be universal, such as the case where the graphics
> > card needs the power and not the CPU, a straight revert is not the
> > answer. Without the series, HWP cripplies the CPU.
> >
> 
> That seems like a huge overstatement.  HWP doesn't "cripple" the CPU
> without this series.  It will certainly set lower clocks than with this
> series for workloads like you show above that utilize the CPU very
> intermittently (i.e. they underutilize it). 

Dbench for example can be quite CPU intensive. When bound to a single
core, it shows up to 80% utilisation of a single core. When unbound,
the usage of individual cores appears low due to the migrations. It may
be intermittent usage as it context switches to worker threads but it's
not low utilisation either.

intel_pstate also had logic for IO-boosting before HWP so the user-visible
impact for some workloads is that upgrading a machine's CPU can result
in regressions due to HWP. Similarly it has been observed prior to the
series that specifying no_hwp often performed better. So one could argue
that HWP isn't "crippled" but it did have surprising behaviour.

srinivas pandruvada July 30, 2018, 3:57 p.m. UTC | #13

On Mon, 2018-07-30 at 16:43 +0100, Mel Gorman wrote:
> On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
> > > > Please revert this series, it led to significant energy usage
> > > > and
> > > > graphics performance regressions [1].  The reasons are roughly
> > > > the ones
> > > > we discussed by e-mail off-list last April: This causes the
> > > > intel_pstate
> > > > driver to decrease the EPP to zero when the workload blocks on
> > > > IO
> > > > frequently enough, which for the regressing benchmarks detailed
> > > > in [1]
> > > > is a symptom of the workload being heavily IO-bound, which
> > > > means they
> > > > won't benefit at all from the EPP boost since they aren't
> > > > significantly
> > > > CPU-bound, and they will suffer a decrease in parallelism due
> > > > to the
> > > > active CPU core using a larger fraction of the TDP in order to
> > > > achieve
> > > > the same work, causing the GPU to have a lower power budget
> > > > available,
> > > > leading to a decrease in system performance.
> > > 
> > > It slices both ways.
> > 
> > I don't think it's acceptable to land an optimization that trades
> > performance of one use-case for another,
> 
> The same logic applies to a revert but that aside, I see that there
> is
> at least one patch floating around to disable HWP Boost for desktops
> and
> laptops. Maybe that'll be sufficient for the cases where IGP is a
> major
> component.
We don't have to revert the series. The only contention is desktop cpu
model. I didn't have this model before the last version.
One entry level server uses the same CPU model as desktop, some users
like Giovanni suggested to add. But we know that those machines are
servers from ACPI. So we can differentiate.

Thanks,
Srinivas




> 
> > especially since one could make
> > both use-cases happy by avoiding the boost in cases where we know
> > beforehand that we aren't going to achieve any improvement in
> > performance, because an application waiting frequently on an IO
> > device
> > which is 100% utilized isn't going to run faster just because we
> > ramp up
> > the CPU frequency, since the IO device won't be able to process
> > requests
> > from the application faster anyway, so we will only be pessimizing
> > energy efficiency (and potentially decreasing performance of the
> > GPU
> > *and* of other CPU cores living on the same package for no
> > benefit).
> > 
> 
> The benchmarks in question are not necessarily utilising IO at 100%
> or
> IO-bound. One pattern is a small fsync which ends up context
> switching
> between the process and a journalling thread (may be dedicated
> thread, may be
> workqueue depending on filesystem) and the process waking again in
> the very
> near future on IO completion. While the workload may be single
> threaded,
> more than one core is in use because of how the short sleeps migrate
> the
> task to other cores.  HWP does not necessarily notice that the task
> is
> quite CPU-intensive due to the migrations and so the performance
> suffers.
> 
> Some effort is made to minimise the number of cores used with this
> sort
> of waker/wakee relationship but it's not necessarily enough for HWP
> to
> boost the frequency.  Minimally, the journalling thread woken up will
> not wake on the same CPU as the IO issuer except under extremely
> heavily
> utilisation and this is not likely to change (stacking stacks too
> often
> increases wakeup latency).
> 
> > > With the series, there are large boosts to performance on other
> > > workloads where a slight increase in power usage is acceptable in
> > > exchange for performance. For example,
> > > 
> > > Single socket skylake running sqlite
> > >                                  v4.17               41ab43c9
> > > Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
> > > Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
> > > Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
> > > CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
> > > Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
> > > BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
> > > BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> > > BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> > > 
> > > That's over doubling the transactions per second for that
> > > workload.
> > > 
> > > Two-socket skylake running dbench4
> > >                                 v4.17               41ab43c9
> > > Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
> > > Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
> > > Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
> > > Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
> > > Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
> > > Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
> > > Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
> > > Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
> > > 
> > > This is reporting the average latency of operations running
> > > dbench. The series over halves the latencies. There are many
> > > examples
> > > of basic workloads that benefit heavily from the series and while
> > > I
> > > accept it may not be universal, such as the case where the
> > > graphics
> > > card needs the power and not the CPU, a straight revert is not
> > > the
> > > answer. Without the series, HWP cripplies the CPU.
> > > 
> > 
> > That seems like a huge overstatement.  HWP doesn't "cripple" the
> > CPU
> > without this series.  It will certainly set lower clocks than with
> > this
> > series for workloads like you show above that utilize the CPU very
> > intermittently (i.e. they underutilize it). 
> 
> Dbench for example can be quite CPU intensive. When bound to a single
> core, it shows up to 80% utilisation of a single core. When unbound,
> the usage of individual cores appears low due to the migrations. It
> may
> be intermittent usage as it context switches to worker threads but
> it's
> not low utilisation either.
> 
> intel_pstate also had logic for IO-boosting before HWP so the user-
> visible
> impact for some workloads is that upgrading a machine's CPU can
> result
> in regressions due to HWP. Similarly it has been observed prior to
> the
> series that specifying no_hwp often performed better. So one could
> argue
> that HWP isn't "crippled" but it did have surprising behaviour.
>

Francisco Jerez July 30, 2018, 6:32 p.m. UTC | #14

Mel Gorman <mgorman@techsingularity.net> writes:

> On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
>> >> Please revert this series, it led to significant energy usage and
>> >> graphics performance regressions [1].  The reasons are roughly the ones
>> >> we discussed by e-mail off-list last April: This causes the intel_pstate
>> >> driver to decrease the EPP to zero when the workload blocks on IO
>> >> frequently enough, which for the regressing benchmarks detailed in [1]
>> >> is a symptom of the workload being heavily IO-bound, which means they
>> >> won't benefit at all from the EPP boost since they aren't significantly
>> >> CPU-bound, and they will suffer a decrease in parallelism due to the
>> >> active CPU core using a larger fraction of the TDP in order to achieve
>> >> the same work, causing the GPU to have a lower power budget available,
>> >> leading to a decrease in system performance.
>> >
>> > It slices both ways.
>> 
>> I don't think it's acceptable to land an optimization that trades
>> performance of one use-case for another,
>
> The same logic applies to a revert

No, it doesn't, the responsibility of addressing the fallout from a
change that happens to hurt performance even though it was supposed to
improve it lies on the author of the change, not on the reporter of the
regression.

> but that aside, I see that there is at least one patch floating around
> to disable HWP Boost for desktops and laptops. Maybe that'll be
> sufficient for the cases where IGP is a major component.
>
>> especially since one could make
>> both use-cases happy by avoiding the boost in cases where we know
>> beforehand that we aren't going to achieve any improvement in
>> performance, because an application waiting frequently on an IO device
>> which is 100% utilized isn't going to run faster just because we ramp up
>> the CPU frequency, since the IO device won't be able to process requests
>> from the application faster anyway, so we will only be pessimizing
>> energy efficiency (and potentially decreasing performance of the GPU
>> *and* of other CPU cores living on the same package for no benefit).
>> 
>
> The benchmarks in question are not necessarily utilising IO at 100% or
> IO-bound.

Exactly.  That's the only reason why they are able to take advantage of
HWP boost, while the regressing graphics benchmarks are not, since they
are utilizing an IO device at 100%.  Both categories of use-cases sleep
on IO-wait frequently, but only the former are authentically CPU-bound.

> One pattern is a small fsync which ends up context switching between
> the process and a journalling thread (may be dedicated thread, may be
> workqueue depending on filesystem) and the process waking again in the
> very near future on IO completion. While the workload may be single
> threaded, more than one core is in use because of how the short sleeps
> migrate the task to other cores.  HWP does not necessarily notice that
> the task is quite CPU-intensive due to the migrations and so the
> performance suffers.
>
> Some effort is made to minimise the number of cores used with this sort
> of waker/wakee relationship but it's not necessarily enough for HWP to
> boost the frequency.  Minimally, the journalling thread woken up will
> not wake on the same CPU as the IO issuer except under extremely heavily
> utilisation and this is not likely to change (stacking stacks too often
> increases wakeup latency).
>

The task scheduler does go through the effort of attempting to re-use
the most frequently active CPU when a task wakes up, at least last time
I checked.  But yes some migration patterns can exacerbate the downward
bias of the response of the HWP to an intermittent workload, primarily
in cases where the application is unable to take advantage of the
parallelism between CPU and the IO device involved, like you're
describing above.

>> > With the series, there are large boosts to performance on other
>> > workloads where a slight increase in power usage is acceptable in
>> > exchange for performance. For example,
>> >
>> > Single socket skylake running sqlite
>> >                                  v4.17               41ab43c9
>> > Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
>> > Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
>> > Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
>> > CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
>> > Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
>> > BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
>> > BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
>> > BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
>> >
>> > That's over doubling the transactions per second for that workload.
>> >
>> > Two-socket skylake running dbench4
>> >                                 v4.17               41ab43c9
>> > Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
>> > Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
>> > Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
>> > Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
>> > Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
>> > Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
>> > Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
>> > Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
>> >
>> > This is reporting the average latency of operations running
>> > dbench. The series over halves the latencies. There are many examples
>> > of basic workloads that benefit heavily from the series and while I
>> > accept it may not be universal, such as the case where the graphics
>> > card needs the power and not the CPU, a straight revert is not the
>> > answer. Without the series, HWP cripplies the CPU.
>> >
>> 
>> That seems like a huge overstatement.  HWP doesn't "cripple" the CPU
>> without this series.  It will certainly set lower clocks than with this
>> series for workloads like you show above that utilize the CPU very
>> intermittently (i.e. they underutilize it). 
>
> Dbench for example can be quite CPU intensive. When bound to a single
> core, it shows up to 80% utilisation of a single core.

So even with an oracle cpufreq governor able to guess that the
application relies on the CPU being locked to the maximum frequency
despite it utilizing less than 80% of the CPU cycles, the application
will still perform 20% worse than an alternative application handling
its IO work asynchronously.

> When unbound, the usage of individual cores appears low due to the
> migrations. It may be intermittent usage as it context switches to
> worker threads but it's not low utilisation either.
>
> intel_pstate also had logic for IO-boosting before HWP 

The IO-boosting logic of the intel_pstate governor has the same flaw as
this unfortunately.

> so the user-visible impact for some workloads is that upgrading a
> machine's CPU can result in regressions due to HWP. Similarly it has
> been observed prior to the series that specifying no_hwp often
> performed better. So one could argue that HWP isn't "crippled" but it
> did have surprising behaviour.
>
> -- 
> Mel Gorman
> SUSE Labs

Giovanni Gherdovich July 31, 2018, 7:10 a.m. UTC | #15

On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
> > > > > Please revert this series, it led to significant energy usage and
> > > > > graphics performance regressions [1].  The reasons are roughly the ones
> > > > > we discussed by e-mail off-list last April: This causes the intel_pstate
> > > > > driver to decrease the EPP to zero when the workload blocks on IO
> > > > > frequently enough, which for the regressing benchmarks detailed in [1]
> > > > > is a symptom of the workload being heavily IO-bound, which means they
> > > > > won't benefit at all from the EPP boost since they aren't significantly
> > > > > CPU-bound, and they will suffer a decrease in parallelism due to the
> > > > > active CPU core using a larger fraction of the TDP in order to achieve
> > > > > the same work, causing the GPU to have a lower power budget available,
> > > > > leading to a decrease in system performance.
> > > > 
> > > > It slices both ways.
> > > 
> > > I don't think it's acceptable to land an optimization that trades
> > > performance of one use-case for another,
> > 
> > The same logic applies to a revert
> 
> No, it doesn't, the responsibility of addressing the fallout from a
> change that happens to hurt performance even though it was supposed to
> improve it lies on the author of the change, not on the reporter of the
> regression.

The server and desktop worlds have different characteristics and needs, which
in this particular case appear to be conflicting. Luckily we can can
differentiate the two scenarios (as in the bugfix patch by Srinivas a few
hours ago).

> The task scheduler does go through the effort of attempting to re-use
> the most frequently active CPU when a task wakes up, at least last time
> I checked.

Unfortunately that doesn't happen in practice; the load balancer in the
scheduler is in a constant tension between spreading tasks evenly across all
cores (necessary when the machine is under heavy load) and packing on just a
few (which makes more sense when only a few threads are working and the box is
almost idle otherwise). Recent evolutions favour spreading. We often observe
tasks helplessly bounce from core to core losing all their accrued utilisation
score, and intel_pstate (with or without HWP) penalizes that.

On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
>
> [...]
> > One pattern is a small fsync which ends up context switching between
> > the process and a journalling thread (may be dedicated thread, may be
> > workqueue depending on filesystem) and the process waking again in the
> > very near future on IO completion. While the workload may be single
> > threaded, more than one core is in use because of how the short sleeps
> > migrate the task to other cores.  HWP does not necessarily notice that
> > the task is quite CPU-intensive due to the migrations and so the
> > performance suffers.
> > 
> > Some effort is made to minimise the number of cores used with this sort
> > of waker/wakee relationship but it's not necessarily enough for HWP to
> > boost the frequency.  Minimally, the journalling thread woken up will
> > not wake on the same CPU as the IO issuer except under extremely heavily
> > utilisation and this is not likely to change (stacking stacks too often
> > increases wakeup latency).
> > 
> 
> The task scheduler does go through the effort of attempting to re-use
> the most frequently active CPU when a task wakes up, at least last time
> I checked.  But yes some migration patterns can exacerbate the downward
> bias of the response of the HWP to an intermittent workload, primarily
> in cases where the application is unable to take advantage of the
> parallelism between CPU and the IO device involved, like you're
> describing above.

Unfortunately that doesn't happen in practice; the load balancer in the
scheduler is in a constant tension between spreading tasks evenly across all
cores (necessary when the machine is under heavy load) and packing on just a
few (which makes more sense when only a few threads are working and the box is
idle otherwise). Recent evolutions favour spreading. We often observe tasks
helplessly bounce from core to core losing all their accrued utilization
score, and intel_pstate (with or without HWP) penalizes that.

That's why in our distro SLES-15 (which is based on 4.12.14) we're sporting a
patch like this:
https://kernel.opensuse.org/cgit/kernel/commit/?h=SLE15&id=3a287868cb7a9 which
boosts tasks that have been placed on a previously idle CPU. We haven't even
proposed this patch upstream as we hope to solve those problems at a more
fundamental level, but when you're supporting power management (freq scaling)
in the server world you get compared to the performance governor, so your
policy needs to be aggressive.

> 
> > > > With the series, there are large boosts to performance on other
> > > > workloads where a slight increase in power usage is acceptable in
> > > > exchange for performance. For example,
> > > > 
> > > > Single socket skylake running sqlite
> > > >                                  v4.17               41ab43c9
> > > > Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
> > > > Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
> > > > Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
> > > > CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
> > > > Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
> > > > BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
> > > > BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> > > > BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
> > > > 
> > > > That's over doubling the transactions per second for that workload.
> > > > 
> > > > Two-socket skylake running dbench4
> > > >                                 v4.17               41ab43c9
> > > > Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
> > > > Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
> > > > Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
> > > > Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
> > > > Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
> > > > Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
> > > > Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
> > > > Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
> > > > 
> > > > This is reporting the average latency of operations running
> > > > dbench. The series over halves the latencies. There are many examples
> > > > of basic workloads that benefit heavily from the series and while I
> > > > accept it may not be universal, such as the case where the graphics
> > > > card needs the power and not the CPU, a straight revert is not the
> > > > answer. Without the series, HWP cripplies the CPU.
> > > > 
> > > 
> > > That seems like a huge overstatement.  HWP doesn't "cripple" the CPU
> > > without this series.  It will certainly set lower clocks than with this
> > > series for workloads like you show above that utilize the CPU very
> > > intermittently (i.e. they underutilize it). 
> > 
> > Dbench for example can be quite CPU intensive. When bound to a single
> > core, it shows up to 80% utilisation of a single core.
> 
> So even with an oracle cpufreq governor able to guess that the
> application relies on the CPU being locked to the maximum frequency
> despite it utilizing less than 80% of the CPU cycles, the application
> will still perform 20% worse than an alternative application handling
> its IO work asynchronously.

It's a matter of being pragmatic. You're saying that a given application is
badly designed and should be rewritten to leverage parallelism between CPU and
IO. But in the field you *have* applications that behave that way, and the OS
is in a position to do something to mitigate the damage.

> 
> > When unbound, the usage of individual cores appears low due to the
> > migrations. It may be intermittent usage as it context switches to
> > worker threads but it's not low utilisation either.
> > 
> > intel_pstate also had logic for IO-boosting before HWP 
> 
> The IO-boosting logic of the intel_pstate governor has the same flaw as
> this unfortunately.
>

Again it's a matter of pragmatism. You'll find that another governor uses
IO-boosting: schedutil. And while intel_pstate needs it because it gets
otherwise killed by migrating tasks, schedutil is based on the PELT
utilization signal and doesn't have that problem at all. The principle there
is plain and simple: if I've been "wasting time" waiting on "slow" IO (disk),
that probably means I'm getting late and there is soon some compute to do:
better catch up on the lost time and speed up. IO-wait boosting on schedutil
was discussed at
https://lore.kernel.org/lkml/3752826.3sXAQIvcIA@vostro.rjw.lan/


Giovanni Gherdovich
SUSE Labs

Peter Zijlstra July 31, 2018, 3:04 p.m. UTC | #16

On Mon, Jul 30, 2018 at 07:06:21AM -0700, Srinivas Pandruvada wrote:
> On Mon, 2018-07-30 at 14:16 +0300, Eero Tamminen wrote:
> > Questions:
> > 
> > * Does currently kernel CPU freq management have any idea which IO
> >    devices share TDP with the CPU cores?

> No. The requests we do to hardware is just indication only (HW can
> ignore it). The HW has a bias register to adjust and distribute power
> among users.
> We can have several other active device on servers beside CPU which
> when runs need extra power. So the HW arbitrates power.

That's not entirely accurate AFAIK. "No" is accurate for Intel, but the
ARM people have their IPA thing (not a beer):

  https://developer.arm.com/open-source/intelligent-power-allocation

  drivers/thermal/power_allocator.c

which IIUC interacts with their cpufreq driver to disallow certain OPP
states.

And note that I discourage intel_pstate active mode.

srinivas pandruvada July 31, 2018, 7:07 p.m. UTC | #17

On Tue, 2018-07-31 at 17:04 +0200, Peter Zijlstra wrote:
> On Mon, Jul 30, 2018 at 07:06:21AM -0700, Srinivas Pandruvada wrote:
> > On Mon, 2018-07-30 at 14:16 +0300, Eero Tamminen wrote:
> > > Questions:
> > > 
> > > * Does currently kernel CPU freq management have any idea which
> > > IO
> > >    devices share TDP with the CPU cores?
> > No. The requests we do to hardware is just indication only (HW can
> > ignore it). The HW has a bias register to adjust and distribute
> > power
> > among users.
> > We can have several other active device on servers beside CPU which
> > when runs need extra power. So the HW arbitrates power.
> 
> That's not entirely accurate AFAIK. "No" is accurate for Intel, but
> the
> ARM people have their IPA thing (not a beer):
We also have that for using space programs (E.g. KBL-G).

Thanks,
Srinivas

> 
>   https://developer.arm.com/open-source/intelligent-power-allocation
> 
>   drivers/thermal/power_allocator.c
> 
> which IIUC interacts with their cpufreq driver to disallow certain
> OPP
> states.
> 
> And note that I discourage intel_pstate active mode.

Francisco Jerez Aug. 1, 2018, 6:52 a.m. UTC | #18

Giovanni Gherdovich <ggherdovich@suse.cz> writes:

> On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
>> > > > > Please revert this series, it led to significant energy usage and
>> > > > > graphics performance regressions [1].  The reasons are roughly the ones
>> > > > > we discussed by e-mail off-list last April: This causes the intel_pstate
>> > > > > driver to decrease the EPP to zero when the workload blocks on IO
>> > > > > frequently enough, which for the regressing benchmarks detailed in [1]
>> > > > > is a symptom of the workload being heavily IO-bound, which means they
>> > > > > won't benefit at all from the EPP boost since they aren't significantly
>> > > > > CPU-bound, and they will suffer a decrease in parallelism due to the
>> > > > > active CPU core using a larger fraction of the TDP in order to achieve
>> > > > > the same work, causing the GPU to have a lower power budget available,
>> > > > > leading to a decrease in system performance.
>> > > > 
>> > > > It slices both ways.
>> > > 
>> > > I don't think it's acceptable to land an optimization that trades
>> > > performance of one use-case for another,
>> > 
>> > The same logic applies to a revert
>> 
>> No, it doesn't, the responsibility of addressing the fallout from a
>> change that happens to hurt performance even though it was supposed to
>> improve it lies on the author of the change, not on the reporter of the
>> regression.
>
> The server and desktop worlds have different characteristics and needs, which
> in this particular case appear to be conflicting. Luckily we can can
> differentiate the two scenarios (as in the bugfix patch by Srinivas a few
> hours ago).
>

I'm skeptical that the needs of the server and desktop world are really
as different and conflicting as this discussion may make them seem.  In
a server environment it can be as much (if not more) of a factor how
many requests the system can process per second rather than the latency
of any individual request (since the latency of the network can easily
be an order of magnitude higher than the latency reduction that can
possibly be achieved by tricking the HWP into reacting faster).  I'm not
convinced about the usefulness of trading the former for the latter in a
server environment, particularly since we could achieve both goals
simultaneously.

>> The task scheduler does go through the effort of attempting to re-use
>> the most frequently active CPU when a task wakes up, at least last time
>> I checked.
>
> Unfortunately that doesn't happen in practice; the load balancer in the
> scheduler is in a constant tension between spreading tasks evenly across all
> cores (necessary when the machine is under heavy load) and packing on just a
> few (which makes more sense when only a few threads are working and the box is
> almost idle otherwise). Recent evolutions favour spreading. We often observe
> tasks helplessly bounce from core to core losing all their accrued utilisation
> score, and intel_pstate (with or without HWP) penalizes that.
>

That's unfortunate.  Luckily it's easy enough for the cpufreq governor
to differentiate those cases from the applications that have enough
parallelism to utilize at least one system resource close to its maximum
throughput and become non-CPU-bound.

> On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>>
>> [...]
>> > One pattern is a small fsync which ends up context switching between
>> > the process and a journalling thread (may be dedicated thread, may be
>> > workqueue depending on filesystem) and the process waking again in the
>> > very near future on IO completion. While the workload may be single
>> > threaded, more than one core is in use because of how the short sleeps
>> > migrate the task to other cores.  HWP does not necessarily notice that
>> > the task is quite CPU-intensive due to the migrations and so the
>> > performance suffers.
>> > 
>> > Some effort is made to minimise the number of cores used with this sort
>> > of waker/wakee relationship but it's not necessarily enough for HWP to
>> > boost the frequency.  Minimally, the journalling thread woken up will
>> > not wake on the same CPU as the IO issuer except under extremely heavily
>> > utilisation and this is not likely to change (stacking stacks too often
>> > increases wakeup latency).
>> > 
>> 
>> The task scheduler does go through the effort of attempting to re-use
>> the most frequently active CPU when a task wakes up, at least last time
>> I checked.  But yes some migration patterns can exacerbate the downward
>> bias of the response of the HWP to an intermittent workload, primarily
>> in cases where the application is unable to take advantage of the
>> parallelism between CPU and the IO device involved, like you're
>> describing above.
>
> Unfortunately that doesn't happen in practice; the load balancer in the
> scheduler is in a constant tension between spreading tasks evenly across all
> cores (necessary when the machine is under heavy load) and packing on just a
> few (which makes more sense when only a few threads are working and the box is
> idle otherwise). Recent evolutions favour spreading. We often observe tasks
> helplessly bounce from core to core losing all their accrued utilization
> score, and intel_pstate (with or without HWP) penalizes that.
>
> That's why in our distro SLES-15 (which is based on 4.12.14) we're sporting a
> patch like this:
> https://kernel.opensuse.org/cgit/kernel/commit/?h=SLE15&id=3a287868cb7a9 which
> boosts tasks that have been placed on a previously idle CPU. We haven't even
> proposed this patch upstream as we hope to solve those problems at a more
> fundamental level, but when you're supporting power management (freq scaling)
> in the server world you get compared to the performance governor, so your
> policy needs to be aggressive.
>
>> 
>> > > > With the series, there are large boosts to performance on other
>> > > > workloads where a slight increase in power usage is acceptable in
>> > > > exchange for performance. For example,
>> > > > 
>> > > > Single socket skylake running sqlite
>> > > >                                  v4.17               41ab43c9
>> > > > Min       Trans     2580.85 (   0.00%)     5401.58 ( 109.29%)
>> > > > Hmean     Trans     2610.38 (   0.00%)     5518.36 ( 111.40%)
>> > > > Stddev    Trans       28.08 (   0.00%)      208.90 (-644.02%)
>> > > > CoeffVar  Trans        1.08 (   0.00%)        3.78 (-251.57%)
>> > > > Max       Trans     2648.02 (   0.00%)     5992.74 ( 126.31%)
>> > > > BHmean-50 Trans     2629.78 (   0.00%)     5643.81 ( 114.61%)
>> > > > BHmean-95 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
>> > > > BHmean-99 Trans     2620.38 (   0.00%)     5538.32 ( 111.36%)
>> > > > 
>> > > > That's over doubling the transactions per second for that workload.
>> > > > 
>> > > > Two-socket skylake running dbench4
>> > > >                                 v4.17               41ab43c9
>> > > > Amean      1         40.85 (   0.00%)       14.97 (  63.36%)
>> > > > Amean      2         42.31 (   0.00%)       17.33 (  59.04%)
>> > > > Amean      4         53.77 (   0.00%)       27.85 (  48.20%)
>> > > > Amean      8         68.86 (   0.00%)       43.78 (  36.42%)
>> > > > Amean      16        82.62 (   0.00%)       56.51 (  31.60%)
>> > > > Amean      32       135.80 (   0.00%)      116.06 (  14.54%)
>> > > > Amean      64       737.51 (   0.00%)      701.00 (   4.95%)
>> > > > Amean      512    14996.60 (   0.00%)    14755.05 (   1.61%)
>> > > > 
>> > > > This is reporting the average latency of operations running
>> > > > dbench. The series over halves the latencies. There are many examples
>> > > > of basic workloads that benefit heavily from the series and while I
>> > > > accept it may not be universal, such as the case where the graphics
>> > > > card needs the power and not the CPU, a straight revert is not the
>> > > > answer. Without the series, HWP cripplies the CPU.
>> > > > 
>> > > 
>> > > That seems like a huge overstatement.  HWP doesn't "cripple" the CPU
>> > > without this series.  It will certainly set lower clocks than with this
>> > > series for workloads like you show above that utilize the CPU very
>> > > intermittently (i.e. they underutilize it). 
>> > 
>> > Dbench for example can be quite CPU intensive. When bound to a single
>> > core, it shows up to 80% utilisation of a single core.
>> 
>> So even with an oracle cpufreq governor able to guess that the
>> application relies on the CPU being locked to the maximum frequency
>> despite it utilizing less than 80% of the CPU cycles, the application
>> will still perform 20% worse than an alternative application handling
>> its IO work asynchronously.
>
> It's a matter of being pragmatic. You're saying that a given application is
> badly designed and should be rewritten to leverage parallelism between CPU and
> IO.

Maybe some of them should be rewritten but that wasn't what I was trying
to say -- The point was that the kind of applications that benefit from
boosting on IO wait are necessarily within this category of workloads
that aren't able to take full advantage of any system resource anyway.
It's not like HWP would be "crippling" the CPU for a well-behaved
application.

> But in the field you *have* applications that behave that way, and the OS
> is in a position to do something to mitigate the damage.
>

Even when such a mitigation may actually reduce the performance of the
*same* applications when they are TDP-bound and the parallelism of the
system is limited by its energy usage?  I'm not objecting to optimizing
for latency-sensitive applications a priori, but such optimizations
shouldn't be applied unless we have an indication that the performance
of the system can possibly improve as a result (e.g. because the
application doesn't already have a bottleneck on some IO device).

>> 
>> > When unbound, the usage of individual cores appears low due to the
>> > migrations. It may be intermittent usage as it context switches to
>> > worker threads but it's not low utilisation either.
>> > 
>> > intel_pstate also had logic for IO-boosting before HWP 
>> 
>> The IO-boosting logic of the intel_pstate governor has the same flaw as
>> this unfortunately.
>>
>
> Again it's a matter of pragmatism. You'll find that another governor uses
> IO-boosting: schedutil. And while intel_pstate needs it because it gets
> otherwise killed by migrating tasks,

Right, I've been working on an alternative to that.

> schedutil is based on the PELT utilization signal and doesn't have
> that problem at all.

Yes, I agree that the reaction time of PELT can be superior to HWP at
times.

> The principle there is plain and simple: if I've been "wasting time"
> waiting on "slow" IO (disk), that probably means I'm getting late and
> there is soon some compute to do: better catch up on the lost time and
> speed up. IO-wait boosting on schedutil was discussed at
> https://lore.kernel.org/lkml/3752826.3sXAQIvcIA@vostro.rjw.lan/
>

That principle is nowhere close to a universal rule.  Waiting on an IO
device repeatedly is often a sign that the IO device is itself
overloaded and the CPU is running at an unnecessarily high frequency
(which means lower parallelism than optimal while TDP-bound), since
otherwise the CPU wouldn't need to wait on the IO device as frequently.
In such cases IOWAIT boosting gives you the opposite of what the
application needs.

>
> Giovanni Gherdovich
> SUSE Labs

[4/4] cpufreq: intel_pstate: enable boost for Skylake Xeon

Commit Message

Comments

Patch