diff mbox

[RFC/RFT,02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting

Message ID 20180516044911.28797-3-srinivas.pandruvada@linux.intel.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srinivas Pandruvada May 16, 2018, 4:49 a.m. UTC
intel_pstate has two operating modes: active and passive. In "active"
mode, the in-built scaling governor is used and in "passive" mode,
the driver can be used with any governor like "schedutil". In "active"
mode the utilization values from schedutil is not used and there is
a requirement from high performance computing use cases, not to read
any APERF/MPERF MSRs. In this case no need to use CPU cycles for
frequency invariant accounting by reading APERF/MPERF MSRs.
With this change frequency invariant account is only enabled in
"passive" mode.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
[Note: The tick will be enabled later in the series when hwp dynamic
boost is enabled]

 drivers/cpufreq/intel_pstate.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Peter Zijlstra May 16, 2018, 7:16 a.m. UTC | #1
On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> frequency invariant accounting by reading APERF/MPERF MSRs.
> With this change frequency invariant account is only enabled in
> "passive" mode.

WTH is active/passive? Is passive when we select performance governor?

Also; you have to explain why using APERF/MPERF is bad in that case. Why
do they care if we read those MSRs during the tick?
Peter Zijlstra May 16, 2018, 7:29 a.m. UTC | #2
On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
> > intel_pstate has two operating modes: active and passive. In "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to read
> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> > frequency invariant accounting by reading APERF/MPERF MSRs.
> > With this change frequency invariant account is only enabled in
> > "passive" mode.
> 
> WTH is active/passive? Is passive when we select performance governor?

Bah, I cannot read it seems. active is when we use the intel_pstate
governor and passive is when we use schedutil and only use intel_pstate
as a driver.

> Also; you have to explain why using APERF/MPERF is bad in that case. Why
> do they care if we read those MSRs during the tick?

That still stands.. this needs to be properly explained.
Rafael J. Wysocki May 16, 2018, 9:07 a.m. UTC | #3
On Wed, May 16, 2018 at 9:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
>> On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
>> > intel_pstate has two operating modes: active and passive. In "active"
>> > mode, the in-built scaling governor is used and in "passive" mode,
>> > the driver can be used with any governor like "schedutil". In "active"
>> > mode the utilization values from schedutil is not used and there is
>> > a requirement from high performance computing use cases, not to read
>> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
>> > frequency invariant accounting by reading APERF/MPERF MSRs.
>> > With this change frequency invariant account is only enabled in
>> > "passive" mode.
>>
>> WTH is active/passive? Is passive when we select performance governor?
>
> Bah, I cannot read it seems. active is when we use the intel_pstate
> governor and passive is when we use schedutil and only use intel_pstate
> as a driver.
>
>> Also; you have to explain why using APERF/MPERF is bad in that case. Why
>> do they care if we read those MSRs during the tick?
>
> That still stands.. this needs to be properly explained.

I guess this is from the intel_pstate perspective only.

The active mode is only used with HWP, so intel_pstate doesn't look at
the utilization (in any form) in the passive mode today.

Still, there are other reasons for PELT to be scale-invariant, so ...
Juri Lelli May 16, 2018, 3:19 p.m. UTC | #4
On 15/05/18 21:49, Srinivas Pandruvada wrote:
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> frequency invariant accounting by reading APERF/MPERF MSRs.
> With this change frequency invariant account is only enabled in
> "passive" mode.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> [Note: The tick will be enabled later in the series when hwp dynamic
> boost is enabled]
> 
>  drivers/cpufreq/intel_pstate.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 17e566af..f686bbe 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2040,6 +2040,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
>  {
>  	int ret;
>  
> +	x86_arch_scale_freq_tick_disable();
> +
>  	memset(&global, 0, sizeof(global));
>  	global.max_perf_pct = 100;
>  
> @@ -2052,6 +2054,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
>  
>  	global.min_perf_pct = min_perf_pct_min();
>  
> +	if (driver == &intel_cpufreq)
> +		x86_arch_scale_freq_tick_enable();

This will unconditionally trigger the reading/calculation at each tick
even though information is not actually consumed (e.g., running
performance or any other governor), right? Do we want that?

Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
hackbench regressions so far (running with schedutil governor).

Best,

- Juri
Peter Zijlstra May 16, 2018, 3:47 p.m. UTC | #5
On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:

> Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> hackbench regressions so far (running with schedutil governor).

https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors

Lists the E5 2609 v3 as not having turbo at all, which is basically a
best case scenario for this patch.

As I wrote earlier today; when turbo exists, like say the 2699, then
when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
things.
Srinivas Pandruvada May 16, 2018, 3:58 p.m. UTC | #6
On Wed, 2018-05-16 at 17:19 +0200, Juri Lelli wrote:
> On 15/05/18 21:49, Srinivas Pandruvada wrote:
> > intel_pstate has two operating modes: active and passive. In
> > "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In
> > "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to
> > read
> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> > frequency invariant accounting by reading APERF/MPERF MSRs.
> > With this change frequency invariant account is only enabled in
> > "passive" mode.
> > 
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
> > .com>
> > ---
> > [Note: The tick will be enabled later in the series when hwp
> > dynamic
> > boost is enabled]
> > 
> >  drivers/cpufreq/intel_pstate.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index 17e566af..f686bbe 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2040,6 +2040,8 @@ static int
> > intel_pstate_register_driver(struct cpufreq_driver *driver)
> >  {
> >  	int ret;
> >  
> > +	x86_arch_scale_freq_tick_disable();
> > +
> >  	memset(&global, 0, sizeof(global));
> >  	global.max_perf_pct = 100;
> >  
> > @@ -2052,6 +2054,9 @@ static int
> > intel_pstate_register_driver(struct cpufreq_driver *driver)
> >  
> >  	global.min_perf_pct = min_perf_pct_min();
> >  
> > +	if (driver == &intel_cpufreq)
> > +		x86_arch_scale_freq_tick_enable();
> 
> This will unconditionally trigger the reading/calculation at each
> tick
> even though information is not actually consumed (e.g., running
> performance or any other governor), right? Do we want that?
Good point. I should call x86_arch_scale_freq_tick_disable() in
performance mode switch for active mode.

Thanks,
Srinivas

> 
> Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not
> seeing
> hackbench regressions so far (running with schedutil governor).
> 
> Best,
> 
> - Juri
Juri Lelli May 16, 2018, 4:31 p.m. UTC | #7
On 16/05/18 17:47, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> 
> > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > hackbench regressions so far (running with schedutil governor).
> 
> https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> 
> Lists the E5 2609 v3 as not having turbo at all, which is basically a
> best case scenario for this patch.
> 
> As I wrote earlier today; when turbo exists, like say the 2699, then
> when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> things.

Indeed. I was mostly trying to see if adding this to the tick might
introduce noticeable overhead.
Srinivas Pandruvada May 16, 2018, 5:32 p.m. UTC | #8
On Wed, 2018-05-16 at 11:07 +0200, Rafael J. Wysocki wrote:
> On Wed, May 16, 2018 at 9:29 AM, Peter Zijlstra <peterz@infradead.org
> > wrote:
> > On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
> > > On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada
> > > wrote:
> > > > intel_pstate has two operating modes: active and passive. In
> > > > "active"
> > > > mode, the in-built scaling governor is used and in "passive"
> > > > mode,
> > > > the driver can be used with any governor like "schedutil". In
> > > > "active"
> > > > mode the utilization values from schedutil is not used and
> > > > there is
> > > > a requirement from high performance computing use cases, not to
> > > > read
> > > > any APERF/MPERF MSRs. In this case no need to use CPU cycles
> > > > for
> > > > frequency invariant accounting by reading APERF/MPERF MSRs.
> > > > With this change frequency invariant account is only enabled in
> > > > "passive" mode.
> > > 
> > > WTH is active/passive? Is passive when we select performance
> > > governor?
> > 
> > Bah, I cannot read it seems. active is when we use the intel_pstate
> > governor and passive is when we use schedutil and only use
> > intel_pstate
> > as a driver.
> > 
> > > Also; you have to explain why using APERF/MPERF is bad in that
> > > case. Why
> > > do they care if we read those MSRs during the tick?
> > 
> > That still stands.. this needs to be properly explained.
> 
> I guess this is from the intel_pstate perspective only.
> 
> The active mode is only used with HWP, so intel_pstate doesn't look
> at
> the utilization (in any form) in the passive mode today.
> 
> Still, there are other reasons for PELT to be scale-invariant, so ...
Not sure about the use case in active mode other than dynamic HWP boost
later in this series. If any, I can remove this patch.
Juri Lelli May 17, 2018, 10:59 a.m. UTC | #9
On 16/05/18 18:31, Juri Lelli wrote:
> On 16/05/18 17:47, Peter Zijlstra wrote:
> > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > 
> > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > > hackbench regressions so far (running with schedutil governor).
> > 
> > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> > 
> > Lists the E5 2609 v3 as not having turbo at all, which is basically a
> > best case scenario for this patch.
> > 
> > As I wrote earlier today; when turbo exists, like say the 2699, then
> > when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> > things.
> 
> Indeed. I was mostly trying to see if adding this to the tick might
> introduce noticeable overhead.

Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following

# perf bench sched messaging --pipe --thread --group 2 --loop 20000

                      count       mean       std     min     50%       95%       99%     max
hostname kernel                                                                             
i5-5200U test_after    30.0  13.843433  0.590605  12.369  13.810  14.85635  15.08205  15.127
         test_before   30.0  13.571167  0.999798  12.228  13.302  15.57805  16.40029  16.690

It might be interesting to see what happens when using a single CPU
only?

Also, I will look at how the util signals look when a single CPU is
busy..
Juri Lelli May 17, 2018, 3:04 p.m. UTC | #10
On 17/05/18 12:59, Juri Lelli wrote:
> On 16/05/18 18:31, Juri Lelli wrote:
> > On 16/05/18 17:47, Peter Zijlstra wrote:
> > > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > > 
> > > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > > > hackbench regressions so far (running with schedutil governor).
> > > 
> > > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> > > 
> > > Lists the E5 2609 v3 as not having turbo at all, which is basically a
> > > best case scenario for this patch.
> > > 
> > > As I wrote earlier today; when turbo exists, like say the 2699, then
> > > when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> > > things.
> > 
> > Indeed. I was mostly trying to see if adding this to the tick might
> > introduce noticeable overhead.
> 
> Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
> 
> # perf bench sched messaging --pipe --thread --group 2 --loop 20000
> 
>                       count       mean       std     min     50%       95%       99%     max
> hostname kernel                                                                             
> i5-5200U test_after    30.0  13.843433  0.590605  12.369  13.810  14.85635  15.08205  15.127
>          test_before   30.0  13.571167  0.999798  12.228  13.302  15.57805  16.40029  16.690
> 
> It might be interesting to see what happens when using a single CPU
> only?
> 
> Also, I will look at how the util signals look when a single CPU is
> busy..

And this is showing where the problem is (as you were saying [1]):

https://gist.github.com/jlelli/f5438221186e5ed3660194e4f645fe93

Just look at the plots (and ignore setup).

First one (pid:4483) shows a single task busy running on a single CPU,
which seems to be able to sustain turbo for 5 sec. So task util reaches
~1024.

Second one (pid:4283) shows the same task, but running together with
other 3 tasks (each one pinned to a different CPU). In this case util
saturates at ~943, which is due to the fact that max freq is still
considered to be the turbo one. :/

[1] https://marc.info/?l=linux-kernel&m=152646464017810&w=2
Srinivas Pandruvada May 17, 2018, 3:41 p.m. UTC | #11
On Thu, 2018-05-17 at 17:04 +0200, Juri Lelli wrote:
> On 17/05/18 12:59, Juri Lelli wrote:
> > On 16/05/18 18:31, Juri Lelli wrote:
> > > On 16/05/18 17:47, Peter Zijlstra wrote:
> > > > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > > > 
> > > > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm
> > > > > not seeing
> > > > > hackbench regressions so far (running with schedutil
> > > > > governor).
> > > > 
> > > > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Serve
> > > > r_processors
> > > > 
> > > > Lists the E5 2609 v3 as not having turbo at all, which is
> > > > basically a
> > > > best case scenario for this patch.
> > > > 
> > > > As I wrote earlier today; when turbo exists, like say the 2699,
> > > > then
> > > > when we're busy we'll run at U=2.3/3.6 ~ .64, which might
> > > > confuse
> > > > things.
> > > 
> > > Indeed. I was mostly trying to see if adding this to the tick
> > > might
> > > introduce noticeable overhead.
> > 
> > Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
> > 
> > # perf bench sched messaging --pipe --thread --group 2 --loop 20000
> > 
> >                       count       mean       std     min     50%   
> >     95%       99%     max
> > hostname
> > kernel                                                             
> >                 
> > i5-5200U
> > test_after    30.0  13.843433  0.590605  12.369  13.810  14.85635  
> > 15.08205  15.127
> >          test_before   30.0  13.571167  0.999798  12.228  13.302  1
> > 5.57805  16.40029  16.690
> > 
> > It might be interesting to see what happens when using a single CPU
> > only?
> > 
> > Also, I will look at how the util signals look when a single CPU is
> > busy..
> 
> And this is showing where the problem is (as you were saying [1]):
> 
> https://gist.github.com/jlelli/f5438221186e5ed3660194e4f645fe93
> 
> Just look at the plots (and ignore setup).
> 
> First one (pid:4483) shows a single task busy running on a single
> CPU,
> which seems to be able to sustain turbo for 5 sec. So task util
> reaches
> ~1024.
> 
> Second one (pid:4283) shows the same task, but running together with
> other 3 tasks (each one pinned to a different CPU). In this case util
> saturates at ~943, which is due to the fact that max freq is still
> considered to be the turbo one. :/


One more point to note. Even if we calculate some utilization based on
the freq-invariant and arrive at a P-state, we will not be able to
control any P-state in turbo region (not even as a cap) on several
Intel processors using PERF_CTL MSRs.


> 
> [1] https://marc.info/?l=linux-kernel&m=152646464017810&w=2
Peter Zijlstra May 17, 2018, 4:16 p.m. UTC | #12
On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
> One more point to note. Even if we calculate some utilization based on
> the freq-invariant and arrive at a P-state, we will not be able to
> control any P-state in turbo region (not even as a cap) on several
> Intel processors using PERF_CTL MSRs.

Right, but don't we need to set the PERF_CTL to max P in order to access
the turbo bins? So we still need to compute a P state, but as soon as we
reach max P, we're done.

And its not as if setting anything below max P is a firm setting either
anyway, its hints all the way down.
Srinivas Pandruvada May 17, 2018, 4:42 p.m. UTC | #13
On Thu, 2018-05-17 at 18:16 +0200, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
> > One more point to note. Even if we calculate some utilization based
> > on
> > the freq-invariant and arrive at a P-state, we will not be able to
> > control any P-state in turbo region (not even as a cap) on several
> > Intel processors using PERF_CTL MSRs.
> 
> Right, but don't we need to set the PERF_CTL to max P in order to
> access
> the turbo bins?
Any PERF_CTL setting above what we call "Turbo Activation ratio" (which
can be less than P1 read from platform info MSR) will do that on these
systems (Most clients from Ivy bridge).

>  So we still need to compute a P state, but as soon as we
> reach max P, we're done.
What will happen if we look at all core turbo as max and cap any
utilization above this to 1024?

> 
> And its not as if setting anything below max P is a firm setting
> either
> anyway, its hints all the way down.
Rafael J. Wysocki May 17, 2018, 4:56 p.m. UTC | #14
On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Thu, 2018-05-17 at 18:16 +0200, Peter Zijlstra wrote:
>> On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
>> > One more point to note. Even if we calculate some utilization based
>> > on
>> > the freq-invariant and arrive at a P-state, we will not be able to
>> > control any P-state in turbo region (not even as a cap) on several
>> > Intel processors using PERF_CTL MSRs.
>>
>> Right, but don't we need to set the PERF_CTL to max P in order to
>> access the turbo bins?
>
> Any PERF_CTL setting above what we call "Turbo Activation ratio" (which
> can be less than P1 read from platform info MSR) will do that on these
> systems (Most clients from Ivy bridge).
>
>>  So we still need to compute a P state, but as soon as we
>> reach max P, we're done.
>
> What will happen if we look at all core turbo as max and cap any
> utilization above this to 1024?

I was going to suggest that.

Otherwise, if we ever get (say) the max one-core turbo at any point
and the system only runs parallel workloads after that, it will appear
as underutilized.
Peter Zijlstra May 17, 2018, 6:28 p.m. UTC | #15
On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada

> > What will happen if we look at all core turbo as max and cap any
> > utilization above this to 1024?
> 
> I was going to suggest that.

To the basic premise behind all our frequency scaling is that there's a
linear relation between utilization and frequency, where u=1 gets us the
fastest.

Now, we all know this is fairly crude, but it is what we work with.

OTOH, the whole premise of turbo is that you don't in fact know what the
fastest is, and in that respect setting u=1 at the guaranteed or
sustainable frequency makes sense.

The direct concequence of allowing clipping is that u=1 doesn't select
the highest frequency, but since we don't select anything anyway
(p-code does that for us) all we really need is to have u=1 above that
turbo activation point you mentioned.

For parts where we have to directly select frequency this obviously
comes apart.

However; what happens when the sustainable freq drops below our initial
'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
we're back to running at u<1 at full tilt.

Or for mobile parts, the sustainable frequency could drop because of
severe thermal limits. Now I _think_ we have the possibility for getting
interrupts and reading the new guaranteed frequency, so we could
re-guage.

So in theory I think it works, in practise we need to always be able to
find the actual max -- be it all-core turbo, AVX or thermal constrained
frequency. Can we do that in all cases?


I need to go back to see what the complains against Vincent's proposal
were, because I really liked the fact that it did away with all this.
Rafael J. Wysocki May 18, 2018, 7:36 a.m. UTC | #16
On Thu, May 17, 2018 at 8:28 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
>> On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
>
>> > What will happen if we look at all core turbo as max and cap any
>> > utilization above this to 1024?
>>
>> I was going to suggest that.
>
> To the basic premise behind all our frequency scaling is that there's a
> linear relation between utilization and frequency, where u=1 gets us the
> fastest.
>
> Now, we all know this is fairly crude, but it is what we work with.
>
> OTOH, the whole premise of turbo is that you don't in fact know what the
> fastest is, and in that respect setting u=1 at the guaranteed or
> sustainable frequency makes sense.
>
> The direct concequence of allowing clipping is that u=1 doesn't select
> the highest frequency, but since we don't select anything anyway
> (p-code does that for us) all we really need is to have u=1 above that
> turbo activation point you mentioned.
>
> For parts where we have to directly select frequency this obviously
> comes apart.
>
> However; what happens when the sustainable freq drops below our initial
> 'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
> we're back to running at u<1 at full tilt.
>
> Or for mobile parts, the sustainable frequency could drop because of
> severe thermal limits. Now I _think_ we have the possibility for getting
> interrupts and reading the new guaranteed frequency, so we could
> re-guage.
>
> So in theory I think it works, in practise we need to always be able to
> find the actual max -- be it all-core turbo, AVX or thermal constrained
> frequency. Can we do that in all cases?

We should be, but unfortunately that's a dynamic thing.

For example, the AVX limit only kicks in when AVX instructions are executed.

> I need to go back to see what the complains against Vincent's proposal
> were, because I really liked the fact that it did away with all this.

That would be the best way to deal with this mess, I agree.
Patrick Bellasi May 18, 2018, 10:57 a.m. UTC | #17
On 17-May 20:28, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
> > On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
> 
> > > What will happen if we look at all core turbo as max and cap any
> > > utilization above this to 1024?
> > 
> > I was going to suggest that.
> 
> To the basic premise behind all our frequency scaling is that there's a
> linear relation between utilization and frequency, where u=1 gets us the
> fastest.
> 
> Now, we all know this is fairly crude, but it is what we work with.
> 
> OTOH, the whole premise of turbo is that you don't in fact know what the
> fastest is, and in that respect setting u=1 at the guaranteed or
> sustainable frequency makes sense.

Looking from the FAIR class standpoint, we can also argue that
although you know that the max possible utilization is 1024, you are
not always granted to reach it because of RT and Interrupts pressure.
Or in big.LITTLE systems, because of the arch scaling factor.

Is it not something quite similar to the problem of having
   "some not always available OPPs"
?

To track these "capacity limitations" we already have the two
different concepts of cpu_capacity_orig and cpu_capacity.

Are not "thermal constraints" and "some not always available OPPs"
just another form of "capacity limitations".
They are:
 - transient
   exactly like RT and Interrupt pressure
 - HW related
   which is the main different wrt RT and Interrupt pressure

But, apart from this last point (i.e.they have an HW related "nature"),
IMHO they seems quite similar concept... which are already addresses,
although only within the FAIR class perhaps.

Thus, my simple (maybe dumb) questions are:
- why can't we just fold turbo boost frequency into the existing concepts?
- what are the limitations of such a "simple" approach?

IOW: utilization always measures wrt the maximum possible capacity
(i.e. max turbo mode) and then there is a way to know what is, on each
CPU and at every decision time, the actual "transient maximum" we can
expect to reach for a "reasonable" future time.

> The direct concequence of allowing clipping is that u=1 doesn't select
> the highest frequency, but since we don't select anything anyway
> (p-code does that for us) all we really need is to have u=1 above that
> turbo activation point you mentioned.

If clipping means that we can also have >1024 values which are just
clamped at read/get time, this could maybe have some side-effects on
math (signals propagations across TG) and type ranges control?

> For parts where we have to directly select frequency this obviously
> comes apart.

Moreover, utilization is not (will not be) just for frequency driving.
We should keep the task placement perspective into account.

On that side, I personally like the definition _I think_ we have now:

  utilization is the amount of maximum capacity used

where maximum is a constant defined at boot time and representing the
absolute max you can expect to get...
... apart from "transient capacity limitations".

Scaling the maximum depending on these transient conditions to me it
reads like "changing the scale". Which I fear will make it more
difficult for example to compare in space (different CPUs) or time
(different scheduler events) what a utilization measure means.

For example, if you have a busy loop running on a CPU which is subject
to RT pressure, you will read a <100% utilization (let say 60%). Still
it's interesting to know that maybe I can try to move that task on an
IDLE CPU to run it faster.

Should not be the same for turbo boost?

If the same task is generating only 60% utilization, because of not
available turbo boost OPPs,  should still not be useful to see that
there is for example another CPU (maybe on a different NUMA node)
which is IDLE and cold, where we can move the task there to exploit
the 100% capacity provided by the topmost turbo boost mode?

> However; what happens when the sustainable freq drops below our initial
> 'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
> we're back to running at u<1 at full tilt.
> 
> Or for mobile parts, the sustainable frequency could drop because of
> severe thermal limits. Now I _think_ we have the possibility for getting
> interrupts and reading the new guaranteed frequency, so we could
> re-guage.
> 
> So in theory I think it works, in practise we need to always be able to
> find the actual max -- be it all-core turbo, AVX or thermal constrained
> frequency. Can we do that in all cases?
> 
> 
> I need to go back to see what the complains against Vincent's proposal
> were, because I really liked the fact that it did away with all this.

AFAIR Vincent proposal was mainly addressing a different issue: fast
ramp-up... I don't recall if there was any specific intent to cover
the issue of "transient maximum capacities".

And still, based on my (maybe bogus) reasoning above, I think we are
discussing here a slightly different problem which has already a
(maybe partial) solution.
Peter Zijlstra May 18, 2018, 11:29 a.m. UTC | #18
On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> Thus, my simple (maybe dumb) questions are:
> - why can't we just fold turbo boost frequency into the existing concepts?
> - what are the limitations of such a "simple" approach?

Perhaps... but does this not further complicate the whole capacity vs
util thing we already have in say the misfit patches? And the
util_fits_capacity() thing from the EAS ones.

The thing is, we either need to dynamically scale the util or the
capacity or both. I think for Thermal there are patches out there that
drop the capacity.

But we'd then have to do the same for turbo/vector and all the other
stuff as well. Otherwise we risk things like running at low U with 0%
idle and not triggering the tipping point between eas and regular
balancing.

So either way around we need to know the 'true' max, either to fudge
util or to fudge capacity. And I'm not sure we can know in some of these
cases :/

And while Vincent's patches might have been inspired by another problem,
they do have the effect of always allowing util to go to 1, which is
nice for this.
Patrick Bellasi May 18, 2018, 1:33 p.m. UTC | #19
On 18-May 13:29, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> > Thus, my simple (maybe dumb) questions are:
> > - why can't we just fold turbo boost frequency into the existing concepts?
> > - what are the limitations of such a "simple" approach?
> 
> Perhaps... but does this not further complicate the whole capacity vs
> util thing we already have in say the misfit patches?

Not sure about that...

> And the  util_fits_capacity() thing from the EAS ones.

In this case instead, if we can track somehow (not saying we can)
what is the currently available "transient maximum capacity"...
then a util_fits_capacity() should just look at that.

If the transient capacity is already folded into cpu_capacity, as it
is now for RT and IRQ pressure, then likely we don't have to change
anything.

> The thing is, we either need to dynamically scale the util or the
> capacity or both. I think for Thermal there are patches out there that
> drop the capacity.

Not sure... but I would feel more comfortable by something which caps
the maximum capacity. Meaning, eventually you can fill up the maximum
possible capacity only "up to" a given value, because of thermal or other
reasons most of the scheduler maybe doesn't even have to know why?

> But we'd then have to do the same for turbo/vector and all the other
> stuff as well. Otherwise we risk things like running at low U with 0%
> idle and not triggering the tipping point between eas and regular
> balancing.

Interacting with the tipping point and/or OPP changes is indeed an
interesting side of the problem I was not considering so far...

But again, the tipping point could not be defined as a delta
with respect to the "transient maximum capacity" ?

> So either way around we need to know the 'true' max, either to fudge
> util or to fudge capacity.

Right, but what I see from a concepts standpoint is something like:

     +--+--+   cpu_capacity_orig (CONSTANT at boot time)
     |  |  |
     |  |  |       HW generated constraints
     |  v  |
     +-----+   cpu_capacity_max (depending on thermal/turbo boost)
     |  |  |
     |  |  |       SW generated constraints
     |  v  |
     +-----+   cpu_capacity (depending on RT/IRQ pressure)
     |  |  |
     |  |  |       tipping point delta
     +--v--+
     |     |   Energy Aware mode available capacity
     +-----+

Where all the wkp/lb heuristics are updated to properly consider the
cpu_capacity_max metrics whenever it comes to know what is the max
speed we can reach now on a CPU.

> And I'm not sure we can know in some of these cases :/

Right, this schema will eventually work only under the hypothesis that
"somehow" we can update cpu_capacity_max from HW events.

Not entirely sure that's possible and/or at which time granularity on
all different platforms.

> And while Vincent's patches might have been inspired by another problem,
> they do have the effect of always allowing util to go to 1, which is
> nice for this.

Sure, that's a nice point, but still I have the feeling that always
reaching u=1 can defeat other interesting properties of a task,
For example, comparing task requirements in different CPUs and/or at
different times, which plays a big role for energy aware task
placement decisions.
Valentin Schneider May 18, 2018, 2:09 p.m. UTC | #20
Hi,

On 18/05/18 12:29, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
>> Thus, my simple (maybe dumb) questions are:
>> - why can't we just fold turbo boost frequency into the existing concepts?
>> - what are the limitations of such a "simple" approach?
> 
> Perhaps... but does this not further complicate the whole capacity vs
> util thing we already have in say the misfit patches?

What do you mean by that ? Bear in mind, I'm a complete stranger to turbo
boost so I fail to see why, as Patrick puts it, it can't fit within the
existing concepts (i.e. max util is 1024 but is only reachable when turbo
boosted).
Patrick Bellasi May 30, 2018, 4:57 p.m. UTC | #21
Hi Peter,
maybe you missed this previous my response:
   20180518133353.GO30654@e110439-lin
?

Would like to have your tought about the concept of "transient maximum
capacity" I was describing...

On 18-May 14:33, Patrick Bellasi wrote:
> On 18-May 13:29, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> > > Thus, my simple (maybe dumb) questions are:
> > > - why can't we just fold turbo boost frequency into the existing concepts?
> > > - what are the limitations of such a "simple" approach?
> > 
> > Perhaps... but does this not further complicate the whole capacity vs
> > util thing we already have in say the misfit patches?
> 
> Not sure about that...
> 
> > And the  util_fits_capacity() thing from the EAS ones.
> 
> In this case instead, if we can track somehow (not saying we can)
> what is the currently available "transient maximum capacity"...
> then a util_fits_capacity() should just look at that.
> 
> If the transient capacity is already folded into cpu_capacity, as it
> is now for RT and IRQ pressure, then likely we don't have to change
> anything.
> 
> > The thing is, we either need to dynamically scale the util or the
> > capacity or both. I think for Thermal there are patches out there that
> > drop the capacity.
> 
> Not sure... but I would feel more comfortable by something which caps
> the maximum capacity. Meaning, eventually you can fill up the maximum
> possible capacity only "up to" a given value, because of thermal or other
> reasons most of the scheduler maybe doesn't even have to know why?
> 
> > But we'd then have to do the same for turbo/vector and all the other
> > stuff as well. Otherwise we risk things like running at low U with 0%
> > idle and not triggering the tipping point between eas and regular
> > balancing.
> 
> Interacting with the tipping point and/or OPP changes is indeed an
> interesting side of the problem I was not considering so far...
> 
> But again, the tipping point could not be defined as a delta
> with respect to the "transient maximum capacity" ?
> 
> > So either way around we need to know the 'true' max, either to fudge
> > util or to fudge capacity.
> 
> Right, but what I see from a concepts standpoint is something like:
> 
>      +--+--+   cpu_capacity_orig (CONSTANT at boot time)
>      |  |  |
>      |  |  |       HW generated constraints
>      |  v  |
>      +-----+   cpu_capacity_max (depending on thermal/turbo boost)
>      |  |  |
>      |  |  |       SW generated constraints
>      |  v  |
>      +-----+   cpu_capacity (depending on RT/IRQ pressure)
>      |  |  |
>      |  |  |       tipping point delta
>      +--v--+
>      |     |   Energy Aware mode available capacity
>      +-----+
> 
> Where all the wkp/lb heuristics are updated to properly consider the
> cpu_capacity_max metrics whenever it comes to know what is the max
> speed we can reach now on a CPU.
> 
> > And I'm not sure we can know in some of these cases :/
> 
> Right, this schema will eventually work only under the hypothesis that
> "somehow" we can update cpu_capacity_max from HW events.
> 
> Not entirely sure that's possible and/or at which time granularity on
> all different platforms.
> 
> > And while Vincent's patches might have been inspired by another problem,
> > they do have the effect of always allowing util to go to 1, which is
> > nice for this.
> 
> Sure, that's a nice point, but still I have the feeling that always
> reaching u=1 can defeat other interesting properties of a task,
> For example, comparing task requirements in different CPUs and/or at
> different times, which plays a big role for energy aware task
> placement decisions.
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi
diff mbox

Patch

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 17e566af..f686bbe 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2040,6 +2040,8 @@  static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 {
 	int ret;
 
+	x86_arch_scale_freq_tick_disable();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2052,6 +2054,9 @@  static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	if (driver == &intel_cpufreq)
+		x86_arch_scale_freq_tick_enable();
+
 	return 0;
 }