[V5,2/3] cpufreq: intel_pstate: account for non C0 time

Message ID	1449165359-25832-5-git-send-email-philippe.longepe@linux.intel.com (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Philippe Longepe <philippe.longepe@linux.intel.com> To: linux-pm@vger.kernel.org Cc: srinivas.pandruvada@linux.intel.com, Stephane Gasparini <stephane.gasparini@linux.intel.com> Subject: [PATCH V5 2/3] cpufreq: intel_pstate: account for non C0 time Date: Thu, 3 Dec 2015 18:55:56 +0100 Message-Id: <1449165359-25832-5-git-send-email-philippe.longepe@linux.intel.com> In-Reply-To: <1449165359-25832-1-git-send-email-philippe.longepe@linux.intel.com> References: <1449165359-25832-1-git-send-email-philippe.longepe@linux.intel.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

plongepe Dec. 3, 2015, 5:55 p.m. UTC

The current function to calculate business uses the ratio of actual
performance to max non turbo state requested during the last sample
period. This causes overestimation of busyness which results in higher
power consumption. This is a problem for low power systems.
The algorithm here uses cpu busyness to select next target P state.
The Percent Busy (or load) can be estimated as the ratio of the mperf
counter running at a constant frequency only during active periods (C0)
and the time stamp counter running at the same frequency but also during
idle. So,
Percent Busy = 100 * (delta_mperf / delta_tsc)
Use this algorithm for platforms with SoCs based on airmont and silvermont
cores.

Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

Rafael J. Wysocki Dec. 4, 2015, 2:12 a.m. UTC | #1

On Thursday, December 03, 2015 06:55:56 PM Philippe Longepe wrote:
> The current function to calculate business uses the ratio of actual

"Utilization" would be a better word here.

> performance to max non turbo state requested during the last sample
> period.

What it really does is to use the measured percentage of busy cycles
scaled by the ratio of the current P-state to the max available
non-turbo one.

> This causes overestimation of busyness which results in higher
> power consumption.

"This leads to an overestimation of utilization which causes higher-performance
P-states to be selected more often and that leads to increased energy
consumption."  [Power can't be consumed.]

> This is a problem for low power systems.

"This is a problem for low-power systems, so it is better to use a different
utilization calculation algorithm for them."

> The algorithm here uses cpu busyness to select next target P state.
> The Percent Busy (or load) can be estimated as the ratio of the mperf
> counter running at a constant frequency only during active periods (C0)
> and the time stamp counter running at the same frequency but also during
> idle.

"Namely, the Percent Busy value (or load) can be estimated as the ratio of the
MPERF counter that runs at a constant rate only during active periods (C0) to
the time stamp counter (TSC) that also runs (at the same rate) during idle. 
That is"

> Percent Busy = 100 * (delta_mperf / delta_tsc)

> Use this algorithm for platforms with SoCs based on airmont and silvermont
> cores.

"based on the Airmont and Silvermont Atom cores".

> 
> Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
> Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>

Is Stephane a co-author of the patch?  Or if not, then what's his role?

> ---
>  drivers/cpufreq/intel_pstate.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index c2553a1..2cf8bb6 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -143,6 +143,7 @@ struct cpu_defaults {
>  };
>  
>  static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu);
> +static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu);
>  
>  static struct pstate_adjust_policy pid_params;
>  static struct pstate_funcs pstate_funcs;
> @@ -763,7 +764,7 @@ static struct cpu_defaults silvermont_params = {
>  		.set = atom_set_pstate,
>  		.get_scaling = silvermont_get_scaling,
>  		.get_vid = atom_get_vid,
> -		.get_target_pstate = get_target_pstate_use_performance,
> +		.get_target_pstate = get_target_pstate_use_cpu_load,
>  	},
>  };
>  
> @@ -784,7 +785,7 @@ static struct cpu_defaults airmont_params = {
>  		.set = atom_set_pstate,
>  		.get_scaling = airmont_get_scaling,
>  		.get_vid = atom_get_vid,
> -		.get_target_pstate = get_target_pstate_use_performance,
> +		.get_target_pstate = get_target_pstate_use_cpu_load,
>  	},
>  };
>  
> @@ -930,6 +931,26 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
>  	mod_timer_pinned(&cpu->timer, jiffies + delay);
>  }
>  
> +static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)

The function name might have been better, but well.

> +{
> +	struct sample *sample = &cpu->sample;
> +	int32_t cpu_load;
> +
> +	/*
> +	 * The load can be estimated as the ratio of the mperf counter
> +	 * running at a constant frequency during active periods
> +	 * (C0) and the time stamp counter running at the same frequency
> +	 * also during C-states.
> +	 */
> +	cpu_load = div64_u64(100 * sample->mperf, sample->tsc);
> +
> +	cpu->sample.busy_scaled = int_tofp(cpu_load);

This is questionable somewhat.

Why do you convert the result of an integer calculation to fixed point
instead of doing the calculation in fixed point in the first place?

> +
> +	return (cpu->pstate.current_pstate - pid_calc(&cpu->pid,
> +						cpu->sample.busy_scaled));
> +}
> +
> +
>  static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
>  {
>  	int32_t core_busy, max_pstate, current_pstate, sample_ratio;
> 

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

plongepe Dec. 4, 2015, 2:33 p.m. UTC | #2

Thank you Rafael,

I'll try to sent a new patch set taking your remarks into account for 
these series before tonight.

Also, do you think we could have in some corner case a division by zero 
when prev_tsc = tsc ?

Normally it should never happen but just in case, I think we should 
complete the following test:

if ((cpu->prev_mperf == mperf) || (cpu->prev_tsc == tsc)) {
...

Thx & Regards

Philippe


On 04/12/2015 03:12, Rafael J. Wysocki wrote:
> On Thursday, December 03, 2015 06:55:56 PM Philippe Longepe wrote:
>> The current function to calculate business uses the ratio of actual
> "Utilization" would be a better word here.
>
>> performance to max non turbo state requested during the last sample
>> period.
> What it really does is to use the measured percentage of busy cycles
> scaled by the ratio of the current P-state to the max available
> non-turbo one.
>
>> This causes overestimation of busyness which results in higher
>> power consumption.
> "This leads to an overestimation of utilization which causes higher-performance
> P-states to be selected more often and that leads to increased energy
> consumption."  [Power can't be consumed.]
>
>> This is a problem for low power systems.
> "This is a problem for low-power systems, so it is better to use a different
> utilization calculation algorithm for them."
>
>> The algorithm here uses cpu busyness to select next target P state.
>> The Percent Busy (or load) can be estimated as the ratio of the mperf
>> counter running at a constant frequency only during active periods (C0)
>> and the time stamp counter running at the same frequency but also during
>> idle.
> "Namely, the Percent Busy value (or load) can be estimated as the ratio of the
> MPERF counter that runs at a constant rate only during active periods (C0) to
> the time stamp counter (TSC) that also runs (at the same rate) during idle.
> That is"
>
>> Percent Busy = 100 * (delta_mperf / delta_tsc)
>> Use this algorithm for platforms with SoCs based on airmont and silvermont
>> cores.
> "based on the Airmont and Silvermont Atom cores".
>
>> Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
>> Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
> Is Stephane a co-author of the patch?  Or if not, then what's his role?
>
>> ---
>>   drivers/cpufreq/intel_pstate.c | 25 +++++++++++++++++++++++--
>>   1 file changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index c2553a1..2cf8bb6 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -143,6 +143,7 @@ struct cpu_defaults {
>>   };
>>   
>>   static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu);
>> +static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu);
>>   
>>   static struct pstate_adjust_policy pid_params;
>>   static struct pstate_funcs pstate_funcs;
>> @@ -763,7 +764,7 @@ static struct cpu_defaults silvermont_params = {
>>   		.set = atom_set_pstate,
>>   		.get_scaling = silvermont_get_scaling,
>>   		.get_vid = atom_get_vid,
>> -		.get_target_pstate = get_target_pstate_use_performance,
>> +		.get_target_pstate = get_target_pstate_use_cpu_load,
>>   	},
>>   };
>>   
>> @@ -784,7 +785,7 @@ static struct cpu_defaults airmont_params = {
>>   		.set = atom_set_pstate,
>>   		.get_scaling = airmont_get_scaling,
>>   		.get_vid = atom_get_vid,
>> -		.get_target_pstate = get_target_pstate_use_performance,
>> +		.get_target_pstate = get_target_pstate_use_cpu_load,
>>   	},
>>   };
>>   
>> @@ -930,6 +931,26 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
>>   	mod_timer_pinned(&cpu->timer, jiffies + delay);
>>   }
>>   
>> +static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
> The function name might have been better, but well.
>
>> +{
>> +	struct sample *sample = &cpu->sample;
>> +	int32_t cpu_load;
>> +
>> +	/*
>> +	 * The load can be estimated as the ratio of the mperf counter
>> +	 * running at a constant frequency during active periods
>> +	 * (C0) and the time stamp counter running at the same frequency
>> +	 * also during C-states.
>> +	 */
>> +	cpu_load = div64_u64(100 * sample->mperf, sample->tsc);
>> +
>> +	cpu->sample.busy_scaled = int_tofp(cpu_load);
> This is questionable somewhat.
>
> Why do you convert the result of an integer calculation to fixed point
> instead of doing the calculation in fixed point in the first place?
>
>> +
>> +	return (cpu->pstate.current_pstate - pid_calc(&cpu->pid,
>> +						cpu->sample.busy_scaled));
>> +}
>> +
>> +
>>   static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
>>   {
>>   	int32_t core_busy, max_pstate, current_pstate, sample_ratio;
>>
> Thanks,
> Rafael
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

srinivas pandruvada Dec. 4, 2015, 5:39 p.m. UTC | #3

On Fri, 2015-12-04 at 15:33 +0100, Philippe Longepe wrote:
> Thank you Rafael,
> 
> I'll try to sent a new patch set taking your remarks into account for 
> these series before tonight.
> 
> Also, do you think we could have in some corner case a division by zero 
> when prev_tsc = tsc ?
> 
> Normally it should never happen but just in case, I think we should 
> complete the following test:
> 
> if ((cpu->prev_mperf == mperf) || (cpu->prev_tsc == tsc)) {
> ...
On Knight landing platform the first one is possible and second one
should never happen. But this function will never be called for such
platforms.

> 
> Thx & Regards
> 
> Philippe
> 
> 
> On 04/12/2015 03:12, Rafael J. Wysocki wrote:
> > On Thursday, December 03, 2015 06:55:56 PM Philippe Longepe wrote:
> >> The current function to calculate business uses the ratio of actual
> > "Utilization" would be a better word here.
> >
> >> performance to max non turbo state requested during the last sample
> >> period.
> > What it really does is to use the measured percentage of busy cycles
> > scaled by the ratio of the current P-state to the max available
> > non-turbo one.
> >
> >> This causes overestimation of busyness which results in higher
> >> power consumption.
> > "This leads to an overestimation of utilization which causes higher-performance
> > P-states to be selected more often and that leads to increased energy
> > consumption."  [Power can't be consumed.]
> >
> >> This is a problem for low power systems.
> > "This is a problem for low-power systems, so it is better to use a different
> > utilization calculation algorithm for them."
> >
> >> The algorithm here uses cpu busyness to select next target P state.
> >> The Percent Busy (or load) can be estimated as the ratio of the mperf
> >> counter running at a constant frequency only during active periods (C0)
> >> and the time stamp counter running at the same frequency but also during
> >> idle.
> > "Namely, the Percent Busy value (or load) can be estimated as the ratio of the
> > MPERF counter that runs at a constant rate only during active periods (C0) to
> > the time stamp counter (TSC) that also runs (at the same rate) during idle.
> > That is"
> >
> >> Percent Busy = 100 * (delta_mperf / delta_tsc)
> >> Use this algorithm for platforms with SoCs based on airmont and silvermont
> >> cores.
> > "based on the Airmont and Silvermont Atom cores".
> >
> >> Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
> >> Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
> > Is Stephane a co-author of the patch?  Or if not, then what's his role?
> >
> >> ---
> >>   drivers/cpufreq/intel_pstate.c | 25 +++++++++++++++++++++++--
> >>   1 file changed, 23 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> index c2553a1..2cf8bb6 100644
> >> --- a/drivers/cpufreq/intel_pstate.c
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -143,6 +143,7 @@ struct cpu_defaults {
> >>   };
> >>   
> >>   static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu);
> >> +static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu);
> >>   
> >>   static struct pstate_adjust_policy pid_params;
> >>   static struct pstate_funcs pstate_funcs;
> >> @@ -763,7 +764,7 @@ static struct cpu_defaults silvermont_params = {
> >>   		.set = atom_set_pstate,
> >>   		.get_scaling = silvermont_get_scaling,
> >>   		.get_vid = atom_get_vid,
> >> -		.get_target_pstate = get_target_pstate_use_performance,
> >> +		.get_target_pstate = get_target_pstate_use_cpu_load,
> >>   	},
> >>   };
> >>   
> >> @@ -784,7 +785,7 @@ static struct cpu_defaults airmont_params = {
> >>   		.set = atom_set_pstate,
> >>   		.get_scaling = airmont_get_scaling,
> >>   		.get_vid = atom_get_vid,
> >> -		.get_target_pstate = get_target_pstate_use_performance,
> >> +		.get_target_pstate = get_target_pstate_use_cpu_load,
> >>   	},
> >>   };
> >>   
> >> @@ -930,6 +931,26 @@ static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
> >>   	mod_timer_pinned(&cpu->timer, jiffies + delay);
> >>   }
> >>   
> >> +static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
> > The function name might have been better, but well.
> >
> >> +{
> >> +	struct sample *sample = &cpu->sample;
> >> +	int32_t cpu_load;
> >> +
> >> +	/*
> >> +	 * The load can be estimated as the ratio of the mperf counter
> >> +	 * running at a constant frequency during active periods
> >> +	 * (C0) and the time stamp counter running at the same frequency
> >> +	 * also during C-states.
> >> +	 */
> >> +	cpu_load = div64_u64(100 * sample->mperf, sample->tsc);
> >> +
> >> +	cpu->sample.busy_scaled = int_tofp(cpu_load);
> > This is questionable somewhat.
> >
> > Why do you convert the result of an integer calculation to fixed point
> > instead of doing the calculation in fixed point in the first place?
> >
> >> +
> >> +	return (cpu->pstate.current_pstate - pid_calc(&cpu->pid,
> >> +						cpu->sample.busy_scaled));
> >> +}
> >> +
> >> +
> >>   static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
> >>   {
> >>   	int32_t core_busy, max_pstate, current_pstate, sample_ratio;
> >>
> > Thanks,
> > Rafael
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

srinivas pandruvada Dec. 4, 2015, 10:48 p.m. UTC | #4

On Sat, 2015-12-05 at 00:01 +0100, Rafael J. Wysocki wrote:
> On Friday, December 04, 2015 09:39:03 AM Srinivas Pandruvada wrote:
> > On Fri, 2015-12-04 at 15:33 +0100, Philippe Longepe wrote:
> > > Thank you Rafael,
> > > 
> > > I'll try to sent a new patch set taking your remarks into account for 
> > > these series before tonight.
> > > 
> > > Also, do you think we could have in some corner case a division by zero 
> > > when prev_tsc = tsc ?
> > > 
> > > Normally it should never happen but just in case, I think we should 
> > > complete the following test:
> > > 
> > > if ((cpu->prev_mperf == mperf) || (cpu->prev_tsc == tsc)) {
> > > ...
> > On Knight landing platform the first one is possible and second one
> > should never happen. But this function will never be called for such
> > platforms.
> 
> Still, in case there's a future platform where those counters don't advance
> for extended time intervals, would it really hurt to add a check for that
> and skip the sample when it triggers?
Doesn't hurt.

Thanks,
Srinivas

> 
> Thanks,
> Rafael
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rafael J. Wysocki Dec. 4, 2015, 11:01 p.m. UTC | #5

On Friday, December 04, 2015 09:39:03 AM Srinivas Pandruvada wrote:
> On Fri, 2015-12-04 at 15:33 +0100, Philippe Longepe wrote:
> > Thank you Rafael,
> > 
> > I'll try to sent a new patch set taking your remarks into account for 
> > these series before tonight.
> > 
> > Also, do you think we could have in some corner case a division by zero 
> > when prev_tsc = tsc ?
> > 
> > Normally it should never happen but just in case, I think we should 
> > complete the following test:
> > 
> > if ((cpu->prev_mperf == mperf) || (cpu->prev_tsc == tsc)) {
> > ...
> On Knight landing platform the first one is possible and second one
> should never happen. But this function will never be called for such
> platforms.

Still, in case there's a future platform where those counters don't advance
for extended time intervals, would it really hurt to add a check for that
and skip the sample when it triggers?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rafael J. Wysocki Dec. 4, 2015, 11:16 p.m. UTC | #6

On Fri, Dec 4, 2015 at 11:48 PM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Sat, 2015-12-05 at 00:01 +0100, Rafael J. Wysocki wrote:
>> On Friday, December 04, 2015 09:39:03 AM Srinivas Pandruvada wrote:
>> > On Fri, 2015-12-04 at 15:33 +0100, Philippe Longepe wrote:
>> > > Thank you Rafael,
>> > >
>> > > I'll try to sent a new patch set taking your remarks into account for
>> > > these series before tonight.
>> > >
>> > > Also, do you think we could have in some corner case a division by zero
>> > > when prev_tsc = tsc ?
>> > >
>> > > Normally it should never happen but just in case, I think we should
>> > > complete the following test:
>> > >
>> > > if ((cpu->prev_mperf == mperf) || (cpu->prev_tsc == tsc)) {
>> > > ...
>> > On Knight landing platform the first one is possible and second one
>> > should never happen. But this function will never be called for such
>> > platforms.
>>
>> Still, in case there's a future platform where those counters don't advance
>> for extended time intervals, would it really hurt to add a check for that
>> and skip the sample when it triggers?
> Doesn't hurt.

OK

Maybe checking the TSC would be overkill as there would be other
significant problems if it didn't advance, but IMO it would be good to
have a sanity check for the MPERF in there.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[V5,2/3] cpufreq: intel_pstate: account for non C0 time

Commit Message

Comments

Patch