diff mbox series

[2/5] thermal: devfreq_cooling: get a copy of device status

Message ID 20200921122007.29610-3-lukasz.luba@arm.com (mailing list archive)
State New, archived
Delegated to: Daniel Lezcano
Headers show
Series Thermal devfreq cooling improvements with Energy Model | expand

Commit Message

Lukasz Luba Sept. 21, 2020, 12:20 p.m. UTC
Devfreq cooling needs to now the correct status of the device in order
to operate. Do not rely on Devfreq last_status which might be a stale data
and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To
mitigate this situation make a copy of the status structure and use it
for internal calculations.

In addition this patch adds normalization function, which also makes sure
that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 9 deletions(-)

Comments

Ionela Voinescu Oct. 7, 2020, 4:11 p.m. UTC | #1
On Monday 21 Sep 2020 at 13:20:04 (+0100), Lukasz Luba wrote:
> Devfreq cooling needs to now the correct status of the device in order
> to operate. Do not rely on Devfreq last_status which might be a stale data
> and get more up-to-date values of the load.
> 
> Devfreq framework can change the device status in the background. To
> mitigate this situation make a copy of the status structure and use it
> for internal calculations.
> 
> In addition this patch adds normalization function, which also makes sure
> that whatever data comes from the device, it is in a sane range.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 7063ccb7b86d..cf045bd4d16b 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>  							       voltage);
>  }
>  
> +static void _normalize_load(struct devfreq_dev_status *status)

Is there a reason for the leading "_" ?
AFAIK, "__name()" is meant to suggest a "worker" function for another
"name()" function, but that would not apply here.

> +{
> +	/* Make some space if needed */
> +	if (status->busy_time > 0xffff) {
> +		status->busy_time >>= 10;
> +		status->total_time >>= 10;
> +	}

How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;

> +
> +	if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

> +		status->busy_time = status->total_time;

But a reversal is needed here:
		status->total_time = status->busy_time;

> +
> +	status->busy_time *= 100;
> +	status->busy_time /= status->total_time ? : 1;
> +
> +	/* Avoid division by 0 */
> +	status->busy_time = status->busy_time ? : 1;
> +	status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
					     status->total_time);
status->total_time = 1 << 10;

This way you gain some resolution to busy_time and the divisions in the
callers would just become shifts by 10.

Hope it helps,
Ionela.
Daniel Lezcano Oct. 14, 2020, 2:34 p.m. UTC | #2
On 21/09/2020 14:20, Lukasz Luba wrote:
> Devfreq cooling needs to now the correct status of the device in order
> to operate. Do not rely on Devfreq last_status which might be a stale data
> and get more up-to-date values of the load.
> 
> Devfreq framework can change the device status in the background. To
> mitigate this situation make a copy of the status structure and use it
> for internal calculations.
> 
> In addition this patch adds normalization function, which also makes sure
> that whatever data comes from the device, it is in a sane range.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 7063ccb7b86d..cf045bd4d16b 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>  							       voltage);
>  }
>  
> +static void _normalize_load(struct devfreq_dev_status *status)
> +{
> +	/* Make some space if needed */
> +	if (status->busy_time > 0xffff) {
> +		status->busy_time >>= 10;
> +		status->total_time >>= 10;
> +	}
> +
> +	if (status->busy_time > status->total_time)
> +		status->busy_time = status->total_time;
> +
> +	status->busy_time *= 100;
> +	status->busy_time /= status->total_time ? : 1;
> +
> +	/* Avoid division by 0 */
> +	status->busy_time = status->busy_time ? : 1;
> +	status->total_time = 100;
> +}

Not sure that works if the devfreq governor is not on-demand.

Is it possible to use the energy model directly in devfreq to compute
the energy consumption given the state transitions since the last reading ?

The power will be read directly from devfreq which will return:

nrj + (current_power * (jiffies - last_update)) / (jiffies - last_update)

The devfreq cooling device driver would result in a much simpler code, no?

>  static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
>  					       struct thermal_zone_device *tz,
> @@ -234,14 +252,22 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> +	struct devfreq_dev_status status;
>  	unsigned long state;
> -	unsigned long freq = status->current_frequency;
> +	unsigned long freq;
>  	unsigned long voltage;
>  	u32 dyn_power = 0;
>  	u32 static_power = 0;
>  	int res;
>  
> +	mutex_lock(&df->lock);
> +	res = df->profile->get_dev_status(df->dev.parent, &status);
> +	mutex_unlock(&df->lock);
> +	if (res)
> +		return res;
> +
> +	freq = status.current_frequency;
> +
>  	state = freq_get_state(dfc, freq);
>  	if (state == THERMAL_CSTATE_INVALID) {
>  		res = -EAGAIN;
> @@ -269,16 +295,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  	} else {
>  		dyn_power = dfc->power_table[state];
>  
> +		_normalize_load(&status);
> +
>  		/* Scale dynamic power for utilization */
> -		dyn_power *= status->busy_time;
> -		dyn_power /= status->total_time;
> +		dyn_power *= status.busy_time;
> +		dyn_power /= status.total_time;
>  		/* Get static power */
>  		static_power = get_static_power(dfc, freq);
>  
>  		*power = dyn_power + static_power;
>  	}
>  
> -	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
> +	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
>  
>  	return 0;
>  fail:
> @@ -312,14 +340,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> -	unsigned long freq = status->current_frequency;
> +	struct devfreq_dev_status status;
>  	unsigned long busy_time;
> +	unsigned long freq;
>  	s32 dyn_power;
>  	u32 static_power;
>  	s32 est_power;
>  	int i;
>  
> +	mutex_lock(&df->lock);
> +	status = df->last_status;
> +	mutex_unlock(&df->lock);
> +
> +	freq = status.current_frequency;
> +
>  	if (dfc->power_ops->get_real_power) {
>  		/* Scale for resource utilization */
>  		est_power = power * dfc->res_util;
> @@ -331,8 +365,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  		dyn_power = dyn_power > 0 ? dyn_power : 0;
>  
>  		/* Scale dynamic power for utilization */
> -		busy_time = status->busy_time ?: 1;
> -		est_power = (dyn_power * status->total_time) / busy_time;
> +		busy_time = status.busy_time ?: 1;
> +		est_power = (dyn_power * status.total_time) / busy_time;
>  	}
>  
>  	/*
>
Lukasz Luba Oct. 22, 2020, 10:55 a.m. UTC | #3
Hi Ionela,


On 10/7/20 5:11 PM, Ionela Voinescu wrote:
> On Monday 21 Sep 2020 at 13:20:04 (+0100), Lukasz Luba wrote:
>> Devfreq cooling needs to now the correct status of the device in order
>> to operate. Do not rely on Devfreq last_status which might be a stale data
>> and get more up-to-date values of the load.
>>
>> Devfreq framework can change the device status in the background. To
>> mitigate this situation make a copy of the status structure and use it
>> for internal calculations.
>>
>> In addition this patch adds normalization function, which also makes sure
>> that whatever data comes from the device, it is in a sane range.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index 7063ccb7b86d..cf045bd4d16b 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>>   							       voltage);
>>   }
>>   
>> +static void _normalize_load(struct devfreq_dev_status *status)
> 
> Is there a reason for the leading "_" ?
> AFAIK, "__name()" is meant to suggest a "worker" function for another
> "name()" function, but that would not apply here.

It is just a local name. Check e.g. ./drivers/opp/core.c there is a few:
_generic_set_opp_regulator(), _generic_set_opp_clk_only(),
_get_opp_count(), _find_opp_table(), _set_opp_bw(), etc.

It is just a shorter name for me, '_' means here locality.
Instead of calling it devfreq_cooling_normalize_load().

> 
>> +{
>> +	/* Make some space if needed */
>> +	if (status->busy_time > 0xffff) {
>> +		status->busy_time >>= 10;
>> +		status->total_time >>= 10;
>> +	}
> 
> How about removing the above code and adding here:
> 
> status->busy_time = status->busy_time ? : 1;

It's not equivalent. The code operates on raw device values, which
might be big (e.g. read from counters). If it's lager than the 0xffff,
it is going to be shifted to get smaller.

> 
>> +
>> +	if (status->busy_time > status->total_time)
> 
> This check would then cover the possibility that total_time is 0.
> 
>> +		status->busy_time = status->total_time;
> 
> But a reversal is needed here:
> 		status->total_time = status->busy_time;

No, I want to clamp the busy_time, which should not be bigger that
total time. It could happen when we deal with 'raw' values from device
counters.

> 
>> +
>> +	status->busy_time *= 100;
>> +	status->busy_time /= status->total_time ? : 1;
>> +
>> +	/* Avoid division by 0 */
>> +	status->busy_time = status->busy_time ? : 1;
>> +	status->total_time = 100;
> 
> Then all of this code can be replaced by:
> 
> status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
> 					     status->total_time);
> status->total_time = 1 << 10;

No, the total_time closed to 'unsigned long' would overflow.

> 
> This way you gain some resolution to busy_time and the divisions in the
> callers would just become shifts by 10.


I don't want to gain more resolution here. I want to be prepare for raw
(not processed yet) big values coming from driver.

Regards,
Lukasz

> 
> Hope it helps,
> Ionela.
>
Lukasz Luba Oct. 22, 2020, 11:45 a.m. UTC | #4
Hi Daniel,

On 10/14/20 3:34 PM, Daniel Lezcano wrote:
> On 21/09/2020 14:20, Lukasz Luba wrote:
>> Devfreq cooling needs to now the correct status of the device in order
>> to operate. Do not rely on Devfreq last_status which might be a stale data
>> and get more up-to-date values of the load.
>>
>> Devfreq framework can change the device status in the background. To
>> mitigate this situation make a copy of the status structure and use it
>> for internal calculations.
>>
>> In addition this patch adds normalization function, which also makes sure
>> that whatever data comes from the device, it is in a sane range.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index 7063ccb7b86d..cf045bd4d16b 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>>   							       voltage);
>>   }
>>   
>> +static void _normalize_load(struct devfreq_dev_status *status)
>> +{
>> +	/* Make some space if needed */
>> +	if (status->busy_time > 0xffff) {
>> +		status->busy_time >>= 10;
>> +		status->total_time >>= 10;
>> +	}
>> +
>> +	if (status->busy_time > status->total_time)
>> +		status->busy_time = status->total_time;
>> +
>> +	status->busy_time *= 100;
>> +	status->busy_time /= status->total_time ? : 1;
>> +
>> +	/* Avoid division by 0 */
>> +	status->busy_time = status->busy_time ? : 1;
>> +	status->total_time = 100;
>> +}
> 
> Not sure that works if the devfreq governor is not on-demand.
> 
> Is it possible to use the energy model directly in devfreq to compute
> the energy consumption given the state transitions since the last reading ?

This change is actually trying to create a safety net for what we do.

In the original code we take the last_state directly:
-	struct devfreq_dev_status *status = &df->last_status;

Then we simply multiply by 'busy_time' and div by 'total_time',
without checks... The values might be huge or zero, etc.
The _normalize_load() introduces this safety.

Apart from that, just simply taking a pointer to &df->last_status does
not protect us from:
- working on a struct which might be modified at the same time in
   background - not safe
- that struct might not be updated by long time, because devfreq
   didn't check it for a long (there are two polling modes in devfreq)

So taking a mutex and then a trigger the device status check and
make a copy of newest data. It is more safe.

I think this can be treated as a fix, not a feature.

> 
> The power will be read directly from devfreq which will return:
> 
> nrj + (current_power * (jiffies - last_update)) / (jiffies - last_update)
> 
> The devfreq cooling device driver would result in a much simpler code, no?

This is something that I would like to address after the EM changes are
merged. It would be the next step, how to estimate the power by taking
into consideration more information. This patch series just tries to
make it possible to use EM. The model improvements would be next.

Thank you Daniel for your review.

Regards,
Lukasz
Ionela Voinescu Dec. 1, 2020, 10:36 a.m. UTC | #5
Hi,

Sorry for the delay and for the noise on this older version. I first
want to understand the code better.

On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote:
[..]
> 
> > 
> > > +{
> > > +	/* Make some space if needed */
> > > +	if (status->busy_time > 0xffff) {
> > > +		status->busy_time >>= 10;
> > > +		status->total_time >>= 10;
> > > +	}
> > 
> > How about removing the above code and adding here:
> > 
> > status->busy_time = status->busy_time ? : 1;
> 
> It's not equivalent. The code operates on raw device values, which
> might be big (e.g. read from counters). If it's lager than the 0xffff,
> it is going to be shifted to get smaller.
> 

Yes, the big values are handled below through the division and by making
total_time = 1024. These two initial checks are only to cover the
possibility for busy_time and total_time being 0, or busy_time >
total_time.

> > 
> > > +
> > > +	if (status->busy_time > status->total_time)
> > 
> > This check would then cover the possibility that total_time is 0.
> > 
> > > +		status->busy_time = status->total_time;
> > 
> > But a reversal is needed here:
> > 		status->total_time = status->busy_time;
> 
> No, I want to clamp the busy_time, which should not be bigger that
> total time. It could happen when we deal with 'raw' values from device
> counters.
> 

Yes, I understand. But isn't making total_time = busy_time accomplishing
the same thing?

> > 
> > > +
> > > +	status->busy_time *= 100;
> > > +	status->busy_time /= status->total_time ? : 1;
> > > +
> > > +	/* Avoid division by 0 */
> > > +	status->busy_time = status->busy_time ? : 1;
> > > +	status->total_time = 100;
> > 
> > Then all of this code can be replaced by:
> > 
> > status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
> > 					     status->total_time);
> > status->total_time = 1 << 10;
> 
> No, the total_time closed to 'unsigned long' would overflow.
> 

I'm not sure I understand. total_time gets a value of 1024, it's not
itself shifted by 10.

> > 
> > This way you gain some resolution to busy_time and the divisions in the
> > callers would just become shifts by 10.
> 
> 
> I don't want to gain more resolution here. I want to be prepare for raw
> (not processed yet) big values coming from driver.
>

Agreed! The higher resolution is an extra benefit. The more important
benefit is that, through my suggestion, you'd be replacing all future
divisions by shifts.

Thanks,
Ionela.

> Regards,
> Lukasz
> 
> > 
> > Hope it helps,
> > Ionela.
> >
Lukasz Luba Dec. 1, 2020, 12:19 p.m. UTC | #6
On 12/1/20 10:36 AM, Ionela Voinescu wrote:
> Hi,
> 
> Sorry for the delay and for the noise on this older version. I first
> want to understand the code better.
> 
> On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote:
> [..]
>>
>>>
>>>> +{
>>>> +	/* Make some space if needed */
>>>> +	if (status->busy_time > 0xffff) {
>>>> +		status->busy_time >>= 10;
>>>> +		status->total_time >>= 10;
>>>> +	}
>>>
>>> How about removing the above code and adding here:
>>>
>>> status->busy_time = status->busy_time ? : 1;
>>
>> It's not equivalent. The code operates on raw device values, which
>> might be big (e.g. read from counters). If it's lager than the 0xffff,
>> it is going to be shifted to get smaller.
>>
> 
> Yes, the big values are handled below through the division and by making
> total_time = 1024. These two initial checks are only to cover the
> possibility for busy_time and total_time being 0, or busy_time >
> total_time.
> 
>>>
>>>> +
>>>> +	if (status->busy_time > status->total_time)
>>>
>>> This check would then cover the possibility that total_time is 0.
>>>
>>>> +		status->busy_time = status->total_time;
>>>
>>> But a reversal is needed here:
>>> 		status->total_time = status->busy_time;
>>
>> No, I want to clamp the busy_time, which should not be bigger that
>> total time. It could happen when we deal with 'raw' values from device
>> counters.
>>
> 
> Yes, I understand. But isn't making total_time = busy_time accomplishing
> the same thing?
> 
>>>
>>>> +
>>>> +	status->busy_time *= 100;
>>>> +	status->busy_time /= status->total_time ? : 1;
>>>> +
>>>> +	/* Avoid division by 0 */
>>>> +	status->busy_time = status->busy_time ? : 1;
>>>> +	status->total_time = 100;
>>>
>>> Then all of this code can be replaced by:
>>>
>>> status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
>>> 					     status->total_time);
>>> status->total_time = 1 << 10;
>>
>> No, the total_time closed to 'unsigned long' would overflow.
>>
> 
> I'm not sure I understand. total_time gets a value of 1024, it's not
> itself shifted by 10.
> 
>>>
>>> This way you gain some resolution to busy_time and the divisions in the
>>> callers would just become shifts by 10.
>>
>>
>> I don't want to gain more resolution here. I want to be prepare for raw
>> (not processed yet) big values coming from driver.
>>
> 
> Agreed! The higher resolution is an extra benefit. The more important
> benefit is that, through my suggestion, you'd be replacing all future
> divisions by shifts.

You have probably missed some bits.
I don't see benefits, you have div64_u64() which is heavy on 32bit CPUs.

Then, what is the range of these values:
busy_time [0, 1024], total_time 1024 in your case.
These values are used for estimating power in two cases:
1. in devfreq_cooling_get_requested_power()
	est_power = power * busy_time / total_time
2. in devfreq_cooling_power2state():
	est_power = power * total_time / busy_time

As you can see above, the est_power values could overflow if total_time,
busy_time are raw values (like in old implementation). So normalize them
into 'some' scale. That was the motivation ('scale' motivation below).

In your case you cannot avoid division in 2. use case, because busy_time
can be any value in range [0, 1024].
We could avoid the division in 1. use case, but load in cpufreq cooling
is also in range of [0, 100], so this devfreq cooling is aligned. I
would like to avoid situation when someone is parsing the traces
and these two devices present different load scale.

I will think about better 'devfreq utilization' (as also Daniel
suggested)in future, but first this EM must be in mainline and cpufreq
cooling changes made by Viresh also there.
But it would be more then just scale change to [0, 1024]...

Regards,
Lukasz


> 
> Thanks,
> Ionela.
> 
>> Regards,
>> Lukasz
>>
>>>
>>> Hope it helps,
>>> Ionela.
>>>
Ionela Voinescu Dec. 1, 2020, 2:55 p.m. UTC | #7
On Tuesday 01 Dec 2020 at 12:19:18 (+0000), Lukasz Luba wrote:
> 
> 
> On 12/1/20 10:36 AM, Ionela Voinescu wrote:
> > Hi,
> > 
> > Sorry for the delay and for the noise on this older version. I first
> > want to understand the code better.
> > 
> > On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote:
> > [..]
> > > 
> > > > 
> > > > > +{
> > > > > +	/* Make some space if needed */
> > > > > +	if (status->busy_time > 0xffff) {
> > > > > +		status->busy_time >>= 10;
> > > > > +		status->total_time >>= 10;
> > > > > +	}
> > > > 
> > > > How about removing the above code and adding here:
> > > > 
> > > > status->busy_time = status->busy_time ? : 1;
> > > 
> > > It's not equivalent. The code operates on raw device values, which
> > > might be big (e.g. read from counters). If it's lager than the 0xffff,
> > > it is going to be shifted to get smaller.
> > > 
> > 
> > Yes, the big values are handled below through the division and by making
> > total_time = 1024. These two initial checks are only to cover the
> > possibility for busy_time and total_time being 0, or busy_time >
> > total_time.
> > 
> > > > 
> > > > > +
> > > > > +	if (status->busy_time > status->total_time)
> > > > 
> > > > This check would then cover the possibility that total_time is 0.
> > > > 
> > > > > +		status->busy_time = status->total_time;
> > > > 
> > > > But a reversal is needed here:
> > > > 		status->total_time = status->busy_time;
> > > 
> > > No, I want to clamp the busy_time, which should not be bigger that
> > > total time. It could happen when we deal with 'raw' values from device
> > > counters.
> > > 
> > 
> > Yes, I understand. But isn't making total_time = busy_time accomplishing
> > the same thing?
> > 
> > > > 
> > > > > +
> > > > > +	status->busy_time *= 100;
> > > > > +	status->busy_time /= status->total_time ? : 1;
> > > > > +
> > > > > +	/* Avoid division by 0 */
> > > > > +	status->busy_time = status->busy_time ? : 1;
> > > > > +	status->total_time = 100;
> > > > 
> > > > Then all of this code can be replaced by:
> > > > 
> > > > status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
> > > > 					     status->total_time);
> > > > status->total_time = 1 << 10;
> > > 
> > > No, the total_time closed to 'unsigned long' would overflow.
> > > 
> > 
> > I'm not sure I understand. total_time gets a value of 1024, it's not
> > itself shifted by 10.
> > 
> > > > 
> > > > This way you gain some resolution to busy_time and the divisions in the
> > > > callers would just become shifts by 10.
> > > 
> > > 
> > > I don't want to gain more resolution here. I want to be prepare for raw
> > > (not processed yet) big values coming from driver.
> > > 
> > 
> > Agreed! The higher resolution is an extra benefit. The more important
> > benefit is that, through my suggestion, you'd be replacing all future
> > divisions by shifts.
> 
> You have probably missed some bits.
> I don't see benefits, you have div64_u64() which is heavy on 32bit CPUs.
> 
> Then, what is the range of these values:
> busy_time [0, 1024], total_time 1024 in your case.
> These values are used for estimating power in two cases:
> 1. in devfreq_cooling_get_requested_power()
> 	est_power = power * busy_time / total_time
> 2. in devfreq_cooling_power2state():
> 	est_power = power * total_time / busy_time
> 
> As you can see above, the est_power values could overflow if total_time,
> busy_time are raw values (like in old implementation). So normalize them
> into 'some' scale. That was the motivation ('scale' motivation below).
> 

Agreed! I do think scaling is necessary, but in my mind the [0, 1024] scale
made more sense.

> In your case you cannot avoid division in 2. use case, because busy_time
> can be any value in range [0, 1024].
> We could avoid the division in 1. use case, but load in cpufreq cooling
> is also in range of [0, 100], so this devfreq cooling is aligned. I
> would like to avoid situation when someone is parsing the traces
> and these two devices present different load scale.
> 

Got it! Looking through the code I did overlook that 2 was reversed.

> I will think about better 'devfreq utilization' (as also Daniel
> suggested)in future, but first this EM must be in mainline and cpufreq
> cooling changes made by Viresh also there.
> But it would be more then just scale change to [0, 1024]...
> 

Okay, looking forward to this. It would be nice to align all of these
utilization metrics in the future for all kinds of devices.

Thanks,
Ionela.
diff mbox series

Patch

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 7063ccb7b86d..cf045bd4d16b 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -227,6 +227,24 @@  static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
 							       voltage);
 }
 
+static void _normalize_load(struct devfreq_dev_status *status)
+{
+	/* Make some space if needed */
+	if (status->busy_time > 0xffff) {
+		status->busy_time >>= 10;
+		status->total_time >>= 10;
+	}
+
+	if (status->busy_time > status->total_time)
+		status->busy_time = status->total_time;
+
+	status->busy_time *= 100;
+	status->busy_time /= status->total_time ? : 1;
+
+	/* Avoid division by 0 */
+	status->busy_time = status->busy_time ? : 1;
+	status->total_time = 100;
+}
 
 static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
 					       struct thermal_zone_device *tz,
@@ -234,14 +252,22 @@  static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
-	struct devfreq_dev_status *status = &df->last_status;
+	struct devfreq_dev_status status;
 	unsigned long state;
-	unsigned long freq = status->current_frequency;
+	unsigned long freq;
 	unsigned long voltage;
 	u32 dyn_power = 0;
 	u32 static_power = 0;
 	int res;
 
+	mutex_lock(&df->lock);
+	res = df->profile->get_dev_status(df->dev.parent, &status);
+	mutex_unlock(&df->lock);
+	if (res)
+		return res;
+
+	freq = status.current_frequency;
+
 	state = freq_get_state(dfc, freq);
 	if (state == THERMAL_CSTATE_INVALID) {
 		res = -EAGAIN;
@@ -269,16 +295,18 @@  static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	} else {
 		dyn_power = dfc->power_table[state];
 
+		_normalize_load(&status);
+
 		/* Scale dynamic power for utilization */
-		dyn_power *= status->busy_time;
-		dyn_power /= status->total_time;
+		dyn_power *= status.busy_time;
+		dyn_power /= status.total_time;
 		/* Get static power */
 		static_power = get_static_power(dfc, freq);
 
 		*power = dyn_power + static_power;
 	}
 
-	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
+	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
 
 	return 0;
 fail:
@@ -312,14 +340,20 @@  static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
-	struct devfreq_dev_status *status = &df->last_status;
-	unsigned long freq = status->current_frequency;
+	struct devfreq_dev_status status;
 	unsigned long busy_time;
+	unsigned long freq;
 	s32 dyn_power;
 	u32 static_power;
 	s32 est_power;
 	int i;
 
+	mutex_lock(&df->lock);
+	status = df->last_status;
+	mutex_unlock(&df->lock);
+
+	freq = status.current_frequency;
+
 	if (dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
@@ -331,8 +365,8 @@  static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 		dyn_power = dyn_power > 0 ? dyn_power : 0;
 
 		/* Scale dynamic power for utilization */
-		busy_time = status->busy_time ?: 1;
-		est_power = (dyn_power * status->total_time) / busy_time;
+		busy_time = status.busy_time ?: 1;
+		est_power = (dyn_power * status.total_time) / busy_time;
 	}
 
 	/*