diff mbox series

[5/5] powercap/drivers/dtpm: Scale the power with the load

Message ID 20210301212149.22877-5-daniel.lezcano@linaro.org (mailing list archive)
State Superseded, archived
Headers show
Series [1/5] powercap/drivers/dtpm: Encapsulate even more the code | expand

Commit Message

Daniel Lezcano March 1, 2021, 9:21 p.m. UTC
Currently the power consumption is based on the current OPP power
assuming the entire performance domain is fully loaded.

That gives very gross power estimation and we can do much better by
using the load to scale the power consumption.

Use the utilization to normalize and scale the power usage over the
max possible power.

Tested on a rock960 with 2 big CPUS, the power consumption estimation
conforms with the expected one.

Before this change:

~$ ~/dhrystone -t 1 -l 10000&
~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
2260000

After this change:

~$ ~/dhrystone -t 1 -l 10000&
~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
1130000

~$ ~/dhrystone -t 2 -l 10000&
~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
2260000

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 drivers/powercap/dtpm_cpu.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Comments

Lukasz Luba March 9, 2021, 10:01 a.m. UTC | #1
Hi Daniel,

I've started reviewing the series, please find some comments below.

On 3/1/21 9:21 PM, Daniel Lezcano wrote:
> Currently the power consumption is based on the current OPP power
> assuming the entire performance domain is fully loaded.
> 
> That gives very gross power estimation and we can do much better by
> using the load to scale the power consumption.
> 
> Use the utilization to normalize and scale the power usage over the
> max possible power.
> 
> Tested on a rock960 with 2 big CPUS, the power consumption estimation
> conforms with the expected one.
> 
> Before this change:
> 
> ~$ ~/dhrystone -t 1 -l 10000&
> ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
> 2260000
> 
> After this change:
> 
> ~$ ~/dhrystone -t 1 -l 10000&
> ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
> 1130000
> 
> ~$ ~/dhrystone -t 2 -l 10000&
> ~$ cat /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
> 2260000
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> ---
>   drivers/powercap/dtpm_cpu.c | 21 +++++++++++++++++----
>   1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
> index e728ebd6d0ca..8379b96468ef 100644
> --- a/drivers/powercap/dtpm_cpu.c
> +++ b/drivers/powercap/dtpm_cpu.c
> @@ -68,27 +68,40 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
>   	return power_limit;
>   }
>   
> +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power)

renamed 'cpus' into 'pd_mask', see below

> +{
> +	unsigned long max, util;
> +	int cpu, load = 0;

IMHO 'int load' looks odd when used with 'util' and 'max'.
I would put in the line above to have them all the same type and
renamed to 'sum_util'.

> +
> +	for_each_cpu(cpu, cpus) {

I would avoid the temporary CPU mask in the get_pd_power_uw()
with this modified loop:

for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {


> +		max = arch_scale_cpu_capacity(cpu);
> +		util = sched_cpu_util(cpu, max);
> +		load += ((util * 100) / max);

Below you can find 3 optimizations. Since we are not in the hot
path here, it's up to if you would like to use all/some of them
or just ignore.

1st optimization.
If we use 'load += (util << 10) / max' in the loop, then
we could avoid div by 100 and use a right shift:
(power * load) >> 10

2nd optimization.
Since we use EM CPU mask, which span all CPUs with the same
arch_scale_cpu_capacity(), you can avoid N divs inside the loop
and do it once, below the loop.

3rd optimization.
If we just simply add all 'util' into 'sum_util' (no mul or div in
the loop), then we might just have simple macro

#define CALC_POWER_USAGE(power, sum_util, max) \
	(((power * (sum_util << 10)) / max) >> 10)


> +	}
> +
> +	return (power * load) / 100;
> +}
> +
>   static u64 get_pd_power_uw(struct dtpm *dtpm)
>   {
>   	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
>   	struct em_perf_domain *pd;
>   	struct cpumask cpus;

Since we don't need the 'nr_cpus' we also don't need the
cpumask which occupy stack; Maybe use
	struct cpumask *pd_mask;

then

>   	unsigned long freq;
> -	int i, nr_cpus;
> +	int i;
>   
>   	pd = em_cpu_get(dtpm_cpu->cpu);
>   	freq = cpufreq_quick_get(dtpm_cpu->cpu);
>   
>   	cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));

	remove ^^^^^ and set
	pd_mask = em_span_cpus(pd);

> -	nr_cpus = cpumask_weight(&cpus);
>   
>   	for (i = 0; i < pd->nr_perf_states; i++) {
>   
>   		if (pd->table[i].frequency < freq)
>   			continue;
>   
> -		return pd->table[i].power *
> -			MICROWATT_PER_MILLIWATT * nr_cpus;
> +		return scale_pd_power_uw(&cpus, pd->table[i].power *
> +					 MICROWATT_PER_MILLIWATT);

Instead of '&cpus' I would put 'pd_mask' and that should do the job.

>   	}
>   
>   	return 0;
> 

Apart from that, the design idea with util looks good.

Regards,
Lukasz
Daniel Lezcano March 9, 2021, 7:03 p.m. UTC | #2
Hi Lukasz,

thanks for your comments, one question below.

On 09/03/2021 11:01, Lukasz Luba wrote:

[ ... ]

>>   +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power)
> 
> renamed 'cpus' into 'pd_mask', see below
> 
>> +{
>> +    unsigned long max, util;
>> +    int cpu, load = 0;
> 
> IMHO 'int load' looks odd when used with 'util' and 'max'.
> I would put in the line above to have them all the same type and
> renamed to 'sum_util'.
> 
>> +
>> +    for_each_cpu(cpu, cpus) {
> 
> I would avoid the temporary CPU mask in the get_pd_power_uw()
> with this modified loop:
> 
> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> 
> 
>> +        max = arch_scale_cpu_capacity(cpu);
>> +        util = sched_cpu_util(cpu, max);
>> +        load += ((util * 100) / max);
> 
> Below you can find 3 optimizations. Since we are not in the hot
> path here, it's up to if you would like to use all/some of them
> or just ignore.
> 
> 1st optimization.
> If we use 'load += (util << 10) / max' in the loop, then
> we could avoid div by 100 and use a right shift:
> (power * load) >> 10
> 
> 2nd optimization.
> Since we use EM CPU mask, which span all CPUs with the same
> arch_scale_cpu_capacity(), you can avoid N divs inside the loop
> and do it once, below the loop.
> 
> 3rd optimization.
> If we just simply add all 'util' into 'sum_util' (no mul or div in
> the loop), then we might just have simple macro
> 
> #define CALC_POWER_USAGE(power, sum_util, max) \
>     (((power * (sum_util << 10)) / max) >> 10)

I don't understand the 'max' division, I was expecting here something
like: ((sum_util << 10) / sum_max) >> 10)

no ?
Daniel Lezcano March 9, 2021, 7:22 p.m. UTC | #3
On 09/03/2021 11:01, Lukasz Luba wrote:
> Hi Daniel,
> 
> I've started reviewing the series, please find some comments below.
> 
> On 3/1/21 9:21 PM, Daniel Lezcano wrote:
>> Currently the power consumption is based on the current OPP power
>> assuming the entire performance domain is fully loaded.
>>
>> That gives very gross power estimation and we can do much better by
>> using the load to scale the power consumption.
>>
>> Use the utilization to normalize and scale the power usage over the
>> max possible power.
>>
>> Tested on a rock960 with 2 big CPUS, the power consumption estimation
>> conforms with the expected one.
>>
>> Before this change:
>>
>> ~$ ~/dhrystone -t 1 -l 10000&
>> ~$ cat
>> /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
>>
>> 2260000
>>
>> After this change:
>>
>> ~$ ~/dhrystone -t 1 -l 10000&
>> ~$ cat
>> /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
>>
>> 1130000
>>
>> ~$ ~/dhrystone -t 2 -l 10000&
>> ~$ cat
>> /sys/devices/virtual/powercap/dtpm/dtpm:0/dtpm:0:1/constraint_0_max_power_uw
>>
>> 2260000
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> ---
>>   drivers/powercap/dtpm_cpu.c | 21 +++++++++++++++++----
>>   1 file changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
>> index e728ebd6d0ca..8379b96468ef 100644
>> --- a/drivers/powercap/dtpm_cpu.c
>> +++ b/drivers/powercap/dtpm_cpu.c
>> @@ -68,27 +68,40 @@ static u64 set_pd_power_limit(struct dtpm *dtpm,
>> u64 power_limit)
>>       return power_limit;
>>   }
>>   +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power)
> 
> renamed 'cpus' into 'pd_mask', see below
> 
>> +{
>> +    unsigned long max, util;
>> +    int cpu, load = 0;
> 
> IMHO 'int load' looks odd when used with 'util' and 'max'.
> I would put in the line above to have them all the same type and
> renamed to 'sum_util'.
> 
>> +
>> +    for_each_cpu(cpu, cpus) {
> 
> I would avoid the temporary CPU mask in the get_pd_power_uw()
> with this modified loop:
> 
> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> 
> 
>> +        max = arch_scale_cpu_capacity(cpu);
>> +        util = sched_cpu_util(cpu, max);
>> +        load += ((util * 100) / max);
> 
> Below you can find 3 optimizations. Since we are not in the hot
> path here, it's up to if you would like to use all/some of them
> or just ignore.
> 
> 1st optimization.
> If we use 'load += (util << 10) / max' in the loop, then
> we could avoid div by 100 and use a right shift:
> (power * load) >> 10
> 
> 2nd optimization.
> Since we use EM CPU mask, which span all CPUs with the same
> arch_scale_cpu_capacity(), you can avoid N divs inside the loop
> and do it once, below the loop.
> 
> 3rd optimization.
> If we just simply add all 'util' into 'sum_util' (no mul or div in
> the loop), then we might just have simple macro
> 
> #define CALC_POWER_USAGE(power, sum_util, max) \
>     (((power * (sum_util << 10)) / max) >> 10)

static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
{
        unsigned long max, sum_max = 0, sum_util = 0;
        int cpu;

        for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
                max = arch_scale_cpu_capacity(cpu);
                sum_util += sched_cpu_util(cpu, max);
                sum_max += max;
        }

        return (power * ((sum_util << 10) / sum_max)) >> 10;
}

??
Lukasz Luba March 9, 2021, 8:44 p.m. UTC | #4
On 3/9/21 7:03 PM, Daniel Lezcano wrote:
> 
> Hi Lukasz,
> 
> thanks for your comments, one question below.
> 
> On 09/03/2021 11:01, Lukasz Luba wrote:
> 
> [ ... ]
> 
>>>    +static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power)
>>
>> renamed 'cpus' into 'pd_mask', see below
>>
>>> +{
>>> +    unsigned long max, util;
>>> +    int cpu, load = 0;
>>
>> IMHO 'int load' looks odd when used with 'util' and 'max'.
>> I would put in the line above to have them all the same type and
>> renamed to 'sum_util'.
>>
>>> +
>>> +    for_each_cpu(cpu, cpus) {
>>
>> I would avoid the temporary CPU mask in the get_pd_power_uw()
>> with this modified loop:
>>
>> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
>>
>>
>>> +        max = arch_scale_cpu_capacity(cpu);
>>> +        util = sched_cpu_util(cpu, max);
>>> +        load += ((util * 100) / max);
>>
>> Below you can find 3 optimizations. Since we are not in the hot
>> path here, it's up to if you would like to use all/some of them
>> or just ignore.
>>
>> 1st optimization.
>> If we use 'load += (util << 10) / max' in the loop, then
>> we could avoid div by 100 and use a right shift:
>> (power * load) >> 10
>>
>> 2nd optimization.
>> Since we use EM CPU mask, which span all CPUs with the same
>> arch_scale_cpu_capacity(), you can avoid N divs inside the loop
>> and do it once, below the loop.
>>
>> 3rd optimization.
>> If we just simply add all 'util' into 'sum_util' (no mul or div in
>> the loop), then we might just have simple macro
>>
>> #define CALC_POWER_USAGE(power, sum_util, max) \
>>      (((power * (sum_util << 10)) / max) >> 10)
> 
> I don't understand the 'max' division, I was expecting here something
> like: ((sum_util << 10) / sum_max) >> 10)
> 
> no ?
> 

No, it should be single 'max', which is in range 0..1024.
We would like to calculate the power for the whole perf domain, e.g.
4 CPUs almost fully utilized would have util ~1000, then total power
should be around ~4 * EM_table[i].power. This '~4' is coming from
4 utils divided by one max util
4000 / 1024


The 'max' in the equation can be put before the bracket, as well as
'power'.

If we had floating point number, simple power for cpu1, cpu2, cpuN
would be just:
power_1 = power * util_1 / max
power_2 = power * util_2 / max
power_N = power * util_N / max
(since they have the same 'max' capacity and the same EM 'power')

The total domain power would be:
total_power = power_1 + power_2 + ... + power_N
which is:
total_power = (power * util_1 / max) + (power * util_2 / max) + ... +
               + (power * util_N / max)

put the 'power' and 'max' before the bracket:
total_power = power * (util_1 + util_2 + ... + util_N) * (1/max)

introduce the 'sum_util':
sum_util = util_1 + util_2 + ... + util_N
then:
total_power = power * sum_util / max

Unfortunately, we don't use floating point, so temporary fixed point
tricks, thus the '<< 10' and '>> 10' avoid some errors
diff mbox series

Patch

diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index e728ebd6d0ca..8379b96468ef 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -68,27 +68,40 @@  static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 	return power_limit;
 }
 
+static u64 scale_pd_power_uw(struct cpumask *cpus, u64 power)
+{
+	unsigned long max, util;
+	int cpu, load = 0;
+
+	for_each_cpu(cpu, cpus) {
+		max = arch_scale_cpu_capacity(cpu);
+		util = sched_cpu_util(cpu, max);
+		load += ((util * 100) / max);
+	}
+
+	return (power * load) / 100;
+}
+
 static u64 get_pd_power_uw(struct dtpm *dtpm)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
 	struct em_perf_domain *pd;
 	struct cpumask cpus;
 	unsigned long freq;
-	int i, nr_cpus;
+	int i;
 
 	pd = em_cpu_get(dtpm_cpu->cpu);
 	freq = cpufreq_quick_get(dtpm_cpu->cpu);
 
 	cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
-	nr_cpus = cpumask_weight(&cpus);
 
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
 		if (pd->table[i].frequency < freq)
 			continue;
 
-		return pd->table[i].power *
-			MICROWATT_PER_MILLIWATT * nr_cpus;
+		return scale_pd_power_uw(&cpus, pd->table[i].power *
+					 MICROWATT_PER_MILLIWATT);
 	}
 
 	return 0;