diff mbox

[v2,04/11] sched: Allow all archs to set the power_orig

Message ID 1400860385-14555-5-git-send-email-vincent.guittot@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Vincent Guittot May 23, 2014, 3:52 p.m. UTC
power_orig is only changed for system with a SMT sched_domain level in order to
reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_power that can be also used by
non SMT platform to set power_orig.

The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
keep backward compatibility in the use of power_orig.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

Comments

Dietmar Eggemann May 30, 2014, 2:04 p.m. UTC | #1
On 23/05/14 16:52, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.
> 
> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> keep backward compatibility in the use of power_orig.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
power is important for energy-aware scheduling to achieve freq scale
invariance for task load.

I know that your patch-set is not about introducing freq scaled cpu
power, but we were discussing how this can be achieved w/ your patch-set
in place, so maybe you can share your opinion regarding the easiest way
to achieve freq scale invariance with us?

(1) We assume that the current way (update_cpu_power() calls
arch_scale_freq_power() to get the avg power(freq) over the time period
since the last call to arch_scale_freq_power()) is suitable
for us. Do you have another opinion here?

(2) Is the current layout of update_cpu_power() adequate for this, where
we scale power_orig related to freq and then related to rt/(irq):

  power_orig = scale_cpu(SCHED_POWER_SCALE)
  power = scale_rt(scale_freq(power_orig))

or do we need an extra power_freq data member on the rq and do:

  power_orig = scale_cpu(SCHED_POWER_SCALE)
  power_freq = scale_freq(power_orig))
  power = scale_rt(power_orig))

In other words, do we consider rt/(irq) pressure when calculating freq
scale invariant task load or not?

Thanks,

-- Dietmar
[...]
Peter Zijlstra May 30, 2014, 2:46 p.m. UTC | #2
On Fri, May 30, 2014 at 03:04:32PM +0100, Dietmar Eggemann wrote:
> On 23/05/14 16:52, Vincent Guittot wrote:
> > power_orig is only changed for system with a SMT sched_domain level in order to
> > reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> > original capacity that is different from the default value.
> > 
> > Create a more generic function arch_scale_cpu_power that can be also used by
> > non SMT platform to set power_orig.
> > 
> > The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> > keep backward compatibility in the use of power_orig.
> > 
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
> power is important for energy-aware scheduling to achieve freq scale
> invariance for task load.
> 
> I know that your patch-set is not about introducing freq scaled cpu
> power, but we were discussing how this can be achieved w/ your patch-set
> in place, so maybe you can share your opinion regarding the easiest way
> to achieve freq scale invariance with us?
> 
> (1) We assume that the current way (update_cpu_power() calls
> arch_scale_freq_power() to get the avg power(freq) over the time period
> since the last call to arch_scale_freq_power()) is suitable
> for us. Do you have another opinion here?
> 
> (2) Is the current layout of update_cpu_power() adequate for this, where
> we scale power_orig related to freq and then related to rt/(irq):
> 
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power = scale_rt(scale_freq(power_orig))
> 
> or do we need an extra power_freq data member on the rq and do:
> 
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power_freq = scale_freq(power_orig))
>   power = scale_rt(power_orig))
> 
> In other words, do we consider rt/(irq) pressure when calculating freq
> scale invariant task load or not?

I don't think you should. The work done depends on the frequency, not on
other tasks present on the cpu. The same is true for an over-utilized
cpu, a task will run less than the desired amount of time, this is no
different from a RT/irq preempting the task and taking its time.
Vincent Guittot May 30, 2014, 8:50 p.m. UTC | #3
On 30 May 2014 16:04, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:52, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>>
>> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
>> keep backward compatibility in the use of power_orig.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
> power is important for energy-aware scheduling to achieve freq scale
> invariance for task load.
>
> I know that your patch-set is not about introducing freq scaled cpu
> power, but we were discussing how this can be achieved w/ your patch-set
> in place, so maybe you can share your opinion regarding the easiest way
> to achieve freq scale invariance with us?
>
> (1) We assume that the current way (update_cpu_power() calls
> arch_scale_freq_power() to get the avg power(freq) over the time period
> since the last call to arch_scale_freq_power()) is suitable
> for us. Do you have another opinion here?

Using power (or power_freq as you mentioned below) is probably the
easiest and more straight forward solution. You can use it to scale
each element when updating entity runnable.
Nevertheless, I see to 2 potential issues:
- is power updated often enough to correctly follow the frequency
scaling ? we need to compare power update frequency with
runnable_avg_sum variation speed and the rate at which we will change
the CPU's frequency.
- the max value of runnable_avg_sum will be also scaled so a task
running on a CPU with less capacity could be seen as a "low" load even
if it's an always running tasks. So we need to find a way to reach the
max value for such situation

>
> (2) Is the current layout of update_cpu_power() adequate for this, where
> we scale power_orig related to freq and then related to rt/(irq):
>
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power = scale_rt(scale_freq(power_orig))
>
> or do we need an extra power_freq data member on the rq and do:
>
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power_freq = scale_freq(power_orig))
>   power = scale_rt(power_orig))

do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?

>
> In other words, do we consider rt/(irq) pressure when calculating freq
> scale invariant task load or not?

we should take power_freq which implies a new field

Thanks,
Vincent
>
> Thanks,
>
> -- Dietmar
> [...]
>
Morten Rasmussen June 3, 2014, 1:22 p.m. UTC | #4
On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.

I did a quick test of the patch set with adjusting cpu_power on
big.LITTLE (ARM TC2) to reflect the different compute capacities of the
A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
without the patches applied, but with non-default cpu_powers.

I didn't see any difference in the load-balance. Three tasks ended up on
the two A15s and two tasks ended up on two of the three A7s leaving one
unused in both cases.

Using default cpu_power I get one task on each of the five cpus (best
throughput). Unless I messed something up, it seems that setting
cpu_power doesn't give me the best throughput with these patches
applied.

Have you done any tests on big.LITTLE?

Morten
Vincent Guittot June 3, 2014, 2:02 p.m. UTC | #5
On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>
> I did a quick test of the patch set with adjusting cpu_power on
> big.LITTLE (ARM TC2) to reflect the different compute capacities of the
> A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
> without the patches applied, but with non-default cpu_powers.
>
> I didn't see any difference in the load-balance. Three tasks ended up on
> the two A15s and two tasks ended up on two of the three A7s leaving one
> unused in both cases.
>
> Using default cpu_power I get one task on each of the five cpus (best
> throughput). Unless I messed something up, it seems that setting
> cpu_power doesn't give me the best throughput with these patches
> applied.

That's normal this patchset is necessary but not enough to solve the
issue you mention. We also need to fix the way the imbalance is
calculated for such situation. I have planned to push that in another
patchset in order to not mix too much thing together

Vincent

>
> Have you done any tests on big.LITTLE?
>
> Morten
Dietmar Eggemann June 4, 2014, 9:42 a.m. UTC | #6
[...]
>> (1) We assume that the current way (update_cpu_power() calls
>> arch_scale_freq_power() to get the avg power(freq) over the time period
>> since the last call to arch_scale_freq_power()) is suitable
>> for us. Do you have another opinion here?
> 
> Using power (or power_freq as you mentioned below) is probably the
> easiest and more straight forward solution. You can use it to scale
> each element when updating entity runnable.
> Nevertheless, I see to 2 potential issues:
> - is power updated often enough to correctly follow the frequency
> scaling ? we need to compare power update frequency with
> runnable_avg_sum variation speed and the rate at which we will change
> the CPU's frequency.
> - the max value of runnable_avg_sum will be also scaled so a task
> running on a CPU with less capacity could be seen as a "low" load even
> if it's an always running tasks. So we need to find a way to reach the
> max value for such situation

I think I mixed two problems together here:

Firstly, we need to scale cpu power in update_cpu_power() regarding
uArch, frequency and rt/irq pressure.
Here the freq related value we get back from arch_scale_freq_power(...,
cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).

Secondly, to be able to scale the runnable avg sum of a sched entity
(se->avg->runnable_avg_sum), we preferable have a coefficient
representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
powerful cpu in the system) and another coefficient (avg freq over 'now
- sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
be retrieved from the arch in __update_entity_runnable_avg().

>> (2) Is the current layout of update_cpu_power() adequate for this, where
>> we scale power_orig related to freq and then related to rt/(irq):
>>
>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>   power = scale_rt(scale_freq(power_orig))
>>
>> or do we need an extra power_freq data member on the rq and do:
>>
>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>   power_freq = scale_freq(power_orig))
>>   power = scale_rt(power_orig))
> 
> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?

No, I also think that power=scale_rt(power_freq) is correct.

>> In other words, do we consider rt/(irq) pressure when calculating freq
>> scale invariant task load or not?
> 
> we should take power_freq which implies a new field
[...]
Vincent Guittot June 4, 2014, 11:15 a.m. UTC | #7
On 4 June 2014 11:42, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> [...]
>>> (1) We assume that the current way (update_cpu_power() calls
>>> arch_scale_freq_power() to get the avg power(freq) over the time period
>>> since the last call to arch_scale_freq_power()) is suitable
>>> for us. Do you have another opinion here?
>>
>> Using power (or power_freq as you mentioned below) is probably the
>> easiest and more straight forward solution. You can use it to scale
>> each element when updating entity runnable.
>> Nevertheless, I see to 2 potential issues:
>> - is power updated often enough to correctly follow the frequency
>> scaling ? we need to compare power update frequency with
>> runnable_avg_sum variation speed and the rate at which we will change
>> the CPU's frequency.
>> - the max value of runnable_avg_sum will be also scaled so a task
>> running on a CPU with less capacity could be seen as a "low" load even
>> if it's an always running tasks. So we need to find a way to reach the
>> max value for such situation
>
> I think I mixed two problems together here:
>
> Firstly, we need to scale cpu power in update_cpu_power() regarding
> uArch, frequency and rt/irq pressure.
> Here the freq related value we get back from arch_scale_freq_power(...,
> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>
> Secondly, to be able to scale the runnable avg sum of a sched entity
> (se->avg->runnable_avg_sum), we preferable have a coefficient
> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
> powerful cpu in the system) and another coefficient (avg freq over 'now

AFAICT, the coefficient representing uArch diffs is already taken into
account into power_freq thanks to scale_cpu, isn't it ?

> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
> be retrieved from the arch in __update_entity_runnable_avg().
>
>>> (2) Is the current layout of update_cpu_power() adequate for this, where
>>> we scale power_orig related to freq and then related to rt/(irq):
>>>
>>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>>   power = scale_rt(scale_freq(power_orig))
>>>
>>> or do we need an extra power_freq data member on the rq and do:
>>>
>>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>>   power_freq = scale_freq(power_orig))
>>>   power = scale_rt(power_orig))
>>
>> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?
>
> No, I also think that power=scale_rt(power_freq) is correct.
>
>>> In other words, do we consider rt/(irq) pressure when calculating freq
>>> scale invariant task load or not?
>>
>> we should take power_freq which implies a new field
> [...]
>
Morten Rasmussen June 4, 2014, 11:17 a.m. UTC | #8
On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
> >> power_orig is only changed for system with a SMT sched_domain level in order to
> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> >> original capacity that is different from the default value.
> >>
> >> Create a more generic function arch_scale_cpu_power that can be also used by
> >> non SMT platform to set power_orig.
> >
> > I did a quick test of the patch set with adjusting cpu_power on
> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the
> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
> > without the patches applied, but with non-default cpu_powers.
> >
> > I didn't see any difference in the load-balance. Three tasks ended up on
> > the two A15s and two tasks ended up on two of the three A7s leaving one
> > unused in both cases.
> >
> > Using default cpu_power I get one task on each of the five cpus (best
> > throughput). Unless I messed something up, it seems that setting
> > cpu_power doesn't give me the best throughput with these patches
> > applied.
> 
> That's normal this patchset is necessary but not enough to solve the
> issue you mention. We also need to fix the way the imbalance is
> calculated for such situation. I have planned to push that in another
> patchset in order to not mix too much thing together

Based on the commit messages I was just lead to believe that this was a
self-contained patch set that also addressed issues related to handling
heterogeneous systems. Maybe it would be worth mentioning that this set
is only part of the solution somewhere?

It is a bit unclear to me how these changes, which appear to mainly
improve factoring rt and irq time into cpu_power, will solve the
cpu_power issues related to heterogeneous systems. Can you share your
plans for the follow up patch set? I think it would be better to review
the solution as a whole.

I absolutely agree that the imbalance calculation needs to fixed, but I
don't think the current rq runnable_avg_sum is the right choice for that
purpose for the reasons I pointed out the in other thread.

Morten
Dietmar Eggemann June 5, 2014, 8:59 a.m. UTC | #9
[...]
>> Firstly, we need to scale cpu power in update_cpu_power() regarding
>> uArch, frequency and rt/irq pressure.
>> Here the freq related value we get back from arch_scale_freq_power(...,
>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>>
>> Secondly, to be able to scale the runnable avg sum of a sched entity
>> (se->avg->runnable_avg_sum), we preferable have a coefficient
>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
>> powerful cpu in the system) and another coefficient (avg freq over 'now
> 
> AFAICT, the coefficient representing uArch diffs is already taken into
> account into power_freq thanks to scale_cpu, isn't it ?

True, but I can't see how the current signature of
arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch
and freq invariant updating of se->avg->runnable_avg_sum business.

>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
>> be retrieved from the arch in __update_entity_runnable_avg().
>>
[...]
Vincent Guittot June 6, 2014, 7:01 a.m. UTC | #10
On 4 June 2014 13:17, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
>> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
>> >> power_orig is only changed for system with a SMT sched_domain level in order to
>> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> >> original capacity that is different from the default value.
>> >>
>> >> Create a more generic function arch_scale_cpu_power that can be also used by
>> >> non SMT platform to set power_orig.
>> >
>> > I did a quick test of the patch set with adjusting cpu_power on
>> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the
>> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
>> > without the patches applied, but with non-default cpu_powers.
>> >
>> > I didn't see any difference in the load-balance. Three tasks ended up on
>> > the two A15s and two tasks ended up on two of the three A7s leaving one
>> > unused in both cases.
>> >
>> > Using default cpu_power I get one task on each of the five cpus (best
>> > throughput). Unless I messed something up, it seems that setting
>> > cpu_power doesn't give me the best throughput with these patches
>> > applied.
>>
>> That's normal this patchset is necessary but not enough to solve the
>> issue you mention. We also need to fix the way the imbalance is
>> calculated for such situation. I have planned to push that in another
>> patchset in order to not mix too much thing together
>
> Based on the commit messages I was just lead to believe that this was a
> self-contained patch set that also addressed issues related to handling
> heterogeneous systems. Maybe it would be worth mentioning that this set
> is only part of the solution somewhere?

So this patch addresses the issue of describing the CPU capacity of an
HMP system but doesn't solve the one task per CPU issue alone.

>
> It is a bit unclear to me how these changes, which appear to mainly
> improve factoring rt and irq time into cpu_power, will solve the
> cpu_power issues related to heterogeneous systems. Can you share your
> plans for the follow up patch set? I think it would be better to review
> the solution as a whole.

I have planned to use load_per_task to calculate the imbalance to pull

>
> I absolutely agree that the imbalance calculation needs to fixed, but I
> don't think the current rq runnable_avg_sum is the right choice for that
> purpose for the reasons I pointed out the in other thread.
>
> Morten
Vincent Guittot June 16, 2014, 9:01 a.m. UTC | #11
On 5 June 2014 10:59, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> [...]
>>> Firstly, we need to scale cpu power in update_cpu_power() regarding
>>> uArch, frequency and rt/irq pressure.
>>> Here the freq related value we get back from arch_scale_freq_power(...,
>>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>>>
>>> Secondly, to be able to scale the runnable avg sum of a sched entity
>>> (se->avg->runnable_avg_sum), we preferable have a coefficient
>>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
>>> powerful cpu in the system) and another coefficient (avg freq over 'now
>>
>> AFAICT, the coefficient representing uArch diffs is already taken into
>> account into power_freq thanks to scale_cpu, isn't it ?
>
> True, but I can't see how the current signature of
> arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch
> and freq invariant updating of se->avg->runnable_avg_sum business.

My 1st assumption is that the update of rq->cpu_power (or
rq->cpu_power_freq as we discussed earlier) is fast enough compare to
frequency scaling so that it can be used to scale the load without
major error. In this case, you can use the cpu_power_orig, cpu_power
fields. The update period of cpu_power is equal balance_interval
which is initialized with the weight of the sched_domain (less or
equal to 4ms for most of ARM platform). Most of ARM platforms use a
100 HZ jiffies so the update period will be aligned to 10ms (1 tick).

If this update is not fast enough, the second possibility could be to
directly access to current frequency (or something that represent it).
This might be possible once the cpufreq will have been consolidated
into the scheduler similarly to what is going with cpuidle

Vincent

>
>>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
>>> be retrieved from the arch in __update_entity_runnable_avg().
>>>
> [...]
>
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a84114..6dc05f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5618,6 +5618,20 @@  unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
 	return default_scale_smt_power(sd, cpu);
 }
 
+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = sd->span_weight;
+
+	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+		if (sched_feat(ARCH_POWER))
+			return arch_scale_smt_power(sd, cpu);
+		else
+			return default_scale_smt_power(sd, cpu);
+	}
+
+	return SCHED_POWER_SCALE;
+}
+
 static unsigned long scale_rt_power(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5654,18 +5668,12 @@  static unsigned long scale_rt_power(int cpu)
 
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
 	unsigned long power = SCHED_POWER_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
-		if (sched_feat(ARCH_POWER))
-			power *= arch_scale_smt_power(sd, cpu);
-		else
-			power *= default_scale_smt_power(sd, cpu);
+	power *= arch_scale_cpu_power(sd, cpu);
 
-		power >>= SCHED_POWER_SHIFT;
-	}
+	power >>= SCHED_POWER_SHIFT;
 
 	sdg->sgp->power_orig = power;