Message ID | 1436293469-25707-23-git-send-email-morten.rasmussen@arm.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
On Tue, Jul 07, 2015 at 07:24:05PM +0100, Morten Rasmussen wrote: > +static unsigned int sched_group_energy(struct sched_group *sg_top) > +{ > + struct sched_domain *sd; > + int cpu, total_energy = 0; > + struct cpumask visit_cpus; > + struct sched_group *sg; > + > + WARN_ON(!sg_top->sge); > + > + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top)); > + > + while (!cpumask_empty(&visit_cpus)) { > + struct sched_group *sg_shared_cap = NULL; > + > + cpu = cpumask_first(&visit_cpus); > + > + /* > + * Is the group utilization affected by cpus outside this > + * sched_group? > + */ > + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); > + if (sd && sd->parent) > + sg_shared_cap = sd->parent->groups; > + > + for_each_domain(cpu, sd) { > + sg = sd->groups; > + > + /* Has this sched_domain already been visited? */ > + if (sd->child && group_first_cpu(sg) != cpu) > + break; > + > + do { > + struct sched_group *sg_cap_util; > + unsigned long group_util; > + int sg_busy_energy, sg_idle_energy, cap_idx; > + > + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) > + sg_cap_util = sg_shared_cap; > + else > + sg_cap_util = sg; > + > + cap_idx = find_new_capacity(sg_cap_util, sg->sge); So here its not really 'new' capacity is it, most like the current capacity? So in the case of coupled P states, you look for the CPU with highest utilization, as that is the on that determines the required P state. > + group_util = group_norm_usage(sg, cap_idx); > + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power) > + >> SCHED_CAPACITY_SHIFT; > + sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power) > + >> SCHED_CAPACITY_SHIFT; > + > + total_energy += sg_busy_energy + sg_idle_energy; > + > + if (!sd->child) > + cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); > + > + if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top))) > + goto next_cpu; > + > + } while (sg = sg->next, sg != sd->groups); > + } > +next_cpu: > + continue; > + } > + > + return total_energy; > +} -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 13, 2015 at 05:34:17PM +0200, Peter Zijlstra wrote: > On Tue, Jul 07, 2015 at 07:24:05PM +0100, Morten Rasmussen wrote: > > +static unsigned int sched_group_energy(struct sched_group *sg_top) > > +{ > > + struct sched_domain *sd; > > + int cpu, total_energy = 0; > > + struct cpumask visit_cpus; > > + struct sched_group *sg; > > + > > + WARN_ON(!sg_top->sge); > > + > > + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top)); > > + > > + while (!cpumask_empty(&visit_cpus)) { > > + struct sched_group *sg_shared_cap = NULL; > > + > > + cpu = cpumask_first(&visit_cpus); > > + > > + /* > > + * Is the group utilization affected by cpus outside this > > + * sched_group? > > + */ > > + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); > > + if (sd && sd->parent) > > + sg_shared_cap = sd->parent->groups; > > + > > + for_each_domain(cpu, sd) { > > + sg = sd->groups; > > + > > + /* Has this sched_domain already been visited? */ > > + if (sd->child && group_first_cpu(sg) != cpu) > > + break; > > + > > + do { > > + struct sched_group *sg_cap_util; > > + unsigned long group_util; > > + int sg_busy_energy, sg_idle_energy, cap_idx; > > + > > + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) > > + sg_cap_util = sg_shared_cap; > > + else > > + sg_cap_util = sg; > > + > > + cap_idx = find_new_capacity(sg_cap_util, sg->sge); > > So here its not really 'new' capacity is it, most like the current > capacity? Yes, sort of. It is what the current capacity (P-state) should be to accommodate the current utilization. Using a sane cpufreq governor it is most likely not far off. I could rename it to find_capacity() instead. It is extended in a subsequent patch to figure out the 'new' capacity in cases were we consider putting more utilization into the group. > So in the case of coupled P states, you look for the CPU with highest > utilization, as that is the on that determines the required P state. Yes. That is why we need the SD_SHARE_CAP_STATES flag and we use group_max_usage() in find_new_capacity(). -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 07, 2015 at 07:24:05PM +0100, Morten Rasmussen wrote: > For energy-aware load-balancing decisions it is necessary to know the > energy consumption estimates of groups of cpus. This patch introduces a > basic function, sched_group_energy(), which estimates the energy > consumption of the cpus in the group and any resources shared by the > members of the group. > > NOTE: The function has five levels of identation and breaks the 80 > character limit. Refactoring is necessary. > > cc: Ingo Molnar <mingo@redhat.com> > cc: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > --- > kernel/sched/fair.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 146 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 78d3081..bd0be9d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4846,6 +4846,152 @@ static inline bool energy_aware(void) > return sched_feat(ENERGY_AWARE); > } > > +/* > + * cpu_norm_usage() returns the cpu usage relative to a specific capacity, > + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for > + * energy calculations. Using the scale-invariant usage returned by > + * get_cpu_usage() and approximating scale-invariant usage by: > + * > + * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time > + * > + * the normalized usage can be found using the specific capacity. > + * > + * capacity = capacity_orig * curr_freq/max_freq > + * > + * norm_usage = running_time/time ~ usage/capacity > + */ > +static unsigned long cpu_norm_usage(int cpu, unsigned long capacity) > +{ > + int usage = __get_cpu_usage(cpu); > + > + if (usage >= capacity) > + return SCHED_CAPACITY_SCALE; > + > + return (usage << SCHED_CAPACITY_SHIFT)/capacity; > +} > + > +static unsigned long group_max_usage(struct sched_group *sg) > +{ > + int i; > + unsigned long max_usage = 0; > + > + for_each_cpu(i, sched_group_cpus(sg)) > + max_usage = max(max_usage, get_cpu_usage(i)); > + > + return max_usage; > +} > + > +/* > + * group_norm_usage() returns the approximated group usage relative to it's > + * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in > + * energy calculations. Since task executions may or may not overlap in time in > + * the group the true normalized usage is between max(cpu_norm_usage(i)) and > + * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The > + * latter is used as the estimate as it leads to a more pessimistic energy > + * estimate (more busy). > + */ > +static unsigned long group_norm_usage(struct sched_group *sg, int cap_idx) > +{ > + int i; > + unsigned long usage_sum = 0; > + unsigned long capacity = sg->sge->cap_states[cap_idx].cap; > + > + for_each_cpu(i, sched_group_cpus(sg)) > + usage_sum += cpu_norm_usage(i, capacity); > + > + if (usage_sum > SCHED_CAPACITY_SCALE) > + return SCHED_CAPACITY_SCALE; > + return usage_sum; > +} > + > +static int find_new_capacity(struct sched_group *sg, > + struct sched_group_energy *sge) > +{ > + int idx; > + unsigned long util = group_max_usage(sg); > + > + for (idx = 0; idx < sge->nr_cap_states; idx++) { > + if (sge->cap_states[idx].cap >= util) > + return idx; > + } > + > + return idx; > +} > + > +/* > + * sched_group_energy(): Returns absolute energy consumption of cpus belonging > + * to the sched_group including shared resources shared only by members of the > + * group. Iterates over all cpus in the hierarchy below the sched_group starting > + * from the bottom working it's way up before going to the next cpu until all > + * cpus are covered at all levels. The current implementation is likely to > + * gather the same usage statistics multiple times. This can probably be done in > + * a faster but more complex way. > + */ > +static unsigned int sched_group_energy(struct sched_group *sg_top) > +{ > + struct sched_domain *sd; > + int cpu, total_energy = 0; > + struct cpumask visit_cpus; > + struct sched_group *sg; > + > + WARN_ON(!sg_top->sge); > + > + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top)); > + > + while (!cpumask_empty(&visit_cpus)) { > + struct sched_group *sg_shared_cap = NULL; > + > + cpu = cpumask_first(&visit_cpus); > + > + /* > + * Is the group utilization affected by cpus outside this > + * sched_group? > + */ > + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); > + if (sd && sd->parent) > + sg_shared_cap = sd->parent->groups; If the sched domain is already the highest level, should directly use its group to calculate shared capacity? so the code like below: if (sd && sd->parent) sg_shared_cap = sd->parent->groups; else if (sd && !sd->parent) sg_shared_cap = sd->groups; > + > + for_each_domain(cpu, sd) { > + sg = sd->groups; > + > + /* Has this sched_domain already been visited? */ > + if (sd->child && group_first_cpu(sg) != cpu) > + break; > + > + do { > + struct sched_group *sg_cap_util; > + unsigned long group_util; > + int sg_busy_energy, sg_idle_energy, cap_idx; > + > + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) > + sg_cap_util = sg_shared_cap; > + else > + sg_cap_util = sg; > + > + cap_idx = find_new_capacity(sg_cap_util, sg->sge); > + group_util = group_norm_usage(sg, cap_idx); > + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power) > + >> SCHED_CAPACITY_SHIFT; > + sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power) > + >> SCHED_CAPACITY_SHIFT; > + > + total_energy += sg_busy_energy + sg_idle_energy; > + > + if (!sd->child) > + cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); > + > + if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top))) > + goto next_cpu; > + > + } while (sg = sg->next, sg != sd->groups); > + } > +next_cpu: > + continue; > + } > + > + return total_energy; > +} > + > static int wake_wide(struct task_struct *p) > { > int factor = this_cpu_read(sd_llc_size); > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 03, 2015 at 01:19:23AM +0800, Leo Yan wrote: > On Tue, Jul 07, 2015 at 07:24:05PM +0100, Morten Rasmussen wrote: > > +/* > > + * sched_group_energy(): Returns absolute energy consumption of cpus belonging > > + * to the sched_group including shared resources shared only by members of the > > + * group. Iterates over all cpus in the hierarchy below the sched_group starting > > + * from the bottom working it's way up before going to the next cpu until all > > + * cpus are covered at all levels. The current implementation is likely to > > + * gather the same usage statistics multiple times. This can probably be done in > > + * a faster but more complex way. > > + */ > > +static unsigned int sched_group_energy(struct sched_group *sg_top) > > +{ > > + struct sched_domain *sd; > > + int cpu, total_energy = 0; > > + struct cpumask visit_cpus; > > + struct sched_group *sg; > > + > > + WARN_ON(!sg_top->sge); > > + > > + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top)); > > + > > + while (!cpumask_empty(&visit_cpus)) { > > + struct sched_group *sg_shared_cap = NULL; > > + > > + cpu = cpumask_first(&visit_cpus); > > + > > + /* > > + * Is the group utilization affected by cpus outside this > > + * sched_group? > > + */ > > + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); > > + if (sd && sd->parent) > > + sg_shared_cap = sd->parent->groups; > > If the sched domain is already the highest level, should directly use > its group to calculate shared capacity? so the code like below: > > if (sd && sd->parent) > sg_shared_cap = sd->parent->groups; > else if (sd && !sd->parent) > sg_shared_cap = sd->groups; This isn't really the right thing to do. The fundamental problem is that we need to know somehow which cpus that share the same clock source (frequency). We have chosen to use sched_groups to represent groups for all the energy model calculations, so we use sg_shared_cap to indicate which cpus that share the same clock source. In the loop above we find the sched_domain that spans all cpus sharing the same clock source, and sd->parent->groups trick gives us a sched_group spanning the same cpus, if such sched_domain/group exists. The problem is when it doesn't, i.e. all cpus share the same clock source. Using a sched_group at the current level would be wrong as it is only spanning a subset of the cpus that really share clock source. It is clearly a missing piece in the current patch set. If you are after a quick and ugly fix you can either: 1) create a temporary sched_group spanning the same cpus as sd, or 2) change struct energy_env and find_new_capacity() to use a cpumask instead of a sched_group and pass the cpumask from the sd instead of sched_group pointer. IMHO, the right solution is to introduce a system-wide sched_group (there has been previous discussions on this) that spans all the cpus. I think it should work even without attaching any energy data to that sched_group. Otherwise, I think you can get away with just adding a zero cost capacity and idle state. Dietmar has already got patches that implements a system-wide sched_group which I'm sure he is willing to share ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 78d3081..bd0be9d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4846,6 +4846,152 @@ static inline bool energy_aware(void) return sched_feat(ENERGY_AWARE); } +/* + * cpu_norm_usage() returns the cpu usage relative to a specific capacity, + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for + * energy calculations. Using the scale-invariant usage returned by + * get_cpu_usage() and approximating scale-invariant usage by: + * + * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time + * + * the normalized usage can be found using the specific capacity. + * + * capacity = capacity_orig * curr_freq/max_freq + * + * norm_usage = running_time/time ~ usage/capacity + */ +static unsigned long cpu_norm_usage(int cpu, unsigned long capacity) +{ + int usage = __get_cpu_usage(cpu); + + if (usage >= capacity) + return SCHED_CAPACITY_SCALE; + + return (usage << SCHED_CAPACITY_SHIFT)/capacity; +} + +static unsigned long group_max_usage(struct sched_group *sg) +{ + int i; + unsigned long max_usage = 0; + + for_each_cpu(i, sched_group_cpus(sg)) + max_usage = max(max_usage, get_cpu_usage(i)); + + return max_usage; +} + +/* + * group_norm_usage() returns the approximated group usage relative to it's + * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in + * energy calculations. Since task executions may or may not overlap in time in + * the group the true normalized usage is between max(cpu_norm_usage(i)) and + * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The + * latter is used as the estimate as it leads to a more pessimistic energy + * estimate (more busy). + */ +static unsigned long group_norm_usage(struct sched_group *sg, int cap_idx) +{ + int i; + unsigned long usage_sum = 0; + unsigned long capacity = sg->sge->cap_states[cap_idx].cap; + + for_each_cpu(i, sched_group_cpus(sg)) + usage_sum += cpu_norm_usage(i, capacity); + + if (usage_sum > SCHED_CAPACITY_SCALE) + return SCHED_CAPACITY_SCALE; + return usage_sum; +} + +static int find_new_capacity(struct sched_group *sg, + struct sched_group_energy *sge) +{ + int idx; + unsigned long util = group_max_usage(sg); + + for (idx = 0; idx < sge->nr_cap_states; idx++) { + if (sge->cap_states[idx].cap >= util) + return idx; + } + + return idx; +} + +/* + * sched_group_energy(): Returns absolute energy consumption of cpus belonging + * to the sched_group including shared resources shared only by members of the + * group. Iterates over all cpus in the hierarchy below the sched_group starting + * from the bottom working it's way up before going to the next cpu until all + * cpus are covered at all levels. The current implementation is likely to + * gather the same usage statistics multiple times. This can probably be done in + * a faster but more complex way. + */ +static unsigned int sched_group_energy(struct sched_group *sg_top) +{ + struct sched_domain *sd; + int cpu, total_energy = 0; + struct cpumask visit_cpus; + struct sched_group *sg; + + WARN_ON(!sg_top->sge); + + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top)); + + while (!cpumask_empty(&visit_cpus)) { + struct sched_group *sg_shared_cap = NULL; + + cpu = cpumask_first(&visit_cpus); + + /* + * Is the group utilization affected by cpus outside this + * sched_group? + */ + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); + if (sd && sd->parent) + sg_shared_cap = sd->parent->groups; + + for_each_domain(cpu, sd) { + sg = sd->groups; + + /* Has this sched_domain already been visited? */ + if (sd->child && group_first_cpu(sg) != cpu) + break; + + do { + struct sched_group *sg_cap_util; + unsigned long group_util; + int sg_busy_energy, sg_idle_energy, cap_idx; + + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) + sg_cap_util = sg_shared_cap; + else + sg_cap_util = sg; + + cap_idx = find_new_capacity(sg_cap_util, sg->sge); + group_util = group_norm_usage(sg, cap_idx); + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power) + >> SCHED_CAPACITY_SHIFT; + sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power) + >> SCHED_CAPACITY_SHIFT; + + total_energy += sg_busy_energy + sg_idle_energy; + + if (!sd->child) + cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); + + if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top))) + goto next_cpu; + + } while (sg = sg->next, sg != sd->groups); + } +next_cpu: + continue; + } + + return total_energy; +} + static int wake_wide(struct task_struct *p) { int factor = this_cpu_read(sd_llc_size);
For energy-aware load-balancing decisions it is necessary to know the energy consumption estimates of groups of cpus. This patch introduces a basic function, sched_group_energy(), which estimates the energy consumption of the cpus in the group and any resources shared by the members of the group. NOTE: The function has five levels of identation and breaks the 80 character limit. Refactoring is necessary. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> --- kernel/sched/fair.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 146 insertions(+)