diff mbox

[RFCv5,25/46] sched: Add over-utilization/tipping point indicator

Message ID 1436293469-25707-26-git-send-email-morten.rasmussen@arm.com (mailing list archive)
State RFC
Headers show

Commit Message

Morten Rasmussen July 7, 2015, 6:24 p.m. UTC
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way,
spreading the tasks across as many cpus as possible based on priority
scaled load to preserve smp_nice.

The over-utilization condition is conservatively chosen to indicate
over-utilization as soon as one cpu is fully utilized at it's highest
frequency. We don't consider groups as lumping usage and capacity
together for a group of cpus may hide the fact that one or more cpus in
the group are over-utilized while group-siblings are partially idle. The
tasks could be served better if moved to another group with completely
idle cpus. This is particularly problematic if some cpus have a
significantly reduced capacity due to RT/IRQ pressure or if the system
has cpus of different capacity (e.g. ARM big.LITTLE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 35 +++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  3 +++
 2 files changed, 34 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra Aug. 13, 2015, 5:35 p.m. UTC | #1
On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote:
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way,
> spreading the tasks across as many cpus as possible based on priority
> scaled load to preserve smp_nice.
> 
> The over-utilization condition is conservatively chosen to indicate
> over-utilization as soon as one cpu is fully utilized at it's highest
> frequency. We don't consider groups as lumping usage and capacity
> together for a group of cpus may hide the fact that one or more cpus in
> the group are over-utilized while group-siblings are partially idle. The
> tasks could be served better if moved to another group with completely
> idle cpus. This is particularly problematic if some cpus have a
> significantly reduced capacity due to RT/IRQ pressure or if the system
> has cpus of different capacity (e.g. ARM big.LITTLE).

I might be tired, but I'm having a very hard time deciphering this
second paragraph.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen Aug. 14, 2015, 1:02 p.m. UTC | #2
On Thu, Aug 13, 2015 at 07:35:33PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote:
> > Energy-aware scheduling is only meant to be active while the system is
> > _not_ over-utilized. That is, there are spare cycles available to shift
> > tasks around based on their actual utilization to get a more
> > energy-efficient task distribution without depriving any tasks. When
> > above the tipping point task placement is done the traditional way,
> > spreading the tasks across as many cpus as possible based on priority
> > scaled load to preserve smp_nice.
> > 
> > The over-utilization condition is conservatively chosen to indicate
> > over-utilization as soon as one cpu is fully utilized at it's highest
> > frequency. We don't consider groups as lumping usage and capacity
> > together for a group of cpus may hide the fact that one or more cpus in
> > the group are over-utilized while group-siblings are partially idle. The
> > tasks could be served better if moved to another group with completely
> > idle cpus. This is particularly problematic if some cpus have a
> > significantly reduced capacity due to RT/IRQ pressure or if the system
> > has cpus of different capacity (e.g. ARM big.LITTLE).
> 
> I might be tired, but I'm having a very hard time deciphering this
> second paragraph.

I can see why, let me try again :-)

It is essentially about when do we make balancing decisions based on
load_avg and util_avg (using the new names in Yuyang's rewrite). As you
mentioned in another thread recently, we want to use util_avg until the
system is over-utilized and then switch to load_avg. We need to define
the conditions that determine the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as being something like:

sum_{cpus}(rq::cfs::avg::util_avg) + margin > sum_{cpus}(rq::capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60%. Balancing based on util_avg we are likely
to end up with nice=-10 sharing cpus and nice=0 getting their own as we
1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60%
for those cpus that have to be shared. The system utilization is only
85% of the system capacity, but we are breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization as
when:

cpu_rq(any)::cfs::avg::util_avg + margin > cpu_rq(any)::capacity

is true for any cpu in the system. IOW, as soon as one cpu is (nearly)
100% utilized, we switch to load_avg to factor in priority.

Now with this definition, we can skip periodic load-balance as no cpu
has an always-running task when the system is not over-utilized. All
tasks will be periodic and we can balance them at wake-up. This
conservative condition does however mean that some scenarios that could
benefit from energy-aware decisions even if one cpu is fully utilized
would not get those benefits.

For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

I haven't found any reasonably easy-to-track conditions that would work
better. Suggestions are very welcome.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leo Yan Aug. 17, 2015, 1:10 p.m. UTC | #3
On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote:
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way,
> spreading the tasks across as many cpus as possible based on priority
> scaled load to preserve smp_nice.
> 
> The over-utilization condition is conservatively chosen to indicate
> over-utilization as soon as one cpu is fully utilized at it's highest
> frequency. We don't consider groups as lumping usage and capacity
> together for a group of cpus may hide the fact that one or more cpus in
> the group are over-utilized while group-siblings are partially idle. The
> tasks could be served better if moved to another group with completely
> idle cpus. This is particularly problematic if some cpus have a
> significantly reduced capacity due to RT/IRQ pressure or if the system
> has cpus of different capacity (e.g. ARM big.LITTLE).
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c  | 35 +++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h |  3 +++
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf1d34c..99e43ee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4281,6 +4281,8 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +static bool cpu_overutilized(int cpu);
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -4291,6 +4293,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	int task_new = !(flags & ENQUEUE_WAKEUP);
>  
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
> @@ -4325,6 +4328,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	if (!se) {
>  		update_rq_runnable_avg(rq, rq->nr_running);
>  		add_nr_running(rq, 1);
> +		if (!task_new && !rq->rd->overutilized &&
> +		    cpu_overutilized(rq->cpu))
> +			rq->rd->overutilized = true;

Maybe this is a stupid question, the root domain's overutilized value is
shared by all CPUs; so just curious if need lock to protect this
variable or use atmomic type for it?

[...]

Thanks,
Leo Yan
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Sept. 29, 2015, 8:08 p.m. UTC | #4
On 08/14/2015 06:02 AM, Morten Rasmussen wrote:
> To be sure not to break smp_nice, we have defined over-utilization as
> when:
> 
> cpu_rq(any)::cfs::avg::util_avg + margin > cpu_rq(any)::capacity
> 
> is true for any cpu in the system. IOW, as soon as one cpu is (nearly)
> 100% utilized, we switch to load_avg to factor in priority.
> 
> Now with this definition, we can skip periodic load-balance as no cpu
> has an always-running task when the system is not over-utilized. All
> tasks will be periodic and we can balance them at wake-up. This
> conservative condition does however mean that some scenarios that could
> benefit from energy-aware decisions even if one cpu is fully utilized
> would not get those benefits.
> 
> For system where some cpus might have reduced capacity on some cpus
> (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
> soon a just a single cpu is fully utilized as it might one of those with
> reduced capacity and in that case we want to migrate it.
> 
> I haven't found any reasonably easy-to-track conditions that would work
> better. Suggestions are very welcome.

Workloads with a single heavy task and many small tasks are pretty
common. I'm worried about the single heavy task tripping the
over-utilization condition on a b.L system, EAS getting turned off, and
small tasks running on big CPUs, leading to an increase in power
consumption.

Perhaps an extension to the over-utilization logic such as the following
could cause big CPUs being saturated by a single task to be ignored?

util(cpu X) + margin > capacity(cpu X) &&
  (capacity(cpu X) != max_capacity ? 1 : nr_running(cpu X) > 1)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen Oct. 9, 2015, 12:49 p.m. UTC | #5
On Tue, Sep 29, 2015 at 01:08:46PM -0700, Steve Muckle wrote:
> On 08/14/2015 06:02 AM, Morten Rasmussen wrote:
> > To be sure not to break smp_nice, we have defined over-utilization as
> > when:
> > 
> > cpu_rq(any)::cfs::avg::util_avg + margin > cpu_rq(any)::capacity
> > 
> > is true for any cpu in the system. IOW, as soon as one cpu is (nearly)
> > 100% utilized, we switch to load_avg to factor in priority.
> > 
> > Now with this definition, we can skip periodic load-balance as no cpu
> > has an always-running task when the system is not over-utilized. All
> > tasks will be periodic and we can balance them at wake-up. This
> > conservative condition does however mean that some scenarios that could
> > benefit from energy-aware decisions even if one cpu is fully utilized
> > would not get those benefits.
> > 
> > For system where some cpus might have reduced capacity on some cpus
> > (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
> > soon a just a single cpu is fully utilized as it might one of those with
> > reduced capacity and in that case we want to migrate it.
> > 
> > I haven't found any reasonably easy-to-track conditions that would work
> > better. Suggestions are very welcome.
> 
> Workloads with a single heavy task and many small tasks are pretty
> common. I'm worried about the single heavy task tripping the
> over-utilization condition on a b.L system, EAS getting turned off, and
> small tasks running on big CPUs, leading to an increase in power
> consumption.
> 
> Perhaps an extension to the over-utilization logic such as the following
> could cause big CPUs being saturated by a single task to be ignored?
> 
> util(cpu X) + margin > capacity(cpu X) &&
>   (capacity(cpu X) != max_capacity ? 1 : nr_running(cpu X) > 1)

I have had the same thought as well. I think it could work. nr_running()
doesn't take into account blocked tasks, so we could in theory see fewer
tasks than there is, but those scenarios are currently ignored by
load-balancing anyway and if the cpu is seriously over-utilized we are
quite likely to have nr_running() > 1.

I'm in favor of giving it a try and see what explodes :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf1d34c..99e43ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4281,6 +4281,8 @@  static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static bool cpu_overutilized(int cpu);
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -4291,6 +4293,7 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4325,6 +4328,9 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se) {
 		update_rq_runnable_avg(rq, rq->nr_running);
 		add_nr_running(rq, 1);
+		if (!task_new && !rq->rd->overutilized &&
+		    cpu_overutilized(rq->cpu))
+			rq->rd->overutilized = true;
 	}
 	hrtick_update(rq);
 }
@@ -4952,6 +4958,14 @@  static int find_new_capacity(struct energy_env *eenv,
 	return idx;
 }
 
+static unsigned int capacity_margin = 1280; /* ~20% margin */
+
+static bool cpu_overutilized(int cpu)
+{
+	return (capacity_of(cpu) * 1024) <
+				(get_cpu_usage(cpu) * capacity_margin);
+}
+
 /*
  * sched_group_energy(): Returns absolute energy consumption of cpus belonging
  * to the sched_group including shared resources shared only by members of the
@@ -6756,11 +6770,12 @@  static enum group_type group_classify(struct lb_env *env,
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
  * @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs,
-			bool *overload)
+			bool *overload, bool *overutilized)
 {
 	unsigned long load;
 	int i;
@@ -6790,6 +6805,9 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		if (cpu_overutilized(i))
+			*overutilized = true;
 	}
 
 	/* Adjust by relative CPU capacity of the group */
@@ -6895,7 +6913,7 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
-	bool overload = false;
+	bool overload = false, overutilized = false;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
@@ -6917,7 +6935,7 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		}
 
 		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
-						&overload);
+						&overload, &overutilized);
 
 		if (local_group)
 			goto next_group;
@@ -6959,8 +6977,14 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* update overload indicator if we are at root domain */
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
-	}
 
+		/* Update over-utilization (tipping point, U >= 0) indicator */
+		if (env->dst_rq->rd->overutilized != overutilized)
+			env->dst_rq->rd->overutilized = overutilized;
+	} else {
+		if (!env->dst_rq->rd->overutilized && overutilized)
+			env->dst_rq->rd->overutilized = true;
+	}
 }
 
 /**
@@ -8324,6 +8348,9 @@  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		task_tick_numa(rq, curr);
 
 	update_rq_runnable_avg(rq, 1);
+
+	if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
+		rq->rd->overutilized = true;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a51692..fbe2da0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -535,6 +535,9 @@  struct root_domain {
 	/* Indicate more than one runnable task for any CPU */
 	bool overload;
 
+	/* Indicate one or more cpus over-utilized (tipping point) */
+	bool overutilized;
+
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).