diff mbox

[2/8] sched/fair: add margin to utilization update

Message ID 1457932932-28444-3-git-send-email-mturquette+renesas@baylibre.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Michael Turquette March 14, 2016, 5:22 a.m. UTC
Utilization contributions to cfs_rq->avg.util_avg are scaled for both
microarchitecture-invariance as well as frequency-invariance. This means
that any given utilization contribution will be scaled against the
current cpu capacity (cpu frequency). Contributions from long running
tasks, whose utilization grows larger over time, will asymptotically
approach the current capacity.

This causes a problem when using this utilization signal to select a
target cpu capacity (cpu frequency), as our signal will never exceed the
current capacity, which would otherwise be our signal to increase
frequency.

Solve this by introducing a default capacity margin that is added to the
utilization signal when requesting a change to capacity (cpu frequency).
The margin is 1280, or 1.25 x SCHED_CAPACITY_SCALE (1024). This is
equivalent to similar margins such as the default 125 value assigned to
struct sched_domain.imbalance_pct for load balancing, and to the 80%
up_threshold used by the legacy cpufreq ondemand governor.

Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
 kernel/sched/fair.c  | 18 ++++++++++++++++--
 kernel/sched/sched.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra March 15, 2016, 9:16 p.m. UTC | #1
On Sun, Mar 13, 2016 at 10:22:06PM -0700, Michael Turquette wrote:
> @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  
>  	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
>  		unsigned long max = rq->cpu_capacity_orig;
> +		unsigned long cap = cfs_rq->avg.util_avg *
> +			cfs_capacity_margin / max;
>  
>  		/*
>  		 * There are a few boundary cases this might miss but it should
> @@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  		 * thread is a different class (!fair), nor will the utilization
>  		 * number include things like RT tasks.
>  		 */
> -		cpufreq_update_util(rq_clock(rq),
> -				    min(cfs_rq->avg.util_avg, max), max);
> +		cpufreq_update_util(rq_clock(rq), min(cap, max), max);
>  	}
>  }

I really don't see why that is here, and not inside whatever uses
cpufreq_update_util().
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 15, 2016, 9:43 p.m. UTC | #2
On Tue, Mar 15, 2016 at 02:28:48PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:16:14)
> > On Sun, Mar 13, 2016 at 10:22:06PM -0700, Michael Turquette wrote:
> > > @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> > >  
> > >       if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> > >               unsigned long max = rq->cpu_capacity_orig;
> > > +             unsigned long cap = cfs_rq->avg.util_avg *
> > > +                     cfs_capacity_margin / max;
> > >  
> > >               /*
> > >                * There are a few boundary cases this might miss but it should
> > > @@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> > >                * thread is a different class (!fair), nor will the utilization
> > >                * number include things like RT tasks.
> > >                */
> > > -             cpufreq_update_util(rq_clock(rq),
> > > -                                 min(cfs_rq->avg.util_avg, max), max);
> > > +             cpufreq_update_util(rq_clock(rq), min(cap, max), max);
> > >       }
> > >  }
> > 
> > I really don't see why that is here, and not inside whatever uses
> > cpufreq_update_util().
> 
> Because I want schedutil to be dumb and not implement any policy of it's
> own. The idea is for the scheduler to select frequency after all.
> 
> I want to avoid a weird hybrid solution where we try to be smart about
> selecting the right capacity/frequency in fair.c (e.g. Steve and Juri's
> patches to fair.c from the sched-freq-v7 series), but then have an
> additional layer of "smarts" massaging those values further in the
> cpufreq governor.

So the problem here is that you add an unconditional division, even if
cpufreq_update_util() then decides to not do anything with it.

Please, these are scheduler paths, do not add (fancy) instructions if
you don't absolutely have to.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 16, 2016, 2:52 a.m. UTC | #3
On 03/13/2016 10:22 PM, Michael Turquette wrote:
> +unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
> +
>  #ifdef CONFIG_CFS_BANDWIDTH
>  /*
>   * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
> @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  
>  	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
>  		unsigned long max = rq->cpu_capacity_orig;
> +		unsigned long cap = cfs_rq->avg.util_avg *
> +			cfs_capacity_margin / max;

Doesn't rq->cpu_capacity_orig get scaled per the microarch invariance?
This would mean that the margin we're applying here would differ based
on that.

I'd expect that the margin would be * (cfs_capacity_margin /
SCHED_CAPACITY_SCALE) which would then reduce the division into a shift.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 16, 2016, 10:12 p.m. UTC | #4
Quoting Steve Muckle (2016-03-15 19:52:59)
> On 03/13/2016 10:22 PM, Michael Turquette wrote:
> > +unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
> > +
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >  /*
> >   * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
> > @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> >  
> >       if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> >               unsigned long max = rq->cpu_capacity_orig;
> > +             unsigned long cap = cfs_rq->avg.util_avg *
> > +                     cfs_capacity_margin / max;
> 
> Doesn't rq->cpu_capacity_orig get scaled per the microarch invariance?
> This would mean that the margin we're applying here would differ based
> on that.
> 
> I'd expect that the margin would be * (cfs_capacity_margin /
> SCHED_CAPACITY_SCALE) which would then reduce the division into a shift.

Will fix.

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a32f281..29e8bae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -100,6 +100,19 @@  const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+/*
+ * Add a 25% margin globally to all capacity requests from cfs. This is
+ * equivalent to an 80% up_threshold in legacy governors like ondemand.
+ *
+ * This is required as task utilization increases. The frequency-invariant
+ * utilization will asymptotically approach the current capacity of the cpu and
+ * the additional margin will cross the threshold into the next capacity state.
+ *
+ * XXX someday expand to separate, per-call site margins? e.g. enqueue, fork,
+ * task_tick, load_balance, etc
+ */
+unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 /*
  * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
@@ -2840,6 +2853,8 @@  static inline void update_load_avg(struct sched_entity *se, int update_tg)
 
 	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
 		unsigned long max = rq->cpu_capacity_orig;
+		unsigned long cap = cfs_rq->avg.util_avg *
+			cfs_capacity_margin / max;
 
 		/*
 		 * There are a few boundary cases this might miss but it should
@@ -2852,8 +2867,7 @@  static inline void update_load_avg(struct sched_entity *se, int update_tg)
 		 * thread is a different class (!fair), nor will the utilization
 		 * number include things like RT tasks.
 		 */
-		cpufreq_update_util(rq_clock(rq),
-				    min(cfs_rq->avg.util_avg, max), max);
+		cpufreq_update_util(rq_clock(rq), min(cap, max), max);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f06dfca..8c93ed2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -27,6 +27,9 @@  extern __read_mostly int scheduler_running;
 extern unsigned long calc_load_update;
 extern atomic_long_t calc_load_tasks;
 
+#define CAPACITY_MARGIN_DEFAULT 1280;
+extern unsigned long cfs_capacity_margin;
+
 extern void calc_global_load_tick(struct rq *this_rq);
 extern long calc_load_fold_active(struct rq *this_rq);