diff mbox

[v10,1/3] cpufreq: Add mechanism for registering utilization update callbacks

Message ID 20160222105224.GE6356@twins.programming.kicks-ass.net (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Peter Zijlstra Feb. 22, 2016, 10:52 a.m. UTC
On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
> > We did experiments using util/max in intel_pstate. For some benchmarks
> > there were regression of 4 to 5%, for some benchmarks it performed at
> > par with getting utilization from the processor. Further optimization
> > in the algorithm is possible and still in progress. Idea is that we can
> > change P-State fast enough and be more reactive. Once I have good data,
> > I will send to this list. The algorithm can be part of the cpufreq
> > governor too.
> 
> There has been a lot of work in the area of scheduler-driven CPU
> frequency selection by Linaro and ARM as well. It was posted most
> recently a couple months ago:
> 
> http://thread.gmane.org/gmane.linux.power-management.general/69176
> 
> It was also posted as part of the energy-aware scheduling series last
> July. There's a new RFC series forthcoming which I had hoped (and
> failed) to post prior to my business travel this week; it should be out
> next week. It will address the feedback received thus far along with
> locking and other things.

Right, so I had a wee look at that again, and had a quick chat with Juri
on IRC. So the main difference seems to be that you guys want to know
why the utilization changed, as opposed to purely _that_ it changed.

And hence you have callbacks all over the place.

I'm not too sure I really like that too much, it bloats the code and
somewhat obfuscates the point.

So I would really like there to be just the one callback when we
actually compute a new number, and that is update_load_avg().

Now I think we can 'easily' propagate the information you want into
update_load_avg() (see below), but I would like to see actual arguments
for why you would need this.

For one, the migration bits don't really make sense. We typically do not
call migration code local on both cpus, typically just one, but possibly
neither. That means you cannot actually update the relevant CPU state
from these sites anyway.

> The scheduler hooks for utilization-based cpufreq operation deserve a
> lot more debate I think. They could quite possibly have different
> requirements than hooks which are chosen just to guarantee periodic
> callbacks into sampling-based governors.

I'll repeat what Rafael said, the periodic callback nature is a
'temporary' hack, simply because current cpufreq depends on that.

The idea is to wane cpufreq off of that requirement and then drop that
part.

Very-much-not-signed-off-by: Peter Zijlstra
---
 kernel/sched/fair.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vincent Guittot Feb. 22, 2016, 2:33 p.m. UTC | #1
On 22 February 2016 at 11:52, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
>> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
>> > We did experiments using util/max in intel_pstate. For some benchmarks
>> > there were regression of 4 to 5%, for some benchmarks it performed at
>> > par with getting utilization from the processor. Further optimization
>> > in the algorithm is possible and still in progress. Idea is that we can
>> > change P-State fast enough and be more reactive. Once I have good data,
>> > I will send to this list. The algorithm can be part of the cpufreq
>> > governor too.
>>
>> There has been a lot of work in the area of scheduler-driven CPU
>> frequency selection by Linaro and ARM as well. It was posted most
>> recently a couple months ago:
>>
>> http://thread.gmane.org/gmane.linux.power-management.general/69176
>>
>> It was also posted as part of the energy-aware scheduling series last
>> July. There's a new RFC series forthcoming which I had hoped (and
>> failed) to post prior to my business travel this week; it should be out
>> next week. It will address the feedback received thus far along with
>> locking and other things.
>
> Right, so I had a wee look at that again, and had a quick chat with Juri
> on IRC. So the main difference seems to be that you guys want to know
> why the utilization changed, as opposed to purely _that_ it changed.

Yes, the main goal was to be able to filter the useful and useless
update of rq's utilization in order to minimize/optimize the trig of
an update of the frequency. These patches have been made for a cpufreq
driver that reacts far slower than scheduler. It's might worth
starting with a simple solution and update it after

>
> And hence you have callbacks all over the place.
>
> I'm not too sure I really like that too much, it bloats the code and
> somewhat obfuscates the point.
>
> So I would really like there to be just the one callback when we
> actually compute a new number, and that is update_load_avg().
>
> Now I think we can 'easily' propagate the information you want into
> update_load_avg() (see below), but I would like to see actual arguments
> for why you would need this.

Your proposal is interesting except that we are interested in the rq's
utilization more that se's ones so we should better use
update_cfs_rq_load_avg and few additional place like
attach_entity_load_avg which bypasses update_cfs_rq_load_avg to update
rq's utilization and load

>
> For one, the migration bits don't really make sense. We typically do not
> call migration code local on both cpus, typically just one, but possibly
> neither. That means you cannot actually update the relevant CPU state
> from these sites anyway.
>
>> The scheduler hooks for utilization-based cpufreq operation deserve a
>> lot more debate I think. They could quite possibly have different
>> requirements than hooks which are chosen just to guarantee periodic
>> callbacks into sampling-based governors.
>
> I'll repeat what Rafael said, the periodic callback nature is a
> 'temporary' hack, simply because current cpufreq depends on that.
>
> The idea is to wane cpufreq off of that requirement and then drop that
> part.
>
> Very-much-not-signed-off-by: Peter Zijlstra
> ---
>  kernel/sched/fair.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce24a456322..f3e95d8b65c3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2528,6 +2528,17 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +enum load_update_type {
> +       LOAD_NONE,
> +       LOAD_TICK,
> +       LOAD_PUT,
> +       LOAD_SET,
> +       LOAD_ENQUEUE,
> +       LOAD_DEQUEUE,
> +       LOAD_ENQUEUE_MOVE = LOAD_ENQUEUE + 2,
> +       LOAD_DEQUEUE_MOVE = LOAD_DEQUEUE + 2,
> +};
> +
>  #ifdef CONFIG_SMP
>  /* Precomputed fixed inverse multiplies for multiplication by y^n */
>  static const u32 runnable_avg_yN_inv[] = {
> @@ -2852,7 +2863,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  }
>
>  /* Update task and its cfs_rq load average */
> -static inline void update_load_avg(struct sched_entity *se, int update_tg)
> +static inline void update_load_avg(struct sched_entity *se, int update_tg,
> +                                  enum load_update_type type)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
>         u64 now = cfs_rq_clock_task(cfs_rq);
> @@ -2940,7 +2952,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  static inline void
>  dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       update_load_avg(se, 1);
> +       update_load_avg(se, 1, LOAD_DEQUEUE);
>
>         cfs_rq->runnable_load_avg =
>                 max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> @@ -3006,7 +3018,8 @@ static int idle_balance(struct rq *this_rq);
>
>  #else /* CONFIG_SMP */
>
> -static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
> +static inline void update_load_avg(struct sched_entity *se, int update_tg,
> +                                  enum load_update_type type) {}
>  static inline void
>  enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
>  static inline void
> @@ -3327,7 +3340,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                 if (schedstat_enabled())
>                         update_stats_wait_end(cfs_rq, se);
>                 __dequeue_entity(cfs_rq, se);
> -               update_load_avg(se, 1);
> +               update_load_avg(se, 1, LOAD_SET);
>         }
>
>         update_stats_curr_start(cfs_rq, se);
> @@ -3431,7 +3444,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>                 /* Put 'current' back into the tree. */
>                 __enqueue_entity(cfs_rq, prev);
>                 /* in !on_rq case, update occurred at dequeue */
> -               update_load_avg(prev, 0);
> +               update_load_avg(prev, 0, LOAD_PUT);
>         }
>         cfs_rq->curr = NULL;
>  }
> @@ -3447,7 +3460,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>         /*
>          * Ensure that runnable average is periodically updated.
>          */
> -       update_load_avg(curr, 1);
> +       update_load_avg(curr, 1, LOAD_TICK);
>         update_cfs_shares(cfs_rq);
>
>  #ifdef CONFIG_SCHED_HRTICK
> @@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (cfs_rq_throttled(cfs_rq))
>                         break;
>
> -               update_load_avg(se, 1);
> +               update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>                 update_cfs_shares(cfs_rq);
>         }
>
> @@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (cfs_rq_throttled(cfs_rq))
>                         break;
>
> -               update_load_avg(se, 1);
> +               update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>                 update_cfs_shares(cfs_rq);
>         }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Feb. 22, 2016, 2:40 p.m. UTC | #2
Hi Peter,

On 22/02/16 11:52, Peter Zijlstra wrote:
> On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
> > On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
> > > We did experiments using util/max in intel_pstate. For some benchmarks
> > > there were regression of 4 to 5%, for some benchmarks it performed at
> > > par with getting utilization from the processor. Further optimization
> > > in the algorithm is possible and still in progress. Idea is that we can
> > > change P-State fast enough and be more reactive. Once I have good data,
> > > I will send to this list. The algorithm can be part of the cpufreq
> > > governor too.
> > 
> > There has been a lot of work in the area of scheduler-driven CPU
> > frequency selection by Linaro and ARM as well. It was posted most
> > recently a couple months ago:
> > 
> > http://thread.gmane.org/gmane.linux.power-management.general/69176
> > 
> > It was also posted as part of the energy-aware scheduling series last
> > July. There's a new RFC series forthcoming which I had hoped (and
> > failed) to post prior to my business travel this week; it should be out
> > next week. It will address the feedback received thus far along with
> > locking and other things.
> 
> Right, so I had a wee look at that again, and had a quick chat with Juri
> on IRC. So the main difference seems to be that you guys want to know
> why the utilization changed, as opposed to purely _that_ it changed.
> 
> And hence you have callbacks all over the place.
> 
> I'm not too sure I really like that too much, it bloats the code and
> somewhat obfuscates the point.
> 
> So I would really like there to be just the one callback when we
> actually compute a new number, and that is update_load_avg().
> 
> Now I think we can 'easily' propagate the information you want into
> update_load_avg() (see below), but I would like to see actual arguments
> for why you would need this.
> 

Right. The information we propagate with your patch might be all we
need, but I'll have to play with it on top of Rafael's or Steve's
changes to fully convince myself. :-)

> For one, the migration bits don't really make sense. We typically do not
> call migration code local on both cpus, typically just one, but possibly
> neither. That means you cannot actually update the relevant CPU state
> from these sites anyway.
> 

I might actually have one point regarding migrations. See below. And I'm
not sure I understand why you are saying that we can't update the
relevant CPU state on migrations; we do know src and dst cpus, don't we?

[...]

> @@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		if (cfs_rq_throttled(cfs_rq))
>  			break;
>  
> -		update_load_avg(se, 1);
> +		update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>  		update_cfs_shares(cfs_rq);
>  	}
>  
> @@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		if (cfs_rq_throttled(cfs_rq))
>  			break;
>  
> -		update_load_avg(se, 1);
> +		update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>  		update_cfs_shares(cfs_rq);
>  	}
>  

What we are trying to do with the sched-freq approach (and maybe that is
just broken :-/) is to wait until all tasks are detached from src cpu
and attached to dst cpu to trigger updates on such cpus. I fear that if
don't do that we might have problems with any sort of rate limiting for
freq transitions we might need to put in place.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 22, 2016, 3:42 p.m. UTC | #3
On Mon, Feb 22, 2016 at 02:40:01PM +0000, Juri Lelli wrote:

> > For one, the migration bits don't really make sense. We typically do not
> > call migration code local on both cpus, typically just one, but possibly
> > neither. That means you cannot actually update the relevant CPU state
> > from these sites anyway.
> > 
> 
> I might actually have one point regarding migrations. See below. And I'm
> not sure I understand why you are saying that we can't update the
> relevant CPU state on migrations; we do know src and dst cpus, don't we?
> 
> [...]
> 
> > @@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		if (cfs_rq_throttled(cfs_rq))
> >  			break;
> >  
> > -		update_load_avg(se, 1);
> > +		update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
> >  		update_cfs_shares(cfs_rq);
> >  	}
> >  
> > @@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		if (cfs_rq_throttled(cfs_rq))
> >  			break;
> >  
> > -		update_load_avg(se, 1);
> > +		update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
> >  		update_cfs_shares(cfs_rq);
> >  	}
> >  

Well, yes, you have the src and dst cpu numbers, but if you want to
access that data remotely you'll have to go add atomic ops or locking.

And you'll have to go trigger IPIs to program remote state (or wait for
the next event on the CPU).

That all is expensive.

> What we are trying to do with the sched-freq approach (and maybe that is
> just broken :-/) is to wait until all tasks are detached from src cpu
> and attached to dst cpu to trigger updates on such cpus. I fear that if
> don't do that we might have problems with any sort of rate limiting for
> freq transitions we might need to put in place.

Hurm.. tricky that :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 22, 2016, 9:46 p.m. UTC | #4
On Mon, Feb 22, 2016 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
>> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
>> > We did experiments using util/max in intel_pstate. For some benchmarks
>> > there were regression of 4 to 5%, for some benchmarks it performed at
>> > par with getting utilization from the processor. Further optimization
>> > in the algorithm is possible and still in progress. Idea is that we can
>> > change P-State fast enough and be more reactive. Once I have good data,
>> > I will send to this list. The algorithm can be part of the cpufreq
>> > governor too.
>>
>> There has been a lot of work in the area of scheduler-driven CPU
>> frequency selection by Linaro and ARM as well. It was posted most
>> recently a couple months ago:
>>
>> http://thread.gmane.org/gmane.linux.power-management.general/69176
>>
>> It was also posted as part of the energy-aware scheduling series last
>> July. There's a new RFC series forthcoming which I had hoped (and
>> failed) to post prior to my business travel this week; it should be out
>> next week. It will address the feedback received thus far along with
>> locking and other things.
>
> Right, so I had a wee look at that again, and had a quick chat with Juri
> on IRC. So the main difference seems to be that you guys want to know
> why the utilization changed, as opposed to purely _that_ it changed.
>
> And hence you have callbacks all over the place.
>
> I'm not too sure I really like that too much, it bloats the code and
> somewhat obfuscates the point.
>
> So I would really like there to be just the one callback when we
> actually compute a new number, and that is update_load_avg().
>
> Now I think we can 'easily' propagate the information you want into
> update_load_avg() (see below), but I would like to see actual arguments
> for why you would need this.
>
> For one, the migration bits don't really make sense. We typically do not
> call migration code local on both cpus, typically just one, but possibly
> neither. That means you cannot actually update the relevant CPU state
> from these sites anyway.
>
>> The scheduler hooks for utilization-based cpufreq operation deserve a
>> lot more debate I think. They could quite possibly have different
>> requirements than hooks which are chosen just to guarantee periodic
>> callbacks into sampling-based governors.
>
> I'll repeat what Rafael said, the periodic callback nature is a
> 'temporary' hack, simply because current cpufreq depends on that.
>
> The idea is to wane cpufreq off of that requirement and then drop that
> part.

Right and I can see at least a couple of ways to do that, but it'll
depend on where the final hooks will be located and what arguments
they will pass.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce24a456322..f3e95d8b65c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2528,6 +2528,17 @@  static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+enum load_update_type {
+	LOAD_NONE,
+	LOAD_TICK,
+	LOAD_PUT,
+	LOAD_SET,
+	LOAD_ENQUEUE,
+	LOAD_DEQUEUE,
+	LOAD_ENQUEUE_MOVE = LOAD_ENQUEUE + 2,
+	LOAD_DEQUEUE_MOVE = LOAD_DEQUEUE + 2,
+};
+
 #ifdef CONFIG_SMP
 /* Precomputed fixed inverse multiplies for multiplication by y^n */
 static const u32 runnable_avg_yN_inv[] = {
@@ -2852,7 +2863,8 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 }
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_load_avg(struct sched_entity *se, int update_tg,
+				   enum load_update_type type)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -2940,7 +2952,7 @@  enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
+	update_load_avg(se, 1, LOAD_DEQUEUE);
 
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
@@ -3006,7 +3018,8 @@  static int idle_balance(struct rq *this_rq);
 
 #else /* CONFIG_SMP */
 
-static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void update_load_avg(struct sched_entity *se, int update_tg,
+				   enum load_update_type type) {}
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
@@ -3327,7 +3340,7 @@  set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (schedstat_enabled())
 			update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, LOAD_SET);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3431,7 +3444,7 @@  static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_load_avg(prev, 0, LOAD_PUT);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3447,7 +3460,7 @@  entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_load_avg(curr, 1, LOAD_TICK);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4320,7 +4333,7 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4380,7 +4393,7 @@  static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
 		update_cfs_shares(cfs_rq);
 	}