diff mbox

[2/2] sched/fair: do not call cpufreq hook unless util changed

Message ID 1458606068-7476-2-git-send-email-smuckle@linaro.org (mailing list archive)
State RFC
Headers show

Commit Message

Steve Muckle March 22, 2016, 12:21 a.m. UTC
There's no reason to call the cpufreq hook if the root cfs_rq
utilization has not been modified.

Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/fair.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Sai March 24, 2016, 11:47 p.m. UTC | #1
Hi Steve,

On 03/21/2016 05:21 PM, Steve Muckle wrote:
> There's no reason to call the cpufreq hook if the root cfs_rq
> utilization has not been modified.
> 
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  kernel/sched/fair.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d418deb04049..af58e826cffe 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2826,20 +2826,21 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_avg *sa = &cfs_rq->avg;
>  	struct rq *rq = rq_of(cfs_rq);
> -	int decayed, removed = 0;
> +	int decayed, removed_load = 0, removed_util = 0;
>  	int cpu = cpu_of(rq);
>  
>  	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>  		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>  		sa->load_avg = max_t(long, sa->load_avg - r, 0);
>  		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> -		removed = 1;
> +		removed_load = 1;
>  	}
>  
>  	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
>  		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
>  		sa->util_avg = max_t(long, sa->util_avg - r, 0);
>  		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
> +		removed_util = 1;
>  	}
>  
>  	decayed = __update_load_avg(now, cpu, sa,
> @@ -2850,7 +2851,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  	cfs_rq->load_last_update_time_copy = sa->last_update_time;
>  #endif
>  
> -	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> +	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq &&
> +	    (decayed || removed_util)) {
>  		unsigned long max = rq->cpu_capacity_orig;

Should this filtering instead happen on the governor side?

Even if the CFS load itself didn't change, we could have switched from an
RT/DL thread to a CFS thread so util would have to be updated from ULONG_MAX
to whatever CFS needs right?

Also now that CFS enqueue calls cpufreq_update_util, RT/DL requests could
potentially get overridden.

-Sai
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 25, 2016, 1:01 a.m. UTC | #2
Hi Sai,

On 03/24/2016 04:47 PM, Sai Gurrappadi wrote:
>> @@ -2850,7 +2851,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>>  	cfs_rq->load_last_update_time_copy = sa->last_update_time;
>>  #endif
>>  
>> -	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
>> +	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq &&
>> +	    (decayed || removed_util)) {
>>  		unsigned long max = rq->cpu_capacity_orig;
> 
> Should this filtering instead happen on the governor side?

Perhaps but that also means making a trip into that logic from this hot
path. To me it seemed better to avoid the overhead if possible,
especially since we already have info here on whether the util changed.
But if everyone agrees the overhead is negligible I'm happy to drop the
patch.

> Even if the CFS load itself didn't change, we could have switched from an
> RT/DL thread to a CFS thread so util would have to be updated from ULONG_MAX
> to whatever CFS needs right?

Agreed, given the current state of things this will delay the ramp down
in that case. The current scheme of having a single vote for CPU
capacity seems broken overall to me however.

If the CPU goes idle after RT/DL execution we'll leave the vote at fmax
until cpufreq_sched starts ignoring it due to staleness.

More importantly though, without capacity vote aggregation from
CFS/RT/DL it doesn't seem possible to ensure appropriate capacity. If
CFS keeps setting the capacity when it runs to a capacity based solely
on the CFS requirement, and there is RT or DL utilization in the system,
won't it tend to be underserved? It may actually be better to be lazy in
ramping down from fmax to compensate for not including RT/DL's
utilization, until we can more accurately calculate it.

We need vote aggregation from each sched class. This has been posted
both as part of the now-defunct schedfreq series as well as Mike
Turquette's recent series, which I hear he's working on rebasing.

Once that is in we need to decide how RT tasks should vote. I'm not
really a fan of the decision to run them at fmax. I think this changes
their semantics and it will be a non-starter for platforms with power
constraints and/or slow frequency transition times. Perhaps we could
make it configurable how the RT class should vote. It should be the RT
class's responsibility though IMO to reduce/drop its vote when necessary
though, which would address your concern above.

> Also now that CFS enqueue calls cpufreq_update_util, RT/DL requests could
> potentially get overridden.

I think this was already busted - enqueue_task_fair() calls
update_load_avg() on the sched entities in the hierarchy which were
already enqueued.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sai March 25, 2016, 9:24 p.m. UTC | #3
On 03/24/2016 06:01 PM, Steve Muckle wrote:
> Hi Sai,
> 
> On 03/24/2016 04:47 PM, Sai Gurrappadi wrote:
>>> @@ -2850,7 +2851,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>>>  	cfs_rq->load_last_update_time_copy = sa->last_update_time;
>>>  #endif
>>>  
>>> -	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
>>> +	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq &&
>>> +	    (decayed || removed_util)) {
>>>  		unsigned long max = rq->cpu_capacity_orig;
>>
>> Should this filtering instead happen on the governor side?
> 
> Perhaps but that also means making a trip into that logic from this hot
> path. To me it seemed better to avoid the overhead if possible,
> especially since we already have info here on whether the util changed.
> But if everyone agrees the overhead is negligible I'm happy to drop the
> patch.

I agree, ideally we skip a pointless function call as long as we make sure we
aren't dropping important frequency requests.

> 
>> Even if the CFS load itself didn't change, we could have switched from an
>> RT/DL thread to a CFS thread so util would have to be updated from ULONG_MAX
>> to whatever CFS needs right?
> 
> Agreed, given the current state of things this will delay the ramp down
> in that case. The current scheme of having a single vote for CPU
> capacity seems broken overall to me however.

Yup, the current hooks just stomp on each other. Need to organize that better.
Not to mention the interaction with the throttling bit on the governor side.

> 
> If the CPU goes idle after RT/DL execution we'll leave the vote at fmax
> until cpufreq_sched starts ignoring it due to staleness.

Note that the same thing can happen due to throttling on the governor side so
it isn't just this change per-say. I do realize it won't be possible to track
all freq. requests perfectly especially given how often the hook fires right
now but something to keep in mind.

> 
> More importantly though, without capacity vote aggregation from
> CFS/RT/DL it doesn't seem possible to ensure appropriate capacity. If
> CFS keeps setting the capacity when it runs to a capacity based solely
> on the CFS requirement, and there is RT or DL utilization in the system,
> won't it tend to be underserved? It may actually be better to be lazy in
> ramping down from fmax to compensate for not including RT/DL's
> utilization, until we can more accurately calculate it.

Could potentially make CFS requests based off of max available CFS capacity
i.e cpu_capacity instead of cpu_capacity_orig. But yes I agree, there needs to
be better interaction between the classes.

> 
> We need vote aggregation from each sched class. This has been posted
> both as part of the now-defunct schedfreq series as well as Mike
> Turquette's recent series, which I hear he's working on rebasing.
> 
> Once that is in we need to decide how RT tasks should vote. I'm not
> really a fan of the decision to run them at fmax. I think this changes
> their semantics and it will be a non-starter for platforms with power
> constraints and/or slow frequency transition times. Perhaps we could
> make it configurable how the RT class should vote. It should be the RT
> class's responsibility though IMO to reduce/drop its vote when necessary
> though, which would address your concern above.

The Fmax bit for RT I think is very much a per-platform decision based on
voltage ramp up/down times, power/thermal budget etc.

> 
>> Also now that CFS enqueue calls cpufreq_update_util, RT/DL requests could
>> potentially get overridden.
> 
> I think this was already busted - enqueue_task_fair() calls
> update_load_avg() on the sched entities in the hierarchy which were
> already enqueued.

Yup, that was busted before too.

-Sai
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d418deb04049..af58e826cffe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2826,20 +2826,21 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
 	struct rq *rq = rq_of(cfs_rq);
-	int decayed, removed = 0;
+	int decayed, removed_load = 0, removed_util = 0;
 	int cpu = cpu_of(rq);
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
-		removed = 1;
+		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sa->util_avg = max_t(long, sa->util_avg - r, 0);
 		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		removed_util = 1;
 	}
 
 	decayed = __update_load_avg(now, cpu, sa,
@@ -2850,7 +2851,8 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq &&
+	    (decayed || removed_util)) {
 		unsigned long max = rq->cpu_capacity_orig;
 
 		/*
@@ -2873,7 +2875,7 @@  static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 				    min(sa->util_avg, max), max);
 	}
 
-	return decayed || removed;
+	return decayed || removed_load;
 }
 
 /* Update task and its cfs_rq load average */