diff mbox

[v2,2/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

Message ID 1499189651-18797-3-git-send-email-patrick.bellasi@arm.com (mailing list archive)
State Deferred
Headers show

Commit Message

Patrick Bellasi July 4, 2017, 5:34 p.m. UTC
Currently, sg_cpu's flags are set to the value defined by the last call of
the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.

When multiple CPU shares the same frequency domain it might happen that a
CPU which executed a RT task, right before entering IDLE, has one of the
SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.

Although such an idle CPU is _going to be_ ignored by the
sugov_next_freq_shared():
  1. this kind of "useless RT requests" are ignored only if more then
     TICK_NSEC have elapsed since the last update
  2. we can still potentially trigger an already too late switch to
     MAX, which starts also a new throttling interval
  3. the internal state machine is not consistent with what the
     scheduler knows, i.e. the CPU is now actually idle

Thus, in sugov_next_freq_shared(), where utilisation and flags are
aggregated across all the CPUs of a frequency domain, it can turn out
that all the CPUs of that domain can run unnecessary at the maximum OPP
until another event happens in the idle CPU, which eventually clear the
SCHED_CPUFREQ_{RT/DL} flag, or the IDLE CPUs gets ignored after
TICK_NSEC since the CPU entering IDLE.

Such a behaviour can harm the energy efficiency of systems where RT
workloads are not so frequent and other CPUs in the same frequency
domain are running small utilisation workloads, which is a quite common
scenario in mobile embedded systems.

This patch proposes a solution which is aligned with the current principle
to update the flags each time a scheduling event happens. The scheduling
of the idle_task on a CPU is considered one of such meaningful events.
That's why when the idle_task is selected for execution we poke the
schedutil policy to reset the flags for that CPU.

No frequency transitions are activated at that point, which is fair in
case the RT workload should come back in the future. However, this still
allows other CPUs in the same frequency domain to scale down the
frequency in case that should be possible.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes from v1:
- added "unlikely()" around the statement (SteveR)
---
 include/linux/sched/cpufreq.h    | 1 +
 kernel/sched/cpufreq_schedutil.c | 7 +++++++
 kernel/sched/idle_task.c         | 4 ++++
 3 files changed, 12 insertions(+)

Comments

Viresh Kumar July 5, 2017, 4:50 a.m. UTC | #1
On 04-07-17, 18:34, Patrick Bellasi wrote:
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2cc..36ac8d2 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -10,6 +10,7 @@
>  #define SCHED_CPUFREQ_RT	(1U << 0)
>  #define SCHED_CPUFREQ_DL	(1U << 1)
>  #define SCHED_CPUFREQ_IOWAIT	(1U << 2)
> +#define SCHED_CPUFREQ_IDLE	(1U << 3)
>  
>  #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>  
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eaba6d6..004ae18 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>  
>  	sg_cpu->util = util;
>  	sg_cpu->max = max;
> +
> +	/* CPU is entering IDLE, reset flags without triggering an update */
> +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> +		sg_cpu->flags = 0;
> +		goto done;
> +	}

Why is it important to have the above diff at all ? For example we aren't doing
similar stuff in sugov_update_single() and that will go on and try to change the
frequency if rate_limit_us time is over since last update.

And also why is it important to write 0 to sg_cpu->flags ? What wouldn't work if
we set sg_cpu->flags to SCHED_CPUFREQ_IDLE in this case ? i.e. Just the below
statement should be good for us.

>  	sg_cpu->flags = flags;
>  
>  	sugov_set_iowait_boost(sg_cpu, time, flags);
> @@ -318,6 +324,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>  		sugov_update_commit(sg_policy, time, next_f);
>  	}
>  
> +done:
>  	raw_spin_unlock(&sg_policy->update_lock);
>  }
>  
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172..a844c91 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	put_prev_task(rq, prev);
>  	update_idle_core(rq);
>  	schedstat_inc(rq->sched_goidle);
> +
> +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
> +	cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> +

This looks correct.

Can we completely avoid the utilization contribution of the CPUs which have gone
idle? Right now we avoid them with help of (delta_ns > TICK_NSEC). Can we
instead check this SCHED_CPUFREQ_IDLE flag ?
Patrick Bellasi July 5, 2017, 1:04 p.m. UTC | #2
On 05-Jul 10:20, Viresh Kumar wrote:
> On 04-07-17, 18:34, Patrick Bellasi wrote:
> > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> > index d2be2cc..36ac8d2 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -10,6 +10,7 @@
> >  #define SCHED_CPUFREQ_RT	(1U << 0)
> >  #define SCHED_CPUFREQ_DL	(1U << 1)
> >  #define SCHED_CPUFREQ_IOWAIT	(1U << 2)
> > +#define SCHED_CPUFREQ_IDLE	(1U << 3)
> >  
> >  #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> >  
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index eaba6d6..004ae18 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >  
> >  	sg_cpu->util = util;
> >  	sg_cpu->max = max;
> > +
> > +	/* CPU is entering IDLE, reset flags without triggering an update */
> > +	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> > +		sg_cpu->flags = 0;
> > +		goto done;
> > +	}
> 
> Why is it important to have the above diff at all ? For example we aren't doing
> similar stuff in sugov_update_single() and that will go on and try to change the
> frequency if rate_limit_us time is over since last update.

The p repose here is mainly to avoid interference of IDLE CPUs on
other CPUs in the same frequency domain, by just resetting their
"requests".

In the single case, it's completely up to the policy to decide what to
do when we enter IDLE without risking to affect other CPUs.
But perhaps you are right, maybe we should use the same heuristics in
both cases. Entering idle just reset the flags and do not enforce for
example a frequency drop.

> And also why is it important to write 0 to sg_cpu->flags ? What wouldn't work if
> we set sg_cpu->flags to SCHED_CPUFREQ_IDLE in this case ? i.e. Just the below
> statement should be good for us.

Let say flags have the RT/DL flag set when the RT task sleep, is there
any specific reason to keep this flag up while the CPU is IDLE?
IOW, why should we care about an information related to an even which
is now over?

The proposal of this patch is just meant to make sure that the flags,
being a state variable, always describe the current status of the
sugov "state machine".
If a CPU is IDLE there are not sensible events going on and thus flags
should better be reset.

> 
> >  	sg_cpu->flags = flags;
> >  
> >  	sugov_set_iowait_boost(sg_cpu, time, flags);
> > @@ -318,6 +324,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >  		sugov_update_commit(sg_policy, time, next_f);
> >  	}
> >  
> > +done:
> >  	raw_spin_unlock(&sg_policy->update_lock);
> >  }
> >  
> > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> > index 0c00172..a844c91 100644
> > --- a/kernel/sched/idle_task.c
> > +++ b/kernel/sched/idle_task.c
> > @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >  	put_prev_task(rq, prev);
> >  	update_idle_core(rq);
> >  	schedstat_inc(rq->sched_goidle);
> > +
> > +	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > +	cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> > +
> 
> This looks correct.
> 
> Can we completely avoid the utilization contribution of the CPUs which have gone
> idle? Right now we avoid them with help of (delta_ns > TICK_NSEC). Can we
> instead check this SCHED_CPUFREQ_IDLE flag ?

I would say that the blocked utilization of an IDLE CPU is still worth
to be considered, at least for a limited amount of time, for few main
reasons:

1. it represents CPU bandwidth that is likely to be required by a task
   which can wakeup in a short while. Consider for example an 80% task
   activated every 16ms: even if it's not running right now it's
   likely to wakeup in the next ~3ms to run for the following ~13ms.
   Thus, we should probably better consider that CPU utilization.

2. we already have policies to gratefully reduce the current OPP if
   its utilization decrease. This means that we are interested in a
   sort of policy which favors higher OPPs to avoid impacting
   performance of tasks which suddenly wakeup.
 
3. A CPU entering IDLE is not a great source of new information
   for OPP selection, I would not strictly bind an OPP change to this
   event. That's also why this patch propose to clear the flags
   without actually triggering an OPP change.

Moreover, maybe the issue you are trying to solve it's more related to
having a stale utilization for an IDLE CPUs?
In that case we should fix the real source of the issue, which is the
utilization of an IDLE CPU not being updated over time. But that's
outside of the scope of this series.

Cheers Patrick
Viresh Kumar July 6, 2017, 5:46 a.m. UTC | #3
On 05-07-17, 14:04, Patrick Bellasi wrote:
> On 05-Jul 10:20, Viresh Kumar wrote:
> > And also why is it important to write 0 to sg_cpu->flags ? What wouldn't work if
> > we set sg_cpu->flags to SCHED_CPUFREQ_IDLE in this case ? i.e. Just the below
> > statement should be good for us.
> 
> Let say flags have the RT/DL flag set when the RT task sleep, is there
> any specific reason to keep this flag up while the CPU is IDLE?
> IOW, why should we care about an information related to an even which
> is now over?

Maybe I wasn't able to communicate what I wanted to say, but I am not asking you
to keep RT/DL flags as is, but rather set the flags variable to
SCHED_CPUFREQ_IDLE (1 << 3). My concerns were about adding an additional
conditional statement here, while we can live without one.

> The proposal of this patch is just meant to make sure that the flags,
> being a state variable, always describe the current status of the
> sugov "state machine".
> If a CPU is IDLE there are not sensible events going on and thus flags
> should better be reset.

or set to SCHED_CPUFREQ_IDLE.

> > This looks correct.
> > 
> > Can we completely avoid the utilization contribution of the CPUs which have gone
> > idle? Right now we avoid them with help of (delta_ns > TICK_NSEC). Can we
> > instead check this SCHED_CPUFREQ_IDLE flag ?
> 
> I would say that the blocked utilization of an IDLE CPU is still worth
> to be considered, at least for a limited amount of time, for few main
> reasons:
> 
> 1. it represents CPU bandwidth that is likely to be required by a task
>    which can wakeup in a short while. Consider for example an 80% task
>    activated every 16ms: even if it's not running right now it's
>    likely to wakeup in the next ~3ms to run for the following ~13ms.
>    Thus, we should probably better consider that CPU utilization.
> 
> 2. we already have policies to gratefully reduce the current OPP if
>    its utilization decrease. This means that we are interested in a
>    sort of policy which favors higher OPPs to avoid impacting
>    performance of tasks which suddenly wakeup.
>  
> 3. A CPU entering IDLE is not a great source of new information
>    for OPP selection, I would not strictly bind an OPP change to this
>    event. That's also why this patch propose to clear the flags
>    without actually triggering an OPP change.
> 
> Moreover, maybe the issue you are trying to solve it's more related to
> having a stale utilization for an IDLE CPUs?

I wasn't trying to solve any issue here, but just discussing about what should
we do here. Yeah it seems fair to keep the utilization of the idle CPU for
another TICK, after which we are ignoring it anyway.
Joel Fernandes July 7, 2017, 4:43 a.m. UTC | #4
On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> Currently, sg_cpu's flags are set to the value defined by the last call of
> the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
> this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.
>
> When multiple CPU shares the same frequency domain it might happen that a
> CPU which executed a RT task, right before entering IDLE, has one of the
> SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.
>
> Although such an idle CPU is _going to be_ ignored by the
> sugov_next_freq_shared():
>   1. this kind of "useless RT requests" are ignored only if more then
>      TICK_NSEC have elapsed since the last update
>   2. we can still potentially trigger an already too late switch to
>      MAX, which starts also a new throttling interval
>   3. the internal state machine is not consistent with what the
>      scheduler knows, i.e. the CPU is now actually idle
>
> Thus, in sugov_next_freq_shared(), where utilisation and flags are
> aggregated across all the CPUs of a frequency domain, it can turn out
> that all the CPUs of that domain can run unnecessary at the maximum OPP
> until another event happens in the idle CPU, which eventually clear the
> SCHED_CPUFREQ_{RT/DL} flag, or the IDLE CPUs gets ignored after
> TICK_NSEC since the CPU entering IDLE.
>
> Such a behaviour can harm the energy efficiency of systems where RT
> workloads are not so frequent and other CPUs in the same frequency
> domain are running small utilisation workloads, which is a quite common
> scenario in mobile embedded systems.
>
> This patch proposes a solution which is aligned with the current principle
> to update the flags each time a scheduling event happens. The scheduling
> of the idle_task on a CPU is considered one of such meaningful events.
> That's why when the idle_task is selected for execution we poke the
> schedutil policy to reset the flags for that CPU.
>
> No frequency transitions are activated at that point, which is fair in
> case the RT workload should come back in the future. However, this still
> allows other CPUs in the same frequency domain to scale down the
> frequency in case that should be possible.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
>
> ---
> Changes from v1:
> - added "unlikely()" around the statement (SteveR)
> ---
>  include/linux/sched/cpufreq.h    | 1 +
>  kernel/sched/cpufreq_schedutil.c | 7 +++++++
>  kernel/sched/idle_task.c         | 4 ++++
>  3 files changed, 12 insertions(+)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2cc..36ac8d2 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -10,6 +10,7 @@
>  #define SCHED_CPUFREQ_RT       (1U << 0)
>  #define SCHED_CPUFREQ_DL       (1U << 1)
>  #define SCHED_CPUFREQ_IOWAIT   (1U << 2)
> +#define SCHED_CPUFREQ_IDLE     (1U << 3)
>
>  #define SCHED_CPUFREQ_RT_DL    (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eaba6d6..004ae18 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>
>         sg_cpu->util = util;
>         sg_cpu->max = max;
> +
> +       /* CPU is entering IDLE, reset flags without triggering an update */
> +       if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> +               sg_cpu->flags = 0;
> +               goto done;
> +       }

Instead of defining a new flag for idle, wouldn't another way be to
just clear the flag from the RT scheduling class with an extra call to
cpufreq_update_util with flags = 0 during dequeue_rt_entity? That
seems to me to be also the right place to clear the flag since the
flag is set in the corresponding class to begin with.

thanks,

-Joel
Juri Lelli July 7, 2017, 10:17 a.m. UTC | #5
On 06/07/17 21:43, Joel Fernandes wrote:
> On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:

[...]

> > @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >
> >         sg_cpu->util = util;
> >         sg_cpu->max = max;
> > +
> > +       /* CPU is entering IDLE, reset flags without triggering an update */
> > +       if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> > +               sg_cpu->flags = 0;
> > +               goto done;
> > +       }
> 
> Instead of defining a new flag for idle, wouldn't another way be to
> just clear the flag from the RT scheduling class with an extra call to
> cpufreq_update_util with flags = 0 during dequeue_rt_entity? That
> seems to me to be also the right place to clear the flag since the
> flag is set in the corresponding class to begin with.
> 

Make sense to me too. Also considering that for DL (with my patches) we
don't generally want to clear the flag at dequeue time, but only when
the 0-lag timer fires.

Best,

- Juri
Saravana Kannan July 11, 2017, 7:16 p.m. UTC | #6
On 07/07/2017 03:17 AM, Juri Lelli wrote:
> On 06/07/17 21:43, Joel Fernandes wrote:
>> On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>
> [...]
>
>>> @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>>>
>>>          sg_cpu->util = util;
>>>          sg_cpu->max = max;
>>> +
>>> +       /* CPU is entering IDLE, reset flags without triggering an update */
>>> +       if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
>>> +               sg_cpu->flags = 0;
>>> +               goto done;
>>> +       }
>>
>> Instead of defining a new flag for idle, wouldn't another way be to
>> just clear the flag from the RT scheduling class with an extra call to
>> cpufreq_update_util with flags = 0 during dequeue_rt_entity? That
>> seems to me to be also the right place to clear the flag since the
>> flag is set in the corresponding class to begin with.
>>
>
> Make sense to me too. Also considering that for DL (with my patches) we
> don't generally want to clear the flag at dequeue time, but only when
> the 0-lag timer fires.
>

Makes sense to me too.

-Saravana
diff mbox

Patch

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d2be2cc..36ac8d2 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -10,6 +10,7 @@ 
 #define SCHED_CPUFREQ_RT	(1U << 0)
 #define SCHED_CPUFREQ_DL	(1U << 1)
 #define SCHED_CPUFREQ_IOWAIT	(1U << 2)
+#define SCHED_CPUFREQ_IDLE	(1U << 3)
 
 #define SCHED_CPUFREQ_RT_DL	(SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eaba6d6..004ae18 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -304,6 +304,12 @@  static void sugov_update_shared(struct update_util_data *hook, u64 time,
 
 	sg_cpu->util = util;
 	sg_cpu->max = max;
+
+	/* CPU is entering IDLE, reset flags without triggering an update */
+	if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
+		sg_cpu->flags = 0;
+		goto done;
+	}
 	sg_cpu->flags = flags;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
@@ -318,6 +324,7 @@  static void sugov_update_shared(struct update_util_data *hook, u64 time,
 		sugov_update_commit(sg_policy, time, next_f);
 	}
 
+done:
 	raw_spin_unlock(&sg_policy->update_lock);
 }
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..a844c91 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,10 @@  pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	put_prev_task(rq, prev);
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+
+	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
+	cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
+
 	return rq->idle;
 }