diff mbox series

[v6,09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

Message ID 20190115101513.2822-10-patrick.bellasi@arm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Add utilization clamping support | expand

Commit Message

Patrick Bellasi Jan. 15, 2019, 10:15 a.m. UTC
Schedutil enforces a maximum frequency when RT tasks are RUNNABLE.
This mandatory policy can be made tunable from userspace to define a max
frequency which is still reasonable for the execution of a specific RT
workload while being also power/energy friendly.

Extend the usage of util_{min,max} to the RT scheduling class.

Add uclamp_default_perf, a special set of clamp values to be used
for tasks requiring maximum performance, i.e. by default all the non
clamped RT tasks.

Since utilization clamping applies now to both CFS and RT tasks,
schedutil clamps the combined utilization of these two classes.
The IOWait boost value is also subject to clamping for RT tasks.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

---
Changes in v6:
 Others:
 - wholesale s/group/bucket/
 - wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
---
 kernel/sched/core.c              | 20 ++++++++++++++++----
 kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++--------------
 kernel/sched/rt.c                |  4 ++++
 3 files changed, 33 insertions(+), 18 deletions(-)

Comments

Quentin Perret Jan. 22, 2019, 12:30 p.m. UTC | #1
On Tuesday 15 Jan 2019 at 10:15:06 (+0000), Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 520ee2b785e7..38a05a4f78cc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -201,9 +201,6 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
>  	unsigned long dl_util, util, irq;
>  	struct rq *rq = cpu_rq(cpu);
>  
> -	if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
> -		return max;
> -
>  	/*
>  	 * Early check to see if IRQ/steal time saturates the CPU, can be
>  	 * because of inaccuracies in how we track these -- see
> @@ -219,15 +216,19 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
>  	 * utilization (PELT windows are synchronized) we can directly add them
>  	 * to obtain the CPU's actual utilization.
>  	 *
> -	 * CFS utilization can be boosted or capped, depending on utilization
> -	 * clamp constraints requested by currently RUNNABLE tasks.
> +	 * CFS and RT utilization can be boosted or capped, depending on
> +	 * utilization clamp constraints requested by currently RUNNABLE
> +	 * tasks.
>  	 * When there are no CFS RUNNABLE tasks, clamps are released and
>  	 * frequency will be gracefully reduced with the utilization decay.
>  	 */
> -	util = (type == ENERGY_UTIL)
> -		? util_cfs
> -		: uclamp_util(rq, util_cfs);
> -	util += cpu_util_rt(rq);
> +	util = cpu_util_rt(rq);
> +	if (type == FREQUENCY_UTIL) {
> +		util += cpu_util_cfs(rq);
> +		util  = uclamp_util(rq, util);

So with this we don't go to max to anymore for CONFIG_UCLAMP_TASK=n no ?

Thanks,
Quentin
Patrick Bellasi Jan. 22, 2019, 12:37 p.m. UTC | #2
On 22-Jan 12:30, Quentin Perret wrote:
> On Tuesday 15 Jan 2019 at 10:15:06 (+0000), Patrick Bellasi wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 520ee2b785e7..38a05a4f78cc 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -201,9 +201,6 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> >  	unsigned long dl_util, util, irq;
> >  	struct rq *rq = cpu_rq(cpu);
> >  
> > -	if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
> > -		return max;
> > -
> >  	/*
> >  	 * Early check to see if IRQ/steal time saturates the CPU, can be
> >  	 * because of inaccuracies in how we track these -- see
> > @@ -219,15 +216,19 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> >  	 * utilization (PELT windows are synchronized) we can directly add them
> >  	 * to obtain the CPU's actual utilization.
> >  	 *
> > -	 * CFS utilization can be boosted or capped, depending on utilization
> > -	 * clamp constraints requested by currently RUNNABLE tasks.
> > +	 * CFS and RT utilization can be boosted or capped, depending on
> > +	 * utilization clamp constraints requested by currently RUNNABLE
> > +	 * tasks.
> >  	 * When there are no CFS RUNNABLE tasks, clamps are released and
> >  	 * frequency will be gracefully reduced with the utilization decay.
> >  	 */
> > -	util = (type == ENERGY_UTIL)
> > -		? util_cfs
> > -		: uclamp_util(rq, util_cfs);
> > -	util += cpu_util_rt(rq);
> > +	util = cpu_util_rt(rq);
> > +	if (type == FREQUENCY_UTIL) {
> > +		util += cpu_util_cfs(rq);
> > +		util  = uclamp_util(rq, util);
> 
> So with this we don't go to max to anymore for CONFIG_UCLAMP_TASK=n no ?

Mmm... good point!

I need to guard this chagen for the !CONFIG_UCLAMP_TASK case!

> 
> Thanks,
> Quentin

Cheers Patrick
Peter Zijlstra Jan. 23, 2019, 10:28 a.m. UTC | #3
On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> -	util = (type == ENERGY_UTIL)
> -		? util_cfs
> -		: uclamp_util(rq, util_cfs);
> -	util += cpu_util_rt(rq);
> +	util = cpu_util_rt(rq);
> +	if (type == FREQUENCY_UTIL) {
> +		util += cpu_util_cfs(rq);
> +		util  = uclamp_util(rq, util);
> +	} else {
> +		util += util_cfs;
> +	}

Or the combined thing:

-       util = util_cfs;
-       util += cpu_util_rt(rq);
+       util = cpu_util_rt(rq);
+       if (type == FREQUENCY_UTIL) {
+               util += cpu_util_cfs(rq);
+               util  = uclamp_util(rq, util);
+       } else {
+               util += util_cfs;
+       }


Leaves me confused.

When type == FREQ, util_cfs should already be cpu_util_cfs(), per
sugov_get_util().

So should that not end up like:

	util = util_cfs;
	util += cpu_util_rt(rq);
+	if (type == FREQUENCY_UTIL)
+		util = uclamp_util(rq, util);

instead?
Peter Zijlstra Jan. 23, 2019, 10:49 a.m. UTC | #4
On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> @@ -858,16 +859,23 @@ static inline void
>  uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
>  		     unsigned int *clamp_value, unsigned int *bucket_id)
>  {
> +	struct uclamp_se *default_clamp;
> +
>  	/* Task specific clamp value */
>  	*clamp_value = p->uclamp[clamp_id].value;
>  	*bucket_id = p->uclamp[clamp_id].bucket_id;
>  
> +	/* RT tasks have different default values */
> +	default_clamp = task_has_rt_policy(p)
> +		? uclamp_default_perf
> +		: uclamp_default;
> +
>  	/* System default restriction */
> -	if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> -		     *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> +	if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> +		     *clamp_value > default_clamp[UCLAMP_MAX].value)) {
>  		/* Keep it simple: unconditionally enforce system defaults */
> -		*clamp_value = uclamp_default[clamp_id].value;
> -		*bucket_id = uclamp_default[clamp_id].bucket_id;
> +		*clamp_value = default_clamp[clamp_id].value;
> +		*bucket_id = default_clamp[clamp_id].bucket_id;
>  	}
>  }

So I still don't much like the whole effective thing; but I think you
should use rt_task() instead of task_has_rt_policy().
Patrick Bellasi Jan. 23, 2019, 2:33 p.m. UTC | #5
On 23-Jan 11:28, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > -	util = (type == ENERGY_UTIL)
> > -		? util_cfs
> > -		: uclamp_util(rq, util_cfs);
> > -	util += cpu_util_rt(rq);
> > +	util = cpu_util_rt(rq);
> > +	if (type == FREQUENCY_UTIL) {
> > +		util += cpu_util_cfs(rq);
> > +		util  = uclamp_util(rq, util);
> > +	} else {
> > +		util += util_cfs;
> > +	}
> 
> Or the combined thing:
> 
> -       util = util_cfs;
> -       util += cpu_util_rt(rq);
> +       util = cpu_util_rt(rq);
> +       if (type == FREQUENCY_UTIL) {
> +               util += cpu_util_cfs(rq);
> +               util  = uclamp_util(rq, util);
> +       } else {
> +               util += util_cfs;
> +       }
> 
> 
> Leaves me confused.
> 
> When type == FREQ, util_cfs should already be cpu_util_cfs(), per
> sugov_get_util().
> 
> So should that not end up like:
> 
> 	util = util_cfs;
> 	util += cpu_util_rt(rq);
> +	if (type == FREQUENCY_UTIL)
> +		util = uclamp_util(rq, util);
> 
> instead?

You right, I get to that core after the patches which integrate
compute_energy(). The chuck above was the version before the EM got
merged but I missed to backport the change once I rebased on
tip/sched/core with the EM in.

Sorry for the confusion, will fix in v7.

Cheers
Patrick Bellasi Jan. 23, 2019, 2:40 p.m. UTC | #6
On 23-Jan 11:49, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > @@ -858,16 +859,23 @@ static inline void
> >  uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> >  		     unsigned int *clamp_value, unsigned int *bucket_id)
> >  {
> > +	struct uclamp_se *default_clamp;
> > +
> >  	/* Task specific clamp value */
> >  	*clamp_value = p->uclamp[clamp_id].value;
> >  	*bucket_id = p->uclamp[clamp_id].bucket_id;
> >  
> > +	/* RT tasks have different default values */
> > +	default_clamp = task_has_rt_policy(p)
> > +		? uclamp_default_perf
> > +		: uclamp_default;
> > +
> >  	/* System default restriction */
> > -	if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > -		     *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > +	if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> > +		     *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> >  		/* Keep it simple: unconditionally enforce system defaults */
> > -		*clamp_value = uclamp_default[clamp_id].value;
> > -		*bucket_id = uclamp_default[clamp_id].bucket_id;
> > +		*clamp_value = default_clamp[clamp_id].value;
> > +		*bucket_id = default_clamp[clamp_id].bucket_id;
> >  	}
> >  }
> 
> So I still don't much like the whole effective thing;

:/

I find back-annotation useful in many cases since we have different
sources for possible clamp values:

 1. task specific
 2. cgroup defined
 3. system defaults
 4. system power default

I don't think we can avoid to somehow back annotate on which bucket a
task has been refcounted... it makes dequeue so much easier, it helps
in ensuring that the refcouning is consistent and enable lazy updates.

> but I think you should use rt_task() instead of
> task_has_rt_policy().

Right... will do.
Peter Zijlstra Jan. 23, 2019, 8:11 p.m. UTC | #7
On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> On 23-Jan 11:49, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > > @@ -858,16 +859,23 @@ static inline void
> > >  uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > >  		     unsigned int *clamp_value, unsigned int *bucket_id)
> > >  {
> > > +	struct uclamp_se *default_clamp;
> > > +
> > >  	/* Task specific clamp value */
> > >  	*clamp_value = p->uclamp[clamp_id].value;
> > >  	*bucket_id = p->uclamp[clamp_id].bucket_id;
> > >  
> > > +	/* RT tasks have different default values */
> > > +	default_clamp = task_has_rt_policy(p)
> > > +		? uclamp_default_perf
> > > +		: uclamp_default;
> > > +
> > >  	/* System default restriction */
> > > -	if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > > -		     *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > > +	if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> > > +		     *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> > >  		/* Keep it simple: unconditionally enforce system defaults */
> > > -		*clamp_value = uclamp_default[clamp_id].value;
> > > -		*bucket_id = uclamp_default[clamp_id].bucket_id;
> > > +		*clamp_value = default_clamp[clamp_id].value;
> > > +		*bucket_id = default_clamp[clamp_id].bucket_id;
> > >  	}
> > >  }
> > 
> > So I still don't much like the whole effective thing;
> 
> :/
> 
> I find back-annotation useful in many cases since we have different
> sources for possible clamp values:
> 
>  1. task specific
>  2. cgroup defined
>  3. system defaults
>  4. system power default

(I'm not sure I've seen 4 happen yet, what's that?)

Anyway, once you get range composition defined; that should be something
like:

	R_p \Compose_g R_g

Where R_p is the range of task-p, and R_g is the range of the g'th
cgroup of p (where you can make an identity between the root cgroup and
the system default).

Now; as per the other email; I think the straight forward composition:

struct range compose(struct range a, struct range b)
{
	return (range){.min = clamp(a.min, b.min, b.max),
	               .max = clamp(a.max, b.min, b.max), };
}

(note that this is non-commutative, so we have to pay attention to
get the order 'right')

Works in this case; unlike the cpu/rq conposition where we resort to a
pure max function for non-interference.

> I don't think we can avoid to somehow back annotate on which bucket a
> task has been refcounted... it makes dequeue so much easier, it helps
> in ensuring that the refcouning is consistent and enable lazy updates.

So I'll have to go over the code again, but I'm wondering why you're
changing uclamp_se::bucket_id on a runnable task.

Ideally you keep bucket_id invariant between enqueue and dequeue; then
dequeue knows where we put it.

Now I suppose actually determining bucket_id is 'expensive' (it
certainly is with the whole mapping scheme, but even that integer
division is not nice), so we'd like to precompute the bucket_id.

This then leads to the problem of how to change uclamp_se::value while
runnable (since per the other thread you don't want to always update all
runnable tasks).

So far so right?

I'm thikning that if we haz a single bit, say:

  struct uclamp_se {
	...
	unsigned int	changed : 1;
  };

We can update uclamp_se::value and set uclamp_se::changed, and then the
next enqueue will (unlikely) test-and-clear changed and recompute the
bucket_id.

Would that not be simpler?
Patrick Bellasi Jan. 24, 2019, 12:30 p.m. UTC | #8
On 23-Jan 21:11, Peter Zijlstra wrote:
> On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > > > @@ -858,16 +859,23 @@ static inline void
> > > >  uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > > >  		     unsigned int *clamp_value, unsigned int *bucket_id)
> > > >  {
> > > > +	struct uclamp_se *default_clamp;
> > > > +
> > > >  	/* Task specific clamp value */
> > > >  	*clamp_value = p->uclamp[clamp_id].value;
> > > >  	*bucket_id = p->uclamp[clamp_id].bucket_id;
> > > >  
> > > > +	/* RT tasks have different default values */
> > > > +	default_clamp = task_has_rt_policy(p)
> > > > +		? uclamp_default_perf
> > > > +		: uclamp_default;
> > > > +
> > > >  	/* System default restriction */
> > > > -	if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > > > -		     *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > > > +	if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> > > > +		     *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> > > >  		/* Keep it simple: unconditionally enforce system defaults */
> > > > -		*clamp_value = uclamp_default[clamp_id].value;
> > > > -		*bucket_id = uclamp_default[clamp_id].bucket_id;
> > > > +		*clamp_value = default_clamp[clamp_id].value;
> > > > +		*bucket_id = default_clamp[clamp_id].bucket_id;
> > > >  	}
> > > >  }
> > > 
> > > So I still don't much like the whole effective thing;
> > 
> > :/
> > 
> > I find back-annotation useful in many cases since we have different
> > sources for possible clamp values:
> > 
> >  1. task specific
> >  2. cgroup defined
> >  3. system defaults
> >  4. system power default
> 
> (I'm not sure I've seen 4 happen yet, what's that?)

Typo: that's "system s/power/perf/ default", i.e. uclamp_default_perf
introduced by this patch.

> Anyway, once you get range composition defined; that should be something
> like:
> 
> 	R_p \Compose_g R_g
> 
> Where R_p is the range of task-p, and R_g is the range of the g'th
> cgroup of p (where you can make an identity between the root cgroup and
> the system default).
> 
> Now; as per the other email; I think the straight forward composition:
> 
> struct range compose(struct range a, struct range b)
> {
> 	return (range){.min = clamp(a.min, b.min, b.max),
> 	               .max = clamp(a.max, b.min, b.max), };
> }

This composition is done in uclamp_effective_get() but it's
slightly different, since we want to support a "nice policy" where
tasks can always ask less then what they have got assigned.

Thus, from an abstract standpoint, if a task is in a cgroup:

     task.min <= R_g.min
     task.max <= R_g.max

While, for tasks in the root cgroup system default applies and we
enforece:

     task.min >= R_0.min
     task.max <= R_0.max

... where the "nice policy" is currently not more supported, but
perhaps we can/should use the same for system defaults too.

> (note that this is non-commutative, so we have to pay attention to
> get the order 'right')
> 
> Works in this case; unlike the cpu/rq conposition where we resort to a
> pure max function for non-interference.

Right.

> > I don't think we can avoid to somehow back annotate on which bucket a
> > task has been refcounted... it makes dequeue so much easier, it helps
> > in ensuring that the refcouning is consistent and enable lazy updates.
> 
> So I'll have to go over the code again, but I'm wondering why you're
> changing uclamp_se::bucket_id on a runnable task.

We change only the "requested" value, not the "effective" one.

> Ideally you keep bucket_id invariant between enqueue and dequeue; then
> dequeue knows where we put it.

Right, that's what we do for the "effective" value.

> Now I suppose actually determining bucket_id is 'expensive' (it
> certainly is with the whole mapping scheme, but even that integer
> division is not nice), so we'd like to precompute the bucket_id.

Yes, although the complexity is mostly in the composition logic
described above not on mapping at all. We have "mapping" overheads
only when we change a "request" value and that's from slow-paths.

The "effective" value is computed at each enqueue time by composing
precomputed bucket_id representing the current "requested" task's,
cgroup's and system default's values.

> This then leads to the problem of how to change uclamp_se::value while
> runnable (since per the other thread you don't want to always update all
> runnable tasks).
> 
> So far so right?

Yes.

> I'm thikning that if we haz a single bit, say:
> 
>   struct uclamp_se {
> 	...
> 	unsigned int	changed : 1;
>   };
> 
> We can update uclamp_se::value and set uclamp_se::changed, and then the
> next enqueue will (unlikely) test-and-clear changed and recompute the
> bucket_id.

This mean will lazy update the "requested" bucket_id by deferring its
computation at enqueue time. Which saves us a copy of the bucket_id,
i.e. we will have only the "effective" value updated at enqueue time.

But...

> Would that not be simpler?

... although being simpler it does not fully exploit the slow-path,
a syscall which is usually running from a different process context
(system management software).

It also fits better for lazy updates but, in the cgroup case, where we
wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
do the updates from the slow-path.

Will look better into this simplification while working on v7, perhaps
the linear mapping can really help in that too.

Cheers Patrick
Patrick Bellasi Jan. 24, 2019, 12:38 p.m. UTC | #9
On 24-Jan 12:30, Patrick Bellasi wrote:
> On 23-Jan 21:11, Peter Zijlstra wrote:
> > On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:

[...]

> > I'm thikning that if we haz a single bit, say:
> > 
> >   struct uclamp_se {
> > 	...
> > 	unsigned int	changed : 1;
> >   };
> > 
> > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > next enqueue will (unlikely) test-and-clear changed and recompute the
> > bucket_id.
> 
> This mean will lazy update the "requested" bucket_id by deferring its
> computation at enqueue time. Which saves us a copy of the bucket_id,
> i.e. we will have only the "effective" value updated at enqueue time.
> 
> But...
> 
> > Would that not be simpler?
> 
> ... although being simpler it does not fully exploit the slow-path,
> a syscall which is usually running from a different process context
> (system management software).
> 
> It also fits better for lazy updates but, in the cgroup case, where we
> wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> do the updates from the slow-path.
> 
> Will look better into this simplification while working on v7, perhaps
> the linear mapping can really help in that too.

Actually, I forgot to mention that:

    uclamp_se::effective::{
        value, bucket_id
    }

will be still required to proper support the cgroup delegation model,
where a child group could be restricted by the parent but we want to
keep track of the original "requested" value for when the parent
should relax the restriction.

Thus, since effective values are already there, why not using them
also to pre-compute the new requested bucket_id from the slow path?
Peter Zijlstra Jan. 24, 2019, 3:12 p.m. UTC | #10
On Thu, Jan 24, 2019 at 12:38:35PM +0000, Patrick Bellasi wrote:
> On 24-Jan 12:30, Patrick Bellasi wrote:
> > On 23-Jan 21:11, Peter Zijlstra wrote:
> > > On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > > > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> 
> [...]
> 
> > > I'm thikning that if we haz a single bit, say:
> > > 
> > >   struct uclamp_se {
> > > 	...
> > > 	unsigned int	changed : 1;
> > >   };
> > > 
> > > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > > next enqueue will (unlikely) test-and-clear changed and recompute the
> > > bucket_id.
> > 
> > This mean will lazy update the "requested" bucket_id by deferring its
> > computation at enqueue time. Which saves us a copy of the bucket_id,
> > i.e. we will have only the "effective" value updated at enqueue time.
> > 
> > But...
> > 
> > > Would that not be simpler?
> > 
> > ... although being simpler it does not fully exploit the slow-path,
> > a syscall which is usually running from a different process context
> > (system management software).
> > 
> > It also fits better for lazy updates but, in the cgroup case, where we
> > wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> > do the updates from the slow-path.
> > 
> > Will look better into this simplification while working on v7, perhaps
> > the linear mapping can really help in that too.
> 
> Actually, I forgot to mention that:
> 
>     uclamp_se::effective::{
>         value, bucket_id
>     }
> 
> will be still required to proper support the cgroup delegation model,
> where a child group could be restricted by the parent but we want to
> keep track of the original "requested" value for when the parent
> should relax the restriction.
> 
> Thus, since effective values are already there, why not using them
> also to pre-compute the new requested bucket_id from the slow path?

Well, we need the orig_value; but I'm still not sure why you need more
bucket_id's. Also, retaining orig_value is already required for the
system limits, there's nothing cgroup-y about this afaict.
Peter Zijlstra Jan. 24, 2019, 3:15 p.m. UTC | #11
On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> Add uclamp_default_perf, a special set of clamp values to be used
> for tasks requiring maximum performance, i.e. by default all the non
> clamped RT tasks.

Urgh... why though?
Peter Zijlstra Jan. 24, 2019, 3:31 p.m. UTC | #12
On Thu, Jan 24, 2019 at 12:30:09PM +0000, Patrick Bellasi wrote:
> > So I'll have to go over the code again, but I'm wondering why you're
> > changing uclamp_se::bucket_id on a runnable task.
> 
> We change only the "requested" value, not the "effective" one.
> 
> > Ideally you keep bucket_id invariant between enqueue and dequeue; then
> > dequeue knows where we put it.
> 
> Right, that's what we do for the "effective" value.

So the problem I have is that you first introduce uclamp_se::value and
use that all over the code, and then introduce effective and change all
the usage sites.

That seems daft. Why not keep all the code as-is and add orig_value.

> > Now I suppose actually determining bucket_id is 'expensive' (it
> > certainly is with the whole mapping scheme, but even that integer
> > division is not nice), so we'd like to precompute the bucket_id.
> 
> Yes, although the complexity is mostly in the composition logic
> described above not on mapping at all. We have "mapping" overheads
> only when we change a "request" value and that's from slow-paths.

It's weird though. Esp. when combined with that mapping logic, because
then you get to use additional maps that are not in fact ever used.

> > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > next enqueue will (unlikely) test-and-clear changed and recompute the
> > bucket_id.
> 
> This mean will lazy update the "requested" bucket_id by deferring its
> computation at enqueue time. Which saves us a copy of the bucket_id,
> i.e. we will have only the "effective" value updated at enqueue time.
> 
> But...
> 
> > Would that not be simpler?
> 
> ... although being simpler it does not fully exploit the slow-path,
> a syscall which is usually running from a different process context
> (system management software).
> 
> It also fits better for lazy updates but, in the cgroup case, where we
> wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> do the updates from the slow-path.
> 
> Will look better into this simplification while working on v7, perhaps
> the linear mapping can really help in that too.

OK. So mostly my complaint is that it seems to do things odd for ill
explained reasons.
Peter Zijlstra Jan. 24, 2019, 3:33 p.m. UTC | #13
On Thu, Jan 24, 2019 at 12:30:09PM +0000, Patrick Bellasi wrote:
> On 23-Jan 21:11, Peter Zijlstra wrote:

> > Anyway, once you get range composition defined; that should be something
> > like:
> > 
> > 	R_p \Compose_g R_g
> > 
> > Where R_p is the range of task-p, and R_g is the range of the g'th
> > cgroup of p (where you can make an identity between the root cgroup and
> > the system default).
> > 
> > Now; as per the other email; I think the straight forward composition:
> > 
> > struct range compose(struct range a, struct range b)
> > {
> > 	return (range){.min = clamp(a.min, b.min, b.max),
> > 	               .max = clamp(a.max, b.min, b.max), };
> > }
> 
> This composition is done in uclamp_effective_get() but it's
> slightly different, since we want to support a "nice policy" where
> tasks can always ask less then what they have got assigned.

Not sure I follow..

> Thus, from an abstract standpoint, if a task is in a cgroup:
> 
>      task.min <= R_g.min
>      task.max <= R_g.max
> 
> While, for tasks in the root cgroup system default applies and we
> enforece:
> 
>      task.min >= R_0.min
>      task.max <= R_0.max
> 
> ... where the "nice policy" is currently not more supported, but
> perhaps we can/should use the same for system defaults too.

That seems inconsistent at best.

OK, I'll go have another look. I never recognised that function for
doing that.
Patrick Bellasi Jan. 24, 2019, 4 p.m. UTC | #14
On 24-Jan 16:12, Peter Zijlstra wrote:
> On Thu, Jan 24, 2019 at 12:38:35PM +0000, Patrick Bellasi wrote:
> > On 24-Jan 12:30, Patrick Bellasi wrote:
> > > On 23-Jan 21:11, Peter Zijlstra wrote:
> > > > On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > > > > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > > > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > 
> > [...]
> > 
> > > > I'm thikning that if we haz a single bit, say:
> > > > 
> > > >   struct uclamp_se {
> > > > 	...
> > > > 	unsigned int	changed : 1;
> > > >   };
> > > > 
> > > > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > > > next enqueue will (unlikely) test-and-clear changed and recompute the
> > > > bucket_id.
> > > 
> > > This mean will lazy update the "requested" bucket_id by deferring its
> > > computation at enqueue time. Which saves us a copy of the bucket_id,
> > > i.e. we will have only the "effective" value updated at enqueue time.
> > > 
> > > But...
> > > 
> > > > Would that not be simpler?
> > > 
> > > ... although being simpler it does not fully exploit the slow-path,
> > > a syscall which is usually running from a different process context
> > > (system management software).
> > > 
> > > It also fits better for lazy updates but, in the cgroup case, where we
> > > wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> > > do the updates from the slow-path.
> > > 
> > > Will look better into this simplification while working on v7, perhaps
> > > the linear mapping can really help in that too.
> > 
> > Actually, I forgot to mention that:
> > 
> >     uclamp_se::effective::{
> >         value, bucket_id
> >     }
> > 
> > will be still required to proper support the cgroup delegation model,
> > where a child group could be restricted by the parent but we want to
> > keep track of the original "requested" value for when the parent
> > should relax the restriction.
> > 
> > Thus, since effective values are already there, why not using them
> > also to pre-compute the new requested bucket_id from the slow path?
> 
> Well, we need the orig_value; but I'm still not sure why you need more
> bucket_id's. Also, retaining orig_value is already required for the
> system limits, there's nothing cgroup-y about this afaict.

Sure, the "effective" values are just a very convenient way (IMHO) to
know exactly which value/bucket_id is currently in use by a task while
keeping them well separated from the "requested" values.

So, you propose to add "orig_value"... but the end effect will be the
same... it's just that if we look at uclamp_se you have two dual
information:

 A) whatever a task or cgroup "request" is always in:

   uclamp_se::value
   uclamp_se::bucket_id

 B) whatever a task or cgroup "gets" is always in:

   uclamp_se::effective::value
   uclamp_se::effective::bucket_id

I find this partitioning useful and easy to use:

 1) the slow-path updates only data in A

 2) the fast-path updates only data in B
    by composing A data in uclamp_effective_get() @enqueue time.
Patrick Bellasi Jan. 24, 2019, 4:05 p.m. UTC | #15
On 24-Jan 16:15, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > Add uclamp_default_perf, a special set of clamp values to be used
> > for tasks requiring maximum performance, i.e. by default all the non
> > clamped RT tasks.
> 
> Urgh... why though?

Because tasks without a task-specific clamp value assigned are subject
to system default... but those, by default, don't clamp at all:

   uclamp_default = [0..1024]

For RT tasks however, we want to execute then always at max OPP, if
not otherwise requested (e.g. task-specific value). Thus we need a
different set of default clamps:

   uclamp_perf_default = [1024..1024]
Patrick Bellasi Jan. 24, 2019, 4:14 p.m. UTC | #16
On 24-Jan 16:31, Peter Zijlstra wrote:
> On Thu, Jan 24, 2019 at 12:30:09PM +0000, Patrick Bellasi wrote:
> > > So I'll have to go over the code again, but I'm wondering why you're
> > > changing uclamp_se::bucket_id on a runnable task.
> > 
> > We change only the "requested" value, not the "effective" one.
> > 
> > > Ideally you keep bucket_id invariant between enqueue and dequeue; then
> > > dequeue knows where we put it.
> > 
> > Right, that's what we do for the "effective" value.
> 
> So the problem I have is that you first introduce uclamp_se::value and
> use that all over the code, and then introduce effective and change all
> the usage sites.

Right, because the moment we introduce the combination/aggregation mechanism
is the moment "effective" value makes sense to have. That's when the
code show that:

   a task cannot always get what it "request", an "effective" value is
   computed by aggregation and that's the value we use now for actual
   clamp enforcing.

> That seems daft. Why not keep all the code as-is and add orig_value.

If you prefer, I can use effective values since the beginning and then
add the "requested" values later... but I fear the patchset will not
be more clear to parse.


> > > Now I suppose actually determining bucket_id is 'expensive' (it
> > > certainly is with the whole mapping scheme, but even that integer
> > > division is not nice), so we'd like to precompute the bucket_id.
> > 
> > Yes, although the complexity is mostly in the composition logic
> > described above not on mapping at all. We have "mapping" overheads
> > only when we change a "request" value and that's from slow-paths.
> 
> It's weird though. Esp. when combined with that mapping logic, because
> then you get to use additional maps that are not in fact ever used.

Mmm... don't get this point...

AFAICS "mapping" and "effective" are two different concepts, that's
why I can probably get rid of the first by I would prefer to keep the
second.

> > > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > > next enqueue will (unlikely) test-and-clear changed and recompute the
> > > bucket_id.
> > 
> > This mean will lazy update the "requested" bucket_id by deferring its
> > computation at enqueue time. Which saves us a copy of the bucket_id,
> > i.e. we will have only the "effective" value updated at enqueue time.
> > 
> > But...
> > 
> > > Would that not be simpler?
> > 
> > ... although being simpler it does not fully exploit the slow-path,
> > a syscall which is usually running from a different process context
> > (system management software).
> > 
> > It also fits better for lazy updates but, in the cgroup case, where we
> > wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> > do the updates from the slow-path.
> > 
> > Will look better into this simplification while working on v7, perhaps
> > the linear mapping can really help in that too.
> 
> OK. So mostly my complaint is that it seems to do things odd for ill
> explained reasons.

:( I'm really sorry I'm not able to convey the overall design idea.

TBH however, despite being a quite "limited" feature it has many
different viewpoints: task-specific, cgroups and system defaults.

I've really tried my best to come up with something reasonable but I
understand that, looking at the single patches, the overall design
could be difficult to grasp... without considering that optimizations
are always possible of course.

If you like better, I can try to give a respin by just removing the
mapping part and then we go back and see if the reaming bits makes
more sense ?
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d1ea5825501a..1ed01f381641 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -746,6 +746,7 @@  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
  * Tasks specific clamp values are required to be within this range
  */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
+static struct uclamp_se uclamp_default_perf[UCLAMP_CNT];
 
 /**
  * Reference count utilization clamp buckets
@@ -858,16 +859,23 @@  static inline void
 uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
 		     unsigned int *clamp_value, unsigned int *bucket_id)
 {
+	struct uclamp_se *default_clamp;
+
 	/* Task specific clamp value */
 	*clamp_value = p->uclamp[clamp_id].value;
 	*bucket_id = p->uclamp[clamp_id].bucket_id;
 
+	/* RT tasks have different default values */
+	default_clamp = task_has_rt_policy(p)
+		? uclamp_default_perf
+		: uclamp_default;
+
 	/* System default restriction */
-	if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
-		     *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
+	if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
+		     *clamp_value > default_clamp[UCLAMP_MAX].value)) {
 		/* Keep it simple: unconditionally enforce system defaults */
-		*clamp_value = uclamp_default[clamp_id].value;
-		*bucket_id = uclamp_default[clamp_id].bucket_id;
+		*clamp_value = default_clamp[clamp_id].value;
+		*bucket_id = default_clamp[clamp_id].bucket_id;
 	}
 }
 
@@ -1282,6 +1290,10 @@  static void __init init_uclamp(void)
 
 		uc_se = &uclamp_default[clamp_id];
 		uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+
+		/* RT tasks by default will go to max frequency */
+		uc_se = &uclamp_default_perf[clamp_id];
+		uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
 	}
 }
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 520ee2b785e7..38a05a4f78cc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -201,9 +201,6 @@  unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
 	unsigned long dl_util, util, irq;
 	struct rq *rq = cpu_rq(cpu);
 
-	if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
-		return max;
-
 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
 	 * because of inaccuracies in how we track these -- see
@@ -219,15 +216,19 @@  unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
 	 *
-	 * CFS utilization can be boosted or capped, depending on utilization
-	 * clamp constraints requested by currently RUNNABLE tasks.
+	 * CFS and RT utilization can be boosted or capped, depending on
+	 * utilization clamp constraints requested by currently RUNNABLE
+	 * tasks.
 	 * When there are no CFS RUNNABLE tasks, clamps are released and
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
-	util = (type == ENERGY_UTIL)
-		? util_cfs
-		: uclamp_util(rq, util_cfs);
-	util += cpu_util_rt(rq);
+	util = cpu_util_rt(rq);
+	if (type == FREQUENCY_UTIL) {
+		util += cpu_util_cfs(rq);
+		util  = uclamp_util(rq, util);
+	} else {
+		util += util_cfs;
+	}
 
 	dl_util = cpu_util_dl(rq);
 
@@ -355,13 +356,11 @@  static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 	 *
 	 * Since DL tasks have a much more advanced bandwidth control, it's
 	 * safe to assume that IO boost does not apply to those tasks.
-	 * Instead, since RT tasks are not utilization clamped, we don't want
-	 * to apply clamping on IO boost while there is blocked RT
-	 * utilization.
+	 * Instead, for CFS and RT tasks we clamp the IO boost max value
+	 * considering the current constraints for the CPU.
 	 */
 	max_boost = sg_cpu->iowait_boost_max;
-	if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
-		max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+	max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
 
 	/* Double the boost at each request */
 	if (sg_cpu->iowait_boost) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e4f398ad9e73..614b0bc359cb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@  const struct sched_class rt_sched_class = {
 	.switched_to		= switched_to_rt,
 
 	.update_curr		= update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+	.uclamp_enabled		= 1,
+#endif
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED