diff mbox

[RFC,v3,0/5] Add capacity capping support to the CPU controller

Message ID 20170410073622.2y6tnpcd2ssuoztz@hirez.programming.kicks-ass.net (mailing list archive)
State RFC, archived
Headers show

Commit Message

Peter Zijlstra April 10, 2017, 7:36 a.m. UTC
On Mon, Mar 20, 2017 at 05:22:33PM +0000, Patrick Bellasi wrote:

> a) Bias OPP selection.
>    Thus granting that certain critical tasks always run at least at a
>    specified frequency.
> 
> b) Bias TASKS placement, which requires an additional extension not
>    yet posted to keep things simple.
>    This allows heterogeneous systems, where different CPUs have
>    different capacities, to schedule important tasks in more capable
>    CPUs.

So I dislike the capacity min/max knobs because:

 1) capacity is more an implementation detail than a primary metric;
    illustrated per your above points in that it affects both, while in
    fact it actually modifies another metric, namely util_avg.

 2) they look like an absolute clamp for a best effort class; while it
    very much is not. This also leads to very confusing nesting
    properties re cgroup.

 3) they have absolutely unacceptable overhead in implementation. Two
    more RB tree operations per enqueue/dequeue is just not going to
    happen.

 4) they have muddled semantics, because while its presented as a task
    property, it very much is not. The max(min) and min(max) of all
    runnable tasks is applied to the aggregate util signal. Also see 2.


So 'nice' is a fairly well understood knob; it controls relative time
received between competing tasks (and we really should replace the cpu
controllers 'shares' file with a 'nice' file, see below).


I would much rather introduce something like nice-opp, which only
affects the OPP selection in a relative fashion. This immediately also
has a natural and obvious extension to cgroup hierarchies (just like
regular nice).


And could be folded as a factor in util_avg (just as we do with weight
in load_avg), although this will mess up everything else we use util for
:/. Or, alternatively, decompose the aggregate value upon usage and only
apply the factor for the current running task or something.... Blergh..
difficult..




---
 kernel/sched/core.c  | 29 +++++++++++++++++++++--------
 kernel/sched/fair.c  |  2 +-
 kernel/sched/sched.h |  1 +
 3 files changed, 23 insertions(+), 9 deletions(-)

Comments

Patrick Bellasi April 11, 2017, 5:58 p.m. UTC | #1
Hi Peter,
sorry for this pretty long response, but I think that if you will have
time to patiently go through all the following comments and questions
it will be a big step forward on defining what we really want.

On 10-Apr 09:36, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 05:22:33PM +0000, Patrick Bellasi wrote:
> 
> > a) Bias OPP selection.
> >    Thus granting that certain critical tasks always run at least at a
> >    specified frequency.
> > 
> > b) Bias TASKS placement, which requires an additional extension not
> >    yet posted to keep things simple.
> >    This allows heterogeneous systems, where different CPUs have
> >    different capacities, to schedule important tasks in more capable
> >    CPUs.
> 
> So I dislike the capacity min/max knobs because:

From your following points I see two main topics: "fundamental
concepts" and "implementation details".

Let's star from the "fundamental concepts", since I think we are not
yet aligned on the overall view and goals for this proposal.


.:: Fundamental Concepts

>  1) capacity is more an implementation detail than a primary metric;

Capacity has been defined in a "platform independent" way, it's a
metric which is normalized to 1024 for the most capable CPU running at
the maximum OPP.

To me it seems it can be considered as a primary metric, maybe not in
the current form, i.e. by exposing a raw number in [0..1024] range.

We should consider also that at the CPUFreq side we already expose
knobs like scaling_{min,max}_freq which are much more platform
dependant than capacity.

>     illustrated per your above points in that it affects both, while in
>     fact it actually modifies another metric, namely util_avg.

I don't see it modifying in any direct way util_avg.

Capacity_{min,max} are tracked independently from util_avg and used
"on demand" to "clamp" the original util_avg signal.

There are two main concepts I would like to know your opinion about:

1) we like to have a support to bias OPPs selection and tasks placement

2) since util_avg is directly affecting both OPPs and tasks placement,
   we can consider to somehow "bias" the "usage of" util_avg to
   achieve these goals

IMHO, if our goal is to bias OPP selection and tasks placement, we
should consider to "work on top of" util_avg thus getting a better
separation of concerns:
- util_avg is (already) used to define how utilization is "produced"
- capacity_{min,max} is used to _bias_ how utilization is "consumed"
  i.e. by schedutil and the FAIR scheduler).


>  2) they look like an absolute clamp for a best effort class; while it
>     very much is not. This also leads to very confusing nesting
>     properties re cgroup.

If by absolute you mean something mandatory, I agree. We are still in
the best effort domain. But perhaps it's just a naming issue.

We want to request a possible minimum capacity, for example to
prefer a big CPUs vs a LITTLE one.
However, being "not mandatory" to me does not mean that it has to be
necessary "more abstract" like the one you are proposing below.

Since, we are defining a "tuning interface" what is the _possible_
effect of a certain value should be easy to understand and to map on a
possible effect.
To me the proposed capacity clamping satisfies this goal while in a
previous proposal, where we advanced the idea of a more generic
"boost" value, there was complains (e.g. from PJT) about not being
really clear what was the possible end result.

Sorry, I don't get instead what are the "confusing nesting properties"
you are referring to?

>  4) they have muddled semantics, because while its presented as a task
>     property, it very much is not.

Actually we always presented it as a task group property, while other
people suggested it should be a per-task API.

Still, IMO it could make sense also as a per-task API, for example
considering a specific RT task which we know in our system is
perfectly fine to always run it below a certain OPP.

Do you think it should be a per-task API?
Should we focus (at least initially) on providing a per task-group API?

>     The max(min) and min(max) of all
>     runnable tasks is applied to the aggregate util signal. Also see 2.

Regarding the max/min aggregations they are an intended feature.
The idea is to implement a sort-of (best-effort of course)
inheritance mechanism.

For example, if we have these two RUNNABLE tasks on a CPU:
 Task A, capacity_min=200
 Task B, capacity_min=400
then we want to run the CPU at least at 40% capacity even when A is
running while B is enqueued.

Don't you think this makes sense?


.:: Implementation details

>  3) they have absolutely unacceptable overhead in implementation. Two
>     more RB tree operations per enqueue/dequeue is just not going to
>     happen.

This last point is about "implementation details", I'm pretty
confident that if we find an agreement on the previous point than
this last will be simple to solve.

Just to be clear, the rb-trees are per CPU and used to track just the
RUNNABLE tasks on each CPUs and, as I described in the previous
example, for the OPP biasing to work I think we really need an
aggregation mechanism.

Ideally, every time a task is enqueue/dequeued from a CPU we want to
know what is the currently required capacity clamping. This requires
to maintain an ordered list of values and rb-trees are the most effective
solution.

Perhaps, if we accept a certain level of approximation, we can
potentially reduce the set of tracked values to a finite set (maybe
corresponding to the EM capacity values) and use just a per-cpu
ref-counting array.
Still the min/max aggregation will have a worst case O(N) search
complexity, where N is the number of finite values we want to use.

Do you think such a solution can work better?

Just for completeness I've reported at the end an overhead analysis
for the current implementation, have a look only if the response to
the previous question was negative ;-)

> So 'nice' is a fairly well understood knob; it controls relative time
> received between competing tasks (and we really should replace the cpu
> controllers 'shares' file with a 'nice' file, see below).

Agree on that point, it's an interesting cleanup/consolidation,
however I cannot see it fitting for our goals...

> I would much rather introduce something like nice-opp, which only
> affects the OPP selection in a relative fashion. This immediately also
> has a natural and obvious extension to cgroup hierarchies (just like
> regular nice).

... affecting OPPs in a relative fashion sounds like a nice abstract
interface but I'm not convinced it can do the job:

a) We should defined what "relative" means.
   While time partitioning between competitive tasks makes perfectly
   sense, I struggle to find how we can boost a task relatively to
   other co-scheduled tasks.
   Does it even make any sense?

b) Nice values have this 10% time-delta each step which will be odd to
   map on OPPs selection.
   What it means to boost a task 10% less than another?

c) Nice values have a well defined discrete range [-20..19].
   Does is make sense to have such a range for OPPs biasing?
   Using a 10% relative delta for each step does not makes this range
   to wide?
   For example, at 5 OPP nice we can probably be already at the minimum OPP.

d) How do we use nice-opp to bias tasks placement?


> And could be folded as a factor in util_avg (just as we do with weight
> in load_avg), although this will mess up everything else we use util for
> :/.

Not if we use get() methods which are called from the code places
where we are interested in getting the biased value.

> Or, alternatively, decompose the aggregate value upon usage and only
> apply the factor for the current running task or something.... Blergh..
> difficult..

Maybe just because we are trying to abstract too much a relatively
simple concept?

Still I don't completely get the downside of using a simple, well
defined and generic concept like "capacity"...


> ---
>  kernel/sched/core.c  | 29 +++++++++++++++++++++--------
>  kernel/sched/fair.c  |  2 +-
>  kernel/sched/sched.h |  1 +
>  3 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 27b4dd55b0c7..20ed17369cb1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6963,18 +6963,31 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
>  }
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
> -				struct cftype *cftype, u64 shareval)
> +static int cpu_nice_write_s64(struct cgroup_subsys_state *css,
> +				struct cftype *cftype, s64 val)
>  {
> -	return sched_group_set_shares(css_tg(css), scale_load(shareval));
> +	struct task_group *tg = css_tg(css);
> +	unsigned long weight;
> +
> +	int ret;
> +
> +	if (val < MIN_NICE || val > MAX_NICE)
> +		return -EINVAL;
> +
> +	weight = sched_prio_to_weight[val - MIN_NICE];
> +
> +	ret = sched_group_set_shares(tg, scale_load(weight));
> +
> +	if (!ret)
> +		tg->nice = val;
>  }
> 
> -static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
> +static u64 cpu_shares_read_s64(struct cgroup_subsys_state *css,
                 ^^^^^^^
you mean:     cpu_nice_write_s64
>  			       struct cftype *cft)
>  {
>  	struct task_group *tg = css_tg(css);
> 
> -	return (u64) scale_load_down(tg->shares);
> +	return (s64)tg->nice;
>  }
> 
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -7252,9 +7265,9 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  static struct cftype cpu_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	{
> -		.name = "shares",
> -		.read_u64 = cpu_shares_read_u64,
> -		.write_u64 = cpu_shares_write_u64,
> +		.name = "nice",
> +		.read_s64 = cpu_nice_read_s64,
> +		.write_s64 = cpu_nice_write_s64,

However, isn't this going to break a user-space API?

>  	},
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76f67b3e34d6..8088043f46d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9471,7 +9471,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  	if (!tg->se[0])
>  		return -EINVAL;
> 
> -	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
> +	shares = clamp(shares, MIN_SHARES, scale_load(MAX_SHARES));
                               ^^^^^^^^^^

Why you remove the scaling here?

> 
>  	mutex_lock(&shares_mutex);
>  	if (tg->shares == shares)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 57caf3606114..27f1ffe763bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -283,6 +283,7 @@ struct task_group {
>  	/* runqueue "owned" by this group on each cpu */
>  	struct cfs_rq **cfs_rq;
>  	unsigned long shares;
> +	int nice;
> 
>  #ifdef	CONFIG_SMP
>  	/*


.:: Overhead analysis for the current solution

Here are the stats on completion times I've got for 30 runs of:
   perf bench sched messaging --pipe --thread --group 1 --loop 5000
with tasks pinned in the two big CPUs of a Juno2 board and using the
performance governor:

        count   mean    std    min    25%    50%    75%   max
with:    30.0  4.666  0.164  4.228  4.585  4.649  4.781  4.95
without: 30.0  4.830  0.178  4.444  4.751  4.813  4.951  5.14

The average slowdown is ~3.5%, which we can easily agree it's not
negligible.

However, a couple of considerations are still worth:

1) this is certainly not the target use-case, the proposed API is not
   targeting server and/or high intensive workloads. In these cases
   this support should not be enabled at kernel compilation time.
   We are mainly targeting low-utilization and latency sensitive
   workloads.

2) the current implementation perhaps can be optimized by disabling it
   at run-time under certain working conditions, e.g. using something
   similar to what we do for EAS once we cross the tipping-point and the
   system is marked as over-utilized.

3) there is a run-time overhead, of course, but if it gives
   energy/performance benefits for certain low-utilization workloads
   it's still worth to be payed.
Peter Zijlstra April 12, 2017, 12:10 p.m. UTC | #2
Let me reply in parts as I read this.. easy things first :-)

On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> On 10-Apr 09:36, Peter Zijlstra wrote:

> >  4) they have muddled semantics, because while its presented as a task
> >     property, it very much is not.
> 
> Actually we always presented it as a task group property, while other
> people suggested it should be a per-task API.
> 
> Still, IMO it could make sense also as a per-task API, for example
> considering a specific RT task which we know in our system is
> perfectly fine to always run it below a certain OPP.
> 
> Do you think it should be a per-task API?
> Should we focus (at least initially) on providing a per task-group API?

Even for the cgroup interface, I think they should set a per-task
property, not a group property.

> >  3) they have absolutely unacceptable overhead in implementation. Two
> >     more RB tree operations per enqueue/dequeue is just not going to
> >     happen.
> 
> This last point is about "implementation details", I'm pretty
> confident that if we find an agreement on the previous point than
> this last will be simple to solve.
> 
> Just to be clear, the rb-trees are per CPU and used to track just the
> RUNNABLE tasks on each CPUs and, as I described in the previous
> example, for the OPP biasing to work I think we really need an
> aggregation mechanism.

I know its runnable, which is exactly what the regular RB tree in fair
already tracks. That gets us 3 RB tree operations per scheduling
operation, which is entirely ridiculous.

And while I disagree with the need to track this at all, see below, it
can be done _much_ cheaper using a double augmented RB-tree, where you
keep the min/max as heaps inside the regular RB-tree.

> Ideally, every time a task is enqueue/dequeued from a CPU we want to
> know what is the currently required capacity clamping. This requires
> to maintain an ordered list of values and rb-trees are the most effective
> solution.
> 
> Perhaps, if we accept a certain level of approximation, we can
> potentially reduce the set of tracked values to a finite set (maybe
> corresponding to the EM capacity values) and use just a per-cpu
> ref-counting array.
> Still the min/max aggregation will have a worst case O(N) search
> complexity, where N is the number of finite values we want to use.

So the bigger point is that if the min/max is a per-task property (even
if set through a cgroup interface), the min(max) / max(min) thing is
wrong.

If the min/max were to apply to each individual task's util, you'd end
up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].

Where a possible approximation is scaling the aggregate util into that
domain.
Peter Zijlstra April 12, 2017, 12:15 p.m. UTC | #3
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> We should consider also that at the CPUFreq side we already expose
> knobs like scaling_{min,max}_freq which are much more platform
> dependant than capacity.

So I've always objected to these knobs.

That said; they are a pre-existing user interface so changing them isn't
really feasible much.

But at the very least we should integrate them properly. Which for
schedutil would mean they have to directly modify the relevant CPU
capacity values. Is this currently done? (I think not.)
Peter Zijlstra April 12, 2017, 12:22 p.m. UTC | #4
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> Sorry, I don't get instead what are the "confusing nesting properties"
> you are referring to?

If a parent group sets min=.2 and max=.8, what are the constraints on
its child groups for setting their resp min and max?

I can't immediately gives rules that would make sense.

For instance, allowing a child to lower min would violate the parent
constraint, while allowing a child to increase min would grant the child
more resources than the parent.

Neither seem like a good thing.
Peter Zijlstra April 12, 2017, 12:48 p.m. UTC | #5
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> >     illustrated per your above points in that it affects both, while in
> >     fact it actually modifies another metric, namely util_avg.
> 
> I don't see it modifying in any direct way util_avg.

The point is that clamps called 'capacity' are applied to util. So while
you don't modify util directly, you do modify the util signal (for one
consumer).
Patrick Bellasi April 12, 2017, 1:24 p.m. UTC | #6
On 12-Apr 14:22, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > Sorry, I don't get instead what are the "confusing nesting properties"
> > you are referring to?
> 
> If a parent group sets min=.2 and max=.8, what are the constraints on
> its child groups for setting their resp min and max?

Currently the logic I'm proposing enforces this:

a) capacity_max can only be reduced
   because we accept that a child can be further constrained
   for example:
   - a resource manager allocates a max capacity to an application
   - the application itself knows that some of its child are background
     tasks and they can be further constrained

b) capacity_min can only be increased
   because we want to inhibit child affecting overall performance
   for example:
   - a resource manager allocates a minimum capacity to an application
   - the application itself cannot slow-down some of its child
     without risking to affect other (unknown) external entities

> I can't immediately gives rules that would make sense.

The second rule is more tricky, but I see it matching better an
overall decomposition schema where a single resource manager is
allocating a capacity_min to two different entities (A and B) which
are independent but (it only knows) are also cooperating.

Let's think about the Android run-time which allocate resources to a
system service (entity A) which it knows it has to interact with
a certain app (entity B).

The cooperation dependency can be resolved only by the resource
manager, by assigning capacity_min at entity level CGroups.
Thus, entities subgroups should not be allowed to further reduce
this constraint without risking to impact an (unknown for them)
external entity.

> For instance, allowing a child to lower min would violate the parent
> constraint,

Quite likely don't want this.

> while allowing a child to increase min would grant the child
> more resources than the parent.

But still within the capacity_max enforced by the parent.

We should always consider the pair (min,max), once a parent defined
this range to me it's seem ok that child can freely play within that
range.

Why should not be allowed a child group to set:

   capacity_min_child = capacity_max_parent

?


> Neither seem like a good thing.
Patrick Bellasi April 12, 2017, 1:27 p.m. UTC | #7
On 12-Apr 14:48, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > >     illustrated per your above points in that it affects both, while in
> > >     fact it actually modifies another metric, namely util_avg.
> > 
> > I don't see it modifying in any direct way util_avg.
> 
> The point is that clamps called 'capacity' are applied to util. So while
> you don't modify util directly, you do modify the util signal (for one
> consumer).

Right, but this consumer (i.e. schedutil) it's already translating
the util_avg into a next_freq (which ultimately it's a capacity).

Thus, I don't see a big misfit in that code path to "filter" this
translation with a capacity clamp.
Patrick Bellasi April 12, 2017, 1:34 p.m. UTC | #8
On 12-Apr 14:15, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > We should consider also that at the CPUFreq side we already expose
> > knobs like scaling_{min,max}_freq which are much more platform
> > dependant than capacity.
> 
> So I've always objected to these knobs.
> 
> That said; they are a pre-existing user interface so changing them isn't
> really feasible much.
> 
> But at the very least we should integrate them properly. Which for
> schedutil would mean they have to directly modify the relevant CPU
> capacity values. Is this currently done? (I think not.)

AFAICS they are clamping the policy decisions, thus the frequency
domain... which can be more then one CPU on ARM platforms.

When you say you would like them to "directly modify the relevant CPU
capacity values" I really see this as exactly what we do with
capacity_{min,max}.

These two capacity clamping values are per task, and thus (by
definition) per CPU. Moreover, they have the interesting property to
be "aggregated" in such a way that, in this configuration:

  TaskA: capacity_max:  20%
  TaskB: capacity_max: 100%

when both tasks are RUNNABLE on the same CPU, that CPU is not capped
until TaskB leave the CPU.

Do you think such a kind of feature can be useful?
Patrick Bellasi April 12, 2017, 1:55 p.m. UTC | #9
On 12-Apr 14:10, Peter Zijlstra wrote:
> Let me reply in parts as I read this.. easy things first :-)
> 
> On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > On 10-Apr 09:36, Peter Zijlstra wrote:
> 
> > >  4) they have muddled semantics, because while its presented as a task
> > >     property, it very much is not.
> > 
> > Actually we always presented it as a task group property, while other
> > people suggested it should be a per-task API.
> > 
> > Still, IMO it could make sense also as a per-task API, for example
> > considering a specific RT task which we know in our system is
> > perfectly fine to always run it below a certain OPP.
> > 
> > Do you think it should be a per-task API?
> > Should we focus (at least initially) on providing a per task-group API?
> 
> Even for the cgroup interface, I think they should set a per-task
> property, not a group property.

Ok, right now using CGroups ans primary (and unique) interface, these
values are tracked as attributes of the CPU controller. Tasks gets
them by reading these attributes once they are binded to a CGroup.

Are you proposing to move these attributes within the task_struct?

In that case we should also defined a primary interface to set them,
any preferred proposal? sched_setattr(), prctl?

> > >  3) they have absolutely unacceptable overhead in implementation. Two
> > >     more RB tree operations per enqueue/dequeue is just not going to
> > >     happen.
> > 
> > This last point is about "implementation details", I'm pretty
> > confident that if we find an agreement on the previous point than
> > this last will be simple to solve.
> > 
> > Just to be clear, the rb-trees are per CPU and used to track just the
> > RUNNABLE tasks on each CPUs and, as I described in the previous
> > example, for the OPP biasing to work I think we really need an
> > aggregation mechanism.
> 
> I know its runnable, which is exactly what the regular RB tree in fair
> already tracks. That gets us 3 RB tree operations per scheduling
> operation, which is entirely ridiculous.
> 
> And while I disagree with the need to track this at all, see below, it
> can be done _much_ cheaper using a double augmented RB-tree, where you
> keep the min/max as heaps inside the regular RB-tree.

By regular rb-tree do you mean the cfs_rq->tasks_timeline?

Because in that case this would apply only to the FAIR class, while
the rb-tree we are using here are across classes.
Supporting both FAIR and RT I think is a worth having feature.

> > Ideally, every time a task is enqueue/dequeued from a CPU we want to
> > know what is the currently required capacity clamping. This requires
> > to maintain an ordered list of values and rb-trees are the most effective
> > solution.
> > 
> > Perhaps, if we accept a certain level of approximation, we can
> > potentially reduce the set of tracked values to a finite set (maybe
> > corresponding to the EM capacity values) and use just a per-cpu
> > ref-counting array.
> > Still the min/max aggregation will have a worst case O(N) search
> > complexity, where N is the number of finite values we want to use.
> 
> So the bigger point is that if the min/max is a per-task property (even
> if set through a cgroup interface), the min(max) / max(min) thing is
> wrong.

Perhaps I'm not following you here but, being per-task does not mean
that we need to aggregate these constraints by summing them (look
below)...

> If the min/max were to apply to each individual task's util, you'd end
> up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].

... as you do here.

Let's use the usual simple example, where these per-tasks constraints
are configured:
- TaskA: capacity_min: 20% capacity_max: 80%
- TaskB: capacity_min: 40% capacity_max: 60%

This means that, at CPU level, we want to enforce the following
clamping depending on the tasks status:

 RUNNABLE tasks    capacity_min    capacity_max
A) TaskA                      20%             80%
B) TaskA,TaskB                40%             80%
C) TaskB                      40%             60%
 
In case C, TaskA gets a bigger boost while is co-scheduled with TaskB.

Notice that this CPU-level aggregation is used just for OPP selection
on that CPU, while for TaskA we still use capacity_min=20% when we are
looking for a CPU.

> Where a possible approximation is scaling the aggregate util into that
> domain.
>
Peter Zijlstra April 12, 2017, 2:34 p.m. UTC | #10
On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:48, Peter Zijlstra wrote:
> > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > >     illustrated per your above points in that it affects both, while in
> > > >     fact it actually modifies another metric, namely util_avg.
> > > 
> > > I don't see it modifying in any direct way util_avg.
> > 
> > The point is that clamps called 'capacity' are applied to util. So while
> > you don't modify util directly, you do modify the util signal (for one
> > consumer).
> 
> Right, but this consumer (i.e. schedutil) it's already translating
> the util_avg into a next_freq (which ultimately it's a capacity).
> 
> Thus, I don't see a big misfit in that code path to "filter" this
> translation with a capacity clamp.

Still strikes me as odd though.
Peter Zijlstra April 12, 2017, 2:41 p.m. UTC | #11
On Wed, Apr 12, 2017 at 02:34:48PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:15, Peter Zijlstra wrote:
> > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > We should consider also that at the CPUFreq side we already expose
> > > knobs like scaling_{min,max}_freq which are much more platform
> > > dependant than capacity.
> > 
> > So I've always objected to these knobs.
> > 
> > That said; they are a pre-existing user interface so changing them isn't
> > really feasible much.
> > 
> > But at the very least we should integrate them properly. Which for
> > schedutil would mean they have to directly modify the relevant CPU
> > capacity values. Is this currently done? (I think not.)
> 
> AFAICS they are clamping the policy decisions, thus the frequency
> domain... which can be more then one CPU on ARM platforms.

Right, knew that. Must've missed the 's' when typing ;-)

> When you say you would like them to "directly modify the relevant CPU
> capacity values" I really see this as exactly what we do with
> capacity_{min,max}.

What I mean is that when we set max as lower than max-freq, it should
reduce the CPU(s)' capacity. That's something entirely different from what
you're attempting (and not something we do currently afaik).

Also; there's as yet unexplored interaction between these knobs and the
RT bandwidth limits.

But the main point is that these knobs are system wide and do not affect
tasks.
Patrick Bellasi April 12, 2017, 2:43 p.m. UTC | #12
On 12-Apr 16:34, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> > On 12-Apr 14:48, Peter Zijlstra wrote:
> > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > > >     illustrated per your above points in that it affects both, while in
> > > > >     fact it actually modifies another metric, namely util_avg.
> > > > 
> > > > I don't see it modifying in any direct way util_avg.
> > > 
> > > The point is that clamps called 'capacity' are applied to util. So while
> > > you don't modify util directly, you do modify the util signal (for one
> > > consumer).
> > 
> > Right, but this consumer (i.e. schedutil) it's already translating
> > the util_avg into a next_freq (which ultimately it's a capacity).
> > 
> > Thus, I don't see a big misfit in that code path to "filter" this
> > translation with a capacity clamp.
> 
> Still strikes me as odd though.

Can you better elaborate on they why?
Peter Zijlstra April 12, 2017, 3:37 p.m. UTC | #13
On Wed, Apr 12, 2017 at 02:55:38PM +0100, Patrick Bellasi wrote:
> On 12-Apr 14:10, Peter Zijlstra wrote:

> > Even for the cgroup interface, I think they should set a per-task
> > property, not a group property.
> 
> Ok, right now using CGroups ans primary (and unique) interface, these
> values are tracked as attributes of the CPU controller. Tasks gets
> them by reading these attributes once they are binded to a CGroup.
> 
> Are you proposing to move these attributes within the task_struct?

/me goes look at your patches again, because I thought you already set
per task_struct values.

@@ -1531,6 +1531,9 @@ struct task_struct {
        struct sched_rt_entity rt;
 #ifdef CONFIG_CGROUP_SCHED
        struct task_group *sched_task_group;
+#ifdef CONFIG_CAPACITY_CLAMPING
+       struct rb_node cap_clamp_node[2];
+#endif

Yeah, see...

> In that case we should also defined a primary interface to set them,
> any preferred proposal? sched_setattr(), prctl?

We could, which I think is the important point.

> By regular rb-tree do you mean the cfs_rq->tasks_timeline?

Yep.

> Because in that case this would apply only to the FAIR class, while
> the rb-tree we are using here are across classes.
> Supporting both FAIR and RT I think is a worth having feature.

*groan* I don't want to even start thinking what this feature means in
the context of RT, head hurts enough.

> > So the bigger point is that if the min/max is a per-task property (even
> > if set through a cgroup interface), the min(max) / max(min) thing is
> > wrong.
> 
> Perhaps I'm not following you here but, being per-task does not mean
> that we need to aggregate these constraints by summing them (look
> below)...
>
> > If the min/max were to apply to each individual task's util, you'd end
> > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].
> 
> ... as you do here.
> 
> Let's use the usual simple example, where these per-tasks constraints
> are configured:
>
> - TaskA: capacity_min: 20% capacity_max: 80%
> - TaskB: capacity_min: 40% capacity_max: 60%
> 
> This means that, at CPU level, we want to enforce the following
> clamping depending on the tasks status:
> 
>  RUNNABLE tasks    capacity_min    capacity_max
> A) TaskA                      20%             80%
> B) TaskA,TaskB                40%             80%
> C) TaskB                      40%             60%
>  
> In case C, TaskA gets a bigger boost while is co-scheduled with TaskB.

(bit unfortunate you gave your cases and tasks the same enumeration)

But this I quite strongly feel is wrong. If you've given your tasks a
minimum OPP, you've in fact given them a minimum bandwidth, for at a
given frequency you can say how long they'll run, right?

So if you want to maintain that case B should be 60%. Once one of the
tasks completes it will drop again. That is, the increased value
represents the additional runnable 'load' over the min from the
currently running task. Combined they will still complete in reduced
time.

> Notice that this CPU-level aggregation is used just for OPP selection
> on that CPU, while for TaskA we still use capacity_min=20% when we are
> looking for a CPU.

And you don't find that inconsistent?
Peter Zijlstra April 12, 2017, 4:14 p.m. UTC | #14
On Wed, Apr 12, 2017 at 03:43:10PM +0100, Patrick Bellasi wrote:
> On 12-Apr 16:34, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> > > On 12-Apr 14:48, Peter Zijlstra wrote:
> > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > > > >     illustrated per your above points in that it affects both, while in
> > > > > >     fact it actually modifies another metric, namely util_avg.
> > > > > 
> > > > > I don't see it modifying in any direct way util_avg.
> > > > 
> > > > The point is that clamps called 'capacity' are applied to util. So while
> > > > you don't modify util directly, you do modify the util signal (for one
> > > > consumer).
> > > 
> > > Right, but this consumer (i.e. schedutil) it's already translating
> > > the util_avg into a next_freq (which ultimately it's a capacity).
> > > 
> > > Thus, I don't see a big misfit in that code path to "filter" this
> > > translation with a capacity clamp.
> > 
> > Still strikes me as odd though.
> 
> Can you better elaborate on they why?

Because capacity is, as you pointed out earlier, a relative measure of
inter CPU performance (which isn't otherwise exposed to userspace
afaik).

While the utilization thing is a per task running signal.

There is no direct relation between the two.

The two main uses for the util signal are:

  OPP selection: the aggregate util of all runnable tasks for a
    particular CPU is used to select an OPP for said CPU [*], against
    whatever max-freq that CPU has. Capacity doesn't really come into play
    here.

  Task placement: capacity comes into play in so far that we want to
    make sure our task fits.


And I'm not at all sure we want to have both uses of our utilization
controlled by the one knob. They're quite distinct.


[*] yeah, I know clock domains with multiple CPUs in etc.. lets keep
this simple ;-)
Patrick Bellasi April 13, 2017, 10:34 a.m. UTC | #15
On 12-Apr 18:14, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 03:43:10PM +0100, Patrick Bellasi wrote:
> > On 12-Apr 16:34, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote:
> > > > On 12-Apr 14:48, Peter Zijlstra wrote:
> > > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote:
> > > > > > >     illustrated per your above points in that it affects both, while in
> > > > > > >     fact it actually modifies another metric, namely util_avg.
> > > > > > 
> > > > > > I don't see it modifying in any direct way util_avg.
> > > > > 
> > > > > The point is that clamps called 'capacity' are applied to util. So while
> > > > > you don't modify util directly, you do modify the util signal (for one
> > > > > consumer).
> > > > 
> > > > Right, but this consumer (i.e. schedutil) it's already translating
> > > > the util_avg into a next_freq (which ultimately it's a capacity).

                                                               ^^^^^^^^
                                                                [REF1]

> > > > 
> > > > Thus, I don't see a big misfit in that code path to "filter" this
> > > > translation with a capacity clamp.
> > > 
> > > Still strikes me as odd though.
> > 
> > Can you better elaborate on they why?
> 
> Because capacity is, as you pointed out earlier, a relative measure of
> inter CPU performance (which isn't otherwise exposed to userspace
> afaik).

Perhaps, since I'm biased by EAS concepts which are still not
mainline, I was not clear on specifying what I meant by "capacity" in
[REF1].

My fault, sorry, perhaps it's worth if I start by reviewing some
concepts and see if we can establish a common language.


.:: Mainline

If we look at mainline, "capacity" is actually a concept used to
represent the computational bandwidth available in a CPU, when running
at the highest OPP (let's consider SMP systems to keep it simple).

But things are already a bit more complicated. Specifically, looking
at update_cpu_capacity(), we distinguish between:

- cpu_rq(cpu)->cpu_capacity_orig
  which is the bandwidth available at the max OPP.

- cpu_rq(cpu)->cpu_capacity
  which discounts from the previous metrics the "average" bandwidth used
  by RT tasks, but not (yet) DEADLINE tasks afaics.

Thus, "capacity" is already a polymorphic concept:
  we use cpu_capacity_orig to cap the cpu utilization of CFS tasks
  in cpu_util()
but
  this cpu utilization is a signal which converge to "current capacity"
  in  ___update_load_avg()

The "current capacity" (capacity_curr, but just in some comments) is actually
the computational bandwidth available at a certain OPP.

Thus, we already have in mainline a concepts of capacity which refers to the
bandwidth available in a certain OPP. The "current capacity" is what we
ultimately use to scale PELT depending on the current OPP.


.:: EAS

Looking at EAS, and specifically the energy model, we describe each
OPP using a:

  struct capacity_state {
          unsigned long cap;      /* compute capacity */
          unsigned long power;    /* power consumption at this compute capacity */
  };

Where again we find a usage of the "current capacity", i.e. the
computational bandwidth available at each OPP.


.:: Current Capacity

In [REF1] I was referring to the concept of "current capacity", which is what
schedutil is after. There we need translate cfs.avg.util_avg into an OPP, which
ultimately is a suitable level of "current capacity" to satisfy the
CPU bandwidth requested by CFS tasks.

> While the utilization thing is a per task running signal.

Which still is converging to the "current capacity", at least before
Vincent's patches.

> There is no direct relation between the two.

Give the previous definitions, can we say that there is a relation between task
utilization and "current capacity"?

     Sum(task_utilization) = cpu_utilization
            <= "current capacity" (cpufreq_schedutil::get_next_freq())    [1]
            <= cpu_capacity_orig

> The two main uses for the util signal are:
> 
>   OPP selection: the aggregate util of all runnable tasks for a
>     particular CPU is used to select an OPP for said CPU [*], against
>     whatever max-freq that CPU has. Capacity doesn't really come into play
>     here.

The OPP selected has to provide a suitable amount of "current capacity" to
accommodate the required utilization.

>   Task placement: capacity comes into play in so far that we want to
>     make sure our task fits.

This two usages are not completely independent, at least when EAS is
in use. In EAS we can evaluate/compare scenarios like:

  "should I increase the capacity of CPUx or wakeup CPUy"

Thus, we use capacity indexes to estimate energy deltas by
moving a task and, by consequence, changing a CPU's OPP.

Which means: expected "capacity" variations are affecting OPP selections.

> And I'm not at all sure we want to have both uses of our utilization
> controlled by the one knob. They're quite distinct.

The proposed knobs, for example capacity_min, are used to clamp the
scheduler/schedutil view on what is the required "current capacity" by
modifying the previous relation [1] to be:

                 Sum(task_utilization) = cpu_utilization
            clamp(cpu_utilization, capacity_min, capacity_max)
                        <= "current capacity"
                        <= cpu_capacity_orig

In [1] we already have a transformation from the cpu_utilization
domain to the "current capacity" domain. Here we are just adding a
clamping filter around that transformation.

I hope this is useful to find some common ground, perhaps the naming
capacity_{min,max} is unfortunate and we can find a better one.
However, we should first agree on the utility of the proposed
clamping concept... ;-)
Patrick Bellasi April 13, 2017, 11:33 a.m. UTC | #16
On 12-Apr 17:37, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 02:55:38PM +0100, Patrick Bellasi wrote:
> > On 12-Apr 14:10, Peter Zijlstra wrote:
> 
> > > Even for the cgroup interface, I think they should set a per-task
> > > property, not a group property.
> > 
> > Ok, right now using CGroups ans primary (and unique) interface, these
> > values are tracked as attributes of the CPU controller. Tasks gets
> > them by reading these attributes once they are binded to a CGroup.
> > 
> > Are you proposing to move these attributes within the task_struct?
> 
> /me goes look at your patches again, because I thought you already set
> per task_struct values.
> 
> @@ -1531,6 +1531,9 @@ struct task_struct {
>         struct sched_rt_entity rt;
>  #ifdef CONFIG_CGROUP_SCHED
>         struct task_group *sched_task_group;
> +#ifdef CONFIG_CAPACITY_CLAMPING
> +       struct rb_node cap_clamp_node[2];
> +#endif
> 
> Yeah, see...

Well, these are not the actual attributes.

These rb_nodes are used to sort the tasks based on their constraints, but the
actual attributes values are stored in:

@@ -273,6 +273,14 @@ struct task_group {
 #endif
 #endif
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+#define CAP_CLAMP_MIN 0
+#define CAP_CLAMP_MAX 1
+
+       /* Min and Max capacity constraints for tasks in this group */
+       unsigned int cap_clamp[2];
+#endif
+

This is done to avoid replicated information in each tasks structure.

> > In that case we should also defined a primary interface to set them,
> > any preferred proposal? sched_setattr(), prctl?
> 
> We could, which I think is the important point.
> 
> > By regular rb-tree do you mean the cfs_rq->tasks_timeline?
> 
> Yep.
> 
> > Because in that case this would apply only to the FAIR class, while
> > the rb-tree we are using here are across classes.
> > Supporting both FAIR and RT I think is a worth having feature.
> 
> *groan* I don't want to even start thinking what this feature means in
> the context of RT, head hurts enough.

:-)

Still, mobile people *groan* when we go to max OPP every time a RT task runs.
Here you can find some energy numbers I've got recently on Pixel phones:
   https://lkml.org/lkml/2017/3/17/214
7%-54% (useless) more energy is a big deal.

Of course, there can be many different solution to this problem, but
capacity_max allows to easily clamp the frequency used when certain RT
while still keeping them within expected latency performance.

> > > So the bigger point is that if the min/max is a per-task property (even
> > > if set through a cgroup interface), the min(max) / max(min) thing is
> > > wrong.
> > 
> > Perhaps I'm not following you here but, being per-task does not mean
> > that we need to aggregate these constraints by summing them (look
> > below)...
> >
> > > If the min/max were to apply to each individual task's util, you'd end
> > > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)].
> > 
> > ... as you do here.
> > 
> > Let's use the usual simple example, where these per-tasks constraints
> > are configured:
> >
> > - TaskA: capacity_min: 20% capacity_max: 80%
> > - TaskB: capacity_min: 40% capacity_max: 60%
> > 
> > This means that, at CPU level, we want to enforce the following
> > clamping depending on the tasks status:
> > 
> >  RUNNABLE tasks    capacity_min    capacity_max
> > A) TaskA                      20%             80%
> > B) TaskA,TaskB                40%             80%
> > C) TaskB                      40%             60%
> >  
> > In case C, TaskA gets a bigger boost while is co-scheduled with TaskB.
> 
> (bit unfortunate you gave your cases and tasks the same enumeration)
> 
> But this I quite strongly feel is wrong. If you've given your tasks a
> minimum OPP, you've in fact given them a minimum bandwidth, for at a
> given frequency you can say how long they'll run, right?

Not really, we are still in the domain of a best-effort solution, and
I think we should stick with that.

The overall idea is not about allocating bandwidth at all, but instead
on expressing preferences, and there are two main reasons:

1) in principle we don't know how long a CFS task will run.
   we just know that if it completes faster than it's better

   Think about a task which is relatively small but functional to
   trigger further processing on an external device (e.g. a GPU).
   In this case the task is part of a critical-path and the sooner it
   finished the better it is.
   It can be the case that allocating bandwidth for such a task is not
   easy, e.g. because the amout of processing the task does can
   sensible change between each activation.

   In this case you have two options:
   a) meaure/estimate the WCET and go for over-budgeting
      likely using DEADLINE
   b) find the minimum capacity which allows your task to complete
      reasonably fast most of the time

   The second is of course a best-effort approach, still I find it
   could be useful to have and it can be easily adapted at run-time to
   express a sort-of power-vs-performance trade-off.

2) if you really want a granted bandwidth, you quite likely want also
   a granted deadline... and you should go for DEADLINE.

> So if you want to maintain that case B should be 60%. Once one of the
> tasks completes it will drop again. That is, the increased value
> represents the additional runnable 'load' over the min from the
> currently running task. Combined they will still complete in reduced
> time.

We already experimented with this approach in the past, actually the
first version of SchedTune was based on the idea to aggregate by
adding the boosted utilizations.

It's true that in that way we are more likely to speed-up tasks
completion also in case of co-scheduling but the downside is that we
are entering the domain of "granted bandwidth allocation" which is
likely overkilling for the scope of a best-effort solution.

Moreover, since bandwidth is a limited resource, we also found such an
approach not fitting well for systems where you have a certain number
of tasks running concurrently. While, a simple threshold based
boosting, where max() is used as the aggregation function, seems to be
still useful.


> > Notice that this CPU-level aggregation is used just for OPP selection
> > on that CPU, while for TaskA we still use capacity_min=20% when we are
> > looking for a CPU.
> 
> And you don't find that inconsistent?

In the previous example, TaskB seems to prefer a CPU which has between
40% and 60% capacity.
Let's assume these numbers comes from a use-case where:
 a) your system provides 60% capacity in a LITTLE CPU
 b) your are after "sustained performances" for TaskA, which on that
    platform can be easily achieved by running at 40% of a LITTLE CPU

Don't you think that this can be a valuable information for the
scheduler to just (possibly) prefer a LITTLE CPU?

With a max() aggregation we can place both TaskA and TaskB on the
LITTLE CPU and try to run them at least at 40% capacity.
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27b4dd55b0c7..20ed17369cb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6963,18 +6963,31 @@  static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
-				struct cftype *cftype, u64 shareval)
+static int cpu_nice_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, s64 val)
 {
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	struct task_group *tg = css_tg(css);
+	unsigned long weight;
+
+	int ret;
+
+	if (val < MIN_NICE || val > MAX_NICE)
+		return -EINVAL;
+
+	weight = sched_prio_to_weight[val - MIN_NICE];
+
+	ret = sched_group_set_shares(tg, scale_load(weight));
+
+	if (!ret)
+		tg->nice = val;
 }
 
-static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
+static u64 cpu_shares_read_s64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
 	struct task_group *tg = css_tg(css);
 
-	return (u64) scale_load_down(tg->shares);
+	return (s64)tg->nice;
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -7252,9 +7265,9 @@  static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
-		.name = "shares",
-		.read_u64 = cpu_shares_read_u64,
-		.write_u64 = cpu_shares_write_u64,
+		.name = "nice",
+		.read_s64 = cpu_nice_read_s64,
+		.write_s64 = cpu_nice_write_s64,
 	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76f67b3e34d6..8088043f46d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9471,7 +9471,7 @@  int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	if (!tg->se[0])
 		return -EINVAL;
 
-	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+	shares = clamp(shares, MIN_SHARES, scale_load(MAX_SHARES));
 
 	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 57caf3606114..27f1ffe763bc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -283,6 +283,7 @@  struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+	int nice;
 
 #ifdef	CONFIG_SMP
 	/*