Message ID | 20170410073622.2y6tnpcd2ssuoztz@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Hi Peter, sorry for this pretty long response, but I think that if you will have time to patiently go through all the following comments and questions it will be a big step forward on defining what we really want. On 10-Apr 09:36, Peter Zijlstra wrote: > On Mon, Mar 20, 2017 at 05:22:33PM +0000, Patrick Bellasi wrote: > > > a) Bias OPP selection. > > Thus granting that certain critical tasks always run at least at a > > specified frequency. > > > > b) Bias TASKS placement, which requires an additional extension not > > yet posted to keep things simple. > > This allows heterogeneous systems, where different CPUs have > > different capacities, to schedule important tasks in more capable > > CPUs. > > So I dislike the capacity min/max knobs because: From your following points I see two main topics: "fundamental concepts" and "implementation details". Let's star from the "fundamental concepts", since I think we are not yet aligned on the overall view and goals for this proposal. .:: Fundamental Concepts > 1) capacity is more an implementation detail than a primary metric; Capacity has been defined in a "platform independent" way, it's a metric which is normalized to 1024 for the most capable CPU running at the maximum OPP. To me it seems it can be considered as a primary metric, maybe not in the current form, i.e. by exposing a raw number in [0..1024] range. We should consider also that at the CPUFreq side we already expose knobs like scaling_{min,max}_freq which are much more platform dependant than capacity. > illustrated per your above points in that it affects both, while in > fact it actually modifies another metric, namely util_avg. I don't see it modifying in any direct way util_avg. Capacity_{min,max} are tracked independently from util_avg and used "on demand" to "clamp" the original util_avg signal. There are two main concepts I would like to know your opinion about: 1) we like to have a support to bias OPPs selection and tasks placement 2) since util_avg is directly affecting both OPPs and tasks placement, we can consider to somehow "bias" the "usage of" util_avg to achieve these goals IMHO, if our goal is to bias OPP selection and tasks placement, we should consider to "work on top of" util_avg thus getting a better separation of concerns: - util_avg is (already) used to define how utilization is "produced" - capacity_{min,max} is used to _bias_ how utilization is "consumed" i.e. by schedutil and the FAIR scheduler). > 2) they look like an absolute clamp for a best effort class; while it > very much is not. This also leads to very confusing nesting > properties re cgroup. If by absolute you mean something mandatory, I agree. We are still in the best effort domain. But perhaps it's just a naming issue. We want to request a possible minimum capacity, for example to prefer a big CPUs vs a LITTLE one. However, being "not mandatory" to me does not mean that it has to be necessary "more abstract" like the one you are proposing below. Since, we are defining a "tuning interface" what is the _possible_ effect of a certain value should be easy to understand and to map on a possible effect. To me the proposed capacity clamping satisfies this goal while in a previous proposal, where we advanced the idea of a more generic "boost" value, there was complains (e.g. from PJT) about not being really clear what was the possible end result. Sorry, I don't get instead what are the "confusing nesting properties" you are referring to? > 4) they have muddled semantics, because while its presented as a task > property, it very much is not. Actually we always presented it as a task group property, while other people suggested it should be a per-task API. Still, IMO it could make sense also as a per-task API, for example considering a specific RT task which we know in our system is perfectly fine to always run it below a certain OPP. Do you think it should be a per-task API? Should we focus (at least initially) on providing a per task-group API? > The max(min) and min(max) of all > runnable tasks is applied to the aggregate util signal. Also see 2. Regarding the max/min aggregations they are an intended feature. The idea is to implement a sort-of (best-effort of course) inheritance mechanism. For example, if we have these two RUNNABLE tasks on a CPU: Task A, capacity_min=200 Task B, capacity_min=400 then we want to run the CPU at least at 40% capacity even when A is running while B is enqueued. Don't you think this makes sense? .:: Implementation details > 3) they have absolutely unacceptable overhead in implementation. Two > more RB tree operations per enqueue/dequeue is just not going to > happen. This last point is about "implementation details", I'm pretty confident that if we find an agreement on the previous point than this last will be simple to solve. Just to be clear, the rb-trees are per CPU and used to track just the RUNNABLE tasks on each CPUs and, as I described in the previous example, for the OPP biasing to work I think we really need an aggregation mechanism. Ideally, every time a task is enqueue/dequeued from a CPU we want to know what is the currently required capacity clamping. This requires to maintain an ordered list of values and rb-trees are the most effective solution. Perhaps, if we accept a certain level of approximation, we can potentially reduce the set of tracked values to a finite set (maybe corresponding to the EM capacity values) and use just a per-cpu ref-counting array. Still the min/max aggregation will have a worst case O(N) search complexity, where N is the number of finite values we want to use. Do you think such a solution can work better? Just for completeness I've reported at the end an overhead analysis for the current implementation, have a look only if the response to the previous question was negative ;-) > So 'nice' is a fairly well understood knob; it controls relative time > received between competing tasks (and we really should replace the cpu > controllers 'shares' file with a 'nice' file, see below). Agree on that point, it's an interesting cleanup/consolidation, however I cannot see it fitting for our goals... > I would much rather introduce something like nice-opp, which only > affects the OPP selection in a relative fashion. This immediately also > has a natural and obvious extension to cgroup hierarchies (just like > regular nice). ... affecting OPPs in a relative fashion sounds like a nice abstract interface but I'm not convinced it can do the job: a) We should defined what "relative" means. While time partitioning between competitive tasks makes perfectly sense, I struggle to find how we can boost a task relatively to other co-scheduled tasks. Does it even make any sense? b) Nice values have this 10% time-delta each step which will be odd to map on OPPs selection. What it means to boost a task 10% less than another? c) Nice values have a well defined discrete range [-20..19]. Does is make sense to have such a range for OPPs biasing? Using a 10% relative delta for each step does not makes this range to wide? For example, at 5 OPP nice we can probably be already at the minimum OPP. d) How do we use nice-opp to bias tasks placement? > And could be folded as a factor in util_avg (just as we do with weight > in load_avg), although this will mess up everything else we use util for > :/. Not if we use get() methods which are called from the code places where we are interested in getting the biased value. > Or, alternatively, decompose the aggregate value upon usage and only > apply the factor for the current running task or something.... Blergh.. > difficult.. Maybe just because we are trying to abstract too much a relatively simple concept? Still I don't completely get the downside of using a simple, well defined and generic concept like "capacity"... > --- > kernel/sched/core.c | 29 +++++++++++++++++++++-------- > kernel/sched/fair.c | 2 +- > kernel/sched/sched.h | 1 + > 3 files changed, 23 insertions(+), 9 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 27b4dd55b0c7..20ed17369cb1 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6963,18 +6963,31 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) > } > > #ifdef CONFIG_FAIR_GROUP_SCHED > -static int cpu_shares_write_u64(struct cgroup_subsys_state *css, > - struct cftype *cftype, u64 shareval) > +static int cpu_nice_write_s64(struct cgroup_subsys_state *css, > + struct cftype *cftype, s64 val) > { > - return sched_group_set_shares(css_tg(css), scale_load(shareval)); > + struct task_group *tg = css_tg(css); > + unsigned long weight; > + > + int ret; > + > + if (val < MIN_NICE || val > MAX_NICE) > + return -EINVAL; > + > + weight = sched_prio_to_weight[val - MIN_NICE]; > + > + ret = sched_group_set_shares(tg, scale_load(weight)); > + > + if (!ret) > + tg->nice = val; > } > > -static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, > +static u64 cpu_shares_read_s64(struct cgroup_subsys_state *css, ^^^^^^^ you mean: cpu_nice_write_s64 > struct cftype *cft) > { > struct task_group *tg = css_tg(css); > > - return (u64) scale_load_down(tg->shares); > + return (s64)tg->nice; > } > > #ifdef CONFIG_CFS_BANDWIDTH > @@ -7252,9 +7265,9 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, > static struct cftype cpu_files[] = { > #ifdef CONFIG_FAIR_GROUP_SCHED > { > - .name = "shares", > - .read_u64 = cpu_shares_read_u64, > - .write_u64 = cpu_shares_write_u64, > + .name = "nice", > + .read_s64 = cpu_nice_read_s64, > + .write_s64 = cpu_nice_write_s64, However, isn't this going to break a user-space API? > }, > #endif > #ifdef CONFIG_CFS_BANDWIDTH > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 76f67b3e34d6..8088043f46d1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9471,7 +9471,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) > if (!tg->se[0]) > return -EINVAL; > > - shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES)); > + shares = clamp(shares, MIN_SHARES, scale_load(MAX_SHARES)); ^^^^^^^^^^ Why you remove the scaling here? > > mutex_lock(&shares_mutex); > if (tg->shares == shares) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 57caf3606114..27f1ffe763bc 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -283,6 +283,7 @@ struct task_group { > /* runqueue "owned" by this group on each cpu */ > struct cfs_rq **cfs_rq; > unsigned long shares; > + int nice; > > #ifdef CONFIG_SMP > /* .:: Overhead analysis for the current solution Here are the stats on completion times I've got for 30 runs of: perf bench sched messaging --pipe --thread --group 1 --loop 5000 with tasks pinned in the two big CPUs of a Juno2 board and using the performance governor: count mean std min 25% 50% 75% max with: 30.0 4.666 0.164 4.228 4.585 4.649 4.781 4.95 without: 30.0 4.830 0.178 4.444 4.751 4.813 4.951 5.14 The average slowdown is ~3.5%, which we can easily agree it's not negligible. However, a couple of considerations are still worth: 1) this is certainly not the target use-case, the proposed API is not targeting server and/or high intensive workloads. In these cases this support should not be enabled at kernel compilation time. We are mainly targeting low-utilization and latency sensitive workloads. 2) the current implementation perhaps can be optimized by disabling it at run-time under certain working conditions, e.g. using something similar to what we do for EAS once we cross the tipping-point and the system is marked as over-utilized. 3) there is a run-time overhead, of course, but if it gives energy/performance benefits for certain low-utilization workloads it's still worth to be payed.
Let me reply in parts as I read this.. easy things first :-) On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > On 10-Apr 09:36, Peter Zijlstra wrote: > > 4) they have muddled semantics, because while its presented as a task > > property, it very much is not. > > Actually we always presented it as a task group property, while other > people suggested it should be a per-task API. > > Still, IMO it could make sense also as a per-task API, for example > considering a specific RT task which we know in our system is > perfectly fine to always run it below a certain OPP. > > Do you think it should be a per-task API? > Should we focus (at least initially) on providing a per task-group API? Even for the cgroup interface, I think they should set a per-task property, not a group property. > > 3) they have absolutely unacceptable overhead in implementation. Two > > more RB tree operations per enqueue/dequeue is just not going to > > happen. > > This last point is about "implementation details", I'm pretty > confident that if we find an agreement on the previous point than > this last will be simple to solve. > > Just to be clear, the rb-trees are per CPU and used to track just the > RUNNABLE tasks on each CPUs and, as I described in the previous > example, for the OPP biasing to work I think we really need an > aggregation mechanism. I know its runnable, which is exactly what the regular RB tree in fair already tracks. That gets us 3 RB tree operations per scheduling operation, which is entirely ridiculous. And while I disagree with the need to track this at all, see below, it can be done _much_ cheaper using a double augmented RB-tree, where you keep the min/max as heaps inside the regular RB-tree. > Ideally, every time a task is enqueue/dequeued from a CPU we want to > know what is the currently required capacity clamping. This requires > to maintain an ordered list of values and rb-trees are the most effective > solution. > > Perhaps, if we accept a certain level of approximation, we can > potentially reduce the set of tracked values to a finite set (maybe > corresponding to the EM capacity values) and use just a per-cpu > ref-counting array. > Still the min/max aggregation will have a worst case O(N) search > complexity, where N is the number of finite values we want to use. So the bigger point is that if the min/max is a per-task property (even if set through a cgroup interface), the min(max) / max(min) thing is wrong. If the min/max were to apply to each individual task's util, you'd end up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)]. Where a possible approximation is scaling the aggregate util into that domain.
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > We should consider also that at the CPUFreq side we already expose > knobs like scaling_{min,max}_freq which are much more platform > dependant than capacity. So I've always objected to these knobs. That said; they are a pre-existing user interface so changing them isn't really feasible much. But at the very least we should integrate them properly. Which for schedutil would mean they have to directly modify the relevant CPU capacity values. Is this currently done? (I think not.)
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > Sorry, I don't get instead what are the "confusing nesting properties" > you are referring to? If a parent group sets min=.2 and max=.8, what are the constraints on its child groups for setting their resp min and max? I can't immediately gives rules that would make sense. For instance, allowing a child to lower min would violate the parent constraint, while allowing a child to increase min would grant the child more resources than the parent. Neither seem like a good thing.
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > illustrated per your above points in that it affects both, while in > > fact it actually modifies another metric, namely util_avg. > > I don't see it modifying in any direct way util_avg. The point is that clamps called 'capacity' are applied to util. So while you don't modify util directly, you do modify the util signal (for one consumer).
On 12-Apr 14:22, Peter Zijlstra wrote: > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > Sorry, I don't get instead what are the "confusing nesting properties" > > you are referring to? > > If a parent group sets min=.2 and max=.8, what are the constraints on > its child groups for setting their resp min and max? Currently the logic I'm proposing enforces this: a) capacity_max can only be reduced because we accept that a child can be further constrained for example: - a resource manager allocates a max capacity to an application - the application itself knows that some of its child are background tasks and they can be further constrained b) capacity_min can only be increased because we want to inhibit child affecting overall performance for example: - a resource manager allocates a minimum capacity to an application - the application itself cannot slow-down some of its child without risking to affect other (unknown) external entities > I can't immediately gives rules that would make sense. The second rule is more tricky, but I see it matching better an overall decomposition schema where a single resource manager is allocating a capacity_min to two different entities (A and B) which are independent but (it only knows) are also cooperating. Let's think about the Android run-time which allocate resources to a system service (entity A) which it knows it has to interact with a certain app (entity B). The cooperation dependency can be resolved only by the resource manager, by assigning capacity_min at entity level CGroups. Thus, entities subgroups should not be allowed to further reduce this constraint without risking to impact an (unknown for them) external entity. > For instance, allowing a child to lower min would violate the parent > constraint, Quite likely don't want this. > while allowing a child to increase min would grant the child > more resources than the parent. But still within the capacity_max enforced by the parent. We should always consider the pair (min,max), once a parent defined this range to me it's seem ok that child can freely play within that range. Why should not be allowed a child group to set: capacity_min_child = capacity_max_parent ? > Neither seem like a good thing.
On 12-Apr 14:48, Peter Zijlstra wrote: > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > illustrated per your above points in that it affects both, while in > > > fact it actually modifies another metric, namely util_avg. > > > > I don't see it modifying in any direct way util_avg. > > The point is that clamps called 'capacity' are applied to util. So while > you don't modify util directly, you do modify the util signal (for one > consumer). Right, but this consumer (i.e. schedutil) it's already translating the util_avg into a next_freq (which ultimately it's a capacity). Thus, I don't see a big misfit in that code path to "filter" this translation with a capacity clamp.
On 12-Apr 14:15, Peter Zijlstra wrote: > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > We should consider also that at the CPUFreq side we already expose > > knobs like scaling_{min,max}_freq which are much more platform > > dependant than capacity. > > So I've always objected to these knobs. > > That said; they are a pre-existing user interface so changing them isn't > really feasible much. > > But at the very least we should integrate them properly. Which for > schedutil would mean they have to directly modify the relevant CPU > capacity values. Is this currently done? (I think not.) AFAICS they are clamping the policy decisions, thus the frequency domain... which can be more then one CPU on ARM platforms. When you say you would like them to "directly modify the relevant CPU capacity values" I really see this as exactly what we do with capacity_{min,max}. These two capacity clamping values are per task, and thus (by definition) per CPU. Moreover, they have the interesting property to be "aggregated" in such a way that, in this configuration: TaskA: capacity_max: 20% TaskB: capacity_max: 100% when both tasks are RUNNABLE on the same CPU, that CPU is not capped until TaskB leave the CPU. Do you think such a kind of feature can be useful?
On 12-Apr 14:10, Peter Zijlstra wrote: > Let me reply in parts as I read this.. easy things first :-) > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > On 10-Apr 09:36, Peter Zijlstra wrote: > > > > 4) they have muddled semantics, because while its presented as a task > > > property, it very much is not. > > > > Actually we always presented it as a task group property, while other > > people suggested it should be a per-task API. > > > > Still, IMO it could make sense also as a per-task API, for example > > considering a specific RT task which we know in our system is > > perfectly fine to always run it below a certain OPP. > > > > Do you think it should be a per-task API? > > Should we focus (at least initially) on providing a per task-group API? > > Even for the cgroup interface, I think they should set a per-task > property, not a group property. Ok, right now using CGroups ans primary (and unique) interface, these values are tracked as attributes of the CPU controller. Tasks gets them by reading these attributes once they are binded to a CGroup. Are you proposing to move these attributes within the task_struct? In that case we should also defined a primary interface to set them, any preferred proposal? sched_setattr(), prctl? > > > 3) they have absolutely unacceptable overhead in implementation. Two > > > more RB tree operations per enqueue/dequeue is just not going to > > > happen. > > > > This last point is about "implementation details", I'm pretty > > confident that if we find an agreement on the previous point than > > this last will be simple to solve. > > > > Just to be clear, the rb-trees are per CPU and used to track just the > > RUNNABLE tasks on each CPUs and, as I described in the previous > > example, for the OPP biasing to work I think we really need an > > aggregation mechanism. > > I know its runnable, which is exactly what the regular RB tree in fair > already tracks. That gets us 3 RB tree operations per scheduling > operation, which is entirely ridiculous. > > And while I disagree with the need to track this at all, see below, it > can be done _much_ cheaper using a double augmented RB-tree, where you > keep the min/max as heaps inside the regular RB-tree. By regular rb-tree do you mean the cfs_rq->tasks_timeline? Because in that case this would apply only to the FAIR class, while the rb-tree we are using here are across classes. Supporting both FAIR and RT I think is a worth having feature. > > Ideally, every time a task is enqueue/dequeued from a CPU we want to > > know what is the currently required capacity clamping. This requires > > to maintain an ordered list of values and rb-trees are the most effective > > solution. > > > > Perhaps, if we accept a certain level of approximation, we can > > potentially reduce the set of tracked values to a finite set (maybe > > corresponding to the EM capacity values) and use just a per-cpu > > ref-counting array. > > Still the min/max aggregation will have a worst case O(N) search > > complexity, where N is the number of finite values we want to use. > > So the bigger point is that if the min/max is a per-task property (even > if set through a cgroup interface), the min(max) / max(min) thing is > wrong. Perhaps I'm not following you here but, being per-task does not mean that we need to aggregate these constraints by summing them (look below)... > If the min/max were to apply to each individual task's util, you'd end > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)]. ... as you do here. Let's use the usual simple example, where these per-tasks constraints are configured: - TaskA: capacity_min: 20% capacity_max: 80% - TaskB: capacity_min: 40% capacity_max: 60% This means that, at CPU level, we want to enforce the following clamping depending on the tasks status: RUNNABLE tasks capacity_min capacity_max A) TaskA 20% 80% B) TaskA,TaskB 40% 80% C) TaskB 40% 60% In case C, TaskA gets a bigger boost while is co-scheduled with TaskB. Notice that this CPU-level aggregation is used just for OPP selection on that CPU, while for TaskA we still use capacity_min=20% when we are looking for a CPU. > Where a possible approximation is scaling the aggregate util into that > domain. >
On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote: > On 12-Apr 14:48, Peter Zijlstra wrote: > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > > illustrated per your above points in that it affects both, while in > > > > fact it actually modifies another metric, namely util_avg. > > > > > > I don't see it modifying in any direct way util_avg. > > > > The point is that clamps called 'capacity' are applied to util. So while > > you don't modify util directly, you do modify the util signal (for one > > consumer). > > Right, but this consumer (i.e. schedutil) it's already translating > the util_avg into a next_freq (which ultimately it's a capacity). > > Thus, I don't see a big misfit in that code path to "filter" this > translation with a capacity clamp. Still strikes me as odd though.
On Wed, Apr 12, 2017 at 02:34:48PM +0100, Patrick Bellasi wrote: > On 12-Apr 14:15, Peter Zijlstra wrote: > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > We should consider also that at the CPUFreq side we already expose > > > knobs like scaling_{min,max}_freq which are much more platform > > > dependant than capacity. > > > > So I've always objected to these knobs. > > > > That said; they are a pre-existing user interface so changing them isn't > > really feasible much. > > > > But at the very least we should integrate them properly. Which for > > schedutil would mean they have to directly modify the relevant CPU > > capacity values. Is this currently done? (I think not.) > > AFAICS they are clamping the policy decisions, thus the frequency > domain... which can be more then one CPU on ARM platforms. Right, knew that. Must've missed the 's' when typing ;-) > When you say you would like them to "directly modify the relevant CPU > capacity values" I really see this as exactly what we do with > capacity_{min,max}. What I mean is that when we set max as lower than max-freq, it should reduce the CPU(s)' capacity. That's something entirely different from what you're attempting (and not something we do currently afaik). Also; there's as yet unexplored interaction between these knobs and the RT bandwidth limits. But the main point is that these knobs are system wide and do not affect tasks.
On 12-Apr 16:34, Peter Zijlstra wrote: > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote: > > On 12-Apr 14:48, Peter Zijlstra wrote: > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > > > illustrated per your above points in that it affects both, while in > > > > > fact it actually modifies another metric, namely util_avg. > > > > > > > > I don't see it modifying in any direct way util_avg. > > > > > > The point is that clamps called 'capacity' are applied to util. So while > > > you don't modify util directly, you do modify the util signal (for one > > > consumer). > > > > Right, but this consumer (i.e. schedutil) it's already translating > > the util_avg into a next_freq (which ultimately it's a capacity). > > > > Thus, I don't see a big misfit in that code path to "filter" this > > translation with a capacity clamp. > > Still strikes me as odd though. Can you better elaborate on they why?
On Wed, Apr 12, 2017 at 02:55:38PM +0100, Patrick Bellasi wrote: > On 12-Apr 14:10, Peter Zijlstra wrote: > > Even for the cgroup interface, I think they should set a per-task > > property, not a group property. > > Ok, right now using CGroups ans primary (and unique) interface, these > values are tracked as attributes of the CPU controller. Tasks gets > them by reading these attributes once they are binded to a CGroup. > > Are you proposing to move these attributes within the task_struct? /me goes look at your patches again, because I thought you already set per task_struct values. @@ -1531,6 +1531,9 @@ struct task_struct { struct sched_rt_entity rt; #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; +#ifdef CONFIG_CAPACITY_CLAMPING + struct rb_node cap_clamp_node[2]; +#endif Yeah, see... > In that case we should also defined a primary interface to set them, > any preferred proposal? sched_setattr(), prctl? We could, which I think is the important point. > By regular rb-tree do you mean the cfs_rq->tasks_timeline? Yep. > Because in that case this would apply only to the FAIR class, while > the rb-tree we are using here are across classes. > Supporting both FAIR and RT I think is a worth having feature. *groan* I don't want to even start thinking what this feature means in the context of RT, head hurts enough. > > So the bigger point is that if the min/max is a per-task property (even > > if set through a cgroup interface), the min(max) / max(min) thing is > > wrong. > > Perhaps I'm not following you here but, being per-task does not mean > that we need to aggregate these constraints by summing them (look > below)... > > > If the min/max were to apply to each individual task's util, you'd end > > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)]. > > ... as you do here. > > Let's use the usual simple example, where these per-tasks constraints > are configured: > > - TaskA: capacity_min: 20% capacity_max: 80% > - TaskB: capacity_min: 40% capacity_max: 60% > > This means that, at CPU level, we want to enforce the following > clamping depending on the tasks status: > > RUNNABLE tasks capacity_min capacity_max > A) TaskA 20% 80% > B) TaskA,TaskB 40% 80% > C) TaskB 40% 60% > > In case C, TaskA gets a bigger boost while is co-scheduled with TaskB. (bit unfortunate you gave your cases and tasks the same enumeration) But this I quite strongly feel is wrong. If you've given your tasks a minimum OPP, you've in fact given them a minimum bandwidth, for at a given frequency you can say how long they'll run, right? So if you want to maintain that case B should be 60%. Once one of the tasks completes it will drop again. That is, the increased value represents the additional runnable 'load' over the min from the currently running task. Combined they will still complete in reduced time. > Notice that this CPU-level aggregation is used just for OPP selection > on that CPU, while for TaskA we still use capacity_min=20% when we are > looking for a CPU. And you don't find that inconsistent?
On Wed, Apr 12, 2017 at 03:43:10PM +0100, Patrick Bellasi wrote: > On 12-Apr 16:34, Peter Zijlstra wrote: > > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote: > > > On 12-Apr 14:48, Peter Zijlstra wrote: > > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > > > > illustrated per your above points in that it affects both, while in > > > > > > fact it actually modifies another metric, namely util_avg. > > > > > > > > > > I don't see it modifying in any direct way util_avg. > > > > > > > > The point is that clamps called 'capacity' are applied to util. So while > > > > you don't modify util directly, you do modify the util signal (for one > > > > consumer). > > > > > > Right, but this consumer (i.e. schedutil) it's already translating > > > the util_avg into a next_freq (which ultimately it's a capacity). > > > > > > Thus, I don't see a big misfit in that code path to "filter" this > > > translation with a capacity clamp. > > > > Still strikes me as odd though. > > Can you better elaborate on they why? Because capacity is, as you pointed out earlier, a relative measure of inter CPU performance (which isn't otherwise exposed to userspace afaik). While the utilization thing is a per task running signal. There is no direct relation between the two. The two main uses for the util signal are: OPP selection: the aggregate util of all runnable tasks for a particular CPU is used to select an OPP for said CPU [*], against whatever max-freq that CPU has. Capacity doesn't really come into play here. Task placement: capacity comes into play in so far that we want to make sure our task fits. And I'm not at all sure we want to have both uses of our utilization controlled by the one knob. They're quite distinct. [*] yeah, I know clock domains with multiple CPUs in etc.. lets keep this simple ;-)
On 12-Apr 18:14, Peter Zijlstra wrote: > On Wed, Apr 12, 2017 at 03:43:10PM +0100, Patrick Bellasi wrote: > > On 12-Apr 16:34, Peter Zijlstra wrote: > > > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote: > > > > On 12-Apr 14:48, Peter Zijlstra wrote: > > > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > > > > > illustrated per your above points in that it affects both, while in > > > > > > > fact it actually modifies another metric, namely util_avg. > > > > > > > > > > > > I don't see it modifying in any direct way util_avg. > > > > > > > > > > The point is that clamps called 'capacity' are applied to util. So while > > > > > you don't modify util directly, you do modify the util signal (for one > > > > > consumer). > > > > > > > > Right, but this consumer (i.e. schedutil) it's already translating > > > > the util_avg into a next_freq (which ultimately it's a capacity). ^^^^^^^^ [REF1] > > > > > > > > Thus, I don't see a big misfit in that code path to "filter" this > > > > translation with a capacity clamp. > > > > > > Still strikes me as odd though. > > > > Can you better elaborate on they why? > > Because capacity is, as you pointed out earlier, a relative measure of > inter CPU performance (which isn't otherwise exposed to userspace > afaik). Perhaps, since I'm biased by EAS concepts which are still not mainline, I was not clear on specifying what I meant by "capacity" in [REF1]. My fault, sorry, perhaps it's worth if I start by reviewing some concepts and see if we can establish a common language. .:: Mainline If we look at mainline, "capacity" is actually a concept used to represent the computational bandwidth available in a CPU, when running at the highest OPP (let's consider SMP systems to keep it simple). But things are already a bit more complicated. Specifically, looking at update_cpu_capacity(), we distinguish between: - cpu_rq(cpu)->cpu_capacity_orig which is the bandwidth available at the max OPP. - cpu_rq(cpu)->cpu_capacity which discounts from the previous metrics the "average" bandwidth used by RT tasks, but not (yet) DEADLINE tasks afaics. Thus, "capacity" is already a polymorphic concept: we use cpu_capacity_orig to cap the cpu utilization of CFS tasks in cpu_util() but this cpu utilization is a signal which converge to "current capacity" in ___update_load_avg() The "current capacity" (capacity_curr, but just in some comments) is actually the computational bandwidth available at a certain OPP. Thus, we already have in mainline a concepts of capacity which refers to the bandwidth available in a certain OPP. The "current capacity" is what we ultimately use to scale PELT depending on the current OPP. .:: EAS Looking at EAS, and specifically the energy model, we describe each OPP using a: struct capacity_state { unsigned long cap; /* compute capacity */ unsigned long power; /* power consumption at this compute capacity */ }; Where again we find a usage of the "current capacity", i.e. the computational bandwidth available at each OPP. .:: Current Capacity In [REF1] I was referring to the concept of "current capacity", which is what schedutil is after. There we need translate cfs.avg.util_avg into an OPP, which ultimately is a suitable level of "current capacity" to satisfy the CPU bandwidth requested by CFS tasks. > While the utilization thing is a per task running signal. Which still is converging to the "current capacity", at least before Vincent's patches. > There is no direct relation between the two. Give the previous definitions, can we say that there is a relation between task utilization and "current capacity"? Sum(task_utilization) = cpu_utilization <= "current capacity" (cpufreq_schedutil::get_next_freq()) [1] <= cpu_capacity_orig > The two main uses for the util signal are: > > OPP selection: the aggregate util of all runnable tasks for a > particular CPU is used to select an OPP for said CPU [*], against > whatever max-freq that CPU has. Capacity doesn't really come into play > here. The OPP selected has to provide a suitable amount of "current capacity" to accommodate the required utilization. > Task placement: capacity comes into play in so far that we want to > make sure our task fits. This two usages are not completely independent, at least when EAS is in use. In EAS we can evaluate/compare scenarios like: "should I increase the capacity of CPUx or wakeup CPUy" Thus, we use capacity indexes to estimate energy deltas by moving a task and, by consequence, changing a CPU's OPP. Which means: expected "capacity" variations are affecting OPP selections. > And I'm not at all sure we want to have both uses of our utilization > controlled by the one knob. They're quite distinct. The proposed knobs, for example capacity_min, are used to clamp the scheduler/schedutil view on what is the required "current capacity" by modifying the previous relation [1] to be: Sum(task_utilization) = cpu_utilization clamp(cpu_utilization, capacity_min, capacity_max) <= "current capacity" <= cpu_capacity_orig In [1] we already have a transformation from the cpu_utilization domain to the "current capacity" domain. Here we are just adding a clamping filter around that transformation. I hope this is useful to find some common ground, perhaps the naming capacity_{min,max} is unfortunate and we can find a better one. However, we should first agree on the utility of the proposed clamping concept... ;-)
On 12-Apr 17:37, Peter Zijlstra wrote: > On Wed, Apr 12, 2017 at 02:55:38PM +0100, Patrick Bellasi wrote: > > On 12-Apr 14:10, Peter Zijlstra wrote: > > > > Even for the cgroup interface, I think they should set a per-task > > > property, not a group property. > > > > Ok, right now using CGroups ans primary (and unique) interface, these > > values are tracked as attributes of the CPU controller. Tasks gets > > them by reading these attributes once they are binded to a CGroup. > > > > Are you proposing to move these attributes within the task_struct? > > /me goes look at your patches again, because I thought you already set > per task_struct values. > > @@ -1531,6 +1531,9 @@ struct task_struct { > struct sched_rt_entity rt; > #ifdef CONFIG_CGROUP_SCHED > struct task_group *sched_task_group; > +#ifdef CONFIG_CAPACITY_CLAMPING > + struct rb_node cap_clamp_node[2]; > +#endif > > Yeah, see... Well, these are not the actual attributes. These rb_nodes are used to sort the tasks based on their constraints, but the actual attributes values are stored in: @@ -273,6 +273,14 @@ struct task_group { #endif #endif +#ifdef CONFIG_CAPACITY_CLAMPING +#define CAP_CLAMP_MIN 0 +#define CAP_CLAMP_MAX 1 + + /* Min and Max capacity constraints for tasks in this group */ + unsigned int cap_clamp[2]; +#endif + This is done to avoid replicated information in each tasks structure. > > In that case we should also defined a primary interface to set them, > > any preferred proposal? sched_setattr(), prctl? > > We could, which I think is the important point. > > > By regular rb-tree do you mean the cfs_rq->tasks_timeline? > > Yep. > > > Because in that case this would apply only to the FAIR class, while > > the rb-tree we are using here are across classes. > > Supporting both FAIR and RT I think is a worth having feature. > > *groan* I don't want to even start thinking what this feature means in > the context of RT, head hurts enough. :-) Still, mobile people *groan* when we go to max OPP every time a RT task runs. Here you can find some energy numbers I've got recently on Pixel phones: https://lkml.org/lkml/2017/3/17/214 7%-54% (useless) more energy is a big deal. Of course, there can be many different solution to this problem, but capacity_max allows to easily clamp the frequency used when certain RT while still keeping them within expected latency performance. > > > So the bigger point is that if the min/max is a per-task property (even > > > if set through a cgroup interface), the min(max) / max(min) thing is > > > wrong. > > > > Perhaps I'm not following you here but, being per-task does not mean > > that we need to aggregate these constraints by summing them (look > > below)... > > > > > If the min/max were to apply to each individual task's util, you'd end > > > up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)]. > > > > ... as you do here. > > > > Let's use the usual simple example, where these per-tasks constraints > > are configured: > > > > - TaskA: capacity_min: 20% capacity_max: 80% > > - TaskB: capacity_min: 40% capacity_max: 60% > > > > This means that, at CPU level, we want to enforce the following > > clamping depending on the tasks status: > > > > RUNNABLE tasks capacity_min capacity_max > > A) TaskA 20% 80% > > B) TaskA,TaskB 40% 80% > > C) TaskB 40% 60% > > > > In case C, TaskA gets a bigger boost while is co-scheduled with TaskB. > > (bit unfortunate you gave your cases and tasks the same enumeration) > > But this I quite strongly feel is wrong. If you've given your tasks a > minimum OPP, you've in fact given them a minimum bandwidth, for at a > given frequency you can say how long they'll run, right? Not really, we are still in the domain of a best-effort solution, and I think we should stick with that. The overall idea is not about allocating bandwidth at all, but instead on expressing preferences, and there are two main reasons: 1) in principle we don't know how long a CFS task will run. we just know that if it completes faster than it's better Think about a task which is relatively small but functional to trigger further processing on an external device (e.g. a GPU). In this case the task is part of a critical-path and the sooner it finished the better it is. It can be the case that allocating bandwidth for such a task is not easy, e.g. because the amout of processing the task does can sensible change between each activation. In this case you have two options: a) meaure/estimate the WCET and go for over-budgeting likely using DEADLINE b) find the minimum capacity which allows your task to complete reasonably fast most of the time The second is of course a best-effort approach, still I find it could be useful to have and it can be easily adapted at run-time to express a sort-of power-vs-performance trade-off. 2) if you really want a granted bandwidth, you quite likely want also a granted deadline... and you should go for DEADLINE. > So if you want to maintain that case B should be 60%. Once one of the > tasks completes it will drop again. That is, the increased value > represents the additional runnable 'load' over the min from the > currently running task. Combined they will still complete in reduced > time. We already experimented with this approach in the past, actually the first version of SchedTune was based on the idea to aggregate by adding the boosted utilizations. It's true that in that way we are more likely to speed-up tasks completion also in case of co-scheduling but the downside is that we are entering the domain of "granted bandwidth allocation" which is likely overkilling for the scope of a best-effort solution. Moreover, since bandwidth is a limited resource, we also found such an approach not fitting well for systems where you have a certain number of tasks running concurrently. While, a simple threshold based boosting, where max() is used as the aggregation function, seems to be still useful. > > Notice that this CPU-level aggregation is used just for OPP selection > > on that CPU, while for TaskA we still use capacity_min=20% when we are > > looking for a CPU. > > And you don't find that inconsistent? In the previous example, TaskB seems to prefer a CPU which has between 40% and 60% capacity. Let's assume these numbers comes from a use-case where: a) your system provides 60% capacity in a LITTLE CPU b) your are after "sustained performances" for TaskA, which on that platform can be easily achieved by running at 40% of a LITTLE CPU Don't you think that this can be a valuable information for the scheduler to just (possibly) prefer a LITTLE CPU? With a max() aggregation we can place both TaskA and TaskB on the LITTLE CPU and try to run them at least at 40% capacity.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 27b4dd55b0c7..20ed17369cb1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6963,18 +6963,31 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) } #ifdef CONFIG_FAIR_GROUP_SCHED -static int cpu_shares_write_u64(struct cgroup_subsys_state *css, - struct cftype *cftype, u64 shareval) +static int cpu_nice_write_s64(struct cgroup_subsys_state *css, + struct cftype *cftype, s64 val) { - return sched_group_set_shares(css_tg(css), scale_load(shareval)); + struct task_group *tg = css_tg(css); + unsigned long weight; + + int ret; + + if (val < MIN_NICE || val > MAX_NICE) + return -EINVAL; + + weight = sched_prio_to_weight[val - MIN_NICE]; + + ret = sched_group_set_shares(tg, scale_load(weight)); + + if (!ret) + tg->nice = val; } -static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, +static u64 cpu_shares_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { struct task_group *tg = css_tg(css); - return (u64) scale_load_down(tg->shares); + return (s64)tg->nice; } #ifdef CONFIG_CFS_BANDWIDTH @@ -7252,9 +7265,9 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, static struct cftype cpu_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { - .name = "shares", - .read_u64 = cpu_shares_read_u64, - .write_u64 = cpu_shares_write_u64, + .name = "nice", + .read_s64 = cpu_nice_read_s64, + .write_s64 = cpu_nice_write_s64, }, #endif #ifdef CONFIG_CFS_BANDWIDTH diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 76f67b3e34d6..8088043f46d1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9471,7 +9471,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) if (!tg->se[0]) return -EINVAL; - shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES)); + shares = clamp(shares, MIN_SHARES, scale_load(MAX_SHARES)); mutex_lock(&shares_mutex); if (tg->shares == shares) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 57caf3606114..27f1ffe763bc 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -283,6 +283,7 @@ struct task_group { /* runqueue "owned" by this group on each cpu */ struct cfs_rq **cfs_rq; unsigned long shares; + int nice; #ifdef CONFIG_SMP /*