diff mbox series

[v9,12/16] sched/core: uclamp: Extend CPU's cgroup controller

Message ID 20190515094459.10317-13-patrick.bellasi@arm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Add utilization clamping support | expand

Commit Message

Patrick Bellasi May 15, 2019, 9:44 a.m. UTC
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes util.{min,max}
which allows to enforce utilization boosting and capping for all the
tasks in a group. Specifically:

- util.min: defines the minimum utilization which should be considered
	    i.e. the RUNNABLE tasks of this group will run at least at a
		 minimum frequency which corresponds to the util.min
		 utilization

- util.max: defines the maximum utilization which should be considered
	    i.e. the RUNNABLE tasks of this group will run up to a
		 maximum frequency which corresponds to the util.max
		 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing util.min < util.max.
Keep it simple by do not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>

---
Changes in v9
 Message-ID: <20190507114232.npsvba4itex5qpvl@e110439-lin>
 - make alloc_uclamp_sched_group() a void function
 Message-ID: <20190508190011.GB32547@worktop.programming.kicks-ass.net>:
- use for_each_clamp_id() and uclamp_se_set() to make code less fragile
---
 Documentation/admin-guide/cgroup-v2.rst |  27 +++++
 init/Kconfig                            |  22 ++++
 kernel/sched/core.c                     | 135 +++++++++++++++++++++++-
 kernel/sched/sched.h                    |   6 ++
 4 files changed, 189 insertions(+), 1 deletion(-)

Comments

Tejun Heo May 31, 2019, 3:35 p.m. UTC | #1
Hello, Patrick.

On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> Extend the CPU controller with a couple of new attributes util.{min,max}
> which allows to enforce utilization boosting and capping for all the
> tasks in a group. Specifically:
> 
> - util.min: defines the minimum utilization which should be considered
> 	    i.e. the RUNNABLE tasks of this group will run at least at a
> 		 minimum frequency which corresponds to the util.min
> 		 utilization
> 
> - util.max: defines the maximum utilization which should be considered
> 	    i.e. the RUNNABLE tasks of this group will run up to a
> 		 maximum frequency which corresponds to the util.max
> 		 utilization

Let's please use a prefix which is more specific.  It's clamping the
utilization estimates of the member tasks which in turn affect
scheduling / frequency decisions but cpu.util.max reads like it's
gonna limit the cpu utilization directly.  Maybe just use uclamp?

> These attributes:
> 
> a) are available only for non-root nodes, both on default and legacy
>    hierarchies, while system wide clamps are defined by a generic
>    interface which does not depends on cgroups. This system wide
>    interface enforces constraints on tasks in the root node.

I'd much prefer if they weren't entangled this way.  The system wide
limits should work the same regardless of cgroup's existence.  cgroup
can put further restriction on top but mere creation of cgroups with
cpu controller enabled shouldn't take them out of the system-wide
limits.

> b) enforce effective constraints at each level of the hierarchy which
>    are a restriction of the group requests considering its parent's
>    effective constraints. Root group effective constraints are defined
>    by the system wide interface.
>    This mechanism allows each (non-root) level of the hierarchy to:
>    - request whatever clamp values it would like to get
>    - effectively get only up to the maximum amount allowed by its parent

I'll come back to this later.

> c) have higher priority than task-specific clamps, defined via
>    sched_setattr(), thus allowing to control and restrict task requests

This sounds good.

> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Validate local consistency by enforcing util.min < util.max.
> Keep it simple by do not caring now about "effective" values computation
> and propagation along the hierarchy.

So, the followings are what we're doing for hierarchical protection
and limit propgations.

* Limits (high / max) default to max.  Protections (low / min) 0.  A
  new cgroup by default doesn't constrain itself further and doesn't
  have any protection.

* A limit defines the upper ceiling for the subtree.  If an ancestor
  has a limit of X, none of its descendants can have more than X.

* A protection defines the upper ceiling of protections for the
  subtree.  If an andester has a protection of X, none of its
  descendants can have more protection than X.

Note that there's no way for an ancestor to enforce protection its
descendants.  It can only allow them to claim some.  This is
intentional as the other end of the spectrum is either descendants
losing the ability to further distribute protections as they see fit.

For proportions (as opposed to weights), we use percentage rational
numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
pending.  I'll put them on cgroup/for-5.3.

Thanks.
Patrick Bellasi June 3, 2019, 12:24 p.m. UTC | #2
On 31-May 08:35, Tejun Heo wrote:
> Hello, Patrick.

Hi Tejun!

> On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> > Extend the CPU controller with a couple of new attributes util.{min,max}
> > which allows to enforce utilization boosting and capping for all the
> > tasks in a group. Specifically:
> > 
> > - util.min: defines the minimum utilization which should be considered
> > 	    i.e. the RUNNABLE tasks of this group will run at least at a
> > 		 minimum frequency which corresponds to the util.min
> > 		 utilization
> > 
> > - util.max: defines the maximum utilization which should be considered
> > 	    i.e. the RUNNABLE tasks of this group will run up to a
> > 		 maximum frequency which corresponds to the util.max
> > 		 utilization
> 
> Let's please use a prefix which is more specific.  It's clamping the
> utilization estimates of the member tasks which in turn affect
> scheduling / frequency decisions but cpu.util.max reads like it's
> gonna limit the cpu utilization directly.  Maybe just use uclamp?

Being too specific does not risk to expose implementation details?

If that's not a problem and Peter likes:

   cpu.uclamp.{min,max}

that's ok with me.
Patrick Bellasi June 3, 2019, 12:27 p.m. UTC | #3
On 31-May 08:35, Tejun Heo wrote:

[...]

> > These attributes:
> > 
> > a) are available only for non-root nodes, both on default and legacy
> >    hierarchies, while system wide clamps are defined by a generic
> >    interface which does not depends on cgroups. This system wide
> >    interface enforces constraints on tasks in the root node.
> 
> I'd much prefer if they weren't entangled this way.  The system wide
> limits should work the same regardless of cgroup's existence.  cgroup
> can put further restriction on top but mere creation of cgroups with
> cpu controller enabled shouldn't take them out of the system-wide
> limits.

That's correct and what you describe matches, at least on its
intents, the current implementation provided in:

   [PATCH v9 14/16] sched/core: uclamp: Propagate system defaults to root group
   https://lore.kernel.org/lkml/20190515094459.10317-15-patrick.bellasi@arm.com/

System clamps always work the same way, independently from cgroups:
they define the upper bound for both min and max clamps.

When cgroups are not available, tasks specific clamps are always
capped by system clamps.

When cgroups are available, the root task group clamps are capped by
the system clamps, which affects its "effective" clamps and propagate
them down the hierarchy to child's "effective" clamps.
That's done in:

   [PATCH v9 13/16] sched/core: uclamp: Propagate parent clamps
   https://lore.kernel.org/lkml/20190515094459.10317-14-patrick.bellasi@arm.com/


Example 1
---------

Here is an example of system and groups clamps aggregation:

        	        min                        max
system defaults         400                        600

cg_name        	        min       min.effective	   max	     max.effective
 /uclamp               1024       400              500       500
 /uclamp/app              512       400	             512       500
 /uclamp/app/pid_smalls	    100	      100              200       200
 /uclamp/app/pid_bigs       500	      400              700       500


The ".effective" clamps are used to define the actual clamp value to
apply to tasks, according to the aggregation rules defined in:

   [PATCH v9 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
   https://lore.kernel.org/lkml/20190515094459.10317-16-patrick.bellasi@arm.com/

All the above, to me it means that:
 - cgroups are always capped by system clamps
 - cgroups can further restrict system clamps

Does that match with your view?

> > b) enforce effective constraints at each level of the hierarchy which
> >    are a restriction of the group requests considering its parent's
> >    effective constraints. Root group effective constraints are defined
> >    by the system wide interface.
> >    This mechanism allows each (non-root) level of the hierarchy to:
> >    - request whatever clamp values it would like to get
> >    - effectively get only up to the maximum amount allowed by its parent
> 
> I'll come back to this later.
> 
> > c) have higher priority than task-specific clamps, defined via
> >    sched_setattr(), thus allowing to control and restrict task requests
> 
> This sounds good.
> 
> > Add two new attributes to the cpu controller to collect "requested"
> > clamp values. Allow that at each non-root level of the hierarchy.
> > Validate local consistency by enforcing util.min < util.max.
> > Keep it simple by do not caring now about "effective" values computation
> > and propagation along the hierarchy.
> 
> So, the followings are what we're doing for hierarchical protection
> and limit propgations.
> 
> * Limits (high / max) default to max.  Protections (low / min) 0.  A
>   new cgroup by default doesn't constrain itself further and doesn't
>   have any protection.

Example 2
---------

Let say we have:

  /tg1:
        util_min=200 (as a protection)
        util_max=800 (as a limit)

the moment we create a subgroup /tg1/tg11, in v9 it is initialized
with the same limits _and protections_ of its father:

  /tg1/tg11:
        util_min=200 (protection inherited from /tg1)
        util_max=800 (limit inherited from /tg1)

Do you mean that we should have instead:

  /tg1/tg11:
        util_min=0   (no protection by default at creation time)
        util_max=800 (limit inherited from /tg1)


i.e. we need to reset the protection of a newly created subgroup?


> * A limit defines the upper ceiling for the subtree.  If an ancestor
>   has a limit of X, none of its descendants can have more than X.

That's correct, however we distinguish between "requested" and
"effective" values.


Example 3
---------

We can have:

cg_name        	        max       max.effective
 /uclamp/app            400                 400
 /uclamp/app/pid_bigs     500	              400


Which means that a subgroup can "request" a limit (max=500) higher
then its father (max=400), while still getting only up to what its
father allows (max.effective = 400).


Example 4
---------

Tracking the actual requested limit (max=500) it's useful to enforce
it once the father limit should be relaxed, for example we will have:

cg_name        	        max       max.effective
 /uclamp/app            600                 600
 /uclamp/app/pid_bigs     500	              500

where a subgroup gets not more than what it has been configured for.

This is the logic implemented by cpu_util_update_eff() in:

   [PATCH v9 13/16] sched/core: uclamp: Propagate parent clamps
   https://lore.kernel.org/lkml/20190515094459.10317-14-patrick.bellasi@arm.com/


> * A protection defines the upper ceiling of protections for the
>   subtree.  If an andester has a protection of X, none of its
>   descendants can have more protection than X.

Right, that's the current behavior in v9.

> Note that there's no way for an ancestor to enforce protection its
> descendants.  It can only allow them to claim some.  This is
> intentional as the other end of the spectrum is either descendants
> losing the ability to further distribute protections as they see fit.

Ok, that means I need to update in v10 the initialization of subgroups
min clamps to be none by default as discussed in the above Example 2,
right?

[...]

Cheers,
Patrick
Patrick Bellasi June 3, 2019, 12:29 p.m. UTC | #4
On 31-May 08:35, Tejun Heo wrote:
> Hello, Patrick.
> 
> On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:

[...]

> For proportions (as opposed to weights), we use percentage rational
> numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
> pending.  I'll put them on cgroup/for-5.3.

That's a point worth discussing with Peter, we already changed one
time from percentages to 1024 scale.

Utilization clamps are expressed as percentages by definition,
they are just expressed in a convenient 1024 scale which should not be
alien to people using those knobs.

If we wanna use a "more specific" name like uclamp.{min,max} then we
should probably also accept to use a "more specific" metric, don't we?

I personally like the [0..1024] range, but I guess that's really up to
you and Peter to agree upon.

> Thanks.
> 
> --
> tejun

Cheers,
Patrick
Tejun Heo June 5, 2019, 2:03 p.m. UTC | #5
Hello,

On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote:
> All the above, to me it means that:
>  - cgroups are always capped by system clamps
>  - cgroups can further restrict system clamps
> 
> Does that match with your view?

Yeah, as long as what's defined at system level clamps everything in
the system whether they're in cgroups or not, it's all good.

> > * Limits (high / max) default to max.  Protections (low / min) 0.  A
> >   new cgroup by default doesn't constrain itself further and doesn't
> >   have any protection.
> 
> Example 2
> ---------
> 
> Let say we have:
> 
>   /tg1:
>         util_min=200 (as a protection)
>         util_max=800 (as a limit)
> 
> the moment we create a subgroup /tg1/tg11, in v9 it is initialized
> with the same limits _and protections_ of its father:
> 
>   /tg1/tg11:
>         util_min=200 (protection inherited from /tg1)
>         util_max=800 (limit inherited from /tg1)
> 
> Do you mean that we should have instead:
> 
>   /tg1/tg11:
>         util_min=0   (no protection by default at creation time)
>         util_max=800 (limit inherited from /tg1)
> 
> 
> i.e. we need to reset the protection of a newly created subgroup?

The default value for limits should be max, protections 0.  Don't
inherit config values from the parent.  That gets confusing super fast
because when the parent config is set and each child is created plays
into the overall configuration.  Hierarchical confinements should
always be enforced and a new cgroup should always start afresh in
terms of its own configuration.

> > * A limit defines the upper ceiling for the subtree.  If an ancestor
> >   has a limit of X, none of its descendants can have more than X.
> 
> That's correct, however we distinguish between "requested" and
> "effective" values.

Sure, all property propagating controllers should.

> > Note that there's no way for an ancestor to enforce protection its
> > descendants.  It can only allow them to claim some.  This is
> > intentional as the other end of the spectrum is either descendants
> > losing the ability to further distribute protections as they see fit.
> 
> Ok, that means I need to update in v10 the initialization of subgroups
> min clamps to be none by default as discussed in the above Example 2,
> right?

Yeah and max to max.

Thanks.
Tejun Heo June 5, 2019, 2:09 p.m. UTC | #6
Hello,

On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote:
> On 31-May 08:35, Tejun Heo wrote:
> > Hello, Patrick.
> > 
> > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> 
> [...]
> 
> > For proportions (as opposed to weights), we use percentage rational
> > numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
> > pending.  I'll put them on cgroup/for-5.3.
> 
> That's a point worth discussing with Peter, we already changed one
> time from percentages to 1024 scale.

cgroup tries to uss uniform units for its interface files as much as
possible even when that deviates from non-cgroup interface.  We can
bikeshed the pros and cons for that design choice for sure but I don't
think it makes sense to deviate from that at this point unless there
are really strong reasons to do so.

> Utilization clamps are expressed as percentages by definition,
> they are just expressed in a convenient 1024 scale which should not be
> alien to people using those knobs.
> 
> If we wanna use a "more specific" name like uclamp.{min,max} then we
> should probably also accept to use a "more specific" metric, don't we?

Heh, this actually made me chuckle.  It's an interesting bargaining
take but I don't think that same word being in two different places
makes them tradable entities.  We can go into the weeds with the
semantics but how about us using an alternative adjective "misleading"
for the cpu.util.min/max names to short-circuit that?

Thanks.
Patrick Bellasi June 5, 2019, 2:39 p.m. UTC | #7
On 05-Jun 07:03, Tejun Heo wrote:
> Hello,

Hi!

> On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote:
> > All the above, to me it means that:
> >  - cgroups are always capped by system clamps
> >  - cgroups can further restrict system clamps
> > 
> > Does that match with your view?
> 
> Yeah, as long as what's defined at system level clamps everything in
> the system whether they're in cgroups or not, it's all good.

Right, then we are good with v9 on this point.

> > > * Limits (high / max) default to max.  Protections (low / min) 0.  A
> > >   new cgroup by default doesn't constrain itself further and doesn't
> > >   have any protection.
> > 
> > Example 2
> > ---------
> > 
> > Let say we have:
> > 
> >   /tg1:
> >         util_min=200 (as a protection)
> >         util_max=800 (as a limit)
> > 
> > the moment we create a subgroup /tg1/tg11, in v9 it is initialized
> > with the same limits _and protections_ of its father:
> > 
> >   /tg1/tg11:
> >         util_min=200 (protection inherited from /tg1)
> >         util_max=800 (limit inherited from /tg1)
> > 
> > Do you mean that we should have instead:
> > 
> >   /tg1/tg11:
> >         util_min=0   (no protection by default at creation time)
> >         util_max=800 (limit inherited from /tg1)
> > 
> > 
> > i.e. we need to reset the protection of a newly created subgroup?
> 
> The default value for limits should be max, protections 0.  Don't
> inherit config values from the parent.  That gets confusing super fast
> because when the parent config is set and each child is created plays
> into the overall configuration.  Hierarchical confinements should
> always be enforced and a new cgroup should always start afresh in
> terms of its own configuration.

Got it, so in the example above we will create:

   /tg1/tg11:
         util_min=0    (no requested protection by default at creation time)
         util_max=1024 (no requests limit by default at creation time)

That's it for the "requested" values side, while the "effective"
values are enforced by the hierarchical confinement rules since
creation time.
Which means we will enforce the effective values as:

   /tg1/tg11:

         util_min.effective=0
            i.e. keep the child protection since smaller than parent

         util_max.effective=800
            i.e. keep parent limit since stricter than child

Please shout if I got it wrong, otherwise I'll update v10 to
implement the above logic.

> > > * A limit defines the upper ceiling for the subtree.  If an ancestor
> > >   has a limit of X, none of its descendants can have more than X.
> > 
> > That's correct, however we distinguish between "requested" and
> > "effective" values.
> 
> Sure, all property propagating controllers should.

Right.

> > > Note that there's no way for an ancestor to enforce protection its
> > > descendants.  It can only allow them to claim some.  This is
> > > intentional as the other end of the spectrum is either descendants
> > > losing the ability to further distribute protections as they see fit.
> > 
> > Ok, that means I need to update in v10 the initialization of subgroups
> > min clamps to be none by default as discussed in the above Example 2,
> > right?
> 
> Yeah and max to max.

Right, I've got it now.


> Thanks.

Cheers,
Patrick
Tejun Heo June 5, 2019, 2:44 p.m. UTC | #8
Hello,

On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote:
> Which means we will enforce the effective values as:
> 
>    /tg1/tg11:
> 
>          util_min.effective=0
>             i.e. keep the child protection since smaller than parent
> 
>          util_max.effective=800
>             i.e. keep parent limit since stricter than child
>
> Please shout if I got it wrong, otherwise I'll update v10 to
> implement the above logic.

Everything sounds good to me.  Please note that cgroup interface files
actually use literal "max" for limit/protection max settings so that 0
and "max" mean the same things for all limit/protection knobs.

Thanks.
Patrick Bellasi June 5, 2019, 3:06 p.m. UTC | #9
On 05-Jun 07:09, Tejun Heo wrote:
> Hello,

Hi,

> On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote:
> > On 31-May 08:35, Tejun Heo wrote:
> > > Hello, Patrick.
> > > 
> > > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> > 
> > [...]
> > 
> > > For proportions (as opposed to weights), we use percentage rational
> > > numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
> > > pending.  I'll put them on cgroup/for-5.3.
> > 
> > That's a point worth discussing with Peter, we already changed one
> > time from percentages to 1024 scale.
> 
> cgroup tries to uss uniform units for its interface files as much as
> possible even when that deviates from non-cgroup interface.  We can
> bikeshed the pros and cons for that design choice for sure but I don't
> think it makes sense to deviate from that at this point unless there
> are really strong reasons to do so.

that makes sense to me, having a uniform interface has certainly a
value.

The only additional point I can think about as a (slightly) stronger
reason is that I guess we would like to have the same API for cgroups
as well as for the task specific and the system wide settings.

The task specific values comes in via the sched_setattr() syscall:

   [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
   https://lore.kernel.org/lkml/20190515094459.10317-7-patrick.bellasi@arm.com/

where we need to encode each clamp into a __u32 value.

System wide settings are expose similarly to these:

   grep '' /proc/sys/kernel/sched_*

where we have always integer numbers.

AFAIU your proposal will require to use a "scaled percentage" - e.g.
3844 for 38.44% which however it's still not quite the same as writing
the string "38.44".

Not sure that's a strong enough argument, is it?

> > Utilization clamps are expressed as percentages by definition,
> > they are just expressed in a convenient 1024 scale which should not be
> > alien to people using those knobs.
> > 
> > If we wanna use a "more specific" name like uclamp.{min,max} then we
> > should probably also accept to use a "more specific" metric, don't we?
> 
> Heh, this actually made me chuckle.

:)

> It's an interesting bargaining take but I don't think that same word
> being in two different places makes them tradable entities.

Sure, that was not my intention.

I was just trying to see if the need to be more specific could
be an argument for having also a more specific value.

> We can go into the weeds with the semantics but how about us using
> an alternative adjective "misleading" for the cpu.util.min/max names
> to short-circuit that?

Not quite sure to get what you mean here. Are you pointing out that
with clamps we don't strictly enforce a bandwidth but we just set a
bias?

Cheers,
Patrick
Tejun Heo June 5, 2019, 3:27 p.m. UTC | #10
Hello, Patrick.

On Wed, Jun 05, 2019 at 04:06:30PM +0100, Patrick Bellasi wrote:
> The only additional point I can think about as a (slightly) stronger
> reason is that I guess we would like to have the same API for cgroups
> as well as for the task specific and the system wide settings.
> 
> The task specific values comes in via the sched_setattr() syscall:
> 
>    [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
>    https://lore.kernel.org/lkml/20190515094459.10317-7-patrick.bellasi@arm.com/
> 
> where we need to encode each clamp into a __u32 value.
> 
> System wide settings are expose similarly to these:
> 
>    grep '' /proc/sys/kernel/sched_*
> 
> where we have always integer numbers.
> 
> AFAIU your proposal will require to use a "scaled percentage" - e.g.
> 3844 for 38.44% which however it's still not quite the same as writing
> the string "38.44".
> 
> Not sure that's a strong enough argument, is it?

It definitely is an argument but the thing is that the units we use in
kernel API are all over the place anyway.  Even for something as
simple as sizes, we use bytes, 512 byte sectors, kilobytes and pages
all over the place.  Some for good reasons (as you mentioned above)
and others for historical / random ones.

So, I'm generally not too concerned about units differing between
cgroup interface and, say, syscall interface.  That ship has sailed a
long while ago and we have to deal with it everywhere anyway (in many
cases there isn't even a good unit to pick for compatibility because
the existing interfaces are already mixing units heavily).  As long as
the translation is trivial, it isn't a big issue.  Note that some
translations are not trivial.  For example, the sched nice value
mapping to weight has a separate unit matching knob for that reason.

> > We can go into the weeds with the semantics but how about us using
> > an alternative adjective "misleading" for the cpu.util.min/max names
> > to short-circuit that?
> 
> Not quite sure to get what you mean here. Are you pointing out that
> with clamps we don't strictly enforce a bandwidth but we just set a
> bias?

It's just that "util" is already used a lot and cpu.util.max reads
like it should cap cpu utilization (wallclock based) to 80% and it's
likely that it'd read seem way to many other folks too.  A more
distinctive name signals that it isn't something that obvious.

Thanks.
Patrick Bellasi June 5, 2019, 3:37 p.m. UTC | #11
On 05-Jun 07:44, Tejun Heo wrote:
> Hello,

Hi,

> On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote:
> > Which means we will enforce the effective values as:
> > 
> >    /tg1/tg11:
> > 
> >          util_min.effective=0
> >             i.e. keep the child protection since smaller than parent
> > 
> >          util_max.effective=800
> >             i.e. keep parent limit since stricter than child
> >
> > Please shout if I got it wrong, otherwise I'll update v10 to
> > implement the above logic.
> 
> Everything sounds good to me.  Please note that cgroup interface files
> actually use literal "max" for limit/protection max settings so that 0
> and "max" mean the same things for all limit/protection knobs.

Lemme see if I've got it right, do you mean that we can:

 1) write the _string_ "max" into a cgroup attribute to:

    - set    0 for util_max, since it's a protection
    - set 1024 for util_min, since it's a limit

 2) write the _string_ "0" into a cgroup attribute to:

    - set 1024 for util_max, since it's a protection
    - set    0 for util_min, since it's a limit

Is that correct or it's just me totally confused?


> Thanks.
> 
> --
> tejun

Cheers,
Patrick
Tejun Heo June 5, 2019, 3:39 p.m. UTC | #12
Hello, Patrick.

On Wed, Jun 05, 2019 at 04:37:43PM +0100, Patrick Bellasi wrote:
> > Everything sounds good to me.  Please note that cgroup interface files
> > actually use literal "max" for limit/protection max settings so that 0
> > and "max" mean the same things for all limit/protection knobs.
> 
> Lemme see if I've got it right, do you mean that we can:
> 
>  1) write the _string_ "max" into a cgroup attribute to:
> 
>     - set    0 for util_max, since it's a protection
>     - set 1024 for util_min, since it's a limit
>
>  2) write the _string_ "0" into a cgroup attribute to:
> 
>     - set 1024 for util_max, since it's a protection
>     - set    0 for util_min, since it's a limit
> 
> Is that correct or it's just me totally confused?

Heh, sorry about not being clearer.  "max" just means numerically
highest possible config for the config knob, so in your case, "max"
would always map to 1024.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 20f92c16ffbf..3a940bfe4e8c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -909,6 +909,12 @@  controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -974,6 +980,27 @@  All time durations are in microseconds.
 	Shows pressure stall information for CPU. See
 	Documentation/accounting/psi.txt for details.
 
+  cpu.util.min
+        A read-write single value file which exists on non-root cgroups.
+        The default is "0", i.e. no utilization boosting.
+
+        The requested minimum utilization in the range [0, 1024].
+
+        This interface allows reading and setting minimum utilization clamp
+        values similar to the sched_setattr(2). This minimum utilization
+        value is used to clamp the task specific minimum utilization clamp.
+
+  cpu.util.max
+        A read-write single value file which exists on non-root cgroups.
+        The default is "1024". i.e. no utilization capping
+
+        The requested maximum utilization in the range [0, 1024].
+
+        This interface allows reading and setting maximum utilization clamp
+        values similar to the sched_setattr(2). This maximum utilization
+        value is used to clamp the task specific maximum utilization clamp.
+
+
 
 Memory
 ------
diff --git a/init/Kconfig b/init/Kconfig
index 8e103505456a..5617742b97e5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -894,6 +894,28 @@  config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config UCLAMP_TASK_GROUP
+	bool "Utilization clamping per group of tasks"
+	depends on CGROUP_SCHED
+	depends on UCLAMP_TASK
+	default n
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+	  When this option is enabled, the user can specify a min and max
+	  CPU bandwidth which is allowed for each single task in a group.
+	  The max bandwidth allows to clamp the maximum frequency a task
+	  can use, while the min bandwidth allows to define a minimum
+	  frequency a task will always use.
+
+	  When task group based utilization clamping is enabled, an eventually
+	  specified task-specific clamp value is constrained by the cgroup
+	  specified clamp value. Both minimum and maximum task clamping cannot
+	  be bigger than the corresponding clamping defined at task group level.
+
+	  If in doubt, say N.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eed7664437ac..19437257a08d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1137,8 +1137,12 @@  static void __init init_uclamp(void)
 
 	/* System defaults allow max clamp values for both indexes */
 	uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
-	for_each_clamp_id(clamp_id)
+	for_each_clamp_id(clamp_id) {
 		uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+		root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+	}
 }
 
 #else /* CONFIG_UCLAMP_TASK */
@@ -6695,6 +6699,17 @@  void ia64_set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+					    struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	int clamp_id;
+
+	for_each_clamp_id(clamp_id)
+		tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
+#endif
+}
+
 static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
@@ -6718,6 +6733,8 @@  struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	alloc_uclamp_sched_group(tg, parent);
+
 	return tg;
 
 err:
@@ -6938,6 +6955,96 @@  static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 min_value)
+{
+	struct task_group *tg;
+	int ret = 0;
+
+	if (min_value > SCHED_CAPACITY_SCALE)
+		return -ERANGE;
+
+	rcu_read_lock();
+
+	tg = css_tg(css);
+	if (tg == &root_task_group) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
+		goto out;
+	if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
+
+out:
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 max_value)
+{
+	struct task_group *tg;
+	int ret = 0;
+
+	if (max_value > SCHED_CAPACITY_SCALE)
+		return -ERANGE;
+
+	rcu_read_lock();
+
+	tg = css_tg(css);
+	if (tg == &root_task_group) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (tg->uclamp_req[UCLAMP_MAX].value == max_value)
+		goto out;
+	if (tg->uclamp_req[UCLAMP_MIN].value > max_value) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
+
+out:
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+				  enum uclamp_id clamp_id)
+{
+	struct task_group *tg;
+	u64 util_clamp;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	util_clamp = tg->uclamp_req[clamp_id].value;
+	rcu_read_unlock();
+
+	return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7282,6 +7389,18 @@  static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "util.min",
+		.read_u64 = cpu_util_min_read_u64,
+		.write_u64 = cpu_util_min_write_u64,
+	},
+	{
+		.name = "util.max",
+		.read_u64 = cpu_util_max_read_u64,
+		.write_u64 = cpu_util_max_write_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7449,6 +7568,20 @@  static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "util.min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_min_read_u64,
+		.write_u64 = cpu_util_min_write_u64,
+	},
+	{
+		.name = "util.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_max_read_u64,
+		.write_u64 = cpu_util_max_write_u64,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6335cfcc81ba..fd31527fdcc8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -399,6 +399,12 @@  struct task_group {
 #endif
 
 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	/* Clamp values requested for a task group */
+	struct uclamp_se	uclamp_req[UCLAMP_CNT];
+#endif
+
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED