diff mbox series

[v4,14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

Message ID 20180828135324.21976-15-patrick.bellasi@arm.com (mailing list archive)
State Changes Requested, archived
Headers show
Series Add utilization clamping support | expand

Commit Message

Patrick Bellasi Aug. 28, 2018, 1:53 p.m. UTC
The number of clamp groups supported is limited and defined at compile
time. However, a malicious user can currently ask for many different
clamp values thus consuming all the available clamp groups.

Since on properly configured systems we expect only a limited set of
different clamp values, the previous problem can be mitigated by
allowing access to clamp groups configuration only to privileged tasks.
This should still allow a System Management Software to properly
pre-configure the system.

Let's restrict the tuning of utilization clamp values, by default, to
tasks with CAP_SYS_ADMIN capabilities.

Whenever this should be considered too restrictive and/or not required
for a specific platforms, a kernel boot option is provided to change
this default behavior thus allowing non privileged tasks to change their
utilization clamp values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes in v4:
 Others:
 - new patch added in this version
 - rebased on v4.19-rc1
---
 .../admin-guide/kernel-parameters.txt         |  3 +++
 kernel/sched/core.c                           | 22 ++++++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

Comments

Juri Lelli Sept. 4, 2018, 1:47 p.m. UTC | #1
Hi,

On 28/08/18 14:53, Patrick Bellasi wrote:
> The number of clamp groups supported is limited and defined at compile
> time. However, a malicious user can currently ask for many different

Even if not malicious.. :-)

> clamp values thus consuming all the available clamp groups.
> 
> Since on properly configured systems we expect only a limited set of
> different clamp values, the previous problem can be mitigated by
> allowing access to clamp groups configuration only to privileged tasks.
> This should still allow a System Management Software to properly
> pre-configure the system.
> 
> Let's restrict the tuning of utilization clamp values, by default, to
> tasks with CAP_SYS_ADMIN capabilities.
> 
> Whenever this should be considered too restrictive and/or not required
> for a specific platforms, a kernel boot option is provided to change
> this default behavior thus allowing non privileged tasks to change their
> utilization clamp values.
> 
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Todd Kjos <tkjos@google.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Steve Muckle <smuckle@google.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Quentin Perret <quentin.perret@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> 
> ---
> Changes in v4:
>  Others:
>  - new patch added in this version
>  - rebased on v4.19-rc1
> ---
>  .../admin-guide/kernel-parameters.txt         |  3 +++
>  kernel/sched/core.c                           | 22 ++++++++++++++++---
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9871e649ffef..481f8214ea9a 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4561,6 +4561,9 @@
>  			<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
>  			See also Documentation/input/devices/joystick-parport.rst
>  
> +	uclamp_user	[KNL] Enable task-specific utilization clamping tuning
> +			also from tasks without CAP_SYS_ADMIN capability.
> +
>  	udbg-immortal	[PPC] When debugging early kernel crashes that
>  			happen after console_init() and before a proper
>  			console driver takes over, this boot options might
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 222397edb8a7..8341ce580a9a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1510,14 +1510,29 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
>  static inline void free_uclamp_sched_group(struct task_group *tg) { }
>  #endif /* CONFIG_UCLAMP_TASK_GROUP */
>  
> +static bool uclamp_user_allowed __read_mostly;
> +static int __init uclamp_user_allow(char *str)
> +{
> +	uclamp_user_allowed = true;
> +
> +	return 0;
> +}
> +early_param("uclamp_user", uclamp_user_allow);
> +
>  static inline int __setscheduler_uclamp(struct task_struct *p,
> -					const struct sched_attr *attr)
> +					const struct sched_attr *attr,
> +					bool user)

Wondering if you want to fold the check below inside the

 if (user && !capable(CAP_SYS_NICE)) {
   ...
 }

block. It would also save you from adding another parameter to the
function.

>  {
>  	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
>  	int lower_bound, upper_bound;
>  	struct uclamp_se *uc_se;
>  	int result = 0;
>  
> +	if (!capable(CAP_SYS_ADMIN) &&
> +	    user && !uclamp_user_allowed) {
> +		return -EPERM;
> +	}
> +

Best,

- Juri
Patrick Bellasi Sept. 6, 2018, 2:40 p.m. UTC | #2
On 04-Sep 15:47, Juri Lelli wrote:
> Hi,
> 
> On 28/08/18 14:53, Patrick Bellasi wrote:
> > The number of clamp groups supported is limited and defined at compile
> > time. However, a malicious user can currently ask for many different
> 
> Even if not malicious.. :-)

Yeah... I should had write "ambitious" :D
... I'll get rid of all those geeks in the next respin ;)

> > clamp values thus consuming all the available clamp groups.
> > 
> > Since on properly configured systems we expect only a limited set of
> > different clamp values, the previous problem can be mitigated by
> > allowing access to clamp groups configuration only to privileged tasks.
> > This should still allow a System Management Software to properly
> > pre-configure the system.
> > 
> > Let's restrict the tuning of utilization clamp values, by default, to
> > tasks with CAP_SYS_ADMIN capabilities.
> > 
> > Whenever this should be considered too restrictive and/or not required
> > for a specific platforms, a kernel boot option is provided to change
> > this default behavior thus allowing non privileged tasks to change their
> > utilization clamp values.
> > 
> > Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Paul Turner <pjt@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Todd Kjos <tkjos@google.com>
> > Cc: Joel Fernandes <joelaf@google.com>
> > Cc: Steve Muckle <smuckle@google.com>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Quentin Perret <quentin.perret@arm.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-pm@vger.kernel.org
> > 
> > ---
> > Changes in v4:
> >  Others:
> >  - new patch added in this version
> >  - rebased on v4.19-rc1
> > ---
> >  .../admin-guide/kernel-parameters.txt         |  3 +++
> >  kernel/sched/core.c                           | 22 ++++++++++++++++---
> >  2 files changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 9871e649ffef..481f8214ea9a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4561,6 +4561,9 @@
> >  			<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
> >  			See also Documentation/input/devices/joystick-parport.rst
> >  
> > +	uclamp_user	[KNL] Enable task-specific utilization clamping tuning
> > +			also from tasks without CAP_SYS_ADMIN capability.
> > +
> >  	udbg-immortal	[PPC] When debugging early kernel crashes that
> >  			happen after console_init() and before a proper
> >  			console driver takes over, this boot options might
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 222397edb8a7..8341ce580a9a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1510,14 +1510,29 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
> >  static inline void free_uclamp_sched_group(struct task_group *tg) { }
> >  #endif /* CONFIG_UCLAMP_TASK_GROUP */
> >  
> > +static bool uclamp_user_allowed __read_mostly;
> > +static int __init uclamp_user_allow(char *str)
> > +{
> > +	uclamp_user_allowed = true;
> > +
> > +	return 0;
> > +}
> > +early_param("uclamp_user", uclamp_user_allow);
> > +
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > -					const struct sched_attr *attr)
> > +					const struct sched_attr *attr,
> > +					bool user)
> 
> Wondering if you want to fold the check below inside the
> 
>  if (user && !capable(CAP_SYS_NICE)) {
>    ...
>  }
> 
> block. It would also save you from adding another parameter to the
> function.

So, there are two reasons for that:

1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
   instead on capable(CAP_SYS_ADMIN)

   Does that make sense ?

   If yes, the I cannot fold it in the block you reported above
   because we will not check for users with CAP_SYS_NICE.

2) Then we could move it after that block, where there is another
   set of checks with just:

      if (user) {

   We can potentially add the check there yes... but when uclamp is
   not enabled we will still perform those checks or we have to add
   some compiler guards...

3) ... or at least check for:

     if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)

   Which is what I'm doing right after the block above (2).

   But, at this point, by passing in the parameter to the
   __setscheduler_uclamp() call, I get the benefits of:

   a) reducing uclamp specific code in the caller
   b) avoiding the checks on !CONFIG_UCLAMP_TASK build

> >  {
> >  	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> >  	int lower_bound, upper_bound;
> >  	struct uclamp_se *uc_se;
> >  	int result = 0;
> >  
> > +	if (!capable(CAP_SYS_ADMIN) &&
> > +	    user && !uclamp_user_allowed) {
> > +		return -EPERM;
> > +	}
> > +

Does all the above makes sense ?

Cheers,
Patrick
Juri Lelli Sept. 6, 2018, 2:59 p.m. UTC | #3
On 06/09/18 15:40, Patrick Bellasi wrote:
> On 04-Sep 15:47, Juri Lelli wrote:

[...]

> > Wondering if you want to fold the check below inside the
> > 
> >  if (user && !capable(CAP_SYS_NICE)) {
> >    ...
> >  }
> > 
> > block. It would also save you from adding another parameter to the
> > function.
> 
> So, there are two reasons for that:
> 
> 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
>    instead on capable(CAP_SYS_ADMIN)
> 
>    Does that make sense ?
> 
>    If yes, the I cannot fold it in the block you reported above
>    because we will not check for users with CAP_SYS_NICE.

Ah, right, not sure though. Looks like CAP_SYS_NICE is used for settings
that relates to priorities, affinity, etc.: CPU related stuff. Since
here you are also dealing with something that seems to fall into the
same realm, it might actually fit more than CAP_SYS_ADMIN?

Now that I think more about it, would it actually make sense to allow
unpriviledged users to lower their assigned umin/umax properties if they
want? Something alike what happens for nice values or RT priorities.

> 2) Then we could move it after that block, where there is another
>    set of checks with just:
> 
>       if (user) {
> 
>    We can potentially add the check there yes... but when uclamp is
>    not enabled we will still perform those checks or we have to add
>    some compiler guards...
> 
> 3) ... or at least check for:
> 
>      if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> 
>    Which is what I'm doing right after the block above (2).
> 
>    But, at this point, by passing in the parameter to the
>    __setscheduler_uclamp() call, I get the benefits of:
> 
>    a) reducing uclamp specific code in the caller
>    b) avoiding the checks on !CONFIG_UCLAMP_TASK build
> 
> > >  {
> > >  	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > >  	int lower_bound, upper_bound;
> > >  	struct uclamp_se *uc_se;
> > >  	int result = 0;
> > >  
> > > +	if (!capable(CAP_SYS_ADMIN) &&
> > > +	    user && !uclamp_user_allowed) {
> > > +		return -EPERM;
> > > +	}
> > > +
> 
> Does all the above makes sense ?

If we agree on CAP_SYS_ADMIN, however, your approach looks cleaner yes.
Patrick Bellasi Sept. 6, 2018, 5:21 p.m. UTC | #4
On 06-Sep 16:59, Juri Lelli wrote:
> On 06/09/18 15:40, Patrick Bellasi wrote:
> > On 04-Sep 15:47, Juri Lelli wrote:
> 
> [...]
> 
> > > Wondering if you want to fold the check below inside the
> > > 
> > >  if (user && !capable(CAP_SYS_NICE)) {
> > >    ...
> > >  }
> > > 
> > > block. It would also save you from adding another parameter to the
> > > function.
> > 
> > So, there are two reasons for that:
> > 
> > 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
> >    instead on capable(CAP_SYS_ADMIN)
> > 
> >    Does that make sense ?
> > 
> >    If yes, the I cannot fold it in the block you reported above
> >    because we will not check for users with CAP_SYS_NICE.
> 
> Ah, right, not sure though. Looks like CAP_SYS_NICE is used for settings
> that relates to priorities, affinity, etc.: CPU related stuff. Since
> here you are also dealing with something that seems to fall into the
> same realm, it might actually fit more than CAP_SYS_ADMIN?

Yes and no... from the functional standpoint if a task is running in
the root cgroup, or cgroups are not in use at all, with this API a
task can always ask for the max OPP. Which is what CAP_SYS_NICE is
there for AFAIU... but...

... this check was meant also to fix the issue of the limited number
of clamp groups. That's why I'm now asking for CAP_SYS_ADMIN.

However, I would say that if we condsider to get in also the
discretization support introduced in:

   [PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support
   https://lore.kernel.org/lkml/20180828135324.21976-16-patrick.bellasi@arm.com/

then yes, we remain with the "nice" semantics to cover, and
CAP_SYS_NICE could be just enough.

> Now that I think more about it, would it actually make sense to allow
> unpriviledged users to lower their assigned umin/umax properties if they
> want? Something alike what happens for nice values or RT priorities.

Yes... if we fix the issue with the limited clamp groups, i.e. we take
discretization in.

> > 2) Then we could move it after that block, where there is another
> >    set of checks with just:
> > 
> >       if (user) {
> > 
> >    We can potentially add the check there yes... but when uclamp is
> >    not enabled we will still perform those checks or we have to add
> >    some compiler guards...
> > 
> > 3) ... or at least check for:
> > 
> >      if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> > 
> >    Which is what I'm doing right after the block above (2).
> > 
> >    But, at this point, by passing in the parameter to the
> >    __setscheduler_uclamp() call, I get the benefits of:
> > 
> >    a) reducing uclamp specific code in the caller
> >    b) avoiding the checks on !CONFIG_UCLAMP_TASK build
> > 
> > > >  {
> > > >  	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > >  	int lower_bound, upper_bound;
> > > >  	struct uclamp_se *uc_se;
> > > >  	int result = 0;
> > > >  
> > > > +	if (!capable(CAP_SYS_ADMIN) &&
> > > > +	    user && !uclamp_user_allowed) {
> > > > +		return -EPERM;
> > > > +	}
> > > > +
> > 
> > Does all the above makes sense ?
> 
> If we agree on CAP_SYS_ADMIN, however, your approach looks cleaner yes.

Cheers Patrick
Peter Zijlstra Sept. 14, 2018, 11:10 a.m. UTC | #5
On Thu, Sep 06, 2018 at 03:40:53PM +0100, Patrick Bellasi wrote:
> 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
>    instead on capable(CAP_SYS_ADMIN)
> 
>    Does that make sense ?

Neither of them really makes sense to me.

The max clamp makes a task 'consume' less and you should always be able
to reduce yourself.

The min clamp doesn't avoid while(1); and is therefore also not a
problem.

So I think setting clamps on a task should not be subject to additional
capabilities.

Now, of course, there is a problem of clamp resources, which are
limited. Consuming those _is_ a problem.

I think the problem here is that the two are conflated in the very same
interface.

Would it make sense to move the available clamp values out to some sysfs
interface like thing and guard that with a capability, while keeping the
task interface unprivilidged?

Another thing that has me 'worried' about this interface is the direct
tie to CPU capacity (not that I have a better suggestion). But it does
raise the point of how userspace is going to discover the relevant
values of the platform.
Patrick Bellasi Sept. 14, 2018, 2:07 p.m. UTC | #6
On 14-Sep 13:10, Peter Zijlstra wrote:
> On Thu, Sep 06, 2018 at 03:40:53PM +0100, Patrick Bellasi wrote:
> > 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
> >    instead on capable(CAP_SYS_ADMIN)
> > 
> >    Does that make sense ?
> 
> Neither of them really makes sense to me.
> 
> The max clamp makes a task 'consume' less and you should always be able
> to reduce yourself.
> 
> The min clamp doesn't avoid while(1); and is therefore also not a
> problem.
> 
> So I think setting clamps on a task should not be subject to additional
> capabilities.
> 
> Now, of course, there is a problem of clamp resources, which are
> limited. Consuming those _is_ a problem.

Right, that problem could be solved if we convince ourself that the
quantization approach proposed in:

   [PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support
   https://lore.kernel.org/lkml/20180828135324.21976-16-patrick.bellasi@arm.com/

could make sense and specifically, the other limitation it imposes
(i.e. the quantizaiton) is within reasonable rounding control errors/

> I think the problem here is that the two are conflated in the very same
> interface.
> 
> Would it make sense to move the available clamp values out to some sysfs
> interface like thing and guard that with a capability, while keeping the
> task interface unprivilidged?

You mean something like:

   $ cat /proc/sys/kernel/sched_uclamp_min_utils
   0 10 20 ... 100

to notify users about the set of clamp values which are available ?

> Another thing that has me 'worried' about this interface is the direct
> tie to CPU capacity (not that I have a better suggestion). But it does
> raise the point of how userspace is going to discover the relevant
> values of the platform.

This point worries me too, and that's what I think is addressed in a
sane way in:

   [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
   https://lore.kernel.org/lkml/20180828135324.21976-14-patrick.bellasi@arm.com/

IMHO percentages are a reasonably safe and generic API to expose to
user-space. Don't you think this should address your concern above ?
Peter Zijlstra Sept. 14, 2018, 2:28 p.m. UTC | #7
Just a quick reply because I have to run..

On Fri, Sep 14, 2018 at 03:07:32PM +0100, Patrick Bellasi wrote:
> On 14-Sep 13:10, Peter Zijlstra wrote:

> > I think the problem here is that the two are conflated in the very same
> > interface.
> > 
> > Would it make sense to move the available clamp values out to some sysfs
> > interface like thing and guard that with a capability, while keeping the
> > task interface unprivilidged?
> 
> You mean something like:
> 
>    $ cat /proc/sys/kernel/sched_uclamp_min_utils
>    0 10 20 ... 100
> 
> to notify users about the set of clamp values which are available ?
> 
> > Another thing that has me 'worried' about this interface is the direct
> > tie to CPU capacity (not that I have a better suggestion). But it does
> > raise the point of how userspace is going to discover the relevant
> > values of the platform.
> 
> This point worries me too, and that's what I think is addressed in a
> sane way in:
> 
>    [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
>    https://lore.kernel.org/lkml/20180828135324.21976-14-patrick.bellasi@arm.com/
> 
> IMHO percentages are a reasonably safe and generic API to expose to
> user-space. Don't you think this should address your concern above ?

Not at all what I meant, and no, percentages don't help.

The thing is, the values you'd want to use are for example the capacity
of the little CPUs. or the capacity of the most energy efficient OPP
(the knee).

Similarly for boosting, how are we 'easily' going to find the values
that correspond to the various available OPPs.

The EAS thing might have these around; but I forgot if/how they're
exposed to userspace (I'll have to soon look at the latest posting).

But changing the clamp metric to something different than these values
is going to be pain.
Patrick Bellasi Sept. 17, 2018, 12:27 p.m. UTC | #8
On 14-Sep 16:28, Peter Zijlstra wrote:
> Just a quick reply because I have to run..
> 
> On Fri, Sep 14, 2018 at 03:07:32PM +0100, Patrick Bellasi wrote:
> > On 14-Sep 13:10, Peter Zijlstra wrote:
> 
> > > I think the problem here is that the two are conflated in the very same
> > > interface.
> > > 
> > > Would it make sense to move the available clamp values out to some sysfs
> > > interface like thing and guard that with a capability, while keeping the
> > > task interface unprivilidged?
> > 
> > You mean something like:
> > 
> >    $ cat /proc/sys/kernel/sched_uclamp_min_utils
> >    0 10 20 ... 100
> > 
> > to notify users about the set of clamp values which are available ?
> > 
> > > Another thing that has me 'worried' about this interface is the direct
> > > tie to CPU capacity (not that I have a better suggestion). But it does
> > > raise the point of how userspace is going to discover the relevant
> > > values of the platform.
> > 
> > This point worries me too, and that's what I think is addressed in a
> > sane way in:
> > 
> >    [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
> >    https://lore.kernel.org/lkml/20180828135324.21976-14-patrick.bellasi@arm.com/
> > 
> > IMHO percentages are a reasonably safe and generic API to expose to
> > user-space. Don't you think this should address your concern above ?
> 
> Not at all what I meant, and no, percentages don't help.
> 
> The thing is, the values you'd want to use are for example the capacity
> of the little CPUs. or the capacity of the most energy efficient OPP
> (the knee).

I don't think so.

On the knee topic, we had some thinking and on most platforms it seems
to be a rather arbitrary decision.

On sane platforms, the Energy Efficiency (EE) is monotonically
decreasing with frequency increase.  Maybe we can define a threshold
for a "EE derivative ratio", but it will still be quite arbitrary.
Moreover, it could be that in certain use-cases we want to push for
higher energy efficiency (i.e. lower derivatives) then others.

> Similarly for boosting, how are we 'easily' going to find the values
> that correspond to the various available OPPs.

In our experience with SchedTune on Android, we found that we
generally focus on a small set of representative use-cases and then
run an exploration, by tuning the percentage of boost, to identify the
optimal trade-off between Performance and Energy.
The value you get could be something which do not match exactly an OPP
but still, since we (will) bias not only OPP selection but also tasks
placement, it's the one which makes most sense.

Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
something we don't care to specify exactly, since:

 - schedutil will top the util request to the next frequency anyway

 - capacity by itself is a loosely defined metric, since it's usually
   measured considering a specific kind of instructions mix, which
   can be very different from the actual instruction mix (e.g. integer
   vs floating point)

 - certain platforms don't even expose OPPs, but just "performance
   levels"... which ultimately are a "percentage"

 - there are so many rounding errors around on utilization tracking
   and it aggregation that being exact on an OPP if of "relative"
   importance

Do you see specific use-cases where an exact OPP capacity is much
better then a percentage value ?

Of course there can be scenarios in which wa want to clamp to a
specific OPP. But still, why should it be difficult for a platform
integrator to express it as a close enough percentage value ?

> The EAS thing might have these around; but I forgot if/how they're
> exposed to userspace (I'll have to soon look at the latest posting).

The new "Energy Model Management" framework can certainly be use to
get the list of OPPs for each frequency domain. IMO this could be
used to identify the maximum number of clamp groups we can have.
In this case, the discretization patch can translate a generic
percentage clamp into the closest OPP capacity... 

... but to me that's an internal detail which I'm not convinced we
don't need to expose to user-space.

IMHO we should instead focus just on defining a usable and generic
userspace interface. Then, platform specific tuning is something
user-space can do, either offline or on-line.

> But changing the clamp metric to something different than these values
> is going to be pain.

Maybe I don't completely get what you mean here... are you saying that
not using exact capacity values to defined clamps is difficult ?
If that's the case why? Can you elaborate with an example ?

Cheers,
Patrick
Peter Zijlstra Sept. 21, 2018, 9:13 a.m. UTC | #9
On Mon, Sep 17, 2018 at 01:27:23PM +0100, Patrick Bellasi wrote:
> On 14-Sep 16:28, Peter Zijlstra wrote:

> > The thing is, the values you'd want to use are for example the capacity
> > of the little CPUs. or the capacity of the most energy efficient OPP
> > (the knee).
> 
> I don't think so.
> 
> On the knee topic, we had some thinking and on most platforms it seems
> to be a rather arbitrary decision.
> 
> On sane platforms, the Energy Efficiency (EE) is monotonically
> decreasing with frequency increase.  Maybe we can define a threshold
> for a "EE derivative ratio", but it will still be quite arbitrary.
> Moreover, it could be that in certain use-cases we want to push for
> higher energy efficiency (i.e. lower derivatives) then others.

I remember IBM-power folks asking for knee related features a number of
years ago (Dusseldorf IIRC) because after some point their chips start
to _really_ suck power. Sure, the curve is monotonic, but the perf/watt
takes a nose dive.

And given that: P = CfV^2, that seems like a fairly generic observation.

However, maybe, due to the very limited thermal capacity of these mobile
things, the issue doesn't really arrise in them.

Laptops with active cooling however...

> > Similarly for boosting, how are we 'easily' going to find the values
> > that correspond to the various available OPPs.
> 
> In our experience with SchedTune on Android, we found that we
> generally focus on a small set of representative use-cases and then
> run an exploration, by tuning the percentage of boost, to identify the
> optimal trade-off between Performance and Energy.

So you basically do an automated optimization for a benchmark?

> The value you get could be something which do not match exactly an OPP
> but still, since we (will) bias not only OPP selection but also tasks
> placement, it's the one which makes most sense.

*groan*, so how exactly does that work? By limiting the task capacity,
we allow some stacking on the CPUs before we switch to regular
load-balancing?

> Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
> something we don't care to specify exactly, since:
> 
>  - schedutil will top the util request to the next frequency anyway
> 
>  - capacity by itself is a loosely defined metric, since it's usually
>    measured considering a specific kind of instructions mix, which
>    can be very different from the actual instruction mix (e.g. integer
>    vs floating point)

Sure, things like pure SIMD workloads can skew things pretty bad, but on
average it should not drastically change the overall shape of the curve
and the knee point should not move around a lot.

>  - certain platforms don't even expose OPPs, but just "performance
>    levels"... which ultimately are a "percentage"

Well, the whole capacity thing is a 'percentage', it's just that 1024 is
much nicer to work with (for computers) than 100 is (also it provides a
wee bit more resolution).

But even the platforms with hidden OPPs (can) have knee points, and if
you measure their power to capacity curve you can place a workload
around the knee by capping capacity.

But yes, this gets trick real fast :/

>  - there are so many rounding errors around on utilization tracking
>    and it aggregation that being exact on an OPP if of "relative"
>    importance

I'm not sure I understand that argument; sure the measurement is subject
to 'issues', but if we hard clip the result, that will exactly match the
fixed points for OPP selection. Any issues on the measurement are lost
after clipping.

> Do you see specific use-cases where an exact OPP capacity is much
> better then a percentage value ?

If I don't have algorithmic optimization available, hand selecting an
OPP is the 'obvious' thing to do.

> Of course there can be scenarios in which wa want to clamp to a
> specific OPP. But still, why should it be difficult for a platform
> integrator to express it as a close enough percentage value ?

But why put him through the trouble of finding the capacity value in the
EAS exposed data, converting that to a percentage that will work and
then feeding it back in.

I don't see the point or benefit of percentages, there's nothing magical
about 1/100, _any_ other fraction works exactly the same.

So why bother changing it around?

> > The EAS thing might have these around; but I forgot if/how they're
> > exposed to userspace (I'll have to soon look at the latest posting).
> 
> The new "Energy Model Management" framework can certainly be use to
> get the list of OPPs for each frequency domain. IMO this could be
> used to identify the maximum number of clamp groups we can have.
> In this case, the discretization patch can translate a generic
> percentage clamp into the closest OPP capacity... 
> 
> ... but to me that's an internal detail which I'm not convinced we
> don't need to expose to user-space.
> 
> IMHO we should instead focus just on defining a usable and generic
> userspace interface. Then, platform specific tuning is something
> user-space can do, either offline or on-line.

The thing I worry about is how do we determine the value to put in in
the first place.

How are expecting people to determine what to put into the interface?
Knee points, little capacity, those things make 'obvious' sense.

> > But changing the clamp metric to something different than these values
> > is going to be pain.
> 
> Maybe I don't completely get what you mean here... are you saying that
> not using exact capacity values to defined clamps is difficult ?
> If that's the case why? Can you elaborate with an example ?

I meant changing the unit around, 1/1024 is what we use throughout and
is what EAS is also exposing IIRC, so why make things complicated again
and use 1/100 (which is a shit fraction for computers).
Patrick Bellasi Sept. 24, 2018, 3:14 p.m. UTC | #10
On 21-Sep 11:13, Peter Zijlstra wrote:
> On Mon, Sep 17, 2018 at 01:27:23PM +0100, Patrick Bellasi wrote:
> > On 14-Sep 16:28, Peter Zijlstra wrote:
> 
> > > The thing is, the values you'd want to use are for example the capacity
> > > of the little CPUs. or the capacity of the most energy efficient OPP
> > > (the knee).
> > 
> > I don't think so.
> > 
> > On the knee topic, we had some thinking and on most platforms it seems
> > to be a rather arbitrary decision.
> > 
> > On sane platforms, the Energy Efficiency (EE) is monotonically
> > decreasing with frequency increase.  Maybe we can define a threshold
> > for a "EE derivative ratio", but it will still be quite arbitrary.
> > Moreover, it could be that in certain use-cases we want to push for
> > higher energy efficiency (i.e. lower derivatives) then others.
> 
> I remember IBM-power folks asking for knee related features a number of
> years ago (Dusseldorf IIRC) because after some point their chips start
> to _really_ suck power. Sure, the curve is monotonic, but the perf/watt
> takes a nose dive.
> 
> And given that: P = CfV^2, that seems like a fairly generic observation.
> 
> However, maybe, due to the very limited thermal capacity of these mobile
> things, the issue doesn't really arrise in them.

The curve is still following the equation above for mobile devices
too and it's exactly by looking at that curve that's rather arbitrary
to defined a knee point (more on that later)...

> Laptops with active cooling however...

How do you see active cooling playing a role ?

Are you thinking, for example, at reduced fan noise if we remain below
a certain OPP ?

Are you factoring fans power consumptions into the overall P consumption ?

> > > Similarly for boosting, how are we 'easily' going to find the values
> > > that correspond to the various available OPPs.
> > 
> > In our experience with SchedTune on Android, we found that we
> > generally focus on a small set of representative use-cases and then
> > run an exploration, by tuning the percentage of boost, to identify the
> > optimal trade-off between Performance and Energy.
> 
> So you basically do an automated optimization for a benchmark?

Not on one single benchmark, we consider a set of interesting
use-cases. We mostly focus on:
- interactivity: no jank frames while scrolling fast on list/views
- power efficiency: on common video/audio playback scenarios

The exploration of some optimization parameters can be automated.
However, at the end there is always a rather arbitrary decision to
take: you can be slightly more oriented towards interactive
performance or energy efficient.

Maybe (in the future) you can also see AI/ML, used from user-space, to
figure out the fine tuning based on user's usage patterns for different
apps... ;)

> > The value you get could be something which do not match exactly an OPP
> > but still, since we (will) bias not only OPP selection but also tasks
> > placement, it's the one which makes most sense.
> 
> *groan*, so how exactly does that work? By limiting the task capacity,
> we allow some stacking on the CPUs before we switch to regular
> load-balancing?

This is a big topic in itself, one of the reasons why we did not added
it in this series. We will need dedicated discussions to figure out
something reasonable.

In principle, however, by capping the utilization of tasks and their
CPUs we can aim at a way to remain in energy_aware mode, i.e. below
the tipping point, and thus with load-balancing disabled.

Utilization clamping can be used to bias the CPUs selection from the
EA code paths. Other mechanisms, e.g. bandwidth control, can also be
exploited to keep CPU utilization under control.

> > Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
> > something we don't care to specify exactly, since:
> > 
> >  - schedutil will top the util request to the next frequency anyway
> > 
> >  - capacity by itself is a loosely defined metric, since it's usually
> >    measured considering a specific kind of instructions mix, which
> >    can be very different from the actual instruction mix (e.g. integer
> >    vs floating point)
> 
> Sure, things like pure SIMD workloads can skew things pretty bad, but on
> average it should not drastically change the overall shape of the curve
> and the knee point should not move around a lot.

There can be quite consistent skews based not just on instructions
type but also "app phases", e.g. memory-vs-cpu bound.
It's also true that's more likely a shift up/down in the capacity axis
then a change in shape.

However, I think my point here is that the actual capacity of each OPP
can be very different wrt the one reported by the EM.

> >  - certain platforms don't even expose OPPs, but just "performance
> >    levels"... which ultimately are a "percentage"
> 
> Well, the whole capacity thing is a 'percentage', it's just that 1024 is
> much nicer to work with (for computers) than 100 is (also it provides a
> wee bit more resolution).

Right, indeed in kernel-space we still use 1024 based values, we just
convert them at the syscall interface...

> But even the platforms with hidden OPPs (can) have knee points, and if
> you measure their power to capacity curve you can place a workload
> around the knee by capping capacity.

... still it's difficult to give a precise definition of knee point,
unless you know about platforms which have a sharp change in energy
efficiency.

The only cases we know about are those where:

A) multiple frequencies uses the same voltage, e.g.


     ^                                     *
     | Energy                              O
     | efficiency                         O+
     |                                   O |
     |                                 O*  |
     |                              O**    |
     |   O**                    O***       |
     |   +  O**            O****           |
     |   |     O**   O*****                |
     |   |        O**                      |
     |   |          +                      |
     |   |  Same V  |   Increasing V       |
     +---+----------+----------------------+----------->
         |          |                      | Frequency
         L          M                      H

B) there is a big frequency gap between low frequency OPPs and high
   frequency OPPs, e.g.

                                       O
    ^                                **+
    | Energy                       **  |
    | efficiency                 **    |
    |                          **      |
    |                        **        |
    |                      **          |
    |                    **            |
    |                  **              |
    |               O**                |
    |        O******+                  |
    |O*******       |                  |
    |               |                  |
    ++--------------+------------------+------>
     |              |                  |  Frequency
     L              M                  H


In case A, all the OPPs left of M are dominated by M in terms
of energy efficiency and normally they should be never used.
Unless you are under thermal constraints and you still want to keep
your code running even if at a lower rate and energy efficiency.
At this point, however, you already invalidated all the OPPs right of
M and, on the remaining, you still struggle do define the knee point.

In case B... I'm wondering it such a conf even makes sense ;)
Is there really some platform out there with such a "non homogeneously
distributed" set of available frequencies ?

> But yes, this gets trick real fast :/
> 
> >  - there are so many rounding errors around on utilization tracking
> >    and it aggregation that being exact on an OPP if of "relative"
> >    importance
> 
> I'm not sure I understand that argument; sure the measurement is subject
> to 'issues', but if we hard clip the result, that will exactly match the
> fixed points for OPP selection. Any issues on the measurement are lost
> after clipping.

My point is just that the clipping does not requires to be specified
as the actual capacity of an OPP.

If we give a "performance level" P then schedutil already knows how to
translate it into an OPP, it will pick the min capacity OPP which
capacity is greater then P.

However, given the knee definition is anyway fuzzy, selecting either
that OPP or the previous one, should almost always not make a big
difference from a "knee" standpoint, isn't it ?

> > Do you see specific use-cases where an exact OPP capacity is much
> > better then a percentage value ?
> 
> If I don't have algorithmic optimization available, hand selecting an
> OPP is the 'obvious' thing to do.

Agree, but that's still possible by passing in a percentage value.

> > Of course there can be scenarios in which wa want to clamp to a
> > specific OPP. But still, why should it be difficult for a platform
> > integrator to express it as a close enough percentage value ?
> 
> But why put him through the trouble of finding the capacity value in the
> EAS exposed data, converting that to a percentage that will work and
> then feeding it back in.
>
> I don't see the point or benefit of percentages, there's nothing magical
> about 1/100, _any_ other fraction works exactly the same.

If you wanna pass in an exact capacity data, you still need to look at
EAS exposed data, don't you?

I think the main problem is picking the right capacity, more then
converting it to a percentage.

Once you got a capacity, let say 654, it's conversion to percentage is
as simple as dropping the units, i.e. 654 ~= 65%, that's quite sure to
pick the 654 capacity OPP anyway.


> So why bother changing it around?

For two main reasons:

1) to expose userspace a more generic interface:
   a "performance percentage" is more generic then a "capacity value"
   while keep translating and using a 1024 based value in kernel space

2) to reduce the configuration space:
   it quite likely doesn't make sense to use, in the same system, 100
   difference clamp values... it makes even more sense to use 1024
   different clamp values, does it ?

> > > The EAS thing might have these around; but I forgot if/how they're
> > > exposed to userspace (I'll have to soon look at the latest posting).
> > 
> > The new "Energy Model Management" framework can certainly be use to
> > get the list of OPPs for each frequency domain. IMO this could be
> > used to identify the maximum number of clamp groups we can have.
> > In this case, the discretization patch can translate a generic
> > percentage clamp into the closest OPP capacity... 
> > 
> > ... but to me that's an internal detail which I'm not convinced we
> > don't need to expose to user-space.
> > 
> > IMHO we should instead focus just on defining a usable and generic
> > userspace interface. Then, platform specific tuning is something
> > user-space can do, either offline or on-line.
> 
> The thing I worry about is how do we determine the value to put in in
> the first place.

I agree that's the main problem, but I also think that's outside of
the kernel-space mechanism.

Is not all that quite similar to DEADLINE tasks configuration?

Given a DL task solving a certain issue, you can certainly define its
deadline (or period) on a completely platform independent way, by just
looking at the problem space. But when it comes to the run-time, we
always have to profile the task in a platform specific way.

In the DL case from user-space we figure out a bandwidth requirement.

In the clamping case, it's still the user-space that needs to figure
our an optimal clamp value, while considering your performance and
energy efficiency goals. This can be based on an automated profiling
process which comes up with "optimal" clamp values.

In the DL case, we are perfectly fine to have a running time
parameter, although we don't give any precise and deterministic
formula to quantify it. It's up to user-space to figure out the
required running time for a given app and platform.
It's also not unrealistic the case you need to close a control loop
with user-space to keep updating this requirement.

Why the same cannot hold for clamp values ?

> How are expecting people to determine what to put into the interface?
> Knee points, little capacity, those things make 'obvious' sense.

IMHO, they make "obvious" sense from a kernel-space perspective
exactly because they are implementation details and platform specific
concepts.

At the same time, I struggle to provide a definition of knee point and
I struggle to find a use-case where I can certainly say that a task
should be clamped exactly to the little capacity for example.

I'm more of the idea that the right clamp value is something a bit
fuzzy and possibly subject to change over time depending on the
specific application phase (e.g. cpu-vs-memory bounded) and/or
optimization goals (e.g. performance vs energy efficiency).

Here we are thus at defining and agree about a "generic and abstract"
interface which allows user-space to feed input to kernel-space.
To this purpose, I think platform specific details and/or internal
implementation details are not "a bonus".

> > > But changing the clamp metric to something different than these values
> > > is going to be pain.
> > 
> > Maybe I don't completely get what you mean here... are you saying that
> > not using exact capacity values to defined clamps is difficult ?
> > If that's the case why? Can you elaborate with an example ?
> 
> I meant changing the unit around, 1/1024 is what we use throughout and
> is what EAS is also exposing IIRC, so why make things complicated again
> and use 1/100 (which is a shit fraction for computers).

Internally, in kernel space, we use 1024 units. It's just the
user-space interface that speaks percentages but, as soon as a
percentage value is used to configure a clamp, it's translated into a
[0..1024] range value.

Is this not an acceptable compromise? We have a generic user-space
interface and an effective/consistent kernel-space implementation.

Cheers,
Patrick
Peter Zijlstra Sept. 24, 2018, 3:56 p.m. UTC | #11
On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:
> On 21-Sep 11:13, Peter Zijlstra wrote:
> > Laptops with active cooling however...
> 
> How do you see active cooling playing a role ?
> 
> Are you thinking, for example, at reduced fan noise if we remain below
> a certain OPP ?
> 
> Are you factoring fans power consumptions into the overall P consumption ?

Nothing as fancy as that; I just figured that with a larger cooling
capacity, you can push chips higher onto that curve past the optimal
IPC/Watt point. Make it go fast etc..
Peter Zijlstra Sept. 24, 2018, 4:26 p.m. UTC | #12
On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:

> ... still it's difficult to give a precise definition of knee point,
> unless you know about platforms which have a sharp change in energy
> efficiency.
> 
> The only cases we know about are those where:
> 
> A) multiple frequencies uses the same voltage, e.g.
> 
> 
>      ^                                     *
>      | Energy                              O
>      | efficiency                         O+
>      |                                   O |
>      |                                 O*  |
>      |                              O**    |
>      |   O**                    O***       |
>      |   +  O**            O****           |
>      |   |     O**   O*****                |
>      |   |        O**                      |
>      |   |          +                      |
>      |   |  Same V  |   Increasing V       |
>      +---+----------+----------------------+----------->
>          |          |                      | Frequency
>          L          M                      H
> 
> B) there is a big frequency gap between low frequency OPPs and high
>    frequency OPPs, e.g.
> 
>                                        O
>     ^                                **+
>     | Energy                       **  |
>     | efficiency                 **    |
>     |                          **      |
>     |                        **        |
>     |                      **          |
>     |                    **            |
>     |                  **              |
>     |               O**                |
>     |        O******+                  |
>     |O*******       |                  |
>     |               |                  |
>     ++--------------+------------------+------>
>      |              |                  |  Frequency
>      L              M                  H
> 
> 
> In case A, all the OPPs left of M are dominated by M in terms
> of energy efficiency and normally they should be never used.
> Unless you are under thermal constraints and you still want to keep
> your code running even if at a lower rate and energy efficiency.
> At this point, however, you already invalidated all the OPPs right of
> M and, on the remaining, you still struggle do define the knee point.
> 
> In case B... I'm wondering it such a conf even makes sense ;)
> Is there really some platform out there with such a "non homogeneously
> distributed" set of available frequencies ?

Well, the curve is a second or third order polynomial (when V~f -> fV^2
-> f^3), so it shoots up at some point. There's not really anything you
can do about that. But if you're willing to put in active cooling and
lots of energy, you can make it go fast :-)

Therefore I was thinking:

> Maybe we can define a threshold
> for a "EE derivative ratio", but it will still be quite arbitrary.

Because up until de/df=.5 we gain more performance than we loose ee. But
I might not have appreciated the fact that when we work with imaginary
cost units that skews the .5.
Patrick Bellasi Sept. 24, 2018, 5:19 p.m. UTC | #13
On 24-Sep 18:26, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:
> 
> > ... still it's difficult to give a precise definition of knee point,
> > unless you know about platforms which have a sharp change in energy
> > efficiency.
> > 
> > The only cases we know about are those where:
> > 
> > A) multiple frequencies uses the same voltage, e.g.
> > 

On a side note, the following plots represents ee^-1, or eventually,
the P on the y axise... my bad.... but you got the meaning anyway ;)

> > 
> >      ^                                     *
> >      | Energy                              O
> >      | efficiency                         O+
> >      |                                   O |
> >      |                                 O*  |
> >      |                              O**    |
> >      |   O**                    O***       |
> >      |   +  O**            O****           |
> >      |   |     O**   O*****                |
> >      |   |        O**                      |
> >      |   |          +                      |
> >      |   |  Same V  |   Increasing V       |
> >      +---+----------+----------------------+----------->
> >          |          |                      | Frequency
> >          L          M                      H
> > 
> > B) there is a big frequency gap between low frequency OPPs and high
> >    frequency OPPs, e.g.
> > 
> >                                        O
> >     ^                                **+
> >     | Energy                       **  |
> >     | efficiency                 **    |
> >     |                          **      |
> >     |                        **        |
> >     |                      **          |
> >     |                    **            |
> >     |                  **              |
> >     |               O**                |
> >     |        O******+                  |
> >     |O*******       |                  |
> >     |               |                  |
> >     ++--------------+------------------+------>
> >      |              |                  |  Frequency
> >      L              M                  H
> > 
> > 
> > In case A, all the OPPs left of M are dominated by M in terms
> > of energy efficiency and normally they should be never used.
> > Unless you are under thermal constraints and you still want to keep
> > your code running even if at a lower rate and energy efficiency.
> > At this point, however, you already invalidated all the OPPs right of
> > M and, on the remaining, you still struggle do define the knee point.
> > 
> > In case B... I'm wondering it such a conf even makes sense ;)
> > Is there really some platform out there with such a "non homogeneously
> > distributed" set of available frequencies ?
> 
> Well, the curve is a second or third order polynomial (when V~f -> fV^2
> -> f^3), so it shoots up at some point. There's not really anything you
> can do about that. But if you're willing to put in active cooling and
> lots of energy, you can make it go fast :-)

Sure... until you don't melt the silicon you can push the frequency.

However, if you are going for such aggressive active cooling, perhaps
your interest for energy efficiency it's already a very low priority
goal.

> Therefore I was thinking:
> 
> > Maybe we can define a threshold
> > for a "EE derivative ratio", but it will still be quite arbitrary.
> 
> Because up until de/df=.5 we gain more performance than we loose ee.

You mean up until de < df ?

IOW... the threshold should be de == df => 45deg tangent ?

> But I might not have appreciated the fact that when we work with
> imaginary cost units that skews the .5.

The main skew IMO comes from the fact the energy efficiency
"tipping point" is very much application / user specific...
and it can also change depending on the usage scenario for the
same user and platform.
Patrick Bellasi Sept. 24, 2018, 5:23 p.m. UTC | #14
On 24-Sep 17:56, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:
> > On 21-Sep 11:13, Peter Zijlstra wrote:
> > > Laptops with active cooling however...
> > 
> > How do you see active cooling playing a role ?
> > 
> > Are you thinking, for example, at reduced fan noise if we remain below
> > a certain OPP ?
> > 
> > Are you factoring fans power consumptions into the overall P consumption ?
> 
> Nothing as fancy as that; I just figured that with a larger cooling
> capacity, you can push chips higher onto that curve past the optimal
> IPC/Watt point. Make it go fast etc..

That very concept of "optimal IPC/Watt" point is not something easy to
define considering a monotonic function and without considering the
specific optimization goals.

It really sounds like saying that an LP problem has a unique solution
independently from the optimization function.

Do you think we should "mandate" an optimization function from kernel
space? I'm not saying it does not make sense, but I find it at least a
strong implementation enforcement.
Peter Zijlstra Sept. 25, 2018, 3:49 p.m. UTC | #15
On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:

> > So why bother changing it around?
> 
> For two main reasons:
> 
> 1) to expose userspace a more generic interface:
>    a "performance percentage" is more generic then a "capacity value"
>    while keep translating and using a 1024 based value in kernel space

The unit doesn't make it more or less generic. It's the exact same thing
in the end.

> 2) to reduce the configuration space:
>    it quite likely doesn't make sense to use, in the same system, 100
>    difference clamp values... it makes even more sense to use 1024
>    different clamp values, does it ?

I'd tend to agree with you that 1024 is probably too big a
configureation space, OTOH I also don't want to end up with a "640KB is
enough for everybody" situation.

And 100 really isn't that much better either way around.

> > The thing I worry about is how do we determine the value to put in in
> > the first place.
> 
> I agree that's the main problem, but I also think that's outside of
> the kernel-space mechanism.
> 
> Is not all that quite similar to DEADLINE tasks configuration?

Well, with DL there are well defined rules for what to put in and what
to then expect.

For this thing, not so much I feel.

> Given a DL task solving a certain issue, you can certainly define its
> deadline (or period) on a completely platform independent way, by just
> looking at the problem space. But when it comes to the run-time, we
> always have to profile the task in a platform specific way.
> 
> In the DL case from user-space we figure out a bandwidth requirement.

Most likely, although you can compute in a number of cases. But yes, it
is always platform specific.

> In the clamping case, it's still the user-space that needs to figure
> our an optimal clamp value, while considering your performance and
> energy efficiency goals. This can be based on an automated profiling
> process which comes up with "optimal" clamp values.
> 
> In the DL case, we are perfectly fine to have a running time
> parameter, although we don't give any precise and deterministic
> formula to quantify it. It's up to user-space to figure out the
> required running time for a given app and platform.
> It's also not unrealistic the case you need to close a control loop
> with user-space to keep updating this requirement.
> 
> Why the same cannot hold for clamp values ?

The big difference is that if I request (and am granted) a runtime quota
of a given amount, then that is what I'm guaranteed to get.

Irrespective of the amount being sufficient for the work in question --
which is where the platform dependency comes in.

But what am I getting when I set a specific clamp value? What does it
mean to set the value to 80%

So far the only real meaning is when combined with the EAS OPP data, we
get a direct translation to OPPs. Irrespective of how the utilization is
measured and the capacity:OPP mapping established, once that's set, we
can map a clamp value to an OPP and get meaning.

But without that, it doesn't mean anything much at all. And that is my
complaint. It seems to get presented as: 'random knob that might do
something'. The range it takes as input doesn't change a thing.

> > How are expecting people to determine what to put into the interface?
> > Knee points, little capacity, those things make 'obvious' sense.
> 
> IMHO, they make "obvious" sense from a kernel-space perspective
> exactly because they are implementation details and platform specific
> concepts.
> 
> At the same time, I struggle to provide a definition of knee point and
> I struggle to find a use-case where I can certainly say that a task
> should be clamped exactly to the little capacity for example.
> 
> I'm more of the idea that the right clamp value is something a bit
> fuzzy and possibly subject to change over time depending on the
> specific application phase (e.g. cpu-vs-memory bounded) and/or
> optimization goals (e.g. performance vs energy efficiency).
> 
> Here we are thus at defining and agree about a "generic and abstract"
> interface which allows user-space to feed input to kernel-space.
> To this purpose, I think platform specific details and/or internal
> implementation details are not "a bonus".

But unlike DL, which has well specified behaviour, and when I know my
platform I can compute a usable value. This doesn't seem to gain meaning
when I know the platform.

Or does it? If you say yes, then we need to be able to correlate to the
platform data that gives it meaning; which would be the OPP states. And
those come with capacity numbers.

> > > > But changing the clamp metric to something different than these values
> > > > is going to be pain.
> > > 
> > > Maybe I don't completely get what you mean here... are you saying that
> > > not using exact capacity values to defined clamps is difficult ?
> > > If that's the case why? Can you elaborate with an example ?
> > 
> > I meant changing the unit around, 1/1024 is what we use throughout and
> > is what EAS is also exposing IIRC, so why make things complicated again
> > and use 1/100 (which is a shit fraction for computers).
> 
> Internally, in kernel space, we use 1024 units. It's just the
> user-space interface that speaks percentages but, as soon as a
> percentage value is used to configure a clamp, it's translated into a
> [0..1024] range value.
> 
> Is this not an acceptable compromise? We have a generic user-space
> interface and an effective/consistent kernel-space implementation.

I really don't see how changing the unit changes anything. Either you
want to relate to OPPs and those are exposed in 1/1024 unit capacity
through the EAS files, or you don't and then the knob has no meaning.

And how the heck are we supposed to assign a value for something that
has no meaning.

Again, with DL we ask for time, once I know the platform I can convert
my work into instructions and time and all makes sense.

With this, you seem reluctant to allow us to close that loop. Why is
that? Why not directly relate to the EAS OPPs, because that is directly
what they end up mapping to.

When I know the platform, I can convert my work into instructions and
obtain time, I can convert my clamp into an OPP and time*OPP gives an
energy consumption.

Why muddle things up and make it complicated?
Patrick Bellasi Sept. 26, 2018, 10:43 a.m. UTC | #16
On 25-Sep 17:49, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:

[...]

> Well, with DL there are well defined rules for what to put in and what
> to then expect.
> 
> For this thing, not so much I feel.

Maybe you'll prove me wrong, but that's not already something happening
for things like priorities?

When you set a prio for a CFS task you don't really know how much
more/less CPU time a CFS task will get, because it depends on other
tasks prios and tasks from higher prio classes.

The priority is thus a knob which defines an "intended behavior", a
preference, without being "legally binding" like in the case of DL
bandwidth... nevertheless we can still make a good use of prios.

[...]

> > In the clamping case, it's still the user-space that needs to figure
> > our an optimal clamp value, while considering your performance and
> > energy efficiency goals. This can be based on an automated profiling
> > process which comes up with "optimal" clamp values.
> > 
> > In the DL case, we are perfectly fine to have a running time
> > parameter, although we don't give any precise and deterministic
> > formula to quantify it. It's up to user-space to figure out the
> > required running time for a given app and platform.
> > It's also not unrealistic the case you need to close a control loop
> > with user-space to keep updating this requirement.
> > 
> > Why the same cannot hold for clamp values ?
> 
> The big difference is that if I request (and am granted) a runtime quota
> of a given amount, then that is what I'm guaranteed to get.

(I think not always... but that's a detail for a different discussion)

> Irrespective of the amount being sufficient for the work in question --
> which is where the platform dependency comes in.
> 
> But what am I getting when I set a specific clamp value? What does it
> mean to set the value to 80%

Exactly, that's a good point: which "expectations" can we set on users
based on a given value ?

> So far the only real meaning is when combined with the EAS OPP data, we
> get a direct translation to OPPs. Irrespective of how the utilization is
> measured and the capacity:OPP mapping established, once that's set, we
> can map a clamp value to an OPP and get meaning.

If we strictly follow this line of reasoning then we should probably
set a frequency value directly...  but still we will not be saying
anything about "expectations".

Give the current patchset, right now we can't do much more then
_biasing_ an OPP selection.
It's actually just a bias, we cannot really grant anything to users
based on clamping. For example, if you require util_min=1024 you are
not really granted anything about running at the maximum capacity,
especially in the current patchset where we are not biasing task
placement.

My fear then is, since we are not really granting/enforcing anything,
why should we base such an interface on an internal implementation
detail and/or platform specific values ?

Why a slightly more abstract interface is so much more confusing ?

> But without that, it doesn't mean anything much at all. And that is my
> complaint. It seems to get presented as: 'random knob that might do
> something'. The range it takes as input doesn't change a thing.

Can not the "range" help in defining the perceived expectations ?

If we use a percentage, IMHO it's more clear that's a _relative_ and
_not mandatory_ interface:

  relative: because, for example, a 50% capped task is expected
            (normally) to run slower then a 50% boosted task, although
            we don't know, or care to know, the exact frequency or cpu
            capacity

  not mandatory: because, for example, the 50% boosted task is not
                 granted to always run at an OPP which capacity is
                 not smaller then 512

> > > How are expecting people to determine what to put into the interface?

The same way people define priorities. Which means, with increasing
level of complexity:

 a) by guessing (or using the default, i.e. no clamps)

 b) by making an educated choice
   i.e. profiling your app to pick the value which makes more sense
        give the platform and a set of optimization goals

 c) by controlling in a closed feedback loop
    i.e. by periodically measuring some app specific power/perf metric
    and tuning the clamp values to close a gap with respect to a given
    power/perf

I think that the complexity of both b) and c) is not really impacted
by the scale/range used... but it also does not benefit much in
"clarity" if we use capacity values instead of percentages.

> > > Knee points, little capacity, those things make 'obvious' sense.

> > IMHO, they make "obvious" sense from a kernel-space perspective
> > exactly because they are implementation details and platform specific
> > concepts.
> > 
> > At the same time, I struggle to provide a definition of knee point and
> > I struggle to find a use-case where I can certainly say that a task
> > should be clamped exactly to the little capacity for example.
> > 
> > I'm more of the idea that the right clamp value is something a bit
> > fuzzy and possibly subject to change over time depending on the
> > specific application phase (e.g. cpu-vs-memory bounded) and/or
> > optimization goals (e.g. performance vs energy efficiency).

What do you think about this last my sentence above ?

> > Here we are thus at defining and agree about a "generic and abstract"
> > interface which allows user-space to feed input to kernel-space.
> > To this purpose, I think platform specific details and/or internal
> > implementation details are not "a bonus".
> 
> But unlike DL, which has well specified behaviour, and when I know my
> platform I can compute a usable value. This doesn't seem to gain meaning
> when I know the platform.
> 
> Or does it?

... or we don't really care about a platform specific meaning.

> If you say yes, then we need to be able to correlate to the platform
> data that gives it meaning; which would be the OPP states. And those
> come with capacity numbers.

The meaning, strictly speaking, should be just:

   I figured out (somehow) that if I set value X my app is now working
   as expected in terms of the acceptable power/performance
   optimization goal.

   I know that value X could require tuning over time depending on
   possible changes in platform status or tasks composition.

[...]

> > Internally, in kernel space, we use 1024 units. It's just the
> > user-space interface that speaks percentages but, as soon as a
> > percentage value is used to configure a clamp, it's translated into a
> > [0..1024] range value.
> > 
> > Is this not an acceptable compromise? We have a generic user-space
> > interface and an effective/consistent kernel-space implementation.
> 
> I really don't see how changing the unit changes anything. Either you
> want to relate to OPPs and those are exposed in 1/1024 unit capacity
> through the EAS files, or you don't and then the knob has no meaning.
> 
> And how the heck are we supposed to assign a value for something that
> has no meaning.
> 
> Again, with DL we ask for time, once I know the platform I can convert
> my work into instructions and time and all makes sense.
> 
> With this, you seem reluctant to allow us to close that loop. Why is
> that? Why not directly relate to the EAS OPPs, because that is directly
> what they end up mapping to.

I'm not really fighting against that, if people find it more intuitive
the usage of capacities we can certainly go for them.

My reluctance is really just tossing out a possible different
perspective considering we are adding a user-space API which
certainly set "expectations" to users.

Provided it's clear the concept that it's a _non mandatory_ and
_relative_ API, then any scale should be ok... I just personally
prefer a percentage for the reasons described above.

In both cases, who will use the interface can certainly close a
loop... especially an automated profiling or run-time control loop.

> When I know the platform, I can convert my work into instructions and
> obtain time, I can convert my clamp into an OPP and time*OPP gives an
> energy consumption.

What you describe makes sense, it can definitively help the human user
to set a value. I'm just not convinced this will be the main usage of
such an interface... or that a single value could fit all the
run-time scenarios.

I think in real workload scenarios we will have so many tasks, some
competing other cooperating, that it will not be possible to do the
exercise you describe above.

What we will do instead will probably be to close a profiling/control
loop from user-space and let the system figure out the optimal
value. In these cases, the platform details are just "informative"
and what we need is really just "random knob which can do
something"... provide that "something" is a consistent mapping of the
knob values on certain scheduler actions.

> Why muddle things up and make it complicated?

I'll not push this further, really, if you are strongly on the
opinion we should use capacity I'll drop percentages in the next v5.

Otherwise, if you also are like me still a bit unsure for what could
be the best API, we can hope in more feedbacks from other folks...
maybe I can ping someone in CC ?

Cheers,
Patrick
Patrick Bellasi Sept. 26, 2018, 5:51 p.m. UTC | #17
Hi Peter,

On 21-Sep 11:13, Peter Zijlstra wrote:
> On Mon, Sep 17, 2018 at 01:27:23PM +0100, Patrick Bellasi wrote:

[...]

While going back to one of our previous conversation, I noted these
comments:

> > Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
> > something we don't care to specify exactly, since:

[...]

> >  - certain platforms don't even expose OPPs, but just "performance
> >    levels"... which ultimately are a "percentage"
> 
> Well, the whole capacity thing is a 'percentage', it's just that 1024 is
> much nicer to work with (for computers) than 100 is (also it provides a
> wee bit more resolution).

Here above I was referring to the Intel's HWP support [1],
specifically at the:

  Ability of HWP to allow software to set an energy/performance
  preference hint in the IA32_HWP_REQUEST MSR.

which is detailed in section "14.4.4 Managing HWP".

The {Minimum,Maximum}_Performance registers represent what I consider
the best semantics for UtilClamp.

In the HWP case we use 256 range values, and thus for UtilClamp as
well it would make more sense to use a 1024 scale as suggested by
Peter, even just to have a bit more room, while still considering the
clamp values _as a percentage_, with just one decimal digit of
resolution

I think the important bit here is the abstraction between what we
the user can require and what the platform can provided.

If HWP does not allow the OS to pinpoint a specific frequency, why
should a user-space interface be designed to pinpoint a specific
capacity ?

Can we find here a common ground around the idea that UtilClamp values
represent a 1024 range percentage of minimum/maximum performance
expected by a task ?

Would be really nice to know what Rafael thing about all that...

Cheers Patrick

[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf
Quentin Perret Sept. 27, 2018, 10 a.m. UTC | #18
On Tuesday 25 Sep 2018 at 17:49:56 (+0200), Peter Zijlstra wrote:
> I really don't see how changing the unit changes anything. Either you
> want to relate to OPPs and those are exposed in 1/1024 unit capacity
> through the EAS files, or you don't and then the knob has no meaning.

FWIW, with the latest versions of the EAS series, we don't expose the
capacity of the various OPPs directly any more. You have 'frequency',
'power' and a more abstract thing called 'cost' (which is useful for
energy computations) in the sysfs files of the EM.

We decided to remove the 'capacity' file to simplify things quite a bit,
and because it wasn't helping much. I'm talking about this discussion w/
Dietmar on v4:

https://lore.kernel.org/lkml/20180717141955.GA4496@e108498-lin.cambridge.arm.com/

But it is also true that we could add back that 'capacity' file if it
makes sense for uclamp. It's just an additional show() function in
energy_model.c, so no problem on that side per se.

The only problem is if you want to use uclamp w/o EAS, I guess ...

Thanks,
Quentin
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9871e649ffef..481f8214ea9a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4561,6 +4561,9 @@ 
 			<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
 			See also Documentation/input/devices/joystick-parport.rst
 
+	uclamp_user	[KNL] Enable task-specific utilization clamping tuning
+			also from tasks without CAP_SYS_ADMIN capability.
+
 	udbg-immortal	[PPC] When debugging early kernel crashes that
 			happen after console_init() and before a proper
 			console driver takes over, this boot options might
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 222397edb8a7..8341ce580a9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1510,14 +1510,29 @@  static inline int alloc_uclamp_sched_group(struct task_group *tg,
 static inline void free_uclamp_sched_group(struct task_group *tg) { }
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
 
+static bool uclamp_user_allowed __read_mostly;
+static int __init uclamp_user_allow(char *str)
+{
+	uclamp_user_allowed = true;
+
+	return 0;
+}
+early_param("uclamp_user", uclamp_user_allow);
+
 static inline int __setscheduler_uclamp(struct task_struct *p,
-					const struct sched_attr *attr)
+					const struct sched_attr *attr,
+					bool user)
 {
 	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
 	int lower_bound, upper_bound;
 	struct uclamp_se *uc_se;
 	int result = 0;
 
+	if (!capable(CAP_SYS_ADMIN) &&
+	    user && !uclamp_user_allowed) {
+		return -EPERM;
+	}
+
 	mutex_lock(&uclamp_mutex);
 
 	/* Find a valid group_id for each required clamp value */
@@ -1702,7 +1717,8 @@  static inline int alloc_uclamp_sched_group(struct task_group *tg,
 	return 1;
 }
 static inline int __setscheduler_uclamp(struct task_struct *p,
-					const struct sched_attr *attr)
+					const struct sched_attr *attr,
+					bool user)
 {
 	return -EINVAL;
 }
@@ -5217,7 +5233,7 @@  static int __sched_setscheduler(struct task_struct *p,
 
 	/* Configure utilization clamps for the task */
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
-		retval = __setscheduler_uclamp(p, attr);
+		retval = __setscheduler_uclamp(p, attr, user);
 		if (retval)
 			return retval;
 	}