diff mbox series

[1/2] sched/uclamp: Add a new sysctl to control RT default boost value

Message ID 20200403123020.13897-1-qais.yousef@arm.com (mailing list archive)
State New, archived
Headers show
Series [1/2] sched/uclamp: Add a new sysctl to control RT default boost value | expand

Commit Message

Qais Yousef April 3, 2020, 12:30 p.m. UTC
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrators, introduce the new sysctl_sched_rt_default_uclamp_util_min
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

Whenever the new default changes, it'd be applied lazily on the next
enqueue, assuming that it still uses the system default value and not a
user applied one.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---

Changes in v2:
	* Do the lazy update in uclamp_rq_inc() (Thanks Qeuntin)
	* Rename to: sysctl_sched_rt_default_uclamp_util_min (Quentin)
	* uclamp_rt_sync_default_util_min() now is a static function in core.c
	  with no prototype export in sched.h
	* Added docs in sysctl/kernel.rst (patch 2) (Quentin)

v1 can be found here:

	https://lore.kernel.org/lkml/20191220164838.31619-1-qais.yousef@arm.com/

Summary of v1 discussion:

Patrick has voiced a concern about the approach. AFAIU the suggestion proposed
by Patrick instead is to split sysctl_sched_uclamp_util_min into 2, one for RT
and another for CFS. And use the RT constraint to limit how much boost RT tasks
get globally.

If my understanding was correct, the proposed approach by Patrick doesn't work
for what we want to achieve here. Beside it breaks ABI.

The global per RT task doesn't work because if I want to disable boosting for
all RT tasks by default but still allow a handful of critical ones to be
boosted, the global constraint will render any request via sched_setattr()
syscall a NOP.

So IIUC, we'll still _always_ boost RT tasks to max, but introduce a new knob
to cap and restrict this boost. Which IMHO is a convoluted way to disable the
hardcoded max boost behavior.

This approach instead gives admins a direct control over the default boost
value for RT tasks, which is exactly what we want, without any level of
indirection. It converts a hardcoded value into a sysctl variable that
sysadmins can modify at runtime.


 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c          | 66 +++++++++++++++++++++++++++++++++---
 kernel/sysctl.c              |  7 ++++
 3 files changed, 69 insertions(+), 5 deletions(-)

Comments

Patrick Bellasi April 14, 2020, 6:21 p.m. UTC | #1
Hi Qais!

On 03-Apr 13:30, Qais Yousef wrote:

[...]

> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index d4f6215ee03f..91204480fabc 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;

nit-pick: I would prefer to keep the same prefix of the already
exising knobs, i.e. sysctl_sched_uclamp_util_min_rt

The same change for consistency should be applied to all the following
symbols related to "uclamp_util_min_rt".

NOTE: I would not use "default" as I think that what we are doing is
exactly force setting a user_defined value for all RT tasks. More on
that later...

>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..a726b26a5056 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>  /* Max allowed maximum utilization */
>  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>  
> +/*
> + * By default RT tasks run at the maximum performance point/capacity of the
> + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> + * SCHED_CAPACITY_SCALE.
> + *
> + * This knob allows admins to change the default behavior when uclamp is being
> + * used. In battery powered devices, particularly, running at the maximum
> + * capacity and frequency will increase energy consumption and shorten the
> + * battery life.
> + *
> + * This knob only affects the default value RT has when a new RT task is
> + * forked or has just changed policy to RT, given the user hasn't modified the
> + * uclamp.min value of the task via sched_setattr().
> + *
> + * This knob will not override the system default sched_util_clamp_min defined
> + * above.
> + *
> + * Any modification is applied lazily on the next RT task wakeup.
> + */
> +unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
> +
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  	return uc_req;
>  }
>  
> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> +{
> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];

Don't we have to filter for RT tasks only here?

> +
> +	if (!uc_se->user_defined)
> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);

Here you are actually setting a user-requested value, why not marking
it as that, i.e. by using true for the last parameter?

Moreover, by keeping user_defined=false I think you are not getting
what you want for RT tasks running in a nested cgroup.

Let say a subgroup is still with the util_min=1024 inherited from the
system defaults, in uclamp_tg_restrict() we will still return the max
value and not what you requested for. Isn't it?

IOW, what about:

---8<---
static void uclamp_sync_util_min_rt(struct task_struct *p)
{
	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];

  if (likely(uc_se->user_defined || !rt_task(p)))
    return;

  uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
}
---8<---


> +}
> +
>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_eff;
> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> +	/*
> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> +	 * user, we apply any new value on the next wakeup, which is here.
> +	 */
> +	uclamp_rt_sync_default_util_min(p);
> +
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_inc_id(rq, p, clamp_id);
>  
> @@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  				loff_t *ppos)
>  {
>  	bool update_root_tg = false;
> -	int old_min, old_max;
> +	int old_min, old_max, old_rt_min;
>  	int result;
>  
>  	mutex_lock(&uclamp_mutex);
>  	old_min = sysctl_sched_uclamp_util_min;
>  	old_max = sysctl_sched_uclamp_util_max;
> +	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;

Perpahs it's just my OCD but, is not "old_min_rt" reading better?

>  
>  	result = proc_dointvec(table, write, buffer, lenp, ppos);
>  	if (result)
> @@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  	if (!write)
>  		goto done;
>  
> +	/*
> +	 * The new value will be applied to all RT tasks the next time they
> +	 * wakeup, assuming the task is using the system default and not a user
> +	 * specified value. In the latter we shall leave the value as the user
> +	 * requested.
> +	 */

Should not this comment go before the next block?

>  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
>  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
>  		result = -EINVAL;
>  		goto undo;
>  	}
>  
> +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> +		result = -EINVAL;
> +		goto undo;
> +	}
> +
>  	if (old_min != sysctl_sched_uclamp_util_min) {
>  		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
>  			      sysctl_sched_uclamp_util_min, false);
> @@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  undo:
>  	sysctl_sched_uclamp_util_min = old_min;
>  	sysctl_sched_uclamp_util_max = old_max;
> +	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
>  done:
>  	mutex_unlock(&uclamp_mutex);
>  
> @@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
>  		if (uc_se->user_defined)
>  			continue;
>  
> -		/* By default, RT tasks always get 100% boost */
> +		/*
> +		 * By default, RT tasks always get 100% boost, which the admins
> +		 * are allowed to change via
> +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> +		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
>
>  		uclamp_se_set(uc_se, clamp_value, false);
>  	}
> @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
>  	for_each_clamp_id(clamp_id) {
>  		unsigned int clamp_value = uclamp_none(clamp_id);
>  
> -		/* By default, RT tasks always get 100% boost */
> +		/*
> +		 * By default, RT tasks always get 100% boost, which the admins
> +		 * are allowed to change via
> +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> +		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
>  

This is not required, look at this Quentin's patch:

   Message-ID: <20200414161320.251897-1-qperret@google.com>
   https://lore.kernel.org/lkml/20200414161320.251897-1-qperret@google.com/

>  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
>  	}
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index ad5b88a53c5a..0272ae8c6147 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= sysctl_sched_uclamp_handler,
>  	},
> +	{
> +		.procname	= "sched_rt_default_util_clamp_min",
> +		.data		= &sysctl_sched_rt_default_uclamp_util_min,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_sched_uclamp_handler,
> +	},
>  #endif
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  	{

Best,
Patrick
Patrick Bellasi April 15, 2020, 7:46 a.m. UTC | #2
On 14-Apr 20:21, Patrick Bellasi wrote:
> Hi Qais!

Hello againa!

> On 03-Apr 13:30, Qais Yousef wrote:
> 
> [...]
> 
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index d4f6215ee03f..91204480fabc 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
> >  #ifdef CONFIG_UCLAMP_TASK
> >  extern unsigned int sysctl_sched_uclamp_util_min;
> >  extern unsigned int sysctl_sched_uclamp_util_max;
> > +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
> 
> nit-pick: I would prefer to keep the same prefix of the already
> exising knobs, i.e. sysctl_sched_uclamp_util_min_rt
> 
> The same change for consistency should be applied to all the following
> symbols related to "uclamp_util_min_rt".
> 
> NOTE: I would not use "default" as I think that what we are doing is
> exactly force setting a user_defined value for all RT tasks. More on
> that later...

Had a second tought on that...

> 
> >  #endif
> >  
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..a726b26a5056 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> >  /* Max allowed maximum utilization */
> >  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> >  
> > +/*
> > + * By default RT tasks run at the maximum performance point/capacity of the
> > + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> > + * SCHED_CAPACITY_SCALE.
> > + *
> > + * This knob allows admins to change the default behavior when uclamp is being
> > + * used. In battery powered devices, particularly, running at the maximum
> > + * capacity and frequency will increase energy consumption and shorten the
> > + * battery life.
> > + *
> > + * This knob only affects the default value RT has when a new RT task is
> > + * forked or has just changed policy to RT, given the user hasn't modified the
> > + * uclamp.min value of the task via sched_setattr().
> > + *
> > + * This knob will not override the system default sched_util_clamp_min defined
> > + * above.
> > + *
> > + * Any modification is applied lazily on the next RT task wakeup.
> > + */
> > +unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
> > +
> >  /* All clamps are required to be less or equal than these values */
> >  static struct uclamp_se uclamp_default[UCLAMP_CNT];
> >  
> > @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >  	return uc_req;
> >  }
> >  
> > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > +{
> > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
> Don't we have to filter for RT tasks only here?

I think this is still a valid point.

> > +
> > +	if (!uc_se->user_defined)
> > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> 
> Here you are actually setting a user-requested value, why not marking
> it as that, i.e. by using true for the last parameter?

I think you don't want to set user_defined to ensure we keep updating
the value every time the task is enqueued, in case the "default"
should be updated at run-time.

> Moreover, by keeping user_defined=false I think you are not getting
> what you want for RT tasks running in a nested cgroup.
> 
> Let say a subgroup is still with the util_min=1024 inherited from the
> system defaults, in uclamp_tg_restrict() we will still return the max
> value and not what you requested for. Isn't it?

This is also not completely true since perhaps you assume that if an
RT task is running in a nested group with a non tuned uclamp_max then
that's probably what we want.

There is still a small concern due to the fact we don't distinguish
CFS and RT tasks when it comes to cgroup clamp values, which
potentially could still generate the same issue. Let say for example
you wanna allow CFS tasks to be boosted to max (util_min=1024) but
still want to run RT tasks only at lower OPPs.
Not sure if that could be a use case tho.
 
> IOW, what about:
> 
> ---8<---
> static void uclamp_sync_util_min_rt(struct task_struct *p)
> {
> 	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
>   if (likely(uc_se->user_defined || !rt_task(p)))
>     return;
> 
>   uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
                                                          ^^^^
                     This should remain false as in your patch.
> }
> ---8<---

Still, I was thinking that perhaps it would be better to massage the
code above into the generation of the effective value, in uclamp_eff_get().

Since you wanna (possibly) update the value at each enqueue time,
that's what conceptually is represented by the "effective clamp
value": a value that is computed by definition at enqueue time by
aggregating all the requests and constraints.

Poking with the effective value instead of the requested value will
fix also the ambiguity above, where we set a "requested values" with
user-defined=false.

> > +}
> > +
> >  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >  {
> >  	struct uclamp_se uc_eff;
> > @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  	if (unlikely(!p->sched_class->uclamp_enabled))
> >  		return;
> >  
> > +	/*
> > +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> > +	 * user, we apply any new value on the next wakeup, which is here.
> > +	 */
> > +	uclamp_rt_sync_default_util_min(p);
> > +
> >  	for_each_clamp_id(clamp_id)
> >  		uclamp_rq_inc_id(rq, p, clamp_id);
> >  
> > @@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  				loff_t *ppos)
> >  {
> >  	bool update_root_tg = false;
> > -	int old_min, old_max;
> > +	int old_min, old_max, old_rt_min;
> >  	int result;
> >  
> >  	mutex_lock(&uclamp_mutex);
> >  	old_min = sysctl_sched_uclamp_util_min;
> >  	old_max = sysctl_sched_uclamp_util_max;
> > +	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;
> 
> Perpahs it's just my OCD but, is not "old_min_rt" reading better?
> 
> >  
> >  	result = proc_dointvec(table, write, buffer, lenp, ppos);
> >  	if (result)
> > @@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  	if (!write)
> >  		goto done;
> >  
> > +	/*
> > +	 * The new value will be applied to all RT tasks the next time they
> > +	 * wakeup, assuming the task is using the system default and not a user
> > +	 * specified value. In the latter we shall leave the value as the user
> > +	 * requested.
> > +	 */
> 
> Should not this comment go before the next block?
> 
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> >  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> >  		result = -EINVAL;
> >  		goto undo;
> >  	}
> >  
> > +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> > +		result = -EINVAL;
> > +		goto undo;
> > +	}
> > +
> >  	if (old_min != sysctl_sched_uclamp_util_min) {
> >  		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
> >  			      sysctl_sched_uclamp_util_min, false);
> > @@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  undo:
> >  	sysctl_sched_uclamp_util_min = old_min;
> >  	sysctl_sched_uclamp_util_max = old_max;
> > +	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
> >  done:
> >  	mutex_unlock(&uclamp_mutex);
> >  
> > @@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  		if (uc_se->user_defined)
> >  			continue;
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >
> >  		uclamp_se_set(uc_se, clamp_value, false);
> >  	}
> > @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
> >  	for_each_clamp_id(clamp_id) {
> >  		unsigned int clamp_value = uclamp_none(clamp_id);
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >  
> 
> This is not required, look at this Quentin's patch:
> 
>    Message-ID: <20200414161320.251897-1-qperret@google.com>
>    https://lore.kernel.org/lkml/20200414161320.251897-1-qperret@google.com/
> 
> >  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
> >  	}
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index ad5b88a53c5a..0272ae8c6147 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= sysctl_sched_uclamp_handler,
> >  	},
> > +	{
> > +		.procname	= "sched_rt_default_util_clamp_min",
> > +		.data		= &sysctl_sched_rt_default_uclamp_util_min,
> > +		.maxlen		= sizeof(unsigned int),
> > +		.mode		= 0644,
> > +		.proc_handler	= sysctl_sched_uclamp_handler,
> > +	},
> >  #endif
> >  #ifdef CONFIG_SCHED_AUTOGROUP
> >  	{
Quentin Perret April 15, 2020, 10:11 a.m. UTC | #3
Hi Qais,

On Friday 03 Apr 2020 at 13:30:19 (+0100), Qais Yousef wrote:
<snip>
> +	/*
> +	 * The new value will be applied to all RT tasks the next time they
> +	 * wakeup, assuming the task is using the system default and not a user
> +	 * specified value. In the latter we shall leave the value as the user
> +	 * requested.
> +	 */
>  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
>  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
>  		result = -EINVAL;
>  		goto undo;
>  	}
>  
> +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> +		result = -EINVAL;
> +		goto undo;
> +	}

Hmm, checking:

	if (sysctl_sched_rt_default_uclamp_util_min > sysctl_sched_uclamp_util_min)

would probably make sense too, but then that would make writing in
sysctl_sched_uclamp_util_min cumbersome for sysadmins as they'd need to
lower the rt default first. Is that the reason for checking against
SCHED_CAPACITY_SCALE? That might deserve a comment or something.

<snip>
> @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
>  	for_each_clamp_id(clamp_id) {
>  		unsigned int clamp_value = uclamp_none(clamp_id);
>  
> -		/* By default, RT tasks always get 100% boost */
> +		/*
> +		 * By default, RT tasks always get 100% boost, which the admins
> +		 * are allowed to change via
> +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> +		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
>  
>  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
>  	}

And that, as per 20200414161320.251897-1-qperret@google.com, should not
be there :)

Otherwise the patch pretty looks good to me!

Thanks,
Quentin
Dietmar Eggemann April 20, 2020, 8:24 a.m. UTC | #4
On 14/04/2020 20:21, Patrick Bellasi wrote:
> Hi Qais!
> 
> On 03-Apr 13:30, Qais Yousef wrote:

[...]

>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>>  	return uc_req;
>>  }
>>  
>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
>> +{
>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
> Don't we have to filter for RT tasks only here?

I think so. It's probably because it got moved from rt.c to core.c.

[...]

>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index ad5b88a53c5a..0272ae8c6147 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
>>  		.mode		= 0644,
>>  		.proc_handler	= sysctl_sched_uclamp_handler,
>>  	},
>> +	{
>> +		.procname	= "sched_rt_default_util_clamp_min",

root@h960:~# find / -name "*util_clamp*"
/proc/sys/kernel/sched_rt_default_util_clamp_min
/proc/sys/kernel/sched_util_clamp_max
/proc/sys/kernel/sched_util_clamp_min

IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.

/proc/sys/kernel/sched_util_clamp_rt_default_min

[...]
Dietmar Eggemann April 20, 2020, 8:29 a.m. UTC | #5
On 03.04.20 14:30, Qais Yousef wrote:

[...]

> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  	return uc_req;
>  }
>  
> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> +{
> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> +
> +	if (!uc_se->user_defined)
> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> +}
> +
>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_eff;
> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> +	/*
> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> +	 * user, we apply any new value on the next wakeup, which is here.
> +	 */
> +	uclamp_rt_sync_default_util_min(p);
> +

Does this have to be an extra function? Can we not reuse
uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?

This function will then deal with enforcing restrictions, whether system
and taskgroup hierarchy related or default value (latter only for rt-min
right now since the others are fixed) related.

uclamp_eff_get() -> uclamp_restrict() is called from:

  'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and

  'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and

  'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'

so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.

Only lightly tested:

---8<---

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Sun, 19 Apr 2020 01:20:17 +0200
Subject: [PATCH] sched/core: uclamp: Move uclamp_rt_sync_default_util_min()
 into uclamp_tg_restrict()

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f4e0d5c7daf..6802113d6d4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,12 +899,22 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
+uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
-#ifdef CONFIG_UCLAMP_TASK_GROUP
-	struct uclamp_se uc_max;
+	struct uclamp_se uc_req, __maybe_unused uc_max;
+
+	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
+	    !uc_req.user_defined) {
+		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
+		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
+
+		if (uc_se->value != rt_min)
+			uclamp_se_set(uc_se, rt_min, false);
+	}
 
+	uc_req = p->uclamp_req[clamp_id];
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
 	/*
 	 * Tasks in autogroups or root task group will be
 	 * restricted by system defaults.
@@ -933,7 +943,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
+	struct uclamp_se uc_req = uclamp_restrict(p, clamp_id);
 	struct uclamp_se uc_max = uclamp_default[clamp_id];
 
 	/* System default restrictions always apply */
@@ -943,14 +953,6 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 	return uc_req;
 }
 
-static void uclamp_rt_sync_default_util_min(struct task_struct *p)
-{
-	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
-
-	if (!uc_se->user_defined)
-		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
-}
-
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_eff;
@@ -1057,12 +1059,6 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
-	/*
-	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
-	 * user, we apply any new value on the next wakeup, which is here.
-	 */
-	uclamp_rt_sync_default_util_min(p);
-
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
Qais Yousef April 20, 2020, 2:50 p.m. UTC | #6
Hi Patrick!

On 04/14/20 20:21, Patrick Bellasi wrote:
> Hi Qais!
> 
> On 03-Apr 13:30, Qais Yousef wrote:
> 
> [...]
> 
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index d4f6215ee03f..91204480fabc 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
> >  #ifdef CONFIG_UCLAMP_TASK
> >  extern unsigned int sysctl_sched_uclamp_util_min;
> >  extern unsigned int sysctl_sched_uclamp_util_max;
> > +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
> 
> nit-pick: I would prefer to keep the same prefix of the already
> exising knobs, i.e. sysctl_sched_uclamp_util_min_rt
> 
> The same change for consistency should be applied to all the following
> symbols related to "uclamp_util_min_rt".

All rt sysctl knobs are prefixed with 'sched_rt', so I used this for
consistency.

> 
> NOTE: I would not use "default" as I think that what we are doing is

The 'default' was suggested by Quentin as it felt more descriptive to him. I
took his 'user' point of view.

I don't mind or care about the name ultimately. This symbol is internal and not
exported to userspace.

I just think sticking to sched_rt for consistency is better. What follows I'm
happy to put whatever reviewers agree on. I think Dietmar has another
suggestion for this too.

> exactly force setting a user_defined value for all RT tasks. More on
> that later...

We are NOT force setting the user_defined value. The whole point is to have
a variable default behavior while still allow the user to override this
default. Exactly like it currently works, except that the __hardcoded__ value
is no longer hardcoded and can be modifyied by sys admins at runtime.

> 
> >  #endif
> >  
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..a726b26a5056 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> >  /* Max allowed maximum utilization */
> >  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> >  
> > +/*
> > + * By default RT tasks run at the maximum performance point/capacity of the
> > + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> > + * SCHED_CAPACITY_SCALE.
> > + *
> > + * This knob allows admins to change the default behavior when uclamp is being
> > + * used. In battery powered devices, particularly, running at the maximum
> > + * capacity and frequency will increase energy consumption and shorten the
> > + * battery life.
> > + *
> > + * This knob only affects the default value RT has when a new RT task is
> > + * forked or has just changed policy to RT, given the user hasn't modified the
> > + * uclamp.min value of the task via sched_setattr().
> > + *
> > + * This knob will not override the system default sched_util_clamp_min defined
> > + * above.
> > + *
> > + * Any modification is applied lazily on the next RT task wakeup.
> > + */
> > +unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
> > +
> >  /* All clamps are required to be less or equal than these values */
> >  static struct uclamp_se uclamp_default[UCLAMP_CNT];
> >  
> > @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >  	return uc_req;
> >  }
> >  
> > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > +{
> > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
> Don't we have to filter for RT tasks only here?

Indeed!

> 
> > +
> > +	if (!uc_se->user_defined)
> > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> 
> Here you are actually setting a user-requested value, why not marking
> it as that, i.e. by using true for the last parameter?

I will have to bounce the question back. The original code passed false.
So why didn't you pass true then? The fact that the value is no longer
hardcoded with this patch doesn't mean the purpose of this code has changed.

> 
> Moreover, by keeping user_defined=false I think you are not getting
> what you want for RT tasks running in a nested cgroup.
> 
> Let say a subgroup is still with the util_min=1024 inherited from the
> system defaults, in uclamp_tg_restrict() we will still return the max
> value and not what you requested for. Isn't it?

With the current code (without my patch) where RT tasks util_min is 1024 by
default and user_defined=false. How is it behaving then? My patch should
retain this behavior. I can't see how it breaks it.. :-/

> 
> IOW, what about:
> 
> ---8<---
> static void uclamp_sync_util_min_rt(struct task_struct *p)
> {
> 	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
>   if (likely(uc_se->user_defined || !rt_task(p)))
>     return;
> 
>   uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
> }
> ---8<---

See above.

> 
> 
> > +}
> > +
> >  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >  {
> >  	struct uclamp_se uc_eff;
> > @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  	if (unlikely(!p->sched_class->uclamp_enabled))
> >  		return;
> >  
> > +	/*
> > +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> > +	 * user, we apply any new value on the next wakeup, which is here.
> > +	 */
> > +	uclamp_rt_sync_default_util_min(p);
> > +
> >  	for_each_clamp_id(clamp_id)
> >  		uclamp_rq_inc_id(rq, p, clamp_id);
> >  
> > @@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  				loff_t *ppos)
> >  {
> >  	bool update_root_tg = false;
> > -	int old_min, old_max;
> > +	int old_min, old_max, old_rt_min;
> >  	int result;
> >  
> >  	mutex_lock(&uclamp_mutex);
> >  	old_min = sysctl_sched_uclamp_util_min;
> >  	old_max = sysctl_sched_uclamp_util_max;
> > +	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;
> 
> Perpahs it's just my OCD but, is not "old_min_rt" reading better?

:D

> 
> >  
> >  	result = proc_dointvec(table, write, buffer, lenp, ppos);
> >  	if (result)
> > @@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  	if (!write)
> >  		goto done;
> >  
> > +	/*
> > +	 * The new value will be applied to all RT tasks the next time they
> > +	 * wakeup, assuming the task is using the system default and not a user
> > +	 * specified value. In the latter we shall leave the value as the user
> > +	 * requested.
> > +	 */
> 
> Should not this comment go before the next block?

+1

> 
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> >  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> >  		result = -EINVAL;
> >  		goto undo;
> >  	}
> >  
> > +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> > +		result = -EINVAL;
> > +		goto undo;
> > +	}
> > +
> >  	if (old_min != sysctl_sched_uclamp_util_min) {
> >  		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
> >  			      sysctl_sched_uclamp_util_min, false);
> > @@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  undo:
> >  	sysctl_sched_uclamp_util_min = old_min;
> >  	sysctl_sched_uclamp_util_max = old_max;
> > +	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
> >  done:
> >  	mutex_unlock(&uclamp_mutex);
> >  
> > @@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  		if (uc_se->user_defined)
> >  			continue;
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >
> >  		uclamp_se_set(uc_se, clamp_value, false);
> >  	}
> > @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
> >  	for_each_clamp_id(clamp_id) {
> >  		unsigned int clamp_value = uclamp_none(clamp_id);
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >  
> 
> This is not required, look at this Quentin's patch:
> 
>    Message-ID: <20200414161320.251897-1-qperret@google.com>
>    https://lore.kernel.org/lkml/20200414161320.251897-1-qperret@google.com/

Yep saw it.

Thanks for taking the time to review!

Cheers

--
Qais Yousef

> 
> >  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
> >  	}
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index ad5b88a53c5a..0272ae8c6147 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= sysctl_sched_uclamp_handler,
> >  	},
> > +	{
> > +		.procname	= "sched_rt_default_util_clamp_min",
> > +		.data		= &sysctl_sched_rt_default_uclamp_util_min,
> > +		.maxlen		= sizeof(unsigned int),
> > +		.mode		= 0644,
> > +		.proc_handler	= sysctl_sched_uclamp_handler,
> > +	},
> >  #endif
> >  #ifdef CONFIG_SCHED_AUTOGROUP
> >  	{
> 
> Best,
> Patrick
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi
Qais Yousef April 20, 2020, 3:04 p.m. UTC | #7
On 04/15/20 09:46, Patrick Bellasi wrote:

[...]

> > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > index d4f6215ee03f..91204480fabc 100644
> > > --- a/include/linux/sched/sysctl.h
> > > +++ b/include/linux/sched/sysctl.h
> > > @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
> > >  #ifdef CONFIG_UCLAMP_TASK
> > >  extern unsigned int sysctl_sched_uclamp_util_min;
> > >  extern unsigned int sysctl_sched_uclamp_util_max;
> > > +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
> > 
> > nit-pick: I would prefer to keep the same prefix of the already
> > exising knobs, i.e. sysctl_sched_uclamp_util_min_rt
> > 
> > The same change for consistency should be applied to all the following
> > symbols related to "uclamp_util_min_rt".
> > 
> > NOTE: I would not use "default" as I think that what we are doing is
> > exactly force setting a user_defined value for all RT tasks. More on
> > that later...
> 
> Had a second tought on that...

Sorry just noticed that you had a second reply to this. I just saw the first
initially.

Still catching up after holiday..

[...]

> > > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > > +{
> > > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> > 
> > Don't we have to filter for RT tasks only here?
> 
> I think this is still a valid point.

Yep it is.

> 
> > > +
> > > +	if (!uc_se->user_defined)
> > > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> > 
> > Here you are actually setting a user-requested value, why not marking
> > it as that, i.e. by using true for the last parameter?
> 
> I think you don't want to set user_defined to ensure we keep updating
> the value every time the task is enqueued, in case the "default"
> should be updated at run-time.

Yes. I'm glad we're finally on agreement about this :-)

> 
> > Moreover, by keeping user_defined=false I think you are not getting
> > what you want for RT tasks running in a nested cgroup.
> > 
> > Let say a subgroup is still with the util_min=1024 inherited from the
> > system defaults, in uclamp_tg_restrict() we will still return the max
> > value and not what you requested for. Isn't it?
> 
> This is also not completely true since perhaps you assume that if an
> RT task is running in a nested group with a non tuned uclamp_max then
> that's probably what we want.
> 
> There is still a small concern due to the fact we don't distinguish
> CFS and RT tasks when it comes to cgroup clamp values, which
> potentially could still generate the same issue. Let say for example
> you wanna allow CFS tasks to be boosted to max (util_min=1024) but
> still want to run RT tasks only at lower OPPs.
> Not sure if that could be a use case tho.

No we can't within the same cgroup. But sys admins can potentially create
different cgroups to enforce the different policies.

A per sched-class uclamp control could simplify userspace if they end up with
this scenario. But given where we are now, I'm not sure how easy it would be to
stage the change.

>  
> > IOW, what about:
> > 
> > ---8<---
> > static void uclamp_sync_util_min_rt(struct task_struct *p)
> > {
> > 	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> > 
> >   if (likely(uc_se->user_defined || !rt_task(p)))
> >     return;
> > 
> >   uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
>                                                           ^^^^
>                      This should remain false as in your patch.
> > }
> > ---8<---
> 
> Still, I was thinking that perhaps it would be better to massage the
> code above into the generation of the effective value, in uclamp_eff_get().
> 
> Since you wanna (possibly) update the value at each enqueue time,
> that's what conceptually is represented by the "effective clamp
> value": a value that is computed by definition at enqueue time by
> aggregating all the requests and constraints.
> 
> Poking with the effective value instead of the requested value will
> fix also the ambiguity above, where we set a "requested values" with
> user-defined=false.

Okay, let me have a second look at this. I just took what we had and improved
on it. But what you say could work too. Let me try it out.

Thanks

--
Qais Yousef
Qais Yousef April 20, 2020, 3:08 p.m. UTC | #8
Hi Quentin

On 04/15/20 11:11, Quentin Perret wrote:
> Hi Qais,
> 
> On Friday 03 Apr 2020 at 13:30:19 (+0100), Qais Yousef wrote:
> <snip>
> > +	/*
> > +	 * The new value will be applied to all RT tasks the next time they
> > +	 * wakeup, assuming the task is using the system default and not a user
> > +	 * specified value. In the latter we shall leave the value as the user
> > +	 * requested.
> > +	 */
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> >  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> >  		result = -EINVAL;
> >  		goto undo;
> >  	}
> >  
> > +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> > +		result = -EINVAL;
> > +		goto undo;
> > +	}
> 
> Hmm, checking:
> 
> 	if (sysctl_sched_rt_default_uclamp_util_min > sysctl_sched_uclamp_util_min)
> 
> would probably make sense too, but then that would make writing in
> sysctl_sched_uclamp_util_min cumbersome for sysadmins as they'd need to
> lower the rt default first. Is that the reason for checking against
> SCHED_CAPACITY_SCALE? That might deserve a comment or something.

There's no need for that extra diff. That constraint will be applied
automatically when calculating the effective value.

The check for SCHED_CAPACITY_SCALE is a a range check. The possible value is
[0:SCHED_CAPACITY_SCALE].

Does this answer your question? I could add a comment that all the uclamp
sysctls need to be within this range.

> 
> <snip>
> > @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
> >  	for_each_clamp_id(clamp_id) {
> >  		unsigned int clamp_value = uclamp_none(clamp_id);
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >  
> >  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
> >  	}
> 
> And that, as per 20200414161320.251897-1-qperret@google.com, should not
> be there :)

Yep saw it. Thanks for fixing it!

> 
> Otherwise the patch pretty looks good to me!

Cheers

--
Qais Yousef
Qais Yousef April 20, 2020, 3:13 p.m. UTC | #9
On 04/20/20 10:29, Dietmar Eggemann wrote:
> On 03.04.20 14:30, Qais Yousef wrote:
> 
> [...]
> 
> > @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >  	return uc_req;
> >  }
> >  
> > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > +{
> > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> > +
> > +	if (!uc_se->user_defined)
> > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> > +}
> > +
> >  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >  {
> >  	struct uclamp_se uc_eff;
> > @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  	if (unlikely(!p->sched_class->uclamp_enabled))
> >  		return;
> >  
> > +	/*
> > +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> > +	 * user, we apply any new value on the next wakeup, which is here.
> > +	 */
> > +	uclamp_rt_sync_default_util_min(p);
> > +
> 
> Does this have to be an extra function? Can we not reuse
> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?

Hmm the thing is that we're not restricting here. In contrary we're boosting,
so the name would be misleading.

> 
> This function will then deal with enforcing restrictions, whether system
> and taskgroup hierarchy related or default value (latter only for rt-min
> right now since the others are fixed) related.
> 
> uclamp_eff_get() -> uclamp_restrict() is called from:
> 
>   'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and
> 
>   'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and
> 
>   'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'
> 
> so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.

I think you're revolving around the same idea that Patrick was suggesting.
I think it is possible to do something in uclamp_eff_get() too.

Thanks

--
Qais Yousef

> 
> Only lightly tested:
> 
> ---8<---
> 
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Sun, 19 Apr 2020 01:20:17 +0200
> Subject: [PATCH] sched/core: uclamp: Move uclamp_rt_sync_default_util_min()
>  into uclamp_tg_restrict()
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/core.c | 34 +++++++++++++++-------------------
>  1 file changed, 15 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8f4e0d5c7daf..6802113d6d4b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -899,12 +899,22 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
>  }
>  
>  static inline struct uclamp_se
> -uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
> +uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> -	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> -#ifdef CONFIG_UCLAMP_TASK_GROUP
> -	struct uclamp_se uc_max;
> +	struct uclamp_se uc_req, __maybe_unused uc_max;
> +
> +	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
> +	    !uc_req.user_defined) {
> +		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> +		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
> +
> +		if (uc_se->value != rt_min)
> +			uclamp_se_set(uc_se, rt_min, false);
> +	}
>  
> +	uc_req = p->uclamp_req[clamp_id];
> +
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>  	/*
>  	 * Tasks in autogroups or root task group will be
>  	 * restricted by system defaults.
> @@ -933,7 +943,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> -	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> +	struct uclamp_se uc_req = uclamp_restrict(p, clamp_id);
>  	struct uclamp_se uc_max = uclamp_default[clamp_id];
>  
>  	/* System default restrictions always apply */
> @@ -943,14 +953,6 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  	return uc_req;
>  }
>  
> -static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> -{
> -	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> -
> -	if (!uc_se->user_defined)
> -		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> -}
> -
>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_eff;
> @@ -1057,12 +1059,6 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> -	/*
> -	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> -	 * user, we apply any new value on the next wakeup, which is here.
> -	 */
> -	uclamp_rt_sync_default_util_min(p);
> -
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_inc_id(rq, p, clamp_id);
>  
> -- 
> 2.17.1
Qais Yousef April 20, 2020, 3:19 p.m. UTC | #10
On 04/20/20 10:24, Dietmar Eggemann wrote:
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >> index ad5b88a53c5a..0272ae8c6147 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
> >>  		.mode		= 0644,
> >>  		.proc_handler	= sysctl_sched_uclamp_handler,
> >>  	},
> >> +	{
> >> +		.procname	= "sched_rt_default_util_clamp_min",
> 
> root@h960:~# find / -name "*util_clamp*"
> /proc/sys/kernel/sched_rt_default_util_clamp_min
> /proc/sys/kernel/sched_util_clamp_max
> /proc/sys/kernel/sched_util_clamp_min
> 
> IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
> 
> /proc/sys/kernel/sched_util_clamp_rt_default_min

All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
adding 'default' to be more obvious, so I ended up with

	'sched_rt' + '_default' + '_util_clamp_min'.

I think this is the logical and most consistent form. Given that Patrick seems
to be okay with the 'default' now, does this look good to you too?

Thanks

--
Qais Yousef
Steven Rostedt April 21, 2020, 12:52 a.m. UTC | #11
On Mon, 20 Apr 2020 16:19:42 +0100
Qais Yousef <qais.yousef@arm.com> wrote:

> > root@h960:~# find / -name "*util_clamp*"
> > /proc/sys/kernel/sched_rt_default_util_clamp_min
> > /proc/sys/kernel/sched_util_clamp_max
> > /proc/sys/kernel/sched_util_clamp_min
> > 
> > IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
> > 
> > /proc/sys/kernel/sched_util_clamp_rt_default_min  
> 
> All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
> coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
> adding 'default' to be more obvious, so I ended up with
> 
> 	'sched_rt' + '_default' + '_util_clamp_min'.
> 
> I think this is the logical and most consistent form. Given that Patrick seems
> to be okay with the 'default' now, does this look good to you too?

There's only two files with "sched_rt" and they are tightly coupled
(they define how much an RT task may use the CPU).

My question is, is this "sched_rt_default_util_clamp_min" related in
any way to those other two files that start with "sched_rt", or is it
more related to the files that start with "sched_util_clamp"?

If the latter, then I would suggest using
"sched_util_clamp_min_rt_default", as it looks to be more related to
the "sched_util_clamp_min" than to anything else.

-- Steve
Dietmar Eggemann April 21, 2020, 11:16 a.m. UTC | #12
On 21/04/2020 02:52, Steven Rostedt wrote:
> On Mon, 20 Apr 2020 16:19:42 +0100
> Qais Yousef <qais.yousef@arm.com> wrote:
> 
>>> root@h960:~# find / -name "*util_clamp*"
>>> /proc/sys/kernel/sched_rt_default_util_clamp_min
>>> /proc/sys/kernel/sched_util_clamp_max
>>> /proc/sys/kernel/sched_util_clamp_min
>>>
>>> IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
>>>
>>> /proc/sys/kernel/sched_util_clamp_rt_default_min  
>>
>> All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
>> coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
>> adding 'default' to be more obvious, so I ended up with
>>
>> 	'sched_rt' + '_default' + '_util_clamp_min'.
>>
>> I think this is the logical and most consistent form. Given that Patrick seems
>> to be okay with the 'default' now, does this look good to you too?
> 
> There's only two files with "sched_rt" and they are tightly coupled
> (they define how much an RT task may use the CPU).
> 
> My question is, is this "sched_rt_default_util_clamp_min" related in
> any way to those other two files that start with "sched_rt", or is it
> more related to the files that start with "sched_util_clamp"?
> 
> If the latter, then I would suggest using
> "sched_util_clamp_min_rt_default", as it looks to be more related to
> the "sched_util_clamp_min" than to anything else.

For me it's the latter.
Dietmar Eggemann April 21, 2020, 11:18 a.m. UTC | #13
On 20/04/2020 17:13, Qais Yousef wrote:
> On 04/20/20 10:29, Dietmar Eggemann wrote:
>> On 03.04.20 14:30, Qais Yousef wrote:
>>
>> [...]
>>
>>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>>>  	return uc_req;
>>>  }
>>>  
>>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
>>> +{
>>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
>>> +
>>> +	if (!uc_se->user_defined)
>>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
>>> +}
>>> +
>>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>>>  {
>>>  	struct uclamp_se uc_eff;
>>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>>>  	if (unlikely(!p->sched_class->uclamp_enabled))
>>>  		return;
>>>  
>>> +	/*
>>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
>>> +	 * user, we apply any new value on the next wakeup, which is here.
>>> +	 */
>>> +	uclamp_rt_sync_default_util_min(p);
>>> +
>>
>> Does this have to be an extra function? Can we not reuse
>> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?
> 
> Hmm the thing is that we're not restricting here. In contrary we're boosting,
> so the name would be misleading.

I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
sysctl_sched_rt_default_uclamp_util_min (0-1024)?

root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min

[  118.028582] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
[  118.036290] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
[  125.181747] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
[  125.189443] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
[  131.213211] uclamp_restrict() [rtkit-daemon 410] p->uclamp_req[0].value=999
[  131.220201] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
[  131.227792] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999
[  137.181544] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
[  137.189170] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999

>> This function will then deal with enforcing restrictions, whether system
>> and taskgroup hierarchy related or default value (latter only for rt-min
>> right now since the others are fixed) related.
>>
>> uclamp_eff_get() -> uclamp_restrict() is called from:
>>
>>   'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and
>>
>>   'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and
>>
>>   'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'
>>
>> so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.
> 
> I think you're revolving around the same idea that Patrick was suggesting.
> I think it is possible to do something in uclamp_eff_get() too.

Yeah, I read https://lore.kernel.org/linux-doc/20200415074600.GA26984@darkstar again.

Everything which moves enforcing sysctl_sched_rt_default_uclamp_util_min closer to 'uclamp_eff_get() -> 
uclamp_(tg_)restrict()' is fine with me.
Qais Yousef April 21, 2020, 11:23 a.m. UTC | #14
On 04/21/20 13:16, Dietmar Eggemann wrote:
> On 21/04/2020 02:52, Steven Rostedt wrote:
> > On Mon, 20 Apr 2020 16:19:42 +0100
> > Qais Yousef <qais.yousef@arm.com> wrote:
> > 
> >>> root@h960:~# find / -name "*util_clamp*"
> >>> /proc/sys/kernel/sched_rt_default_util_clamp_min
> >>> /proc/sys/kernel/sched_util_clamp_max
> >>> /proc/sys/kernel/sched_util_clamp_min
> >>>
> >>> IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
> >>>
> >>> /proc/sys/kernel/sched_util_clamp_rt_default_min  
> >>
> >> All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
> >> coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
> >> adding 'default' to be more obvious, so I ended up with
> >>
> >> 	'sched_rt' + '_default' + '_util_clamp_min'.
> >>
> >> I think this is the logical and most consistent form. Given that Patrick seems
> >> to be okay with the 'default' now, does this look good to you too?
> > 
> > There's only two files with "sched_rt" and they are tightly coupled
> > (they define how much an RT task may use the CPU).
> > 
> > My question is, is this "sched_rt_default_util_clamp_min" related in
> > any way to those other two files that start with "sched_rt", or is it
> > more related to the files that start with "sched_util_clamp"?
> > 
> > If the latter, then I would suggest using
> > "sched_util_clamp_min_rt_default", as it looks to be more related to
> > the "sched_util_clamp_min" than to anything else.
> 
> For me it's the latter.

The way I see it is that 'sched_rt' define an rt class property. And running at
max performance level is an RT class property that uclamp honoured and has an
extra side effect that it allows us to tune.

That said, I'm fine with whatever.

Patrick, Quentin, is sched_util_clamp_min_rt_default fine with you too?

Thanks

--
Qais Yousef
Qais Yousef April 21, 2020, 11:27 a.m. UTC | #15
On 04/21/20 13:18, Dietmar Eggemann wrote:
> On 20/04/2020 17:13, Qais Yousef wrote:
> > On 04/20/20 10:29, Dietmar Eggemann wrote:
> >> On 03.04.20 14:30, Qais Yousef wrote:
> >>
> >> [...]
> >>
> >>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >>>  	return uc_req;
> >>>  }
> >>>  
> >>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> >>> +{
> >>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> >>> +
> >>> +	if (!uc_se->user_defined)
> >>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> >>> +}
> >>> +
> >>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >>>  {
> >>>  	struct uclamp_se uc_eff;
> >>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >>>  	if (unlikely(!p->sched_class->uclamp_enabled))
> >>>  		return;
> >>>  
> >>> +	/*
> >>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> >>> +	 * user, we apply any new value on the next wakeup, which is here.
> >>> +	 */
> >>> +	uclamp_rt_sync_default_util_min(p);
> >>> +
> >>
> >> Does this have to be an extra function? Can we not reuse
> >> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?
> > 
> > Hmm the thing is that we're not restricting here. In contrary we're boosting,
> > so the name would be misleading.
> 
> I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
> sysctl_sched_rt_default_uclamp_util_min (0-1024)?

The way I look at it is that we're *setting* it to
sysctl_sched_rt_default_uclamp_util_min if !user_defined.

The restriction mechanism that ensures this set value doesn't escape
cgroup/global restrictions setup.

> 
> root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min
> 
> [  118.028582] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
> [  118.036290] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
> [  125.181747] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
> [  125.189443] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
> [  131.213211] uclamp_restrict() [rtkit-daemon 410] p->uclamp_req[0].value=999
> [  131.220201] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
> [  131.227792] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999
> [  137.181544] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
> [  137.189170] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999
> 
> >> This function will then deal with enforcing restrictions, whether system
> >> and taskgroup hierarchy related or default value (latter only for rt-min
> >> right now since the others are fixed) related.
> >>
> >> uclamp_eff_get() -> uclamp_restrict() is called from:
> >>
> >>   'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and
> >>
> >>   'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and
> >>
> >>   'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'
> >>
> >> so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.
> > 
> > I think you're revolving around the same idea that Patrick was suggesting.
> > I think it is possible to do something in uclamp_eff_get() too.
> 
> Yeah, I read https://lore.kernel.org/linux-doc/20200415074600.GA26984@darkstar again.
> 
> Everything which moves enforcing sysctl_sched_rt_default_uclamp_util_min closer to 'uclamp_eff_get() -> 
> uclamp_(tg_)restrict()' is fine with me.

Cool.

Thanks

--
Qais Yousef
Dietmar Eggemann April 22, 2020, 10:59 a.m. UTC | #16
On 21/04/2020 13:27, Qais Yousef wrote:
> On 04/21/20 13:18, Dietmar Eggemann wrote:
>> On 20/04/2020 17:13, Qais Yousef wrote:
>>> On 04/20/20 10:29, Dietmar Eggemann wrote:
>>>> On 03.04.20 14:30, Qais Yousef wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>>>>>  	return uc_req;
>>>>>  }
>>>>>  
>>>>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
>>>>> +{
>>>>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
>>>>> +
>>>>> +	if (!uc_se->user_defined)
>>>>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
>>>>> +}
>>>>> +
>>>>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>>>>>  {
>>>>>  	struct uclamp_se uc_eff;
>>>>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>>>>>  	if (unlikely(!p->sched_class->uclamp_enabled))
>>>>>  		return;
>>>>>  
>>>>> +	/*
>>>>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
>>>>> +	 * user, we apply any new value on the next wakeup, which is here.
>>>>> +	 */
>>>>> +	uclamp_rt_sync_default_util_min(p);
>>>>> +
>>>>
>>>> Does this have to be an extra function? Can we not reuse
>>>> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?

Btw, there was an issue in my little snippet. I used uc_req.user_defined
uninitialized in uclamp_restrict().


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3706dad32ce..7e6b2b7cd1e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -903,12 +903,11 @@ uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_req, __maybe_unused uc_max;
 
-	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
-	    !uc_req.user_defined) {
+	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN) {
 		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
 		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
 
-		if (uc_se->value != rt_min) {
+		if (!uc_se->user_defined && uc_se->value != rt_min) {
 			uclamp_se_set(uc_se, rt_min, false);
 			printk("uclamp_restrict() [%s %d] p->uclamp_req[%d].value=%d\n",
 			       p->comm, p->pid, clamp_id, uc_se->value);

>>> Hmm the thing is that we're not restricting here. In contrary we're boosting,
>>> so the name would be misleading.
>>
>> I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
>> sysctl_sched_rt_default_uclamp_util_min (0-1024)?
> 
> The way I look at it is that we're *setting* it to
> sysctl_sched_rt_default_uclamp_util_min if !user_defined.
> 
> The restriction mechanism that ensures this set value doesn't escape
> cgroup/global restrictions setup.

I guess we overall agree here. 

I see 3 restriction levels: (!user_defined) task -> taskgroup -> system

I see sysctl_sched_rt_default_uclamp_util_min (min_rt_default) as a
restriction on task level.

It's true that the task level restriction is setting the value at the same time.

For CFS (id=UCLAMP_[MIN\|MAX]) and RT (id=UCLAMP_MAX) we use
uclamp_none(id) and those values (0, 1024) are fixed so these task level
values don't need to be further restricted.

For RT (id=UCLAMP_MIN) we use 'min_rt_default' and since it can change
we have to check the task level restriction in 'uclamp_eff_get() ->
uclamp_(tg)_restrict()'.

root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min

[ 2540.507236] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=1024
[ 2540.514947] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=1024
[ 2548.015208] uclamp_restrict() [rtkit-daemon 419] p->uclamp_req[0].value=999

root@h960:~# echo 666 > /proc/sys/kernel/sched_util_clamp_min

[ 2548.022219] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=999
[ 2548.029825] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=999
[ 2553.479509] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_max.value=666
[ 2553.487131] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_max.value=666

Haven't tried to put an rt task into a taskgroup other than root.
Qais Yousef April 22, 2020, 1:13 p.m. UTC | #17
On 04/22/20 12:59, Dietmar Eggemann wrote:
> On 21/04/2020 13:27, Qais Yousef wrote:
> > On 04/21/20 13:18, Dietmar Eggemann wrote:
> >> On 20/04/2020 17:13, Qais Yousef wrote:
> >>> On 04/20/20 10:29, Dietmar Eggemann wrote:
> >>>> On 03.04.20 14:30, Qais Yousef wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >>>>>  	return uc_req;
> >>>>>  }
> >>>>>  
> >>>>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> >>>>> +{
> >>>>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> >>>>> +
> >>>>> +	if (!uc_se->user_defined)
> >>>>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> >>>>> +}
> >>>>> +
> >>>>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >>>>>  {
> >>>>>  	struct uclamp_se uc_eff;
> >>>>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >>>>>  	if (unlikely(!p->sched_class->uclamp_enabled))
> >>>>>  		return;
> >>>>>  
> >>>>> +	/*
> >>>>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> >>>>> +	 * user, we apply any new value on the next wakeup, which is here.
> >>>>> +	 */
> >>>>> +	uclamp_rt_sync_default_util_min(p);
> >>>>> +
> >>>>
> >>>> Does this have to be an extra function? Can we not reuse
> >>>> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?
> 
> Btw, there was an issue in my little snippet. I used uc_req.user_defined
> uninitialized in uclamp_restrict().
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f3706dad32ce..7e6b2b7cd1e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -903,12 +903,11 @@ uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_req, __maybe_unused uc_max;
>  
> -	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
> -	    !uc_req.user_defined) {
> +	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN) {
>  		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
>  		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
>  
> -		if (uc_se->value != rt_min) {
> +		if (!uc_se->user_defined && uc_se->value != rt_min) {
>  			uclamp_se_set(uc_se, rt_min, false);
>  			printk("uclamp_restrict() [%s %d] p->uclamp_req[%d].value=%d\n",
>  			       p->comm, p->pid, clamp_id, uc_se->value);
> 
> >>> Hmm the thing is that we're not restricting here. In contrary we're boosting,
> >>> so the name would be misleading.
> >>
> >> I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
> >> sysctl_sched_rt_default_uclamp_util_min (0-1024)?
> > 
> > The way I look at it is that we're *setting* it to
> > sysctl_sched_rt_default_uclamp_util_min if !user_defined.
> > 
> > The restriction mechanism that ensures this set value doesn't escape
> > cgroup/global restrictions setup.
> 
> I guess we overall agree here. 
> 
> I see 3 restriction levels: (!user_defined) task -> taskgroup -> system
> 
> I see sysctl_sched_rt_default_uclamp_util_min (min_rt_default) as a
> restriction on task level.

Hmm from code perspective it is a request. One that is applied by default if
the user didn't make any request.

Since restriction has a different meaning from code point of view, I think
interchanging both could be confusing.

A restriction from the code view is that you can request 1024, but if cgroup or
global settings doesn't allow you, the system will automatically crop it.

The new sysctl doesn't result in any cropping. It is equivalent to a user
making a sched_setattr() call to change UCLAMP_MIN value of a task. It's just
this request is applied automatically by default, if !user_defined.

But if you meant your high level view of how it works you think of it as
a restriction, then yeah, it could be abstracted in this way. The terminology
just conflicts with the code.

> 
> It's true that the task level restriction is setting the value at the same time.
> 
> For CFS (id=UCLAMP_[MIN\|MAX]) and RT (id=UCLAMP_MAX) we use
> uclamp_none(id) and those values (0, 1024) are fixed so these task level
> values don't need to be further restricted.

I wouldn't think of these as restriction. They're default requests, if
I understood what you're saying correctly, by default:

	cfs_task->util_min = 0
	cfs_task->util_max = 1024

	rt_task->util_min = 1024
	rt_task->util_max = 1024

Which are the requested value.

sysctl_util_clamp_{min,max} are the default restriction which by default would
allow the tasks to request any value within the full range.

The root taskgroup will inherit this value by default. And new cgroups will
inherit from the root taskgroup.

> 
> For RT (id=UCLAMP_MIN) we use 'min_rt_default' and since it can change
> we have to check the task level restriction in 'uclamp_eff_get() ->
> uclamp_(tg)_restrict()'.

Yes. If we take the approach to apply the default request in uclamp_eff_get(),
then this must be applied before uclamp_tg_restrict() call.

> 
> root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min
> 
> [ 2540.507236] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=1024
> [ 2540.514947] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=1024
> [ 2548.015208] uclamp_restrict() [rtkit-daemon 419] p->uclamp_req[0].value=999
> 
> root@h960:~# echo 666 > /proc/sys/kernel/sched_util_clamp_min
> 
> [ 2548.022219] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=999
> [ 2548.029825] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=999
> [ 2553.479509] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_max.value=666
> [ 2553.487131] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_max.value=666
> 
> Haven't tried to put an rt task into a taskgroup other than root.

I do run a test that Patrick had which checks for cgroup values. One of the
tests checks if RT are boosted to max, and it fails when I change the default
RT max-boost value :)

I think we're in agreement, but the terminology is probably making things a bit
confusing.

Thanks

--
Qais Yousef
diff mbox series

Patch

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..91204480fabc 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -59,6 +59,7 @@  extern int sysctl_sched_rt_runtime;
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..a726b26a5056 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -797,6 +797,27 @@  unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 /* Max allowed maximum utilization */
 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 
+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects the default value RT has when a new RT task is
+ * forked or has just changed policy to RT, given the user hasn't modified the
+ * uclamp.min value of the task via sched_setattr().
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ *
+ * Any modification is applied lazily on the next RT task wakeup.
+ */
+unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
+
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -924,6 +945,14 @@  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 	return uc_req;
 }
 
+static void uclamp_rt_sync_default_util_min(struct task_struct *p)
+{
+	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+	if (!uc_se->user_defined)
+		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
+}
+
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_eff;
@@ -1030,6 +1059,12 @@  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
+	/*
+	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
+	 * user, we apply any new value on the next wakeup, which is here.
+	 */
+	uclamp_rt_sync_default_util_min(p);
+
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
 
@@ -1121,12 +1156,13 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				loff_t *ppos)
 {
 	bool update_root_tg = false;
-	int old_min, old_max;
+	int old_min, old_max, old_rt_min;
 	int result;
 
 	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
+	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;
 
 	result = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (result)
@@ -1134,12 +1170,23 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	if (!write)
 		goto done;
 
+	/*
+	 * The new value will be applied to all RT tasks the next time they
+	 * wakeup, assuming the task is using the system default and not a user
+	 * specified value. In the latter we shall leave the value as the user
+	 * requested.
+	 */
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
 	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
 		result = -EINVAL;
 		goto undo;
 	}
 
+	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
+		result = -EINVAL;
+		goto undo;
+	}
+
 	if (old_min != sysctl_sched_uclamp_util_min) {
 		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
 			      sysctl_sched_uclamp_util_min, false);
@@ -1165,6 +1212,7 @@  int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 undo:
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
+	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
 done:
 	mutex_unlock(&uclamp_mutex);
 
@@ -1207,9 +1255,13 @@  static void __setscheduler_uclamp(struct task_struct *p,
 		if (uc_se->user_defined)
 			continue;
 
-		/* By default, RT tasks always get 100% boost */
+		/*
+		 * By default, RT tasks always get 100% boost, which the admins
+		 * are allowed to change via
+		 * sysctl_sched_rt_default_uclamp_util_min knob.
+		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
 
 		uclamp_se_set(uc_se, clamp_value, false);
 	}
@@ -1241,9 +1293,13 @@  static void uclamp_fork(struct task_struct *p)
 	for_each_clamp_id(clamp_id) {
 		unsigned int clamp_value = uclamp_none(clamp_id);
 
-		/* By default, RT tasks always get 100% boost */
+		/*
+		 * By default, RT tasks always get 100% boost, which the admins
+		 * are allowed to change via
+		 * sysctl_sched_rt_default_uclamp_util_min knob.
+		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
 
 		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
 	}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ad5b88a53c5a..0272ae8c6147 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -465,6 +465,13 @@  static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_rt_default_util_clamp_min",
+		.data		= &sysctl_sched_rt_default_uclamp_util_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_handler,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{