mbox series

[v11,0/5] Add utilization clamping support (CGroups API)

Message ID 20190708084357.12944-1-patrick.bellasi@arm.com (mailing list archive)
Headers show
Series Add utilization clamping support (CGroups API) | expand

Message

Patrick Bellasi July 8, 2019, 8:43 a.m. UTC
Hi all, this is a follow up of:

  https://lore.kernel.org/lkml/20190621084217.8167-1-patrick.bellasi@arm.com/

to respin all the bits not yet queued by PeterZ and addressing all Tejun's
requests from previous review:

 - remove checks for cpu_uclamp_{min,max}_write()s from root group
 - remove checks on "protections" being smaller then "limits"
 - rephrase uclamp extension description to avoid explicit
   mentioning of the bandwidth concept

the series is based on top of:

   tj/cgroup.git	for-5.3
   tip/tip.git		sched/core

I hope this version covers all major details about the expected behavior
and delegation model. The code however can still benefit from a better
review, looking forward for any additional feedback.

Cheers,
Patrick


Series Organization
===================

This series contains just the remaining bits of the original posting:

 - Patches [0-5]: Per task group (secondary) API

and it is based on today's tj/cgroup/for-5.3 and tip/sched/core.

The full tree is available here:

   git://linux-arm.org/linux-pb.git   lkml/utilclamp_v11
   http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v11

where you can also get the patches already queued in tip/sched/core

 - Patches [01-07]: Per task (primary) API
 - Patches [08-09]: Schedutil integration for FAIR and RT tasks
 - Patches [10-11]: Integration with EAS's energy_compute()


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [1].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilization values by themselves cannot convey the desired
power/performance behaviors of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorized user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

 - boosting: better interactive response for small tasks which
   are affecting the user experience.

   Consider for example the case of a small control thread for an external
   accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
   the scheduler does not have a complete view of what the task's requirements
   are and, if it's a small utilization task, it keeps selecting a more energy
   efficient CPU, with smaller capacity and lower frequency, thus negatively
   impacting the overall time required to complete task activations.

 - capping: increase energy efficiency for background tasks not affecting the
   user experience.

   Since running on a lower capacity CPU at a lower frequency is more energy
   efficient, when the completion time is not a main goal, then capping the
   utilization considered for certain (maybe big) tasks can have positive
   effects, both on energy consumption and thermal headroom.
   This feature allows also to make RT tasks more energy friendly on mobile
   systems where running them on high capacity CPUs and at the maximum
   frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in making tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

 - boosting: try to run small/foreground tasks on higher-capacity CPUs to
   complete them faster despite being less energy efficient.

 - capping: try to run big/background tasks on low-capacity CPUs to save power
   and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] Energy Aware Scheduling
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
    Linux Plumbers Conference 2018
    https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (5):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes

 Documentation/admin-guide/cgroup-v2.rst |  30 +++
 init/Kconfig                            |  22 ++
 kernel/sched/core.c                     | 335 +++++++++++++++++++++++-
 kernel/sched/sched.h                    |   8 +
 4 files changed, 392 insertions(+), 3 deletions(-)

Comments

Michal Koutný July 15, 2019, 4:51 p.m. UTC | #1
Hello Patrick.

I took a look at your series and I've posted some notes to your patches.

One applies more to the series overall -- I see there is enum uclamp_id
defined but at many places (local variables, function args) int or
unsigned int is used. Besides the inconsistency, I think it'd be nice to
use the enum at these places.

(Also, I may suggest CCing ML cgroups@vger.kernel.org where more eyes
may be available to the cgroup part of your series.)

Michal
Patrick Bellasi July 16, 2019, 2:03 p.m. UTC | #2
On 15-Jul 18:51, Michal Koutný wrote:
> Hello Patrick.

Hi Michal,

> I took a look at your series and I've posted some notes to your patches.

thanks for your review!

> One applies more to the series overall -- I see there is enum uclamp_id
> defined but at many places (local variables, function args) int or
> unsigned int is used. Besides the inconsistency, I think it'd be nice to
> use the enum at these places.

Right, I think in some of the original versions I had few code paths
where it was not possible to use enum values. That seems no more the case.

Since this change is likely affecting also core bits already merged in
5.3, in v12 I'm going to add a bulk rename patch at the end of the
series, so that we can keep a better tracking of this change.

> (Also, I may suggest CCing ML cgroups@vger.kernel.org where more eyes
> may be available to the cgroup part of your series.)

Good point, I'll add that for the upcoming v12 posting.

Cheers,
Patrick