From patchwork Tue Aug 28 13:53:09 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578563
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CBFC6920
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:54:01 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B8A0F29C65
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:54:01 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id AB6E92A1A4; Tue, 28 Aug 2018 13:54:01 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 066AB29C65
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:54:00 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727870AbeH1Rpk (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:45:40 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38394 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726439AbeH1Rpj (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:45:39 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1EED8ED1;
        Tue, 28 Aug 2018 06:53:52 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 D218D3F5BD;
        Tue, 28 Aug 2018 06:53:48 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>,
        Randy Dunlap <rdunlap@infradead.org>, linux-api@vger.kernel.org
Subject: [PATCH v4 01/16] sched/core: uclamp: extend sched_setattr to support
 utilization clamping
Date: Tue, 28 Aug 2018 14:53:09 +0100
Message-Id: <20180828135324.21976-2-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a more simplified model which is essentially based on
the relatively simple concept of POSIX priorities.

Such a simple priority based model however does not allow to exploit
some of the most advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.

Utilization clamping aims at exposing to user-space a new set of
per-task attributes which can be used to provide the scheduler with some
hints about the expected/required utilization for a task.
This will allow to implement a more advanced per-task frequency control
mechanism which is not based just on a "passive" measured task
utilization but on a more "active" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy efficient.
Ultimately, such a mechanism can be considered similar to the cpufreq's
powersave, performance and userspace governor but with a much fine
grained and per-task control.

Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes.
Specifically, a new pair of attributes allows to specify a minimum and
maximum utilization which the scheduler should consider for a task.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-api@vger.kernel.org
---
Changes in v4:
 Message-ID: <87897157-0b49-a0be-f66c-81cc2942b4dd@infradead.org>
 - remove not required default setting
 - fixed some tabs/spaces
 Message-ID: <20180807095905.GB2288@localhost.localdomain>
 - replace/rephrase "bandwidth" references to use "capacity"
 - better stress that this do not enforce any bandwidth requirement
   but "just" give hints to the scheduler
 - fixed some typos
 Others:
 - add support for SCHED_FLAG_RESET_ON_FORK
   default clamps are now set for init_task and inherited/reset at
   fork time (when then flag is set for the parent)
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
 - removed UCLAMP_NONE not used by this patch
 Others:
 - rebased on tip/sched/core
Changes in v2:
 - rebased on v4.18-rc4
 - move at the head of the series

As discussed at OSPM, using a [0..SCHED_CAPACITY_SCALE] range seems to
be acceptable. However, an additional patch has been added at the end of
the series which introduces a simple abstraction to use a more
generic [0..100] range.

At OSPM we also discarded the idea to "recycle" the usage of
sched_runtime and sched_period which would have made the API too
much complex for limited benefits.
---
 include/linux/sched.h            | 13 +++++++
 include/uapi/linux/sched.h       |  4 +-
 include/uapi/linux/sched/types.h | 66 +++++++++++++++++++++++++++-----
 init/Kconfig                     | 21 ++++++++++
 init/init_task.c                 |  5 +++
 kernel/sched/core.c              | 39 +++++++++++++++++++
 6 files changed, 138 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..880a0c5c1f87 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,14 @@ struct vtime {
 	u64			gtime;
 };
 
+enum uclamp_id {
+	UCLAMP_MIN = 0, /* Minimum utilization */
+	UCLAMP_MAX,     /* Maximum utilization */
+
+	/* Utilization clamping constraints count */
+	UCLAMP_CNT
+};
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
@@ -649,6 +657,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_UCLAMP_TASK
+	/* Utlization clamp values for this task */
+	int				uclamp[UCLAMP_CNT];
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* List of struct preempt_notifier: */
 	struct hlist_head		preempt_notifiers;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..c27d6e81517b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
 #define SCHED_FLAG_RESET_ON_FORK	0x01
 #define SCHED_FLAG_RECLAIM		0x02
 #define SCHED_FLAG_DL_OVERRUN		0x04
+#define SCHED_FLAG_UTIL_CLAMP		0x08
 
 #define SCHED_FLAG_ALL	(SCHED_FLAG_RESET_ON_FORK	| \
 			 SCHED_FLAG_RECLAIM		| \
-			 SCHED_FLAG_DL_OVERRUN)
+			 SCHED_FLAG_DL_OVERRUN		| \
+			 SCHED_FLAG_UTIL_CLAMP)
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..7512b5934013 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -21,8 +21,33 @@ struct sched_param {
  * the tasks may be useful for a wide variety of application fields, e.g.,
  * multimedia, streaming, automation and control, and many others.
  *
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ *  @size		size of the structure, for fwd/bwd compat.
+ *
+ *  @sched_policy	task's scheduling policy
+ *  @sched_nice		task's nice value      (SCHED_NORMAL/BATCH)
+ *  @sched_priority	task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ *  @sched_flags	for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
  *  - the activation period or minimum instance inter-arrival time;
  *  - the maximum (or average, depending on the actual scheduling
  *    discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +59,8 @@ struct sched_param {
  * than the runtime and must be completed by time instant t equal to
  * the instance activation time + the deadline.
  *
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
  *
- *  @size		size of the structure, for fwd/bwd compat.
- *
- *  @sched_policy	task's scheduling policy
- *  @sched_flags	for customizing the scheduler behaviour
- *  @sched_nice		task's nice value      (SCHED_NORMAL/BATCH)
- *  @sched_priority	task's static priority (SCHED_FIFO/RR)
  *  @sched_deadline	representative of the task's deadline
  *  @sched_runtime	representative of the task's runtime
  *  @sched_period	representative of the task's period
@@ -53,6 +72,30 @@ struct sched_param {
  * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
  * only user of this new interface. More information about the algorithm
  * available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization which
+ * should be expected by a task. These attributes allow to inform the
+ * scheduler about the utilization boundaries within which it is expected to
+ * schedule the task. These boundaries are valuable hints to support scheduler
+ * decisions on both task placement and frequencies selection.
+ *
+ *  @sched_util_min	represents the minimum utilization
+ *  @sched_util_max	represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. Thus, for
+ * example, a 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger then 0 is more likely to be
+ * scheduled on a CPU which has a capacity big enough to fit the specified
+ * minimum utilization value.
+ * A task with a max utilization value smaller then 1024 is more likely to be
+ * scheduled on a CPU which do not necessarily have more capacity then the
+ * specified max utilization value.
  */
 struct sched_attr {
 	__u32 size;
@@ -70,6 +113,11 @@ struct sched_attr {
 	__u64 sched_runtime;
 	__u64 sched_deadline;
 	__u64 sched_period;
+
+	/* Utilization hints */
+	__u32 sched_util_min;
+	__u32 sched_util_max;
+
 };
 
 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..738974c4f628 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,6 +613,27 @@ config HAVE_UNSTABLE_SCHED_CLOCK
 config GENERIC_SCHED_CLOCK
 	bool
 
+menu "Scheduler features"
+
+config UCLAMP_TASK
+	bool "Enable utilization clamping for RT/FAIR tasks"
+	depends on CPU_FREQ_GOV_SCHEDUTIL
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+	  When this option is enabled, the user can specify a min and max CPU
+	  utilization which is allowed for RUNNABLE tasks.
+	  The max utilization allows to request a maximum frequency a task should
+	  use, while the min utilization allows to request a minimum frequency a
+	  task should use.
+	  Both min and max utilization clamp values are hints to the scheduler,
+	  aiming at improving its frequency selection policy, but they do not
+	  enforce or grant any specific bandwidth for tasks.
+
+	  If in doubt, say N.
+
+endmenu
 #
 # For architectures that want to enable the support for NUMA-affine scheduler
 # balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..5bfdcc3fb839 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
 #include <linux/sched/task.h>
+#include <linux/sched/topology.h>
 #include <linux/init.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
@@ -91,6 +92,10 @@ struct task_struct init_task
 #endif
 #ifdef CONFIG_CGROUP_SCHED
 	.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_UCLAMP_TASK
+	.uclamp[UCLAMP_MIN] = 0,
+	.uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
 #endif
 	.ptraced	= LIST_HEAD_INIT(init_task.ptraced),
 	.ptrace_entry	= LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625bc9897f62..16d3544c7ffa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -716,6 +716,28 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	}
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+static inline int __setscheduler_uclamp(struct task_struct *p,
+					const struct sched_attr *attr)
+{
+	if (attr->sched_util_min > attr->sched_util_max)
+		return -EINVAL;
+	if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+		return -EINVAL;
+
+	p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
+	p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+
+	return 0;
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline int __setscheduler_uclamp(struct task_struct *p,
+					const struct sched_attr *attr)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (!(flags & ENQUEUE_NOCLOCK))
@@ -2320,6 +2342,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->prio = p->normal_prio = __normal_prio(p);
 		set_load_weight(p, false);
 
+#ifdef CONFIG_UCLAMP_TASK
+		p->uclamp[UCLAMP_MIN] = 0;
+		p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif
+
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
@@ -4215,6 +4242,13 @@ static int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
+	/* Configure utilization clamps for the task */
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+		retval = __setscheduler_uclamp(p, attr);
+		if (retval)
+			return retval;
+	}
+
 	/*
 	 * Make sure no PI-waiters arrive (or leave) while we are
 	 * changing the priority of the task:
@@ -4721,6 +4755,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	else
 		attr.sched_nice = task_nice(p);
 
+#ifdef CONFIG_UCLAMP_TASK
+	attr.sched_util_min = p->uclamp[UCLAMP_MIN];
+	attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+#endif
+
 	rcu_read_unlock();
 
 	retval = sched_read_attr(uattr, &attr, size);

From patchwork Tue Aug 28 13:53:10 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578565
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6ECF9174A
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:54:02 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5D15229C65
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:54:02 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5074229FE7; Tue, 28 Aug 2018 13:54:02 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 80E2829E7A
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:54:00 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727028AbeH1Rpq (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:45:46 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38414 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726439AbeH1Rpp (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:45:45 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7138815BF;
        Tue, 28 Aug 2018 06:53:58 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 5AECB3F5BD;
        Tue, 28 Aug 2018 06:53:55 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into
 CPU's clamp groups
Date: Tue, 28 Aug 2018 14:53:10 +0100
Message-Id: <20180828135324.21976-3-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Utilization clamping requires each CPU to know which clamp values are
assigned to tasks that are currently RUNNABLE on that CPU.
Multiple tasks can be assigned the same clamp value and tasks with
different clamp values can be concurrently active on the same CPU.
Thus, a proper data structure is required to support a fast and
efficient aggregation of the clamp values required by the currently
RUNNABLE tasks.

For this purpose we use a per-CPU array of reference counters,
where each slot is used to account how many tasks require a certain
clamp value are currently RUNNABLE on each CPU.
Each clamp value corresponds to a "clamp index" which identifies the
position within the array of reference counters.

                                 :
       (user-space changes)      :      (kernel space / scheduler)
                                 :
             SLOW PATH           :             FAST PATH
                                 :
    task_struct::uclamp::value   :     sched/core::enqueue/dequeue
                                 :         cpufreq_schedutil
                                 :
  +----------------+    +--------------------+     +-------------------+
  |      TASK      |    |     CLAMP GROUP    |     |    CPU CLAMPS     |
  +----------------+    +--------------------+     +-------------------+
  |                |    |   clamp_{min,max}  |     |  clamp_{min,max}  |
  | util_{min,max} |    |      se_count      |     |    tasks count    |
  +----------------+    +--------------------+     +-------------------+
                                 :
           +------------------>  :  +------------------->
    group_id = map(clamp_value)  :  ref_count(group_id)
                                 :
                                 :

Let's introduce the support to map tasks to "clamp groups".
Specifically we introduce the required functions to translate a
"clamp value" into a clamp's "group index" (group_id).

Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
   sense to boost/limit to different frequencies,
   e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
   the per-CPU clamp values in the fast path.

The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
a -ENOSPC error in case this will exceed the number of maximum different
clamp values supported.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180814112509.GB2661@codeaurora.org>
 - add uclamp_exit_task() to release clamp refcount from do_exit()
 Message-ID: <20180816133249.GA2964@e110439-lin>
 - keep the WARN but butify a bit that code
 Message-ID: <20180413082648.GP4043@hirez.programming.kicks-ass.net>
 - move uclamp_enabled at the top of sched_class to keep it on the same
   cache line of other main wakeup time callbacks
 Others:
 - init uclamp for the init_task and refcount its clamp groups
 - add uclamp specific fork time code into uclamp_fork
 - add support for SCHED_FLAG_RESET_ON_FORK
   default clamps are now set for init_task and inherited/reset at
   fork time (when then flag is set for the parent)
 - enable uclamp only for FAIR tasks, RT class will be enabled only
   by a following patch which also integrate the class to schedutil
 - define uclamp_maps ____cacheline_aligned_in_smp
 - in uclamp_group_get() ensure to include uclamp_group_available() and
   uclamp_group_init() into the atomic section defined by:
      uc_map[next_group_id].se_lock
 - do not use mutex_lock(&uclamp_mutex) in uclamp_exit_task
   which is also not needed since refcounting is already guarded by
   the uc_map[group_id].se_lock spinlock
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
 - remove not necessary checks in uclamp_group_find()
 - add WARN on unlikely un-referenced decrement in uclamp_group_put()
 - make __setscheduler_uclamp() able to set just one clamp value
 - make __setscheduler_uclamp() failing if both clamps are required but
   there is no clamp groups available for one of them
 - remove uclamp_group_find() from uclamp_group_get() which now takes a
   group_id as a parameter
 Others:
 - rebased on tip/sched/core
Changes in v2:
 - rabased on v4.18-rc4
 - set UCLAMP_GROUPS_COUNT=2 by default
   which allows to fit all the hot-path CPU clamps data, partially
   intorduced also by the following patches, into a single cache line
   while still supporting up to 2 different {min,max}_utiql clamps.
---
 include/linux/sched.h      |  16 +-
 include/linux/sched/task.h |   6 +
 include/uapi/linux/sched.h |   6 +-
 init/Kconfig               |  20 ++
 init/init_task.c           |   4 -
 kernel/exit.c              |   1 +
 kernel/sched/core.c        | 395 +++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c        |   4 +
 kernel/sched/sched.h       |  28 ++-
 9 files changed, 456 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 880a0c5c1f87..7385f0b1a7c0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,9 @@ struct vtime {
 	u64			gtime;
 };
 
+/* Clamp not valid, i.e. group not assigned or invalid value */
+#define UCLAMP_NOT_VALID -1
+
 enum uclamp_id {
 	UCLAMP_MIN = 0, /* Minimum utilization */
 	UCLAMP_MAX,     /* Maximum utilization */
@@ -575,6 +578,17 @@ struct sched_dl_entity {
 	struct hrtimer inactive_timer;
 };
 
+/**
+ * Utilization's clamp group
+ *
+ * A utilization clamp group maps a "clamp value" (value), i.e.
+ * util_{min,max}, to a "clamp group index" (group_id).
+ */
+struct uclamp_se {
+	unsigned int value;
+	unsigned int group_id;
+};
+
 union rcu_special {
 	struct {
 		u8			blocked;
@@ -659,7 +673,7 @@ struct task_struct {
 
 #ifdef CONFIG_UCLAMP_TASK
 	/* Utlization clamp values for this task */
-	int				uclamp[UCLAMP_CNT];
+	struct uclamp_se		uclamp[UCLAMP_CNT];
 #endif
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 108ede99e533..36c81c364112 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -68,6 +68,12 @@ static inline void exit_thread(struct task_struct *tsk)
 #endif
 extern void do_group_exit(int);
 
+#ifdef CONFIG_UCLAMP_TASK
+extern void uclamp_exit_task(struct task_struct *p);
+#else
+static inline void uclamp_exit_task(struct task_struct *p) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index c27d6e81517b..ae7e12de32ca 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,7 +50,11 @@
 #define SCHED_FLAG_RESET_ON_FORK	0x01
 #define SCHED_FLAG_RECLAIM		0x02
 #define SCHED_FLAG_DL_OVERRUN		0x04
-#define SCHED_FLAG_UTIL_CLAMP		0x08
+
+#define SCHED_FLAG_UTIL_CLAMP_MIN	0x10
+#define SCHED_FLAG_UTIL_CLAMP_MAX	0x20
+#define SCHED_FLAG_UTIL_CLAMP	(SCHED_FLAG_UTIL_CLAMP_MIN | \
+				 SCHED_FLAG_UTIL_CLAMP_MAX)
 
 #define SCHED_FLAG_ALL	(SCHED_FLAG_RESET_ON_FORK	| \
 			 SCHED_FLAG_RECLAIM		| \
diff --git a/init/Kconfig b/init/Kconfig
index 738974c4f628..10536cb83295 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -633,7 +633,27 @@ config UCLAMP_TASK
 
 	  If in doubt, say N.
 
+config UCLAMP_GROUPS_COUNT
+	int "Number of different utilization clamp values supported"
+	range 0 32
+	default 5
+	depends on UCLAMP_TASK
+	help
+	  This defines the maximum number of different utilization clamp
+	  values which can be concurrently enforced for each utilization
+	  clamp index (i.e. minimum and maximum utilization).
+
+	  Only a limited number of clamp values are supported because:
+	    1. there are usually only few classes of workloads for which it
+	       makes sense to boost/cap for different frequencies,
+	       e.g. background vs foreground, interactive vs low-priority.
+	    2. it allows a simpler and more memory/time efficient tracking of
+	       the per-CPU clamp values.
+
+	  If in doubt, use the default value.
+
 endmenu
+
 #
 # For architectures that want to enable the support for NUMA-affine scheduler
 # balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5bfdcc3fb839..7f77741b6a9b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -92,10 +92,6 @@ struct task_struct init_task
 #endif
 #ifdef CONFIG_CGROUP_SCHED
 	.sched_task_group = &root_task_group,
-#endif
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp[UCLAMP_MIN] = 0,
-	.uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
 #endif
 	.ptraced	= LIST_HEAD_INIT(init_task.ptraced),
 	.ptrace_entry	= LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..feb540558051 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -877,6 +877,7 @@ void __noreturn do_exit(long code)
 
 	sched_autogroup_exit_task(tsk);
 	cgroup_exit(tsk);
+	uclamp_exit_task(tsk);
 
 	/*
 	 * FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 16d3544c7ffa..2668990b96d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -717,25 +717,389 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_mutex: serializes updates of utilization clamp values
+ *
+ * A utilization clamp value update is usually triggered from a user-space
+ * process (slow-path) but it requires a synchronization with the scheduler's
+ * (fast-path) enqueue/dequeue operations.
+ * While the fast-path synchronization is protected by RQs spinlock, this
+ * mutex ensures that we sequentially serve user-space requests.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * uclamp_map: reference counts a utilization "clamp value"
+ * @value:    the utilization "clamp value" required
+ * @se_count: the number of scheduling entities requiring the "clamp value"
+ * @se_lock:  serialize reference count updates by protecting se_count
+ */
+struct uclamp_map {
+	int value;
+	int se_count;
+	raw_spinlock_t se_lock;
+};
+
+/**
+ * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
+ *
+ * Since only a limited number of different "clamp values" are supported, we
+ * need to map each different clamp value into a "clamp group" (group_id) to
+ * be used by the per-CPU accounting in the fast-path, when tasks are
+ * enqueued and dequeued.
+ * We also support different kind of utilization clamping, min and max
+ * utilization for example, each representing what we call a "clamp index"
+ * (clamp_id).
+ *
+ * A matrix is thus required to map "clamp values" to "clamp groups"
+ * (group_id), for each "clamp index" (clamp_id), where:
+ * - rows are indexed by clamp_id and they collect the clamp groups for a
+ *   given clamp index
+ * - columns are indexed by group_id and they collect the clamp values which
+ *   maps to that clamp group
+ *
+ * Thus, the column index of a given (clamp_id, value) pair represents the
+ * clamp group (group_id) used by the fast-path's per-CPU accounting.
+ *
+ * NOTE: first clamp group (group_id=0) is reserved for tracking of non
+ * clamped tasks.  Thus we allocate one more slot than the value of
+ * CONFIG_UCLAMP_GROUPS_COUNT.
+ *
+ * Here is the map layout and, right below, how entries are accessed by the
+ * following code.
+ *
+ *                          uclamp_maps is a matrix of
+ *          +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
+ *          |                                |
+ *          |                /---------------+---------------\
+ *          |               +------------+       +------------+
+ *          |  / UCLAMP_MIN | value      |       | value      |
+ *          |  |            | se_count   |...... | se_count   |
+ *          |  |            +------------+       +------------+
+ *          +--+            +------------+       +------------+
+ *             |            | value      |       | value      |
+ *             \ UCLAMP_MAX | se_count   |...... | se_count   |
+ *                          +-----^------+       +----^-------+
+ *                                |                   |
+ *                      uc_map =  +                   |
+ *                     &uclamp_maps[clamp_id][0]      +
+ *                                                clamp_value =
+ *                                       uc_map[group_id].value
+ */
+static struct uclamp_map uclamp_maps[UCLAMP_CNT]
+				    [CONFIG_UCLAMP_GROUPS_COUNT + 1]
+				    ____cacheline_aligned_in_smp;
+
+#define UCLAMP_ENOSPC_FMT "Cannot allocate more than " \
+	__stringify(CONFIG_UCLAMP_GROUPS_COUNT) " UTIL_%s clamp groups\n"
+
+/**
+ * uclamp_group_available: checks if a clamp group is available
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index in the given clamp_id
+ *
+ * A clamp group is not free if there is at least one SE which is sing a clamp
+ * value mapped on the specified clamp_id. These SEs are reference counted by
+ * the se_count of a uclamp_map entry.
+ *
+ * Return: true if there are no SE's mapped on the specified clamp
+ *         index and group
+ */
+static inline bool uclamp_group_available(int clamp_id, int group_id)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+	return (uc_map[group_id].value == UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_group_init: maps a clamp value on a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index to map a given clamp_value
+ * @clamp_value: the utilization clamp value to map
+ *
+ * Initializes a clamp group to track tasks from the fast-path.
+ * Each different clamp value, for a given clamp index (i.e. min/max
+ * utilization clamp), is mapped by a clamp group which index is used by the
+ * fast-path code to keep track of RUNNABLE tasks requiring a certain clamp
+ * value.
+ *
+ */
+static inline void uclamp_group_init(int clamp_id, int group_id,
+				     unsigned int clamp_value)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+	uc_map[group_id].value = clamp_value;
+	uc_map[group_id].se_count = 0;
+}
+
+/**
+ * uclamp_group_reset: resets a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @group_id: the group index to release
+ *
+ * A clamp group can be reset every time there are no more task groups using
+ * the clamp value it maps for a given clamp index.
+ */
+static inline void uclamp_group_reset(int clamp_id, int group_id)
+{
+	uclamp_group_init(clamp_id, group_id, UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_group_find: finds the group index of a utilization clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @clamp_value: the utilization clamping value lookup for
+ *
+ * Verify if a group has been assigned to a certain clamp value and return
+ * its index to be used for accounting.
+ *
+ * Since only a limited number of utilization clamp groups are allowed, if no
+ * groups have been assigned for the specified value, a new group is assigned,
+ * if possible.
+ * Otherwise an error is returned, meaning that an additional clamp value is
+ * not (currently) supported.
+ */
+static int
+uclamp_group_find(int clamp_id, unsigned int clamp_value)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	int free_group_id = UCLAMP_NOT_VALID;
+	unsigned int group_id = 0;
+
+	for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+		/* Keep track of first free clamp group */
+		if (uclamp_group_available(clamp_id, group_id)) {
+			if (free_group_id == UCLAMP_NOT_VALID)
+				free_group_id = group_id;
+			continue;
+		}
+		/* Return index of first group with same clamp value */
+		if (uc_map[group_id].value == clamp_value)
+			return group_id;
+	}
+
+	if (likely(free_group_id != UCLAMP_NOT_VALID))
+		return free_group_id;
+
+	return -ENOSPC;
+}
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @uc_se: the utilization clamp data for that task group
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value. A clamp group is
+ * released when there are no more task groups referencing its clamp value.
+ */
+static inline void uclamp_group_put(int clamp_id, int group_id)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	unsigned long flags;
+
+	/* Ignore SE's not yet attached */
+	if (group_id == UCLAMP_NOT_VALID)
+		return;
+
+	/* Remove SE from this clamp group */
+	raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
+	if (likely(uc_map[group_id].se_count))
+		uc_map[group_id].se_count -= 1;
+#ifdef SCHED_DEBUG
+	else {
+		WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
+		     clamp_id, group_id);
+	}
+#endif
+	if (uc_map[group_id].se_count == 0)
+		uclamp_group_reset(clamp_id, group_id);
+	raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @clamp_id: the clamp index affected by the task
+ * @next_group_id: the clamp group to refcount
+ * @uc_se: the utilization clamp data for the task
+ * @clamp_value: the new clamp value for the task
+ *
+ * Each time a task changes its utilization clamp value, for a specified clamp
+ * index, we need to find an available clamp group which can be used to track
+ * this new clamp value. The corresponding clamp group index will be used by
+ * the task to reference count the clamp value on CPUs while enqueued.
+ */
+static inline void uclamp_group_get(int clamp_id, int next_group_id,
+				    struct uclamp_se *uc_se,
+				    unsigned int clamp_value)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	int prev_group_id = uc_se->group_id;
+	unsigned long flags;
+
+	/* Allocate new clamp group for this clamp value */
+	raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
+	if (uclamp_group_available(clamp_id, next_group_id))
+		uclamp_group_init(clamp_id, next_group_id, clamp_value);
+
+	/* Update SE's clamp values and attach it to new clamp group */
+	uc_se->value = clamp_value;
+	uc_se->group_id = next_group_id;
+	uc_map[next_group_id].se_count += 1;
+	raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+
+	/* Release the previous clamp group */
+	uclamp_group_put(clamp_id, prev_group_id);
+}
+
 static inline int __setscheduler_uclamp(struct task_struct *p,
 					const struct sched_attr *attr)
 {
-	if (attr->sched_util_min > attr->sched_util_max)
-		return -EINVAL;
-	if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
-		return -EINVAL;
+	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
+	int lower_bound, upper_bound;
+	struct uclamp_se *uc_se;
+	int result = 0;
 
-	p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
-	p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+	mutex_lock(&uclamp_mutex);
 
-	return 0;
+	/* Find a valid group_id for each required clamp value */
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+		upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+			? attr->sched_util_max
+			: p->uclamp[UCLAMP_MAX].value;
+
+		if (upper_bound == UCLAMP_NOT_VALID)
+			upper_bound = SCHED_CAPACITY_SCALE;
+		if (attr->sched_util_min > upper_bound) {
+			result = -EINVAL;
+			goto done;
+		}
+
+		result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
+		if (result == -ENOSPC) {
+			pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+			goto done;
+		}
+		group_id[UCLAMP_MIN] = result;
+	}
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+		lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+			? attr->sched_util_min
+			: p->uclamp[UCLAMP_MIN].value;
+
+		if (lower_bound == UCLAMP_NOT_VALID)
+			lower_bound = 0;
+		if (attr->sched_util_max < lower_bound ||
+		    attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+			result = -EINVAL;
+			goto done;
+		}
+
+		result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
+		if (result == -ENOSPC) {
+			pr_err(UCLAMP_ENOSPC_FMT, "MAX");
+			goto done;
+		}
+		group_id[UCLAMP_MAX] = result;
+	}
+
+	/* Update each required clamp group */
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+		uc_se = &p->uclamp[UCLAMP_MIN];
+		uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
+				 uc_se, attr->sched_util_min);
+	}
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+		uc_se = &p->uclamp[UCLAMP_MAX];
+		uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
+				 uc_se, attr->sched_util_max);
+	}
+
+done:
+	mutex_unlock(&uclamp_mutex);
+
+	return result;
+}
+
+/**
+ * uclamp_exit_task: release referenced clamp groups
+ * @p: the task exiting
+ *
+ * When a task terminates, release all its (eventually) refcounted
+ * task-specific clamp groups.
+ */
+void uclamp_exit_task(struct task_struct *p)
+{
+	struct uclamp_se *uc_se;
+	int clamp_id;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_se = &p->uclamp[clamp_id];
+		uclamp_group_put(clamp_id, uc_se->group_id);
+	}
+}
+
+/**
+ * uclamp_fork: refcount task-specific clamp values for a new task
+ */
+static void uclamp_fork(struct task_struct *p, bool reset)
+{
+	int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		int next_group_id = p->uclamp[clamp_id].group_id;
+		struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+
+		if (unlikely(reset)) {
+			next_group_id = 0;
+			p->uclamp[clamp_id].value = uclamp_none(clamp_id);
+		}
+
+		p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
+		uclamp_group_get(clamp_id, next_group_id, uc_se,
+				 p->uclamp[clamp_id].value);
+	}
+}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static void __init init_uclamp(void)
+{
+	struct uclamp_se *uc_se;
+	int clamp_id;
+
+	mutex_init(&uclamp_mutex);
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+		int group_id = 0;
+
+		for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+			uc_map[group_id].value = UCLAMP_NOT_VALID;
+			raw_spin_lock_init(&uc_map[group_id].se_lock);
+		}
+
+		/* Init init_task's clamp group */
+		uc_se = &init_task.uclamp[clamp_id];
+		uc_se->group_id = UCLAMP_NOT_VALID;
+		uclamp_group_get(clamp_id, 0, uc_se, uclamp_none(clamp_id));
+	}
 }
+
 #else /* CONFIG_UCLAMP_TASK */
 static inline int __setscheduler_uclamp(struct task_struct *p,
 					const struct sched_attr *attr)
 {
 	return -EINVAL;
 }
+static inline void uclamp_fork(struct task_struct *p, bool reset) { }
+static inline void init_uclamp(void) { }
 #endif /* CONFIG_UCLAMP_TASK */
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2314,6 +2678,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
+	bool reset;
 
 	__sched_fork(clone_flags, p);
 	/*
@@ -2331,7 +2696,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	/*
 	 * Revert to default priority/policy on fork if requested.
 	 */
-	if (unlikely(p->sched_reset_on_fork)) {
+	reset = p->sched_reset_on_fork;
+	if (unlikely(reset)) {
 		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
@@ -2342,11 +2708,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->prio = p->normal_prio = __normal_prio(p);
 		set_load_weight(p, false);
 
-#ifdef CONFIG_UCLAMP_TASK
-		p->uclamp[UCLAMP_MIN] = 0;
-		p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
-#endif
-
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
@@ -2363,6 +2724,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	init_entity_runnable_average(&p->se);
 
+	uclamp_fork(p, reset);
+
 	/*
 	 * The child is not yet in the pid-hash so no cgroup attach races,
 	 * and the cgroup is pinned to this child due to cgroup_fork()
@@ -4756,8 +5119,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		attr.sched_nice = task_nice(p);
 
 #ifdef CONFIG_UCLAMP_TASK
-	attr.sched_util_min = p->uclamp[UCLAMP_MIN];
-	attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+	attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+	attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
 #endif
 
 	rcu_read_unlock();
@@ -6107,6 +6470,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	init_uclamp();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..dab0405386c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10055,6 +10055,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	.task_change_group	= task_change_group_fair,
 #endif
+
+#ifdef CONFIG_UCLAMP_TASK
+	.uclamp_enabled		= 1,
+#endif
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..72df2dc779bc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1501,10 +1501,12 @@ extern const u32		sched_prio_to_wmult[40];
 struct sched_class {
 	const struct sched_class *next;
 
+#ifdef CONFIG_UCLAMP_TASK
+	int uclamp_enabled;
+#endif
+
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
-	void (*yield_task)   (struct rq *rq);
-	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
 
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
@@ -1537,7 +1539,6 @@ struct sched_class {
 	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
-	void (*task_dead)(struct task_struct *p);
 
 	/*
 	 * The switched_from() call is allowed to drop rq->lock, therefore we
@@ -1554,12 +1555,17 @@ struct sched_class {
 
 	void (*update_curr)(struct rq *rq);
 
+	void (*yield_task)   (struct rq *rq);
+	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
+
 #define TASK_SET_GROUP		0
 #define TASK_MOVE_GROUP		1
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	void (*task_change_group)(struct task_struct *p, int type);
 #endif
+
+	void (*task_dead)(struct task_struct *p);
 };
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
@@ -2177,6 +2183,22 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
 static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif /* CONFIG_CPU_FREQ */
 
+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+	if (clamp_id == UCLAMP_MIN)
+		return 0;
+	return SCHED_CAPACITY_SCALE;
+}
+
 #ifdef arch_scale_freq_capacity
 # ifndef arch_scale_freq_invariant
 #  define arch_scale_freq_invariant()	true

From patchwork Tue Aug 28 13:53:11 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578597
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E6140174A
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 14:02:40 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D62872A336
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 14:02:40 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id C8CC82A34E; Tue, 28 Aug 2018 14:02:40 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9D2812A336
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 14:02:39 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726439AbeH1RyU (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:54:20 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38750 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726998AbeH1RyU (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:54:20 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 57D8118A;
        Tue, 28 Aug 2018 06:54:06 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 66B783F5BD;
        Tue, 28 Aug 2018 06:54:03 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups
 accounting
Date: Tue, 28 Aug 2018 14:53:11 +0100
Message-Id: <20180828135324.21976-4-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range. This range depends on the set of currently
RUNNABLE tasks on a CPU, where each task references two "clamp groups"
defining the util_min and the util_max clamp values to be considered for
that task. The clamp value mapped by a clamp group applies to a CPU only
when there is at least one task RUNNABLE referencing that clamp group.

When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it can be required to re-compute what is the new "aggregated" clamp
value to apply on that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e.  with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).

Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.

Tasks have a:
   task_struct::uclamp::group_id[clamp_idx]
indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
which they should refcount at enqueue time.

CPUs rq have a:
   rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index..

The clamp value of each clamp group is tracked by
rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
unordered array of clamp values. However, the MAX aggregation of the
currently active clamp groups is implemented to minimize the number of
times we need to scan the complete (unordered) clamp group array to
figure out the new max value. This operation indeed happens only when we
dequeue last task of the clamp group corresponding to the current max
clamp, and thus the CPU is either entering IDLE or going to schedule a
less boosted or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180816133249.GA2964@e110439-lin>
 - keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
 - add another WARN on the unexpected condition of releasing a refcount
   from a CPU which has a lower clamp value active
 Other:
 - ensure (and check) that all tasks have a valid group_id at
   uclamp_cpu_get_id()
 - rework uclamp_cpu layout to better fit into just 2x64B cache lines
 - fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
 - add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
 Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
 - few typos fixed
 Other:
 - rebased on tip/sched/core
Changes in v2:
 Message-ID: <20180413093822.GM4129@hirez.programming.kicks-ass.net>
 - refactored struct rq::uclamp_cpu to be more cache efficient
   no more holes, re-arranged vectors to match cache lines with expected
   data locality
 Message-ID: <20180413094615.GT4043@hirez.programming.kicks-ass.net>
 - use *rq as parameter whenever already available
 - add scheduling class's uclamp_enabled marker
 - get rid of the "confusing" single callback uclamp_task_update()
   and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
 - fix/remove "bad" comments
 Message-ID: <20180413113337.GU14248@e110439-lin>
 - remove inline from init_uclamp, flag it __init
 Other:
 - rabased on v4.18-rc4
 - improved documentation to make more explicit some concepts.
---
 kernel/sched/core.c  | 207 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  67 ++++++++++++++
 2 files changed, 273 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2668990b96d1..8f908035701f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -829,9 +829,19 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
 				     unsigned int clamp_value)
 {
 	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+	struct uclamp_cpu *uc_cpu;
+	int cpu;
 
+	/* Set clamp group map */
 	uc_map[group_id].value = clamp_value;
 	uc_map[group_id].se_count = 0;
+
+	/* Set clamp groups on all CPUs */
+	for_each_possible_cpu(cpu) {
+		uc_cpu = &cpu_rq(cpu)->uclamp;
+		uc_cpu->group[clamp_id][group_id].value = clamp_value;
+		uc_cpu->group[clamp_id][group_id].tasks = 0;
+	}
 }
 
 /**
@@ -886,6 +896,190 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
 	return -ENOSPC;
 }
 
+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @cpu: the CPU which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups is subject to change. Since each clamp group enforces a
+ * different utilization clamp value, once the set of these groups changes it
+ * can be required to re-compute what is the new clamp value to apply for that
+ * CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
+ */
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+{
+	struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
+	int max_value = UCLAMP_NOT_VALID;
+	unsigned int group_id;
+
+	for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+		/* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
+		if (!uclamp_group_active(uc_grp, group_id))
+			continue;
+
+		/* Both min and max clamp are MAX aggregated */
+		max_value = max(max_value, uc_grp[group_id].value);
+
+		/* Stop if we reach the max possible clamp */
+		if (max_value >= SCHED_CAPACITY_SCALE)
+			break;
+	}
+	rq->uclamp.value[clamp_id] = max_value;
+}
+
+/**
+ * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @rq: the CPU's rq where the clamp group has to be reference counted
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
+ *
+ * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
+ * the task's uclamp.group_id is reference counted on that CPU.
+ */
+static inline void uclamp_cpu_get_id(struct task_struct *p,
+				     struct rq *rq, int clamp_id)
+{
+	struct uclamp_group *uc_grp;
+	struct uclamp_cpu *uc_cpu;
+	int clamp_value;
+	int group_id;
+
+	/* Every task must reference a clamp group */
+	group_id = p->uclamp[clamp_id].group_id;
+#ifdef CONFIG_SCHED_DEBUG
+	if (unlikely(group_id == UCLAMP_NOT_VALID)) {
+		WARN(1, "invalid task [%d:%s] clamp group\n",
+		     p->pid, p->comm);
+		return;
+	}
+#endif
+
+	/* Reference count the task into its current group_id */
+	uc_grp = &rq->uclamp.group[clamp_id][0];
+	uc_grp[group_id].tasks += 1;
+
+	/*
+	 * If this is the new max utilization clamp value, then we can update
+	 * straight away the CPU clamp value. Otherwise, the current CPU clamp
+	 * value is still valid and we are done.
+	 */
+	uc_cpu = &rq->uclamp;
+	clamp_value = p->uclamp[clamp_id].value;
+	if (uc_cpu->value[clamp_id] < clamp_value)
+		uc_cpu->value[clamp_id] = clamp_value;
+}
+
+/**
+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @cpu: the CPU from where the clamp group has to be released
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
+ *
+ * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
+ * counted by the task is decreased.
+ * If this was the last task defining the current max clamp group, then the
+ * CPU clamping is updated to find the new max for the specified clamp
+ * index.
+ */
+static inline void uclamp_cpu_put_id(struct task_struct *p,
+				     struct rq *rq, int clamp_id)
+{
+	struct uclamp_group *uc_grp;
+	struct uclamp_cpu *uc_cpu;
+	unsigned int clamp_value;
+	int group_id;
+
+	/* New tasks don't have a previous clamp group */
+	group_id = p->uclamp[clamp_id].group_id;
+	if (group_id == UCLAMP_NOT_VALID)
+		return;
+
+	/* Decrement the task's reference counted group index */
+	uc_grp = &rq->uclamp.group[clamp_id][0];
+	if (likely(uc_grp[group_id].tasks))
+		uc_grp[group_id].tasks -= 1;
+#ifdef CONFIG_SCHED_DEBUG
+	else {
+		WARN(1, "invalid CPU[%d] clamp group [%d:%d] refcount\n",
+		     cpu_of(rq), clamp_id, group_id);
+	}
+#endif
+
+	/* If this is not the last task, no updates are required */
+	if (uc_grp[group_id].tasks > 0)
+		return;
+
+	/*
+	 * Update the CPU only if this was the last task of the group
+	 * defining the current clamp value.
+	 */
+	uc_cpu = &rq->uclamp;
+	clamp_value = uc_grp[group_id].value;
+#ifdef CONFIG_SCHED_DEBUG
+	if (unlikely(clamp_value > uc_cpu->value[clamp_id])) {
+		WARN(1, "invalid CPU[%d] clamp group [%d:%d] value\n",
+		     cpu_of(rq), clamp_id, group_id);
+	}
+#endif
+	if (clamp_value >= uc_cpu->value[clamp_id])
+		uclamp_cpu_update(rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_get(): increase CPU's clamp group refcount
+ * @rq: the CPU's rq where the clamp group has to be refcounted
+ * @p: the task being enqueued
+ *
+ * Once a task is enqueued on a CPU's rq, all the clamp groups currently
+ * enforced on a task are reference counted on that rq.
+ * Not all scheduling classes have utilization clamping support, their tasks
+ * will be silently ignored.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
+{
+	int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		uclamp_cpu_get_id(p, rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_put(): decrease CPU's clamp group refcount
+ * @cpu: the CPU's rq where the clamp group refcount has to be decreased
+ * @p: the task being dequeued
+ *
+ * When a task is dequeued from a CPU's rq, all the clamp groups the task has
+ * been reference counted at task's enqueue time have to be decreased for that
+ * CPU.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
+{
+	int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		uclamp_cpu_put_id(p, rq, clamp_id);
+}
+
 /**
  * uclamp_group_put: decrease the reference count for a clamp group
  * @clamp_id: the clamp index which was affected by a task group
@@ -908,7 +1102,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
 	raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
 	if (likely(uc_map[group_id].se_count))
 		uc_map[group_id].se_count -= 1;
-#ifdef SCHED_DEBUG
+#ifdef CONFIG_SCHED_DEBUG
 	else {
 		WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
 		     clamp_id, group_id);
@@ -1073,9 +1267,16 @@ static void __init init_uclamp(void)
 {
 	struct uclamp_se *uc_se;
 	int clamp_id;
+	int cpu;
 
 	mutex_init(&uclamp_mutex);
 
+	for_each_possible_cpu(cpu) {
+		struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+		memset(uc_cpu, UCLAMP_NOT_VALID, sizeof(struct uclamp_cpu));
+	}
+
 	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
 		struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
 		int group_id = 0;
@@ -1093,6 +1294,8 @@ static void __init init_uclamp(void)
 }
 
 #else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
 static inline int __setscheduler_uclamp(struct task_struct *p,
 					const struct sched_attr *attr)
 {
@@ -1110,6 +1313,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
+	uclamp_cpu_get(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -1121,6 +1325,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
 
+	uclamp_cpu_put(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72df2dc779bc..513608ae4908 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -764,6 +764,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
 #endif
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * struct uclamp_group - Utilization clamp Group
+ * @value: utilization clamp value for tasks on this clamp group
+ * @tasks: number of RUNNABLE tasks on this clamp group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+	int value;
+	int tasks;
+};
+
+/**
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @group: utilization clamp groups affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp value, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ *   utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ *   maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+	struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+	int value[UCLAMP_CNT];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -801,6 +845,11 @@ struct rq {
 	unsigned long		nr_load_updates;
 	u64			nr_switches;
 
+#ifdef CONFIG_UCLAMP_TASK
+	/* Utilization clamp values based on CPU's RUNNABLE tasks */
+	struct uclamp_cpu	uclamp ____cacheline_aligned;
+#endif
+
 	struct cfs_rq		cfs;
 	struct rt_rq		rt;
 	struct dl_rq		dl;
@@ -2145,6 +2194,24 @@ static inline u64 irq_time_read(int cpu)
 }
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_group_active: check if a clamp group is active on a CPU
+ * @uc_grp: the clamp groups for a CPU
+ * @group_id: the clamp group to check
+ *
+ * A clamp group affects a CPU if it has at least one RUNNABLE task.
+ *
+ * Return: true if the specified CPU has at least one RUNNABLE task
+ *         for the specified clamp group.
+ */
+static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
+				       int group_id)
+{
+	return uc_grp[group_id].tasks > 0;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
 #ifdef CONFIG_CPU_FREQ
 DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
 

From patchwork Tue Aug 28 13:53:12 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578595
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E4EC9139B
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 14:02:32 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D5AF42A336
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 14:02:32 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id C931C2A34E; Tue, 28 Aug 2018 14:02:32 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 121922A336
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 14:02:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727273AbeH1RyU (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:54:20 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38752 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726439AbeH1RyU (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:54:20 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 563A7ED1;
        Tue, 28 Aug 2018 06:54:11 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 64BDF3F5BD;
        Tue, 28 Aug 2018 06:54:08 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 04/16] sched/core: uclamp: update CPU's refcount on clamp
 changes
Date: Tue, 28 Aug 2018 14:53:12 +0100
Message-Id: <20180828135324.21976-5-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Utilization clamp values enforced on a CPU by a task can be updated at
run-time, for example via a sched_setattr syscall, while a task is
currently RUNNABLE on that CPU. In these cases, the task can be already
refcounting a clamp group for its CPU and thus we need to update this
reference to ensure the new constraints are immediately enforced.

Since a clamp value change always implies a clamp group refcount update,
this patch hooks into the clamp group refcount getter to trigger a CPU
refcount syncup. Such a syncup is required only by currently RUNNABLE
tasks which are also referencing at least one valid clamp group.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180816132249.GA2960@e110439-lin>
 - inline uclamp_task_active() code into uclamp_task_update_active()
 - get rid of the now unused uclamp_task_active()
 Other:
 - allow to call uclamp_group_get() without a task pointer, which is
   used to refcount the initial clamp group for all the global objects
   (init_task, root_task_group and system_defaults)
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
 Other:
 - rabased on tip/sched/core
Changes in v2:
 Message-ID: <20180413111900.GF4082@hirez.programming.kicks-ass.net>
 - get rid of the group_id back annotation
   which is not requires at this stage where we have only per-task
   clamping support. It will be introduce later when CGroups support is
   added.
 Other:
 - rabased on v4.18-rc4
 - this code has been split from a previous patch to simplify the review
---
 kernel/sched/core.c  | 65 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h | 16 +++++++++++
 2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f908035701f..64e5c96bfdaf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1080,6 +1080,54 @@ static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
 		uclamp_cpu_put_id(p, rq, clamp_id);
 }
 
+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ * @group_id: the clamp group to update
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups have to be updated for each CPU containing a RUNNABLE task belonging
+ * to this tasks group. Sleeping tasks are not updated since they will be
+ * enqueued with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * Lock the task and the CPU where the task is (or was) queued.
+	 *
+	 * We might lock the (previous) RQ of a !RUNNABLE task, but that's the
+	 * price to pay to safely serialize util_{min,max} updates with
+	 * enqueues, dequeues and migration operations.
+	 * This is the same locking schema used by __set_cpus_allowed_ptr().
+	 */
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * The setting of the clamp group is serialized by task_rq_lock().
+	 * Thus, if the task is not yet RUNNABLE and its task_struct is not
+	 * affecting a valid clamp group, then the next time it's going to be
+	 * enqueued it will already see the updated clamp group value.
+	 */
+	if (!task_on_rq_queued(p) && !p->on_cpu)
+		goto done;
+	if (!uclamp_task_affects(p, clamp_id))
+		goto done;
+
+	/* Release p's currently referenced clamp group */
+	uclamp_cpu_put_id(p, rq, clamp_id);
+
+	/* Get p's new clamp group */
+	uclamp_cpu_get_id(p, rq, clamp_id);
+
+done:
+	task_rq_unlock(rq, p, &rf);
+}
+
 /**
  * uclamp_group_put: decrease the reference count for a clamp group
  * @clamp_id: the clamp index which was affected by a task group
@@ -1115,6 +1163,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
 
 /**
  * uclamp_group_get: increase the reference count for a clamp group
+ * @p: the task which clamp value must be tracked
  * @clamp_id: the clamp index affected by the task
  * @next_group_id: the clamp group to refcount
  * @uc_se: the utilization clamp data for the task
@@ -1125,7 +1174,8 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
  * this new clamp value. The corresponding clamp group index will be used by
  * the task to reference count the clamp value on CPUs while enqueued.
  */
-static inline void uclamp_group_get(int clamp_id, int next_group_id,
+static inline void uclamp_group_get(struct task_struct *p,
+				    int clamp_id, int next_group_id,
 				    struct uclamp_se *uc_se,
 				    unsigned int clamp_value)
 {
@@ -1144,6 +1194,10 @@ static inline void uclamp_group_get(int clamp_id, int next_group_id,
 	uc_map[next_group_id].se_count += 1;
 	raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
 
+	/* Update CPU's clamp group refcounts of RUNNABLE task */
+	if (p)
+		uclamp_task_update_active(p, clamp_id, next_group_id);
+
 	/* Release the previous clamp group */
 	uclamp_group_put(clamp_id, prev_group_id);
 }
@@ -1202,12 +1256,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
 	/* Update each required clamp group */
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uc_se = &p->uclamp[UCLAMP_MIN];
-		uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
+		uclamp_group_get(p, UCLAMP_MIN, group_id[UCLAMP_MIN],
 				 uc_se, attr->sched_util_min);
 	}
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
 		uc_se = &p->uclamp[UCLAMP_MAX];
-		uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
+		uclamp_group_get(p, UCLAMP_MAX, group_id[UCLAMP_MAX],
 				 uc_se, attr->sched_util_max);
 	}
 
@@ -1255,7 +1309,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
 		}
 
 		p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(clamp_id, next_group_id, uc_se,
+		uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
 				 p->uclamp[clamp_id].value);
 	}
 }
@@ -1289,7 +1343,8 @@ static void __init init_uclamp(void)
 		/* Init init_task's clamp group */
 		uc_se = &init_task.uclamp[clamp_id];
 		uc_se->group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(clamp_id, 0, uc_se, uclamp_none(clamp_id));
+		uclamp_group_get(NULL, clamp_id, 0, uc_se,
+				 uclamp_none(clamp_id));
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 513608ae4908..25d1d218ae10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2210,6 +2210,22 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
 {
 	return uc_grp[group_id].tasks > 0;
 }
+
+/**
+ * uclamp_task_affects: check if a task affects a utilization clamp
+ * @p: the task to consider
+ * @clamp_id: the utilization clamp to check
+ *
+ * A task affects a clamp index if:
+ * - it's currently enqueued on a CPU
+ * - it references a valid clamp group index for the specified clamp index
+ *
+ * Return: true if p currently affects the specified clamp_id
+ */
+static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
+{
+	return (p->uclamp[clamp_id].group_id != UCLAMP_NOT_VALID);
+}
 #endif /* CONFIG_UCLAMP_TASK */
 
 #ifdef CONFIG_CPU_FREQ

From patchwork Tue Aug 28 13:53:13 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578589
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4CB0913B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:23 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3C57629B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:23 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 2DA872A30F; Tue, 28 Aug 2018 13:56:23 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 92F9829B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:22 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727775AbeH1Rq7 (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from foss.arm.com ([217.140.101.70]:38460 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727172AbeH1Rq6 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:58 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 896F715BF;
        Tue, 28 Aug 2018 06:54:14 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 980953F5BD;
        Tue, 28 Aug 2018 06:54:11 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 05/16] sched/core: uclamp: enforce last task UCLAMP_MAX
Date: Tue, 28 Aug 2018 14:53:13 +0100
Message-Id: <20180828135324.21976-6-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When a util_max clamped task sleeps, its clamp constraints are removed
from the CPU. However, the blocked utilization on that CPU can still be
higher than the max clamp value enforced while that task was running.
This max clamp removal when a CPU is going to be idle could thus allow
unwanted CPU frequency increases, right while the task is not running.

This can happen, for example, where there is another (smaller) task
running on a different CPU of the same frequency domain.
In this case, when we aggregate the utilization of all the CPUs in a
shared frequency domain, schedutil can still see the full non clamped
blocked utilization of all the CPUs and thus eventually increase the
frequency.

Let's fix this by using:

   uclamp_cpu_put_id(UCLAMP_MAX)
      uclamp_cpu_update(last_clamp_value)

to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition. Thus, while a CPU is idle, we can still enforce the last used
clamp value for it.

To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
idle, we don't want to enforce any minimum frequency
Indeed, we rely just on blocked load decay to smoothly reduce the
frequency.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180816172016.GG2960@e110439-lin>
 - ensure to always reset clamp holding on wakeup from IDLE
 Others:
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
Changes in v2:
 - rabased on v4.18-rc4
 - new patch to improve a specific issue
---
 kernel/sched/core.c  | 39 +++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h | 11 +++++++++++
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64e5c96bfdaf..ba0e7208c65a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -910,7 +910,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
  * For the specified clamp index, this method computes the new CPU utilization
  * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
  */
-static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
+				     unsigned int last_clamp_value)
 {
 	struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
 	int max_value = UCLAMP_NOT_VALID;
@@ -928,6 +929,24 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
 		if (max_value >= SCHED_CAPACITY_SCALE)
 			break;
 	}
+
+	/*
+	 * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
+	 * task, we want to keep the CPU clamped to the last task's clamp
+	 * value. This is to avoid frequency spikes to MAX when one CPU, with
+	 * an high blocked utilization, sleeps and another CPU, in the same
+	 * frequency domain, do not see anymore the clamp on the first CPU.
+	 *
+	 * The UCLAMP_FLAG_IDLE is set whenever we detect, from the above
+	 * loop, that there are no more RUNNABLE taks on that CPU.
+	 * In this case we enforce the CPU util_max to that of the last
+	 * dequeued task.
+	 */
+	if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
+		rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
+		max_value = last_clamp_value;
+	}
+
 	rq->uclamp.value[clamp_id] = max_value;
 }
 
@@ -962,13 +981,25 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
 	uc_grp = &rq->uclamp.group[clamp_id][0];
 	uc_grp[group_id].tasks += 1;
 
+	/* Reset clamp holds on idle exit */
+	uc_cpu = &rq->uclamp;
+	clamp_value = p->uclamp[clamp_id].value;
+	if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
+		/*
+		 * This function is called for both UCLAMP_MIN (before) and
+		 * UCLAMP_MAX (after). Let's reset the flag only the second
+		 * once we know that UCLAMP_MIN has been already updated.
+		 */
+		if (clamp_id == UCLAMP_MAX)
+			uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
+		uc_cpu->value[clamp_id] = clamp_value;
+	}
+
 	/*
 	 * If this is the new max utilization clamp value, then we can update
 	 * straight away the CPU clamp value. Otherwise, the current CPU clamp
 	 * value is still valid and we are done.
 	 */
-	uc_cpu = &rq->uclamp;
-	clamp_value = p->uclamp[clamp_id].value;
 	if (uc_cpu->value[clamp_id] < clamp_value)
 		uc_cpu->value[clamp_id] = clamp_value;
 }
@@ -1026,7 +1057,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
 	}
 #endif
 	if (clamp_value >= uc_cpu->value[clamp_id])
-		uclamp_cpu_update(rq, clamp_id);
+		uclamp_cpu_update(rq, clamp_id, clamp_value);
 }
 
 /**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 25d1d218ae10..411635c4c09a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -805,6 +805,17 @@ struct uclamp_group {
 struct uclamp_cpu {
 	struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
 	int value[UCLAMP_CNT];
+/*
+ * Idle clamp holding
+ * Whenever a CPU is idle, we enforce the util_max clamp value of the last
+ * task running on that CPU. This bit is used to flag a clamp holding
+ * currently active for a CPU. This flag is:
+ * - set when we update the clamp value of a CPU at the time of dequeuing the
+ *   last before entering idle
+ * - reset when we enqueue the first task after a CPU wakeup from IDLE
+ */
+#define UCLAMP_FLAG_IDLE 0x01
+	int flags;
 };
 #endif /* CONFIG_UCLAMP_TASK */
 

From patchwork Tue Aug 28 13:53:14 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578569
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F347513B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:13 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E3CD32A1A4
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:13 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id D757A2A30F; Tue, 28 Aug 2018 13:55:13 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2DF4229953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:13 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727645AbeH1Rq6 (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:46:58 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38486 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726439AbeH1Rq6 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:58 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BC5F81682;
        Tue, 28 Aug 2018 06:54:17 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 CAF4C3F5BD;
        Tue, 28 Aug 2018 06:54:14 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for
 FAIR tasks
Date: Tue, 28 Aug 2018 14:53:14 +0100
Message-Id: <20180828135324.21976-7-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider the requirements suggested by userspace, for
example, to:

 - boost tasks which are directly affecting the user experience
   by running them at least at a minimum "required" frequency

 - cap low priority tasks not directly affecting the user experience
   by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus allowing to have a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Let's add the required support to clamp the utilization generated by
FAIR tasks within the boundaries defined by their aggregated utilization
clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.

For FAIR tasks both the utilization as well as the IOWait boost values
are clamped according to the CPU aggregated utilization clamp
constraints.

The default values for boosting and capping are defined to be:
 - util_min: 0
 - util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <CAKfTPtC2adLupg7wy1JU9zxKx1466Sza6fSCcr92wcawm1OYkg@mail.gmail.com>
 - use *rq instead of cpu for both uclamp_util() and uclamp_value()
 Message-ID: <20180816135300.GC2960@e110439-lin>
 - remove uclamp_value() which is never used outside CONFIG_UCLAMP_TASK
 Others:
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
 Others:
 - rebased on tip/sched/core
Changes in v2:
 - rebased on v4.18-rc4
---
 kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++--
 kernel/sched/sched.h             | 50 ++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3bc8a8..949082555ee8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	 * CFS tasks and we use the same metric to track the effective
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
+	 *
+	 * CFS utilization can be boosted or capped, depending on utilization
+	 * clamp constraints configured for currently RUNNABLE tasks.
 	 */
 	util = cpu_util_cfs(rq);
+	if (util)
+		util = uclamp_util(rq, util);
 	util += cpu_util_rt(rq);
 
 	/*
@@ -307,6 +312,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 			       unsigned int flags)
 {
 	bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+	unsigned int max_boost;
 
 	/* Reset boost if the CPU appears to have been idle enough */
 	if (sg_cpu->iowait_boost &&
@@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 		return;
 	sg_cpu->iowait_boost_pending = true;
 
+	/*
+	 * Boost FAIR tasks only up to the CPU clamped utilization.
+	 *
+	 * Since DL tasks have a much more advanced bandwidth control, it's
+	 * safe to assume that IO boost does not apply to those tasks.
+	 * Instead, since RT tasks are not utiliation clamped, we don't want
+	 * to apply clamping on IO boost while there is blocked RT
+	 * utilization.
+	 */
+	max_boost = sg_cpu->iowait_boost_max;
+	if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
+		max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+
 	/* Double the boost at each request */
 	if (sg_cpu->iowait_boost) {
 		sg_cpu->iowait_boost <<= 1;
-		if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
-			sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+		if (sg_cpu->iowait_boost > max_boost)
+			sg_cpu->iowait_boost = max_boost;
 		return;
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 411635c4c09a..1b05b38b1081 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2293,6 +2293,56 @@ static inline unsigned int uclamp_none(int clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_value: get the current CPU's utilization clamp value
+ * @rq: the CPU's RQ to consider
+ * @clamp_id: the utilization clamp index (i.e. min or max utilization)
+ *
+ * The utilization clamp value for a CPU depends on its set of currently
+ * RUNNABLE tasks and their specific util_{min,max} constraints.
+ * A max aggregated value is tracked for each CPU and returned by this
+ * function.
+ *
+ * Return: the current value for the specified CPU and clamp index
+ */
+static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
+{
+	struct uclamp_cpu *uc_cpu = &rq->uclamp;
+
+	if (uc_cpu->value[clamp_id] == UCLAMP_NOT_VALID)
+		return uclamp_none(clamp_id);
+
+	return uc_cpu->value[clamp_id];
+}
+
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @rq: the CPU's RQ to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently RUNNABLE tasks. Given a utilization signal, i.e a signal in
+ * the [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+	unsigned int min_util = uclamp_value(rq, UCLAMP_MIN);
+	unsigned int max_util = uclamp_value(rq, UCLAMP_MAX);
+
+	return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+	return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
 #ifdef arch_scale_freq_capacity
 # ifndef arch_scale_freq_invariant
 #  define arch_scale_freq_invariant()	true

From patchwork Tue Aug 28 13:53:15 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578583
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6A5D013B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:59 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 595AF2A30F
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:59 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4D4B32A320; Tue, 28 Aug 2018 13:55:59 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1B64629953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:58 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726439AbeH1Rq7 (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from foss.arm.com ([217.140.101.70]:38504 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727209AbeH1Rq6 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:58 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EF12C1684;
        Tue, 28 Aug 2018 06:54:20 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 094ED3F5BD;
        Tue, 28 Aug 2018 06:54:17 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 07/16] sched/core: uclamp: extend cpu's cgroup controller
Date: Tue, 28 Aug 2018 14:53:15 +0100
Message-Id: <20180828135324.21976-8-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is expected (or allowed) to generate.
Constraints on minimum and maximum utilization allowed for tasks in a
CPU cgroup can improve the control on the actual amount of CPU bandwidth
consumed by tasks.

Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.

The ultimate goal of utilization clamping is thus to enable:

- boosting: by selecting an higher capacity CPU and/or higher execution
            frequency for small tasks which are affecting the user
            interactive experience.

- capping: by selecting more energy efficiency CPUs or lower execution
           frequency, for big tasks which are mainly related to
           background activities, and thus without a direct impact on
           the user experience.

Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.

This patch extends the CPU controller by adding a couple of new
attributes, util.min and util.max, which can be used to enforce task's
utilization boosting and capping. Specifically:

- util.min: defines the minimum utilization which should be considered,
            e.g. when schedutil selects the frequency for a CPU while a
            task in this group is RUNNABLE.
            i.e. the task will run at least at a minimum frequency which
                corresponds to the min_util utilization

- util.max: defines the maximum utilization which should be considered,
            e.g. when schedutil selects the frequency for a CPU while a
            task in this group is RUNNABLE.
            i.e. the task will run up to a maximum frequency which
                corresponds to the max_util utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies
b) do not enforce any constraints and/or dependency between the parent
   and its child nodes, thus relying on the delegation model and
   permission settings defined by the system management software
c) allow to (eventually) further restrict task-specific clamps defined
   via sched_setattr(2)

This patch provides the basic support to expose the two new attributes
and to validate their run-time updates. However, we do not actually
allocated clamp groups and thus the write calls added by this patch
always returns -EINVAL. Following patches will provide the missing bits.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Others:
 - consolidate init_uclamp_sched_group() into init_uclamp()
 - refcount root_task_group's clamp groups from init_uclamp()
 - small documentation fixes

Changes in v3:
 Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
 Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com>
 - use "." notation for attributes naming
   i.e. s/util_{min,max}/util.{min,max}/
 Others
 - rebased on v4.19-rc1
Changes in v2:
 Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com>
 - make attributes available only on non-root nodes
   a system wide API seems of not immediate interest and thus it's not
   supported anymore
 - remove implicit parent-child constraints and dependencies
 Message-ID: <20180410200514.GA793541@devbig577.frc2.facebook.com>
 - add some cgroup-v2 documentation for the new attributes
 - (hopefully) better explain intended use-cases
   the changelog above has been extended to better justify the naming
   proposed by the new attributes
 Others:
 - rebased on v4.18-rc4
 - reduced code to simplify the review of this patch
   which now provides just the basic code for CGroups integration
 - add attributes to the default hierarchy as well as the legacy one
 - use -ERANGE as range violation error

These additional bits:
 - refcounting of clamp groups
 - RUNNABLE tasks refcount updates
 - aggregation of per-task and per-task_group utilization constraints
are provided in separate and following patches to make it more clear and
documented how they are performed.
---
 Documentation/admin-guide/cgroup-v2.rst |  25 ++++
 include/linux/sched.h                   |   4 +
 init/Kconfig                            |  22 ++++
 kernel/sched/core.c                     | 154 ++++++++++++++++++++++++
 kernel/sched/sched.h                    |   5 +
 5 files changed, 210 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..80ef7bdc517b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -907,6 +907,12 @@ controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -966,6 +972,25 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
+  cpu.util.min
+        A read-write single value file which exists on non-root cgroups.
+        The default is "0", i.e. no bandwidth boosting.
+
+        The minimum utilization in the range [0, 1023].
+
+        This interface allows reading and setting minimum utilization clamp
+        values similar to the sched_setattr(2). This minimum utilization
+        value is used to clamp the task specific minimum utilization clamp.
+
+  cpu.util.max
+        A read-write single value file which exists on non-root cgroups.
+        The default is "1023". i.e. no bandwidth clamping
+
+        The maximum utilization in the range [0, 1023].
+
+        This interface allows reading and setting maximum utilization clamp
+        values similar to the sched_setattr(2). This maximum utilization
+        value is used to clamp the task specific maximum utilization clamp.
 
 Memory
 ------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7385f0b1a7c0..dc39b67a366a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -583,6 +583,10 @@ struct sched_dl_entity {
  *
  * A utilization clamp group maps a "clamp value" (value), i.e.
  * util_{min,max}, to a "clamp group index" (group_id).
+ *
+ * The same "group_id" can be used by multiple scheduling entities, i.e.
+ * either tasks or task groups, to enforce the same clamp "value" for a given
+ * clamp index.
  */
 struct uclamp_se {
 	unsigned int value;
diff --git a/init/Kconfig b/init/Kconfig
index 10536cb83295..089db7a804a8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -827,6 +827,28 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config UCLAMP_TASK_GROUP
+	bool "Utilization clamping per group of tasks"
+	depends on CGROUP_SCHED
+	depends on UCLAMP_TASK
+	default n
+	help
+	  This feature enables the scheduler to track the clamped utilization
+	  of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+	  When this option is enabled, the user can specify a min and max
+	  CPU bandwidth which is allowed for each single task in a group.
+	  The max bandwidth allows to clamp the maximum frequency a task
+	  can use, while the min bandwidth allows to define a minimum
+	  frequency a task will always use.
+
+	  When task group based utilization clamping is enabled, an eventually
+          specified task-specific clamp value is constrained by the cgroup
+	  specified clamp value. Both minimum and maximum task clamping cannot
+          be bigger than the corresponding clamping defined at task group level.
+
+	  If in doubt, say N.
+
 config CGROUP_PIDS
 	bool "PIDs controller"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba0e7208c65a..dcbf22abd0bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1233,6 +1233,41 @@ static inline void uclamp_group_get(struct task_struct *p,
 	uclamp_group_put(clamp_id, prev_group_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: 0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+					   struct task_group *parent)
+{
+	struct uclamp_se *uc_se;
+	int clamp_id;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_se = &tg->uclamp[clamp_id];
+		uc_se->value = parent->uclamp[clamp_id].value;
+		uc_se->group_id = parent->uclamp[clamp_id].group_id;
+	}
+
+	return 1;
+}
+#else /* CONFIG_UCLAMP_TASK_GROUP */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+					   struct task_group *parent)
+{
+	return 1;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
 static inline int __setscheduler_uclamp(struct task_struct *p,
 					const struct sched_attr *attr)
 {
@@ -1376,12 +1411,24 @@ static void __init init_uclamp(void)
 		uc_se->group_id = UCLAMP_NOT_VALID;
 		uclamp_group_get(NULL, clamp_id, 0, uc_se,
 				 uclamp_none(clamp_id));
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+		/* Init root TG's clamp group */
+		uc_se = &root_task_group.uclamp[clamp_id];
+		uc_se->value = uclamp_none(clamp_id);
+		uc_se->group_id = 0;
+#endif
 	}
 }
 
 #else /* CONFIG_UCLAMP_TASK */
 static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+					   struct task_group *parent)
+{
+	return 1;
+}
 static inline int __setscheduler_uclamp(struct task_struct *p,
 					const struct sched_attr *attr)
 {
@@ -6955,6 +7002,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	if (!alloc_uclamp_sched_group(tg, parent))
+		goto err;
+
 	return tg;
 
 err:
@@ -7175,6 +7225,84 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 		sched_move_task(task);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 min_value)
+{
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	if (min_value > SCHED_CAPACITY_SCALE)
+		return -ERANGE;
+
+	rcu_read_lock();
+
+	tg = css_tg(css);
+	if (tg->uclamp[UCLAMP_MIN].value == min_value) {
+		ret = 0;
+		goto out;
+	}
+	if (tg->uclamp[UCLAMP_MAX].value < min_value)
+		goto out;
+
+out:
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cftype, u64 max_value)
+{
+	struct task_group *tg;
+	int ret = -EINVAL;
+
+	if (max_value > SCHED_CAPACITY_SCALE)
+		return -ERANGE;
+
+	rcu_read_lock();
+
+	tg = css_tg(css);
+	if (tg->uclamp[UCLAMP_MAX].value == max_value) {
+		ret = 0;
+		goto out;
+	}
+	if (tg->uclamp[UCLAMP_MIN].value > max_value)
+		goto out;
+
+out:
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+				  enum uclamp_id clamp_id)
+{
+	struct task_group *tg;
+	u64 util_clamp;
+
+	rcu_read_lock();
+	tg = css_tg(css);
+	util_clamp = tg->uclamp[clamp_id].value;
+	rcu_read_unlock();
+
+	return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -7512,6 +7640,18 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "util.min",
+		.read_u64 = cpu_util_min_read_u64,
+		.write_u64 = cpu_util_min_write_u64,
+	},
+	{
+		.name = "util.max",
+		.read_u64 = cpu_util_max_read_u64,
+		.write_u64 = cpu_util_max_write_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7679,6 +7819,20 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	{
+		.name = "util_min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_min_read_u64,
+		.write_u64 = cpu_util_min_write_u64,
+	},
+	{
+		.name = "util_max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_max_read_u64,
+		.write_u64 = cpu_util_max_write_u64,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1b05b38b1081..489d7403affe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,11 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	struct			uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED

From patchwork Tue Aug 28 13:53:16 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578585
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8B49A174A
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:59 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7ACFC29953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:59 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 6EB4F2A320; Tue, 28 Aug 2018 13:55:59 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6A79D2A1A4
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:58 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727519AbeH1Rro (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:44 -0400
Received: from foss.arm.com ([217.140.101.70]:38662 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727284AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2D291168F;
        Tue, 28 Aug 2018 06:54:24 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 3BC6E3F5BD;
        Tue, 28 Aug 2018 06:54:21 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps
Date: Tue, 28 Aug 2018 14:53:16 +0100
Message-Id: <20180828135324.21976-9-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This represent also the clamp value which is actually
used to clamp tasks in each task group.

Since it can be interesting for tasks in a cgroup to know exactly what
is the currently propagated/enforced configuration, the effective clamp
values are exposed to user-space by means of a new pair of read-only
attributes: cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180816140731.GD2960@e110439-lin>
 - add ".effective" attributes to the default hierarchy
 Others:
 - small documentation fixes
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com>
 - new patch in v3, to implement a suggestion from v1 review
---
 Documentation/admin-guide/cgroup-v2.rst |  25 +++++-
 include/linux/sched.h                   |   8 ++
 kernel/sched/core.c                     | 112 +++++++++++++++++++++++-
 3 files changed, 139 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 80ef7bdc517b..72272f58d304 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,22 +976,43 @@ All time durations are in microseconds.
         A read-write single value file which exists on non-root cgroups.
         The default is "0", i.e. no bandwidth boosting.
 
-        The minimum utilization in the range [0, 1023].
+        The requested minimum utilization in the range [0, 1023].
 
         This interface allows reading and setting minimum utilization clamp
         values similar to the sched_setattr(2). This minimum utilization
         value is used to clamp the task specific minimum utilization clamp.
 
+  cpu.util.min.effective
+        A read-only single value file which exists on non-root cgroups and
+        reports minimum utilization clamp value currently enforced on a task
+        group.
+
+        The actual minimum utilization in the range [0, 1023].
+
+        This value can be lower then cpu.util.min in case a parent cgroup
+        is enforcing a more restrictive clamping on minimum utilization.
+
   cpu.util.max
         A read-write single value file which exists on non-root cgroups.
         The default is "1023". i.e. no bandwidth clamping
 
-        The maximum utilization in the range [0, 1023].
+        The requested maximum utilization in the range [0, 1023].
 
         This interface allows reading and setting maximum utilization clamp
         values similar to the sched_setattr(2). This maximum utilization
         value is used to clamp the task specific maximum utilization clamp.
 
+  cpu.util.max.effective
+        A read-only single value file which exists on non-root cgroups and
+        reports maximum utilization clamp value currently enforced on a task
+        group.
+
+        The actual maximum utilization in the range [0, 1023].
+
+        This value can be lower then cpu.util.max in case a parent cgroup
+        is enforcing a more restrictive clamping on max utilization.
+
+
 Memory
 ------
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc39b67a366a..2da130d17e70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -591,6 +591,14 @@ struct sched_dl_entity {
 struct uclamp_se {
 	unsigned int value;
 	unsigned int group_id;
+	/*
+	 * Effective task (group) clamp value.
+	 * For task groups is the value (eventually) enforced by a parent task
+	 * group.
+	 */
+	struct {
+		unsigned int value;
+	} effective;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dcbf22abd0bf..b2d438b6484b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1254,6 +1254,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
 
 	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
 		uc_se = &tg->uclamp[clamp_id];
+		uc_se->effective.value =
+			parent->uclamp[clamp_id].effective.value;
 		uc_se->value = parent->uclamp[clamp_id].value;
 		uc_se->group_id = parent->uclamp[clamp_id].group_id;
 	}
@@ -1415,6 +1417,7 @@ static void __init init_uclamp(void)
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 		/* Init root TG's clamp group */
 		uc_se = &root_task_group.uclamp[clamp_id];
+		uc_se->effective.value = uclamp_none(clamp_id);
 		uc_se->value = uclamp_none(clamp_id);
 		uc_se->group_id = 0;
 #endif
@@ -7226,6 +7229,68 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * cpu_util_update_hier: propagete effective clamp down the hierarchy
+ * @css: the task group to update
+ * @clamp_id: the clamp index to update
+ * @value: the new task group clamp value
+ *
+ * The effective clamp for a TG is expected to track the most restrictive
+ * value between the TG's clamp value and it's parent effective clamp value.
+ * This method achieve that:
+ * 1. updating the current TG effective value
+ * 2. walking all the descendant task group that needs an update
+ *
+ * A TG's effective clamp needs to be updated when its current value is not
+ * matching the TG's clamp value. In this case indeed either:
+ * a) the parent has got a more relaxed clamp value
+ *    thus potentially we can relax the effective value for this group
+ * b) the parent has got a more strict clamp value
+ *    thus potentially we have to restrict the effective value of this group
+ *
+ * Restriction and relaxation of current TG's effective clamp values needs to
+ * be propagated down to all the descendants. When a subgroup is found which
+ * has already its effective clamp value matching its clamp value, then we can
+ * safely skip all its descendants which are granted to be already in sync.
+ */
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+				 int clamp_id, int value)
+{
+	struct cgroup_subsys_state *top_css = css;
+	struct uclamp_se *uc_se, *uc_parent;
+
+	css_for_each_descendant_pre(css, top_css) {
+		/*
+		 * The first visited task group is top_css, which clamp value
+		 * is the one passed as parameter. For descendent task
+		 * groups we consider their current value.
+		 */
+		uc_se = &css_tg(css)->uclamp[clamp_id];
+		if (css != top_css)
+			value = uc_se->value;
+		/*
+		 * Skip the whole subtrees if the current effective clamp is
+		 * alredy matching the TG's clamp value.
+		 * In this case, all the subtrees already have top_value, or a
+		 * more restrictive, as effective clamp.
+		 */
+		uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+		if (uc_se->effective.value == value &&
+		    uc_parent->effective.value >= value) {
+			css = css_rightmost_descendant(css);
+			continue;
+		}
+
+		/* Propagate the most restrictive effective value */
+		if (uc_parent->effective.value < value)
+			value = uc_parent->effective.value;
+		if (uc_se->effective.value == value)
+			continue;
+
+		uc_se->effective.value = value;
+	}
+}
+
 static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 				  struct cftype *cftype, u64 min_value)
 {
@@ -7245,6 +7310,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 	if (tg->uclamp[UCLAMP_MAX].value < min_value)
 		goto out;
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+
 out:
 	rcu_read_unlock();
 
@@ -7270,6 +7338,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	if (tg->uclamp[UCLAMP_MIN].value > max_value)
 		goto out;
 
+	/* Update effective clamps to track the most restrictive value */
+	cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+
 out:
 	rcu_read_unlock();
 
@@ -7277,14 +7348,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 }
 
 static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
-				  enum uclamp_id clamp_id)
+				  enum uclamp_id clamp_id,
+				  bool effective)
 {
 	struct task_group *tg;
 	u64 util_clamp;
 
 	rcu_read_lock();
 	tg = css_tg(css);
-	util_clamp = tg->uclamp[clamp_id].value;
+	util_clamp = effective
+		? tg->uclamp[clamp_id].effective.value
+		: tg->uclamp[clamp_id].value;
 	rcu_read_unlock();
 
 	return util_clamp;
@@ -7293,13 +7367,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
 static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
 				 struct cftype *cft)
 {
-	return cpu_uclamp_read(css, UCLAMP_MIN);
+	return cpu_uclamp_read(css, UCLAMP_MIN, false);
 }
 
 static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
 				 struct cftype *cft)
 {
-	return cpu_uclamp_read(css, UCLAMP_MAX);
+	return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+					   struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+					   struct cftype *cft)
+{
+	return cpu_uclamp_read(css, UCLAMP_MAX, true);
 }
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
 
@@ -7647,11 +7733,19 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_util_min_read_u64,
 		.write_u64 = cpu_util_min_write_u64,
 	},
+	{
+		.name = "util.min.effective",
+		.read_u64 = cpu_util_min_effective_read_u64,
+	},
 	{
 		.name = "util.max",
 		.read_u64 = cpu_util_max_read_u64,
 		.write_u64 = cpu_util_max_write_u64,
 	},
+	{
+		.name = "util.max.effective",
+		.read_u64 = cpu_util_max_effective_read_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7827,12 +7921,22 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_util_min_read_u64,
 		.write_u64 = cpu_util_min_write_u64,
 	},
+	{
+		.name = "util.min.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_min_effective_read_u64,
+	},
 	{
 		.name = "util_max",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = cpu_util_max_read_u64,
 		.write_u64 = cpu_util_max_write_u64,
 	},
+	{
+		.name = "util.max.effective",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_util_max_effective_read_u64,
+	},
 #endif
 	{ }	/* terminate */
 };

From patchwork Tue Aug 28 13:53:17 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578591
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A2F07920
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:24 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 93BD429953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:24 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 87F4A2A1A4; Tue, 28 Aug 2018 13:56:24 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5E26829953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:22 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727284AbeH1RsC (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:48:02 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38664 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727379AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5FF6F174E;
        Tue, 28 Aug 2018 06:54:27 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 6EC343F5BD;
        Tue, 28 Aug 2018 06:54:24 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 09/16] sched/core: uclamp: map TG's clamp values into CPU's
 clamp groups
Date: Tue, 28 Aug 2018 14:53:17 +0100
Message-Id: <20180828135324.21976-10-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Utilization clamping requires to map each different clamp value
into one of the available clamp groups used by the scheduler's fast-path
to account for RUNNABLE tasks. Thus, each time a TG's clamp value
sysfs attribute is updated via:
   cpu_util_{min,max}_write_u64()
we need to get (if possible) a reference to the new value's clamp group
and release the reference to the previous one.

Let's ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time).
We do that by slightly refactoring uclamp_group_get() to make the
*task_struct parameter optional. This allows to re-use the code already
available to support the per-task API.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Others:
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
 - add explicit calls to uclamp_group_find(), which is now not more
   part of uclamp_group_get()
 Others:
 - rebased on tip/sched/core
Changes in v2:
 - rebased on v4.18-rc4
 - this code has been split from a previous patch to simplify the review
---
 include/linux/sched.h | 11 +++--
 kernel/sched/core.c   | 95 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2da130d17e70..4e5522ed57e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -587,17 +587,22 @@ struct sched_dl_entity {
  * The same "group_id" can be used by multiple scheduling entities, i.e.
  * either tasks or task groups, to enforce the same clamp "value" for a given
  * clamp index.
+ *
+ * Scheduling entity's specific clamp group index can be different
+ * from the effective clamp group index used at enqueue time since
+ * task groups's clamps can be restricted by their parent task group.
  */
 struct uclamp_se {
 	unsigned int value;
 	unsigned int group_id;
 	/*
-	 * Effective task (group) clamp value.
-	 * For task groups is the value (eventually) enforced by a parent task
-	 * group.
+	 * Effective task (group) clamp value and group index.
+	 * For task groups it's the value (eventually) enforced by a parent
+	 * task group.
 	 */
 	struct {
 		unsigned int value;
+		unsigned int group_id;
 	} effective;
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b2d438b6484b..e617a7b18f2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1250,24 +1250,51 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
 					   struct task_group *parent)
 {
 	struct uclamp_se *uc_se;
+	int next_group_id;
 	int clamp_id;
 
 	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
 		uc_se = &tg->uclamp[clamp_id];
+
 		uc_se->effective.value =
 			parent->uclamp[clamp_id].effective.value;
-		uc_se->value = parent->uclamp[clamp_id].value;
-		uc_se->group_id = parent->uclamp[clamp_id].group_id;
+		uc_se->effective.group_id =
+			parent->uclamp[clamp_id].effective.group_id;
+
+		next_group_id = parent->uclamp[clamp_id].group_id;
+		uc_se->group_id = UCLAMP_NOT_VALID;
+		uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
+				 parent->uclamp[clamp_id].value);
 	}
 
 	return 1;
 }
+
+/**
+ * release_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
+ */
+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+	struct uclamp_se *uc_se;
+	int clamp_id;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+		uc_se = &tg->uclamp[clamp_id];
+		uclamp_group_put(clamp_id, uc_se->group_id);
+	}
+}
 #else /* CONFIG_UCLAMP_TASK_GROUP */
 static inline int alloc_uclamp_sched_group(struct task_group *tg,
 					   struct task_group *parent)
 {
 	return 1;
 }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
 
 static inline int __setscheduler_uclamp(struct task_struct *p,
@@ -1417,9 +1444,18 @@ static void __init init_uclamp(void)
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 		/* Init root TG's clamp group */
 		uc_se = &root_task_group.uclamp[clamp_id];
+
 		uc_se->effective.value = uclamp_none(clamp_id);
-		uc_se->value = uclamp_none(clamp_id);
-		uc_se->group_id = 0;
+		uc_se->effective.group_id = 0;
+
+		/*
+		 * The max utilization is always allowed for both clamps.
+		 * This is required to not force a null minimum utiliation on
+		 * all child groups.
+		 */
+		uc_se->group_id = UCLAMP_NOT_VALID;
+		uclamp_group_get(NULL, clamp_id, 0, uc_se,
+				 uclamp_none(UCLAMP_MAX));
 #endif
 	}
 }
@@ -1427,6 +1463,7 @@ static void __init init_uclamp(void)
 #else /* CONFIG_UCLAMP_TASK */
 static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
 static inline int alloc_uclamp_sched_group(struct task_group *tg,
 					   struct task_group *parent)
 {
@@ -6984,6 +7021,7 @@ static DEFINE_SPINLOCK(task_group_lock);
 
 static void sched_free_group(struct task_group *tg)
 {
+	free_uclamp_sched_group(tg);
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
@@ -7234,6 +7272,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
  * @css: the task group to update
  * @clamp_id: the clamp index to update
  * @value: the new task group clamp value
+ * @group_id: the group index mapping the new task clamp value
  *
  * The effective clamp for a TG is expected to track the most restrictive
  * value between the TG's clamp value and it's parent effective clamp value.
@@ -7252,9 +7291,12 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
  * be propagated down to all the descendants. When a subgroup is found which
  * has already its effective clamp value matching its clamp value, then we can
  * safely skip all its descendants which are granted to be already in sync.
+ *
+ * The TG's group_id is also updated to ensure it tracks the effective clamp
+ * value.
  */
 static void cpu_util_update_hier(struct cgroup_subsys_state *css,
-				 int clamp_id, int value)
+				 int clamp_id, int value, int group_id)
 {
 	struct cgroup_subsys_state *top_css = css;
 	struct uclamp_se *uc_se, *uc_parent;
@@ -7282,24 +7324,30 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
 		}
 
 		/* Propagate the most restrictive effective value */
-		if (uc_parent->effective.value < value)
+		if (uc_parent->effective.value < value) {
 			value = uc_parent->effective.value;
+			group_id = uc_parent->effective.group_id;
+		}
 		if (uc_se->effective.value == value)
 			continue;
 
 		uc_se->effective.value = value;
+		uc_se->effective.group_id = group_id;
 	}
 }
 
 static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 				  struct cftype *cftype, u64 min_value)
 {
+	struct uclamp_se *uc_se;
 	struct task_group *tg;
 	int ret = -EINVAL;
+	int group_id;
 
 	if (min_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(css);
@@ -7310,11 +7358,25 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 	if (tg->uclamp[UCLAMP_MAX].value < min_value)
 		goto out;
 
+	/* Find a valid group_id */
+	ret = uclamp_group_find(UCLAMP_MIN, min_value);
+	if (ret == -ENOSPC) {
+		pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+		goto out;
+	}
+	group_id = ret;
+	ret = 0;
+
 	/* Update effective clamps to track the most restrictive value */
-	cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+	cpu_util_update_hier(css, UCLAMP_MIN, min_value, group_id);
+
+	/* Update TG's reference count */
+	uc_se = &tg->uclamp[UCLAMP_MIN];
+	uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);
 
 out:
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return ret;
 }
@@ -7322,12 +7384,15 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 				  struct cftype *cftype, u64 max_value)
 {
+	struct uclamp_se *uc_se;
 	struct task_group *tg;
 	int ret = -EINVAL;
+	int group_id;
 
 	if (max_value > SCHED_CAPACITY_SCALE)
 		return -ERANGE;
 
+	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
 	tg = css_tg(css);
@@ -7338,11 +7403,25 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	if (tg->uclamp[UCLAMP_MIN].value > max_value)
 		goto out;
 
+	/* Find a valid group_id */
+	ret = uclamp_group_find(UCLAMP_MAX, max_value);
+	if (ret == -ENOSPC) {
+		pr_err(UCLAMP_ENOSPC_FMT, "MAX");
+		goto out;
+	}
+	group_id = ret;
+	ret = 0;
+
 	/* Update effective clamps to track the most restrictive value */
-	cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+	cpu_util_update_hier(css, UCLAMP_MAX, max_value, group_id);
+
+	/* Update TG's reference count */
+	uc_se = &tg->uclamp[UCLAMP_MAX];
+	uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);
 
 out:
 	rcu_read_unlock();
+	mutex_unlock(&uclamp_mutex);
 
 	return ret;
 }

From patchwork Tue Aug 28 13:53:18 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578581
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 24A11920
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:53 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1119829B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:53 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 038E52A320; Tue, 28 Aug 2018 13:55:53 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1CDEF29B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727706AbeH1Rrc (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:32 -0400
Received: from foss.arm.com ([217.140.101.70]:38668 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727323AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 931E91993;
        Tue, 28 Aug 2018 06:54:30 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 A1DC93F5BD;
        Tue, 28 Aug 2018 06:54:27 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 10/16] sched/core: uclamp: use TG's clamps to restrict
 Task's clamps
Date: Tue, 28 Aug 2018 14:53:18 +0100
Message-Id: <20180828135324.21976-11-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's Task Group (TG) in which the task is currently living.

Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG effective value.

This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. The overall idea is that cgroup
attributes are always used to restrict the per-task attributes.

Thus, this implementation allows to:

1. ensure cgroup clamps are always used to restrict task specific
   requests, i.e. boosted only up to the effective granted value or
   clamped at least to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
   request less then what enforced by their current TG

For this mechanisms to work properly, we exploit the concept of
"effective" clamp, which is already used by a TG to track parent
enforced restrictions.
In this patch we re-use the same variable:
   task_struct::uclamp::effective::group_id
to track the currently most restrictive clamp group each task is
subject to and thus it's also currently refcounted into.

This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.

For consistency purposes, as well as to properly inform userspace, the
sched_getattr(2) call is updated to always return the properly
aggregated constrains as described above. This will also make
sched_getattr(2) a convenient userspace API to know the utilization
constraints enforced on a task by the cgroup's CPU controller.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180816140731.GD2960@e110439-lin>
 - reuse already existing:
     task_struct::uclamp::effective::group_id
   instead of adding:
     task_struct::uclamp_group_id
   to back annotate the effective clamp group in which a task has been
   refcounted
 Others:
 - small documentation fixes
 - rebased on v4.19-rc1

Changes in v3:
 Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
 - rename UCLAMP_NONE into UCLAMP_NOT_VALID
 - fix not required override
 - fix typos in changelog
 Others:
 - clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
   clamp group_id/value code into dedicated getter functions:
   uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
 - rebased on tip/sched/core
Changes in v2:
 OSPM discussion:
 - implement a "nice" semantics where cgroup clamp values are always
   used to restrict task specific clamp values, i.e. tasks running on a
   TG are only allowed to demote themself.
 Other:
 - rabased on v4.18-rc4
 - this code has been split from a previous patch to simplify the review
---
 kernel/sched/core.c  | 86 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  2 +-
 2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e617a7b18f2d..da0b3bd41e96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -950,14 +950,75 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
 	rq->uclamp.value[clamp_id] = max_value;
 }
 
+/**
+ * uclamp_task_group_id: get the effective clamp group index of a task
+ *
+ * The effective clamp group index of a task depends on its status, RUNNABLE
+ * or SLEEPING, and on:
+ * - the task specific clamp value, when !UCLAMP_NOT_VALID
+ * - its task group effective clamp value, for tasks not in the root group
+ * - the system default clamp value, for tasks in the root group
+ *
+ * This method returns the effective group index for a task, depending on its
+ * status and a proper aggregation of the clamp values listed above.
+ */
+static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
+{
+	struct uclamp_se *uc_se;
+	int clamp_value;
+	int group_id;
+
+	/* Taks currently accounted into a clamp group */
+	if (uclamp_task_affects(p, clamp_id))
+		return p->uclamp[clamp_id].effective.group_id;
+
+	/* Task specific clamp value */
+	uc_se = &p->uclamp[clamp_id];
+	clamp_value = uc_se->value;
+	group_id = uc_se->group_id;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+	/* Use TG's clamp value to limit task specific values */
+	uc_se = &task_group(p)->uclamp[clamp_id];
+	if (clamp_value > uc_se->effective.value)
+		group_id = uc_se->effective.group_id;
+#endif
+
+	return group_id;
+}
+
+static inline int uclamp_group_value(int clamp_id, int group_id)
+{
+	struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+	if (group_id == UCLAMP_NOT_VALID)
+		return uclamp_none(clamp_id);
+
+	return uc_map[group_id].value;
+}
+
+static inline int uclamp_task_value(struct task_struct *p, int clamp_id)
+{
+	int group_id = uclamp_task_group_id(p, clamp_id);
+
+	return uclamp_group_value(clamp_id, group_id);
+}
+
 /**
  * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
  * @p: the task being enqueued on a CPU
  * @rq: the CPU's rq where the clamp group has to be reference counted
  * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
  *
- * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
- * the task's uclamp.group_id is reference counted on that CPU.
+ * Once a task is enqueued on a CPU's RQ, the most restrictive clamp group,
+ * among the task specific and that of the task's cgroup one, is reference
+ * counted on that CPU.
+ *
+ * Since the CPUs reference counted clamp group can be either that of the task
+ * or of its cgroup, we keep track of the reference counted clamp group by
+ * storing its index (group_id) into task_struct::uclamp::effective::group_id.
+ * This group index will then be used at task's dequeue time to release the
+ * correct refcount.
  */
 static inline void uclamp_cpu_get_id(struct task_struct *p,
 				     struct rq *rq, int clamp_id)
@@ -968,7 +1029,7 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
 	int group_id;
 
 	/* Every task must reference a clamp group */
-	group_id = p->uclamp[clamp_id].group_id;
+	group_id = uclamp_task_group_id(p, clamp_id);
 #ifdef CONFIG_SCHED_DEBUG
 	if (unlikely(group_id == UCLAMP_NOT_VALID)) {
 		WARN(1, "invalid task [%d:%s] clamp group\n",
@@ -977,6 +1038,9 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
 	}
 #endif
 
+	/* Track the effective clamp group */
+	p->uclamp[clamp_id].effective.group_id = group_id;
+
 	/* Reference count the task into its current group_id */
 	uc_grp = &rq->uclamp.group[clamp_id][0];
 	uc_grp[group_id].tasks += 1;
@@ -1025,7 +1089,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
 	int group_id;
 
 	/* New tasks don't have a previous clamp group */
-	group_id = p->uclamp[clamp_id].group_id;
+	group_id = p->uclamp[clamp_id].effective.group_id;
 	if (group_id == UCLAMP_NOT_VALID)
 		return;
 
@@ -1040,6 +1104,9 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
 	}
 #endif
 
+	/* Flag the task as not affecting any clamp index */
+	p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
+
 	/* If this is not the last task, no updates are required */
 	if (uc_grp[group_id].tasks > 0)
 		return;
@@ -1402,6 +1469,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)
 			next_group_id = 0;
 			p->uclamp[clamp_id].value = uclamp_none(clamp_id);
 		}
+		/* Forked tasks are not yet enqueued */
+		p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
 
 		p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
 		uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
@@ -5497,8 +5566,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		attr.sched_nice = task_nice(p);
 
 #ifdef CONFIG_UCLAMP_TASK
-	attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
-	attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+	attr.sched_util_min = uclamp_task_value(p, UCLAMP_MIN);
+	attr.sched_util_max = uclamp_task_value(p, UCLAMP_MAX);
 #endif
 
 	rcu_read_unlock();
@@ -7308,8 +7377,11 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
 		 * groups we consider their current value.
 		 */
 		uc_se = &css_tg(css)->uclamp[clamp_id];
-		if (css != top_css)
+		if (css != top_css) {
 			value = uc_se->value;
+			group_id = uc_se->effective.group_id;
+		}
+
 		/*
 		 * Skip the whole subtrees if the current effective clamp is
 		 * alredy matching the TG's clamp value.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 489d7403affe..72b022b9a407 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2240,7 +2240,7 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
  */
 static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
 {
-	return (p->uclamp[clamp_id].group_id != UCLAMP_NOT_VALID);
+	return (p->uclamp[clamp_id].effective.group_id != UCLAMP_NOT_VALID);
 }
 #endif /* CONFIG_UCLAMP_TASK */
 

From patchwork Tue Aug 28 13:53:19 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578587
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7800213B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:12 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6709829953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:12 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5B2942A1A4; Tue, 28 Aug 2018 13:56:12 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 739B429953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:56:11 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727323AbeH1Rro (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:44 -0400
Received: from foss.arm.com ([217.140.101.70]:38666 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727385AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C6E4919BF;
        Tue, 28 Aug 2018 06:54:33 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 D5B793F5BD;
        Tue, 28 Aug 2018 06:54:30 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 11/16] sched/core: uclamp: add system default clamps
Date: Tue, 28 Aug 2018 14:53:19 +0100
Message-Id: <20180828135324.21976-12-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Clamp values cannot be tuned at the root cgroup level. Moreover, because
of the delegation model requirements and how the parent clamps
propagation works, if we want to enable subgroups to set a non null
util.min, we need to be able to configure the root group util.min to the
allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).

Unfortunately this setup will also mean that all tasks running in the
root group, will always get a maximum util.min clamp, unless they have a
lower task specific clamp which is definitively not a desirable default
configuration.

Let's fix this by explicitly adding a system default configuration
(sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
for all tasks running on the root group.

This interface is available independently from cgroups, thus providing a
complete solution for system wide utilization clamping configuration.

Each task has now by default:
   task_struct::uclamp::value = UCLAMP_NOT_VALID
unless:
 - the task has been forked from a parent with a valid clamp and
   !SCHED_FLAG_RESET_ON_FORK
 - the task has got a task-specific value set via sched_setattr()

A task with a UCLAMP_NOT_VALID clamp value is refcounted considering the
system default clamps if either we do not have task group support or
they are part of the root_task_group.
Tasks without a task specific clamp value in a child task group will be
refcounted instead considering the task group clamps.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180820122728.GM2960@e110439-lin>
 - fix unwanted reset of clamp values on refcount success
 Others:
 - by default all tasks have a UCLAMP_NOT_VALID task specific clamp
 - always use:
      p->uclamp[clamp_id].effective.value
   to track the actual clamp value the task has been refcounted into.
   This matches with the usage of
      p->uclamp[clamp_id].effective.group_id
 - rebased on v4.19-rc1
---
 include/linux/sched/sysctl.h |  11 +++
 kernel/sched/core.c          | 147 +++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |  16 ++++
 3 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index a9c32daeb9d8..445fb54eaeff 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
@@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+#ifdef CONFIG_UCLAMP_TASK
+extern int sched_uclamp_handler(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp,
+				loff_t *ppos);
+#endif
+
 extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da0b3bd41e96..fbc8d9fdfdbb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -728,6 +728,20 @@ static void set_load_weight(struct task_struct *p, bool update_load)
  */
 static DEFINE_MUTEX(uclamp_mutex);
 
+/*
+ * Minimum utilization for tasks in the root cgroup
+ * default: 0
+ */
+unsigned int sysctl_sched_uclamp_util_min;
+
+/*
+ * Maximum utilization for tasks in the root cgroup
+ * default: 1024
+ */
+unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+
+static struct uclamp_se uclamp_default[UCLAMP_CNT];
+
 /**
  * uclamp_map: reference counts a utilization "clamp value"
  * @value:    the utilization "clamp value" required
@@ -961,11 +975,16 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
  *
  * This method returns the effective group index for a task, depending on its
  * status and a proper aggregation of the clamp values listed above.
+ * Moreover, it ensures that the task's effective value:
+ *    task_struct::uclamp::effective::value
+ * is updated to represent the clamp value corresponding to the taks effective
+ * group index.
  */
 static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
 {
 	struct uclamp_se *uc_se;
 	int clamp_value;
+	bool unclamped;
 	int group_id;
 
 	/* Taks currently accounted into a clamp group */
@@ -977,13 +996,40 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
 	clamp_value = uc_se->value;
 	group_id = uc_se->group_id;
 
+	unclamped = (clamp_value == UCLAMP_NOT_VALID);
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+	/*
+	 * Tasks in the root group, which do not have a task specific clamp
+	 * value, get the system default clamp value.
+	 */
+	if (unclamped && (task_group_is_autogroup(task_group(p)) ||
+			  task_group(p) == &root_task_group)) {
+		p->uclamp[clamp_id].effective.value =
+			uclamp_default[clamp_id].value;
+
+		return uclamp_default[clamp_id].group_id;
+	}
+
 	/* Use TG's clamp value to limit task specific values */
 	uc_se = &task_group(p)->uclamp[clamp_id];
-	if (clamp_value > uc_se->effective.value)
-		group_id = uc_se->effective.group_id;
+	if (unclamped || clamp_value > uc_se->effective.value) {
+		p->uclamp[clamp_id].effective.value =
+			uc_se->effective.value;
+
+		return uc_se->effective.group_id;
+	}
+#else
+	/* By default, all tasks get the system default clamp value */
+	if (unclamped) {
+		p->uclamp[clamp_id].effective.value =
+			uclamp_default[clamp_id].value;
+
+		return uclamp_default[clamp_id].group_id;
+	}
 #endif
 
+	p->uclamp[clamp_id].effective.value = clamp_value;
+
 	return group_id;
 }
 
@@ -999,9 +1045,10 @@ static inline int uclamp_group_value(int clamp_id, int group_id)
 
 static inline int uclamp_task_value(struct task_struct *p, int clamp_id)
 {
-	int group_id = uclamp_task_group_id(p, clamp_id);
+	/* Ensure effective task's clamp value is update */
+	uclamp_task_group_id(p, clamp_id);
 
-	return uclamp_group_value(clamp_id, group_id);
+	return p->uclamp[clamp_id].effective.value;
 }
 
 /**
@@ -1047,7 +1094,7 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
 
 	/* Reset clamp holds on idle exit */
 	uc_cpu = &rq->uclamp;
-	clamp_value = p->uclamp[clamp_id].value;
+	clamp_value = p->uclamp[clamp_id].effective.value;
 	if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
 		/*
 		 * This function is called for both UCLAMP_MIN (before) and
@@ -1300,6 +1347,77 @@ static inline void uclamp_group_get(struct task_struct *p,
 	uclamp_group_put(clamp_id, prev_group_id);
 }
 
+int sched_uclamp_handler(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp,
+			 loff_t *ppos)
+{
+	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
+	struct uclamp_se *uc_se;
+	int old_min, old_max;
+	unsigned int value;
+	int result;
+
+	mutex_lock(&uclamp_mutex);
+
+	old_min = sysctl_sched_uclamp_util_min;
+	old_max = sysctl_sched_uclamp_util_max;
+
+	result = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (result)
+		goto undo;
+	if (!write)
+		goto done;
+
+	result = -EINVAL;
+	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
+		goto undo;
+	if (sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE)
+		goto undo;
+
+	/* Find a valid group_id for each required clamp value */
+	if (old_min != sysctl_sched_uclamp_util_min) {
+		value = sysctl_sched_uclamp_util_min;
+		result = uclamp_group_find(UCLAMP_MIN, value);
+		if (result == -ENOSPC) {
+			pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+			goto undo;
+		}
+		group_id[UCLAMP_MIN] = result;
+	}
+	if (old_max != sysctl_sched_uclamp_util_max) {
+		value = sysctl_sched_uclamp_util_max;
+		result = uclamp_group_find(UCLAMP_MAX, value);
+		if (result == -ENOSPC) {
+			pr_err(UCLAMP_ENOSPC_FMT, "MAX");
+			goto undo;
+		}
+		group_id[UCLAMP_MAX] = result;
+	}
+	result = 0;
+
+	/* Update each required clamp group */
+	if (old_min != sysctl_sched_uclamp_util_min) {
+		uc_se = &uclamp_default[UCLAMP_MIN];
+		uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
+				 uc_se, sysctl_sched_uclamp_util_min);
+	}
+	if (old_max != sysctl_sched_uclamp_util_max) {
+		uc_se = &uclamp_default[UCLAMP_MAX];
+		uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
+				 uc_se, sysctl_sched_uclamp_util_max);
+	}
+	goto done;
+
+undo:
+	sysctl_sched_uclamp_util_min = old_min;
+	sysctl_sched_uclamp_util_max = old_max;
+
+done:
+	mutex_unlock(&uclamp_mutex);
+
+	return result;
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 /**
  * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
@@ -1468,6 +1586,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)
 		if (unlikely(reset)) {
 			next_group_id = 0;
 			p->uclamp[clamp_id].value = uclamp_none(clamp_id);
+			p->uclamp[clamp_id].effective.value =
+				uclamp_none(clamp_id);
 		}
 		/* Forked tasks are not yet enqueued */
 		p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
@@ -1475,6 +1595,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
 		p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
 		uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
 				 p->uclamp[clamp_id].value);
+
+		/* By default we do not want task-specific clamp values */
+		if (unlikely(reset))
+			p->uclamp[clamp_id].value = UCLAMP_NOT_VALID;
 	}
 }
 
@@ -1509,12 +1633,17 @@ static void __init init_uclamp(void)
 		uc_se->group_id = UCLAMP_NOT_VALID;
 		uclamp_group_get(NULL, clamp_id, 0, uc_se,
 				 uclamp_none(clamp_id));
+		/*
+		 * By default we do not want task-specific clamp values,
+		 * so that system default values apply.
+		 */
+		uc_se->value = UCLAMP_NOT_VALID;
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 		/* Init root TG's clamp group */
 		uc_se = &root_task_group.uclamp[clamp_id];
 
-		uc_se->effective.value = uclamp_none(clamp_id);
+		uc_se->effective.value = uclamp_none(UCLAMP_MAX);
 		uc_se->effective.group_id = 0;
 
 		/*
@@ -1526,6 +1655,12 @@ static void __init init_uclamp(void)
 		uclamp_group_get(NULL, clamp_id, 0, uc_se,
 				 uclamp_none(UCLAMP_MAX));
 #endif
+
+		/* Init system defaul's clamp group */
+		uc_se = &uclamp_default[clamp_id];
+		uc_se->group_id = UCLAMP_NOT_VALID;
+		uclamp_group_get(NULL, clamp_id, 0, uc_se,
+				 uclamp_none(clamp_id));
 	}
 }
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050fd0c4..378ea57e5fc5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -445,6 +445,22 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
+#ifdef CONFIG_UCLAMP_TASK
+	{
+		.procname	= "sched_uclamp_util_min",
+		.data		= &sysctl_sched_uclamp_util_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_uclamp_handler,
+	},
+	{
+		.procname	= "sched_uclamp_util_max",
+		.data		= &sysctl_sched_uclamp_util_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_uclamp_handler,
+	},
+#endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",

From patchwork Tue Aug 28 13:53:20 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578577
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7D9E813B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:32 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6E2AF2A30F
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:32 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 618492A326; Tue, 28 Aug 2018 13:55:32 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A18DF2A30F
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728321AbeH1RrL (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:11 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38670 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727579AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 04C831AC1;
        Tue, 28 Aug 2018 06:54:37 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 135DB3F5BD;
        Tue, 28 Aug 2018 06:54:33 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 12/16] sched/core: uclamp: update CPU's refcount on TG's
 clamp changes
Date: Tue, 28 Aug 2018 14:53:20 +0100
Message-Id: <20180828135324.21976-13-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When a task group refcounts a new clamp group, we need to ensure that
the new clamp values are immediately enforced to all its tasks which are
currently RUNNABLE. This is to ensure that all currently RUNNABLE tasks
are boosted and/or clamped as requested as soon as possible.

Let's ensure that, whenever a new clamp group is refcounted by a task
group, all its RUNNABLE tasks are correctly accounted in their
respective CPUs. We do that by slightly refactoring uclamp_group_get()
to get an additional parameter *cgroup_subsys_state which, when
provided, it's used to walk the list of tasks in the corresponding TGs
and update the RUNNABLE ones.

This is a "brute force" solution which allows to reuse the same refcount
update code already used by the per-task API. That's also the only way
to ensure a prompt enforcement of new clamp constraints on RUNNABLE
tasks, as soon as a task group attribute is tweaked.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Others:
 - rebased on v4.19-rc1

Changes in v3:
 - rebased on tip/sched/core
 - fixed some typos
Changes in v2:
 - rebased on v4.18-rc4
 - this code has been split from a previous patch to simplify the review
---
 kernel/sched/core.c     | 52 ++++++++++++++++++++++++++++++++---------
 kernel/sched/features.h |  5 ++++
 2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbc8d9fdfdbb..9ca881d1ff9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1306,9 +1306,30 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
 	raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
 }
 
+static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
+				       int clamp_id, unsigned int group_id)
+{
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	/*
+	 * In lazy update mode, tasks will be accounted into the right clamp
+	 * group the next time they will be requeued.
+	 */
+	if (unlikely(sched_feat(UCLAMP_LAZY_UPDATE)))
+		return;
+
+	/* Update clamp groups for RUNNABLE tasks in this TG */
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it)))
+		uclamp_task_update_active(p, clamp_id, group_id);
+	css_task_iter_end(&it);
+}
+
 /**
  * uclamp_group_get: increase the reference count for a clamp group
  * @p: the task which clamp value must be tracked
+ * @css: the task group which clamp value must be tracked
  * @clamp_id: the clamp index affected by the task
  * @next_group_id: the clamp group to refcount
  * @uc_se: the utilization clamp data for the task
@@ -1320,6 +1341,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
  * the task to reference count the clamp value on CPUs while enqueued.
  */
 static inline void uclamp_group_get(struct task_struct *p,
+				    struct cgroup_subsys_state *css,
 				    int clamp_id, int next_group_id,
 				    struct uclamp_se *uc_se,
 				    unsigned int clamp_value)
@@ -1339,6 +1361,10 @@ static inline void uclamp_group_get(struct task_struct *p,
 	uc_map[next_group_id].se_count += 1;
 	raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
 
+	/* Newly created TG don't have tasks assigned */
+	if (css)
+		uclamp_group_get_tg(css, clamp_id, next_group_id);
+
 	/* Update CPU's clamp group refcounts of RUNNABLE task */
 	if (p)
 		uclamp_task_update_active(p, clamp_id, next_group_id);
@@ -1398,12 +1424,12 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
 	/* Update each required clamp group */
 	if (old_min != sysctl_sched_uclamp_util_min) {
 		uc_se = &uclamp_default[UCLAMP_MIN];
-		uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
+		uclamp_group_get(NULL, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
 				 uc_se, sysctl_sched_uclamp_util_min);
 	}
 	if (old_max != sysctl_sched_uclamp_util_max) {
 		uc_se = &uclamp_default[UCLAMP_MAX];
-		uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
+		uclamp_group_get(NULL, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
 				 uc_se, sysctl_sched_uclamp_util_max);
 	}
 	goto done;
@@ -1448,7 +1474,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
 
 		next_group_id = parent->uclamp[clamp_id].group_id;
 		uc_se->group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
+		uclamp_group_get(NULL, NULL, clamp_id, next_group_id, uc_se,
 				 parent->uclamp[clamp_id].value);
 	}
 
@@ -1536,12 +1562,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
 	/* Update each required clamp group */
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uc_se = &p->uclamp[UCLAMP_MIN];
-		uclamp_group_get(p, UCLAMP_MIN, group_id[UCLAMP_MIN],
+		uclamp_group_get(p, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
 				 uc_se, attr->sched_util_min);
 	}
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
 		uc_se = &p->uclamp[UCLAMP_MAX];
-		uclamp_group_get(p, UCLAMP_MAX, group_id[UCLAMP_MAX],
+		uclamp_group_get(p, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
 				 uc_se, attr->sched_util_max);
 	}
 
@@ -1593,7 +1619,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
 		p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
 
 		p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
+		uclamp_group_get(NULL, NULL, clamp_id, next_group_id, uc_se,
 				 p->uclamp[clamp_id].value);
 
 		/* By default we do not want task-specific clamp values */
@@ -1631,7 +1657,7 @@ static void __init init_uclamp(void)
 		/* Init init_task's clamp group */
 		uc_se = &init_task.uclamp[clamp_id];
 		uc_se->group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(NULL, clamp_id, 0, uc_se,
+		uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
 				 uclamp_none(clamp_id));
 		/*
 		 * By default we do not want task-specific clamp values,
@@ -1652,14 +1678,14 @@ static void __init init_uclamp(void)
 		 * all child groups.
 		 */
 		uc_se->group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(NULL, clamp_id, 0, uc_se,
+		uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
 				 uclamp_none(UCLAMP_MAX));
 #endif
 
 		/* Init system defaul's clamp group */
 		uc_se = &uclamp_default[clamp_id];
 		uc_se->group_id = UCLAMP_NOT_VALID;
-		uclamp_group_get(NULL, clamp_id, 0, uc_se,
+		uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
 				 uclamp_none(clamp_id));
 	}
 }
@@ -7540,6 +7566,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
 
 		uc_se->effective.value = value;
 		uc_se->effective.group_id = group_id;
+
+		/* Immediately updated descendants active tasks */
+		if (css != top_css)
+			uclamp_group_get_tg(css, clamp_id, group_id);
 	}
 }
 
@@ -7579,7 +7609,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 
 	/* Update TG's reference count */
 	uc_se = &tg->uclamp[UCLAMP_MIN];
-	uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);
+	uclamp_group_get(NULL, css, UCLAMP_MIN, group_id, uc_se, min_value);
 
 out:
 	rcu_read_unlock();
@@ -7624,7 +7654,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 
 	/* Update TG's reference count */
 	uc_se = &tg->uclamp[UCLAMP_MAX];
-	uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);
+	uclamp_group_get(NULL, css, UCLAMP_MAX, group_id, uc_se, max_value);
 
 out:
 	rcu_read_unlock();
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..aad826aa55f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,8 @@ SCHED_FEAT(WA_BIAS, true)
  * UtilEstimation. Use estimated CPU utilization.
  */
 SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Utilization clamping lazy update.
+ */
+SCHED_FEAT(UCLAMP_LAZY_UPDATE, false)

From patchwork Tue Aug 28 13:53:21 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578571
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 52D3C13B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:25 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4028C29B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:25 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 33FA62A320; Tue, 28 Aug 2018 13:55:25 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5097329B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727344AbeH1RrF (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:05 -0400
Received: from foss.arm.com ([217.140.101.70]:38671 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727463AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 38B171BB0;
        Tue, 28 Aug 2018 06:54:40 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 479003F5BD;
        Tue, 28 Aug 2018 06:54:37 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
Date: Tue, 28 Aug 2018 14:53:21 +0100
Message-Id: <20180828135324.21976-14-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The utilization is a well defined property of tasks and CPUs with an
in-kernel representation based on power-of-two values.
The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
allows efficient computations in hot-paths and a sufficient fixed point
arithmetic precision.
However, the utilization values range is still an implementation detail
which is also possibly subject to changes in the future.

Since we don't want to commit new user-space APIs to any in-kernel
implementation detail, let's add an abstraction layer on top of the APIs
used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
cpu.util_{min,max} attributes.

We do that by adding a couple of conversion functions which can be used
to conveniently transform utilization/capacity values from/to the internal
SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
in the standard [0..100] range.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Others:
 - rebased on v4.19-rc1

Changes in v3:
 - rebased on tip/sched/core
Changes in v2:
- none: this is a new patch
---
 Documentation/admin-guide/cgroup-v2.rst | 10 +++----
 include/linux/sched.h                   | 20 +++++++++++++
 include/uapi/linux/sched/types.h        | 14 +++++----
 kernel/sched/core.c                     | 38 ++++++++++++++-----------
 4 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 72272f58d304..4b236390273b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,7 +976,7 @@ All time durations are in microseconds.
         A read-write single value file which exists on non-root cgroups.
         The default is "0", i.e. no bandwidth boosting.
 
-        The requested minimum utilization in the range [0, 1023].
+        The requested minimum percentage of utilization in the range [0, 100].
 
         This interface allows reading and setting minimum utilization clamp
         values similar to the sched_setattr(2). This minimum utilization
@@ -987,16 +987,16 @@ All time durations are in microseconds.
         reports minimum utilization clamp value currently enforced on a task
         group.
 
-        The actual minimum utilization in the range [0, 1023].
+        The actual minimum percentage of utilization in the range [0, 100].
 
         This value can be lower then cpu.util.min in case a parent cgroup
         is enforcing a more restrictive clamping on minimum utilization.
 
   cpu.util.max
         A read-write single value file which exists on non-root cgroups.
-        The default is "1023". i.e. no bandwidth clamping
+        The default is "100". i.e. no bandwidth clamping
 
-        The requested maximum utilization in the range [0, 1023].
+        The requested maximum percentage of utilization in the range [0, 100].
 
         This interface allows reading and setting maximum utilization clamp
         values similar to the sched_setattr(2). This maximum utilization
@@ -1007,7 +1007,7 @@ All time durations are in microseconds.
         reports maximum utilization clamp value currently enforced on a task
         group.
 
-        The actual maximum utilization in the range [0, 1023].
+        The actual maximum percentage of utilization in the range [0, 100].
 
         This value can be lower then cpu.util.max in case a parent cgroup
         is enforcing a more restrictive clamping on max utilization.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4e5522ed57e0..ca0a80881fa9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -321,6 +321,26 @@ struct sched_info {
 # define SCHED_FIXEDPOINT_SHIFT		10
 # define SCHED_FIXEDPOINT_SCALE		(1L << SCHED_FIXEDPOINT_SHIFT)
 
+static inline unsigned int util_from_pct(unsigned int pct)
+{
+	WARN_ON(pct > 100);
+
+	return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
+}
+
+static inline unsigned int util_to_pct(unsigned int value)
+{
+	unsigned int rounding = 0;
+
+	WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
+
+	/* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
+	if (likely((value & 0xFF) && ~(value & 0x700)))
+		rounding = 1;
+
+	return (rounding + ((100 * value) / SCHED_FIXEDPOINT_SCALE));
+}
+
 struct load_weight {
 	unsigned long			weight;
 	u32				inv_weight;
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 7512b5934013..b0fe00939fb3 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -84,16 +84,18 @@ struct sched_param {
  *
  *  @sched_util_min	represents the minimum utilization
  *  @sched_util_max	represents the maximum utilization
+ *  @sched_util_min	represents the minimum utilization percentage
+ *  @sched_util_max	represents the maximum utilization percentage
  *
- * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
- * represents the percentage of CPU time used by a task when running at the
- * maximum frequency on the highest capacity CPU of the system. Thus, for
- * example, a 20% utilization task is a task running for 2ms every 10ms.
+ * Utilization is a value in the range [0..100] which represents the
+ * percentage of CPU time used by a task when running at the maximum frequency
+ * on the highest capacity CPU of the system. Thus, for example, a 20%
+ * utilization task is a task running for 2ms every 10ms.
  *
- * A task with a min utilization value bigger then 0 is more likely to be
+ * A task with a min utilization value bigger then 0% is more likely to be
  * scheduled on a CPU which has a capacity big enough to fit the specified
  * minimum utilization value.
- * A task with a max utilization value smaller then 1024 is more likely to be
+ * A task with a max utilization value smaller then 100% is more likely to be
  * scheduled on a CPU which do not necessarily have more capacity then the
  * specified max utilization value.
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9ca881d1ff9e..222397edb8a7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -730,15 +730,15 @@ static DEFINE_MUTEX(uclamp_mutex);
 
 /*
  * Minimum utilization for tasks in the root cgroup
- * default: 0
+ * default: 0%
  */
 unsigned int sysctl_sched_uclamp_util_min;
 
 /*
  * Maximum utilization for tasks in the root cgroup
- * default: 1024
+ * default: 100%
  */
-unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+unsigned int sysctl_sched_uclamp_util_max = 100;
 
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -940,7 +940,7 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
 		max_value = max(max_value, uc_grp[group_id].value);
 
 		/* Stop if we reach the max possible clamp */
-		if (max_value >= SCHED_CAPACITY_SCALE)
+		if (max_value >= 100)
 			break;
 	}
 
@@ -1397,7 +1397,7 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
 	result = -EINVAL;
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
 		goto undo;
-	if (sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE)
+	if (sysctl_sched_uclamp_util_max > 100)
 		goto undo;
 
 	/* Find a valid group_id for each required clamp value */
@@ -1424,13 +1424,15 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
 	/* Update each required clamp group */
 	if (old_min != sysctl_sched_uclamp_util_min) {
 		uc_se = &uclamp_default[UCLAMP_MIN];
+		value = util_from_pct(sysctl_sched_uclamp_util_min);
 		uclamp_group_get(NULL, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
-				 uc_se, sysctl_sched_uclamp_util_min);
+				 uc_se, value);
 	}
 	if (old_max != sysctl_sched_uclamp_util_max) {
 		uc_se = &uclamp_default[UCLAMP_MAX];
+		value = util_from_pct(sysctl_sched_uclamp_util_max);
 		uclamp_group_get(NULL, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
-				 uc_se, sysctl_sched_uclamp_util_max);
+				 uc_se, value);
 	}
 	goto done;
 
@@ -1525,7 +1527,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
 			: p->uclamp[UCLAMP_MAX].value;
 
 		if (upper_bound == UCLAMP_NOT_VALID)
-			upper_bound = SCHED_CAPACITY_SCALE;
+			upper_bound = 100;
 		if (attr->sched_util_min > upper_bound) {
 			result = -EINVAL;
 			goto done;
@@ -1546,7 +1548,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
 		if (lower_bound == UCLAMP_NOT_VALID)
 			lower_bound = 0;
 		if (attr->sched_util_max < lower_bound ||
-		    attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+		    attr->sched_util_max > 100) {
 			result = -EINVAL;
 			goto done;
 		}
@@ -1563,12 +1565,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uc_se = &p->uclamp[UCLAMP_MIN];
 		uclamp_group_get(p, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
-				 uc_se, attr->sched_util_min);
+				 uc_se, util_from_pct(attr->sched_util_min));
 	}
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
 		uc_se = &p->uclamp[UCLAMP_MAX];
 		uclamp_group_get(p, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
-				 uc_se, attr->sched_util_max);
+				 uc_se, util_from_pct(attr->sched_util_max));
 	}
 
 done:
@@ -5727,8 +5729,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		attr.sched_nice = task_nice(p);
 
 #ifdef CONFIG_UCLAMP_TASK
-	attr.sched_util_min = uclamp_task_value(p, UCLAMP_MIN);
-	attr.sched_util_max = uclamp_task_value(p, UCLAMP_MAX);
+	attr.sched_util_min = util_to_pct(uclamp_task_value(p, UCLAMP_MIN));
+	attr.sched_util_max = util_to_pct(uclamp_task_value(p, UCLAMP_MAX));
 #endif
 
 	rcu_read_unlock();
@@ -7581,8 +7583,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
 	int ret = -EINVAL;
 	int group_id;
 
-	if (min_value > SCHED_CAPACITY_SCALE)
+	/* Check range and scale to internal representation */
+	if (min_value > 100)
 		return -ERANGE;
+	min_value = util_from_pct(min_value);
 
 	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
@@ -7626,8 +7630,10 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
 	int ret = -EINVAL;
 	int group_id;
 
-	if (max_value > SCHED_CAPACITY_SCALE)
+	/* Check range and scale to internal representation */
+	if (max_value > 100)
 		return -ERANGE;
+	max_value = util_from_pct(max_value);
 
 	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
@@ -7677,7 +7683,7 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
 		: tg->uclamp[clamp_id].value;
 	rcu_read_unlock();
 
-	return util_clamp;
+	return util_to_pct(util_clamp);
 }
 
 static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,

From patchwork Tue Aug 28 13:53:22 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578579
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C9F5D13B8
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:41 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BA93C29953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:41 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id AE9D52A1A4; Tue, 28 Aug 2018 13:55:41 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2BD2529953
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:41 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728355AbeH1Rr0 (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:26 -0400
Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38676 "EHLO
        foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727706AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6AF421AED;
        Tue, 28 Aug 2018 06:54:43 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 796633F5BD;
        Tue, 28 Aug 2018 06:54:40 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default
Date: Tue, 28 Aug 2018 14:53:22 +0100
Message-Id: <20180828135324.21976-15-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The number of clamp groups supported is limited and defined at compile
time. However, a malicious user can currently ask for many different
clamp values thus consuming all the available clamp groups.

Since on properly configured systems we expect only a limited set of
different clamp values, the previous problem can be mitigated by
allowing access to clamp groups configuration only to privileged tasks.
This should still allow a System Management Software to properly
pre-configure the system.

Let's restrict the tuning of utilization clamp values, by default, to
tasks with CAP_SYS_ADMIN capabilities.

Whenever this should be considered too restrictive and/or not required
for a specific platforms, a kernel boot option is provided to change
this default behavior thus allowing non privileged tasks to change their
utilization clamp values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Others:
 - new patch added in this version
 - rebased on v4.19-rc1
---
 .../admin-guide/kernel-parameters.txt         |  3 +++
 kernel/sched/core.c                           | 22 ++++++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9871e649ffef..481f8214ea9a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4561,6 +4561,9 @@
 			<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
 			See also Documentation/input/devices/joystick-parport.rst
 
+	uclamp_user	[KNL] Enable task-specific utilization clamping tuning
+			also from tasks without CAP_SYS_ADMIN capability.
+
 	udbg-immortal	[PPC] When debugging early kernel crashes that
 			happen after console_init() and before a proper
 			console driver takes over, this boot options might
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 222397edb8a7..8341ce580a9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1510,14 +1510,29 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
 static inline void free_uclamp_sched_group(struct task_group *tg) { }
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
 
+static bool uclamp_user_allowed __read_mostly;
+static int __init uclamp_user_allow(char *str)
+{
+	uclamp_user_allowed = true;
+
+	return 0;
+}
+early_param("uclamp_user", uclamp_user_allow);
+
 static inline int __setscheduler_uclamp(struct task_struct *p,
-					const struct sched_attr *attr)
+					const struct sched_attr *attr,
+					bool user)
 {
 	int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
 	int lower_bound, upper_bound;
 	struct uclamp_se *uc_se;
 	int result = 0;
 
+	if (!capable(CAP_SYS_ADMIN) &&
+	    user && !uclamp_user_allowed) {
+		return -EPERM;
+	}
+
 	mutex_lock(&uclamp_mutex);
 
 	/* Find a valid group_id for each required clamp value */
@@ -1702,7 +1717,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
 	return 1;
 }
 static inline int __setscheduler_uclamp(struct task_struct *p,
-					const struct sched_attr *attr)
+					const struct sched_attr *attr,
+					bool user)
 {
 	return -EINVAL;
 }
@@ -5217,7 +5233,7 @@ static int __sched_setscheduler(struct task_struct *p,
 
 	/* Configure utilization clamps for the task */
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
-		retval = __setscheduler_uclamp(p, attr);
+		retval = __setscheduler_uclamp(p, attr, user);
 		if (retval)
 			return retval;
 	}

From patchwork Tue Aug 28 13:53:23 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578573
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 77E95920
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:25 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 678DB29B5C
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:25 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5B2142A30F; Tue, 28 Aug 2018 13:55:25 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A72AA2A1A4
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727209AbeH1Rq7 (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from foss.arm.com ([217.140.101.70]:38674 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727725AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9D2131BB2;
        Tue, 28 Aug 2018 06:54:46 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 ABBE53F5BD;
        Tue, 28 Aug 2018 06:54:43 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 15/16] sched/core: uclamp: add clamp group discretization
 support
Date: Tue, 28 Aug 2018 14:53:23 +0100
Message-Id: <20180828135324.21976-16-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The limited number of clamp groups is required to have both an effective
and efficient run-time tracking of the clamp groups required by RUNNABLE
tasks. However, being a limited number imposes some constraints on its
usage at run-time. Specifically, a System Management Software should
"reserve" all the possible clamp values required at run-time to ensure
there will always be a clamp group to track them whenever required.

To fix this problem we can trade-off CPU clamping precision for
efficiency by transforming CPU's clamp groups into buckets of a
predefined range.

The number of clamp groups configured at compile time defines the range
of utilization clamp values tracked by each CPU clamp group. Thus, for
example, with the default:
   CONFIG_UCLAMP_GROUPS_COUNT 5
we will have 5 clamp groups tracking 20% utilization each and a task
with util_min=25% will have group_id=1.

Scheduling entities keep tracking the specific value defined from
user-space, which can still be used for task placement biasing
decisions. However, at enqueue time tasks will be refcounted in the
clamp group which range includes the task specific clamp value.

Each CPU's clamp value will also be updated to aggregate and represent
at run-time the most restrictive value among those of the RUNNABLE tasks
refcounted by that group. Each time a CPU clamp group becomes empty we
reset its clamp value to the minimum value of the range it tracks.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180809152313.lewfhufidhxb2qrk@darkstar>
 - implements the idea discussed in this thread
 Others:
 - new patch added in this version
 - rebased on v4.19-rc1
---
 include/linux/sched.h   | 13 ++++-----
 kernel/sched/core.c     | 59 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |  5 ++++
 3 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca0a80881fa9..752fcd5d2cea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -608,17 +608,18 @@ struct sched_dl_entity {
  * either tasks or task groups, to enforce the same clamp "value" for a given
  * clamp index.
  *
- * Scheduling entity's specific clamp group index can be different
- * from the effective clamp group index used at enqueue time since
- * task groups's clamps can be restricted by their parent task group.
+ * Scheduling entity's specific clamp value and group index can be different
+ * from the effective value and group index used at enqueue time. Indeed:
+ * - task's clamps can be restricted by their task group calmps
+ * - task groups's clamps can be restricted by their parent task group
  */
 struct uclamp_se {
 	unsigned int value;
 	unsigned int group_id;
 	/*
-	 * Effective task (group) clamp value and group index.
-	 * For task groups it's the value (eventually) enforced by a parent
-	 * task group.
+	 * Effective task (group) clamp value and group index:
+	 * for tasks: those used at enqueue time
+	 * for task groups: those (eventually) enforced by a parent task group
 	 */
 	struct {
 		unsigned int value;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8341ce580a9a..f71e15eaf152 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -807,6 +807,34 @@ static struct uclamp_map uclamp_maps[UCLAMP_CNT]
 #define UCLAMP_ENOSPC_FMT "Cannot allocate more than " \
 	__stringify(CONFIG_UCLAMP_GROUPS_COUNT) " UTIL_%s clamp groups\n"
 
+/*
+ * uclamp_round: round a clamp value to the closest trackable value
+ *
+ * The number of clamp group, which is defined at compile time, allows to
+ * track a finete number of different clamp values. This makes sense from both
+ * a practical standpoint, since we do not expect many different values at on
+ * a real system, as well as for run-time efficiency.
+ *
+ * To ensure a clamp group is always available, this methd allows to
+ * discretize a required value into one of the possible available clamp
+ * groups.
+ */
+static inline int uclamp_round(int value)
+{
+#define UCLAMP_GROUP_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_GROUPS_COUNT)
+#define UCLAMP_GROUP_UPPER (UCLAMP_GROUP_DELTA * CONFIG_UCLAMP_GROUPS_COUNT)
+
+	if (unlikely(!sched_feat(UCLAMP_ROUNDING)))
+		return value;
+
+	if (value <= 0)
+		return value;
+	if (value >= UCLAMP_GROUP_UPPER)
+		return SCHED_CAPACITY_SCALE;
+
+	return UCLAMP_GROUP_DELTA * (value / UCLAMP_GROUP_DELTA);
+}
+
 /**
  * uclamp_group_available: checks if a clamp group is available
  * @clamp_id: the utilization clamp index (i.e. min or max clamp)
@@ -846,6 +874,9 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
 	struct uclamp_cpu *uc_cpu;
 	int cpu;
 
+	/* Clamp groups are always initialized to the rounded clamp value */
+	clamp_value = uclamp_round(clamp_value);
+
 	/* Set clamp group map */
 	uc_map[group_id].value = clamp_value;
 	uc_map[group_id].se_count = 0;
@@ -892,6 +923,7 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
 	int free_group_id = UCLAMP_NOT_VALID;
 	unsigned int group_id = 0;
 
+	clamp_value = uclamp_round(clamp_value);
 	for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
 		/* Keep track of first free clamp group */
 		if (uclamp_group_available(clamp_id, group_id)) {
@@ -979,6 +1011,22 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
  *    task_struct::uclamp::effective::value
  * is updated to represent the clamp value corresponding to the taks effective
  * group index.
+ *
+ * Thus, the effective clamp value for a task is granted to be in the range of
+ * the rounded clamp values of its effective clamp group. For example:
+ *  - CONFIG_UCLAMP_GROUPS_COUNT=5 => UCLAMP_GROUP_DELTA=20%
+ *  - TaskA:      util_min=25%     => clamp_group1: range [20-39]%
+ *  - TaskB:      util_min=35%     => clamp_group1: range [20-39]%
+ *  - TaskGroupA: util_min=10%     => clamp_group0: range [ 0-19]%
+ * Then, when TaskA is part of TaskGroupA, it will be:
+ *  - allocated in clamp_group1
+ *  - clamp_group1.value=25
+ *    while TaskA is running alone
+ *  - clamp_group1.value=35
+ *    since TaskB was RUNNABLE and until TaskA is RUNNABLE
+ *  - clamp_group1.value=20
+ *    i.e. CPU's clamp group value is reset to the nominal rounded value,
+ *    while TaskA and TaskB are not RUNNABLE
  */
 static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
 {
@@ -1106,6 +1154,10 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
 		uc_cpu->value[clamp_id] = clamp_value;
 	}
 
+	/* Track the max effective clamp value for each CPU's clamp group */
+	if (clamp_value > uc_cpu->group[clamp_id][group_id].value)
+		uc_cpu->group[clamp_id][group_id].value = clamp_value;
+
 	/*
 	 * If this is the new max utilization clamp value, then we can update
 	 * straight away the CPU clamp value. Otherwise, the current CPU clamp
@@ -1170,8 +1222,13 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
 		     cpu_of(rq), clamp_id, group_id);
 	}
 #endif
-	if (clamp_value >= uc_cpu->value[clamp_id])
+	if (clamp_value >= uc_cpu->value[clamp_id]) {
+		/* Reset CPU's clamp value to rounded clamp group value */
+		clamp_value = uclamp_group_value(clamp_id, group_id);
+		uc_cpu->group[clamp_id][group_id].value = clamp_value;
+
 		uclamp_cpu_update(rq, clamp_id, clamp_value);
+	}
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index aad826aa55f8..5b7d0965b090 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,8 @@ SCHED_FEAT(UTIL_EST, true)
  * Utilization clamping lazy update.
  */
 SCHED_FEAT(UCLAMP_LAZY_UPDATE, false)
+
+/*
+ * Utilization clamping discretization.
+ */
+SCHED_FEAT(UCLAMP_ROUNDING, true)

From patchwork Tue Aug 28 13:53:24 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Patrick Bellasi <patrick.bellasi@arm.com>
X-Patchwork-Id: 10578575
Return-Path: <linux-pm-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 233BD920
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:30 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 134E12A30F
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:30 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 0714A2A326; Tue, 28 Aug 2018 13:55:30 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 52E642A30F
	for <patchwork-linux-pm@patchwork.kernel.org>;
 Tue, 28 Aug 2018 13:55:29 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728331AbeH1RrL (ORCPT
        <rfc822;patchwork-linux-pm@patchwork.kernel.org>);
        Tue, 28 Aug 2018 13:47:11 -0400
Received: from foss.arm.com ([217.140.101.70]:38678 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1727836AbeH1Rq7 (ORCPT <rfc822;linux-pm@vger.kernel.org>);
        Tue, 28 Aug 2018 13:46:59 -0400
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])
        by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CFE6C1BC0;
        Tue, 28 Aug 2018 06:54:49 -0700 (PDT)
Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com
 [10.4.12.126])
        by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id
 DEDA83F5BD;
        Tue, 28 Aug 2018 06:54:46 -0700 (PDT)
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Quentin Perret <quentin.perret@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Todd Kjos <tkjos@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Steve Muckle <smuckle@google.com>,
        Suren Baghdasaryan <surenb@google.com>
Subject: [PATCH v4 16/16] sched/cpufreq: uclamp: add utilization clamping for
 RT tasks
Date: Tue, 28 Aug 2018 14:53:24 +0100
Message-Id: <20180828135324.21976-17-patrick.bellasi@arm.com>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com>
References: <20180828135324.21976-1-patrick.bellasi@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pm.vger.kernel.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE.  Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.

This patch extends the usage of util_{min,max} to the RT scheduling
class. Whenever a task in this class is RUNNABLE, the util required is
defined by its task specific clamp value. However, we still want to run
at maximum capacity RT tasks which:
 - do not have task specific clamp values
 - run either in the root task group or an autogroup

Let's add uclamp_default_perf, a special set of clamp value to be used
for tasks that require maximum performance. This set of clamps are then
used whenever the above conditions matches for an RT task being enqueued
on a CPU.

Since utilization clamping applies now to both CFS and RT tasks, we
clamp the combined utilization of these two classes.
This approach, contrary to combining individually clamped utilizations,
is more power efficient. Indeed, it selects lower frequencies when we
have both RT and CFS clamped tasks.
However, it could also affect performance of the lower priority CFS
class, since the CFS's minimum utilization clamp could be completely
eclipsed by the RT workloads.

The IO wait boost value also is subject to clamping for RT tasks.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clamping constraints.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes in v4:
 Message-ID: <20180813150112.GE2605@e110439-lin>
 - remove UCLAMP_SCHED_CLASS policy since we do not have in the current
   implementation a proper per-sched_class clamp tracking support
 Message-ID: <20180809155551.bp46sixk4u3ilcnh@queper01-lin>
 - add default boost for not clamped RT tasks
 Others:
 - rebased on v4.19-rc1

Changes in v3:
 - rebased on tip/sched/core
Changes in v2:
 - rebased on v4.18-rc4
---
 kernel/sched/core.c              | 30 ++++++++++++++++++++++++------
 kernel/sched/cpufreq_schedutil.c | 22 ++++++++++++----------
 kernel/sched/rt.c                |  4 ++++
 3 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f71e15eaf152..9761457af1ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -741,6 +741,7 @@ unsigned int sysctl_sched_uclamp_util_min;
 unsigned int sysctl_sched_uclamp_util_max = 100;
 
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
+static struct uclamp_se uclamp_default_perf[UCLAMP_CNT];
 
 /**
  * uclamp_map: reference counts a utilization "clamp value"
@@ -1052,10 +1053,15 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
 	 */
 	if (unclamped && (task_group_is_autogroup(task_group(p)) ||
 			  task_group(p) == &root_task_group)) {
-		p->uclamp[clamp_id].effective.value =
-			uclamp_default[clamp_id].value;
 
-		return uclamp_default[clamp_id].group_id;
+		/* Unclamped RT tasks: max perfs by default */
+		uc_se = task_has_rt_policy(p)
+			? &uclamp_default_perf[clamp_id]
+			: &uclamp_default[clamp_id];
+
+		p->uclamp[clamp_id].effective.value = uc_se->value;
+
+		return uc_se->group_id;
 	}
 
 	/* Use TG's clamp value to limit task specific values */
@@ -1069,10 +1075,15 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
 #else
 	/* By default, all tasks get the system default clamp value */
 	if (unclamped) {
-		p->uclamp[clamp_id].effective.value =
-			uclamp_default[clamp_id].value;
 
-		return uclamp_default[clamp_id].group_id;
+		/* Unclamped RT tasks: max perfs by default */
+		uc_se = task_has_rt_policy(p)
+			? &uclamp_default_perf[clamp_id]
+			: &uclamp_default[clamp_id];
+
+		p->uclamp[clamp_id].effective.value = uc_se->value;
+
+		return uc_se->group_id;
 	}
 #endif
 
@@ -1761,6 +1772,13 @@ static void __init init_uclamp(void)
 		uc_se->group_id = UCLAMP_NOT_VALID;
 		uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
 				 uclamp_none(clamp_id));
+
+		/* Init max perf clamps: default for RT tasks */
+		uc_se = &uclamp_default_perf[clamp_id];
+		uc_se->group_id = UCLAMP_NOT_VALID;
+		uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
+				 uclamp_none(UCLAMP_MAX));
+
 	}
 }
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 949082555ee8..8a2d12a691eb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -205,7 +205,10 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
 	sg_cpu->bw_dl = cpu_bw_dl(rq);
 
-	if (rt_rq_is_runnable(&rq->rt))
+	util = rt_rq_is_runnable(&rq->rt)
+		? uclamp_util(rq, SCHED_CAPACITY_SCALE)
+		: cpu_util_rt(rq);
+	if (unlikely(util >= max))
 		return max;
 
 	/*
@@ -223,13 +226,14 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
 	 *
-	 * CFS utilization can be boosted or capped, depending on utilization
-	 * clamp constraints configured for currently RUNNABLE tasks.
+	 * CFS and RT utilizations can be boosted or capped, depending on
+	 * utilization constraints enforce by currently RUNNABLE tasks.
 	 */
-	util = cpu_util_cfs(rq);
+	util += cpu_util_cfs(rq);
 	if (util)
 		util = uclamp_util(rq, util);
-	util += cpu_util_rt(rq);
+	if (unlikely(util >= max))
+		return max;
 
 	/*
 	 * We do not make cpu_util_dl() a permanent part of this sum because we
@@ -333,13 +337,11 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 	 *
 	 * Since DL tasks have a much more advanced bandwidth control, it's
 	 * safe to assume that IO boost does not apply to those tasks.
-	 * Instead, since RT tasks are not utiliation clamped, we don't want
-	 * to apply clamping on IO boost while there is blocked RT
-	 * utilization.
+	 * Instead, for CFS and RT tasks we clamp the IO boost max value
+	 * considering the current constraints for the CPU.
 	 */
 	max_boost = sg_cpu->iowait_boost_max;
-	if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
-		max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+	max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
 
 	/* Double the boost at each request */
 	if (sg_cpu->iowait_boost) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2e2955a8cf8f..06ec33467dd9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2404,6 +2404,10 @@ const struct sched_class rt_sched_class = {
 	.switched_to		= switched_to_rt,
 
 	.update_curr		= update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+	.uclamp_enabled		= 1,
+#endif
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED