diff mbox

[3/4] arm64: topology: Tell the scheduler about the relative power of cores

Message ID 1387483575-11430-4-git-send-email-broonie@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Mark Brown Dec. 19, 2013, 8:06 p.m. UTC
From: Mark Brown <broonie@linaro.org>

In non-heterogeneous systems like big.LITTLE systems the scheduler will be
able to make better use of the available cores if we provide power numbers
to it indicating their relative performance. Do this by parsing the CPU
nodes in the DT.

This code currently has no effect as no information on the relative
performance of the cores is provided.

Signed-off-by: Mark Brown <broonie@linaro.org>
---
 arch/arm64/kernel/topology.c | 145 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

Comments

Lorenzo Pieralisi Jan. 7, 2014, 1:05 p.m. UTC | #1
On Thu, Dec 19, 2013 at 08:06:14PM +0000, Mark Brown wrote:
> From: Mark Brown <broonie@linaro.org>
> 
> In non-heterogeneous systems like big.LITTLE systems the scheduler will be
> able to make better use of the available cores if we provide power numbers
> to it indicating their relative performance. Do this by parsing the CPU
> nodes in the DT.
> 
> This code currently has no effect as no information on the relative
> performance of the cores is provided.
> 
> Signed-off-by: Mark Brown <broonie@linaro.org>
> ---
>  arch/arm64/kernel/topology.c | 145 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c

[...]

> +/*
> + * Iterate all CPUs' descriptor in DT and compute the efficiency
> + * (as per table_efficiency). Also calculate a middle efficiency
> + * as close as possible to  (max{eff_i} - min{eff_i}) / 2
> + * This is later used to scale the cpu_power field such that an
> + * 'average' CPU is of middle power. Also see the comments near
> + * table_efficiency[] and update_cpu_power().
> + */
>  static void __init parse_dt_topology(void)
>  {
> +	const struct cpu_efficiency *cpu_eff;
>  	struct device_node *cn;
> +	unsigned long min_capacity = (unsigned long)(-1);

ULONG_MAX ?

> +	unsigned long max_capacity = 0;
> +	unsigned long capacity = 0;
> +	int alloc_size, cpu;
> +
> +	alloc_size = nr_cpu_ids * sizeof(*__cpu_capacity);
> +	__cpu_capacity = kzalloc(alloc_size, GFP_NOWAIT);

kcalloc ? BTW this patch should include slab.h not the previous ones,
because that's the only patch where memory allocation takes place unless
I am missing something.

>  	cn = of_find_node_by_path("/cpus");
>  	if (!cn) {
> @@ -158,10 +221,88 @@ static void __init parse_dt_topology(void)
>  	if (!cn)
>  		return;
>  	parse_cluster(cn);
> +
> +	for_each_possible_cpu(cpu) {
> +		const u32 *rate;
> +		int len;
> +
> +		/* Too early to use cpu->of_node */
> +		cn = of_get_cpu_node(cpu, NULL);
> +		if (!cn) {
> +			pr_err("Missing device node for CPU %d\n", cpu);
> +			continue;
> +		}
> +
> +		/* check if the cpu is marked as "disabled", if so ignore */
> +		if (!of_device_is_available(cn))
> +			continue;

It is time we defined what a "disabled" CPU means in ARM world, I need to
have a proper look into this since this topic has been brought up before.

Lorenzo
Mark Brown Jan. 7, 2014, 1:38 p.m. UTC | #2
On Tue, Jan 07, 2014 at 01:05:40PM +0000, Lorenzo Pieralisi wrote:
> On Thu, Dec 19, 2013 at 08:06:14PM +0000, Mark Brown wrote:

> > +		/* check if the cpu is marked as "disabled", if so ignore */
> > +		if (!of_device_is_available(cn))
> > +			continue;

> It is time we defined what a "disabled" CPU means in ARM world, I need to
> have a proper look into this since this topic has been brought up before.

What is the confusion here - why would there be something architecture
specific going on?
Lorenzo Pieralisi Jan. 7, 2014, 2:29 p.m. UTC | #3
On Tue, Jan 07, 2014 at 01:38:29PM +0000, Mark Brown wrote:
> On Tue, Jan 07, 2014 at 01:05:40PM +0000, Lorenzo Pieralisi wrote:
> > On Thu, Dec 19, 2013 at 08:06:14PM +0000, Mark Brown wrote:
> 
> > > +		/* check if the cpu is marked as "disabled", if so ignore */
> > > +		if (!of_device_is_available(cn))
> > > +			continue;
> 
> > It is time we defined what a "disabled" CPU means in ARM world, I need to
> > have a proper look into this since this topic has been brought up before.
> 
> What is the confusion here - why would there be something architecture
> specific going on?

I think this check was added following this thread discussion:

http://lkml.indiana.edu/hypermail/linux/kernel/1306.0/03663.html

So my question is: what does "disabled" mean ? A CPU present in HW
that can't/must not be booted ?

ePAPR v1.1 page 43:

"disabled". The CPU is in a quiescent state. A quiescent CPU is in a state
where it cannot interfere with the normal operation of other CPUs, nor can
its state be affected by the normal operation of other running CPUs, except
by an explicit method for enabling or reenabling the quiescent CPU (see the
enable-method property).

This means that a "disabled" CPU can be booted with eg PSCI but that is
not what the thread in the link above wants to achieve. Furthermore, if
we add the check in topology.c, the check must be executed also when
building the cpu_logical_map, otherwise a "disabled" cpu would be marked
possible and then booted, am I wrong ?

Lorenzo
Mark Brown Jan. 7, 2014, 3:06 p.m. UTC | #4
On Tue, Jan 07, 2014 at 02:29:29PM +0000, Lorenzo Pieralisi wrote:
> On Tue, Jan 07, 2014 at 01:38:29PM +0000, Mark Brown wrote:
> > On Tue, Jan 07, 2014 at 01:05:40PM +0000, Lorenzo Pieralisi wrote:

> > > It is time we defined what a "disabled" CPU means in ARM world, I need to
> > > have a proper look into this since this topic has been brought up before.

> > What is the confusion here - why would there be something architecture
> > specific going on?

> I think this check was added following this thread discussion:

> http://lkml.indiana.edu/hypermail/linux/kernel/1306.0/03663.html

> So my question is: what does "disabled" mean ? A CPU present in HW
> that can't/must not be booted ?

Yes, that would seem to be the obvious meaning and consistent with ePAPR
(in so far as we're paying a blind bit of notice to ePAPR, see other
threads...).

> ePAPR v1.1 page 43:

> "disabled". The CPU is in a quiescent state. A quiescent CPU is in a state
> where it cannot interfere with the normal operation of other CPUs, nor can
> its state be affected by the normal operation of other running CPUs, except
> by an explicit method for enabling or reenabling the quiescent CPU (see the
> enable-method property).

> This means that a "disabled" CPU can be booted with eg PSCI but that is
> not what the thread in the link above wants to achieve. Furthermore, if

I think that's just another bit of ill considered wording from ePAPR
that doesn't really reflect reality; it seems like what they're trying
to shoot for there is administratively down.  

At the very least that means hot unplugged, and it seems reasonable to
read that as being stronger than that.  The current ARM implementation
is more conservative since it doesn't provide any way to put the core on
line but it does seem more likely to match what a system integrator
would be trying to achieve and it also matches the standard meaning of
disabled.

> we add the check in topology.c, the check must be executed also when
> building the cpu_logical_map, otherwise a "disabled" cpu would be marked
> possible and then booted, am I wrong ?

Right, this is a separate issue in the SMP enumeration code - it should
be paying attention to the property and at the very least defaulting the
core to being unplugged, though like I say I do find the ARM meaning
more sane.

In any case I don't vastly care, I guess I'll drop this for now.
Lorenzo Pieralisi Jan. 7, 2014, 5:56 p.m. UTC | #5
On Tue, Jan 07, 2014 at 03:06:31PM +0000, Mark Brown wrote:
> On Tue, Jan 07, 2014 at 02:29:29PM +0000, Lorenzo Pieralisi wrote:
> > On Tue, Jan 07, 2014 at 01:38:29PM +0000, Mark Brown wrote:
> > > On Tue, Jan 07, 2014 at 01:05:40PM +0000, Lorenzo Pieralisi wrote:
> 
> > > > It is time we defined what a "disabled" CPU means in ARM world, I need to
> > > > have a proper look into this since this topic has been brought up before.
> 
> > > What is the confusion here - why would there be something architecture
> > > specific going on?
> 
> > I think this check was added following this thread discussion:
> 
> > http://lkml.indiana.edu/hypermail/linux/kernel/1306.0/03663.html
> 
> > So my question is: what does "disabled" mean ? A CPU present in HW
> > that can't/must not be booted ?
> 
> Yes, that would seem to be the obvious meaning and consistent with ePAPR
> (in so far as we're paying a blind bit of notice to ePAPR, see other
> threads...).

Just playing devil's advocate and trying to reuse ePAPR bindings as much as
possible, provided they define what we need on ARM. In this case it seems they
do not.

> > ePAPR v1.1 page 43:
> 
> > "disabled". The CPU is in a quiescent state. A quiescent CPU is in a state
> > where it cannot interfere with the normal operation of other CPUs, nor can
> > its state be affected by the normal operation of other running CPUs, except
> > by an explicit method for enabling or reenabling the quiescent CPU (see the
> > enable-method property).
> 
> > This means that a "disabled" CPU can be booted with eg PSCI but that is
> > not what the thread in the link above wants to achieve. Furthermore, if
> 
> I think that's just another bit of ill considered wording from ePAPR
> that doesn't really reflect reality; it seems like what they're trying
> to shoot for there is administratively down.  
> 
> At the very least that means hot unplugged, and it seems reasonable to
> read that as being stronger than that.  The current ARM implementation
> is more conservative since it doesn't provide any way to put the core on
> line but it does seem more likely to match what a system integrator
> would be trying to achieve and it also matches the standard meaning of
> disabled.

What do you mean by ARM implementation ? The status property is currently
ignored on ARM. I'd agree with what you are saying but that should be
specified in DT bindings.

> > we add the check in topology.c, the check must be executed also when
> > building the cpu_logical_map, otherwise a "disabled" cpu would be marked
> > possible and then booted, am I wrong ?
> 
> Right, this is a separate issue in the SMP enumeration code - it should
> be paying attention to the property and at the very least defaulting the
> core to being unplugged, though like I say I do find the ARM meaning
> more sane.

Again, I tend to agree, since this means that the CPU is there but
simply is not a "possible" one. To be debated.

> In any case I don't vastly care, I guess I'll drop this for now.

Yes, I think dropping the check is fine for now, we can add it if/when
we achieve consensus, that should not be a big deal.

Lorenzo
Mark Brown Jan. 7, 2014, 6:02 p.m. UTC | #6
On Tue, Jan 07, 2014 at 05:56:42PM +0000, Lorenzo Pieralisi wrote:
> On Tue, Jan 07, 2014 at 03:06:31PM +0000, Mark Brown wrote:

> > At the very least that means hot unplugged, and it seems reasonable to
> > read that as being stronger than that.  The current ARM implementation
> > is more conservative since it doesn't provide any way to put the core on
> > line but it does seem more likely to match what a system integrator
> > would be trying to achieve and it also matches the standard meaning of
> > disabled.

> What do you mean by ARM implementation ? The status property is currently
> ignored on ARM. I'd agree with what you are saying but that should be
> specified in DT bindings.

The 32 bit ARM implementation.  That code was supposed to be just
cut'n'pasted from there, though I see now it wasn't...
diff mbox

Patch

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 5a2724b3d4b7..68ccf4f4f258 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -23,6 +23,29 @@ 
 
 #include <asm/topology.h>
 
+/*
+ * cpu power table
+ * This per cpu data structure describes the relative capacity of each core.
+ * On a heteregenous system, cores don't have the same computation capacity
+ * and we reflect that difference in the cpu_power field so the scheduler can
+ * take this difference into account during load balance. A per cpu structure
+ * is preferred because each CPU updates its own cpu_power field during the
+ * load balance except for idle cores. One idle core is selected to run the
+ * rebalance_domains for all idle cores and the cpu_power can be updated
+ * during this sequence.
+ */
+static DEFINE_PER_CPU(unsigned long, cpu_scale);
+
+unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	return per_cpu(cpu_scale, cpu);
+}
+
+static void set_power_scale(unsigned int cpu, unsigned long power)
+{
+	per_cpu(cpu_scale, cpu) = power;
+}
+
 #ifdef CONFIG_OF
 static int cluster_id;
 
@@ -140,9 +163,49 @@  static void __init parse_cluster(struct device_node *cluster)
 		cluster_id++;
 }
 
+struct cpu_efficiency {
+	const char *compatible;
+	unsigned long efficiency;
+};
+
+/*
+ * Table of relative efficiency of each processors
+ * The efficiency value must fit in 20bit and the final
+ * cpu_scale value must be in the range
+ *   0 < cpu_scale < 3*SCHED_POWER_SCALE/2
+ * in order to return at most 1 when DIV_ROUND_CLOSEST
+ * is used to compute the capacity of a CPU.
+ * Processors that are not defined in the table,
+ * use the default SCHED_POWER_SCALE value for cpu_scale.
+ */
+static const struct cpu_efficiency table_efficiency[] = {
+	{ NULL, },
+};
+
+static unsigned long *__cpu_capacity;
+#define cpu_capacity(cpu)	__cpu_capacity[cpu]
+
+static unsigned long middle_capacity = 1;
+
+/*
+ * Iterate all CPUs' descriptor in DT and compute the efficiency
+ * (as per table_efficiency). Also calculate a middle efficiency
+ * as close as possible to  (max{eff_i} - min{eff_i}) / 2
+ * This is later used to scale the cpu_power field such that an
+ * 'average' CPU is of middle power. Also see the comments near
+ * table_efficiency[] and update_cpu_power().
+ */
 static void __init parse_dt_topology(void)
 {
+	const struct cpu_efficiency *cpu_eff;
 	struct device_node *cn;
+	unsigned long min_capacity = (unsigned long)(-1);
+	unsigned long max_capacity = 0;
+	unsigned long capacity = 0;
+	int alloc_size, cpu;
+
+	alloc_size = nr_cpu_ids * sizeof(*__cpu_capacity);
+	__cpu_capacity = kzalloc(alloc_size, GFP_NOWAIT);
 
 	cn = of_find_node_by_path("/cpus");
 	if (!cn) {
@@ -158,10 +221,88 @@  static void __init parse_dt_topology(void)
 	if (!cn)
 		return;
 	parse_cluster(cn);
+
+	for_each_possible_cpu(cpu) {
+		const u32 *rate;
+		int len;
+
+		/* Too early to use cpu->of_node */
+		cn = of_get_cpu_node(cpu, NULL);
+		if (!cn) {
+			pr_err("Missing device node for CPU %d\n", cpu);
+			continue;
+		}
+
+		/* check if the cpu is marked as "disabled", if so ignore */
+		if (!of_device_is_available(cn))
+			continue;
+
+		for (cpu_eff = table_efficiency; cpu_eff->compatible; cpu_eff++)
+			if (of_device_is_compatible(cn, cpu_eff->compatible))
+				break;
+
+		if (cpu_eff->compatible == NULL) {
+			pr_warn("%s: Unknown CPU type\n", cn->full_name);
+			continue;
+		}
+
+		rate = of_get_property(cn, "clock-frequency", &len);
+		if (!rate || len != 4) {
+			pr_err("%s: Missing clock-frequency property\n",
+				cn->full_name);
+			continue;
+		}
+
+		capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+
+		/* Save min capacity of the system */
+		if (capacity < min_capacity)
+			min_capacity = capacity;
+
+		/* Save max capacity of the system */
+		if (capacity > max_capacity)
+			max_capacity = capacity;
+
+		cpu_capacity(cpu) = capacity;
+	}
+
+	/* If min and max capacities are equal we bypass the update of the
+	 * cpu_scale because all CPUs have the same capacity. Otherwise, we
+	 * compute a middle_capacity factor that will ensure that the capacity
+	 * of an 'average' CPU of the system will be as close as possible to
+	 * SCHED_POWER_SCALE, which is the default value, but with the
+	 * constraint explained near table_efficiency[].
+	 */
+	if (min_capacity == max_capacity)
+		return;
+	else if (4 * max_capacity < (3 * (max_capacity + min_capacity)))
+		middle_capacity = (min_capacity + max_capacity)
+				>> (SCHED_POWER_SHIFT+1);
+	else
+		middle_capacity = ((max_capacity / 3)
+				>> (SCHED_POWER_SHIFT-1)) + 1;
+
+}
+
+/*
+ * Look for a customed capacity of a CPU in the cpu_topo_data table during the
+ * boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
+ * function returns directly for SMP system.
+ */
+static void update_cpu_power(unsigned int cpu)
+{
+	if (!cpu_capacity(cpu))
+		return;
+
+	set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+
+	pr_info("CPU%u: update cpu_power %lu\n",
+		cpu, arch_scale_freq_power(NULL, cpu));
 }
 
 #else
 static inline void parse_dt_topology(void) {}
+static inline void update_cpu_power(unsigned int cpuid) {}
 #endif
 
 /*
@@ -210,6 +351,8 @@  void store_cpu_topology(unsigned int cpuid)
 		pr_info("CPU%u: No topology information configured\n", cpuid);
 	else
 		update_siblings_masks(cpuid);
+
+	update_cpu_power(cpuid);
 }
 
 /*
@@ -229,6 +372,8 @@  void __init init_cpu_topology(void)
 		cpu_topo->socket_id = -1;
 		cpumask_clear(&cpu_topo->core_sibling);
 		cpumask_clear(&cpu_topo->thread_sibling);
+
+		set_power_scale(cpu, SCHED_POWER_SCALE);
 	}
 
 	parse_dt_topology();