diff mbox

[2/3] arm64: topology: Tell the scheduler about the relative power of cores

Message ID 1386792029-23148-2-git-send-email-broonie@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Mark Brown Dec. 11, 2013, 8 p.m. UTC
From: Mark Brown <broonie@linaro.org>

In non-heterogeneous systems like big.LITTLE systems the scheduler will be
able to make better use of the available cores if we provide power numbers
to it indicating their relative performance. Do this by parsing the CPU
nodes in the DT.

The power numbers are the same as for ARMv7 since it seems that the
expected differential between the big and little cores is very similar on
both ARMv7 and ARMv8. These numbers are just an initial and basic
approximation for use with the current scheduler, it is likely that both
experience with silicon and ongoing work on improving the scheduler will
lead to further tuning.

Signed-off-by: Mark Brown <broonie@linaro.org>
---
 arch/arm64/kernel/topology.c | 164 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

Comments

Morten Rasmussen Dec. 12, 2013, 11:35 a.m. UTC | #1
On Wed, Dec 11, 2013 at 08:00:28PM +0000, Mark Brown wrote:
> From: Mark Brown <broonie@linaro.org>
> 
> In non-heterogeneous systems like big.LITTLE systems the scheduler will be
> able to make better use of the available cores if we provide power numbers
> to it indicating their relative performance. Do this by parsing the CPU
> nodes in the DT.

Setting the relative performance through cpu_power in it current form
eads to sub-optimal scheduling on big.LITTLE in common scenarios.

I know that this is how it is currently done for ARMv7 and one could
argue that we should do the same for ARMv8 until we have a better
solution. I just want to highlight that setting cpu_power this way is
not generally the right thing to do for big.LITTLE. It will have to be
fixed eventually.

I fully agree that we need to pass relative performance information to
the scheduler, but it isn't really ready for it yet.

It doesn't harm when using the ARM big.LITTLE reference patches as they
ignore cpu_power for load-balancing. However, it leads to
under-utilization of little cpus with the mainline scheduler.

> 
> The power numbers are the same as for ARMv7 since it seems that the
> expected differential between the big and little cores is very similar on
> both ARMv7 and ARMv8. These numbers are just an initial and basic
> approximation for use with the current scheduler, it is likely that both
> experience with silicon and ongoing work on improving the scheduler will
> lead to further tuning.

As said above, it needs to be fixed in the scheduler.

Morten
Mark Brown Dec. 12, 2013, 12:06 p.m. UTC | #2
On Thu, Dec 12, 2013 at 11:35:12AM +0000, Morten Rasmussen wrote:

> I know that this is how it is currently done for ARMv7 and one could
> argue that we should do the same for ARMv8 until we have a better
> solution. I just want to highlight that setting cpu_power this way is
> not generally the right thing to do for big.LITTLE. It will have to be
> fixed eventually.

Right, there's a good solid reason for all the work on the scheduler.  I
definitely think we ought to be following the same approach on both
ARMv7 and ARMv8 to avoid confusion between people based on the platform
they're working on.

If you're saying that the current ARMv7 code is always worse than doing
nothing then clearly we ought to be removing that code from ARMv7 rather
than hurting performance.  I'd been under the impression that what we
had there was not ideal but better than nothing in mainline rather than
actively harmful.
Morten Rasmussen Dec. 12, 2013, 1:36 p.m. UTC | #3
On Thu, Dec 12, 2013 at 12:06:49PM +0000, Mark Brown wrote:
> On Thu, Dec 12, 2013 at 11:35:12AM +0000, Morten Rasmussen wrote:
> 
> > I know that this is how it is currently done for ARMv7 and one could
> > argue that we should do the same for ARMv8 until we have a better
> > solution. I just want to highlight that setting cpu_power this way is
> > not generally the right thing to do for big.LITTLE. It will have to be
> > fixed eventually.
> 
> Right, there's a good solid reason for all the work on the scheduler.  I
> definitely think we ought to be following the same approach on both
> ARMv7 and ARMv8 to avoid confusion between people based on the platform
> they're working on.

That is fair enough.

> 
> If you're saying that the current ARMv7 code is always worse than doing
> nothing then clearly we ought to be removing that code from ARMv7 rather
> than hurting performance.  I'd been under the impression that what we
> had there was not ideal but better than nothing in mainline rather than
> actively harmful.

For some scenarios it might be better to set cpu_power to reflect the
relative performance, for others it is worse due to the way cpu_power
is currently used in the scheduler.

Setting cpu_power as it is done for v7 may bias the scheduler to put
heavier tasks on big cpus and will generally put more task on big cpus,
which is a good thing for some scenarios. However, if you have parallel
workloads that spawn a worker thread for each cpu and does dynamic work
distribution in user-space (OpenMP applications for example), then
setting cpu_power will put two worker threads on some big cpus and leave
some little cpus idle resulting in slower completion time. It happens on
TC2.

We need the code (or something very similar) later when the scheduler
has been fixed. For v7 it has been left in waiting for that fix, it
doesn't harm when using the reference big.LITTLE patches. We can do the
same for v8 to avoid confusion.

Morten
Catalin Marinas Dec. 12, 2013, 5:39 p.m. UTC | #4
On Thu, Dec 12, 2013 at 01:36:43PM +0000, Morten Rasmussen wrote:
> On Thu, Dec 12, 2013 at 12:06:49PM +0000, Mark Brown wrote:
> > On Thu, Dec 12, 2013 at 11:35:12AM +0000, Morten Rasmussen wrote:
> > 
> > > I know that this is how it is currently done for ARMv7 and one could
> > > argue that we should do the same for ARMv8 until we have a better
> > > solution. I just want to highlight that setting cpu_power this way is
> > > not generally the right thing to do for big.LITTLE. It will have to be
> > > fixed eventually.
> > 
> > Right, there's a good solid reason for all the work on the scheduler.  I
> > definitely think we ought to be following the same approach on both
> > ARMv7 and ARMv8 to avoid confusion between people based on the platform
> > they're working on.
[...]
> > If you're saying that the current ARMv7 code is always worse than doing
> > nothing then clearly we ought to be removing that code from ARMv7 rather
> > than hurting performance.  I'd been under the impression that what we
> > had there was not ideal but better than nothing in mainline rather than
> > actively harmful.
> 
> For some scenarios it might be better to set cpu_power to reflect the
> relative performance, for others it is worse due to the way cpu_power
> is currently used in the scheduler.
> 
> Setting cpu_power as it is done for v7 may bias the scheduler to put
> heavier tasks on big cpus and will generally put more task on big cpus,
> which is a good thing for some scenarios. However, if you have parallel
> workloads that spawn a worker thread for each cpu and does dynamic work
> distribution in user-space (OpenMP applications for example), then
> setting cpu_power will put two worker threads on some big cpus and leave
> some little cpus idle resulting in slower completion time. It happens on
> TC2.
> 
> We need the code (or something very similar) later when the scheduler
> has been fixed. For v7 it has been left in waiting for that fix, it
> doesn't harm when using the reference big.LITTLE patches. We can do the
> same for v8 to avoid confusion.

If we can't really guarantee the effect of this patch, I would rather
keep it in the LSK kernel only until the scheduler is fixed (can this be
treated as a performance issue independent of the power-aware
scheduling? We could get it merged quicker).
Mark Brown Dec. 12, 2013, 6:06 p.m. UTC | #5
On Thu, Dec 12, 2013 at 05:39:12PM +0000, Catalin Marinas wrote:

> If we can't really guarantee the effect of this patch, I would rather
> keep it in the LSK kernel only until the scheduler is fixed (can this be
> treated as a performance issue independent of the power-aware
> scheduling? We could get it merged quicker).

My understanding is that the behaviour is reasonably well understood,
it's just not great in all situations (hence all the energy aware
scheduler work) but then the default scheduler behaviour isn't that good
either and possibly worse (hence all the energy aware scheduler work).

I think if we're that worried about problems that might be caused by
doing this we should remove the equivalent ARMv7 code to keep the two in
parity in terms of behaviour - it's going to be enough work getting
big.LITTLE working well without introducing software only differences
between ARMv7 and ARMv8 big.LITTLE.  If you're dead set on that then I
can do patches for that.
diff mbox

Patch

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index aae9d4d72328..c88970b1b863 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -18,6 +18,7 @@ 
 #include <linux/percpu.h>
 #include <linux/node.h>
 #include <linux/nodemask.h>
+#include <linux/of.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 
@@ -26,6 +27,163 @@ 
 #include <asm/topology.h>
 
 /*
+ * cpu power scale management
+ */
+
+/*
+ * cpu power table
+ * This per cpu data structure describes the relative capacity of each core.
+ * On a heteregenous system, cores don't have the same computation capacity
+ * and we reflect that difference in the cpu_power field so the scheduler can
+ * take this difference into account during load balance. A per cpu structure
+ * is preferred because each CPU updates its own cpu_power field during the
+ * load balance except for idle cores. One idle core is selected to run the
+ * rebalance_domains for all idle cores and the cpu_power can be updated
+ * during this sequence.
+ */
+static DEFINE_PER_CPU(unsigned long, cpu_scale);
+
+unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	return per_cpu(cpu_scale, cpu);
+}
+
+static void set_power_scale(unsigned int cpu, unsigned long power)
+{
+	per_cpu(cpu_scale, cpu) = power;
+}
+
+#ifdef CONFIG_OF
+struct cpu_efficiency {
+	const char *compatible;
+	unsigned long efficiency;
+};
+
+/*
+ * Table of relative efficiency of each processors
+ * The efficiency value must fit in 20bit and the final
+ * cpu_scale value must be in the range
+ *   0 < cpu_scale < 3*SCHED_POWER_SCALE/2
+ * in order to return at most 1 when DIV_ROUND_CLOSEST
+ * is used to compute the capacity of a CPU.
+ * Processors that are not defined in the table,
+ * use the default SCHED_POWER_SCALE value for cpu_scale.
+ */
+static const struct cpu_efficiency table_efficiency[] = {
+	{ "arm,cortex-a57", 3891 },
+	{ "arm,cortex-a53", 2048 },
+	{ NULL, },
+};
+
+static unsigned long *__cpu_capacity;
+#define cpu_capacity(cpu)	__cpu_capacity[cpu]
+
+static unsigned long middle_capacity = 1;
+
+/*
+ * Iterate all CPUs' descriptor in DT and compute the efficiency
+ * (as per table_efficiency). Also calculate a middle efficiency
+ * as close as possible to  (max{eff_i} - min{eff_i}) / 2
+ * This is later used to scale the cpu_power field such that an
+ * 'average' CPU is of middle power. Also see the comments near
+ * table_efficiency[] and update_cpu_power().
+ */
+static void __init parse_dt_topology(void)
+{
+	const struct cpu_efficiency *cpu_eff;
+	struct device_node *cn = NULL;
+	unsigned long min_capacity = (unsigned long)(-1);
+	unsigned long max_capacity = 0;
+	unsigned long capacity = 0;
+	int alloc_size, cpu;
+
+	alloc_size = nr_cpu_ids * sizeof(*__cpu_capacity);
+	__cpu_capacity = kzalloc(alloc_size, GFP_NOWAIT);
+
+	for_each_possible_cpu(cpu) {
+		const u32 *rate;
+		int len;
+
+		/* Too early to use cpu->of_node */
+		cn = of_get_cpu_node(cpu, NULL);
+		if (!cn) {
+			pr_err("Missing device node for CPU %d\n", cpu);
+			continue;
+		}
+
+		/* check if the cpu is marked as "disabled", if so ignore */
+		if (!of_device_is_available(cn))
+			continue;
+
+		for (cpu_eff = table_efficiency; cpu_eff->compatible; cpu_eff++)
+			if (of_device_is_compatible(cn, cpu_eff->compatible))
+				break;
+
+		if (cpu_eff->compatible == NULL) {
+			pr_warn("%s: Unknown CPU type\n", cn->full_name);
+			continue;
+		}
+
+		rate = of_get_property(cn, "clock-frequency", &len);
+		if (!rate || len != 4) {
+			pr_err("%s: Missing clock-frequency property\n",
+				cn->full_name);
+			continue;
+		}
+
+		capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+
+		/* Save min capacity of the system */
+		if (capacity < min_capacity)
+			min_capacity = capacity;
+
+		/* Save max capacity of the system */
+		if (capacity > max_capacity)
+			max_capacity = capacity;
+
+		cpu_capacity(cpu) = capacity;
+	}
+
+	/* If min and max capacities are equal we bypass the update of the
+	 * cpu_scale because all CPUs have the same capacity. Otherwise, we
+	 * compute a middle_capacity factor that will ensure that the capacity
+	 * of an 'average' CPU of the system will be as close as possible to
+	 * SCHED_POWER_SCALE, which is the default value, but with the
+	 * constraint explained near table_efficiency[].
+	 */
+	if (min_capacity == max_capacity)
+		return;
+	else if (4 * max_capacity < (3 * (max_capacity + min_capacity)))
+		middle_capacity = (min_capacity + max_capacity)
+				>> (SCHED_POWER_SHIFT+1);
+	else
+		middle_capacity = ((max_capacity / 3)
+				>> (SCHED_POWER_SHIFT-1)) + 1;
+
+}
+
+/*
+ * Look for a customed capacity of a CPU in the cpu_topo_data table during the
+ * boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
+ * function returns directly for SMP system.
+ */
+static void update_cpu_power(unsigned int cpu)
+{
+	if (!cpu_capacity(cpu))
+		return;
+
+	set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+
+	pr_info("CPU%u: update cpu_power %lu\n",
+		cpu, arch_scale_freq_power(NULL, cpu));
+}
+
+#else
+static inline void parse_dt_topology(void) {}
+static inline void update_cpu_power(unsigned int cpuid) {}
+#endif
+
+/*
  * cpu topology table
  */
 struct cputopo_arm cpu_topology[NR_CPUS];
@@ -71,6 +229,8 @@  void store_cpu_topology(unsigned int cpuid)
 		pr_info("CPU%u: No topology information configured\n", cpuid);
 	else
 		update_siblings_masks(cpuid);
+
+	update_cpu_power(cpuid);
 }
 
 
@@ -119,6 +279,10 @@  void __init init_cpu_topology(void)
 		cpu_topo->socket_id = -1;
 		cpumask_clear(&cpu_topo->core_sibling);
 		cpumask_clear(&cpu_topo->thread_sibling);
+
+		set_power_scale(cpu, SCHED_POWER_SCALE);
 	}
 	smp_wmb();
+
+	parse_dt_topology();
 }