diff mbox

[RFCv5,18/46] arm: topology: Define TC2 energy and provide it to the scheduler

Message ID 1436293469-25707-19-git-send-email-morten.rasmussen@arm.com (mailing list archive)
State RFC
Headers show

Commit Message

Morten Rasmussen July 7, 2015, 6:24 p.m. UTC
From: Dietmar Eggemann <dietmar.eggemann@arm.com>

This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.

This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).

To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Aug. 12, 2015, 10:33 a.m. UTC | #1
On Tue, Jul 07, 2015 at 07:24:01PM +0100, Morten Rasmussen wrote:
> +static struct capacity_state cap_states_cluster_a7[] = {
> +	/* Cluster only power */
> +	 { .cap =  150, .power = 2967, }, /*  350 MHz */
> +	 { .cap =  172, .power = 2792, }, /*  400 MHz */
> +	 { .cap =  215, .power = 2810, }, /*  500 MHz */
> +	 { .cap =  258, .power = 2815, }, /*  600 MHz */
> +	 { .cap =  301, .power = 2919, }, /*  700 MHz */
> +	 { .cap =  344, .power = 2847, }, /*  800 MHz */
> +	 { .cap =  387, .power = 3917, }, /*  900 MHz */
> +	 { .cap =  430, .power = 4905, }, /* 1000 MHz */
> +	};

So can I suggest a SCHED_DEBUG validation of the data provided?

Given the above table, it _never_ makes sense to run at .cap=150, it
equally also doesn't make sense to run at .cap = 301.

So please add a SCHED_DEBUG test on domain creation that validates that
not only is the .cap monotonically increasing, but the .power is too.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dietmar Eggemann Aug. 12, 2015, 6:47 p.m. UTC | #2
On 12/08/15 11:33, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:01PM +0100, Morten Rasmussen wrote:
>> +static struct capacity_state cap_states_cluster_a7[] = {
>> +	/* Cluster only power */
>> +	 { .cap =  150, .power = 2967, }, /*  350 MHz */
>> +	 { .cap =  172, .power = 2792, }, /*  400 MHz */
>> +	 { .cap =  215, .power = 2810, }, /*  500 MHz */
>> +	 { .cap =  258, .power = 2815, }, /*  600 MHz */
>> +	 { .cap =  301, .power = 2919, }, /*  700 MHz */
>> +	 { .cap =  344, .power = 2847, }, /*  800 MHz */
>> +	 { .cap =  387, .power = 3917, }, /*  900 MHz */
>> +	 { .cap =  430, .power = 4905, }, /* 1000 MHz */
>> +	};
> 
> So can I suggest a SCHED_DEBUG validation of the data provided?

Yes we can do that.

> 
> Given the above table, it _never_ makes sense to run at .cap=150, it
> equally also doesn't make sense to run at .cap = 301.
>

Absolutely right.


> So please add a SCHED_DEBUG test on domain creation that validates that
> not only is the .cap monotonically increasing, but the .power is too.

The requirement for current EAS code to work is even higher. We're not
only requiring monotonically increasing values for .cap and .power but
that the energy efficiency (.cap/.power) is monotonically decreasing.
Otherwise we can't stop the search for a new appropriate OPP in
find_new_capacity() in case .cap >= current 'max. group usage' because
we can't assume that this OPP will be the most energy efficient one.

For the example above we get .cap/.power = [0.05 0.06 0.08 0.09 0.1 0.12
0.1 0.09] so only the last 3 OPPs [800, 900, 1000 Mhz] make sense from
this perspective on our TC2 test chip platform.

So we should check for monotonically decreasing (.cap/.power) values.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leo Yan Aug. 17, 2015, 9:19 a.m. UTC | #3
Hi Morten,

On Tue, Jul 07, 2015 at 07:24:01PM +0100, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> This patch is only here to be able to test provisioning of energy related
> data from an arch topology shim layer to the scheduler. Since there is no
> code today which deals with extracting energy related data from the dtb or
> acpi, and process it in the topology shim layer, the content of the
> sched_group_energy structures as well as the idle_state and capacity_state
> arrays are hard-coded here.
> 
> This patch defines the sched_group_energy structure as well as the
> idle_state and capacity_state array for the cluster (relates to sched
> groups (sgs) in DIE sched domain level) and for the core (relates to sgs
> in MC sd level) for a Cortex A7 as well as for a Cortex A15.
> It further provides related implementations of the sched_domain_energy_f
> functions (cpu_cluster_energy() and cpu_core_energy()).
> 
> To be able to propagate this information from the topology shim layer to
> the scheduler, the elements of the arm_topology[] table have been
> provisioned with the appropriate sched_domain_energy_f functions.
> 
> cc: Russell King <linux@arm.linux.org.uk>
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> ---
> arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 115 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index b35d3e5..bbe20c7 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -274,6 +274,119 @@ void store_cpu_topology(unsigned int cpuid)
>  		cpu_topology[cpuid].socket_id, mpidr);
>  }
>  
> +/*
> + * ARM TC2 specific energy cost model data. There are no unit requirements for
> + * the data. Data can be normalized to any reference point, but the
> + * normalization must be consistent. That is, one bogo-joule/watt must be the
> + * same quantity for all data, but we don't care what it is.
> + */
> +static struct idle_state idle_states_cluster_a7[] = {
> +	 { .power = 25 }, /* WFI */

This state is confused. Is this state corresponding to all CPUs have been
powered off but L2 cache RAM array and SCU are still power on?

> +	 { .power = 10 }, /* cluster-sleep-l */

Is this status means all CPU and cluster have been powered off, if so
then it will have no power consumption anymore...

> +	};
> +
> +static struct idle_state idle_states_cluster_a15[] = {
> +	 { .power = 70 }, /* WFI */
> +	 { .power = 25 }, /* cluster-sleep-b */
> +	};
> +
> +static struct capacity_state cap_states_cluster_a7[] = {
> +	/* Cluster only power */
> +	 { .cap =  150, .power = 2967, }, /*  350 MHz */

For cluster level's capacity, does it mean need run benchmark on all
CPUs within cluster?

> +	 { .cap =  172, .power = 2792, }, /*  400 MHz */
> +	 { .cap =  215, .power = 2810, }, /*  500 MHz */
> +	 { .cap =  258, .power = 2815, }, /*  600 MHz */
> +	 { .cap =  301, .power = 2919, }, /*  700 MHz */
> +	 { .cap =  344, .power = 2847, }, /*  800 MHz */
> +	 { .cap =  387, .power = 3917, }, /*  900 MHz */
> +	 { .cap =  430, .power = 4905, }, /* 1000 MHz */
> +	};
> +
> +static struct capacity_state cap_states_cluster_a15[] = {
> +	/* Cluster only power */
> +	 { .cap =  426, .power =  7920, }, /*  500 MHz */
> +	 { .cap =  512, .power =  8165, }, /*  600 MHz */
> +	 { .cap =  597, .power =  8172, }, /*  700 MHz */
> +	 { .cap =  682, .power =  8195, }, /*  800 MHz */
> +	 { .cap =  768, .power =  8265, }, /*  900 MHz */
> +	 { .cap =  853, .power =  8446, }, /* 1000 MHz */
> +	 { .cap =  938, .power = 11426, }, /* 1100 MHz */
> +	 { .cap = 1024, .power = 15200, }, /* 1200 MHz */
> +	};
> +
> +static struct sched_group_energy energy_cluster_a7 = {
> +	  .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
> +	  .idle_states    = idle_states_cluster_a7,
> +	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a7),
> +	  .cap_states     = cap_states_cluster_a7,
> +};
> +
> +static struct sched_group_energy energy_cluster_a15 = {
> +	  .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
> +	  .idle_states    = idle_states_cluster_a15,
> +	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a15),
> +	  .cap_states     = cap_states_cluster_a15,
> +};
> +
> +static struct idle_state idle_states_core_a7[] = {
> +	 { .power = 0 }, /* WFI */

Should have two idle states for CPU level (WFI and CPU's power off)?

> +	};
> +
> +static struct idle_state idle_states_core_a15[] = {
> +	 { .power = 0 }, /* WFI */
> +	};
> +
> +static struct capacity_state cap_states_core_a7[] = {
> +	/* Power per cpu */
> +	 { .cap =  150, .power =  187, }, /*  350 MHz */
> +	 { .cap =  172, .power =  275, }, /*  400 MHz */
> +	 { .cap =  215, .power =  334, }, /*  500 MHz */
> +	 { .cap =  258, .power =  407, }, /*  600 MHz */
> +	 { .cap =  301, .power =  447, }, /*  700 MHz */
> +	 { .cap =  344, .power =  549, }, /*  800 MHz */
> +	 { .cap =  387, .power =  761, }, /*  900 MHz */
> +	 { .cap =  430, .power = 1024, }, /* 1000 MHz */
> +	};
> +
> +static struct capacity_state cap_states_core_a15[] = {
> +	/* Power per cpu */
> +	 { .cap =  426, .power = 2021, }, /*  500 MHz */
> +	 { .cap =  512, .power = 2312, }, /*  600 MHz */
> +	 { .cap =  597, .power = 2756, }, /*  700 MHz */
> +	 { .cap =  682, .power = 3125, }, /*  800 MHz */
> +	 { .cap =  768, .power = 3524, }, /*  900 MHz */
> +	 { .cap =  853, .power = 3846, }, /* 1000 MHz */
> +	 { .cap =  938, .power = 5177, }, /* 1100 MHz */
> +	 { .cap = 1024, .power = 6997, }, /* 1200 MHz */
> +	};
> +
> +static struct sched_group_energy energy_core_a7 = {
> +	  .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
> +	  .idle_states    = idle_states_core_a7,
> +	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a7),
> +	  .cap_states     = cap_states_core_a7,
> +};
> +
> +static struct sched_group_energy energy_core_a15 = {
> +	  .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
> +	  .idle_states    = idle_states_core_a15,
> +	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a15),
> +	  .cap_states     = cap_states_core_a15,
> +};
> +
> +/* sd energy functions */
> +static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
> +{
> +	return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
> +			&energy_cluster_a15;
> +}
> +
> +static inline const struct sched_group_energy *cpu_core_energy(int cpu)
> +{
> +	return cpu_topology[cpu].socket_id ? &energy_core_a7 :
> +			&energy_core_a15;
> +}
> +
>  static inline int cpu_corepower_flags(void)
>  {
>  	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN | \
> @@ -282,10 +395,9 @@ static inline int cpu_corepower_flags(void)
>  
>  static struct sched_domain_topology_level arm_topology[] = {
>  #ifdef CONFIG_SCHED_MC
> -	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
> -	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +	{ cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
>  #endif
> -	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +	{ cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
>  	{ NULL, },
>  };
>  
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dietmar Eggemann Aug. 20, 2015, 7:19 p.m. UTC | #4
Hi Leo,

On 08/17/2015 02:19 AM, Leo Yan wrote:

[...]

>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index b35d3e5..bbe20c7 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -274,6 +274,119 @@ void store_cpu_topology(unsigned int cpuid)
>>   		cpu_topology[cpuid].socket_id, mpidr);
>>   }
>>
>> +/*
>> + * ARM TC2 specific energy cost model data. There are no unit requirements for
>> + * the data. Data can be normalized to any reference point, but the
>> + * normalization must be consistent. That is, one bogo-joule/watt must be the
>> + * same quantity for all data, but we don't care what it is.
>> + */
>> +static struct idle_state idle_states_cluster_a7[] = {
>> +	 { .power = 25 }, /* WFI */
>
> This state is confused. Is this state corresponding to all CPUs have been
> powered off but L2 cache RAM array and SCU are still power on?

This is what we refer to as 'active idle'. All cpus of the cluster are 
in WFI but the cluster is not in cluster-sleep yet. We measure the 
corresponding energy value by disabling the 'cluster-sleep-[b,l]' state 
and let the cpus do nothing for a specific time period.
>
>> +	 { .power = 10 }, /* cluster-sleep-l */
>
> Is this status means all CPU and cluster have been powered off, if so
> then it will have no power consumption anymore...

The cluster is in cluster-sleep but there is still some peripheral 
related to the cluster active which explains this power value we 
calculated from the pre/post energy value diff (by reading the vexpress 
energy counter for this cluster) and the time period we were idling on 
this cluster.

>
>> +	};
>> +
>> +static struct idle_state idle_states_cluster_a15[] = {
>> +	 { .power = 70 }, /* WFI */
>> +	 { .power = 25 }, /* cluster-sleep-b */
>> +	};
>> +
>> +static struct capacity_state cap_states_cluster_a7[] = {
>> +	/* Cluster only power */
>> +	 { .cap =  150, .power = 2967, }, /*  350 MHz */
>
> For cluster level's capacity, does it mean need run benchmark on all
> CPUs within cluster?

We run an 'always running thread per cpu' workload on {n, n-1, ..., 1} 
cpus of a cluster (hotplug-out the other cpus) for a specific time 
period. Then we calculate the cluster power value by extrapolating from 
the power values for the {n, n-1, ... 1} test runs and use the delta 
between a n and n+1 test run value as core power value.

[...]

>> +static struct idle_state idle_states_core_a7[] = {
>> +	 { .power = 0 }, /* WFI */
>
> Should have two idle states for CPU level (WFI and CPU's power off)?

The ARM TC2 platform has only 2 idle states, there is no 'cpu power off':

# cat /sys/devices/system/cpu/cpu[0,2]/cpuidle/state*/name
WFI
cluster-sleep-b
WFI
cluster-sleep-l

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index b35d3e5..bbe20c7 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -274,6 +274,119 @@  void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct idle_state idle_states_cluster_a7[] = {
+	 { .power = 25 }, /* WFI */
+	 { .power = 10 }, /* cluster-sleep-l */
+	};
+
+static struct idle_state idle_states_cluster_a15[] = {
+	 { .power = 70 }, /* WFI */
+	 { .power = 25 }, /* cluster-sleep-b */
+	};
+
+static struct capacity_state cap_states_cluster_a7[] = {
+	/* Cluster only power */
+	 { .cap =  150, .power = 2967, }, /*  350 MHz */
+	 { .cap =  172, .power = 2792, }, /*  400 MHz */
+	 { .cap =  215, .power = 2810, }, /*  500 MHz */
+	 { .cap =  258, .power = 2815, }, /*  600 MHz */
+	 { .cap =  301, .power = 2919, }, /*  700 MHz */
+	 { .cap =  344, .power = 2847, }, /*  800 MHz */
+	 { .cap =  387, .power = 3917, }, /*  900 MHz */
+	 { .cap =  430, .power = 4905, }, /* 1000 MHz */
+	};
+
+static struct capacity_state cap_states_cluster_a15[] = {
+	/* Cluster only power */
+	 { .cap =  426, .power =  7920, }, /*  500 MHz */
+	 { .cap =  512, .power =  8165, }, /*  600 MHz */
+	 { .cap =  597, .power =  8172, }, /*  700 MHz */
+	 { .cap =  682, .power =  8195, }, /*  800 MHz */
+	 { .cap =  768, .power =  8265, }, /*  900 MHz */
+	 { .cap =  853, .power =  8446, }, /* 1000 MHz */
+	 { .cap =  938, .power = 11426, }, /* 1100 MHz */
+	 { .cap = 1024, .power = 15200, }, /* 1200 MHz */
+	};
+
+static struct sched_group_energy energy_cluster_a7 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
+	  .idle_states    = idle_states_cluster_a7,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a7),
+	  .cap_states     = cap_states_cluster_a7,
+};
+
+static struct sched_group_energy energy_cluster_a15 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
+	  .idle_states    = idle_states_cluster_a15,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a15),
+	  .cap_states     = cap_states_cluster_a15,
+};
+
+static struct idle_state idle_states_core_a7[] = {
+	 { .power = 0 }, /* WFI */
+	};
+
+static struct idle_state idle_states_core_a15[] = {
+	 { .power = 0 }, /* WFI */
+	};
+
+static struct capacity_state cap_states_core_a7[] = {
+	/* Power per cpu */
+	 { .cap =  150, .power =  187, }, /*  350 MHz */
+	 { .cap =  172, .power =  275, }, /*  400 MHz */
+	 { .cap =  215, .power =  334, }, /*  500 MHz */
+	 { .cap =  258, .power =  407, }, /*  600 MHz */
+	 { .cap =  301, .power =  447, }, /*  700 MHz */
+	 { .cap =  344, .power =  549, }, /*  800 MHz */
+	 { .cap =  387, .power =  761, }, /*  900 MHz */
+	 { .cap =  430, .power = 1024, }, /* 1000 MHz */
+	};
+
+static struct capacity_state cap_states_core_a15[] = {
+	/* Power per cpu */
+	 { .cap =  426, .power = 2021, }, /*  500 MHz */
+	 { .cap =  512, .power = 2312, }, /*  600 MHz */
+	 { .cap =  597, .power = 2756, }, /*  700 MHz */
+	 { .cap =  682, .power = 3125, }, /*  800 MHz */
+	 { .cap =  768, .power = 3524, }, /*  900 MHz */
+	 { .cap =  853, .power = 3846, }, /* 1000 MHz */
+	 { .cap =  938, .power = 5177, }, /* 1100 MHz */
+	 { .cap = 1024, .power = 6997, }, /* 1200 MHz */
+	};
+
+static struct sched_group_energy energy_core_a7 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
+	  .idle_states    = idle_states_core_a7,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a7),
+	  .cap_states     = cap_states_core_a7,
+};
+
+static struct sched_group_energy energy_core_a15 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
+	  .idle_states    = idle_states_core_a15,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a15),
+	  .cap_states     = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
+{
+	return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+			&energy_cluster_a15;
+}
+
+static inline const struct sched_group_energy *cpu_core_energy(int cpu)
+{
+	return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+			&energy_core_a15;
+}
+
 static inline int cpu_corepower_flags(void)
 {
 	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN | \
@@ -282,10 +395,9 @@  static inline int cpu_corepower_flags(void)
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
-	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+	{ cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
 #endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };