diff mbox series

[v2,6/6] arm64: use activity monitors for frequency invariance

Message ID 20191218182607.21607-7-ionela.voinescu@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: ARMv8.4 Activity Monitors support | expand

Commit Message

Ionela Voinescu Dec. 18, 2019, 6:26 p.m. UTC
The Frequency Invariance Engine (FIE) is providing a frequency
scaling correction factor that helps achieve more accurate
load-tracking.

So far, for arm and arm64 platforms, this scale factor has been
obtained based on the ratio between the current frequency and the
maximum supported frequency recorded by the cpufreq policy. The
setting of this scale factor is triggered from cpufreq drivers by
calling arch_set_freq_scale. The current frequency used in computation
is the frequency requested by a governor, but it may not be the
frequency that was implemented by the platform.

This correction factor can also be obtained using a core counter and a
constant counter to get information on the performance (frequency based
only) obtained in a period of time. This will more accurately reflect
the actual current frequency of the CPU, compared with the alternative
implementation that reflects the request of a performance level from
the OS.

Therefore, implement arch_scale_freq_tick to use activity monitors, if
present, for the computation of the frequency scale factor.

The use of AMU counters depends on:
 - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
 - CONFIG_CPU_FREQ - the current frequency obtained using counter
   information is divided by the maximum frequency obtained from the
   cpufreq policy.

While it is possible to have a combination of CPUs in the system with
and without support for activity monitors, the use of counters for
frequency invariance is only enabled for a CPU, if all related CPUs
(CPUs in the same frequency domain) support and have enabled the core
and constant activity monitor counters. In this way, there is a clear
separation between the policies for which arch_set_freq_scale
(cpufreq based FIE) is used, and the policies for which
arch_scale_freq_tick (counter based FIE) is used to set the frequency
scale factor. For this purpose, a cpufreq notifier is registered to
trigger validation work for CPUs and policies at policy creation that
will enable or disable the use of AMU counters for frequency invariance.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
---
 arch/arm64/include/asm/topology.h |   9 ++
 arch/arm64/kernel/topology.c      | 233 ++++++++++++++++++++++++++++++
 drivers/base/arch_topology.c      |  16 ++
 3 files changed, 258 insertions(+)

Comments

Lukasz Luba Jan. 23, 2020, 11:49 a.m. UTC | #1
Hi Ionela,

Please find my few comments below.

On 12/18/19 6:26 PM, Ionela Voinescu wrote:
> The Frequency Invariance Engine (FIE) is providing a frequency
> scaling correction factor that helps achieve more accurate
> load-tracking.
> 
> So far, for arm and arm64 platforms, this scale factor has been
> obtained based on the ratio between the current frequency and the
> maximum supported frequency recorded by the cpufreq policy. The
> setting of this scale factor is triggered from cpufreq drivers by
> calling arch_set_freq_scale. The current frequency used in computation
> is the frequency requested by a governor, but it may not be the
> frequency that was implemented by the platform.
> 
> This correction factor can also be obtained using a core counter and a
> constant counter to get information on the performance (frequency based
> only) obtained in a period of time. This will more accurately reflect
> the actual current frequency of the CPU, compared with the alternative
> implementation that reflects the request of a performance level from
> the OS.
> 
> Therefore, implement arch_scale_freq_tick to use activity monitors, if
> present, for the computation of the frequency scale factor.
> 
> The use of AMU counters depends on:
>   - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
>   - CONFIG_CPU_FREQ - the current frequency obtained using counter
>     information is divided by the maximum frequency obtained from the
>     cpufreq policy.
> 
> While it is possible to have a combination of CPUs in the system with
> and without support for activity monitors, the use of counters for
> frequency invariance is only enabled for a CPU, if all related CPUs
> (CPUs in the same frequency domain) support and have enabled the core

This looks like an edge case scenario, for which we are designing the
whole machinery with workqueues. AFAIU we cannot run the code in
arch_set_freq_scale() and you want to be check all CPUs upfront.

Maybe you can just wait till all CPUs boot and then set the proper
flags and finish initialization. Something like:
per_cpu(s8, amu_feat) /* form the patch 1/6 */
OR
per_cpu(u8, amu_scale_freq) /* from this patch */
with maybe some values:
0 - not checked yet
1 - checked and present
-1 - checked and not available
-2 - checked but in conflict with others in the freq domain
-3..-k - other odd configurations

could potentially eliminate the need of workqueues.

Then, if we could trigger this from i.e. late_initcall, the CPUs
should be online and you can validate them.

> and constant activity monitor counters. In this way, there is a clear
> separation between the policies for which arch_set_freq_scale
> (cpufreq based FIE) is used, and the policies for which
> arch_scale_freq_tick (counter based FIE) is used to set the frequency
> scale factor. For this purpose, a cpufreq notifier is registered to
> trigger validation work for CPUs and policies at policy creation that
> will enable or disable the use of AMU counters for frequency invariance.
> 
> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Sudeep Holla <sudeep.holla@arm.com>
> ---
>   arch/arm64/include/asm/topology.h |   9 ++
>   arch/arm64/kernel/topology.c      | 233 ++++++++++++++++++++++++++++++
>   drivers/base/arch_topology.c      |  16 ++
>   3 files changed, 258 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index a4d945db95a2..98412dd27565 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -19,6 +19,15 @@ int pcibus_to_node(struct pci_bus *bus);
>   /* Replace task scheduler's default frequency-invariant accounting */
>   #define arch_scale_freq_capacity topology_get_freq_scale
>   
> +#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
> +void topology_scale_freq_tick(void);
> +/*
> + * Replace task scheduler's default counter-based frequency-invariance
> + * scale factor setting.
> + */
> +#define arch_scale_freq_tick topology_scale_freq_tick
> +#endif
> +
>   /* Replace task scheduler's default cpu-invariant accounting */
>   #define arch_scale_cpu_capacity topology_get_cpu_scale
>   
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index fa9528dfd0ce..61f8264afec9 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -14,6 +14,7 @@
>   #include <linux/acpi.h>
>   #include <linux/arch_topology.h>
>   #include <linux/cacheinfo.h>
> +#include <linux/cpufreq.h>
>   #include <linux/init.h>
>   #include <linux/percpu.h>
>   
> @@ -120,4 +121,236 @@ int __init parse_acpi_topology(void)
>   }
>   #endif
>   
> +#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
>   
> +#undef pr_fmt
> +#define pr_fmt(fmt) "AMU: " fmt
> +
> +static void init_fie_counters_done_workfn(struct work_struct *work);
> +static DECLARE_WORK(init_fie_counters_done_work,
> +		    init_fie_counters_done_workfn);
> +
> +static struct workqueue_struct *policy_amu_fie_init_wq;
> +static struct workqueue_struct *cpu_amu_fie_init_wq;
> +
> +struct cpu_amu_work {
> +	struct work_struct cpu_work;
> +	struct work_struct policy_work;
> +	unsigned int cpuinfo_max_freq;
> +	struct cpumask policy_cpus;
> +	bool cpu_amu_fie;
> +};
> +static struct cpu_amu_work __percpu *works;
> +static cpumask_var_t cpus_to_visit;
> +
> +static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
> +static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> +static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> +DECLARE_PER_CPU(u8, amu_scale_freq);
> +
> +static void cpu_amu_fie_init_workfn(struct work_struct *work)
> +{
> +	u64 core_cnt, const_cnt, ratio;
> +	struct cpu_amu_work *amu_work;
> +	int cpu = smp_processor_id();
> +
> +	if (!cpu_has_amu_feat()) {
> +		pr_debug("CPU%d: counters are not supported.\n", cpu);
> +		return;
> +	}
> +
> +	core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> +	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> +
> +	if (unlikely(!core_cnt || !const_cnt)) {
> +		pr_err("CPU%d: cycle counters are not enabled.\n", cpu);
> +		return;
> +	}
> +
> +	amu_work = container_of(work, struct cpu_amu_work, cpu_work);
> +	if (unlikely(!(amu_work->cpuinfo_max_freq))) {
> +		pr_err("CPU%d: invalid maximum frequency.\n", cpu);
> +		return;
> +	}
> +
> +	/*
> +	 * Pre-compute the fixed ratio between the frequency of the
> +	 * constant counter and the maximum frequency of the CPU (hz).
> +	 */
> +	ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
> +	ratio = div64_u64(ratio, amu_work->cpuinfo_max_freq * 1000);
> +	this_cpu_write(arch_max_freq_scale, (unsigned long)ratio);
> +
> +	this_cpu_write(arch_core_cycles_prev, core_cnt);
> +	this_cpu_write(arch_const_cycles_prev, const_cnt);
> +	amu_work->cpu_amu_fie = true;
> +}
> +
> +static void policy_amu_fie_init_workfn(struct work_struct *work)
> +{
> +	struct cpu_amu_work *amu_work;
> +	u8 enable;
> +	int cpu;
> +
> +	amu_work = container_of(work, struct cpu_amu_work, policy_work);
> +
> +	flush_workqueue(cpu_amu_fie_init_wq);
> +
> +	for_each_cpu(cpu, &amu_work->policy_cpus)
> +		if (!(per_cpu_ptr(works, cpu)->cpu_amu_fie))
> +			break;
> +
> +	enable = (cpu >= nr_cpu_ids) ? 1 : 0;
> +
> +	for_each_cpu(cpu, &amu_work->policy_cpus)
> +		per_cpu(amu_scale_freq, cpu) = enable;
> +
> +	pr_info("CPUs[%*pbl]: counters %s be used for FIE.",
> +		cpumask_pr_args(&amu_work->policy_cpus),
> +		enable ? "will" : "WON'T");
> +}
> +
> +static int init_fie_counters_callback(struct notifier_block *nb,
> +				      unsigned long val,
> +				      void *data)
> +{
> +	struct cpufreq_policy *policy = data;
> +	struct cpu_amu_work *work;
> +	int cpu;
> +
> +	if (val != CPUFREQ_CREATE_POLICY)
> +		return 0;
> +
> +	/* Return if not all related CPUs are online */
> +	if (!cpumask_equal(policy->cpus, policy->related_cpus)) {
> +		pr_info("CPUs[%*pbl]: counters WON'T be used for FIE.",
> +			cpumask_pr_args(policy->related_cpus));
> +		return 0;
> +	}
> +
> +	/*
> +	 * Queue functions on all online CPUs from policy to:
> +	 *  - Check support and enablement for AMU counters
> +	 *  - Store system freq to max freq ratio per cpu
> +	 *  - Flag CPU as valid for use of counters for FIE
> +	 */
> +	for_each_cpu(cpu, policy->cpus) {
> +		work = per_cpu_ptr(works, cpu);
> +		work->cpuinfo_max_freq = policy->cpuinfo.max_freq;
> +		work->cpu_amu_fie = false;
> +		INIT_WORK(&work->cpu_work, cpu_amu_fie_init_workfn);
> +		queue_work_on(cpu, cpu_amu_fie_init_wq, &work->cpu_work);
> +	}
> +
> +	/*
> +	 * Queue function to validate support at policy level:
> +	 *  - Flush all work on online policy CPUs
> +	 *  - Verify that all online policy CPUs are flagged as
> +	 *    valid for use of counters for FIE
> +	 *  - Enable or disable use of counters for FIE on CPUs
> +	 */
> +	work = per_cpu_ptr(works, cpumask_first(policy->cpus));
> +	cpumask_copy(&work->policy_cpus, policy->cpus);
> +	INIT_WORK(&work->policy_work, policy_amu_fie_init_workfn);
> +	queue_work(policy_amu_fie_init_wq, &work->policy_work);
> +
> +	cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->cpus);
> +	if (cpumask_empty(cpus_to_visit))
> +		schedule_work(&init_fie_counters_done_work);
> +
> +	return 0;
> +}
> +
> +static struct notifier_block init_fie_counters_notifier = {
> +	.notifier_call = init_fie_counters_callback,
> +};
> +
> +static void init_fie_counters_done_workfn(struct work_struct *work)
> +{
> +	cpufreq_unregister_notifier(&init_fie_counters_notifier,
> +				    CPUFREQ_POLICY_NOTIFIER);
> +
> +	/*
> +	 * Destroy policy_amu_fie_init_wq first to ensure all policy
> +	 * work is finished, which includes flushing of the per-CPU
> +	 * work, before cpu_amu_fie_init_wq is destroyed.
> +	 */
> +	destroy_workqueue(policy_amu_fie_init_wq);
> +	destroy_workqueue(cpu_amu_fie_init_wq);
> +
> +	free_percpu(works);
> +	free_cpumask_var(cpus_to_visit);
> +}
> +
> +static int __init register_fie_counters_cpufreq_notifier(void)
> +{
> +	int ret = -ENOMEM;
> +
> +	if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
> +		goto out;
> +
> +	cpumask_copy(cpus_to_visit, cpu_possible_mask);
> +
> +	cpu_amu_fie_init_wq = create_workqueue("cpu_amu_fie_init_wq");
> +	if (!cpu_amu_fie_init_wq)
> +		goto free_cpumask;
> +
> +	policy_amu_fie_init_wq = create_workqueue("policy_amu_fie_init_wq");
> +	if (!cpu_amu_fie_init_wq)
> +		goto free_cpu_wq;
> +
> +	works = alloc_percpu(struct cpu_amu_work);
> +	if (!works)
> +		goto free_policy_wq;
> +
> +	ret = cpufreq_register_notifier(&init_fie_counters_notifier,
> +					CPUFREQ_POLICY_NOTIFIER);
> +	if (ret)
> +		goto free_works;
> +
> +	return 0;
> +
> +free_works:
> +	free_percpu(works);
> +free_policy_wq:
> +	destroy_workqueue(policy_amu_fie_init_wq);
> +free_cpu_wq:
> +	destroy_workqueue(cpu_amu_fie_init_wq);
> +free_cpumask:
> +	free_cpumask_var(cpus_to_visit);
> +out:
> +	return ret;
> +}
> +core_initcall(register_fie_counters_cpufreq_notifier);

If we move it to a bit later stage maybe it could solve the
issue with not-all-CPUs-online? Is it needed at this stage?
The device_initcall or late_initcall is not an option for it?


> +
> +void topology_scale_freq_tick(void)
> +{
> +	u64 prev_core_cnt, prev_const_cnt;
> +	u64 core_cnt, const_cnt, scale;
> +
> +	if (!this_cpu_read(amu_scale_freq))
> +		return;
> +
> +	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> +	core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> +	prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> +	prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> +
> +	if (unlikely(core_cnt <= prev_core_cnt ||
> +		     const_cnt <= prev_const_cnt))
> +		goto store_and_exit;
> +
> +	scale = core_cnt - prev_core_cnt;
> +	scale *= this_cpu_read(arch_max_freq_scale);
> +	scale = div64_u64(scale >> SCHED_CAPACITY_SHIFT,
> +			  const_cnt - prev_const_cnt);
> +
> +	scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> +	this_cpu_write(freq_scale, (unsigned long)scale);
> +
> +store_and_exit:
> +	this_cpu_write(arch_core_cycles_prev, core_cnt);
> +	this_cpu_write(arch_const_cycles_prev, const_cnt);
> +}
> +
> +#endif
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 1eb81f113786..3ae6091d845e 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -23,12 +23,28 @@
>   
>   DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
>   
> +#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
> +DEFINE_PER_CPU_READ_MOSTLY(u8, amu_scale_freq);
> +#endif
> +
>   void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
>   			 unsigned long max_freq)
>   {
>   	unsigned long scale;
>   	int i;
>   
> +#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)

This kind of #ifdef is probably not the best option inside drivers/base/
The function is called from cpufreq drivers, could we react earlier
and keep this function untouched?


> +	/*
> +	 * This function will only be called from CPUFREQ drivers.
> +	 * If the use of counters for FIE is enabled, establish if a CPU,
> +	 * the first one, supports counters and if they are valid. If they
> +	 * are, just return as we don't want to update with information
> +	 * from CPUFREQ. In this case the scale factor will be updated
> +	 * from arch_scale_freq_tick.
> +	 */
> +	if (per_cpu(amu_scale_freq, cpumask_first(cpus)))
> +		return;
> +#endif
>   	scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
>   
>   	for_each_cpu(i, cpus)
> 


Regards,
Lukasz
Ionela Voinescu Jan. 23, 2020, 5:07 p.m. UTC | #2
Hi Lukasz,

Thank you for taking a look over the patches.

On Thursday 23 Jan 2020 at 11:49:29 (+0000), Lukasz Luba wrote:
> Hi Ionela,
> 
> Please find my few comments below.
> 
> On 12/18/19 6:26 PM, Ionela Voinescu wrote:
> > The Frequency Invariance Engine (FIE) is providing a frequency
> > scaling correction factor that helps achieve more accurate
> > load-tracking.
> > 
> > So far, for arm and arm64 platforms, this scale factor has been
> > obtained based on the ratio between the current frequency and the
> > maximum supported frequency recorded by the cpufreq policy. The
> > setting of this scale factor is triggered from cpufreq drivers by
> > calling arch_set_freq_scale. The current frequency used in computation
> > is the frequency requested by a governor, but it may not be the
> > frequency that was implemented by the platform.
> > 
> > This correction factor can also be obtained using a core counter and a
> > constant counter to get information on the performance (frequency based
> > only) obtained in a period of time. This will more accurately reflect
> > the actual current frequency of the CPU, compared with the alternative
> > implementation that reflects the request of a performance level from
> > the OS.
> > 
> > Therefore, implement arch_scale_freq_tick to use activity monitors, if
> > present, for the computation of the frequency scale factor.
> > 
> > The use of AMU counters depends on:
> >   - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
> >   - CONFIG_CPU_FREQ - the current frequency obtained using counter
> >     information is divided by the maximum frequency obtained from the
> >     cpufreq policy.
> > 
> > While it is possible to have a combination of CPUs in the system with
> > and without support for activity monitors, the use of counters for
> > frequency invariance is only enabled for a CPU, if all related CPUs
> > (CPUs in the same frequency domain) support and have enabled the core
> 
> This looks like an edge case scenario, for which we are designing the
> whole machinery with workqueues. AFAIU we cannot run the code in
> arch_set_freq_scale() and you want to be check all CPUs upfront.
> 

Unfortunately, I don't believe it to be be an edge-case. Given that this
is an optional feature, I do believe that people might skip on
implementing it on some CPUs(LITTLEs) while keeping it for CPUs(bigs)
where power and thermal mitigation is more probable to happen in firmware.
This is the main reason to be conservative in the validation of CPUs and
cpufreq policies.

In regards to arch_set_freq_scale, I want to be able to tell, when that
function is called, if I should return a scale factor based on cpufreq
for the current policy. If activity monitors are useable for the CPUs in
the full policy, than I'm bailing out and leave the AMU FIE machinery
set the scale factor. Unfortunately this works at policy granularity.

This could  be done in a nicer way by setting the scale factor per cpu
and not for all CPUs in a policy in this arch_set_freq_scale function.
But this would require some rewriting for the full frequency invariance
support in drivers which we've talked about for a while but it was not
the purpose of this patch set. But it would eliminate the policy
verification I do with the second workqueue.

> Maybe you can just wait till all CPUs boot and then set the proper
> flags and finish initialization. Something like:
> per_cpu(s8, amu_feat) /* form the patch 1/6 */
> OR
> per_cpu(u8, amu_scale_freq) /* from this patch */
> with maybe some values:
> 0 - not checked yet
> 1 - checked and present
> -1 - checked and not available
> -2 - checked but in conflict with others in the freq domain
> -3..-k - other odd configurations
> 
> could potentially eliminate the need of workqueues.
> 
> Then, if we could trigger this from i.e. late_initcall, the CPUs
> should be online and you can validate them.
> 

I did initially give such a state machine a try but it proved to be
quite messy. A big reason for this is that the activity monitors unit
has multiple counters that can be used for different purposes.

The amu_feat per_cpu variable only flags that you have the AMU present
for potential users (in this case FIE) to validate the counters they
need for their respective usecase. For this reason I don't want to
overload the meaning of amu_feat. For the same reason I'm not doing the
validation of the counters in a generic way, but I'm tying it to the
usecase for particular counters. For example, it would not matter if
the instructions retired counter is not enabled from firmware for the
usecase of FIE. For frequency invariance we only need the core and
constant cycle counters and I'm making it the job of the user (arm64
topology code) to do the checking.

Secondly, for amu_scale_freq I could have added such a state machine,
but I did not think it was useful. The only thing it would change is
that I would not have to use the cpu_amu_fie variable in the data
structure that gets passed to the work functions. The only way I would
eliminate the second workqueue was if I did not do a check of all CPUs
in a policy, as described above, and rewrite frequency invariance to
work at CPU granularity and not policy granularity. This would eliminate
the dependency on cpufreq policy all-together, so it would be worth
doing if only for this reason alone :).

But even in that case, it's probably not needed to have more than two
states for amu_freq_scale.

What do you think?

Thank you,
Ionela.
Lukasz Luba Jan. 24, 2020, 1:19 a.m. UTC | #3
On 1/23/20 5:07 PM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> Thank you for taking a look over the patches.
> 
> On Thursday 23 Jan 2020 at 11:49:29 (+0000), Lukasz Luba wrote:
>> Hi Ionela,
>>
>> Please find my few comments below.
>>
>> On 12/18/19 6:26 PM, Ionela Voinescu wrote:
>>> The Frequency Invariance Engine (FIE) is providing a frequency
>>> scaling correction factor that helps achieve more accurate
>>> load-tracking.
>>>
>>> So far, for arm and arm64 platforms, this scale factor has been
>>> obtained based on the ratio between the current frequency and the
>>> maximum supported frequency recorded by the cpufreq policy. The
>>> setting of this scale factor is triggered from cpufreq drivers by
>>> calling arch_set_freq_scale. The current frequency used in computation
>>> is the frequency requested by a governor, but it may not be the
>>> frequency that was implemented by the platform.
>>>
>>> This correction factor can also be obtained using a core counter and a
>>> constant counter to get information on the performance (frequency based
>>> only) obtained in a period of time. This will more accurately reflect
>>> the actual current frequency of the CPU, compared with the alternative
>>> implementation that reflects the request of a performance level from
>>> the OS.
>>>
>>> Therefore, implement arch_scale_freq_tick to use activity monitors, if
>>> present, for the computation of the frequency scale factor.
>>>
>>> The use of AMU counters depends on:
>>>    - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
>>>    - CONFIG_CPU_FREQ - the current frequency obtained using counter
>>>      information is divided by the maximum frequency obtained from the
>>>      cpufreq policy.
>>>
>>> While it is possible to have a combination of CPUs in the system with
>>> and without support for activity monitors, the use of counters for
>>> frequency invariance is only enabled for a CPU, if all related CPUs
>>> (CPUs in the same frequency domain) support and have enabled the core
>>
>> This looks like an edge case scenario, for which we are designing the
>> whole machinery with workqueues. AFAIU we cannot run the code in
>> arch_set_freq_scale() and you want to be check all CPUs upfront.
>>
> 
> Unfortunately, I don't believe it to be be an edge-case. Given that this
> is an optional feature, I do believe that people might skip on
> implementing it on some CPUs(LITTLEs) while keeping it for CPUs(bigs)
> where power and thermal mitigation is more probable to happen in firmware.
> This is the main reason to be conservative in the validation of CPUs and
> cpufreq policies.
> 
> In regards to arch_set_freq_scale, I want to be able to tell, when that
> function is called, if I should return a scale factor based on cpufreq
> for the current policy. If activity monitors are useable for the CPUs in
> the full policy, than I'm bailing out and leave the AMU FIE machinery
> set the scale factor. Unfortunately this works at policy granularity.
> 
> This could  be done in a nicer way by setting the scale factor per cpu
> and not for all CPUs in a policy in this arch_set_freq_scale function.
> But this would require some rewriting for the full frequency invariance
> support in drivers which we've talked about for a while but it was not
> the purpose of this patch set. But it would eliminate the policy
> verification I do with the second workqueue.
> 
>> Maybe you can just wait till all CPUs boot and then set the proper
>> flags and finish initialization. Something like:
>> per_cpu(s8, amu_feat) /* form the patch 1/6 */
>> OR
>> per_cpu(u8, amu_scale_freq) /* from this patch */
>> with maybe some values:
>> 0 - not checked yet
>> 1 - checked and present
>> -1 - checked and not available
>> -2 - checked but in conflict with others in the freq domain
>> -3..-k - other odd configurations
>>
>> could potentially eliminate the need of workqueues.
>>
>> Then, if we could trigger this from i.e. late_initcall, the CPUs
>> should be online and you can validate them.
>>
> 
> I did initially give such a state machine a try but it proved to be
> quite messy. A big reason for this is that the activity monitors unit
> has multiple counters that can be used for different purposes.
> 
> The amu_feat per_cpu variable only flags that you have the AMU present
> for potential users (in this case FIE) to validate the counters they
> need for their respective usecase. For this reason I don't want to
> overload the meaning of amu_feat. For the same reason I'm not doing the
> validation of the counters in a generic way, but I'm tying it to the
> usecase for particular counters. For example, it would not matter if
> the instructions retired counter is not enabled from firmware for the
> usecase of FIE. For frequency invariance we only need the core and
> constant cycle counters and I'm making it the job of the user (arm64
> topology code) to do the checking.
> 
> Secondly, for amu_scale_freq I could have added such a state machine,
> but I did not think it was useful. The only thing it would change is
> that I would not have to use the cpu_amu_fie variable in the data
> structure that gets passed to the work functions. The only way I would
> eliminate the second workqueue was if I did not do a check of all CPUs
> in a policy, as described above, and rewrite frequency invariance to
> work at CPU granularity and not policy granularity. This would eliminate
> the dependency on cpufreq policy all-together, so it would be worth
> doing if only for this reason alone :).
> 
> But even in that case, it's probably not needed to have more than two
> states for amu_freq_scale.
> 
> What do you think?

I think currently we are the only users for this AMU and if there will
be another in the future, then we can start thinking about his proposed
changes. Let's cross that bridge when we come to it.

Regarding the code, in the arch/arm64/cpufeature.c you can already
read the cycle registers. All the CPUs are going through that code
during start. If you use this fact in the late_initcall() all CPUs
should be checked and you can just ask for cpufreq policy, calculate the 
max_freq ratio, set the per cpu config value to 'ready' state.

Something like in the code below, it is on top of your patch set.

------------------------>8-------------------------------------


diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index c639b3e052d7..837ea46d8867 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1168,19 +1168,26 @@ static bool has_hw_dbm(const struct 
arm64_cpu_capabilities *cap,
   * from the current cpu.
   *  - cpu_has_amu_feat()
   */
-static DEFINE_PER_CPU_READ_MOSTLY(u8, amu_feat);
-
-inline bool cpu_has_amu_feat(void)
-{
-	return !!this_cpu_read(amu_feat);
-}
+DECLARE_PER_CPU(u64, arch_const_cycles_prev);
+DECLARE_PER_CPU(u64, arch_core_cycles_prev);
+DECLARE_PER_CPU(u8, amu_scale_freq);

  static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
  {
+	u64 core_cnt, const_cnt;
+
  	if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
  		pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
  			smp_processor_id());
-		this_cpu_write(amu_feat, 1);
+		core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
+		const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
+
+		this_cpu_write(arch_core_cycles_prev, core_cnt);
+		this_cpu_write(arch_const_cycles_prev, const_cnt);
+
+		this_cpu_write(amu_scale_freq, 1);
+	} else {
+		this_cpu_write(amu_scale_freq, 2);
  	}
  }

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 61f8264afec9..95b34085ae64 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -144,8 +144,8 @@ static struct cpu_amu_work __percpu *works;
  static cpumask_var_t cpus_to_visit;

  static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
-static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
-static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
+DEFINE_PER_CPU(u64, arch_const_cycles_prev);
+DEFINE_PER_CPU(u64, arch_core_cycles_prev);
  DECLARE_PER_CPU(u8, amu_scale_freq);

  static void cpu_amu_fie_init_workfn(struct work_struct *work)
@@ -323,12 +323,64 @@ static int __init 
register_fie_counters_cpufreq_notifier(void)
  }
  core_initcall(register_fie_counters_cpufreq_notifier);

+static int __init init_amu_feature(void)
+{
+	struct cpufreq_policy *policy;
+	struct cpumask *checked_cpus;
+	int count, total;
+	int cpu, i;
+	s8 amu_config;
+	u64 ratio;
+
+	checked_cpus = kzalloc(cpumask_size(), GFP_KERNEL);
+	if (!checked_cpus)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, checked_cpus))
+			continue;
+
+		policy = cpufreq_cpu_get(cpu);
+		if (!policy) {
+			pr_warn("No cpufreq policy found for CPU%d\n", cpu);
+			continue;
+		}
+
+		count = total = 0;
+
+		for_each_cpu(i, policy->related_cpus) {
+			amu_config = per_cpu(amu_scale_freq, i);
+			if (amu_config == 1)
+				count++;
+			total++;
+		}
+
+		amu_config = (total == count) ? 3 : 4;
+
+		ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
+		ratio = div64_u64(ratio, policy->cpuinfo.max_freq * 1000);
+
+		for_each_cpu(i, policy->related_cpus) {
+			per_cpu(arch_max_freq_scale, i) = (unsigned long)ratio;
+			per_cpu(amu_scale_freq, i) = amu_config;
+			cpumask_set_cpu(i, checked_cpus);
+		}
+
+		cpufreq_cpu_put(policy);
+	}
+
+	kfree(checked_cpus);
+
+	return 0;
+}
+late_initcall(init_amu_feature);
+
  void topology_scale_freq_tick(void)
  {
  	u64 prev_core_cnt, prev_const_cnt;
  	u64 core_cnt, const_cnt, scale;

-	if (!this_cpu_read(amu_scale_freq))
+	if (this_cpu_read(amu_scale_freq) != 3)
  		return;

  	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);


-------------------------8<------------------------------------

Regards,
Lukasz

> 
> Thank you,
> Ionela.
>
Ionela Voinescu Jan. 24, 2020, 1:12 p.m. UTC | #4
Hi Lukasz,

On Friday 24 Jan 2020 at 01:19:31 (+0000), Lukasz Luba wrote:
> 
> 
> On 1/23/20 5:07 PM, Ionela Voinescu wrote:
> > Hi Lukasz,
> > 
> > Thank you for taking a look over the patches.
> > 
> > On Thursday 23 Jan 2020 at 11:49:29 (+0000), Lukasz Luba wrote:
> > > Hi Ionela,
> > > 
> > > Please find my few comments below.
> > > 
> > > On 12/18/19 6:26 PM, Ionela Voinescu wrote:
> > > > The Frequency Invariance Engine (FIE) is providing a frequency
> > > > scaling correction factor that helps achieve more accurate
> > > > load-tracking.
> > > > 
> > > > So far, for arm and arm64 platforms, this scale factor has been
> > > > obtained based on the ratio between the current frequency and the
> > > > maximum supported frequency recorded by the cpufreq policy. The
> > > > setting of this scale factor is triggered from cpufreq drivers by
> > > > calling arch_set_freq_scale. The current frequency used in computation
> > > > is the frequency requested by a governor, but it may not be the
> > > > frequency that was implemented by the platform.
> > > > 
> > > > This correction factor can also be obtained using a core counter and a
> > > > constant counter to get information on the performance (frequency based
> > > > only) obtained in a period of time. This will more accurately reflect
> > > > the actual current frequency of the CPU, compared with the alternative
> > > > implementation that reflects the request of a performance level from
> > > > the OS.
> > > > 
> > > > Therefore, implement arch_scale_freq_tick to use activity monitors, if
> > > > present, for the computation of the frequency scale factor.
> > > > 
> > > > The use of AMU counters depends on:
> > > >    - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
> > > >    - CONFIG_CPU_FREQ - the current frequency obtained using counter
> > > >      information is divided by the maximum frequency obtained from the
> > > >      cpufreq policy.
> > > > 
> > > > While it is possible to have a combination of CPUs in the system with
> > > > and without support for activity monitors, the use of counters for
> > > > frequency invariance is only enabled for a CPU, if all related CPUs
> > > > (CPUs in the same frequency domain) support and have enabled the core
> > > 
> > > This looks like an edge case scenario, for which we are designing the
> > > whole machinery with workqueues. AFAIU we cannot run the code in
> > > arch_set_freq_scale() and you want to be check all CPUs upfront.
> > > 
> > 
> > Unfortunately, I don't believe it to be be an edge-case. Given that this
> > is an optional feature, I do believe that people might skip on
> > implementing it on some CPUs(LITTLEs) while keeping it for CPUs(bigs)
> > where power and thermal mitigation is more probable to happen in firmware.
> > This is the main reason to be conservative in the validation of CPUs and
> > cpufreq policies.
> > 
> > In regards to arch_set_freq_scale, I want to be able to tell, when that
> > function is called, if I should return a scale factor based on cpufreq
> > for the current policy. If activity monitors are useable for the CPUs in
> > the full policy, than I'm bailing out and leave the AMU FIE machinery
> > set the scale factor. Unfortunately this works at policy granularity.
> > 
> > This could  be done in a nicer way by setting the scale factor per cpu
> > and not for all CPUs in a policy in this arch_set_freq_scale function.
> > But this would require some rewriting for the full frequency invariance
> > support in drivers which we've talked about for a while but it was not
> > the purpose of this patch set. But it would eliminate the policy
> > verification I do with the second workqueue.
> > 
> > > Maybe you can just wait till all CPUs boot and then set the proper
> > > flags and finish initialization. Something like:
> > > per_cpu(s8, amu_feat) /* form the patch 1/6 */
> > > OR
> > > per_cpu(u8, amu_scale_freq) /* from this patch */
> > > with maybe some values:
> > > 0 - not checked yet
> > > 1 - checked and present
> > > -1 - checked and not available
> > > -2 - checked but in conflict with others in the freq domain
> > > -3..-k - other odd configurations
> > > 
> > > could potentially eliminate the need of workqueues.
> > > 
> > > Then, if we could trigger this from i.e. late_initcall, the CPUs
> > > should be online and you can validate them.
> > > 
> > 
> > I did initially give such a state machine a try but it proved to be
> > quite messy. A big reason for this is that the activity monitors unit
> > has multiple counters that can be used for different purposes.
> > 
> > The amu_feat per_cpu variable only flags that you have the AMU present
> > for potential users (in this case FIE) to validate the counters they
> > need for their respective usecase. For this reason I don't want to
> > overload the meaning of amu_feat. For the same reason I'm not doing the
> > validation of the counters in a generic way, but I'm tying it to the
> > usecase for particular counters. For example, it would not matter if
> > the instructions retired counter is not enabled from firmware for the
> > usecase of FIE. For frequency invariance we only need the core and
> > constant cycle counters and I'm making it the job of the user (arm64
> > topology code) to do the checking.
> > 
> > Secondly, for amu_scale_freq I could have added such a state machine,
> > but I did not think it was useful. The only thing it would change is
> > that I would not have to use the cpu_amu_fie variable in the data
> > structure that gets passed to the work functions. The only way I would
> > eliminate the second workqueue was if I did not do a check of all CPUs
> > in a policy, as described above, and rewrite frequency invariance to
> > work at CPU granularity and not policy granularity. This would eliminate
> > the dependency on cpufreq policy all-together, so it would be worth
> > doing if only for this reason alone :).
> > 
> > But even in that case, it's probably not needed to have more than two
> > states for amu_freq_scale.
> > 
> > What do you think?
> 
> I think currently we are the only users for this AMU and if there will
> be another in the future, then we can start thinking about his proposed
> changes. Let's cross that bridge when we come to it.
> 
> Regarding the code, in the arch/arm64/cpufeature.c you can already
> read the cycle registers. All the CPUs are going through that code
> during start. If you use this fact in the late_initcall() all CPUs
> should be checked and you can just ask for cpufreq policy, calculate the
> max_freq ratio, set the per cpu config value to 'ready' state.
> 
> Something like in the code below, it is on top of your patch set.
> 
> ------------------------>8-------------------------------------
> 
> 
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index c639b3e052d7..837ea46d8867 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1168,19 +1168,26 @@ static bool has_hw_dbm(const struct
> arm64_cpu_capabilities *cap,
>   * from the current cpu.
>   *  - cpu_has_amu_feat()
>   */
> -static DEFINE_PER_CPU_READ_MOSTLY(u8, amu_feat);
> -
> -inline bool cpu_has_amu_feat(void)
> -{
> -	return !!this_cpu_read(amu_feat);
> -}
> +DECLARE_PER_CPU(u64, arch_const_cycles_prev);
> +DECLARE_PER_CPU(u64, arch_core_cycles_prev);
> +DECLARE_PER_CPU(u8, amu_scale_freq);
> 
>  static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
>  {
> +	u64 core_cnt, const_cnt;
> +
>  	if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
>  		pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
>  			smp_processor_id());
> -		this_cpu_write(amu_feat, 1);
> +		core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> +		const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> +
> +		this_cpu_write(arch_core_cycles_prev, core_cnt);
> +		this_cpu_write(arch_const_cycles_prev, const_cnt);
> +
> +		this_cpu_write(amu_scale_freq, 1);
> +	} else {
> +		this_cpu_write(amu_scale_freq, 2);
>  	}
>  }


Yes, functionally this can be done here (it would need some extra checks
on the initial values of core_cnt and const_cnt), but what I was saying
in my previous comment is that I don't want to mix generic feature
detection, which should happen here, with counter validation for
frequency invariance. As you see, this would already bring here per-cpu
variables for counters and amu_scale_freq flag, and I only see this
getting more messy with the future use of more counters. I don't believe
this code belongs here.

Looking a bit more over the code and checking against the new frequency
invariance code for x86, there is a case of either doing this CPU
validation in smp_prepare_cpus (separately for arm64 and x86) or calling
an arch_init_freq_invariance() maybe in sched_init_smp to be defined with
the proper frequency invariance counter initialisation code separately
for x86 and arm64. I'll have to look more over the details to make sure
this is feasible.

> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 61f8264afec9..95b34085ae64 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -144,8 +144,8 @@ static struct cpu_amu_work __percpu *works;
>  static cpumask_var_t cpus_to_visit;
> 
>  static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
> -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> +DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> +DEFINE_PER_CPU(u64, arch_core_cycles_prev);
>  DECLARE_PER_CPU(u8, amu_scale_freq);
> 
>  static void cpu_amu_fie_init_workfn(struct work_struct *work)
> @@ -323,12 +323,64 @@ static int __init
> register_fie_counters_cpufreq_notifier(void)
>  }
>  core_initcall(register_fie_counters_cpufreq_notifier);
> 
> +static int __init init_amu_feature(void)
> +{
> +	struct cpufreq_policy *policy;
> +	struct cpumask *checked_cpus;
> +	int count, total;
> +	int cpu, i;
> +	s8 amu_config;
> +	u64 ratio;
> +
> +	checked_cpus = kzalloc(cpumask_size(), GFP_KERNEL);
> +	if (!checked_cpus)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		if (cpumask_test_cpu(cpu, checked_cpus))
> +			continue;
> +
> +		policy = cpufreq_cpu_get(cpu);
> +		if (!policy) {
> +			pr_warn("No cpufreq policy found for CPU%d\n", cpu);
> +			continue;
> +		}
> +
> +		count = total = 0;
> +
> +		for_each_cpu(i, policy->related_cpus) {
> +			amu_config = per_cpu(amu_scale_freq, i);
> +			if (amu_config == 1)
> +				count++;
> +			total++;
> +		}
> +
> +		amu_config = (total == count) ? 3 : 4;
> +
> +		ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
> +		ratio = div64_u64(ratio, policy->cpuinfo.max_freq * 1000);
> +
> +		for_each_cpu(i, policy->related_cpus) {
> +			per_cpu(arch_max_freq_scale, i) = (unsigned long)ratio;
> +			per_cpu(amu_scale_freq, i) = amu_config;
> +			cpumask_set_cpu(i, checked_cpus);
> +		}
> +
> +		cpufreq_cpu_put(policy);
> +	}
> +
> +	kfree(checked_cpus);
> +
> +	return 0;
> +}
> +late_initcall(init_amu_feature);
> +

Yes, with the design I mentioned above, this CPU policy validation could
move to a late_initcall and I could drop the workqueues and the extra
data structure. Thanks for this!

Let me know what you think!

Thank you,
Ionela.

>  void topology_scale_freq_tick(void)
>  {
>  	u64 prev_core_cnt, prev_const_cnt;
>  	u64 core_cnt, const_cnt, scale;
> 
> -	if (!this_cpu_read(amu_scale_freq))
> +	if (this_cpu_read(amu_scale_freq) != 3)
>  		return;
> 
>  	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> 
> 
> -------------------------8<------------------------------------
> 
> Regards,
> Lukasz
> 
> > 
> > Thank you,
> > Ionela.
> >
Lukasz Luba Jan. 24, 2020, 3:17 p.m. UTC | #5
On 1/24/20 1:12 PM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> On Friday 24 Jan 2020 at 01:19:31 (+0000), Lukasz Luba wrote:
>>
>>
>> On 1/23/20 5:07 PM, Ionela Voinescu wrote:
>>> Hi Lukasz,
>>>
>>> Thank you for taking a look over the patches.
>>>
>>> On Thursday 23 Jan 2020 at 11:49:29 (+0000), Lukasz Luba wrote:
>>>> Hi Ionela,
>>>>
>>>> Please find my few comments below.
>>>>
>>>> On 12/18/19 6:26 PM, Ionela Voinescu wrote:
>>>>> The Frequency Invariance Engine (FIE) is providing a frequency
>>>>> scaling correction factor that helps achieve more accurate
>>>>> load-tracking.
>>>>>
>>>>> So far, for arm and arm64 platforms, this scale factor has been
>>>>> obtained based on the ratio between the current frequency and the
>>>>> maximum supported frequency recorded by the cpufreq policy. The
>>>>> setting of this scale factor is triggered from cpufreq drivers by
>>>>> calling arch_set_freq_scale. The current frequency used in computation
>>>>> is the frequency requested by a governor, but it may not be the
>>>>> frequency that was implemented by the platform.
>>>>>
>>>>> This correction factor can also be obtained using a core counter and a
>>>>> constant counter to get information on the performance (frequency based
>>>>> only) obtained in a period of time. This will more accurately reflect
>>>>> the actual current frequency of the CPU, compared with the alternative
>>>>> implementation that reflects the request of a performance level from
>>>>> the OS.
>>>>>
>>>>> Therefore, implement arch_scale_freq_tick to use activity monitors, if
>>>>> present, for the computation of the frequency scale factor.
>>>>>
>>>>> The use of AMU counters depends on:
>>>>>     - CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
>>>>>     - CONFIG_CPU_FREQ - the current frequency obtained using counter
>>>>>       information is divided by the maximum frequency obtained from the
>>>>>       cpufreq policy.
>>>>>
>>>>> While it is possible to have a combination of CPUs in the system with
>>>>> and without support for activity monitors, the use of counters for
>>>>> frequency invariance is only enabled for a CPU, if all related CPUs
>>>>> (CPUs in the same frequency domain) support and have enabled the core
>>>>
>>>> This looks like an edge case scenario, for which we are designing the
>>>> whole machinery with workqueues. AFAIU we cannot run the code in
>>>> arch_set_freq_scale() and you want to be check all CPUs upfront.
>>>>
>>>
>>> Unfortunately, I don't believe it to be be an edge-case. Given that this
>>> is an optional feature, I do believe that people might skip on
>>> implementing it on some CPUs(LITTLEs) while keeping it for CPUs(bigs)
>>> where power and thermal mitigation is more probable to happen in firmware.
>>> This is the main reason to be conservative in the validation of CPUs and
>>> cpufreq policies.
>>>
>>> In regards to arch_set_freq_scale, I want to be able to tell, when that
>>> function is called, if I should return a scale factor based on cpufreq
>>> for the current policy. If activity monitors are useable for the CPUs in
>>> the full policy, than I'm bailing out and leave the AMU FIE machinery
>>> set the scale factor. Unfortunately this works at policy granularity.
>>>
>>> This could  be done in a nicer way by setting the scale factor per cpu
>>> and not for all CPUs in a policy in this arch_set_freq_scale function.
>>> But this would require some rewriting for the full frequency invariance
>>> support in drivers which we've talked about for a while but it was not
>>> the purpose of this patch set. But it would eliminate the policy
>>> verification I do with the second workqueue.
>>>
>>>> Maybe you can just wait till all CPUs boot and then set the proper
>>>> flags and finish initialization. Something like:
>>>> per_cpu(s8, amu_feat) /* form the patch 1/6 */
>>>> OR
>>>> per_cpu(u8, amu_scale_freq) /* from this patch */
>>>> with maybe some values:
>>>> 0 - not checked yet
>>>> 1 - checked and present
>>>> -1 - checked and not available
>>>> -2 - checked but in conflict with others in the freq domain
>>>> -3..-k - other odd configurations
>>>>
>>>> could potentially eliminate the need of workqueues.
>>>>
>>>> Then, if we could trigger this from i.e. late_initcall, the CPUs
>>>> should be online and you can validate them.
>>>>
>>>
>>> I did initially give such a state machine a try but it proved to be
>>> quite messy. A big reason for this is that the activity monitors unit
>>> has multiple counters that can be used for different purposes.
>>>
>>> The amu_feat per_cpu variable only flags that you have the AMU present
>>> for potential users (in this case FIE) to validate the counters they
>>> need for their respective usecase. For this reason I don't want to
>>> overload the meaning of amu_feat. For the same reason I'm not doing the
>>> validation of the counters in a generic way, but I'm tying it to the
>>> usecase for particular counters. For example, it would not matter if
>>> the instructions retired counter is not enabled from firmware for the
>>> usecase of FIE. For frequency invariance we only need the core and
>>> constant cycle counters and I'm making it the job of the user (arm64
>>> topology code) to do the checking.
>>>
>>> Secondly, for amu_scale_freq I could have added such a state machine,
>>> but I did not think it was useful. The only thing it would change is
>>> that I would not have to use the cpu_amu_fie variable in the data
>>> structure that gets passed to the work functions. The only way I would
>>> eliminate the second workqueue was if I did not do a check of all CPUs
>>> in a policy, as described above, and rewrite frequency invariance to
>>> work at CPU granularity and not policy granularity. This would eliminate
>>> the dependency on cpufreq policy all-together, so it would be worth
>>> doing if only for this reason alone :).
>>>
>>> But even in that case, it's probably not needed to have more than two
>>> states for amu_freq_scale.
>>>
>>> What do you think?
>>
>> I think currently we are the only users for this AMU and if there will
>> be another in the future, then we can start thinking about his proposed
>> changes. Let's cross that bridge when we come to it.
>>
>> Regarding the code, in the arch/arm64/cpufeature.c you can already
>> read the cycle registers. All the CPUs are going through that code
>> during start. If you use this fact in the late_initcall() all CPUs
>> should be checked and you can just ask for cpufreq policy, calculate the
>> max_freq ratio, set the per cpu config value to 'ready' state.
>>
>> Something like in the code below, it is on top of your patch set.
>>
>> ------------------------>8-------------------------------------
>>
>>
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index c639b3e052d7..837ea46d8867 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -1168,19 +1168,26 @@ static bool has_hw_dbm(const struct
>> arm64_cpu_capabilities *cap,
>>    * from the current cpu.
>>    *  - cpu_has_amu_feat()
>>    */
>> -static DEFINE_PER_CPU_READ_MOSTLY(u8, amu_feat);
>> -
>> -inline bool cpu_has_amu_feat(void)
>> -{
>> -	return !!this_cpu_read(amu_feat);
>> -}
>> +DECLARE_PER_CPU(u64, arch_const_cycles_prev);
>> +DECLARE_PER_CPU(u64, arch_core_cycles_prev);
>> +DECLARE_PER_CPU(u8, amu_scale_freq);
>>
>>   static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
>>   {
>> +	u64 core_cnt, const_cnt;
>> +
>>   	if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
>>   		pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
>>   			smp_processor_id());
>> -		this_cpu_write(amu_feat, 1);
>> +		core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
>> +		const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
>> +
>> +		this_cpu_write(arch_core_cycles_prev, core_cnt);
>> +		this_cpu_write(arch_const_cycles_prev, const_cnt);
>> +
>> +		this_cpu_write(amu_scale_freq, 1);
>> +	} else {
>> +		this_cpu_write(amu_scale_freq, 2);
>>   	}
>>   }
> 
> 
> Yes, functionally this can be done here (it would need some extra checks
> on the initial values of core_cnt and const_cnt), but what I was saying
> in my previous comment is that I don't want to mix generic feature
> detection, which should happen here, with counter validation for
> frequency invariance. As you see, this would already bring here per-cpu
> variables for counters and amu_scale_freq flag, and I only see this
> getting more messy with the future use of more counters. I don't believe
> this code belongs here.
> 
> Looking a bit more over the code and checking against the new frequency
> invariance code for x86, there is a case of either doing this CPU
> validation in smp_prepare_cpus (separately for arm64 and x86) or calling
> an arch_init_freq_invariance() maybe in sched_init_smp to be defined with
> the proper frequency invariance counter initialisation code separately
> for x86 and arm64. I'll have to look more over the details to make sure
> this is feasible.

I have found that we could simply draw on from Mark's solution to
similar problem. In commit:

commit df857416a13734ed9356f6e4f0152d55e4fb748a
Author: Mark Rutland <mark.rutland@arm.com>
Date:   Wed Jul 16 16:32:44 2014 +0100

     arm64: cpuinfo: record cpu system register values

     Several kernel subsystems need to know details about CPU system 
register
     values, sometimes for CPUs other than that they are executing on. 
Rather
     than hard-coding system register accesses and cross-calls for these
     cases, this patch adds logic to record various system register 
values at
     boot-time. This may be used for feature reporting, firmware bug
     detection, etc.

     Separate hooks are added for the boot and hotplug paths to enable
     one-time intialisation and cold/warm boot value mismatch detection in
     later patches.

     Signed-off-by: Mark Rutland <mark.rutland@arm.com>
     Reviewed-by: Will Deacon <will.deacon@arm.com>
     Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>


He added cpuinfo_store_cpu() call in secondary_start_kernel()
[in arm64 smp.c]. Please check the file:
arch/arm64/kernel/cpuinfo.c

We can probably add our read-amu-regs-and-setup-invariance call
just below his cpuinfo_store_cpu.

Then the arm64 cpufeature.c would be clean, we will be called for
each cpu, late_initcal() will finish setup with edge case policy
check like in the init_amu_feature() code below.


> 
>>
>> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>> index 61f8264afec9..95b34085ae64 100644
>> --- a/arch/arm64/kernel/topology.c
>> +++ b/arch/arm64/kernel/topology.c
>> @@ -144,8 +144,8 @@ static struct cpu_amu_work __percpu *works;
>>   static cpumask_var_t cpus_to_visit;
>>
>>   static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
>> -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
>> -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
>> +DEFINE_PER_CPU(u64, arch_const_cycles_prev);
>> +DEFINE_PER_CPU(u64, arch_core_cycles_prev);
>>   DECLARE_PER_CPU(u8, amu_scale_freq);
>>
>>   static void cpu_amu_fie_init_workfn(struct work_struct *work)
>> @@ -323,12 +323,64 @@ static int __init
>> register_fie_counters_cpufreq_notifier(void)
>>   }
>>   core_initcall(register_fie_counters_cpufreq_notifier);
>>
>> +static int __init init_amu_feature(void)
>> +{
>> +	struct cpufreq_policy *policy;
>> +	struct cpumask *checked_cpus;
>> +	int count, total;
>> +	int cpu, i;
>> +	s8 amu_config;
>> +	u64 ratio;
>> +
>> +	checked_cpus = kzalloc(cpumask_size(), GFP_KERNEL);
>> +	if (!checked_cpus)
>> +		return -ENOMEM;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		if (cpumask_test_cpu(cpu, checked_cpus))
>> +			continue;
>> +
>> +		policy = cpufreq_cpu_get(cpu);
>> +		if (!policy) {
>> +			pr_warn("No cpufreq policy found for CPU%d\n", cpu);
>> +			continue;
>> +		}
>> +
>> +		count = total = 0;
>> +
>> +		for_each_cpu(i, policy->related_cpus) {
>> +			amu_config = per_cpu(amu_scale_freq, i);
>> +			if (amu_config == 1)
>> +				count++;
>> +			total++;
>> +		}
>> +
>> +		amu_config = (total == count) ? 3 : 4;
>> +
>> +		ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
>> +		ratio = div64_u64(ratio, policy->cpuinfo.max_freq * 1000);
>> +
>> +		for_each_cpu(i, policy->related_cpus) {
>> +			per_cpu(arch_max_freq_scale, i) = (unsigned long)ratio;
>> +			per_cpu(amu_scale_freq, i) = amu_config;
>> +			cpumask_set_cpu(i, checked_cpus);
>> +		}
>> +
>> +		cpufreq_cpu_put(policy);
>> +	}
>> +
>> +	kfree(checked_cpus);
>> +
>> +	return 0;
>> +}
>> +late_initcall(init_amu_feature);
>> +
> 
> Yes, with the design I mentioned above, this CPU policy validation could
> move to a late_initcall and I could drop the workqueues and the extra
> data structure. Thanks for this!
> 
> Let me know what you think!
> 

One think is still open, the file drivers/base/arch_topology.c and
#ifdef in function arch_set_freq_scale().

Generally, if there is such need, it's better to put such stuff into the
header and make dual implementation not polluting generic code with:
#if defined(CONFIG_ARM64_XZY)
#endif
#if defined(CONFIG_POWERPC_ABC)
#endif
#if defined(CONFIG_x86_QAZ)
#endif
...


In our case we would need i.e. linux/topology.h because it includes
asm/topology.h, which might provide a needed symbol. At the end of
linux/topology.h we can have:

#ifndef arch_cpu_auto_scaling
static __always_inline
bool arch_cpu_auto_scaling(void) { return False; }
#endif

Then, when the symbol was missing and we got the default one,
it should be easily optimized by the compiler.

We could have a much cleaner function arch_set_freq_scale()
in drivers/base/ and all architecture will deal with specific
#ifdef CONFIG in their <asm/topology.h> implementations or
use default.

Example:
arch_set_freq_scale()
{
	unsigned long scale;
	int i;
	
	if (arch_cpu_auto_scaling(cpu))
		return;

	scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
	for_each_cpu(i, cpus)
		per_cpu(freq_scale, i) = scale;
}

Regards,
Lukasz
Ionela Voinescu Jan. 28, 2020, 5:36 p.m. UTC | #6
Hi Lukasz,

On Friday 24 Jan 2020 at 15:17:48 (+0000), Lukasz Luba wrote:
[..]
> > >   static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap)
> > >   {
> > > +	u64 core_cnt, const_cnt;
> > > +
> > >   	if (has_cpuid_feature(cap, SCOPE_LOCAL_CPU)) {
> > >   		pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n",
> > >   			smp_processor_id());
> > > -		this_cpu_write(amu_feat, 1);
> > > +		core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> > > +		const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> > > +
> > > +		this_cpu_write(arch_core_cycles_prev, core_cnt);
> > > +		this_cpu_write(arch_const_cycles_prev, const_cnt);
> > > +
> > > +		this_cpu_write(amu_scale_freq, 1);
> > > +	} else {
> > > +		this_cpu_write(amu_scale_freq, 2);
> > >   	}
> > >   }
> > 
> > 
> > Yes, functionally this can be done here (it would need some extra checks
> > on the initial values of core_cnt and const_cnt), but what I was saying
> > in my previous comment is that I don't want to mix generic feature
> > detection, which should happen here, with counter validation for
> > frequency invariance. As you see, this would already bring here per-cpu
> > variables for counters and amu_scale_freq flag, and I only see this
> > getting more messy with the future use of more counters. I don't believe
> > this code belongs here.
> > 
> > Looking a bit more over the code and checking against the new frequency
> > invariance code for x86, there is a case of either doing this CPU
> > validation in smp_prepare_cpus (separately for arm64 and x86) or calling
> > an arch_init_freq_invariance() maybe in sched_init_smp to be defined with
> > the proper frequency invariance counter initialisation code separately
> > for x86 and arm64. I'll have to look more over the details to make sure
> > this is feasible.
> 
> I have found that we could simply draw on from Mark's solution to
> similar problem. In commit:
> 
> commit df857416a13734ed9356f6e4f0152d55e4fb748a
> Author: Mark Rutland <mark.rutland@arm.com>
> Date:   Wed Jul 16 16:32:44 2014 +0100
> 
>     arm64: cpuinfo: record cpu system register values
> 
>     Several kernel subsystems need to know details about CPU system register
>     values, sometimes for CPUs other than that they are executing on. Rather
>     than hard-coding system register accesses and cross-calls for these
>     cases, this patch adds logic to record various system register values at
>     boot-time. This may be used for feature reporting, firmware bug
>     detection, etc.
> 
>     Separate hooks are added for the boot and hotplug paths to enable
>     one-time intialisation and cold/warm boot value mismatch detection in
>     later patches.
> 
>     Signed-off-by: Mark Rutland <mark.rutland@arm.com>
>     Reviewed-by: Will Deacon <will.deacon@arm.com>
>     Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> 
> He added cpuinfo_store_cpu() call in secondary_start_kernel()
> [in arm64 smp.c]. Please check the file:
> arch/arm64/kernel/cpuinfo.c
> 
> We can probably add our read-amu-regs-and-setup-invariance call
> just below his cpuinfo_store_cpu.
> 
> Then the arm64 cpufeature.c would be clean, we will be called for
> each cpu, late_initcal() will finish setup with edge case policy
> check like in the init_amu_feature() code below.
> 

Yes, this should work: calling a AMU per_cpu validation function in
setup_processor for the boot CPU and in secondary_start_kernel for
secondary and hotplugged CPUs.

I would still like to bring this closer to the scheduler
(sched_init_smp) as frequency invariance is a functionality needed by
the scheduler and its initialisation should be part of scheduler init
code. But this together with needed interfaces for other architectures
can be done in a separate patchset that is not so AMU/arm64 specific.

[..]
> > 
> > Yes, with the design I mentioned above, this CPU policy validation could
> > move to a late_initcall and I could drop the workqueues and the extra
> > data structure. Thanks for this!
> > 
> > Let me know what you think!
> > 
> 
> One think is still open, the file drivers/base/arch_topology.c and
> #ifdef in function arch_set_freq_scale().
> 
> Generally, if there is such need, it's better to put such stuff into the
> header and make dual implementation not polluting generic code with:
> #if defined(CONFIG_ARM64_XZY)
> #endif
> #if defined(CONFIG_POWERPC_ABC)
> #endif
> #if defined(CONFIG_x86_QAZ)
> #endif
> ...
> 
> 
> In our case we would need i.e. linux/topology.h because it includes
> asm/topology.h, which might provide a needed symbol. At the end of
> linux/topology.h we can have:
> 
> #ifndef arch_cpu_auto_scaling
> static __always_inline
> bool arch_cpu_auto_scaling(void) { return False; }
> #endif
> 
> Then, when the symbol was missing and we got the default one,
> it should be easily optimized by the compiler.
> 
> We could have a much cleaner function arch_set_freq_scale()
> in drivers/base/ and all architecture will deal with specific
> #ifdef CONFIG in their <asm/topology.h> implementations or
> use default.
> 
> Example:
> arch_set_freq_scale()
> {
> 	unsigned long scale;
> 	int i;
> 	
> 	if (arch_cpu_auto_scaling(cpu))
> 		return;
> 
> 	scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
> 	for_each_cpu(i, cpus)
> 		per_cpu(freq_scale, i) = scale;
> }
> 
> Regards,
> Lukasz
>

Okay, it does look nice and clean. Let me give this a try in v3.

Thank you very much,
Ionela.
Valentin Schneider Jan. 29, 2020, 5:13 p.m. UTC | #7
Only commenting on the bits that should be there regardless of using the
workqueues or not;

On 18/12/2019 18:26, Ionela Voinescu wrote:
> +static void cpu_amu_fie_init_workfn(struct work_struct *work)
> +{
> +	u64 core_cnt, const_cnt, ratio;
> +	struct cpu_amu_work *amu_work;
> +	int cpu = smp_processor_id();
> +
> +	if (!cpu_has_amu_feat()) {
> +		pr_debug("CPU%d: counters are not supported.\n", cpu);
> +		return;
> +	}
> +
> +	core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> +	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> +
> +	if (unlikely(!core_cnt || !const_cnt)) {
> +		pr_err("CPU%d: cycle counters are not enabled.\n", cpu);
> +		return;
> +	}
> +
> +	amu_work = container_of(work, struct cpu_amu_work, cpu_work);
> +	if (unlikely(!(amu_work->cpuinfo_max_freq))) {
> +		pr_err("CPU%d: invalid maximum frequency.\n", cpu);
> +		return;
> +	}
> +
> +	/*
> +	 * Pre-compute the fixed ratio between the frequency of the
> +	 * constant counter and the maximum frequency of the CPU (hz).

I can't resist: s/hz/Hz/

> +	 */
> +	ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
> +	ratio = div64_u64(ratio, amu_work->cpuinfo_max_freq * 1000);

Nit: we're missing a comment somewhere that the unit of this is in kHz
(which explains the * 1000).

> +	this_cpu_write(arch_max_freq_scale, (unsigned long)ratio);
> +

Okay so what we get in the tick is:

  /\ core
  --------
  /\ const

And we want that to be SCHED_CAPACITY_SCALE when running at max freq. IOW we
want to turn

  max_freq
  ----------
  const_freq

into SCHED_CAPACITY_SCALE, so we can just multiply that by:

  const_freq
  ---------- * SCHED_CAPACITY_SCALE
  max_freq

But what the ratio you are storing here is 

                          const_freq
  arch_max_freq_scale =   ---------- * SCHED_CAPACITY_SCALE²
                           max_freq

(because x << 2 * SCHED_CAPACITY_SHIFT == x << 20)


In topology_freq_scale_tick() you end up doing

  /\ core   arch_max_freq_scale
  ------- * --------------------
  /\ const  SCHED_CAPACITY_SCALE

which gives us what we want (SCHED_CAPACITY_SCALE at max freq).


Now, the reason why we multiply our ratio by the square of
SCHED_CAPACITY_SCALE was not obvious to me, but you pointed me out that the
frequency of the arch timer can be *really* low compared to the max CPU freq.

For instance on h960:

  [    0.000000] arch_timer: cp15 timer(s) running at 1.92MHz (phys)

  $ root@valsch-h960:~# cat /sys/devices/system/cpu/cpufreq/policy4/cpuinfo_max_freq 
  2362000

So our ratio would be

  1'920'000 * 1024
  ----------------
    2'362'000'000

Which is ~0.83, so that becomes simply 0...


I had a brief look at the Arm ARM, for the arch timer it says it is
"typically in the range 1-50MHz", but then also gives an example with 20KHz
in a low-power mode.

If we take say 5GHz max CPU frequency, our lower bound for the arch timer
(with that SCHED_CAPACITY_SCALE² trick) is about ~4.768KHz. It's not *too*
far from that 20KHz, but I'm not sure we would actually be executing stuff
in that low-power mode.

Long story short, we're probably fine, but it would nice to shove some of
the above into comments (especially the SCHED_CAPACITY_SCALE² trick)
Ionela Voinescu Jan. 29, 2020, 5:52 p.m. UTC | #8
Hi Valentin,

On Wednesday 29 Jan 2020 at 17:13:53 (+0000), Valentin Schneider wrote:
> Only commenting on the bits that should be there regardless of using the
> workqueues or not;
> 
> On 18/12/2019 18:26, Ionela Voinescu wrote:
> > +static void cpu_amu_fie_init_workfn(struct work_struct *work)
> > +{
> > +	u64 core_cnt, const_cnt, ratio;
> > +	struct cpu_amu_work *amu_work;
> > +	int cpu = smp_processor_id();
> > +
> > +	if (!cpu_has_amu_feat()) {
> > +		pr_debug("CPU%d: counters are not supported.\n", cpu);
> > +		return;
> > +	}
> > +
> > +	core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
> > +	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
> > +
> > +	if (unlikely(!core_cnt || !const_cnt)) {
> > +		pr_err("CPU%d: cycle counters are not enabled.\n", cpu);
> > +		return;
> > +	}
> > +
> > +	amu_work = container_of(work, struct cpu_amu_work, cpu_work);
> > +	if (unlikely(!(amu_work->cpuinfo_max_freq))) {
> > +		pr_err("CPU%d: invalid maximum frequency.\n", cpu);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * Pre-compute the fixed ratio between the frequency of the
> > +	 * constant counter and the maximum frequency of the CPU (hz).
> 
> I can't resist: s/hz/Hz/
> 
> > +	 */
> > +	ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
> > +	ratio = div64_u64(ratio, amu_work->cpuinfo_max_freq * 1000);
> 
> Nit: we're missing a comment somewhere that the unit of this is in kHz
> (which explains the * 1000).
> 

Will do! The previous comment that explained this was ".. while
ensuring max_freq is converted to HZ.", but I believed it as too
clear and replaced it with the obscure "(hz)". I'll revert :).

> > +	this_cpu_write(arch_max_freq_scale, (unsigned long)ratio);
> > +
> 
> Okay so what we get in the tick is:
> 
>   /\ core
>   --------
>   /\ const
> 
> And we want that to be SCHED_CAPACITY_SCALE when running at max freq. IOW we
> want to turn
> 
>   max_freq
>   ----------
>   const_freq
> 
> into SCHED_CAPACITY_SCALE, so we can just multiply that by:
> 
>   const_freq
>   ---------- * SCHED_CAPACITY_SCALE
>   max_freq
> 
> But what the ratio you are storing here is 
> 
>                           const_freq
>   arch_max_freq_scale =   ---------- * SCHED_CAPACITY_SCALE²
>                            max_freq
> 
> (because x << 2 * SCHED_CAPACITY_SHIFT == x << 20)
> 
> 
> In topology_freq_scale_tick() you end up doing
> 
>   /\ core   arch_max_freq_scale
>   ------- * --------------------
>   /\ const  SCHED_CAPACITY_SCALE
> 
> which gives us what we want (SCHED_CAPACITY_SCALE at max freq).
> 
> 
> Now, the reason why we multiply our ratio by the square of
> SCHED_CAPACITY_SCALE was not obvious to me, but you pointed me out that the
> frequency of the arch timer can be *really* low compared to the max CPU freq.
> 
> For instance on h960:
> 
>   [    0.000000] arch_timer: cp15 timer(s) running at 1.92MHz (phys)
> 
>   $ root@valsch-h960:~# cat /sys/devices/system/cpu/cpufreq/policy4/cpuinfo_max_freq 
>   2362000
> 
> So our ratio would be
> 
>   1'920'000 * 1024
>   ----------------
>     2'362'000'000
> 
> Which is ~0.83, so that becomes simply 0...
> 
> 
> I had a brief look at the Arm ARM, for the arch timer it says it is
> "typically in the range 1-50MHz", but then also gives an example with 20KHz
> in a low-power mode.
> 
> If we take say 5GHz max CPU frequency, our lower bound for the arch timer
> (with that SCHED_CAPACITY_SCALE² trick) is about ~4.768KHz. It's not *too*
> far from that 20KHz, but I'm not sure we would actually be executing stuff
> in that low-power mode.
> 
> Long story short, we're probably fine, but it would nice to shove some of
> the above into comments (especially the SCHED_CAPACITY_SCALE² trick)

Okay, I'll add some of this documentation as comments in the patches. I
thought about doing it but I was not sure it justified the line count.
But if it saves people at least the hassle to unpack this computation to
understand the logic, it will be worth it.

Thank you for the thorough review,
Ionela.
Valentin Schneider Jan. 29, 2020, 11:39 p.m. UTC | #9
On 29/01/2020 17:13, Valentin Schneider wrote:
> I had a brief look at the Arm ARM, for the arch timer it says it is
> "typically in the range 1-50MHz", but then also gives an example with 20KHz
> in a low-power mode.
> 
> If we take say 5GHz max CPU frequency, our lower bound for the arch timer
> (with that SCHED_CAPACITY_SCALE² trick) is about ~4.768KHz. It's not *too*
> far from that 20KHz, but I'm not sure we would actually be executing stuff
> in that low-power mode.
> 

I mixed up a few things in there; that low-power mode is supposed to do
higher increments, so it would emulate a similar frequency as the non-low-power
mode. Thus the actual frequency matters less than what is reported in CNTFRQ
(though we hope to get the behaviour we're told we should see), so we should
be quite safe from that ~5KHz value. Still, to make it obvious, I don't think
something like this would hurt:

---
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 9a5464c625b45..a72832093575a 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -885,6 +885,17 @@ static int arch_timer_starting_cpu(unsigned int cpu)
 	return 0;
 }
 
+static int validate_timer_rate(void)
+{
+	if (!arch_timer_rate)
+		return 1;
+
+	/* Arch timer frequency < 1MHz is shady */
+	WARN_ON(arch_timer_rate < 1000000);
+
+	return 0;
+}
+
 /*
  * For historical reasons, when probing with DT we use whichever (non-zero)
  * rate was probed first, and don't verify that others match. If the first node
@@ -900,7 +911,7 @@ static void arch_timer_of_configure_rate(u32 rate, struct device_node *np)
 		arch_timer_rate = rate;
 
 	/* Check the timer frequency. */
-	if (arch_timer_rate == 0)
+	if (validate_timer_rate())
 		pr_warn("frequency not available\n");
 }
 
@@ -1594,7 +1605,7 @@ static int __init arch_timer_acpi_init(struct acpi_table_header *table)
 	 * CNTFRQ value. This *must* be correct.
 	 */
 	arch_timer_rate = arch_timer_get_cntfrq();
-	if (!arch_timer_rate) {
+	if (validate_timer_rate()) {
 		pr_err(FW_BUG "frequency not available.\n");
 		return -EINVAL;
 	}
---

> Long story short, we're probably fine, but it would nice to shove some of
> the above into comments (especially the SCHED_CAPACITY_SCALE² trick)
>
Ionela Voinescu Jan. 30, 2020, 3:49 p.m. UTC | #10
Hi Valentin,

On Wednesday 29 Jan 2020 at 23:39:11 (+0000), Valentin Schneider wrote:
> On 29/01/2020 17:13, Valentin Schneider wrote:
> > I had a brief look at the Arm ARM, for the arch timer it says it is
> > "typically in the range 1-50MHz", but then also gives an example with 20KHz
> > in a low-power mode.
> > 
> > If we take say 5GHz max CPU frequency, our lower bound for the arch timer
> > (with that SCHED_CAPACITY_SCALE² trick) is about ~4.768KHz. It's not *too*
> > far from that 20KHz, but I'm not sure we would actually be executing stuff
> > in that low-power mode.
> > 
> 
> I mixed up a few things in there; that low-power mode is supposed to do
> higher increments, so it would emulate a similar frequency as the non-low-power
> mode. Thus the actual frequency matters less than what is reported in CNTFRQ
> (though we hope to get the behaviour we're told we should see), so we should
> be quite safe from that ~5KHz value. Still, to make it obvious, I don't think
> something like this would hurt:
> 
> ---
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index 9a5464c625b45..a72832093575a 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -885,6 +885,17 @@ static int arch_timer_starting_cpu(unsigned int cpu)
>  	return 0;
>  }
>  
> +static int validate_timer_rate(void)
> +{
> +	if (!arch_timer_rate)
> +		return 1;
> +
> +	/* Arch timer frequency < 1MHz is shady */
> +	WARN_ON(arch_timer_rate < 1000000);
> +
> +	return 0;
> +}
> +
>  /*
>   * For historical reasons, when probing with DT we use whichever (non-zero)
>   * rate was probed first, and don't verify that others match. If the first node
> @@ -900,7 +911,7 @@ static void arch_timer_of_configure_rate(u32 rate, struct device_node *np)
>  		arch_timer_rate = rate;
>  
>  	/* Check the timer frequency. */
> -	if (arch_timer_rate == 0)
> +	if (validate_timer_rate())
>  		pr_warn("frequency not available\n");
>  }
>  
> @@ -1594,7 +1605,7 @@ static int __init arch_timer_acpi_init(struct acpi_table_header *table)
>  	 * CNTFRQ value. This *must* be correct.
>  	 */
>  	arch_timer_rate = arch_timer_get_cntfrq();
> -	if (!arch_timer_rate) {
> +	if (validate_timer_rate()) {
>  		pr_err(FW_BUG "frequency not available.\n");
>  		return -EINVAL;
>  	}
> ---
> 

Okay, I'll add this as a separate patch to the series and put you as
author. That is if you want me to tie this check to this usecase that
proves its usefulness. Otherwise it can stand on its own as well if
you want to submit it separately.

In regards to the ratio computation for frequency invariance where this
plays a role, I'll do a check and bail out if the ratio is 0, which I'm
ashamed to not have added before :).

Thanks,
Ionela.


> > Long story short, we're probably fine, but it would nice to shove some of
> > the above into comments (especially the SCHED_CAPACITY_SCALE² trick)
> >
Valentin Schneider Jan. 30, 2020, 4:11 p.m. UTC | #11
On 30/01/2020 15:49, Ionela Voinescu wrote:
> Okay, I'll add this as a separate patch to the series and put you as
> author. That is if you want me to tie this check to this usecase that
> proves its usefulness. Otherwise it can stand on its own as well if
> you want to submit it separately.
> 

It's pretty much standalone, but it does make sense to bundle it with this
series, I think. Feel free to grab ownership (I didn't test it) ;)

> In regards to the ratio computation for frequency invariance where this
> plays a role, I'll do a check and bail out if the ratio is 0, which I'm
> ashamed to not have added before :).

That does sound like something we very much want to have.

> 
> Thanks,
> Ionela.
> 
> 
>>> Long story short, we're probably fine, but it would nice to shove some of
>>> the above into comments (especially the SCHED_CAPACITY_SCALE² trick)
>>>
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a4d945db95a2..98412dd27565 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -19,6 +19,15 @@  int pcibus_to_node(struct pci_bus *bus);
 /* Replace task scheduler's default frequency-invariant accounting */
 #define arch_scale_freq_capacity topology_get_freq_scale
 
+#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
+void topology_scale_freq_tick(void);
+/*
+ * Replace task scheduler's default counter-based frequency-invariance
+ * scale factor setting.
+ */
+#define arch_scale_freq_tick topology_scale_freq_tick
+#endif
+
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index fa9528dfd0ce..61f8264afec9 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -14,6 +14,7 @@ 
 #include <linux/acpi.h>
 #include <linux/arch_topology.h>
 #include <linux/cacheinfo.h>
+#include <linux/cpufreq.h>
 #include <linux/init.h>
 #include <linux/percpu.h>
 
@@ -120,4 +121,236 @@  int __init parse_acpi_topology(void)
 }
 #endif
 
+#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
 
+#undef pr_fmt
+#define pr_fmt(fmt) "AMU: " fmt
+
+static void init_fie_counters_done_workfn(struct work_struct *work);
+static DECLARE_WORK(init_fie_counters_done_work,
+		    init_fie_counters_done_workfn);
+
+static struct workqueue_struct *policy_amu_fie_init_wq;
+static struct workqueue_struct *cpu_amu_fie_init_wq;
+
+struct cpu_amu_work {
+	struct work_struct cpu_work;
+	struct work_struct policy_work;
+	unsigned int cpuinfo_max_freq;
+	struct cpumask policy_cpus;
+	bool cpu_amu_fie;
+};
+static struct cpu_amu_work __percpu *works;
+static cpumask_var_t cpus_to_visit;
+
+static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
+static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
+static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
+DECLARE_PER_CPU(u8, amu_scale_freq);
+
+static void cpu_amu_fie_init_workfn(struct work_struct *work)
+{
+	u64 core_cnt, const_cnt, ratio;
+	struct cpu_amu_work *amu_work;
+	int cpu = smp_processor_id();
+
+	if (!cpu_has_amu_feat()) {
+		pr_debug("CPU%d: counters are not supported.\n", cpu);
+		return;
+	}
+
+	core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
+	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
+
+	if (unlikely(!core_cnt || !const_cnt)) {
+		pr_err("CPU%d: cycle counters are not enabled.\n", cpu);
+		return;
+	}
+
+	amu_work = container_of(work, struct cpu_amu_work, cpu_work);
+	if (unlikely(!(amu_work->cpuinfo_max_freq))) {
+		pr_err("CPU%d: invalid maximum frequency.\n", cpu);
+		return;
+	}
+
+	/*
+	 * Pre-compute the fixed ratio between the frequency of the
+	 * constant counter and the maximum frequency of the CPU (hz).
+	 */
+	ratio = (u64)arch_timer_get_rate() << (2 * SCHED_CAPACITY_SHIFT);
+	ratio = div64_u64(ratio, amu_work->cpuinfo_max_freq * 1000);
+	this_cpu_write(arch_max_freq_scale, (unsigned long)ratio);
+
+	this_cpu_write(arch_core_cycles_prev, core_cnt);
+	this_cpu_write(arch_const_cycles_prev, const_cnt);
+	amu_work->cpu_amu_fie = true;
+}
+
+static void policy_amu_fie_init_workfn(struct work_struct *work)
+{
+	struct cpu_amu_work *amu_work;
+	u8 enable;
+	int cpu;
+
+	amu_work = container_of(work, struct cpu_amu_work, policy_work);
+
+	flush_workqueue(cpu_amu_fie_init_wq);
+
+	for_each_cpu(cpu, &amu_work->policy_cpus)
+		if (!(per_cpu_ptr(works, cpu)->cpu_amu_fie))
+			break;
+
+	enable = (cpu >= nr_cpu_ids) ? 1 : 0;
+
+	for_each_cpu(cpu, &amu_work->policy_cpus)
+		per_cpu(amu_scale_freq, cpu) = enable;
+
+	pr_info("CPUs[%*pbl]: counters %s be used for FIE.",
+		cpumask_pr_args(&amu_work->policy_cpus),
+		enable ? "will" : "WON'T");
+}
+
+static int init_fie_counters_callback(struct notifier_block *nb,
+				      unsigned long val,
+				      void *data)
+{
+	struct cpufreq_policy *policy = data;
+	struct cpu_amu_work *work;
+	int cpu;
+
+	if (val != CPUFREQ_CREATE_POLICY)
+		return 0;
+
+	/* Return if not all related CPUs are online */
+	if (!cpumask_equal(policy->cpus, policy->related_cpus)) {
+		pr_info("CPUs[%*pbl]: counters WON'T be used for FIE.",
+			cpumask_pr_args(policy->related_cpus));
+		return 0;
+	}
+
+	/*
+	 * Queue functions on all online CPUs from policy to:
+	 *  - Check support and enablement for AMU counters
+	 *  - Store system freq to max freq ratio per cpu
+	 *  - Flag CPU as valid for use of counters for FIE
+	 */
+	for_each_cpu(cpu, policy->cpus) {
+		work = per_cpu_ptr(works, cpu);
+		work->cpuinfo_max_freq = policy->cpuinfo.max_freq;
+		work->cpu_amu_fie = false;
+		INIT_WORK(&work->cpu_work, cpu_amu_fie_init_workfn);
+		queue_work_on(cpu, cpu_amu_fie_init_wq, &work->cpu_work);
+	}
+
+	/*
+	 * Queue function to validate support at policy level:
+	 *  - Flush all work on online policy CPUs
+	 *  - Verify that all online policy CPUs are flagged as
+	 *    valid for use of counters for FIE
+	 *  - Enable or disable use of counters for FIE on CPUs
+	 */
+	work = per_cpu_ptr(works, cpumask_first(policy->cpus));
+	cpumask_copy(&work->policy_cpus, policy->cpus);
+	INIT_WORK(&work->policy_work, policy_amu_fie_init_workfn);
+	queue_work(policy_amu_fie_init_wq, &work->policy_work);
+
+	cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->cpus);
+	if (cpumask_empty(cpus_to_visit))
+		schedule_work(&init_fie_counters_done_work);
+
+	return 0;
+}
+
+static struct notifier_block init_fie_counters_notifier = {
+	.notifier_call = init_fie_counters_callback,
+};
+
+static void init_fie_counters_done_workfn(struct work_struct *work)
+{
+	cpufreq_unregister_notifier(&init_fie_counters_notifier,
+				    CPUFREQ_POLICY_NOTIFIER);
+
+	/*
+	 * Destroy policy_amu_fie_init_wq first to ensure all policy
+	 * work is finished, which includes flushing of the per-CPU
+	 * work, before cpu_amu_fie_init_wq is destroyed.
+	 */
+	destroy_workqueue(policy_amu_fie_init_wq);
+	destroy_workqueue(cpu_amu_fie_init_wq);
+
+	free_percpu(works);
+	free_cpumask_var(cpus_to_visit);
+}
+
+static int __init register_fie_counters_cpufreq_notifier(void)
+{
+	int ret = -ENOMEM;
+
+	if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
+		goto out;
+
+	cpumask_copy(cpus_to_visit, cpu_possible_mask);
+
+	cpu_amu_fie_init_wq = create_workqueue("cpu_amu_fie_init_wq");
+	if (!cpu_amu_fie_init_wq)
+		goto free_cpumask;
+
+	policy_amu_fie_init_wq = create_workqueue("policy_amu_fie_init_wq");
+	if (!cpu_amu_fie_init_wq)
+		goto free_cpu_wq;
+
+	works = alloc_percpu(struct cpu_amu_work);
+	if (!works)
+		goto free_policy_wq;
+
+	ret = cpufreq_register_notifier(&init_fie_counters_notifier,
+					CPUFREQ_POLICY_NOTIFIER);
+	if (ret)
+		goto free_works;
+
+	return 0;
+
+free_works:
+	free_percpu(works);
+free_policy_wq:
+	destroy_workqueue(policy_amu_fie_init_wq);
+free_cpu_wq:
+	destroy_workqueue(cpu_amu_fie_init_wq);
+free_cpumask:
+	free_cpumask_var(cpus_to_visit);
+out:
+	return ret;
+}
+core_initcall(register_fie_counters_cpufreq_notifier);
+
+void topology_scale_freq_tick(void)
+{
+	u64 prev_core_cnt, prev_const_cnt;
+	u64 core_cnt, const_cnt, scale;
+
+	if (!this_cpu_read(amu_scale_freq))
+		return;
+
+	const_cnt = read_sysreg_s(SYS_AMEVCNTR0_CONST_EL0);
+	core_cnt = read_sysreg_s(SYS_AMEVCNTR0_CORE_EL0);
+	prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
+	prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
+
+	if (unlikely(core_cnt <= prev_core_cnt ||
+		     const_cnt <= prev_const_cnt))
+		goto store_and_exit;
+
+	scale = core_cnt - prev_core_cnt;
+	scale *= this_cpu_read(arch_max_freq_scale);
+	scale = div64_u64(scale >> SCHED_CAPACITY_SHIFT,
+			  const_cnt - prev_const_cnt);
+
+	scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
+	this_cpu_write(freq_scale, (unsigned long)scale);
+
+store_and_exit:
+	this_cpu_write(arch_core_cycles_prev, core_cnt);
+	this_cpu_write(arch_const_cycles_prev, const_cnt);
+}
+
+#endif
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1eb81f113786..3ae6091d845e 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -23,12 +23,28 @@ 
 
 DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
 
+#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
+DEFINE_PER_CPU_READ_MOSTLY(u8, amu_scale_freq);
+#endif
+
 void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
 			 unsigned long max_freq)
 {
 	unsigned long scale;
 	int i;
 
+#if defined(CONFIG_ARM64_AMU_EXTN) && defined(CONFIG_CPU_FREQ)
+	/*
+	 * This function will only be called from CPUFREQ drivers.
+	 * If the use of counters for FIE is enabled, establish if a CPU,
+	 * the first one, supports counters and if they are valid. If they
+	 * are, just return as we don't want to update with information
+	 * from CPUFREQ. In this case the scale factor will be updated
+	 * from arch_scale_freq_tick.
+	 */
+	if (per_cpu(amu_scale_freq, cpumask_first(cpus)))
+		return;
+#endif
 	scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
 
 	for_each_cpu(i, cpus)