mbox series

[v3,0/3] CPUFreq: Add support for cpu performance dependencies

Message ID 20201102120115.29993-1-nicola.mazzucato@arm.com (mailing list archive)
Headers show
Series CPUFreq: Add support for cpu performance dependencies | expand

Message

Nicola Mazzucato Nov. 2, 2020, 12:01 p.m. UTC
Hi All,

In this V3 posting I have replaced the new dt-binding with minor changes/
improvements for opp (since we are now using opp tables instead).
The RFC still stands on how to make this info available to sw consumers.

In the RFC, I am proposing a simple addition of a performance dependencies
cpumask in CPUFreq core and an example of how drivers and consumers would
make use of it.
I also propose an alternative approach, which does not require changes in
CPUFreq core, but - possibly - in all the consumers.

This is to support systems where exposed cpu performance controls are more
fine-grained that the platform's ability to scale cpus independently.

[v3]
  * Remove proposal for new 'cpu-performance-dependencies' as we instead
    can reuse the opp table.
  * Update documentation for devicetree/bindings/opp
  * Minor changes within opp to support empty opp table
  * Rework the RFC by adding a second proposal

[v2]
  * Fix errors when running make dt_binding_check
  * Improve commit message description for the dt-binding
  * Add RFC for implementation in cpufreq-core and one of its
    drivers.

Nicola Mazzucato (3):
  dt-bindings/opp: Update documentation for opp-shared
  opp/of: Allow empty opp-table with opp-shared
  [RFC] CPUFreq: Add support for cpu-perf-dependencies

 Documentation/devicetree/bindings/opp/opp.txt | 53 +++++++++++++++++++
 drivers/opp/of.c                              | 13 ++++-
 2 files changed, 64 insertions(+), 2 deletions(-)

Comments

Nicola Mazzucato Nov. 2, 2020, 12:01 p.m. UTC | #1
Hi All,

This is a continuation of the previous v2, where we focused mostly on the
dt binding.

I am seeking some feedback/comments on the following two approaches.

Intro:
We have seen that in a system where performance control and hardware
description do not match (i.e. per-cpu), we still need the information of
how the v/f lines are shared among the cpus. We call this information
"performance dependencies".
We got this info through the opp-shared (the previous 2 patches aim for
that).

Problem:
How do we share such info (retrieved from a cpufreq driver) to other
consumers that rely on it? I have two proposals.

First proposal:
Having a new cpumask 'dependent_cpus' to represent the
CPU performance dependencies seems a viable and scalable way.

The main reason for a new cpumaks is that although 'related_cpus'
could be (or could have been) used for such purpose, its meaning has
changed over time [1].

Provided that the cpufreq driver will populate dependent_cpus and set
shared_type accordingly, the s/w components that rely on such description
(we focus on energy-model and cpufreq_cooling for now) will always be
provided with the correct information, when picking the new cpumask.

Proposed changes (at high level)(3):

1) cpufreq: Add new dependent_cpus cpumaks in cpufreq_policy

   * New cpumask addition
   <snippet>
struct cpufreq_policy {
        cpumask_var_t           related_cpus; /* Online + Offline CPUs */
        cpumask_var_t           real_cpus; /* Related and present */

+       /*
+        * CPUs with hardware clk/perf dependencies
+        *
+        * For sw components that rely on h/w info of clk dependencies when hw
+        * coordinates. This cpumask should always reflect the hw dependencies.
+        */
+       cpumask_var_t           dependent_cpus;                 /* all clk-dependent cpus */
+
        unsigned int            shared_type; /* ACPI: ANY or ALL affected CPUs
   </snippet>

   * Fallback mechanism for dependent_cpus. With this, s/w components can
     always pick dependent_cpus regardless the coordination type.
   <snippet>
static int cpufreq_online(unsigned int cpu)

                /* related_cpus should at least include policy->cpus. */
                cpumask_copy(policy->related_cpus, policy->cpus);
+
+               /* dependent_cpus should differ only when hw coordination is in place */
+               if (policy->shared_type != CPUFREQ_SHARED_TYPE_HW)
+                       cpumask_copy(policy->dependent_cpus, policy->cpus);
        }
   </snippet>

   * Add sysfs attribute for dependent_cpus

2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus

3) drivers/cpufreq/scmi-cpufreq: Get perf-dependencies from dev_pm_opp_of_get_sharing_cpus()

   * Call dev_pm_opp_of_get_sharing_cpus()

   * Populate policy->dependent_cpus if possible

   * Set policy->shared_type accordingly

   * Provide to EM the correct performance dependencies information
   <snippet>
-       em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
+       /*
+       * EM needs accurate information about perf boundaries, thus provide the
+       * correct cpumask.
+       */
+       if (policy->shared_type == CPUFREQ_SHARED_TYPE_HW)
+               em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb,
+                                           policy->dependent_cpus);
+       else
+               em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb,
+                                           policy->cpus);
   </snippet>

Second proposal:

Another option could be for each driver to store internally the performance
dependencies and let the driver directly provide the correct cpumask for
any consumer.
Whilst this works fine for energy model (see above), it is not the case
(currently) for cpufreq_cooling, thus we would need to add a new interface for
it. That way the driver can call directly the registration function.
This seems to be possible since the CPUFreq core can skip to register
cpufreq_cooling (flag CPUFREQ_IS_COOLING_DEV).

In comparison with the first proposal this one is less scalable since each
driver will have to call the registration functions, while in some cases
the CPUFreq core could do it.

Any comments/preferences?

Many Thanks
Nicola

[1] https://lkml.org/lkml/2020/9/24/399
Viresh Kumar Nov. 3, 2020, 10:18 a.m. UTC | #2
On 02-11-20, 12:01, Nicola Mazzucato wrote:
> Hi All,
> 
> In this V3 posting I have replaced the new dt-binding with minor changes/
> improvements for opp (since we are now using opp tables instead).
> The RFC still stands on how to make this info available to sw consumers.
> 
> In the RFC, I am proposing a simple addition of a performance dependencies
> cpumask in CPUFreq core and an example of how drivers and consumers would
> make use of it.
> I also propose an alternative approach, which does not require changes in
> CPUFreq core, but - possibly - in all the consumers.
> 
> This is to support systems where exposed cpu performance controls are more
> fine-grained that the platform's ability to scale cpus independently.

I was talking to Vincent about what you are doing here and we got a
bit confused and so here are few questions that we had:

- Based on my previous understanding, you don't want software
  coordination of frequencies (which is done by cpufreq today), but
  want the hardware to do that and so you want per-cpu cpufreq
  policies.

- What's the real benefit of hardware coordination ? Want to make sure
  I fully understand that.

- Because of hardware co-ordination of otherwise co-ordinated CPUs,
  few things break. Thermal and EAS are some of the examples and so
  you are trying to fix them here by proving them the missing
  information again.

- One other thing that breaks with this is freq-invariance in the
  scheduler, as the scheduler won't see the real frequencies the
  various CPUs are running at. Most of the hardware we have today
  doesn't have counters, like AMUs, not sure if all future ones based
  on SCMI will have that too, so how are they gong to be fixed ?

  And if we even have to fix this (freq invariance), what's hardware
  coordination giving us that makes all this worth it ?

Sorry about the long list :)
Nicola Mazzucato Nov. 4, 2020, 6:04 p.m. UTC | #3
Hi Viresh, thanks for looking into this.

On 11/3/20 10:18 AM, Viresh Kumar wrote:
> On 02-11-20, 12:01, Nicola Mazzucato wrote:
>> Hi All,
>>
>> In this V3 posting I have replaced the new dt-binding with minor changes/
>> improvements for opp (since we are now using opp tables instead).
>> The RFC still stands on how to make this info available to sw consumers.
>>
>> In the RFC, I am proposing a simple addition of a performance dependencies
>> cpumask in CPUFreq core and an example of how drivers and consumers would
>> make use of it.
>> I also propose an alternative approach, which does not require changes in
>> CPUFreq core, but - possibly - in all the consumers.
>>
>> This is to support systems where exposed cpu performance controls are more
>> fine-grained that the platform's ability to scale cpus independently.
> 
> I was talking to Vincent about what you are doing here and we got a
> bit confused and so here are few questions that we had:
> 
> - Based on my previous understanding, you don't want software
>   coordination of frequencies (which is done by cpufreq today), but
>   want the hardware to do that and so you want per-cpu cpufreq
>   policies.

Correct. And this has been done for quite some time in some platforms.

> 
> - What's the real benefit of hardware coordination ? Want to make sure
>   I fully understand that.

The hardware coordination that is coming out by having per-cpu cpufreq policies
is not new, and works just fine in most of the systems.

The benefit of having per-cpu controls is that the firmware will take care of
the performance of the entire system. It is purely a delegation to firmware for
the performance optimizations.

> 
> - Because of hardware co-ordination of otherwise co-ordinated CPUs,
>   few things break. Thermal and EAS are some of the examples and so
>   you are trying to fix them here by proving them the missing
>   information again.

Correct. And for this I have proposed two ways.

> 
> - One other thing that breaks with this is freq-invariance in the
>   scheduler, as the scheduler won't see the real frequencies the
>   various CPUs are running at. Most of the hardware we have today
>   doesn't have counters, like AMUs, not sure if all future ones based
>   on SCMI will have that too, so how are they gong to be fixed ?
> 

Correct. freq-invariance without counters is trying to do its best based on the
information it has available. It definitely relies on the knowledge of the v/f
domains to work at its best so I think in the case of per-cpu it will follow the
same approach as others being affected (EAS, thermal).

>   And if we even have to fix this (freq invariance), what's hardware
>   coordination giving us that makes all this worth it ?

I suppose this is more a generic question for all the platforms running with h/w
coordination, but for our case is that the f/w will take care of the performance
optimizations for us :)

> 
> Sorry about the long list :)

No problem at all. Thank you for your time on this and I hope I have made bits
clearer.

Nicola

>
Vincent Guittot Nov. 5, 2020, 2:25 p.m. UTC | #4
On Wed, 4 Nov 2020 at 19:02, Nicola Mazzucato <nicola.mazzucato@arm.com> wrote:
>
> Hi Viresh, thanks for looking into this.
>
> On 11/3/20 10:18 AM, Viresh Kumar wrote:
> > On 02-11-20, 12:01, Nicola Mazzucato wrote:
> >> Hi All,
> >>
> >> In this V3 posting I have replaced the new dt-binding with minor changes/
> >> improvements for opp (since we are now using opp tables instead).
> >> The RFC still stands on how to make this info available to sw consumers.
> >>
> >> In the RFC, I am proposing a simple addition of a performance dependencies
> >> cpumask in CPUFreq core and an example of how drivers and consumers would
> >> make use of it.
> >> I also propose an alternative approach, which does not require changes in
> >> CPUFreq core, but - possibly - in all the consumers.
> >>
> >> This is to support systems where exposed cpu performance controls are more
> >> fine-grained that the platform's ability to scale cpus independently.
> >
> > I was talking to Vincent about what you are doing here and we got a
> > bit confused and so here are few questions that we had:
> >
> > - Based on my previous understanding, you don't want software
> >   coordination of frequencies (which is done by cpufreq today), but
> >   want the hardware to do that and so you want per-cpu cpufreq
> >   policies.
>
> Correct. And this has been done for quite some time in some platforms.
>
> >
> > - What's the real benefit of hardware coordination ? Want to make sure
> >   I fully understand that.
>
> The hardware coordination that is coming out by having per-cpu cpufreq policies
> is not new, and works just fine in most of the systems.
>
> The benefit of having per-cpu controls is that the firmware will take care of
> the performance of the entire system. It is purely a delegation to firmware for
> the performance optimizations.
>
> >
> > - Because of hardware co-ordination of otherwise co-ordinated CPUs,
> >   few things break. Thermal and EAS are some of the examples and so
> >   you are trying to fix them here by proving them the missing
> >   information again.
>
> Correct. And for this I have proposed two ways.
>
> >
> > - One other thing that breaks with this is freq-invariance in the
> >   scheduler, as the scheduler won't see the real frequencies the
> >   various CPUs are running at. Most of the hardware we have today
> >   doesn't have counters, like AMUs, not sure if all future ones based
> >   on SCMI will have that too, so how are they gong to be fixed ?
> >
>
> Correct. freq-invariance without counters is trying to do its best based on the
> information it has available. It definitely relies on the knowledge of the v/f
> domains to work at its best so I think in the case of per-cpu it will follow the
> same approach as others being affected (EAS, thermal).

As frequency invariance has same problem as EAS and Thermal it would
be good to see the solution as part of this proposal like EAS and
Thermal

>
> >   And if we even have to fix this (freq invariance), what's hardware
> >   coordination giving us that makes all this worth it ?
>
> I suppose this is more a generic question for all the platforms running with h/w
> coordination, but for our case is that the f/w will take care of the performance
> optimizations for us :)
>
> >
> > Sorry about the long list :)
>
> No problem at all. Thank you for your time on this and I hope I have made bits
> clearer.
>
> Nicola
>
> >
Ionela Voinescu Nov. 5, 2020, 3:46 p.m. UTC | #5
Hi guys,

On Thursday 05 Nov 2020 at 15:25:53 (+0100), Vincent Guittot wrote:
[..]
> > > - Because of hardware co-ordination of otherwise co-ordinated CPUs,
> > >   few things break. Thermal and EAS are some of the examples and so
> > >   you are trying to fix them here by proving them the missing
> > >   information again.
> >
> > Correct. And for this I have proposed two ways.
> >
> > >
> > > - One other thing that breaks with this is freq-invariance in the
> > >   scheduler, as the scheduler won't see the real frequencies the
> > >   various CPUs are running at. Most of the hardware we have today
> > >   doesn't have counters, like AMUs, not sure if all future ones based
> > >   on SCMI will have that too, so how are they gong to be fixed ?
> > >
> >
> > Correct. freq-invariance without counters is trying to do its best based on the
> > information it has available. It definitely relies on the knowledge of the v/f
> > domains to work at its best so I think in the case of per-cpu it will follow the
> > same approach as others being affected (EAS, thermal).
> 
> As frequency invariance has same problem as EAS and Thermal it would
> be good to see the solution as part of this proposal like EAS and
> Thermal
> 

I think I was waiting for a consensus on patch 3/3, although I believe the
discussion at [1] tended towards option 2: "each driver to store
internally the performance dependencies and let the driver directly
provide the correct cpumask for any consumer."
The alternative was option 1: "add a new dependent_cpus cpumaks in
cpufreq_policy", as Nicola mentioned in the commit message for 3/3.

If the choice is clear, I'm happy to take the FIE fixes in a separate
set.

Thanks,
Ionela.

[1] https://lore.kernel.org/linux-arm-kernel/20200924095347.32148-3-nicola.mazzucato@arm.com/

> >
> > >   And if we even have to fix this (freq invariance), what's hardware
> > >   coordination giving us that makes all this worth it ?
> >
> > I suppose this is more a generic question for all the platforms running with h/w
> > coordination, but for our case is that the f/w will take care of the performance
> > optimizations for us :)
> >
> > >
> > > Sorry about the long list :)
> >
> > No problem at all. Thank you for your time on this and I hope I have made bits
> > clearer.
> >
> > Nicola
> >
> > >
Viresh Kumar Nov. 6, 2020, 9:20 a.m. UTC | #6
On 02-11-20, 12:01, Nicola Mazzucato wrote:
> This is a continuation of the previous v2, where we focused mostly on the
> dt binding.
> 
> I am seeking some feedback/comments on the following two approaches.
> 
> Intro:
> We have seen that in a system where performance control and hardware
> description do not match (i.e. per-cpu), we still need the information of
> how the v/f lines are shared among the cpus. We call this information
> "performance dependencies".
> We got this info through the opp-shared (the previous 2 patches aim for
> that).
> 
> Problem:
> How do we share such info (retrieved from a cpufreq driver) to other
> consumers that rely on it? I have two proposals.

I haven't really stop thinking about what and how we should solve
this, but I have few concerns first.

> 2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus

I am not sure if I understand completely on how this is going to be
modified/work.

The only use of related_cpus in the cooling driver is in the helper
cdev->get_requested_power(), where we need to find the total power
being consumed by devices controlled by the cooling device. Right ?

Now the cooling devices today are very closely related to the cpufreq
policy, the registration function itself takes a cpufreq policy as an
argument.

Consider that you have an octa-core platform and all the CPUs are
dependent on each other. With your suggested changes and hw control,
we will have different cpufreq policies for each CPU. And so we will
have a cooling device, cdev, for each CPU as well. When the IPA
governor calls cdev->get_requested_power(), why should we ever bother
to traverse the list of dependent_cpus and not related_cpus only ?

Otherwise the same CPU will have its load contributed to the power of
8 cooling devices.
Viresh Kumar Nov. 6, 2020, 9:24 a.m. UTC | #7
On 02-11-20, 12:01, Nicola Mazzucato wrote:
> Hi All,
> 
> This is a continuation of the previous v2, where we focused mostly on the
> dt binding.
> 
> I am seeking some feedback/comments on the following two approaches.
> 
> Intro:
> We have seen that in a system where performance control and hardware
> description do not match (i.e. per-cpu), we still need the information of
> how the v/f lines are shared among the cpus. We call this information
> "performance dependencies".
> We got this info through the opp-shared (the previous 2 patches aim for
> that).
> 
> Problem:
> How do we share such info (retrieved from a cpufreq driver) to other
> consumers that rely on it? I have two proposals.

FWIW, we already have a case in kernel where something similar is
done.

commit f4fd3797848a ("acpi-cpufreq: Add new sysfs attribute freqdomain_cpus")

and the information is shared via:

/sys/devices/system/cpu/cpu#/cpufreq/freqdomain_cpus
Lukasz Luba Nov. 6, 2020, 10:37 a.m. UTC | #8
Hi Viresh,

On 11/6/20 9:20 AM, Viresh Kumar wrote:
> On 02-11-20, 12:01, Nicola Mazzucato wrote:
>> This is a continuation of the previous v2, where we focused mostly on the
>> dt binding.
>>
>> I am seeking some feedback/comments on the following two approaches.
>>
>> Intro:
>> We have seen that in a system where performance control and hardware
>> description do not match (i.e. per-cpu), we still need the information of
>> how the v/f lines are shared among the cpus. We call this information
>> "performance dependencies".
>> We got this info through the opp-shared (the previous 2 patches aim for
>> that).
>>
>> Problem:
>> How do we share such info (retrieved from a cpufreq driver) to other
>> consumers that rely on it? I have two proposals.
> 
> I haven't really stop thinking about what and how we should solve
> this, but I have few concerns first.
> 
>> 2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus
> 
> I am not sure if I understand completely on how this is going to be
> modified/work.
> 
> The only use of related_cpus in the cooling driver is in the helper
> cdev->get_requested_power(), where we need to find the total power
> being consumed by devices controlled by the cooling device. Right ?
> 
> Now the cooling devices today are very closely related to the cpufreq
> policy, the registration function itself takes a cpufreq policy as an
> argument.
> 
> Consider that you have an octa-core platform and all the CPUs are
> dependent on each other. With your suggested changes and hw control,
> we will have different cpufreq policies for each CPU. And so we will
> have a cooling device, cdev, for each CPU as well. When the IPA
> governor calls cdev->get_requested_power(), why should we ever bother
> to traverse the list of dependent_cpus and not related_cpus only ?
> 
> Otherwise the same CPU will have its load contributed to the power of
> 8 cooling devices.
> 

Good question.

How about a different interface for those cpufreq drivers?
That new registration API would allow to specify the cpumask.
Or rely on EM cpumask: em_span_cpus(em)

Currently we have two ways to register cooling device:
1. when the cpufreq driver set a flag CPUFREQ_IS_COOLING_DEV, the core
will register cooling device
2. cpufreq driver can explicitly call the registration function:
cpufreq_cooling_register() with 'policy' as argument

That would need substantial change to the cpufreq cooling code, from
policy oriented to custom driver's cpumask (like EM registration).

Regards,
Lukasz
Viresh Kumar Nov. 6, 2020, 10:55 a.m. UTC | #9
On 06-11-20, 10:37, Lukasz Luba wrote:
> Good question.
> 
> How about a different interface for those cpufreq drivers?
> That new registration API would allow to specify the cpumask.
> Or rely on EM cpumask: em_span_cpus(em)
> 
> Currently we have two ways to register cooling device:
> 1. when the cpufreq driver set a flag CPUFREQ_IS_COOLING_DEV, the core
> will register cooling device
> 2. cpufreq driver can explicitly call the registration function:
> cpufreq_cooling_register() with 'policy' as argument
> 
> That would need substantial change to the cpufreq cooling code, from
> policy oriented to custom driver's cpumask (like EM registration).

I am even wondering if we should really make that change. Why do we
need the combined load of the CPUs to be sent back to the IPA governor
? Why shouldn't they all do that (they == cdev) ?

This is a bit confusing to me, sorry about that. The cpufreq governors
take a look at all the CPUs utilization and set the frequency based on
the highest utilization (and not the total util).

While in this case we present the total load of the CPUs to the IPA
(based on the current frequency of the CPUs), in response to which it
tells us the frequency at which all the CPUs of the policy can run at
(I am not even sure if it is the right thing to do as the CPUs have
different loads). And how do we fit this dependent_cpus thing into
this.

Sorry, I am not sure what's the right way of doing thing here.
Lukasz Luba Nov. 6, 2020, 11:14 a.m. UTC | #10
On 11/6/20 10:55 AM, Viresh Kumar wrote:
> On 06-11-20, 10:37, Lukasz Luba wrote:
>> Good question.
>>
>> How about a different interface for those cpufreq drivers?
>> That new registration API would allow to specify the cpumask.
>> Or rely on EM cpumask: em_span_cpus(em)
>>
>> Currently we have two ways to register cooling device:
>> 1. when the cpufreq driver set a flag CPUFREQ_IS_COOLING_DEV, the core
>> will register cooling device
>> 2. cpufreq driver can explicitly call the registration function:
>> cpufreq_cooling_register() with 'policy' as argument
>>
>> That would need substantial change to the cpufreq cooling code, from
>> policy oriented to custom driver's cpumask (like EM registration).
> 
> I am even wondering if we should really make that change. Why do we
> need the combined load of the CPUs to be sent back to the IPA governor
> ? Why shouldn't they all do that (they == cdev) ?
> 
> This is a bit confusing to me, sorry about that. The cpufreq governors
> take a look at all the CPUs utilization and set the frequency based on
> the highest utilization (and not the total util).
> 
> While in this case we present the total load of the CPUs to the IPA
> (based on the current frequency of the CPUs), in response to which it
> tells us the frequency at which all the CPUs of the policy can run at
> (I am not even sure if it is the right thing to do as the CPUs have
> different loads). And how do we fit this dependent_cpus thing into
> this.
> 
> Sorry, I am not sure what's the right way of doing thing here.
> 

I also had similar doubts, because if we make frequency requests
independently for each CPU, why not having N cooling devs, which
will set independently QoS max freq for them...

What convinced me:
EAS and FIE would know the 'real' frequency of the cluster, IPA
can use it also and have only one cooling device per cluster.

We would like to keep this old style 'one cooling device per cpuset'.
I don't have strong opinion and if it would appear that there are
some errors in freq estimation for cluster, then maybe it does make
more sense to have cdev per CPU...
Viresh Kumar Nov. 9, 2020, 6:57 a.m. UTC | #11
On 06-11-20, 11:14, Lukasz Luba wrote:
> I also had similar doubts, because if we make frequency requests
> independently for each CPU, why not having N cooling devs, which
> will set independently QoS max freq for them...
> 
> What convinced me:
> EAS and FIE would know the 'real' frequency of the cluster, IPA
> can use it also and have only one cooling device per cluster.
> 
> We would like to keep this old style 'one cooling device per cpuset'.
> I don't have strong opinion and if it would appear that there are
> some errors in freq estimation for cluster, then maybe it does make
> more sense to have cdev per CPU...

Let me rephrase my question. What is it that doesn't work _correctly_
with cdev per cpufreq policy in your case? What doesn't work well if
the thermal stuff keeps looking at only the related_cpus thing and not
the cpu-perf-dependencies thing?
Lukasz Luba Nov. 16, 2020, 11:33 a.m. UTC | #12
On 11/9/20 6:57 AM, Viresh Kumar wrote:
> On 06-11-20, 11:14, Lukasz Luba wrote:
>> I also had similar doubts, because if we make frequency requests
>> independently for each CPU, why not having N cooling devs, which
>> will set independently QoS max freq for them...
>>
>> What convinced me:
>> EAS and FIE would know the 'real' frequency of the cluster, IPA
>> can use it also and have only one cooling device per cluster.
>>
>> We would like to keep this old style 'one cooling device per cpuset'.
>> I don't have strong opinion and if it would appear that there are
>> some errors in freq estimation for cluster, then maybe it does make
>> more sense to have cdev per CPU...
> 
> Let me rephrase my question. What is it that doesn't work _correctly_
> with cdev per cpufreq policy in your case? What doesn't work well if
> the thermal stuff keeps looking at only the related_cpus thing and not
> the cpu-perf-dependencies thing?
> 

We don't have a platform which would be this per-cpu freq request, yet.
Thus it's hard to answer your question. The EAS would work in 'old
style' - cluster mode. I don't know how IPA would work on such HW
and SW configuration. To figure this out I need a real platform.
Viresh Kumar Nov. 17, 2020, 10:11 a.m. UTC | #13
On 16-11-20, 11:33, Lukasz Luba wrote:
> On 11/9/20 6:57 AM, Viresh Kumar wrote:
> > On 06-11-20, 11:14, Lukasz Luba wrote:
> > > I also had similar doubts, because if we make frequency requests
> > > independently for each CPU, why not having N cooling devs, which
> > > will set independently QoS max freq for them...
> > > 
> > > What convinced me:
> > > EAS and FIE would know the 'real' frequency of the cluster, IPA
> > > can use it also and have only one cooling device per cluster.
> > > 
> > > We would like to keep this old style 'one cooling device per cpuset'.
> > > I don't have strong opinion and if it would appear that there are
> > > some errors in freq estimation for cluster, then maybe it does make
> > > more sense to have cdev per CPU...
> > 
> > Let me rephrase my question. What is it that doesn't work _correctly_
> > with cdev per cpufreq policy in your case? What doesn't work well if
> > the thermal stuff keeps looking at only the related_cpus thing and not
> > the cpu-perf-dependencies thing?
> > 
> 
> We don't have a platform which would be this per-cpu freq request, yet.
> Thus it's hard to answer your question. The EAS would work in 'old
> style' - cluster mode. I don't know how IPA would work on such HW
> and SW configuration. To figure this out I need a real platform.

Hmm, so who are going to be the users of this new stuff (dependent
CPUs) ? I don't think cpufreq-cooling should be updated, unless there
is a compelling reason to.

The other one in energy model ? Why does it need this information ?

Who else ?
Nicola Mazzucato Nov. 17, 2020, 10:47 a.m. UTC | #14
Hi Viresh,


On 11/17/20 10:11 AM, Viresh Kumar wrote:
> On 16-11-20, 11:33, Lukasz Luba wrote:
>> On 11/9/20 6:57 AM, Viresh Kumar wrote:
>>> On 06-11-20, 11:14, Lukasz Luba wrote:
>>>> I also had similar doubts, because if we make frequency requests
>>>> independently for each CPU, why not having N cooling devs, which
>>>> will set independently QoS max freq for them...
>>>>
>>>> What convinced me:
>>>> EAS and FIE would know the 'real' frequency of the cluster, IPA
>>>> can use it also and have only one cooling device per cluster.
>>>>
>>>> We would like to keep this old style 'one cooling device per cpuset'.
>>>> I don't have strong opinion and if it would appear that there are
>>>> some errors in freq estimation for cluster, then maybe it does make
>>>> more sense to have cdev per CPU...
>>>
>>> Let me rephrase my question. What is it that doesn't work _correctly_
>>> with cdev per cpufreq policy in your case? What doesn't work well if
>>> the thermal stuff keeps looking at only the related_cpus thing and not
>>> the cpu-perf-dependencies thing?
>>>
>>
>> We don't have a platform which would be this per-cpu freq request, yet.
>> Thus it's hard to answer your question. The EAS would work in 'old
>> style' - cluster mode. I don't know how IPA would work on such HW
>> and SW configuration. To figure this out I need a real platform.
> 
> Hmm, so who are going to be the users of this new stuff (dependent
> CPUs) ?

In general, any platform that has hardware coordination in place and some
components need to use the information.

I don't think cpufreq-cooling should be updated, unless there
> is a compelling reason to.
> 
> The other one in energy model ? Why does it need this information ?

The reasons has probably gone lost in the emails, but in a nutshell EM needs
accurate information on performance boundaries to achieve correct task placement.

> 
> Who else ?
> 

Freq-invariance has been mentioned. I suppose the fix will depend on which
strategy we prefer to solve this.

As a reminder, two solutions:
1) dependent_cpus cpumask in cpufreq and involved entities pick this info
or
2) dependent_cpus cpumask in driver but some entities' interfaces may need to change

Hope it helps,
Nicola
Lukasz Luba Nov. 17, 2020, 10:47 a.m. UTC | #15
On 11/17/20 10:11 AM, Viresh Kumar wrote:
> On 16-11-20, 11:33, Lukasz Luba wrote:
>> On 11/9/20 6:57 AM, Viresh Kumar wrote:
>>> On 06-11-20, 11:14, Lukasz Luba wrote:
>>>> I also had similar doubts, because if we make frequency requests
>>>> independently for each CPU, why not having N cooling devs, which
>>>> will set independently QoS max freq for them...
>>>>
>>>> What convinced me:
>>>> EAS and FIE would know the 'real' frequency of the cluster, IPA
>>>> can use it also and have only one cooling device per cluster.
>>>>
>>>> We would like to keep this old style 'one cooling device per cpuset'.
>>>> I don't have strong opinion and if it would appear that there are
>>>> some errors in freq estimation for cluster, then maybe it does make
>>>> more sense to have cdev per CPU...
>>>
>>> Let me rephrase my question. What is it that doesn't work _correctly_
>>> with cdev per cpufreq policy in your case? What doesn't work well if
>>> the thermal stuff keeps looking at only the related_cpus thing and not
>>> the cpu-perf-dependencies thing?
>>>
>>
>> We don't have a platform which would be this per-cpu freq request, yet.
>> Thus it's hard to answer your question. The EAS would work in 'old
>> style' - cluster mode. I don't know how IPA would work on such HW
>> and SW configuration. To figure this out I need a real platform.
> 
> Hmm, so who are going to be the users of this new stuff (dependent
> CPUs) ? I don't think cpufreq-cooling should be updated, unless there
> is a compelling reason to.

Fair enough. What if we come back with experiments results in future?
Currently we are trying to conduct experiments with Nicola on modified 
Juno firmware and kernel)

> 
> The other one in energy model ? Why does it need this information ?

The energy model needs this information, because it's used during the
sched domain re-build. The sched domain is then used in the EAS main
functions: feec() [1] and compute_energy() [2].
What EAS normally does is 'trying different options to put a task
on different CPUs and observe the potential energy costs'.
Example, we need the 'cluster' frequency, because when you put a task on
a CPU its freq might need to be set a bit higher. This would affect all
CPUs in the cluster not only one and we capture that energy cost. Then,
we try to put a task on other CPU in that cluster and if it appears to
be no need of rising freq for the whole cluster, then it wins.


> 
> Who else ?
> 

FIE..
Viresh Kumar Nov. 17, 2020, 10:53 a.m. UTC | #16
On 17-11-20, 10:47, Nicola Mazzucato wrote:
> Freq-invariance has been mentioned. I suppose the fix will depend on which
> strategy we prefer to solve this.

I am not sure how FIE will use this information, as I thought the
problem is about knowing the current frequency on a CPU instead of
which CPUs are related. Anyway, EM is good enough to get this stuff
going.

> As a reminder, two solutions:
> 1) dependent_cpus cpumask in cpufreq and involved entities pick this info

Yeah, this one. And it will be called freqdomain_cpus. Add support for
freqdomain_cpus in core, update acpi-cpufreq to reuse it instead of
adding its own and you can have it in other drivers then.
Viresh Kumar Nov. 17, 2020, 10:55 a.m. UTC | #17
On 17-11-20, 10:47, Lukasz Luba wrote:
> Fair enough. What if we come back with experiments results in future?
> Currently we are trying to conduct experiments with Nicola on modified Juno
> firmware and kernel)

Sure. If we think it improves things, why not. I just wanted to make
sure we understand why would we go do this change now.
Rafael J. Wysocki Nov. 17, 2020, 1:06 p.m. UTC | #18
On Tue, Nov 17, 2020 at 11:53 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 17-11-20, 10:47, Nicola Mazzucato wrote:
> > Freq-invariance has been mentioned. I suppose the fix will depend on which
> > strategy we prefer to solve this.
>
> I am not sure how FIE will use this information, as I thought the
> problem is about knowing the current frequency on a CPU instead of
> which CPUs are related. Anyway, EM is good enough to get this stuff
> going.
>
> > As a reminder, two solutions:
> > 1) dependent_cpus cpumask in cpufreq and involved entities pick this info
>
> Yeah, this one. And it will be called freqdomain_cpus. Add support for
> freqdomain_cpus in core, update acpi-cpufreq to reuse it instead of
> adding its own and you can have it in other drivers then.

Is this really a cpufreq thing, though, or is it arch stuff?  I think
the latter, because it is not necessary for anything in cpufreq.

Yes, acpi-cpufreq happens to know this information, because it uses
processor_perflib, but the latter may as well be used by the arch
enumeration of CPUs and the freqdomain_cpus mask may be populated from
there.

As far as cpufreq is concerned, if the interface to the hardware is
per-CPU, there is one CPU per policy and cpufreq has no business
knowing anything about the underlying hardware coordination.
Viresh Kumar Nov. 18, 2020, 4:42 a.m. UTC | #19
On 17-11-20, 14:06, Rafael J. Wysocki wrote:
> Is this really a cpufreq thing, though, or is it arch stuff?  I think
> the latter, because it is not necessary for anything in cpufreq.
> 
> Yes, acpi-cpufreq happens to know this information, because it uses
> processor_perflib, but the latter may as well be used by the arch
> enumeration of CPUs and the freqdomain_cpus mask may be populated from
> there.
> 
> As far as cpufreq is concerned, if the interface to the hardware is
> per-CPU, there is one CPU per policy and cpufreq has no business
> knowing anything about the underlying hardware coordination.

It won't be used by cpufreq for now at least and yes I understand your
concern. I opted for this because we already have a cpufreq
implementation for the same thing and it is usually better to reuse
this kind of stuff instead of inventing it over.
Rafael J. Wysocki Nov. 18, 2020, noon UTC | #20
On Wed, Nov 18, 2020 at 5:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 17-11-20, 14:06, Rafael J. Wysocki wrote:
> > Is this really a cpufreq thing, though, or is it arch stuff?  I think
> > the latter, because it is not necessary for anything in cpufreq.
> >
> > Yes, acpi-cpufreq happens to know this information, because it uses
> > processor_perflib, but the latter may as well be used by the arch
> > enumeration of CPUs and the freqdomain_cpus mask may be populated from
> > there.
> >
> > As far as cpufreq is concerned, if the interface to the hardware is
> > per-CPU, there is one CPU per policy and cpufreq has no business
> > knowing anything about the underlying hardware coordination.
>
> It won't be used by cpufreq for now at least and yes I understand your
> concern. I opted for this because we already have a cpufreq
> implementation for the same thing and it is usually better to reuse
> this kind of stuff instead of inventing it over.

Do you mean related_cpus and real_cpus?

That's the granularity of the interface to the hardware I'm talking about.

Strictly speaking, it means "these CPUs share a HW interface for perf
control" and it need not mean "these CPUs are in the same
clock/voltage domain".  Specifically, it need not mean "these CPUs are
the only CPUs in the given clock/voltage domain".  That's what it
means when the control is exercised by manipulating OPPs directly, but
not in general.

In the ACPI case, for example, what the firmware tells you need not
reflect the HW topology in principle.  It only tells you whether or
not it wants you to coordinate a given group of CPUs and in what way,
but this may not be the whole picture from the HW perspective.  If you
need the latter, you need more information in general (at least you
need to assume that what the firmware tells you actually does reflect
the HW topology on the given SoC).

So yes, in the particular case of OPP-based perf control, cpufreq
happens to have the same information that is needed by the other
subsystems, but otherwise it may not and what I'm saying is that it
generally is a mistake to expect cpufreq to have that information or
to be able to obtain it without the help of the arch/platform code.
Hence, it would be a mistake to design an interface based on that
expectation.

Or looking at it from a different angle, today a cpufreq driver is
only required to specify the granularity of the HW interface for perf
control via related_cpus.  It is not required to obtain extra
information beyond that.  If a new mask to be populated by it is
added, the driver may need to do more work which is not necessary from
the perf control perspective.  That doesn't look particularly clean to
me.

Moreover, adding such a mask to cpufreq_policy would make the users of
it depend on cpufreq sort of artificially, which need not be useful
even.

IMO, the information needed by all of the subsystems in question
should be obtained and made available at the arch/platform level and
everyone who needs it should be able to access it from there,
including the cpufreq driver for the given platform if that's what it
needs to do.

BTW, cpuidle may need the information in question too, so why should
it be provided via cpufreq rather than via cpuidle?
Viresh Kumar Nov. 19, 2020, 6:40 a.m. UTC | #21
On 18-11-20, 13:00, Rafael J. Wysocki wrote:
> On Wed, Nov 18, 2020 at 5:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 17-11-20, 14:06, Rafael J. Wysocki wrote:
> > > Is this really a cpufreq thing, though, or is it arch stuff?  I think
> > > the latter, because it is not necessary for anything in cpufreq.
> > >
> > > Yes, acpi-cpufreq happens to know this information, because it uses
> > > processor_perflib, but the latter may as well be used by the arch
> > > enumeration of CPUs and the freqdomain_cpus mask may be populated from
> > > there.
> > >
> > > As far as cpufreq is concerned, if the interface to the hardware is
> > > per-CPU, there is one CPU per policy and cpufreq has no business
> > > knowing anything about the underlying hardware coordination.
> >
> > It won't be used by cpufreq for now at least and yes I understand your
> > concern. I opted for this because we already have a cpufreq
> > implementation for the same thing and it is usually better to reuse
> > this kind of stuff instead of inventing it over.
> 
> Do you mean related_cpus and real_cpus?

Sorry about the confusion, I meant freqdomain_cpus only.

> That's the granularity of the interface to the hardware I'm talking about.
> 
> Strictly speaking, it means "these CPUs share a HW interface for perf
> control" and it need not mean "these CPUs are in the same
> clock/voltage domain".  Specifically, it need not mean "these CPUs are
> the only CPUs in the given clock/voltage domain".  That's what it
> means when the control is exercised by manipulating OPPs directly, but
> not in general.
> 
> In the ACPI case, for example, what the firmware tells you need not
> reflect the HW topology in principle.  It only tells you whether or
> not it wants you to coordinate a given group of CPUs and in what way,
> but this may not be the whole picture from the HW perspective.  If you
> need the latter, you need more information in general (at least you
> need to assume that what the firmware tells you actually does reflect
> the HW topology on the given SoC).
> 
> So yes, in the particular case of OPP-based perf control, cpufreq
> happens to have the same information that is needed by the other
> subsystems, but otherwise it may not and what I'm saying is that it
> generally is a mistake to expect cpufreq to have that information or
> to be able to obtain it without the help of the arch/platform code.
> Hence, it would be a mistake to design an interface based on that
> expectation.
> 
> Or looking at it from a different angle, today a cpufreq driver is
> only required to specify the granularity of the HW interface for perf
> control via related_cpus.  It is not required to obtain extra
> information beyond that.  If a new mask to be populated by it is
> added, the driver may need to do more work which is not necessary from
> the perf control perspective.  That doesn't look particularly clean to
> me.
> 
> Moreover, adding such a mask to cpufreq_policy would make the users of
> it depend on cpufreq sort of artificially, which need not be useful
> even.
> 
> IMO, the information needed by all of the subsystems in question
> should be obtained and made available at the arch/platform level and
> everyone who needs it should be able to access it from there,
> including the cpufreq driver for the given platform if that's what it
> needs to do.
> 
> BTW, cpuidle may need the information in question too, so why should
> it be provided via cpufreq rather than via cpuidle?

Right.
Viresh Kumar Nov. 19, 2020, 6:43 a.m. UTC | #22
On 02-11-20, 12:01, Nicola Mazzucato wrote:
> Second proposal:
> 
> Another option could be for each driver to store internally the performance
> dependencies and let the driver directly provide the correct cpumask for
> any consumer.

From the discussion that happened in this thread, looks like we are
going to do it.

> Whilst this works fine for energy model (see above), it is not the case

Good.

> (currently) for cpufreq_cooling, thus we would need to add a new interface for

And this is not required for now as we discussed.