[v2,0/2] CPUFreq: Add support for cpu performance dependencies

Message ID	20200924095347.32148-1-nicola.mazzucato@arm.com (mailing list archive)
Headers	show Return-Path: <SRS0=e+u9=DB=vger.kernel.org=linux-pm-owner@kernel.org> From: Nicola Mazzucato <nicola.mazzucato@arm.com> To: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, devicetree@vger.kernel.org, linux-pm@vger.kernel.org, sudeep.holla@arm.com, rjw@rjwysocki.net, vireshk@kernel.org, robh+dt@kernel.org, daniel.lezcano@linaro.org Cc: morten.rasmussen@arm.com, chris.redpath@arm.com, nicola.mazzucato@arm.com Subject: [PATCH v2 0/2] CPUFreq: Add support for cpu performance dependencies Date: Thu, 24 Sep 2020 10:53:45 +0100 Message-Id: <20200924095347.32148-1-nicola.mazzucato@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	CPUFreq: Add support for cpu performance dependencies \| expand [v2,0/2] CPUFreq: Add support for cpu performance dependencies [v2,1/2] dt-bindings: arm: Add devicetree binding for cpu-performance-dependencies

Nicola Mazzucato Sept. 24, 2020, 9:53 a.m. UTC

Hi Rob, Rafael, Viresh, Daniel and Sudeep,

I am proposing the addition of performance dependencies in cpufreq core
and one if its drivers (the implementation/approach is RFC).
This is to support systems where exposed cpu performance controls are more
fine-grained that the platform's ability to scale cpus independently.

In this V2 posting of the proposed dt-binding to support this, I have
included an outline of how this information can be stored and passed onto
relevant OSPM frameworks.

Please see each patch for detailed information.

[v2]
  * Fix errors when running make dt_binding_check
  * Improve commit message description for the dt-binding
  * Add RFC for implementation in cpufreq-core and one of its
    drivers.

Nicola Mazzucato (2):
  dt-bindings: arm: Add devicetree binding for
    cpu-performance-dependencies
  [RFC] CPUFreq: Add support for cpu-perf-dependencies

 .../bindings/arm/cpu-perf-dependencies.yaml   | 48 +++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/cpu-perf-dependencies.yaml

Nicola Mazzucato Sept. 24, 2020, 9:53 a.m. UTC | #1

I am seeking some feedback/comments on the following approach.

Intro:
Info of performance depency for cpus will be beneficial for systems
where f/w description of the CPU performance control domain is different
from the clock domain, e.g. per-CPU control with multiple CPUs sharing
clock, and kernel OSPM s/w components need to take CPU performance
dependency into account.
Essentially these s/w components will have to be provided with
this information from dt and this RFC is presenting a possible way
to do so.

Little details about implementation are given, as this RFC aims to
present the overall approach.

Proposal:
The cpufreq framework currently assumes that a policy covers a group of
CPUs that are controlled together. The energy model and thermal
frameworks assume that the policy cpumask describes performance
dependency relation. This assumption is no longer generally valid, so we
need a way to represent both control and performance relation in cpufreq.

The proposal is to have one cpufreq_policy instance per control domain,
and have a new cpumask 'dependent_cpus' to the policy to represent the
CPU performance dependencies.

The main reason for a new cpumaks is that although 'related_cpus'
could be (or could have been) used for such purpose, its meaning has
changed over time. Initially it was designed specifically for this
purpose[1], but eventually it has changed to online + offline cpus when
sw coordination in use [2,3].

There is also a 'shared_type' field in cpufreq_policy which provides
info about coordination type (NONE, SW_ANY, SW_ALL, HW). Currently it's in
use only for ACPI but I assume it can be used to indicate the coordination
type even out of ACPI itself. Currently there is no use of TYPE_HW.

Provided that the cpufreq driver will populate dependent_cpus and
set shared_type, the s/w components that rely on such description (we
focus on energy-model and cpufreq_cooling for now) will always be provided
with the correct information, when picking the new cpumask.

Proposed changes (at high level)(4):

1) cpufreq: Add new dependent_cpus cpumaks in cpufreq_policy

   * New cpumask addition
   <snippet>
struct cpufreq_policy {
        cpumask_var_t           related_cpus; /* Online + Offline CPUs */
        cpumask_var_t           real_cpus; /* Related and present */

+       /*
+        * CPUs with hardware clk/perf dependencies
+        *
+        * For sw components that rely on h/w info of clk dependencies when hw
+        * coordinates. This cpumask should always reflect the hw dependencies.
+        */
+       cpumask_var_t           dependent_cpus;                 /* all clk-dependent cpus */
+
        unsigned int            shared_type; /* ACPI: ANY or ALL affected CPUs
   </snippet>

   * Fallback mechanism for dependent_cpus. With this, s/w components can
     always pick dependent_cpus regardless the coordination type.
   <snippet>
static int cpufreq_online(unsigned int cpu)

                /* related_cpus should at least include policy->cpus. */
                cpumask_copy(policy->related_cpus, policy->cpus);
+
+               /* dependent_cpus should differ only when hw coordination is in place */
+               if (policy->shared_type != CPUFREQ_SHARED_TYPE_HW)
+                       cpumask_copy(policy->dependent_cpus, policy->cpus);
        }
   </snippet>

   * Add sysfs attribute for dependent_cpus

2) drivers/thermal/cpufreq_cooling: Replace related_cpus with dependent_cpus

3) drivers/firmware/arm_scmi/perf.c: Parse dt for `cpu-performance-dependencies`

   * Parse dt for `cpu-performance-dependencies` optional node

   * Store internally performance dependencies

   * Add api to get depedent_cpus if required

4) drivers/cpufreq/scmi-cpufreq: Register EM device with the proper cpumask

   * Check for performance dependencies and get dependent_cpus

   * Set policy->shared_type accordingly

   * Provide to EM the correct performance dependencies information
   <snippet>
static int scmi_cpufreq_init(struct cpufreq_policy *policy)
        policy->fast_switch_possible =
                handle->perf_ops->fast_switch_possible(handle, cpu_dev);

-       em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
+       /*
+        * EM needs accurate information about clk boundaries, thus provide the
+        * correct cpumask.
+        */
+       if (handle->perf_ops->has_perf_deps(handle))
+               em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb,
+                                       policy->dependent_cpus);
+       else
+               em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb,
+                                           policy->cpus);
   </snippet>

Any other suggestions are welcome.

Thanks
Nicola

[1] 'commit e8628dd06d66 ("[CPUFREQ] expose cpufreq coordination requirements regardless of coordination mechanism")'
[2] 'commit 951fc5f45836 ("cpufreq: Update Documentation for cpus and related_cpus")'
[3] 'commit f4fd3797848a ("acpi-cpufreq: Add new sysfs attribute freqdomain_cpus")'

Viresh Kumar Oct. 6, 2020, 7:19 a.m. UTC | #2

On 24-09-20, 10:53, Nicola Mazzucato wrote:
> I am seeking some feedback/comments on the following approach.
> 
> Intro:
> Info of performance depency for cpus will be beneficial for systems
> where f/w description of the CPU performance control domain is different
> from the clock domain, e.g. per-CPU control with multiple CPUs sharing
> clock, and kernel OSPM s/w components need to take CPU performance
> dependency into account.
> Essentially these s/w components will have to be provided with
> this information from dt and this RFC is presenting a possible way
> to do so.

I am not sure I understand what performance control mean here. Can you please
elaborate a bit more on that ? For example, with current code and understanding,
a cpufreq policy belongs to a group of CPUs which change their frequency
together, which also mean that they change their performance level together and
so I am not able to understand what's going on here. Sorry about that.

What kind of hardware configuration doesn't work with this ?

Nicola Mazzucato Oct. 7, 2020, 12:58 p.m. UTC | #3

Hi Viresh,

performance controls is what is exposed by the firmware through a protocol that
is not capable of describing hardware (say SCMI). For example, the firmware can
tell that the platform has N controls, but it can't say to which hardware they
are "wired" to. This is done in dt, where, for example, we map these controls
to cpus, gpus, etc.

Let's focus on cpus.

Normally we would have N of performance controls (what comes from f/w)
that that correspond to hardware clock/dvfs domains.

However, some firmware implementations might benefit from having finer
grained information about the performance requirements (e.g.
per-CPU) and therefore choose to present M performance controls to the
OS. DT would be adjusted accordingly to "wire" these controls to cpus
or set of cpus.
In this scenario, the f/w will make aggregation decisions based on the
requests it receives on these M controls.

Here we would have M cpufreq policies which do not necessarily reflect the
underlying clock domains, thus some s/w components will underperform
(EAS and thermal, for example).

A real example would be a platform in which the firmware describes the system
having M per-cpu control, and the cpufreq subsystem will have M policies while
in fact these cpus are "performance-dependent" each other (e.g. are in the same
clock domain). This performance dependency information is essential for some
components that take information from the cpufreq policy.

To restore functionality we can use the optional node
'cpu-performance-dependencies' in dt which will provide such dependency
information and we can add a new cpumask 'dependency_cpus' in policy.

Hope it gives some clarity.

Nicola

On 10/6/20 8:19 AM, Viresh Kumar wrote:
> On 24-09-20, 10:53, Nicola Mazzucato wrote:
>> I am seeking some feedback/comments on the following approach.
>>
>> Intro:
>> Info of performance depency for cpus will be beneficial for systems
>> where f/w description of the CPU performance control domain is different
>> from the clock domain, e.g. per-CPU control with multiple CPUs sharing
>> clock, and kernel OSPM s/w components need to take CPU performance
>> dependency into account.
>> Essentially these s/w components will have to be provided with
>> this information from dt and this RFC is presenting a possible way
>> to do so.
> 
> I am not sure I understand what performance control mean here. Can you please
> elaborate a bit more on that ? For example, with current code and understanding,
> a cpufreq policy belongs to a group of CPUs which change their frequency
> together, which also mean that they change their performance level together and
> so I am not able to understand what's going on here. Sorry about that.
> 
> What kind of hardware configuration doesn't work with this ?
>

Viresh Kumar Oct. 8, 2020, 11:02 a.m. UTC | #4

On 07-10-20, 13:58, Nicola Mazzucato wrote:
> Hi Viresh,
> 
> performance controls is what is exposed by the firmware through a protocol that
> is not capable of describing hardware (say SCMI). For example, the firmware can
> tell that the platform has N controls, but it can't say to which hardware they
> are "wired" to. This is done in dt, where, for example, we map these controls
> to cpus, gpus, etc.
> 
> Let's focus on cpus.
> 
> Normally we would have N of performance controls (what comes from f/w)
> that that correspond to hardware clock/dvfs domains.
> 
> However, some firmware implementations might benefit from having finer
> grained information about the performance requirements (e.g.
> per-CPU) and therefore choose to present M performance controls to the
> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> or set of cpus.
> In this scenario, the f/w will make aggregation decisions based on the
> requests it receives on these M controls.
> 
> Here we would have M cpufreq policies which do not necessarily reflect the
> underlying clock domains, thus some s/w components will underperform
> (EAS and thermal, for example).
> 
> A real example would be a platform in which the firmware describes the system
> having M per-cpu control, and the cpufreq subsystem will have M policies while
> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> clock domain).

If the CPUs are in the same clock domain, they must be part of the
same cpufreq policy.

> This performance dependency information is essential for some
> components that take information from the cpufreq policy.
> 
> To restore functionality we can use the optional node
> 'cpu-performance-dependencies' in dt which will provide such dependency
> information and we can add a new cpumask 'dependency_cpus' in policy.
> 
> Hope it gives some clarity.

Some, but I am still confused :(

Can you give a real example, with exact number of CPUs, how they share
clocks/voltage domains and what else the firmware needs in terms of
performance-domains ? That may make it easier for me to understand it.

Ionela Voinescu Oct. 8, 2020, 3:03 p.m. UTC | #5

Hi Viresh,

On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> > Hi Viresh,
> > 
> > performance controls is what is exposed by the firmware through a protocol that
> > is not capable of describing hardware (say SCMI). For example, the firmware can
> > tell that the platform has N controls, but it can't say to which hardware they
> > are "wired" to. This is done in dt, where, for example, we map these controls
> > to cpus, gpus, etc.
> > 
> > Let's focus on cpus.
> > 
> > Normally we would have N of performance controls (what comes from f/w)
> > that that correspond to hardware clock/dvfs domains.
> > 
> > However, some firmware implementations might benefit from having finer
> > grained information about the performance requirements (e.g.
> > per-CPU) and therefore choose to present M performance controls to the
> > OS. DT would be adjusted accordingly to "wire" these controls to cpus
> > or set of cpus.
> > In this scenario, the f/w will make aggregation decisions based on the
> > requests it receives on these M controls.
> > 
> > Here we would have M cpufreq policies which do not necessarily reflect the
> > underlying clock domains, thus some s/w components will underperform
> > (EAS and thermal, for example).
> > 
> > A real example would be a platform in which the firmware describes the system
> > having M per-cpu control, and the cpufreq subsystem will have M policies while
> > in fact these cpus are "performance-dependent" each other (e.g. are in the same
> > clock domain).
> 
> If the CPUs are in the same clock domain, they must be part of the
> same cpufreq policy.

But cpufreq does not currently support HW_ALL (I'm using the ACPI
coordination type to describe the generic scenario of using hardware
aggregation and coordination when establishing the clock rate of CPUs).

Adding support for HW_ALL* will involve either bypassing some
assumptions around cpufreq policies or making core cpufreq changes.

In the way I see it, support for HW_ALL involves either:

 - (a) Creating per-cpu policies in order to allow each of the CPUs to
   send their own frequency request to the hardware which will do
   aggregation and clock rate decision at the level of the clock
   domain. The PSD domains (ACPI) and the new DT binding will tell
   which CPUs are actually in the same clock domain for whomever is
   interested, despite those CPUs not being in the same policy.
   This requires the extra mask that Nicola introduced.

 - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
   - Governors to stop aggregating (usually max) the information
     for each of the CPUs in the policy and convey to the core
     information for each CPU.
   - Cpufreq core to be able to receive and pass this information
     down to the drivers.
   - Drivers to be able to have some per cpu structures to hold
     frequency control (let's say SCP fast channel addresses) for
     each of the CPUs in the policy. Or have these structures in the
     cpufreq core/policy, to avoid code duplication in drivers.

Therefore (a) is the least invasive but we'll be bypassing the rule
above. But to make that rule stick we'll have to make invasive cpufreq
changes (b).

This is my current understanding and I'm leaning towards (a). What do
you think?

*in not so many words, this is what these patches are trying to propose,
while also making sure it's supported for both ACPI and DT.

BTW, thank you for your effort in making sense of this!

Regards,
Ionela.

Rafael J. Wysocki Oct. 8, 2020, 3:57 p.m. UTC | #6

On Thu, Oct 8, 2020 at 5:03 PM Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> Hi Viresh,
>
> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> > On 07-10-20, 13:58, Nicola Mazzucato wrote:
> > > Hi Viresh,
> > >
> > > performance controls is what is exposed by the firmware through a protocol that
> > > is not capable of describing hardware (say SCMI). For example, the firmware can
> > > tell that the platform has N controls, but it can't say to which hardware they
> > > are "wired" to. This is done in dt, where, for example, we map these controls
> > > to cpus, gpus, etc.
> > >
> > > Let's focus on cpus.
> > >
> > > Normally we would have N of performance controls (what comes from f/w)
> > > that that correspond to hardware clock/dvfs domains.
> > >
> > > However, some firmware implementations might benefit from having finer
> > > grained information about the performance requirements (e.g.
> > > per-CPU) and therefore choose to present M performance controls to the
> > > OS. DT would be adjusted accordingly to "wire" these controls to cpus
> > > or set of cpus.
> > > In this scenario, the f/w will make aggregation decisions based on the
> > > requests it receives on these M controls.
> > >
> > > Here we would have M cpufreq policies which do not necessarily reflect the
> > > underlying clock domains, thus some s/w components will underperform
> > > (EAS and thermal, for example).
> > >
> > > A real example would be a platform in which the firmware describes the system
> > > having M per-cpu control, and the cpufreq subsystem will have M policies while
> > > in fact these cpus are "performance-dependent" each other (e.g. are in the same
> > > clock domain).
> >
> > If the CPUs are in the same clock domain, they must be part of the
> > same cpufreq policy.
>
> But cpufreq does not currently support HW_ALL (I'm using the ACPI
> coordination type to describe the generic scenario of using hardware
> aggregation and coordination when establishing the clock rate of CPUs).
>
> Adding support for HW_ALL* will involve either bypassing some
> assumptions around cpufreq policies or making core cpufreq changes.
>
> In the way I see it, support for HW_ALL involves either:
>
>  - (a) Creating per-cpu policies in order to allow each of the CPUs to
>    send their own frequency request to the hardware which will do
>    aggregation and clock rate decision at the level of the clock
>    domain.

This has been done for years on many platforms.

>    The PSD domains (ACPI) and the new DT binding will tell
>    which CPUs are actually in the same clock domain for whomever is
>    interested, despite those CPUs not being in the same policy.

And this information hasn't been used so far in those cases.

>    This requires the extra mask that Nicola introduced.

What would you use it for, specifically?

>  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:

Not an option really.

Nicola Mazzucato Oct. 8, 2020, 4 p.m. UTC | #7

Hi Viresh and Ionela,

@Viresh, I am sorry it's still not crystal clear, please find an example below.
@Ionela, thanks for adding more details.

On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> Hi Viresh,
> 
> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
>>> Hi Viresh,
>>>
>>> performance controls is what is exposed by the firmware through a protocol that
>>> is not capable of describing hardware (say SCMI). For example, the firmware can
>>> tell that the platform has N controls, but it can't say to which hardware they
>>> are "wired" to. This is done in dt, where, for example, we map these controls
>>> to cpus, gpus, etc.
>>>
>>> Let's focus on cpus.
>>>
>>> Normally we would have N of performance controls (what comes from f/w)
>>> that that correspond to hardware clock/dvfs domains.
>>>
>>> However, some firmware implementations might benefit from having finer
>>> grained information about the performance requirements (e.g.
>>> per-CPU) and therefore choose to present M performance controls to the
>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
>>> or set of cpus.
>>> In this scenario, the f/w will make aggregation decisions based on the
>>> requests it receives on these M controls.
>>>
>>> Here we would have M cpufreq policies which do not necessarily reflect the
>>> underlying clock domains, thus some s/w components will underperform
>>> (EAS and thermal, for example).
>>>
>>> A real example would be a platform in which the firmware describes the system
>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
>>> clock domain).
>>
>> If the CPUs are in the same clock domain, they must be part of the
>> same cpufreq policy.
> 
> But cpufreq does not currently support HW_ALL (I'm using the ACPI
> coordination type to describe the generic scenario of using hardware
> aggregation and coordination when establishing the clock rate of CPUs).
> 
> Adding support for HW_ALL* will involve either bypassing some
> assumptions around cpufreq policies or making core cpufreq changes.
> 
> In the way I see it, support for HW_ALL involves either:
> 
>  - (a) Creating per-cpu policies in order to allow each of the CPUs to
>    send their own frequency request to the hardware which will do
>    aggregation and clock rate decision at the level of the clock
>    domain. The PSD domains (ACPI) and the new DT binding will tell
>    which CPUs are actually in the same clock domain for whomever is
>    interested, despite those CPUs not being in the same policy.
>    This requires the extra mask that Nicola introduced.
> 
>  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
>    - Governors to stop aggregating (usually max) the information
>      for each of the CPUs in the policy and convey to the core
>      information for each CPU.
>    - Cpufreq core to be able to receive and pass this information
>      down to the drivers.
>    - Drivers to be able to have some per cpu structures to hold
>      frequency control (let's say SCP fast channel addresses) for
>      each of the CPUs in the policy. Or have these structures in the
>      cpufreq core/policy, to avoid code duplication in drivers.
> 
> Therefore (a) is the least invasive but we'll be bypassing the rule
> above. But to make that rule stick we'll have to make invasive cpufreq
> changes (b).

Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
to share my understanding on it. Perhaps it's a good opportunity to shed some light.

Looking back in the history of CPUFreq, related_cpus was originally designed
to hold the map of cpus within the same clock. Later on, the meaning of this
cpumask changed [1].
This led to the introduction of a new cpumask 'freqdomain_cpus'
within acpi-cpufreq to keep the knowledge of hardware clock domains for
sysfs consumers since related_cpus was not suitable anymore for this.
Further on, this cpumask was assigned to online+offline cpus within the same clk
domain when sw coordination is in use [2].

My interpretation is that there is no guarantee that related_cpus holds the
'real' hardware clock implementation. As a consequence, it is not true anymore
that cpus that are in the same clock domain will be part of the same
policy.

This guided me to think it would be better to have a cpumask which always holds
the real hw clock domains in the policy.

> 
> This is my current understanding and I'm leaning towards (a). What do
> you think?
> 
> *in not so many words, this is what these patches are trying to propose,
> while also making sure it's supported for both ACPI and DT.
> 
> BTW, thank you for your effort in making sense of this!
> 
> Regards,
> Ionela.
> 

This could be a platform where per-cpu and perf-dependencies will be used:

CPU:              0    1    2    3    4    5    6    7
Type:             A    A    A    A    B    B    B    B
Cluster:         [                                    ]
perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
perf-dependency: [                ]  [                ]
HW clock:        [                ]  [                ]

The firmware will present 8 controls to the OS and each control is mapped to a
cpu device via the standard dt. This is done so we can achieve hw coordination.
What is required in these systems is to present to OS the information of which
cpus belong to which clock domain. In other words, when hw coordinates we don't
have any way at present in dt to understand how these cpus are dependent
each other, from performance perspective (as opposed to ACPI where we have
_PSD). Hence my proposal for the new cpu-perf-dependencies.
This is regardless whether we decide to go for either a policy per-cpu or a
policy per-domain.

Hope it helps.

Many thanks
Best regards,
Nicola

[1] 'commit f4fd3797848a ("acpi-cpufreq: Add new sysfs attribute freqdomain_cpus")'
[2] 'commit 951fc5f45836 ("cpufreq: Update Documentation for cpus and
related_cpus")'

Ionela Voinescu Oct. 8, 2020, 5:08 p.m. UTC | #8

Hi Rafael,

On Thursday 08 Oct 2020 at 17:57:23 (+0200), Rafael J. Wysocki wrote:
> On Thu, Oct 8, 2020 at 5:03 PM Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >
> > Hi Viresh,
> >
> > On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> > > On 07-10-20, 13:58, Nicola Mazzucato wrote:
> > > > Hi Viresh,
> > > >
> > > > performance controls is what is exposed by the firmware through a protocol that
> > > > is not capable of describing hardware (say SCMI). For example, the firmware can
> > > > tell that the platform has N controls, but it can't say to which hardware they
> > > > are "wired" to. This is done in dt, where, for example, we map these controls
> > > > to cpus, gpus, etc.
> > > >
> > > > Let's focus on cpus.
> > > >
> > > > Normally we would have N of performance controls (what comes from f/w)
> > > > that that correspond to hardware clock/dvfs domains.
> > > >
> > > > However, some firmware implementations might benefit from having finer
> > > > grained information about the performance requirements (e.g.
> > > > per-CPU) and therefore choose to present M performance controls to the
> > > > OS. DT would be adjusted accordingly to "wire" these controls to cpus
> > > > or set of cpus.
> > > > In this scenario, the f/w will make aggregation decisions based on the
> > > > requests it receives on these M controls.
> > > >
> > > > Here we would have M cpufreq policies which do not necessarily reflect the
> > > > underlying clock domains, thus some s/w components will underperform
> > > > (EAS and thermal, for example).
> > > >
> > > > A real example would be a platform in which the firmware describes the system
> > > > having M per-cpu control, and the cpufreq subsystem will have M policies while
> > > > in fact these cpus are "performance-dependent" each other (e.g. are in the same
> > > > clock domain).
> > >
> > > If the CPUs are in the same clock domain, they must be part of the
> > > same cpufreq policy.
> >
> > But cpufreq does not currently support HW_ALL (I'm using the ACPI
> > coordination type to describe the generic scenario of using hardware
> > aggregation and coordination when establishing the clock rate of CPUs).
> >
> > Adding support for HW_ALL* will involve either bypassing some
> > assumptions around cpufreq policies or making core cpufreq changes.
> >
> > In the way I see it, support for HW_ALL involves either:
> >
> >  - (a) Creating per-cpu policies in order to allow each of the CPUs to
> >    send their own frequency request to the hardware which will do
> >    aggregation and clock rate decision at the level of the clock
> >    domain.
> 
> This has been done for years on many platforms.

Exactly!

> 
> >    The PSD domains (ACPI) and the new DT binding will tell
> >    which CPUs are actually in the same clock domain for whomever is
> >    interested, despite those CPUs not being in the same policy.
> 
> And this information hasn't been used so far in those cases.
> 
> >    This requires the extra mask that Nicola introduced.
> 
> What would you use it for, specifically?

This would be useful for:

 - Energy Aware Scheduling: for this you need to know how other CPUs in
   a clock domain would be impacted by a task placement. For example,
   if the utilization of a CPU would increase as a result of a certain
   task placement choice and as a result (for schedutil) its clock rate
   need would increase as well, this increase in the clock rate, and
   therefore energy, of the entire domain is considered before making
   that task placement choice.

 - Thermal: the usefulness is dependent on the distribution of thermal
   zones and their attached cooling devices. But with knowledge about
   what CPUs use the same clock, the thermal governors could cap all
   dependent CPUs in one go, while for some governor (IPA) knowing about
   dependent CPUs help with more accurate power allocation, similar to
   EAS.

 - Frequency invariance: ideally one would have counters for this,
   but when lacking counters, even knowing that some CPUs have the same
   frequency and after using some software aggregation (likely maximum)
   to establish what that frequency might be, I believe it would still
   be more useful than no frequency invariance at all.

Even if in these cases you don't have accurate information about the
frequency that hardware will grant, knowing that some CPUs will change
frequency together is useful. Given that some of the above users (EAS,
IPA) are proactive and are trying to predict the future state of a
system, they did not have completely accurate information to begin with.
But not taking into account CPUs sharing a clock will result in too
much inaccuracy (that even control loops and can't compensate for).
This together with the assumption* that predicted frequencies won't be
very far off from granted frequencies will result in maintaining these
features in a more useful state.

*my assumption, until proven otherwise :)

> 
> >  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> 
> Not an option really.

Agreed!

Thanks,
Ionela.

Viresh Kumar Oct. 9, 2020, 5:39 a.m. UTC | #9

On 08-10-20, 17:00, Nicola Mazzucato wrote:
> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> > Hi Viresh,
> > 
> > On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> >> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> >>> Hi Viresh,
> >>>
> >>> performance controls is what is exposed by the firmware through a protocol that
> >>> is not capable of describing hardware (say SCMI). For example, the firmware can
> >>> tell that the platform has N controls, but it can't say to which hardware they
> >>> are "wired" to. This is done in dt, where, for example, we map these controls
> >>> to cpus, gpus, etc.
> >>>
> >>> Let's focus on cpus.
> >>>
> >>> Normally we would have N of performance controls (what comes from f/w)
> >>> that that correspond to hardware clock/dvfs domains.
> >>>
> >>> However, some firmware implementations might benefit from having finer
> >>> grained information about the performance requirements (e.g.
> >>> per-CPU) and therefore choose to present M performance controls to the
> >>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> >>> or set of cpus.
> >>> In this scenario, the f/w will make aggregation decisions based on the
> >>> requests it receives on these M controls.
> >>>
> >>> Here we would have M cpufreq policies which do not necessarily reflect the
> >>> underlying clock domains, thus some s/w components will underperform
> >>> (EAS and thermal, for example).
> >>>
> >>> A real example would be a platform in which the firmware describes the system
> >>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> >>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> >>> clock domain).
> >>
> >> If the CPUs are in the same clock domain, they must be part of the
> >> same cpufreq policy.
> > 
> > But cpufreq does not currently support HW_ALL (I'm using the ACPI
> > coordination type to describe the generic scenario of using hardware
> > aggregation and coordination when establishing the clock rate of CPUs).
> > 
> > Adding support for HW_ALL* will involve either bypassing some
> > assumptions around cpufreq policies or making core cpufreq changes.
> > 
> > In the way I see it, support for HW_ALL involves either:
> > 
> >  - (a) Creating per-cpu policies in order to allow each of the CPUs to
> >    send their own frequency request to the hardware which will do
> >    aggregation and clock rate decision at the level of the clock
> >    domain. The PSD domains (ACPI) and the new DT binding will tell
> >    which CPUs are actually in the same clock domain for whomever is
> >    interested, despite those CPUs not being in the same policy.
> >    This requires the extra mask that Nicola introduced.
> > 
> >  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> >    - Governors to stop aggregating (usually max) the information
> >      for each of the CPUs in the policy and convey to the core
> >      information for each CPU.
> >    - Cpufreq core to be able to receive and pass this information
> >      down to the drivers.
> >    - Drivers to be able to have some per cpu structures to hold
> >      frequency control (let's say SCP fast channel addresses) for
> >      each of the CPUs in the policy. Or have these structures in the
> >      cpufreq core/policy, to avoid code duplication in drivers.
> > 
> > Therefore (a) is the least invasive but we'll be bypassing the rule
> > above. But to make that rule stick we'll have to make invasive cpufreq
> > changes (b).
> 
> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> 
> Looking back in the history of CPUFreq, related_cpus was originally designed
> to hold the map of cpus within the same clock. Later on, the meaning of this
> cpumask changed [1].
> This led to the introduction of a new cpumask 'freqdomain_cpus'
> within acpi-cpufreq to keep the knowledge of hardware clock domains for
> sysfs consumers since related_cpus was not suitable anymore for this.
> Further on, this cpumask was assigned to online+offline cpus within the same clk
> domain when sw coordination is in use [2].
> 
> My interpretation is that there is no guarantee that related_cpus holds the
> 'real' hardware clock implementation. As a consequence, it is not true anymore
> that cpus that are in the same clock domain will be part of the same
> policy.
> 
> This guided me to think it would be better to have a cpumask which always holds
> the real hw clock domains in the policy.
> 
> > 
> > This is my current understanding and I'm leaning towards (a). What do
> > you think?
> > 
> > *in not so many words, this is what these patches are trying to propose,
> > while also making sure it's supported for both ACPI and DT.
> > 
> > BTW, thank you for your effort in making sense of this!
> > 
> > Regards,
> > Ionela.
> > 
> 
> This could be a platform where per-cpu and perf-dependencies will be used:
> 
> CPU:              0    1    2    3    4    5    6    7
> Type:             A    A    A    A    B    B    B    B
> Cluster:         [                                    ]
> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> perf-dependency: [                ]  [                ]
> HW clock:        [                ]  [                ]
> 
> The firmware will present 8 controls to the OS and each control is mapped to a
> cpu device via the standard dt. This is done so we can achieve hw coordination.
> What is required in these systems is to present to OS the information of which
> cpus belong to which clock domain. In other words, when hw coordinates we don't
> have any way at present in dt to understand how these cpus are dependent
> each other, from performance perspective (as opposed to ACPI where we have
> _PSD). Hence my proposal for the new cpu-perf-dependencies.
> This is regardless whether we decide to go for either a policy per-cpu or a
> policy per-domain.
> 
> Hope it helps.

Oh yes, I get it now. Finally. Thanks for helping me out :)

So if I can say all this stuff in simple terms, this is what it will
be like:

- We don't want software aggregation of frequencies and so we need to
  have per-cpu policies even when they share their clock lines.

- But we still need a way for other frameworks to know which CPUs
  share the clock lines (that's what the perf-dependency is all about,
  right ?).

- We can't get it from SCMI, but need a DT based solution.

- Currently for the cpufreq-case we relied for this on the way OPP
  tables for the CPUs were described. i.e. the opp-table is marked as
  "shared" and multiple CPUs point to it.

- I wonder if we can keep using that instead of creating new bindings
  for exact same stuff ? Though the difference here would be that the
  OPP may not have any other entries.

Nicola Mazzucato Oct. 9, 2020, 11:10 a.m. UTC | #10

Hi Viresh, I'm glad it helped.

Please find below my reply.

On 10/9/20 6:39 AM, Viresh Kumar wrote:
> On 08-10-20, 17:00, Nicola Mazzucato wrote:
>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
>>> Hi Viresh,
>>>
>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
>>>>> Hi Viresh,
>>>>>
>>>>> performance controls is what is exposed by the firmware through a protocol that
>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
>>>>> tell that the platform has N controls, but it can't say to which hardware they
>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
>>>>> to cpus, gpus, etc.
>>>>>
>>>>> Let's focus on cpus.
>>>>>
>>>>> Normally we would have N of performance controls (what comes from f/w)
>>>>> that that correspond to hardware clock/dvfs domains.
>>>>>
>>>>> However, some firmware implementations might benefit from having finer
>>>>> grained information about the performance requirements (e.g.
>>>>> per-CPU) and therefore choose to present M performance controls to the
>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
>>>>> or set of cpus.
>>>>> In this scenario, the f/w will make aggregation decisions based on the
>>>>> requests it receives on these M controls.
>>>>>
>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
>>>>> underlying clock domains, thus some s/w components will underperform
>>>>> (EAS and thermal, for example).
>>>>>
>>>>> A real example would be a platform in which the firmware describes the system
>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
>>>>> clock domain).
>>>>
>>>> If the CPUs are in the same clock domain, they must be part of the
>>>> same cpufreq policy.
>>>
>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
>>> coordination type to describe the generic scenario of using hardware
>>> aggregation and coordination when establishing the clock rate of CPUs).
>>>
>>> Adding support for HW_ALL* will involve either bypassing some
>>> assumptions around cpufreq policies or making core cpufreq changes.
>>>
>>> In the way I see it, support for HW_ALL involves either:
>>>
>>>  - (a) Creating per-cpu policies in order to allow each of the CPUs to
>>>    send their own frequency request to the hardware which will do
>>>    aggregation and clock rate decision at the level of the clock
>>>    domain. The PSD domains (ACPI) and the new DT binding will tell
>>>    which CPUs are actually in the same clock domain for whomever is
>>>    interested, despite those CPUs not being in the same policy.
>>>    This requires the extra mask that Nicola introduced.
>>>
>>>  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
>>>    - Governors to stop aggregating (usually max) the information
>>>      for each of the CPUs in the policy and convey to the core
>>>      information for each CPU.
>>>    - Cpufreq core to be able to receive and pass this information
>>>      down to the drivers.
>>>    - Drivers to be able to have some per cpu structures to hold
>>>      frequency control (let's say SCP fast channel addresses) for
>>>      each of the CPUs in the policy. Or have these structures in the
>>>      cpufreq core/policy, to avoid code duplication in drivers.
>>>
>>> Therefore (a) is the least invasive but we'll be bypassing the rule
>>> above. But to make that rule stick we'll have to make invasive cpufreq
>>> changes (b).
>>
>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
>>
>> Looking back in the history of CPUFreq, related_cpus was originally designed
>> to hold the map of cpus within the same clock. Later on, the meaning of this
>> cpumask changed [1].
>> This led to the introduction of a new cpumask 'freqdomain_cpus'
>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
>> sysfs consumers since related_cpus was not suitable anymore for this.
>> Further on, this cpumask was assigned to online+offline cpus within the same clk
>> domain when sw coordination is in use [2].
>>
>> My interpretation is that there is no guarantee that related_cpus holds the
>> 'real' hardware clock implementation. As a consequence, it is not true anymore
>> that cpus that are in the same clock domain will be part of the same
>> policy.
>>
>> This guided me to think it would be better to have a cpumask which always holds
>> the real hw clock domains in the policy.
>>
>>>
>>> This is my current understanding and I'm leaning towards (a). What do
>>> you think?
>>>
>>> *in not so many words, this is what these patches are trying to propose,
>>> while also making sure it's supported for both ACPI and DT.
>>>
>>> BTW, thank you for your effort in making sense of this!
>>>
>>> Regards,
>>> Ionela.
>>>
>>
>> This could be a platform where per-cpu and perf-dependencies will be used:
>>
>> CPU:              0    1    2    3    4    5    6    7
>> Type:             A    A    A    A    B    B    B    B
>> Cluster:         [                                    ]
>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
>> perf-dependency: [                ]  [                ]
>> HW clock:        [                ]  [                ]
>>
>> The firmware will present 8 controls to the OS and each control is mapped to a
>> cpu device via the standard dt. This is done so we can achieve hw coordination.
>> What is required in these systems is to present to OS the information of which
>> cpus belong to which clock domain. In other words, when hw coordinates we don't
>> have any way at present in dt to understand how these cpus are dependent
>> each other, from performance perspective (as opposed to ACPI where we have
>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
>> This is regardless whether we decide to go for either a policy per-cpu or a
>> policy per-domain.
>>
>> Hope it helps.
> 
> Oh yes, I get it now. Finally. Thanks for helping me out :)
> 
> So if I can say all this stuff in simple terms, this is what it will
> be like:
> 
> - We don't want software aggregation of frequencies and so we need to
>   have per-cpu policies even when they share their clock lines.
> 
> - But we still need a way for other frameworks to know which CPUs
>   share the clock lines (that's what the perf-dependency is all about,
>   right ?).
> 
> - We can't get it from SCMI, but need a DT based solution.
> 
> - Currently for the cpufreq-case we relied for this on the way OPP
>   tables for the CPUs were described. i.e. the opp-table is marked as
>   "shared" and multiple CPUs point to it.
> 
> - I wonder if we can keep using that instead of creating new bindings
>   for exact same stuff ? Though the difference here would be that the
>   OPP may not have any other entries.

I thought about it and looked for other platforms' DT to see if can reuse
existing opp information. Unfortunately I don't think it is optimal. The reason
being that, because cpus have the same opp table it does not necessarily mean
that they share a clock wire. It just tells us that they have the same
capabilities (literally just tells us they have the same V/f op points).
Unless I am missing something?

When comparing with ACPI/_PSD it becomes more intuitive that there is no
equivalent way to reveal "perf-dependencies" in DT.

Thank you for time on this.

Regards
Nicola

>

Viresh Kumar Oct. 9, 2020, 11:17 a.m. UTC | #11

On 09-10-20, 12:10, Nicola Mazzucato wrote:
> I thought about it and looked for other platforms' DT to see if can reuse
> existing opp information. Unfortunately I don't think it is optimal. The reason
> being that, because cpus have the same opp table it does not necessarily mean
> that they share a clock wire. It just tells us that they have the same
> capabilities (literally just tells us they have the same V/f op points).

No.

> Unless I am missing something?

Yes.

Here are the different scenarios which can happen.
- Two CPUs have separate OPP tables, even if they are exact copy of
  each other, these CPUs don't share a clock line, but just v/f points
  as you said.

- Two CPUs use the same OPP table, i.e. both point to it, but
  "opp-shared" property is missing. This is same as above case. They
  just share the v/f points and this is the preferred way instead of
  duplicate OPP tables.

- Case two with "opp-shared" property present in the OPP table. The
  CPUs share clock-lines.

And this is exactly how we find out today if CPUs share a policy or
not.

Rob Herring (Arm) Oct. 9, 2020, 2:01 p.m. UTC | #12

On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
> Hi Viresh, I'm glad it helped.
> 
> Please find below my reply.
> 
> On 10/9/20 6:39 AM, Viresh Kumar wrote:
> > On 08-10-20, 17:00, Nicola Mazzucato wrote:
> >> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> >>> Hi Viresh,
> >>>
> >>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> >>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> >>>>> Hi Viresh,
> >>>>>
> >>>>> performance controls is what is exposed by the firmware through a protocol that
> >>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
> >>>>> tell that the platform has N controls, but it can't say to which hardware they
> >>>>> are "wired" to. This is done in dt, where, for example, we map these controls
> >>>>> to cpus, gpus, etc.
> >>>>>
> >>>>> Let's focus on cpus.
> >>>>>
> >>>>> Normally we would have N of performance controls (what comes from f/w)
> >>>>> that that correspond to hardware clock/dvfs domains.
> >>>>>
> >>>>> However, some firmware implementations might benefit from having finer
> >>>>> grained information about the performance requirements (e.g.
> >>>>> per-CPU) and therefore choose to present M performance controls to the
> >>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> >>>>> or set of cpus.
> >>>>> In this scenario, the f/w will make aggregation decisions based on the
> >>>>> requests it receives on these M controls.
> >>>>>
> >>>>> Here we would have M cpufreq policies which do not necessarily reflect the
> >>>>> underlying clock domains, thus some s/w components will underperform
> >>>>> (EAS and thermal, for example).
> >>>>>
> >>>>> A real example would be a platform in which the firmware describes the system
> >>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> >>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> >>>>> clock domain).
> >>>>
> >>>> If the CPUs are in the same clock domain, they must be part of the
> >>>> same cpufreq policy.
> >>>
> >>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
> >>> coordination type to describe the generic scenario of using hardware
> >>> aggregation and coordination when establishing the clock rate of CPUs).
> >>>
> >>> Adding support for HW_ALL* will involve either bypassing some
> >>> assumptions around cpufreq policies or making core cpufreq changes.
> >>>
> >>> In the way I see it, support for HW_ALL involves either:
> >>>
> >>>  - (a) Creating per-cpu policies in order to allow each of the CPUs to
> >>>    send their own frequency request to the hardware which will do
> >>>    aggregation and clock rate decision at the level of the clock
> >>>    domain. The PSD domains (ACPI) and the new DT binding will tell
> >>>    which CPUs are actually in the same clock domain for whomever is
> >>>    interested, despite those CPUs not being in the same policy.
> >>>    This requires the extra mask that Nicola introduced.
> >>>
> >>>  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> >>>    - Governors to stop aggregating (usually max) the information
> >>>      for each of the CPUs in the policy and convey to the core
> >>>      information for each CPU.
> >>>    - Cpufreq core to be able to receive and pass this information
> >>>      down to the drivers.
> >>>    - Drivers to be able to have some per cpu structures to hold
> >>>      frequency control (let's say SCP fast channel addresses) for
> >>>      each of the CPUs in the policy. Or have these structures in the
> >>>      cpufreq core/policy, to avoid code duplication in drivers.
> >>>
> >>> Therefore (a) is the least invasive but we'll be bypassing the rule
> >>> above. But to make that rule stick we'll have to make invasive cpufreq
> >>> changes (b).
> >>
> >> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> >> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> >>
> >> Looking back in the history of CPUFreq, related_cpus was originally designed
> >> to hold the map of cpus within the same clock. Later on, the meaning of this
> >> cpumask changed [1].
> >> This led to the introduction of a new cpumask 'freqdomain_cpus'
> >> within acpi-cpufreq to keep the knowledge of hardware clock domains for
> >> sysfs consumers since related_cpus was not suitable anymore for this.
> >> Further on, this cpumask was assigned to online+offline cpus within the same clk
> >> domain when sw coordination is in use [2].
> >>
> >> My interpretation is that there is no guarantee that related_cpus holds the
> >> 'real' hardware clock implementation. As a consequence, it is not true anymore
> >> that cpus that are in the same clock domain will be part of the same
> >> policy.
> >>
> >> This guided me to think it would be better to have a cpumask which always holds
> >> the real hw clock domains in the policy.
> >>
> >>>
> >>> This is my current understanding and I'm leaning towards (a). What do
> >>> you think?
> >>>
> >>> *in not so many words, this is what these patches are trying to propose,
> >>> while also making sure it's supported for both ACPI and DT.
> >>>
> >>> BTW, thank you for your effort in making sense of this!
> >>>
> >>> Regards,
> >>> Ionela.
> >>>
> >>
> >> This could be a platform where per-cpu and perf-dependencies will be used:
> >>
> >> CPU:              0    1    2    3    4    5    6    7
> >> Type:             A    A    A    A    B    B    B    B
> >> Cluster:         [                                    ]
> >> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> >> perf-dependency: [                ]  [                ]
> >> HW clock:        [                ]  [                ]
> >>
> >> The firmware will present 8 controls to the OS and each control is mapped to a
> >> cpu device via the standard dt. This is done so we can achieve hw coordination.
> >> What is required in these systems is to present to OS the information of which
> >> cpus belong to which clock domain. In other words, when hw coordinates we don't
> >> have any way at present in dt to understand how these cpus are dependent
> >> each other, from performance perspective (as opposed to ACPI where we have
> >> _PSD). Hence my proposal for the new cpu-perf-dependencies.
> >> This is regardless whether we decide to go for either a policy per-cpu or a
> >> policy per-domain.
> >>
> >> Hope it helps.
> > 
> > Oh yes, I get it now. Finally. Thanks for helping me out :)
> > 
> > So if I can say all this stuff in simple terms, this is what it will
> > be like:
> > 
> > - We don't want software aggregation of frequencies and so we need to
> >   have per-cpu policies even when they share their clock lines.
> > 
> > - But we still need a way for other frameworks to know which CPUs
> >   share the clock lines (that's what the perf-dependency is all about,
> >   right ?).
> > 
> > - We can't get it from SCMI, but need a DT based solution.
> > 
> > - Currently for the cpufreq-case we relied for this on the way OPP
> >   tables for the CPUs were described. i.e. the opp-table is marked as
> >   "shared" and multiple CPUs point to it.
> > 
> > - I wonder if we can keep using that instead of creating new bindings
> >   for exact same stuff ? Though the difference here would be that the
> >   OPP may not have any other entries.
> 
> I thought about it and looked for other platforms' DT to see if can reuse
> existing opp information. Unfortunately I don't think it is optimal. The reason
> being that, because cpus have the same opp table it does not necessarily mean
> that they share a clock wire. It just tells us that they have the same
> capabilities (literally just tells us they have the same V/f op points).
> Unless I am missing something?
> 
> When comparing with ACPI/_PSD it becomes more intuitive that there is no
> equivalent way to reveal "perf-dependencies" in DT.

You should be able to by examining the clock tree. But perhaps SCMI 
abstracts all that and just presents virtual clocks without parent 
clocks available to determine what clocks are shared? Fix SCMI if that's 
the case.

Rob

Nicola Mazzucato Oct. 9, 2020, 3:28 p.m. UTC | #13

Hi Viresh and Rob,

first of all, thanks once again for looking into this!

On 10/9/20 3:01 PM, Rob Herring wrote:
> On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
>> Hi Viresh, I'm glad it helped.
>>
>> Please find below my reply.
>>
>> On 10/9/20 6:39 AM, Viresh Kumar wrote:
>>> On 08-10-20, 17:00, Nicola Mazzucato wrote:
>>>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
>>>>> Hi Viresh,
>>>>>
>>>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
>>>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
>>>>>>> Hi Viresh,
>>>>>>>
>>>>>>> performance controls is what is exposed by the firmware through a protocol that
>>>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
>>>>>>> tell that the platform has N controls, but it can't say to which hardware they
>>>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
>>>>>>> to cpus, gpus, etc.
>>>>>>>
>>>>>>> Let's focus on cpus.
>>>>>>>
>>>>>>> Normally we would have N of performance controls (what comes from f/w)
>>>>>>> that that correspond to hardware clock/dvfs domains.
>>>>>>>
>>>>>>> However, some firmware implementations might benefit from having finer
>>>>>>> grained information about the performance requirements (e.g.
>>>>>>> per-CPU) and therefore choose to present M performance controls to the
>>>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
>>>>>>> or set of cpus.
>>>>>>> In this scenario, the f/w will make aggregation decisions based on the
>>>>>>> requests it receives on these M controls.
>>>>>>>
>>>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
>>>>>>> underlying clock domains, thus some s/w components will underperform
>>>>>>> (EAS and thermal, for example).
>>>>>>>
>>>>>>> A real example would be a platform in which the firmware describes the system
>>>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
>>>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
>>>>>>> clock domain).
>>>>>>
>>>>>> If the CPUs are in the same clock domain, they must be part of the
>>>>>> same cpufreq policy.
>>>>>
>>>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
>>>>> coordination type to describe the generic scenario of using hardware
>>>>> aggregation and coordination when establishing the clock rate of CPUs).
>>>>>
>>>>> Adding support for HW_ALL* will involve either bypassing some
>>>>> assumptions around cpufreq policies or making core cpufreq changes.
>>>>>
>>>>> In the way I see it, support for HW_ALL involves either:
>>>>>
>>>>>  - (a) Creating per-cpu policies in order to allow each of the CPUs to
>>>>>    send their own frequency request to the hardware which will do
>>>>>    aggregation and clock rate decision at the level of the clock
>>>>>    domain. The PSD domains (ACPI) and the new DT binding will tell
>>>>>    which CPUs are actually in the same clock domain for whomever is
>>>>>    interested, despite those CPUs not being in the same policy.
>>>>>    This requires the extra mask that Nicola introduced.
>>>>>
>>>>>  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
>>>>>    - Governors to stop aggregating (usually max) the information
>>>>>      for each of the CPUs in the policy and convey to the core
>>>>>      information for each CPU.
>>>>>    - Cpufreq core to be able to receive and pass this information
>>>>>      down to the drivers.
>>>>>    - Drivers to be able to have some per cpu structures to hold
>>>>>      frequency control (let's say SCP fast channel addresses) for
>>>>>      each of the CPUs in the policy. Or have these structures in the
>>>>>      cpufreq core/policy, to avoid code duplication in drivers.
>>>>>
>>>>> Therefore (a) is the least invasive but we'll be bypassing the rule
>>>>> above. But to make that rule stick we'll have to make invasive cpufreq
>>>>> changes (b).
>>>>
>>>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
>>>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
>>>>
>>>> Looking back in the history of CPUFreq, related_cpus was originally designed
>>>> to hold the map of cpus within the same clock. Later on, the meaning of this
>>>> cpumask changed [1].
>>>> This led to the introduction of a new cpumask 'freqdomain_cpus'
>>>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
>>>> sysfs consumers since related_cpus was not suitable anymore for this.
>>>> Further on, this cpumask was assigned to online+offline cpus within the same clk
>>>> domain when sw coordination is in use [2].
>>>>
>>>> My interpretation is that there is no guarantee that related_cpus holds the
>>>> 'real' hardware clock implementation. As a consequence, it is not true anymore
>>>> that cpus that are in the same clock domain will be part of the same
>>>> policy.
>>>>
>>>> This guided me to think it would be better to have a cpumask which always holds
>>>> the real hw clock domains in the policy.
>>>>
>>>>>
>>>>> This is my current understanding and I'm leaning towards (a). What do
>>>>> you think?
>>>>>
>>>>> *in not so many words, this is what these patches are trying to propose,
>>>>> while also making sure it's supported for both ACPI and DT.
>>>>>
>>>>> BTW, thank you for your effort in making sense of this!
>>>>>
>>>>> Regards,
>>>>> Ionela.
>>>>>
>>>>
>>>> This could be a platform where per-cpu and perf-dependencies will be used:
>>>>
>>>> CPU:              0    1    2    3    4    5    6    7
>>>> Type:             A    A    A    A    B    B    B    B
>>>> Cluster:         [                                    ]
>>>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
>>>> perf-dependency: [                ]  [                ]
>>>> HW clock:        [                ]  [                ]
>>>>
>>>> The firmware will present 8 controls to the OS and each control is mapped to a
>>>> cpu device via the standard dt. This is done so we can achieve hw coordination.
>>>> What is required in these systems is to present to OS the information of which
>>>> cpus belong to which clock domain. In other words, when hw coordinates we don't
>>>> have any way at present in dt to understand how these cpus are dependent
>>>> each other, from performance perspective (as opposed to ACPI where we have
>>>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
>>>> This is regardless whether we decide to go for either a policy per-cpu or a
>>>> policy per-domain.
>>>>
>>>> Hope it helps.
>>>
>>> Oh yes, I get it now. Finally. Thanks for helping me out :)
>>>
>>> So if I can say all this stuff in simple terms, this is what it will
>>> be like:
>>>
>>> - We don't want software aggregation of frequencies and so we need to
>>>   have per-cpu policies even when they share their clock lines.
>>>
>>> - But we still need a way for other frameworks to know which CPUs
>>>   share the clock lines (that's what the perf-dependency is all about,
>>>   right ?).
>>>
>>> - We can't get it from SCMI, but need a DT based solution.
>>>
>>> - Currently for the cpufreq-case we relied for this on the way OPP
>>>   tables for the CPUs were described. i.e. the opp-table is marked as
>>>   "shared" and multiple CPUs point to it.
>>>
>>> - I wonder if we can keep using that instead of creating new bindings
>>>   for exact same stuff ? Though the difference here would be that the
>>>   OPP may not have any other entries.
>>
>> I thought about it and looked for other platforms' DT to see if can reuse
>> existing opp information. Unfortunately I don't think it is optimal. The reason
>> being that, because cpus have the same opp table it does not necessarily mean
>> that they share a clock wire. It just tells us that they have the same
>> capabilities (literally just tells us they have the same V/f op points).
>> Unless I am missing something?
>>
>> When comparing with ACPI/_PSD it becomes more intuitive that there is no
>> equivalent way to reveal "perf-dependencies" in DT.

@Viresh
I am sorry I misread your reply earlier thus I did not pay attention on that
property.
And yes, it is exactly as how you have described :)
In the case 1 (different opps, different clk) and case 2 (same opps, different
clk) we provide v/f points. In case 3, we add 'opp-shared' property in opp table
to tell that the cpus with this opp table share a clock line.

Here are my thoughts on this 3rd case:
- the information of 'share the same clock line' comes correctly from DT as it's
purely a hw description. The same applies for cpu-perf-dependencies.
- because the opp table can come from any firmware, we won't have any opp table
in DT, thus we won't be able to take advantage of 'opp-shared' I am afraid.

> 
> You should be able to by examining the clock tree. But perhaps SCMI 
> abstracts all that and just presents virtual clocks without parent 
> clocks available to determine what clocks are shared? Fix SCMI if that's 
> the case.
> 
> Rob
> 

@Rob
I think that performance is not only dealing with clocks. Possibly my expression
of 'clock line' was not entirely precise, though it was good enough to
understand the problem. It should have rather been "performance line". So the
information from the clock tree would not be precise too.

The issue here is for any DT without opp table, cause it can come from ANY
firmware, and therefore we can't reuse 'opp-shared'.

Once again, thanks for providing your insights!

Best regards,
Nicola

Viresh Kumar Oct. 12, 2020, 4:19 a.m. UTC | #14

On 09-10-20, 16:28, Nicola Mazzucato wrote:
> @Viresh
> I am sorry I misread your reply earlier thus I did not pay attention on that
> property.
> And yes, it is exactly as how you have described :)
> In the case 1 (different opps, different clk) and case 2 (same opps, different
> clk) we provide v/f points. In case 3, we add 'opp-shared' property in opp table
> to tell that the cpus with this opp table share a clock line.
> 
> Here are my thoughts on this 3rd case:
> - the information of 'share the same clock line' comes correctly from DT as it's
> purely a hw description. The same applies for cpu-perf-dependencies.
> - because the opp table can come from any firmware, we won't have any opp table
> in DT, thus we won't be able to take advantage of 'opp-shared' I am afraid.

I wonder if we should use an empty OPP table just for parsing this
information.

Lukasz Luba Oct. 12, 2020, 10:22 a.m. UTC | #15

Hi Rob,

On 10/9/20 3:01 PM, Rob Herring wrote:
> On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
>> Hi Viresh, I'm glad it helped.
>>
>> Please find below my reply.
>>
>> On 10/9/20 6:39 AM, Viresh Kumar wrote:
>>> On 08-10-20, 17:00, Nicola Mazzucato wrote:
>>>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
>>>>> Hi Viresh,
>>>>>
>>>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
>>>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
>>>>>>> Hi Viresh,
>>>>>>>
>>>>>>> performance controls is what is exposed by the firmware through a protocol that
>>>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
>>>>>>> tell that the platform has N controls, but it can't say to which hardware they
>>>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
>>>>>>> to cpus, gpus, etc.
>>>>>>>
>>>>>>> Let's focus on cpus.
>>>>>>>
>>>>>>> Normally we would have N of performance controls (what comes from f/w)
>>>>>>> that that correspond to hardware clock/dvfs domains.
>>>>>>>
>>>>>>> However, some firmware implementations might benefit from having finer
>>>>>>> grained information about the performance requirements (e.g.
>>>>>>> per-CPU) and therefore choose to present M performance controls to the
>>>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
>>>>>>> or set of cpus.
>>>>>>> In this scenario, the f/w will make aggregation decisions based on the
>>>>>>> requests it receives on these M controls.
>>>>>>>
>>>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
>>>>>>> underlying clock domains, thus some s/w components will underperform
>>>>>>> (EAS and thermal, for example).
>>>>>>>
>>>>>>> A real example would be a platform in which the firmware describes the system
>>>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
>>>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
>>>>>>> clock domain).
>>>>>>
>>>>>> If the CPUs are in the same clock domain, they must be part of the
>>>>>> same cpufreq policy.
>>>>>
>>>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
>>>>> coordination type to describe the generic scenario of using hardware
>>>>> aggregation and coordination when establishing the clock rate of CPUs).
>>>>>
>>>>> Adding support for HW_ALL* will involve either bypassing some
>>>>> assumptions around cpufreq policies or making core cpufreq changes.
>>>>>
>>>>> In the way I see it, support for HW_ALL involves either:
>>>>>
>>>>>   - (a) Creating per-cpu policies in order to allow each of the CPUs to
>>>>>     send their own frequency request to the hardware which will do
>>>>>     aggregation and clock rate decision at the level of the clock
>>>>>     domain. The PSD domains (ACPI) and the new DT binding will tell
>>>>>     which CPUs are actually in the same clock domain for whomever is
>>>>>     interested, despite those CPUs not being in the same policy.
>>>>>     This requires the extra mask that Nicola introduced.
>>>>>
>>>>>   - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
>>>>>     - Governors to stop aggregating (usually max) the information
>>>>>       for each of the CPUs in the policy and convey to the core
>>>>>       information for each CPU.
>>>>>     - Cpufreq core to be able to receive and pass this information
>>>>>       down to the drivers.
>>>>>     - Drivers to be able to have some per cpu structures to hold
>>>>>       frequency control (let's say SCP fast channel addresses) for
>>>>>       each of the CPUs in the policy. Or have these structures in the
>>>>>       cpufreq core/policy, to avoid code duplication in drivers.
>>>>>
>>>>> Therefore (a) is the least invasive but we'll be bypassing the rule
>>>>> above. But to make that rule stick we'll have to make invasive cpufreq
>>>>> changes (b).
>>>>
>>>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
>>>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
>>>>
>>>> Looking back in the history of CPUFreq, related_cpus was originally designed
>>>> to hold the map of cpus within the same clock. Later on, the meaning of this
>>>> cpumask changed [1].
>>>> This led to the introduction of a new cpumask 'freqdomain_cpus'
>>>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
>>>> sysfs consumers since related_cpus was not suitable anymore for this.
>>>> Further on, this cpumask was assigned to online+offline cpus within the same clk
>>>> domain when sw coordination is in use [2].
>>>>
>>>> My interpretation is that there is no guarantee that related_cpus holds the
>>>> 'real' hardware clock implementation. As a consequence, it is not true anymore
>>>> that cpus that are in the same clock domain will be part of the same
>>>> policy.
>>>>
>>>> This guided me to think it would be better to have a cpumask which always holds
>>>> the real hw clock domains in the policy.
>>>>
>>>>>
>>>>> This is my current understanding and I'm leaning towards (a). What do
>>>>> you think?
>>>>>
>>>>> *in not so many words, this is what these patches are trying to propose,
>>>>> while also making sure it's supported for both ACPI and DT.
>>>>>
>>>>> BTW, thank you for your effort in making sense of this!
>>>>>
>>>>> Regards,
>>>>> Ionela.
>>>>>
>>>>
>>>> This could be a platform where per-cpu and perf-dependencies will be used:
>>>>
>>>> CPU:              0    1    2    3    4    5    6    7
>>>> Type:             A    A    A    A    B    B    B    B
>>>> Cluster:         [                                    ]
>>>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
>>>> perf-dependency: [                ]  [                ]
>>>> HW clock:        [                ]  [                ]
>>>>
>>>> The firmware will present 8 controls to the OS and each control is mapped to a
>>>> cpu device via the standard dt. This is done so we can achieve hw coordination.
>>>> What is required in these systems is to present to OS the information of which
>>>> cpus belong to which clock domain. In other words, when hw coordinates we don't
>>>> have any way at present in dt to understand how these cpus are dependent
>>>> each other, from performance perspective (as opposed to ACPI where we have
>>>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
>>>> This is regardless whether we decide to go for either a policy per-cpu or a
>>>> policy per-domain.
>>>>
>>>> Hope it helps.
>>>
>>> Oh yes, I get it now. Finally. Thanks for helping me out :)
>>>
>>> So if I can say all this stuff in simple terms, this is what it will
>>> be like:
>>>
>>> - We don't want software aggregation of frequencies and so we need to
>>>    have per-cpu policies even when they share their clock lines.
>>>
>>> - But we still need a way for other frameworks to know which CPUs
>>>    share the clock lines (that's what the perf-dependency is all about,
>>>    right ?).
>>>
>>> - We can't get it from SCMI, but need a DT based solution.
>>>
>>> - Currently for the cpufreq-case we relied for this on the way OPP
>>>    tables for the CPUs were described. i.e. the opp-table is marked as
>>>    "shared" and multiple CPUs point to it.
>>>
>>> - I wonder if we can keep using that instead of creating new bindings
>>>    for exact same stuff ? Though the difference here would be that the
>>>    OPP may not have any other entries.
>>
>> I thought about it and looked for other platforms' DT to see if can reuse
>> existing opp information. Unfortunately I don't think it is optimal. The reason
>> being that, because cpus have the same opp table it does not necessarily mean
>> that they share a clock wire. It just tells us that they have the same
>> capabilities (literally just tells us they have the same V/f op points).
>> Unless I am missing something?
>>
>> When comparing with ACPI/_PSD it becomes more intuitive that there is no
>> equivalent way to reveal "perf-dependencies" in DT.
> 
> You should be able to by examining the clock tree. But perhaps SCMI
> abstracts all that and just presents virtual clocks without parent
> clocks available to determine what clocks are shared? Fix SCMI if that's
> the case.

True, the SCMI clock does not support discovery of clock tree:
(from 4.6.1 Clock management protocol background)
'The protocol does not cover discovery of the clock tree, which must be
described through firmware tables instead.' [1]

In this situation, would it make sense, instead of this binding from
patch 1/2, create a binding for internal firmware/scmi node?

Something like:

firmware {
	scmi {
	...		
		scmi-perf-dep {
			compatible = "arm,scmi-perf-dependencies";
			cpu-perf-dep0 {
				cpu-perf-affinity = <&CPU0>, <&CPU1>;
			};
			cpu-perf-dep1 {
				cpu-perf-affinity = <&CPU3>, <&CPU4>;
			};
			cpu-perf-dep2 {
				cpu-perf-affinity = <&CPU7>;
			};
		};
	};
};

The code which is going to parse the binding would be inside the
scmi perf protocol code and used via API by scmi-cpufreq.c.


Now regarding the 'dependent_cpus' mask.

We could avoid adding a new field 'dependent_cpus' in policy
struct, but I am not sure of one bit - Frequency Invariant Engine,
(which is also not fixed by just adding a new cpumask).

We have 3 subsystems to fix:
1. EAS - EM has API function which takes custom cpumask, so no issue,
   fix would be to use it via the scmi-cpufreq.c
2. IPA (for calculating the power of a cluster, not whole thermal needs
   this knowledge about 'dependent cpus') - this can be fixed internally
3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
   filtering and providing max freq of all cpus in the cluster into the
   FIE; this info is then populated to all 'related_cpus' which will
   have this freq (we know, because there is no other freq requests);
   Issues:
3.1. Schedutil is not going to check all cpus in the cluster to take
   max freq, which is then passed into the cpufreq driver and FIE
3.2. FIE would have to (or maybe we would drop it) have a logic similar
   to what schedutil does (max freq search and set, then filter next
   freq requests from other cpus in the next period e.g. 10ms)
3.3. Schedutil is going to invoke freq change for each cpu independently
   and the current code just calls arch_set_freq_scale() - adding just
   'dependent_cpus' won't help
3.4 What would be the real frequency of these cpus and what would be
   set to FIE
3.5 FIE is going to filter to soon requests from other dependent cpus?

IMHO the FIE needs more bits than just a new cpumask.
Maybe we should consider to move FIE arch_set_freq_scale() call into the
cpufreq driver, which will know better how to aggregate/filter requests
and then call FIE update?

Regards,
Lukasz

[1] https://developer.arm.com/documentation/den0056/b/
> 
> Rob
>

Rafael J. Wysocki Oct. 12, 2020, 10:50 a.m. UTC | #16

On Mon, Oct 12, 2020 at 12:23 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rob,
>
> On 10/9/20 3:01 PM, Rob Herring wrote:
> > On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
> >> Hi Viresh, I'm glad it helped.
> >>
> >> Please find below my reply.
> >>
> >> On 10/9/20 6:39 AM, Viresh Kumar wrote:
> >>> On 08-10-20, 17:00, Nicola Mazzucato wrote:
> >>>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> >>>>> Hi Viresh,
> >>>>>
> >>>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> >>>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> >>>>>>> Hi Viresh,
> >>>>>>>
> >>>>>>> performance controls is what is exposed by the firmware through a protocol that
> >>>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
> >>>>>>> tell that the platform has N controls, but it can't say to which hardware they
> >>>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
> >>>>>>> to cpus, gpus, etc.
> >>>>>>>
> >>>>>>> Let's focus on cpus.
> >>>>>>>
> >>>>>>> Normally we would have N of performance controls (what comes from f/w)
> >>>>>>> that that correspond to hardware clock/dvfs domains.
> >>>>>>>
> >>>>>>> However, some firmware implementations might benefit from having finer
> >>>>>>> grained information about the performance requirements (e.g.
> >>>>>>> per-CPU) and therefore choose to present M performance controls to the
> >>>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> >>>>>>> or set of cpus.
> >>>>>>> In this scenario, the f/w will make aggregation decisions based on the
> >>>>>>> requests it receives on these M controls.
> >>>>>>>
> >>>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
> >>>>>>> underlying clock domains, thus some s/w components will underperform
> >>>>>>> (EAS and thermal, for example).
> >>>>>>>
> >>>>>>> A real example would be a platform in which the firmware describes the system
> >>>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> >>>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> >>>>>>> clock domain).
> >>>>>>
> >>>>>> If the CPUs are in the same clock domain, they must be part of the
> >>>>>> same cpufreq policy.
> >>>>>
> >>>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
> >>>>> coordination type to describe the generic scenario of using hardware
> >>>>> aggregation and coordination when establishing the clock rate of CPUs).
> >>>>>
> >>>>> Adding support for HW_ALL* will involve either bypassing some
> >>>>> assumptions around cpufreq policies or making core cpufreq changes.
> >>>>>
> >>>>> In the way I see it, support for HW_ALL involves either:
> >>>>>
> >>>>>   - (a) Creating per-cpu policies in order to allow each of the CPUs to
> >>>>>     send their own frequency request to the hardware which will do
> >>>>>     aggregation and clock rate decision at the level of the clock
> >>>>>     domain. The PSD domains (ACPI) and the new DT binding will tell
> >>>>>     which CPUs are actually in the same clock domain for whomever is
> >>>>>     interested, despite those CPUs not being in the same policy.
> >>>>>     This requires the extra mask that Nicola introduced.
> >>>>>
> >>>>>   - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> >>>>>     - Governors to stop aggregating (usually max) the information
> >>>>>       for each of the CPUs in the policy and convey to the core
> >>>>>       information for each CPU.
> >>>>>     - Cpufreq core to be able to receive and pass this information
> >>>>>       down to the drivers.
> >>>>>     - Drivers to be able to have some per cpu structures to hold
> >>>>>       frequency control (let's say SCP fast channel addresses) for
> >>>>>       each of the CPUs in the policy. Or have these structures in the
> >>>>>       cpufreq core/policy, to avoid code duplication in drivers.
> >>>>>
> >>>>> Therefore (a) is the least invasive but we'll be bypassing the rule
> >>>>> above. But to make that rule stick we'll have to make invasive cpufreq
> >>>>> changes (b).
> >>>>
> >>>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> >>>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> >>>>
> >>>> Looking back in the history of CPUFreq, related_cpus was originally designed
> >>>> to hold the map of cpus within the same clock. Later on, the meaning of this
> >>>> cpumask changed [1].
> >>>> This led to the introduction of a new cpumask 'freqdomain_cpus'
> >>>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
> >>>> sysfs consumers since related_cpus was not suitable anymore for this.
> >>>> Further on, this cpumask was assigned to online+offline cpus within the same clk
> >>>> domain when sw coordination is in use [2].
> >>>>
> >>>> My interpretation is that there is no guarantee that related_cpus holds the
> >>>> 'real' hardware clock implementation. As a consequence, it is not true anymore
> >>>> that cpus that are in the same clock domain will be part of the same
> >>>> policy.
> >>>>
> >>>> This guided me to think it would be better to have a cpumask which always holds
> >>>> the real hw clock domains in the policy.
> >>>>
> >>>>>
> >>>>> This is my current understanding and I'm leaning towards (a). What do
> >>>>> you think?
> >>>>>
> >>>>> *in not so many words, this is what these patches are trying to propose,
> >>>>> while also making sure it's supported for both ACPI and DT.
> >>>>>
> >>>>> BTW, thank you for your effort in making sense of this!
> >>>>>
> >>>>> Regards,
> >>>>> Ionela.
> >>>>>
> >>>>
> >>>> This could be a platform where per-cpu and perf-dependencies will be used:
> >>>>
> >>>> CPU:              0    1    2    3    4    5    6    7
> >>>> Type:             A    A    A    A    B    B    B    B
> >>>> Cluster:         [                                    ]
> >>>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> >>>> perf-dependency: [                ]  [                ]
> >>>> HW clock:        [                ]  [                ]
> >>>>
> >>>> The firmware will present 8 controls to the OS and each control is mapped to a
> >>>> cpu device via the standard dt. This is done so we can achieve hw coordination.
> >>>> What is required in these systems is to present to OS the information of which
> >>>> cpus belong to which clock domain. In other words, when hw coordinates we don't
> >>>> have any way at present in dt to understand how these cpus are dependent
> >>>> each other, from performance perspective (as opposed to ACPI where we have
> >>>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
> >>>> This is regardless whether we decide to go for either a policy per-cpu or a
> >>>> policy per-domain.
> >>>>
> >>>> Hope it helps.
> >>>
> >>> Oh yes, I get it now. Finally. Thanks for helping me out :)
> >>>
> >>> So if I can say all this stuff in simple terms, this is what it will
> >>> be like:
> >>>
> >>> - We don't want software aggregation of frequencies and so we need to
> >>>    have per-cpu policies even when they share their clock lines.
> >>>
> >>> - But we still need a way for other frameworks to know which CPUs
> >>>    share the clock lines (that's what the perf-dependency is all about,
> >>>    right ?).
> >>>
> >>> - We can't get it from SCMI, but need a DT based solution.
> >>>
> >>> - Currently for the cpufreq-case we relied for this on the way OPP
> >>>    tables for the CPUs were described. i.e. the opp-table is marked as
> >>>    "shared" and multiple CPUs point to it.
> >>>
> >>> - I wonder if we can keep using that instead of creating new bindings
> >>>    for exact same stuff ? Though the difference here would be that the
> >>>    OPP may not have any other entries.
> >>
> >> I thought about it and looked for other platforms' DT to see if can reuse
> >> existing opp information. Unfortunately I don't think it is optimal. The reason
> >> being that, because cpus have the same opp table it does not necessarily mean
> >> that they share a clock wire. It just tells us that they have the same
> >> capabilities (literally just tells us they have the same V/f op points).
> >> Unless I am missing something?
> >>
> >> When comparing with ACPI/_PSD it becomes more intuitive that there is no
> >> equivalent way to reveal "perf-dependencies" in DT.
> >
> > You should be able to by examining the clock tree. But perhaps SCMI
> > abstracts all that and just presents virtual clocks without parent
> > clocks available to determine what clocks are shared? Fix SCMI if that's
> > the case.
>
> True, the SCMI clock does not support discovery of clock tree:
> (from 4.6.1 Clock management protocol background)
> 'The protocol does not cover discovery of the clock tree, which must be
> described through firmware tables instead.' [1]
>
> In this situation, would it make sense, instead of this binding from
> patch 1/2, create a binding for internal firmware/scmi node?
>
> Something like:
>
> firmware {
>         scmi {
>         ...
>                 scmi-perf-dep {
>                         compatible = "arm,scmi-perf-dependencies";
>                         cpu-perf-dep0 {
>                                 cpu-perf-affinity = <&CPU0>, <&CPU1>;
>                         };
>                         cpu-perf-dep1 {
>                                 cpu-perf-affinity = <&CPU3>, <&CPU4>;
>                         };
>                         cpu-perf-dep2 {
>                                 cpu-perf-affinity = <&CPU7>;
>                         };
>                 };
>         };
> };
>
> The code which is going to parse the binding would be inside the
> scmi perf protocol code and used via API by scmi-cpufreq.c.
>
>
> Now regarding the 'dependent_cpus' mask.
>
> We could avoid adding a new field 'dependent_cpus' in policy
> struct, but I am not sure of one bit - Frequency Invariant Engine,
> (which is also not fixed by just adding a new cpumask).
>
> We have 3 subsystems to fix:
> 1. EAS - EM has API function which takes custom cpumask, so no issue,
>    fix would be to use it via the scmi-cpufreq.c
> 2. IPA (for calculating the power of a cluster, not whole thermal needs
>    this knowledge about 'dependent cpus') - this can be fixed internally
> 3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
>    filtering and providing max freq of all cpus in the cluster into the
>    FIE; this info is then populated to all 'related_cpus' which will
>    have this freq (we know, because there is no other freq requests);
>    Issues:
> 3.1. Schedutil is not going to check all cpus in the cluster to take
>    max freq, which is then passed into the cpufreq driver and FIE
> 3.2. FIE would have to (or maybe we would drop it) have a logic similar
>    to what schedutil does (max freq search and set, then filter next
>    freq requests from other cpus in the next period e.g. 10ms)
> 3.3. Schedutil is going to invoke freq change for each cpu independently
>    and the current code just calls arch_set_freq_scale() - adding just
>    'dependent_cpus' won't help
> 3.4 What would be the real frequency of these cpus and what would be
>    set to FIE
> 3.5 FIE is going to filter to soon requests from other dependent cpus?
>
> IMHO the FIE needs more bits than just a new cpumask.
> Maybe we should consider to move FIE arch_set_freq_scale() call into the
> cpufreq driver, which will know better how to aggregate/filter requests
> and then call FIE update?
>
> [1] https://developer.arm.com/documentation/den0056/b/

I'm not sure if I understand your concern correctly, but generally
speaking in the one-CPU-per-policy case with HW-based frequency
coordination feedback registers are needed to implement (approximate)
frequency invariance, which is what happens on x86.

Thanks!

Ionela Voinescu Oct. 12, 2020, 10:59 a.m. UTC | #17

On Monday 12 Oct 2020 at 11:22:57 (+0100), Lukasz Luba wrote:
[..]
> > > I thought about it and looked for other platforms' DT to see if can reuse
> > > existing opp information. Unfortunately I don't think it is optimal. The reason
> > > being that, because cpus have the same opp table it does not necessarily mean
> > > that they share a clock wire. It just tells us that they have the same
> > > capabilities (literally just tells us they have the same V/f op points).
> > > Unless I am missing something?
> > > 
> > > When comparing with ACPI/_PSD it becomes more intuitive that there is no
> > > equivalent way to reveal "perf-dependencies" in DT.
> > 
> > You should be able to by examining the clock tree. But perhaps SCMI
> > abstracts all that and just presents virtual clocks without parent
> > clocks available to determine what clocks are shared? Fix SCMI if that's
> > the case.
> 
> True, the SCMI clock does not support discovery of clock tree:
> (from 4.6.1 Clock management protocol background)
> 'The protocol does not cover discovery of the clock tree, which must be
> described through firmware tables instead.' [1]
> 
> In this situation, would it make sense, instead of this binding from
> patch 1/2, create a binding for internal firmware/scmi node?
> 
> Something like:
> 
> firmware {
> 	scmi {
> 	...		
> 		scmi-perf-dep {
> 			compatible = "arm,scmi-perf-dependencies";
> 			cpu-perf-dep0 {
> 				cpu-perf-affinity = <&CPU0>, <&CPU1>;
> 			};
> 			cpu-perf-dep1 {
> 				cpu-perf-affinity = <&CPU3>, <&CPU4>;
> 			};
> 			cpu-perf-dep2 {
> 				cpu-perf-affinity = <&CPU7>;
> 			};
> 		};
> 	};
> };
> 
> The code which is going to parse the binding would be inside the
> scmi perf protocol code and used via API by scmi-cpufreq.c.
> 

While SCMI cpufreq would be able to benefit from the functionality that
Nicola is trying to introduce, it's not the only driver, and more
importantly, it's not *going* to be the only driver benefiting from
this.

Currently there is also qcom-cpufreq-hw.c and the future
mediatek-cpufreq-hw.c that is currently under review [1]. They both do
their frequency setting by interacting with HW/FW, and could either take
or update their OPP tables from there. Therefore, if the platform would
require it, they could also expose different controls for frequency
setting and could benefit from additional information about clock
domains (either through opp-shared or the new entries in Nicola's patch),
without driver changes.

Another point to be made is that I strongly believe this is going to be
the norm in the future. Directly setting PLLs and regulator voltages
has been proven unsafe and unsecure.

Therefore, I see this as support for a generic cpufreq feature (a
hardware coordination type), rather than support for a specific driver.

[1] https://lkml.org/lkml/2020/9/10/11

> 
> Now regarding the 'dependent_cpus' mask.
> 
> We could avoid adding a new field 'dependent_cpus' in policy
> struct, but I am not sure of one bit - Frequency Invariant Engine,
> (which is also not fixed by just adding a new cpumask).
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  Let's take it step by step..
> 
> We have 3 subsystems to fix:
> 1. EAS - EM has API function which takes custom cpumask, so no issue,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	   keep in mind that EAS it's using the max aggregation method
	   that schedutil is using. So if we are to describe the
	   functionality correctly, it needs both a cpumask describing
	   the frequency domains and an aggregation method.

>   fix would be to use it via the scmi-cpufreq.c

> 2. IPA (for calculating the power of a cluster, not whole thermal needs
>   this knowledge about 'dependent cpus') - this can be fixed internally

> 3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
>   filtering and providing max freq of all cpus in the cluster into the
>   FIE; this info is then populated to all 'related_cpus' which will
>   have this freq (we know, because there is no other freq requests);
>   Issues:
> 3.1. Schedutil is not going to check all cpus in the cluster to take
>   max freq, which is then passed into the cpufreq driver and FIE
> 3.2. FIE would have to (or maybe we would drop it) have a logic similar
>   to what schedutil does (max freq search and set, then filter next
>   freq requests from other cpus in the next period e.g. 10ms)
> 3.3. Schedutil is going to invoke freq change for each cpu independently
>   and the current code just calls arch_set_freq_scale() - adding just
>   'dependent_cpus' won't help

I don't believe these are issues. As we need changes for EAS and IPA, we'd
need changes for FIE. We don't need more than the cpumask that shows
frequency domains as we already already have the aggregation method that
schedutil uses to propagate the max frequency in a domain across CPUs.

This would be the default method if cycle counters are not present. It
might not reflect the frequency the cores actually get from HW, but for
that cycle counters should be used.

> 3.4 What would be the real frequency of these cpus and what would be
>   set to FIE
> 3.5 FIE is going to filter to soon requests from other dependent cpus?
> 
> IMHO the FIE needs more bits than just a new cpumask.
> Maybe we should consider to move FIE arch_set_freq_scale() call into the
> cpufreq driver, which will know better how to aggregate/filter requests
> and then call FIE update?

I'm quite strongly against this :). As described before, this is not a
feature that a single driver needs, and even if it was, the aggregation
method for FIE is not a driver policy.

Thanks,
Ionela.

> 
> Regards,
> Lukasz
> 
> [1] https://developer.arm.com/documentation/den0056/b/
> > 
> > Rob
> >

Lukasz Luba Oct. 12, 2020, 11:05 a.m. UTC | #18

On 10/12/20 11:50 AM, Rafael J. Wysocki wrote:
> On Mon, Oct 12, 2020 at 12:23 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Rob,
>>
>> On 10/9/20 3:01 PM, Rob Herring wrote:
>>> On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
>>>> Hi Viresh, I'm glad it helped.
>>>>
>>>> Please find below my reply.
>>>>
>>>> On 10/9/20 6:39 AM, Viresh Kumar wrote:
>>>>> On 08-10-20, 17:00, Nicola Mazzucato wrote:
>>>>>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
>>>>>>> Hi Viresh,
>>>>>>>
>>>>>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
>>>>>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
>>>>>>>>> Hi Viresh,
>>>>>>>>>
>>>>>>>>> performance controls is what is exposed by the firmware through a protocol that
>>>>>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
>>>>>>>>> tell that the platform has N controls, but it can't say to which hardware they
>>>>>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
>>>>>>>>> to cpus, gpus, etc.
>>>>>>>>>
>>>>>>>>> Let's focus on cpus.
>>>>>>>>>
>>>>>>>>> Normally we would have N of performance controls (what comes from f/w)
>>>>>>>>> that that correspond to hardware clock/dvfs domains.
>>>>>>>>>
>>>>>>>>> However, some firmware implementations might benefit from having finer
>>>>>>>>> grained information about the performance requirements (e.g.
>>>>>>>>> per-CPU) and therefore choose to present M performance controls to the
>>>>>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
>>>>>>>>> or set of cpus.
>>>>>>>>> In this scenario, the f/w will make aggregation decisions based on the
>>>>>>>>> requests it receives on these M controls.
>>>>>>>>>
>>>>>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
>>>>>>>>> underlying clock domains, thus some s/w components will underperform
>>>>>>>>> (EAS and thermal, for example).
>>>>>>>>>
>>>>>>>>> A real example would be a platform in which the firmware describes the system
>>>>>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
>>>>>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
>>>>>>>>> clock domain).
>>>>>>>>
>>>>>>>> If the CPUs are in the same clock domain, they must be part of the
>>>>>>>> same cpufreq policy.
>>>>>>>
>>>>>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
>>>>>>> coordination type to describe the generic scenario of using hardware
>>>>>>> aggregation and coordination when establishing the clock rate of CPUs).
>>>>>>>
>>>>>>> Adding support for HW_ALL* will involve either bypassing some
>>>>>>> assumptions around cpufreq policies or making core cpufreq changes.
>>>>>>>
>>>>>>> In the way I see it, support for HW_ALL involves either:
>>>>>>>
>>>>>>>    - (a) Creating per-cpu policies in order to allow each of the CPUs to
>>>>>>>      send their own frequency request to the hardware which will do
>>>>>>>      aggregation and clock rate decision at the level of the clock
>>>>>>>      domain. The PSD domains (ACPI) and the new DT binding will tell
>>>>>>>      which CPUs are actually in the same clock domain for whomever is
>>>>>>>      interested, despite those CPUs not being in the same policy.
>>>>>>>      This requires the extra mask that Nicola introduced.
>>>>>>>
>>>>>>>    - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
>>>>>>>      - Governors to stop aggregating (usually max) the information
>>>>>>>        for each of the CPUs in the policy and convey to the core
>>>>>>>        information for each CPU.
>>>>>>>      - Cpufreq core to be able to receive and pass this information
>>>>>>>        down to the drivers.
>>>>>>>      - Drivers to be able to have some per cpu structures to hold
>>>>>>>        frequency control (let's say SCP fast channel addresses) for
>>>>>>>        each of the CPUs in the policy. Or have these structures in the
>>>>>>>        cpufreq core/policy, to avoid code duplication in drivers.
>>>>>>>
>>>>>>> Therefore (a) is the least invasive but we'll be bypassing the rule
>>>>>>> above. But to make that rule stick we'll have to make invasive cpufreq
>>>>>>> changes (b).
>>>>>>
>>>>>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
>>>>>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
>>>>>>
>>>>>> Looking back in the history of CPUFreq, related_cpus was originally designed
>>>>>> to hold the map of cpus within the same clock. Later on, the meaning of this
>>>>>> cpumask changed [1].
>>>>>> This led to the introduction of a new cpumask 'freqdomain_cpus'
>>>>>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
>>>>>> sysfs consumers since related_cpus was not suitable anymore for this.
>>>>>> Further on, this cpumask was assigned to online+offline cpus within the same clk
>>>>>> domain when sw coordination is in use [2].
>>>>>>
>>>>>> My interpretation is that there is no guarantee that related_cpus holds the
>>>>>> 'real' hardware clock implementation. As a consequence, it is not true anymore
>>>>>> that cpus that are in the same clock domain will be part of the same
>>>>>> policy.
>>>>>>
>>>>>> This guided me to think it would be better to have a cpumask which always holds
>>>>>> the real hw clock domains in the policy.
>>>>>>
>>>>>>>
>>>>>>> This is my current understanding and I'm leaning towards (a). What do
>>>>>>> you think?
>>>>>>>
>>>>>>> *in not so many words, this is what these patches are trying to propose,
>>>>>>> while also making sure it's supported for both ACPI and DT.
>>>>>>>
>>>>>>> BTW, thank you for your effort in making sense of this!
>>>>>>>
>>>>>>> Regards,
>>>>>>> Ionela.
>>>>>>>
>>>>>>
>>>>>> This could be a platform where per-cpu and perf-dependencies will be used:
>>>>>>
>>>>>> CPU:              0    1    2    3    4    5    6    7
>>>>>> Type:             A    A    A    A    B    B    B    B
>>>>>> Cluster:         [                                    ]
>>>>>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
>>>>>> perf-dependency: [                ]  [                ]
>>>>>> HW clock:        [                ]  [                ]
>>>>>>
>>>>>> The firmware will present 8 controls to the OS and each control is mapped to a
>>>>>> cpu device via the standard dt. This is done so we can achieve hw coordination.
>>>>>> What is required in these systems is to present to OS the information of which
>>>>>> cpus belong to which clock domain. In other words, when hw coordinates we don't
>>>>>> have any way at present in dt to understand how these cpus are dependent
>>>>>> each other, from performance perspective (as opposed to ACPI where we have
>>>>>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
>>>>>> This is regardless whether we decide to go for either a policy per-cpu or a
>>>>>> policy per-domain.
>>>>>>
>>>>>> Hope it helps.
>>>>>
>>>>> Oh yes, I get it now. Finally. Thanks for helping me out :)
>>>>>
>>>>> So if I can say all this stuff in simple terms, this is what it will
>>>>> be like:
>>>>>
>>>>> - We don't want software aggregation of frequencies and so we need to
>>>>>     have per-cpu policies even when they share their clock lines.
>>>>>
>>>>> - But we still need a way for other frameworks to know which CPUs
>>>>>     share the clock lines (that's what the perf-dependency is all about,
>>>>>     right ?).
>>>>>
>>>>> - We can't get it from SCMI, but need a DT based solution.
>>>>>
>>>>> - Currently for the cpufreq-case we relied for this on the way OPP
>>>>>     tables for the CPUs were described. i.e. the opp-table is marked as
>>>>>     "shared" and multiple CPUs point to it.
>>>>>
>>>>> - I wonder if we can keep using that instead of creating new bindings
>>>>>     for exact same stuff ? Though the difference here would be that the
>>>>>     OPP may not have any other entries.
>>>>
>>>> I thought about it and looked for other platforms' DT to see if can reuse
>>>> existing opp information. Unfortunately I don't think it is optimal. The reason
>>>> being that, because cpus have the same opp table it does not necessarily mean
>>>> that they share a clock wire. It just tells us that they have the same
>>>> capabilities (literally just tells us they have the same V/f op points).
>>>> Unless I am missing something?
>>>>
>>>> When comparing with ACPI/_PSD it becomes more intuitive that there is no
>>>> equivalent way to reveal "perf-dependencies" in DT.
>>>
>>> You should be able to by examining the clock tree. But perhaps SCMI
>>> abstracts all that and just presents virtual clocks without parent
>>> clocks available to determine what clocks are shared? Fix SCMI if that's
>>> the case.
>>
>> True, the SCMI clock does not support discovery of clock tree:
>> (from 4.6.1 Clock management protocol background)
>> 'The protocol does not cover discovery of the clock tree, which must be
>> described through firmware tables instead.' [1]
>>
>> In this situation, would it make sense, instead of this binding from
>> patch 1/2, create a binding for internal firmware/scmi node?
>>
>> Something like:
>>
>> firmware {
>>          scmi {
>>          ...
>>                  scmi-perf-dep {
>>                          compatible = "arm,scmi-perf-dependencies";
>>                          cpu-perf-dep0 {
>>                                  cpu-perf-affinity = <&CPU0>, <&CPU1>;
>>                          };
>>                          cpu-perf-dep1 {
>>                                  cpu-perf-affinity = <&CPU3>, <&CPU4>;
>>                          };
>>                          cpu-perf-dep2 {
>>                                  cpu-perf-affinity = <&CPU7>;
>>                          };
>>                  };
>>          };
>> };
>>
>> The code which is going to parse the binding would be inside the
>> scmi perf protocol code and used via API by scmi-cpufreq.c.
>>
>>
>> Now regarding the 'dependent_cpus' mask.
>>
>> We could avoid adding a new field 'dependent_cpus' in policy
>> struct, but I am not sure of one bit - Frequency Invariant Engine,
>> (which is also not fixed by just adding a new cpumask).
>>
>> We have 3 subsystems to fix:
>> 1. EAS - EM has API function which takes custom cpumask, so no issue,
>>     fix would be to use it via the scmi-cpufreq.c
>> 2. IPA (for calculating the power of a cluster, not whole thermal needs
>>     this knowledge about 'dependent cpus') - this can be fixed internally
>> 3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
>>     filtering and providing max freq of all cpus in the cluster into the
>>     FIE; this info is then populated to all 'related_cpus' which will
>>     have this freq (we know, because there is no other freq requests);
>>     Issues:
>> 3.1. Schedutil is not going to check all cpus in the cluster to take
>>     max freq, which is then passed into the cpufreq driver and FIE
>> 3.2. FIE would have to (or maybe we would drop it) have a logic similar
>>     to what schedutil does (max freq search and set, then filter next
>>     freq requests from other cpus in the next period e.g. 10ms)
>> 3.3. Schedutil is going to invoke freq change for each cpu independently
>>     and the current code just calls arch_set_freq_scale() - adding just
>>     'dependent_cpus' won't help
>> 3.4 What would be the real frequency of these cpus and what would be
>>     set to FIE
>> 3.5 FIE is going to filter to soon requests from other dependent cpus?
>>
>> IMHO the FIE needs more bits than just a new cpumask.
>> Maybe we should consider to move FIE arch_set_freq_scale() call into the
>> cpufreq driver, which will know better how to aggregate/filter requests
>> and then call FIE update?
>>
>> [1] https://developer.arm.com/documentation/den0056/b/
> 
> I'm not sure if I understand your concern correctly, but generally
> speaking in the one-CPU-per-policy case with HW-based frequency
> coordination feedback registers are needed to implement (approximate)
> frequency invariance, which is what happens on x86.
> 
> Thanks!
> 

Unfortunately, in Arm, we might have platforms which do not have these
counters (AMU - similar to mperf, aperf). Even worse, in the same SoC
some CPUs might have them, while other not. They are 'optional' in Arm
and some vendors might skip to implement them. This is a good example
and requirement to make them 'obligatory' (I'll rise this internally).

I don't know if there is a good solution for these problems. I am not
convinced that a simple new cpumask would solve them all.

Thank you Rafael for this important comment.

Regards,
Lukasz

Lukasz Luba Oct. 12, 2020, 1:48 p.m. UTC | #19

On 10/12/20 11:59 AM, Ionela Voinescu wrote:
> On Monday 12 Oct 2020 at 11:22:57 (+0100), Lukasz Luba wrote:
> [..]
>>>> I thought about it and looked for other platforms' DT to see if can reuse
>>>> existing opp information. Unfortunately I don't think it is optimal. The reason
>>>> being that, because cpus have the same opp table it does not necessarily mean
>>>> that they share a clock wire. It just tells us that they have the same
>>>> capabilities (literally just tells us they have the same V/f op points).
>>>> Unless I am missing something?
>>>>
>>>> When comparing with ACPI/_PSD it becomes more intuitive that there is no
>>>> equivalent way to reveal "perf-dependencies" in DT.
>>>
>>> You should be able to by examining the clock tree. But perhaps SCMI
>>> abstracts all that and just presents virtual clocks without parent
>>> clocks available to determine what clocks are shared? Fix SCMI if that's
>>> the case.
>>
>> True, the SCMI clock does not support discovery of clock tree:
>> (from 4.6.1 Clock management protocol background)
>> 'The protocol does not cover discovery of the clock tree, which must be
>> described through firmware tables instead.' [1]
>>
>> In this situation, would it make sense, instead of this binding from
>> patch 1/2, create a binding for internal firmware/scmi node?
>>
>> Something like:
>>
>> firmware {
>> 	scmi {
>> 	...		
>> 		scmi-perf-dep {
>> 			compatible = "arm,scmi-perf-dependencies";
>> 			cpu-perf-dep0 {
>> 				cpu-perf-affinity = <&CPU0>, <&CPU1>;
>> 			};
>> 			cpu-perf-dep1 {
>> 				cpu-perf-affinity = <&CPU3>, <&CPU4>;
>> 			};
>> 			cpu-perf-dep2 {
>> 				cpu-perf-affinity = <&CPU7>;
>> 			};
>> 		};
>> 	};
>> };
>>
>> The code which is going to parse the binding would be inside the
>> scmi perf protocol code and used via API by scmi-cpufreq.c.
>>
> 
> While SCMI cpufreq would be able to benefit from the functionality that
> Nicola is trying to introduce, it's not the only driver, and more
> importantly, it's not *going* to be the only driver benefiting from
> this.
> 
> Currently there is also qcom-cpufreq-hw.c and the future
> mediatek-cpufreq-hw.c that is currently under review [1]. They both do
> their frequency setting by interacting with HW/FW, and could either take
> or update their OPP tables from there. Therefore, if the platform would
> require it, they could also expose different controls for frequency
> setting and could benefit from additional information about clock
> domains (either through opp-shared or the new entries in Nicola's patch),
> without driver changes.
> 
> Another point to be made is that I strongly believe this is going to be
> the norm in the future. Directly setting PLLs and regulator voltages
> has been proven unsafe and unsecure.
> 
> Therefore, I see this as support for a generic cpufreq feature (a
> hardware coordination type), rather than support for a specific driver.
> 
> [1] https://lkml.org/lkml/2020/9/10/11
> 
>>
>> Now regarding the 'dependent_cpus' mask.
>>
>> We could avoid adding a new field 'dependent_cpus' in policy
>> struct, but I am not sure of one bit - Frequency Invariant Engine,
>> (which is also not fixed by just adding a new cpumask).
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    Let's take it step by step..
>>
>> We have 3 subsystems to fix:
>> 1. EAS - EM has API function which takes custom cpumask, so no issue,
>             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	   keep in mind that EAS it's using the max aggregation method
> 	   that schedutil is using. So if we are to describe the
> 	   functionality correctly, it needs both a cpumask describing
> 	   the frequency domains and an aggregation method.

EAS does not use schedutil max agregation, it calculates max_util
internally.

The compute_energy() loops through the CPUs in the domain and
takes the utilization from them via schedutil_cpu_util(cpu_rq(cpu)).
It figures out max_util and then em_cpu_energy() maps it to next
frequency for the cluster. It just needs proper utilization from
CPUs, which is taken from run-queues, which is a sum of utilization
of tasks being there. This leads to problem how we account utilization
of a task. This is the place where the FIE is involved. EAS assumes the
utilization is calculated properly.

> 
>>    fix would be to use it via the scmi-cpufreq.c
> 
>> 2. IPA (for calculating the power of a cluster, not whole thermal needs
>>    this knowledge about 'dependent cpus') - this can be fixed internally
> 
>> 3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
>>    filtering and providing max freq of all cpus in the cluster into the
>>    FIE; this info is then populated to all 'related_cpus' which will
>>    have this freq (we know, because there is no other freq requests);
>>    Issues:
>> 3.1. Schedutil is not going to check all cpus in the cluster to take
>>    max freq, which is then passed into the cpufreq driver and FIE
>> 3.2. FIE would have to (or maybe we would drop it) have a logic similar
>>    to what schedutil does (max freq search and set, then filter next
>>    freq requests from other cpus in the next period e.g. 10ms)
>> 3.3. Schedutil is going to invoke freq change for each cpu independently
>>    and the current code just calls arch_set_freq_scale() - adding just
>>    'dependent_cpus' won't help
> 
> I don't believe these are issues. As we need changes for EAS and IPA, we'd
> need changes for FIE. We don't need more than the cpumask that shows
> frequency domains as we already already have the aggregation method that
> schedutil uses to propagate the max frequency in a domain across CPUs.

Schedutil is going to work in !policy_is_shared() mode, which leads to
sugov_update_single() being the 'main' function. We won't have
schedutil goodness which is handling related_cpus use case.

Then in software FIE would you just change the call from:
	arch_set_freq_scale(policy->related_cpus,...)
to:
	arch_set_freq_scale(policy->dependent_cpus,...)
?

This code would be called from any CPU (without filtering) and it
would loop through cpumask updating freq_scale, which is wrong IMO.
You need some 'logic', which is not currently in there.

Leaving the 'related_cpus' would also be wrong (because real CPU
frequency is different, so we would account task utilization wrongly).

> 
> This would be the default method if cycle counters are not present. It
> might not reflect the frequency the cores actually get from HW, but for
> that cycle counters should be used.

IMHO the configurations with per-cpu freq requests while there are CPUs
'dependent' and there are no HW counters to use for tasks
utilization accounting - should be blocked. Then we don't need
'dependent_cpus' in software FIE. Then one less from your requirements
list for new cpumask.

> 
>> 3.4 What would be the real frequency of these cpus and what would be
>>    set to FIE
>> 3.5 FIE is going to filter to soon requests from other dependent cpus?
>>
>> IMHO the FIE needs more bits than just a new cpumask.
>> Maybe we should consider to move FIE arch_set_freq_scale() call into the
>> cpufreq driver, which will know better how to aggregate/filter requests
>> and then call FIE update?
> 
> I'm quite strongly against this :). As described before, this is not a
> feature that a single driver needs, and even if it was, the aggregation
> method for FIE is not a driver policy.

Software version of FIE has issues in this case, schedutil or EAS won't
help (different code path).

Regards,
Lukasz

Rob Herring (Arm) Oct. 12, 2020, 1:59 p.m. UTC | #20

+Stephen for clock issues

On Mon, Oct 12, 2020 at 5:23 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rob,
>
> On 10/9/20 3:01 PM, Rob Herring wrote:
> > On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
> >> Hi Viresh, I'm glad it helped.
> >>
> >> Please find below my reply.
> >>
> >> On 10/9/20 6:39 AM, Viresh Kumar wrote:
> >>> On 08-10-20, 17:00, Nicola Mazzucato wrote:
> >>>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> >>>>> Hi Viresh,
> >>>>>
> >>>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> >>>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> >>>>>>> Hi Viresh,
> >>>>>>>
> >>>>>>> performance controls is what is exposed by the firmware through a protocol that
> >>>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
> >>>>>>> tell that the platform has N controls, but it can't say to which hardware they
> >>>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
> >>>>>>> to cpus, gpus, etc.
> >>>>>>>
> >>>>>>> Let's focus on cpus.
> >>>>>>>
> >>>>>>> Normally we would have N of performance controls (what comes from f/w)
> >>>>>>> that that correspond to hardware clock/dvfs domains.
> >>>>>>>
> >>>>>>> However, some firmware implementations might benefit from having finer
> >>>>>>> grained information about the performance requirements (e.g.
> >>>>>>> per-CPU) and therefore choose to present M performance controls to the
> >>>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> >>>>>>> or set of cpus.
> >>>>>>> In this scenario, the f/w will make aggregation decisions based on the
> >>>>>>> requests it receives on these M controls.
> >>>>>>>
> >>>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
> >>>>>>> underlying clock domains, thus some s/w components will underperform
> >>>>>>> (EAS and thermal, for example).
> >>>>>>>
> >>>>>>> A real example would be a platform in which the firmware describes the system
> >>>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> >>>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> >>>>>>> clock domain).
> >>>>>>
> >>>>>> If the CPUs are in the same clock domain, they must be part of the
> >>>>>> same cpufreq policy.
> >>>>>
> >>>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
> >>>>> coordination type to describe the generic scenario of using hardware
> >>>>> aggregation and coordination when establishing the clock rate of CPUs).
> >>>>>
> >>>>> Adding support for HW_ALL* will involve either bypassing some
> >>>>> assumptions around cpufreq policies or making core cpufreq changes.
> >>>>>
> >>>>> In the way I see it, support for HW_ALL involves either:
> >>>>>
> >>>>>   - (a) Creating per-cpu policies in order to allow each of the CPUs to
> >>>>>     send their own frequency request to the hardware which will do
> >>>>>     aggregation and clock rate decision at the level of the clock
> >>>>>     domain. The PSD domains (ACPI) and the new DT binding will tell
> >>>>>     which CPUs are actually in the same clock domain for whomever is
> >>>>>     interested, despite those CPUs not being in the same policy.
> >>>>>     This requires the extra mask that Nicola introduced.
> >>>>>
> >>>>>   - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> >>>>>     - Governors to stop aggregating (usually max) the information
> >>>>>       for each of the CPUs in the policy and convey to the core
> >>>>>       information for each CPU.
> >>>>>     - Cpufreq core to be able to receive and pass this information
> >>>>>       down to the drivers.
> >>>>>     - Drivers to be able to have some per cpu structures to hold
> >>>>>       frequency control (let's say SCP fast channel addresses) for
> >>>>>       each of the CPUs in the policy. Or have these structures in the
> >>>>>       cpufreq core/policy, to avoid code duplication in drivers.
> >>>>>
> >>>>> Therefore (a) is the least invasive but we'll be bypassing the rule
> >>>>> above. But to make that rule stick we'll have to make invasive cpufreq
> >>>>> changes (b).
> >>>>
> >>>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> >>>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> >>>>
> >>>> Looking back in the history of CPUFreq, related_cpus was originally designed
> >>>> to hold the map of cpus within the same clock. Later on, the meaning of this
> >>>> cpumask changed [1].
> >>>> This led to the introduction of a new cpumask 'freqdomain_cpus'
> >>>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
> >>>> sysfs consumers since related_cpus was not suitable anymore for this.
> >>>> Further on, this cpumask was assigned to online+offline cpus within the same clk
> >>>> domain when sw coordination is in use [2].
> >>>>
> >>>> My interpretation is that there is no guarantee that related_cpus holds the
> >>>> 'real' hardware clock implementation. As a consequence, it is not true anymore
> >>>> that cpus that are in the same clock domain will be part of the same
> >>>> policy.
> >>>>
> >>>> This guided me to think it would be better to have a cpumask which always holds
> >>>> the real hw clock domains in the policy.
> >>>>
> >>>>>
> >>>>> This is my current understanding and I'm leaning towards (a). What do
> >>>>> you think?
> >>>>>
> >>>>> *in not so many words, this is what these patches are trying to propose,
> >>>>> while also making sure it's supported for both ACPI and DT.
> >>>>>
> >>>>> BTW, thank you for your effort in making sense of this!
> >>>>>
> >>>>> Regards,
> >>>>> Ionela.
> >>>>>
> >>>>
> >>>> This could be a platform where per-cpu and perf-dependencies will be used:
> >>>>
> >>>> CPU:              0    1    2    3    4    5    6    7
> >>>> Type:             A    A    A    A    B    B    B    B
> >>>> Cluster:         [                                    ]
> >>>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> >>>> perf-dependency: [                ]  [                ]
> >>>> HW clock:        [                ]  [                ]
> >>>>
> >>>> The firmware will present 8 controls to the OS and each control is mapped to a
> >>>> cpu device via the standard dt. This is done so we can achieve hw coordination.
> >>>> What is required in these systems is to present to OS the information of which
> >>>> cpus belong to which clock domain. In other words, when hw coordinates we don't
> >>>> have any way at present in dt to understand how these cpus are dependent
> >>>> each other, from performance perspective (as opposed to ACPI where we have
> >>>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
> >>>> This is regardless whether we decide to go for either a policy per-cpu or a
> >>>> policy per-domain.
> >>>>
> >>>> Hope it helps.
> >>>
> >>> Oh yes, I get it now. Finally. Thanks for helping me out :)
> >>>
> >>> So if I can say all this stuff in simple terms, this is what it will
> >>> be like:
> >>>
> >>> - We don't want software aggregation of frequencies and so we need to
> >>>    have per-cpu policies even when they share their clock lines.
> >>>
> >>> - But we still need a way for other frameworks to know which CPUs
> >>>    share the clock lines (that's what the perf-dependency is all about,
> >>>    right ?).
> >>>
> >>> - We can't get it from SCMI, but need a DT based solution.
> >>>
> >>> - Currently for the cpufreq-case we relied for this on the way OPP
> >>>    tables for the CPUs were described. i.e. the opp-table is marked as
> >>>    "shared" and multiple CPUs point to it.
> >>>
> >>> - I wonder if we can keep using that instead of creating new bindings
> >>>    for exact same stuff ? Though the difference here would be that the
> >>>    OPP may not have any other entries.
> >>
> >> I thought about it and looked for other platforms' DT to see if can reuse
> >> existing opp information. Unfortunately I don't think it is optimal. The reason
> >> being that, because cpus have the same opp table it does not necessarily mean
> >> that they share a clock wire. It just tells us that they have the same
> >> capabilities (literally just tells us they have the same V/f op points).
> >> Unless I am missing something?
> >>
> >> When comparing with ACPI/_PSD it becomes more intuitive that there is no
> >> equivalent way to reveal "perf-dependencies" in DT.
> >
> > You should be able to by examining the clock tree. But perhaps SCMI
> > abstracts all that and just presents virtual clocks without parent
> > clocks available to determine what clocks are shared? Fix SCMI if that's
> > the case.
>
> True, the SCMI clock does not support discovery of clock tree:
> (from 4.6.1 Clock management protocol background)
> 'The protocol does not cover discovery of the clock tree, which must be
> described through firmware tables instead.' [1]

That's a shame given we don't describe whole clock trees in DT either.

How does assigned-clocks work with SCMI? Or any case where 2 devices
share a clock which imposes restrictions on the 2 devices ability to
control their clock. I guess more generally, this case is just the
latter. 2 CPUs are just 2 devices which may or may not share a clock.

>
> In this situation, would it make sense, instead of this binding from
> patch 1/2, create a binding for internal firmware/scmi node?

Maybe, but I think that's the least of the issues here.

>
> Something like:
>
> firmware {
>         scmi {
>         ...
>                 scmi-perf-dep {
>                         compatible = "arm,scmi-perf-dependencies";
>                         cpu-perf-dep0 {
>                                 cpu-perf-affinity = <&CPU0>, <&CPU1>;
>                         };
>                         cpu-perf-dep1 {
>                                 cpu-perf-affinity = <&CPU3>, <&CPU4>;
>                         };
>                         cpu-perf-dep2 {
>                                 cpu-perf-affinity = <&CPU7>;
>                         };
>                 };
>         };
> };
>
> The code which is going to parse the binding would be inside the
> scmi perf protocol code and used via API by scmi-cpufreq.c.

Sudeep Holla Oct. 12, 2020, 3:49 p.m. UTC | #21

On Fri, Oct 09, 2020 at 11:09:21AM +0530, Viresh Kumar wrote:
> On 08-10-20, 17:00, Nicola Mazzucato wrote:
> > On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> > > Hi Viresh,
> > > 
> > > On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> > >> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> > >>> Hi Viresh,
> > >>>
> > >>> performance controls is what is exposed by the firmware through a protocol that
> > >>> is not capable of describing hardware (say SCMI). For example, the firmware can
> > >>> tell that the platform has N controls, but it can't say to which hardware they
> > >>> are "wired" to. This is done in dt, where, for example, we map these controls
> > >>> to cpus, gpus, etc.
> > >>>
> > >>> Let's focus on cpus.
> > >>>
> > >>> Normally we would have N of performance controls (what comes from f/w)
> > >>> that that correspond to hardware clock/dvfs domains.
> > >>>
> > >>> However, some firmware implementations might benefit from having finer
> > >>> grained information about the performance requirements (e.g.
> > >>> per-CPU) and therefore choose to present M performance controls to the
> > >>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> > >>> or set of cpus.
> > >>> In this scenario, the f/w will make aggregation decisions based on the
> > >>> requests it receives on these M controls.
> > >>>
> > >>> Here we would have M cpufreq policies which do not necessarily reflect the
> > >>> underlying clock domains, thus some s/w components will underperform
> > >>> (EAS and thermal, for example).
> > >>>
> > >>> A real example would be a platform in which the firmware describes the system
> > >>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> > >>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> > >>> clock domain).
> > >>
> > >> If the CPUs are in the same clock domain, they must be part of the
> > >> same cpufreq policy.
> > > 
> > > But cpufreq does not currently support HW_ALL (I'm using the ACPI
> > > coordination type to describe the generic scenario of using hardware
> > > aggregation and coordination when establishing the clock rate of CPUs).
> > > 
> > > Adding support for HW_ALL* will involve either bypassing some
> > > assumptions around cpufreq policies or making core cpufreq changes.
> > > 
> > > In the way I see it, support for HW_ALL involves either:
> > > 
> > >  - (a) Creating per-cpu policies in order to allow each of the CPUs to
> > >    send their own frequency request to the hardware which will do
> > >    aggregation and clock rate decision at the level of the clock
> > >    domain. The PSD domains (ACPI) and the new DT binding will tell
> > >    which CPUs are actually in the same clock domain for whomever is
> > >    interested, despite those CPUs not being in the same policy.
> > >    This requires the extra mask that Nicola introduced.
> > > 
> > >  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> > >    - Governors to stop aggregating (usually max) the information
> > >      for each of the CPUs in the policy and convey to the core
> > >      information for each CPU.
> > >    - Cpufreq core to be able to receive and pass this information
> > >      down to the drivers.
> > >    - Drivers to be able to have some per cpu structures to hold
> > >      frequency control (let's say SCP fast channel addresses) for
> > >      each of the CPUs in the policy. Or have these structures in the
> > >      cpufreq core/policy, to avoid code duplication in drivers.
> > > 
> > > Therefore (a) is the least invasive but we'll be bypassing the rule
> > > above. But to make that rule stick we'll have to make invasive cpufreq
> > > changes (b).
> > 
> > Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> > to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> > 
> > Looking back in the history of CPUFreq, related_cpus was originally designed
> > to hold the map of cpus within the same clock. Later on, the meaning of this
> > cpumask changed [1].
> > This led to the introduction of a new cpumask 'freqdomain_cpus'
> > within acpi-cpufreq to keep the knowledge of hardware clock domains for
> > sysfs consumers since related_cpus was not suitable anymore for this.
> > Further on, this cpumask was assigned to online+offline cpus within the same clk
> > domain when sw coordination is in use [2].
> > 
> > My interpretation is that there is no guarantee that related_cpus holds the
> > 'real' hardware clock implementation. As a consequence, it is not true anymore
> > that cpus that are in the same clock domain will be part of the same
> > policy.
> > 
> > This guided me to think it would be better to have a cpumask which always holds
> > the real hw clock domains in the policy.
> > 
> > > 
> > > This is my current understanding and I'm leaning towards (a). What do
> > > you think?
> > > 
> > > *in not so many words, this is what these patches are trying to propose,
> > > while also making sure it's supported for both ACPI and DT.
> > > 
> > > BTW, thank you for your effort in making sense of this!
> > > 
> > > Regards,
> > > Ionela.
> > > 
> > 
> > This could be a platform where per-cpu and perf-dependencies will be used:
> > 
> > CPU:              0    1    2    3    4    5    6    7
> > Type:             A    A    A    A    B    B    B    B
> > Cluster:         [                                    ]
> > perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> > perf-dependency: [                ]  [                ]
> > HW clock:        [                ]  [                ]
> > 
> > The firmware will present 8 controls to the OS and each control is mapped to a
> > cpu device via the standard dt. This is done so we can achieve hw coordination.
> > What is required in these systems is to present to OS the information of which
> > cpus belong to which clock domain. In other words, when hw coordinates we don't
> > have any way at present in dt to understand how these cpus are dependent
> > each other, from performance perspective (as opposed to ACPI where we have
> > _PSD). Hence my proposal for the new cpu-perf-dependencies.
> > This is regardless whether we decide to go for either a policy per-cpu or a
> > policy per-domain.
> > 
> > Hope it helps.
> 
> Oh yes, I get it now. Finally. Thanks for helping me out :)
> 
> So if I can say all this stuff in simple terms, this is what it will
> be like:
> 
> - We don't want software aggregation of frequencies and so we need to
>   have per-cpu policies even when they share their clock lines.
> 
> - But we still need a way for other frameworks to know which CPUs
>   share the clock lines (that's what the perf-dependency is all about,
>   right ?).
> 
> - We can't get it from SCMI, but need a DT based solution.
> 
> - Currently for the cpufreq-case we relied for this on the way OPP
>   tables for the CPUs were described. i.e. the opp-table is marked as
>   "shared" and multiple CPUs point to it.
> 
> - I wonder if we can keep using that instead of creating new bindings
>   for exact same stuff ? Though the difference here would be that the
>   OPP may not have any other entries.

Well summarised, sorry for chiming in late. I could have not summarised
any better. Just saw the big thread and was thinking of summarising.
If the last point on OPP is possible(i.e. no OPP entries but just use
it for fetch the information) for $subject patch is trying to achieve,
then it would be good.

Sudeep Holla Oct. 12, 2020, 3:54 p.m. UTC | #22

On Fri, Oct 09, 2020 at 09:01:41AM -0500, Rob Herring wrote:
> On Fri, Oct 09, 2020 at 12:10:03PM +0100, Nicola Mazzucato wrote:
> > Hi Viresh, I'm glad it helped.
> >
> > Please find below my reply.
> >
> > On 10/9/20 6:39 AM, Viresh Kumar wrote:
> > > On 08-10-20, 17:00, Nicola Mazzucato wrote:
> > >> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> > >>> Hi Viresh,
> > >>>
> > >>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> > >>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> > >>>>> Hi Viresh,
> > >>>>>
> > >>>>> performance controls is what is exposed by the firmware through a protocol that
> > >>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
> > >>>>> tell that the platform has N controls, but it can't say to which hardware they
> > >>>>> are "wired" to. This is done in dt, where, for example, we map these controls
> > >>>>> to cpus, gpus, etc.
> > >>>>>
> > >>>>> Let's focus on cpus.
> > >>>>>
> > >>>>> Normally we would have N of performance controls (what comes from f/w)
> > >>>>> that that correspond to hardware clock/dvfs domains.
> > >>>>>
> > >>>>> However, some firmware implementations might benefit from having finer
> > >>>>> grained information about the performance requirements (e.g.
> > >>>>> per-CPU) and therefore choose to present M performance controls to the
> > >>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> > >>>>> or set of cpus.
> > >>>>> In this scenario, the f/w will make aggregation decisions based on the
> > >>>>> requests it receives on these M controls.
> > >>>>>
> > >>>>> Here we would have M cpufreq policies which do not necessarily reflect the
> > >>>>> underlying clock domains, thus some s/w components will underperform
> > >>>>> (EAS and thermal, for example).
> > >>>>>
> > >>>>> A real example would be a platform in which the firmware describes the system
> > >>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> > >>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> > >>>>> clock domain).
> > >>>>
> > >>>> If the CPUs are in the same clock domain, they must be part of the
> > >>>> same cpufreq policy.
> > >>>
> > >>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
> > >>> coordination type to describe the generic scenario of using hardware
> > >>> aggregation and coordination when establishing the clock rate of CPUs).
> > >>>
> > >>> Adding support for HW_ALL* will involve either bypassing some
> > >>> assumptions around cpufreq policies or making core cpufreq changes.
> > >>>
> > >>> In the way I see it, support for HW_ALL involves either:
> > >>>
> > >>>  - (a) Creating per-cpu policies in order to allow each of the CPUs to
> > >>>    send their own frequency request to the hardware which will do
> > >>>    aggregation and clock rate decision at the level of the clock
> > >>>    domain. The PSD domains (ACPI) and the new DT binding will tell
> > >>>    which CPUs are actually in the same clock domain for whomever is
> > >>>    interested, despite those CPUs not being in the same policy.
> > >>>    This requires the extra mask that Nicola introduced.
> > >>>
> > >>>  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> > >>>    - Governors to stop aggregating (usually max) the information
> > >>>      for each of the CPUs in the policy and convey to the core
> > >>>      information for each CPU.
> > >>>    - Cpufreq core to be able to receive and pass this information
> > >>>      down to the drivers.
> > >>>    - Drivers to be able to have some per cpu structures to hold
> > >>>      frequency control (let's say SCP fast channel addresses) for
> > >>>      each of the CPUs in the policy. Or have these structures in the
> > >>>      cpufreq core/policy, to avoid code duplication in drivers.
> > >>>
> > >>> Therefore (a) is the least invasive but we'll be bypassing the rule
> > >>> above. But to make that rule stick we'll have to make invasive cpufreq
> > >>> changes (b).
> > >>
> > >> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> > >> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> > >>
> > >> Looking back in the history of CPUFreq, related_cpus was originally designed
> > >> to hold the map of cpus within the same clock. Later on, the meaning of this
> > >> cpumask changed [1].
> > >> This led to the introduction of a new cpumask 'freqdomain_cpus'
> > >> within acpi-cpufreq to keep the knowledge of hardware clock domains for
> > >> sysfs consumers since related_cpus was not suitable anymore for this.
> > >> Further on, this cpumask was assigned to online+offline cpus within the same clk
> > >> domain when sw coordination is in use [2].
> > >>
> > >> My interpretation is that there is no guarantee that related_cpus holds the
> > >> 'real' hardware clock implementation. As a consequence, it is not true anymore
> > >> that cpus that are in the same clock domain will be part of the same
> > >> policy.
> > >>
> > >> This guided me to think it would be better to have a cpumask which always holds
> > >> the real hw clock domains in the policy.
> > >>
> > >>>
> > >>> This is my current understanding and I'm leaning towards (a). What do
> > >>> you think?
> > >>>
> > >>> *in not so many words, this is what these patches are trying to propose,
> > >>> while also making sure it's supported for both ACPI and DT.
> > >>>
> > >>> BTW, thank you for your effort in making sense of this!
> > >>>
> > >>> Regards,
> > >>> Ionela.
> > >>>
> > >>
> > >> This could be a platform where per-cpu and perf-dependencies will be used:
> > >>
> > >> CPU:              0    1    2    3    4    5    6    7
> > >> Type:             A    A    A    A    B    B    B    B
> > >> Cluster:         [                                    ]
> > >> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> > >> perf-dependency: [                ]  [                ]
> > >> HW clock:        [                ]  [                ]
> > >>
> > >> The firmware will present 8 controls to the OS and each control is mapped to a
> > >> cpu device via the standard dt. This is done so we can achieve hw coordination.
> > >> What is required in these systems is to present to OS the information of which
> > >> cpus belong to which clock domain. In other words, when hw coordinates we don't
> > >> have any way at present in dt to understand how these cpus are dependent
> > >> each other, from performance perspective (as opposed to ACPI where we have
> > >> _PSD). Hence my proposal for the new cpu-perf-dependencies.
> > >> This is regardless whether we decide to go for either a policy per-cpu or a
> > >> policy per-domain.
> > >>
> > >> Hope it helps.
> > >
> > > Oh yes, I get it now. Finally. Thanks for helping me out :)
> > >
> > > So if I can say all this stuff in simple terms, this is what it will
> > > be like:
> > >
> > > - We don't want software aggregation of frequencies and so we need to
> > >   have per-cpu policies even when they share their clock lines.
> > >
> > > - But we still need a way for other frameworks to know which CPUs
> > >   share the clock lines (that's what the perf-dependency is all about,
> > >   right ?).
> > >
> > > - We can't get it from SCMI, but need a DT based solution.
> > >
> > > - Currently for the cpufreq-case we relied for this on the way OPP
> > >   tables for the CPUs were described. i.e. the opp-table is marked as
> > >   "shared" and multiple CPUs point to it.
> > >
> > > - I wonder if we can keep using that instead of creating new bindings
> > >   for exact same stuff ? Though the difference here would be that the
> > >   OPP may not have any other entries.
> >
> > I thought about it and looked for other platforms' DT to see if can reuse
> > existing opp information. Unfortunately I don't think it is optimal. The reason
> > being that, because cpus have the same opp table it does not necessarily mean
> > that they share a clock wire. It just tells us that they have the same
> > capabilities (literally just tells us they have the same V/f op points).
> > Unless I am missing something?
> >
> > When comparing with ACPI/_PSD it becomes more intuitive that there is no
> > equivalent way to reveal "perf-dependencies" in DT.
>
> You should be able to by examining the clock tree. But perhaps SCMI
> abstracts all that and just presents virtual clocks without parent
> clocks available to determine what clocks are shared? Fix SCMI if that's
> the case.

We tried, but it goes against the abstraction according to SCMI spec authors
which is hard to argue against. The authors think the whole EAS thing is
OSPM specific and since ACPI has a way to express this even when CPPC
entries are per CPU and _PSD provides the shared clock domain info, they
prefer this to be outside the scope of SCMI and use DT for this. I don't
totally agree with that but I don't have strong case to argue here.

--
Regards,
Sudeep

Sudeep Holla Oct. 12, 2020, 4:02 p.m. UTC | #23

On Mon, Oct 12, 2020 at 11:22:57AM +0100, Lukasz Luba wrote:
[...]

> 
> True, the SCMI clock does not support discovery of clock tree:
> (from 4.6.1 Clock management protocol background)
> 'The protocol does not cover discovery of the clock tree, which must be
> described through firmware tables instead.' [1]
>

By firmware, spec refers to DT or ACPI, just to be clear.

> In this situation, would it make sense, instead of this binding from
> patch 1/2, create a binding for internal firmware/scmi node?
>

Why ? I prefer to solve this in a generic way and make it not scmi
specific issue. If OPP idea Viresh suggested can be made to work, that
would be good.

> Something like:
> 
> firmware {
> 	scmi {
> 	...		
> 		scmi-perf-dep {
> 			compatible = "arm,scmi-perf-dependencies";
> 			cpu-perf-dep0 {
> 				cpu-perf-affinity = <&CPU0>, <&CPU1>;
> 			};
> 			cpu-perf-dep1 {
> 				cpu-perf-affinity = <&CPU3>, <&CPU4>;
> 			};
> 			cpu-perf-dep2 {
> 				cpu-perf-affinity = <&CPU7>;
> 			};
> 		};
> 	};
> };
> 
> The code which is going to parse the binding would be inside the
> scmi perf protocol code and used via API by scmi-cpufreq.c.
>

Not completely against it, just need to understand how is this solved
or will be solved for any DT(non SCMI) and why it can be generic.

Sudeep Holla Oct. 12, 2020, 4:06 p.m. UTC | #24

On Thu, Oct 08, 2020 at 05:57:23PM +0200, Rafael J. Wysocki wrote:
> On Thu, Oct 8, 2020 at 5:03 PM Ionela Voinescu <ionela.voinescu@arm.com> wrote:

[...]

>
> >    The PSD domains (ACPI) and the new DT binding will tell
> >    which CPUs are actually in the same clock domain for whomever is
> >    interested, despite those CPUs not being in the same policy.
>
> And this information hasn't been used so far in those cases.
>

Indeed, do you see that will change for new heterogeneous CPUs ? Are there
any plans to use EAS on those ?

--
Regards,
Sudeep

Ionela Voinescu Oct. 12, 2020, 4:30 p.m. UTC | #25

Hi Lukasz,

On Monday 12 Oct 2020 at 14:48:20 (+0100), Lukasz Luba wrote:
> 
> 
> On 10/12/20 11:59 AM, Ionela Voinescu wrote:
> > On Monday 12 Oct 2020 at 11:22:57 (+0100), Lukasz Luba wrote:
> > [..]
> > > > > I thought about it and looked for other platforms' DT to see if can reuse
> > > > > existing opp information. Unfortunately I don't think it is optimal. The reason
> > > > > being that, because cpus have the same opp table it does not necessarily mean
> > > > > that they share a clock wire. It just tells us that they have the same
> > > > > capabilities (literally just tells us they have the same V/f op points).
> > > > > Unless I am missing something?
> > > > > 
> > > > > When comparing with ACPI/_PSD it becomes more intuitive that there is no
> > > > > equivalent way to reveal "perf-dependencies" in DT.
> > > > 
> > > > You should be able to by examining the clock tree. But perhaps SCMI
> > > > abstracts all that and just presents virtual clocks without parent
> > > > clocks available to determine what clocks are shared? Fix SCMI if that's
> > > > the case.
> > > 
> > > True, the SCMI clock does not support discovery of clock tree:
> > > (from 4.6.1 Clock management protocol background)
> > > 'The protocol does not cover discovery of the clock tree, which must be
> > > described through firmware tables instead.' [1]
> > > 
> > > In this situation, would it make sense, instead of this binding from
> > > patch 1/2, create a binding for internal firmware/scmi node?
> > > 
> > > Something like:
> > > 
> > > firmware {
> > > 	scmi {
> > > 	...		
> > > 		scmi-perf-dep {
> > > 			compatible = "arm,scmi-perf-dependencies";
> > > 			cpu-perf-dep0 {
> > > 				cpu-perf-affinity = <&CPU0>, <&CPU1>;
> > > 			};
> > > 			cpu-perf-dep1 {
> > > 				cpu-perf-affinity = <&CPU3>, <&CPU4>;
> > > 			};
> > > 			cpu-perf-dep2 {
> > > 				cpu-perf-affinity = <&CPU7>;
> > > 			};
> > > 		};
> > > 	};
> > > };
> > > 
> > > The code which is going to parse the binding would be inside the
> > > scmi perf protocol code and used via API by scmi-cpufreq.c.
> > > 
> > 
> > While SCMI cpufreq would be able to benefit from the functionality that
> > Nicola is trying to introduce, it's not the only driver, and more
> > importantly, it's not *going* to be the only driver benefiting from
> > this.
> > 
> > Currently there is also qcom-cpufreq-hw.c and the future
> > mediatek-cpufreq-hw.c that is currently under review [1]. They both do
> > their frequency setting by interacting with HW/FW, and could either take
> > or update their OPP tables from there. Therefore, if the platform would
> > require it, they could also expose different controls for frequency
> > setting and could benefit from additional information about clock
> > domains (either through opp-shared or the new entries in Nicola's patch),
> > without driver changes.
> > 
> > Another point to be made is that I strongly believe this is going to be
> > the norm in the future. Directly setting PLLs and regulator voltages
> > has been proven unsafe and unsecure.
> > 
> > Therefore, I see this as support for a generic cpufreq feature (a
> > hardware coordination type), rather than support for a specific driver.
> > 
> > [1] https://lkml.org/lkml/2020/9/10/11
> > 
> > > 
> > > Now regarding the 'dependent_cpus' mask.
> > > 
> > > We could avoid adding a new field 'dependent_cpus' in policy
> > > struct, but I am not sure of one bit - Frequency Invariant Engine,
> > > (which is also not fixed by just adding a new cpumask).
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >    Let's take it step by step..
> > > 
> > > We have 3 subsystems to fix:
> > > 1. EAS - EM has API function which takes custom cpumask, so no issue,
> >             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 	   keep in mind that EAS it's using the max aggregation method
> > 	   that schedutil is using. So if we are to describe the
> > 	   functionality correctly, it needs both a cpumask describing
> > 	   the frequency domains and an aggregation method.
> 
> EAS does not use schedutil max agregation, it calculates max_util
> internally.
> 

But isn't it the same logic mechanism that schedutil uses?

> The compute_energy() loops through the CPUs in the domain and
> takes the utilization from them via schedutil_cpu_util(cpu_rq(cpu)).
> It figures out max_util and then em_cpu_energy() maps it to next
  ^^^^^^^^^^^^^^^^^^^^^^^

Same for schedutil: sugov_next_freq_shared() calls sugov_get_util()
which then calls schedutil_cpu_util().

If your point is that one is applying the max function in compute_energy()
while the other is doing it in sugov_next_freq_shared(), I'll re-enforce
my argument that they are logically doing the same *type* of
aggregation. EAS is relying on it and schedutil was purposely modified
for this purpose:

938e5e4b0d15  sched/cpufreq: Prepare schedutil for Energy Aware
Scheduling

> frequency for the cluster. It just needs proper utilization from
> CPUs, which is taken from run-queues, which is a sum of utilization
> of tasks being there. This leads to problem how we account utilization
> of a task. This is the place where the FIE is involved. EAS assumes the
> utilization is calculated properly.

This is separate. Above we were discussing the aggregation method and
what CPUs this is applied on. I'll continue on FIE below.

> > 
> > >    fix would be to use it via the scmi-cpufreq.c
> > 
> > > 2. IPA (for calculating the power of a cluster, not whole thermal needs
> > >    this knowledge about 'dependent cpus') - this can be fixed internally
> > 
> > > 3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
> > >    filtering and providing max freq of all cpus in the cluster into the
> > >    FIE; this info is then populated to all 'related_cpus' which will
> > >    have this freq (we know, because there is no other freq requests);
> > >    Issues:
> > > 3.1. Schedutil is not going to check all cpus in the cluster to take
> > >    max freq, which is then passed into the cpufreq driver and FIE
> > > 3.2. FIE would have to (or maybe we would drop it) have a logic similar
> > >    to what schedutil does (max freq search and set, then filter next
> > >    freq requests from other cpus in the next period e.g. 10ms)
> > > 3.3. Schedutil is going to invoke freq change for each cpu independently
> > >    and the current code just calls arch_set_freq_scale() - adding just
> > >    'dependent_cpus' won't help
> > 
> > I don't believe these are issues. As we need changes for EAS and IPA, we'd
> > need changes for FIE. We don't need more than the cpumask that shows
> > frequency domains as we already already have the aggregation method that
> > schedutil uses to propagate the max frequency in a domain across CPUs.
> 
> Schedutil is going to work in !policy_is_shared() mode, which leads to
> sugov_update_single() being the 'main' function. We won't have
> schedutil goodness which is handling related_cpus use case.
> 

Agreed! I did not mean that I'd rely on schedutil to do the aggregation
and hand me the answer. But my suggestion is to use the same logical
method - maximum, for cases where counters are not present.

> Then in software FIE would you just change the call from:
> 	arch_set_freq_scale(policy->related_cpus,...)
> to:
> 	arch_set_freq_scale(policy->dependent_cpus,...)
> ?
> 
> This code would be called from any CPU (without filtering) and it
> would loop through cpumask updating freq_scale, which is wrong IMO.
> You need some 'logic', which is not currently in there.
> 

Definitely! But that's because the FIE changes above are incomplete.
That's why whomever does these changes should go beyond:
s/related_cpus/dependent_cpus.

We don't need more information from DT additional to this dependent_cpus
maks, but it does not mean the end solution for making use of it will be
a simple "s/related_cpus/dependent_cpus".

> Leaving the 'related_cpus' would also be wrong (because real CPU
> frequency is different, so we would account task utilization wrongly).
> 
> > 
> > This would be the default method if cycle counters are not present. It
> > might not reflect the frequency the cores actually get from HW, but for
> > that cycle counters should be used.
> 
> IMHO the configurations with per-cpu freq requests while there are CPUs
> 'dependent' and there are no HW counters to use for tasks
> utilization accounting - should be blocked. Then we don't need
> 'dependent_cpus' in software FIE. Then one less from your requirements
> list for new cpumask.
> 

I'd go for a default.. better have something than removing it
altogether, but we'll see.

I'll stop this here as I think we're distracting a bit from the main
purpose of this RFC. I don't believe FIE brings an additional
requirement. "Software" FIE will need fixing/optimizing/bypassing
(we'll agree later on the implementation) but it does not need anything
else from DT/ACPI.

Thank you,
Ionela.

> > 
> > > 3.4 What would be the real frequency of these cpus and what would be
> > >    set to FIE
> > > 3.5 FIE is going to filter to soon requests from other dependent cpus?
> > > 
> > > IMHO the FIE needs more bits than just a new cpumask.
> > > Maybe we should consider to move FIE arch_set_freq_scale() call into the
> > > cpufreq driver, which will know better how to aggregate/filter requests
> > > and then call FIE update?
> > 
> > I'm quite strongly against this :). As described before, this is not a
> > feature that a single driver needs, and even if it was, the aggregation
> > method for FIE is not a driver policy.
> 
> Software version of FIE has issues in this case, schedutil or EAS won't
> help (different code path).
> 
> Regards,
> Lukasz

Ionela Voinescu Oct. 12, 2020, 4:52 p.m. UTC | #26

On Monday 12 Oct 2020 at 16:49:30 (+0100), Sudeep Holla wrote:
> On Fri, Oct 09, 2020 at 11:09:21AM +0530, Viresh Kumar wrote:
> > On 08-10-20, 17:00, Nicola Mazzucato wrote:
> > > On 10/8/20 4:03 PM, Ionela Voinescu wrote:
> > > > Hi Viresh,
> > > > 
> > > > On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
> > > >> On 07-10-20, 13:58, Nicola Mazzucato wrote:
> > > >>> Hi Viresh,
> > > >>>
> > > >>> performance controls is what is exposed by the firmware through a protocol that
> > > >>> is not capable of describing hardware (say SCMI). For example, the firmware can
> > > >>> tell that the platform has N controls, but it can't say to which hardware they
> > > >>> are "wired" to. This is done in dt, where, for example, we map these controls
> > > >>> to cpus, gpus, etc.
> > > >>>
> > > >>> Let's focus on cpus.
> > > >>>
> > > >>> Normally we would have N of performance controls (what comes from f/w)
> > > >>> that that correspond to hardware clock/dvfs domains.
> > > >>>
> > > >>> However, some firmware implementations might benefit from having finer
> > > >>> grained information about the performance requirements (e.g.
> > > >>> per-CPU) and therefore choose to present M performance controls to the
> > > >>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
> > > >>> or set of cpus.
> > > >>> In this scenario, the f/w will make aggregation decisions based on the
> > > >>> requests it receives on these M controls.
> > > >>>
> > > >>> Here we would have M cpufreq policies which do not necessarily reflect the
> > > >>> underlying clock domains, thus some s/w components will underperform
> > > >>> (EAS and thermal, for example).
> > > >>>
> > > >>> A real example would be a platform in which the firmware describes the system
> > > >>> having M per-cpu control, and the cpufreq subsystem will have M policies while
> > > >>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
> > > >>> clock domain).
> > > >>
> > > >> If the CPUs are in the same clock domain, they must be part of the
> > > >> same cpufreq policy.
> > > > 
> > > > But cpufreq does not currently support HW_ALL (I'm using the ACPI
> > > > coordination type to describe the generic scenario of using hardware
> > > > aggregation and coordination when establishing the clock rate of CPUs).
> > > > 
> > > > Adding support for HW_ALL* will involve either bypassing some
> > > > assumptions around cpufreq policies or making core cpufreq changes.
> > > > 
> > > > In the way I see it, support for HW_ALL involves either:
> > > > 
> > > >  - (a) Creating per-cpu policies in order to allow each of the CPUs to
> > > >    send their own frequency request to the hardware which will do
> > > >    aggregation and clock rate decision at the level of the clock
> > > >    domain. The PSD domains (ACPI) and the new DT binding will tell
> > > >    which CPUs are actually in the same clock domain for whomever is
> > > >    interested, despite those CPUs not being in the same policy.
> > > >    This requires the extra mask that Nicola introduced.
> > > > 
> > > >  - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
> > > >    - Governors to stop aggregating (usually max) the information
> > > >      for each of the CPUs in the policy and convey to the core
> > > >      information for each CPU.
> > > >    - Cpufreq core to be able to receive and pass this information
> > > >      down to the drivers.
> > > >    - Drivers to be able to have some per cpu structures to hold
> > > >      frequency control (let's say SCP fast channel addresses) for
> > > >      each of the CPUs in the policy. Or have these structures in the
> > > >      cpufreq core/policy, to avoid code duplication in drivers.
> > > > 
> > > > Therefore (a) is the least invasive but we'll be bypassing the rule
> > > > above. But to make that rule stick we'll have to make invasive cpufreq
> > > > changes (b).
> > > 
> > > Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
> > > to share my understanding on it. Perhaps it's a good opportunity to shed some light.
> > > 
> > > Looking back in the history of CPUFreq, related_cpus was originally designed
> > > to hold the map of cpus within the same clock. Later on, the meaning of this
> > > cpumask changed [1].
> > > This led to the introduction of a new cpumask 'freqdomain_cpus'
> > > within acpi-cpufreq to keep the knowledge of hardware clock domains for
> > > sysfs consumers since related_cpus was not suitable anymore for this.
> > > Further on, this cpumask was assigned to online+offline cpus within the same clk
> > > domain when sw coordination is in use [2].
> > > 
> > > My interpretation is that there is no guarantee that related_cpus holds the
> > > 'real' hardware clock implementation. As a consequence, it is not true anymore
> > > that cpus that are in the same clock domain will be part of the same
> > > policy.
> > > 
> > > This guided me to think it would be better to have a cpumask which always holds
> > > the real hw clock domains in the policy.
> > > 
> > > > 
> > > > This is my current understanding and I'm leaning towards (a). What do
> > > > you think?
> > > > 
> > > > *in not so many words, this is what these patches are trying to propose,
> > > > while also making sure it's supported for both ACPI and DT.
> > > > 
> > > > BTW, thank you for your effort in making sense of this!
> > > > 
> > > > Regards,
> > > > Ionela.
> > > > 
> > > 
> > > This could be a platform where per-cpu and perf-dependencies will be used:
> > > 
> > > CPU:              0    1    2    3    4    5    6    7
> > > Type:             A    A    A    A    B    B    B    B
> > > Cluster:         [                                    ]
> > > perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
> > > perf-dependency: [                ]  [                ]
> > > HW clock:        [                ]  [                ]
> > > 
> > > The firmware will present 8 controls to the OS and each control is mapped to a
> > > cpu device via the standard dt. This is done so we can achieve hw coordination.
> > > What is required in these systems is to present to OS the information of which
> > > cpus belong to which clock domain. In other words, when hw coordinates we don't
> > > have any way at present in dt to understand how these cpus are dependent
> > > each other, from performance perspective (as opposed to ACPI where we have
> > > _PSD). Hence my proposal for the new cpu-perf-dependencies.
> > > This is regardless whether we decide to go for either a policy per-cpu or a
> > > policy per-domain.
> > > 
> > > Hope it helps.
> > 
> > Oh yes, I get it now. Finally. Thanks for helping me out :)
> > 
> > So if I can say all this stuff in simple terms, this is what it will
> > be like:
> > 
> > - We don't want software aggregation of frequencies and so we need to
> >   have per-cpu policies even when they share their clock lines.
> > 
> > - But we still need a way for other frameworks to know which CPUs
> >   share the clock lines (that's what the perf-dependency is all about,
> >   right ?).
> > 
> > - We can't get it from SCMI, but need a DT based solution.
> > 
> > - Currently for the cpufreq-case we relied for this on the way OPP
> >   tables for the CPUs were described. i.e. the opp-table is marked as
> >   "shared" and multiple CPUs point to it.
> > 
> > - I wonder if we can keep using that instead of creating new bindings
> >   for exact same stuff ? Though the difference here would be that the
> >   OPP may not have any other entries.
> 
> Well summarised, sorry for chiming in late. I could have not summarised
> any better. Just saw the big thread and was thinking of summarising.
> If the last point on OPP is possible(i.e. no OPP entries but just use
> it for fetch the information) for $subject patch is trying to achieve,
> then it would be good.
> 

Just to put in my two pennies worth: using opp-shared (in possibly empty
OPP table) as alternative to cpu-perf-dependencies sounds good enough
to me as well.

Thanks,
Ionela.

> -- 
> Regards,
> Sudeep

Lukasz Luba Oct. 12, 2020, 5:18 p.m. UTC | #27

On 10/12/20 5:52 PM, Ionela Voinescu wrote:
> On Monday 12 Oct 2020 at 16:49:30 (+0100), Sudeep Holla wrote:
>> On Fri, Oct 09, 2020 at 11:09:21AM +0530, Viresh Kumar wrote:
>>> On 08-10-20, 17:00, Nicola Mazzucato wrote:
>>>> On 10/8/20 4:03 PM, Ionela Voinescu wrote:
>>>>> Hi Viresh,
>>>>>
>>>>> On Thursday 08 Oct 2020 at 16:32:41 (+0530), Viresh Kumar wrote:
>>>>>> On 07-10-20, 13:58, Nicola Mazzucato wrote:
>>>>>>> Hi Viresh,
>>>>>>>
>>>>>>> performance controls is what is exposed by the firmware through a protocol that
>>>>>>> is not capable of describing hardware (say SCMI). For example, the firmware can
>>>>>>> tell that the platform has N controls, but it can't say to which hardware they
>>>>>>> are "wired" to. This is done in dt, where, for example, we map these controls
>>>>>>> to cpus, gpus, etc.
>>>>>>>
>>>>>>> Let's focus on cpus.
>>>>>>>
>>>>>>> Normally we would have N of performance controls (what comes from f/w)
>>>>>>> that that correspond to hardware clock/dvfs domains.
>>>>>>>
>>>>>>> However, some firmware implementations might benefit from having finer
>>>>>>> grained information about the performance requirements (e.g.
>>>>>>> per-CPU) and therefore choose to present M performance controls to the
>>>>>>> OS. DT would be adjusted accordingly to "wire" these controls to cpus
>>>>>>> or set of cpus.
>>>>>>> In this scenario, the f/w will make aggregation decisions based on the
>>>>>>> requests it receives on these M controls.
>>>>>>>
>>>>>>> Here we would have M cpufreq policies which do not necessarily reflect the
>>>>>>> underlying clock domains, thus some s/w components will underperform
>>>>>>> (EAS and thermal, for example).
>>>>>>>
>>>>>>> A real example would be a platform in which the firmware describes the system
>>>>>>> having M per-cpu control, and the cpufreq subsystem will have M policies while
>>>>>>> in fact these cpus are "performance-dependent" each other (e.g. are in the same
>>>>>>> clock domain).
>>>>>>
>>>>>> If the CPUs are in the same clock domain, they must be part of the
>>>>>> same cpufreq policy.
>>>>>
>>>>> But cpufreq does not currently support HW_ALL (I'm using the ACPI
>>>>> coordination type to describe the generic scenario of using hardware
>>>>> aggregation and coordination when establishing the clock rate of CPUs).
>>>>>
>>>>> Adding support for HW_ALL* will involve either bypassing some
>>>>> assumptions around cpufreq policies or making core cpufreq changes.
>>>>>
>>>>> In the way I see it, support for HW_ALL involves either:
>>>>>
>>>>>   - (a) Creating per-cpu policies in order to allow each of the CPUs to
>>>>>     send their own frequency request to the hardware which will do
>>>>>     aggregation and clock rate decision at the level of the clock
>>>>>     domain. The PSD domains (ACPI) and the new DT binding will tell
>>>>>     which CPUs are actually in the same clock domain for whomever is
>>>>>     interested, despite those CPUs not being in the same policy.
>>>>>     This requires the extra mask that Nicola introduced.
>>>>>
>>>>>   - (b) Making deep changes to cpufreq (core/governors/drivers) to allow:
>>>>>     - Governors to stop aggregating (usually max) the information
>>>>>       for each of the CPUs in the policy and convey to the core
>>>>>       information for each CPU.
>>>>>     - Cpufreq core to be able to receive and pass this information
>>>>>       down to the drivers.
>>>>>     - Drivers to be able to have some per cpu structures to hold
>>>>>       frequency control (let's say SCP fast channel addresses) for
>>>>>       each of the CPUs in the policy. Or have these structures in the
>>>>>       cpufreq core/policy, to avoid code duplication in drivers.
>>>>>
>>>>> Therefore (a) is the least invasive but we'll be bypassing the rule
>>>>> above. But to make that rule stick we'll have to make invasive cpufreq
>>>>> changes (b).
>>>>
>>>> Regarding the 'rule' above of one cpufreq policy per clock domain, I would like
>>>> to share my understanding on it. Perhaps it's a good opportunity to shed some light.
>>>>
>>>> Looking back in the history of CPUFreq, related_cpus was originally designed
>>>> to hold the map of cpus within the same clock. Later on, the meaning of this
>>>> cpumask changed [1].
>>>> This led to the introduction of a new cpumask 'freqdomain_cpus'
>>>> within acpi-cpufreq to keep the knowledge of hardware clock domains for
>>>> sysfs consumers since related_cpus was not suitable anymore for this.
>>>> Further on, this cpumask was assigned to online+offline cpus within the same clk
>>>> domain when sw coordination is in use [2].
>>>>
>>>> My interpretation is that there is no guarantee that related_cpus holds the
>>>> 'real' hardware clock implementation. As a consequence, it is not true anymore
>>>> that cpus that are in the same clock domain will be part of the same
>>>> policy.
>>>>
>>>> This guided me to think it would be better to have a cpumask which always holds
>>>> the real hw clock domains in the policy.
>>>>
>>>>>
>>>>> This is my current understanding and I'm leaning towards (a). What do
>>>>> you think?
>>>>>
>>>>> *in not so many words, this is what these patches are trying to propose,
>>>>> while also making sure it's supported for both ACPI and DT.
>>>>>
>>>>> BTW, thank you for your effort in making sense of this!
>>>>>
>>>>> Regards,
>>>>> Ionela.
>>>>>
>>>>
>>>> This could be a platform where per-cpu and perf-dependencies will be used:
>>>>
>>>> CPU:              0    1    2    3    4    5    6    7
>>>> Type:             A    A    A    A    B    B    B    B
>>>> Cluster:         [                                    ]
>>>> perf-controls:   [  ] [  ] [  ] [ ]  [ ]  [ ]  [ ]  [ ]
>>>> perf-dependency: [                ]  [                ]
>>>> HW clock:        [                ]  [                ]
>>>>
>>>> The firmware will present 8 controls to the OS and each control is mapped to a
>>>> cpu device via the standard dt. This is done so we can achieve hw coordination.
>>>> What is required in these systems is to present to OS the information of which
>>>> cpus belong to which clock domain. In other words, when hw coordinates we don't
>>>> have any way at present in dt to understand how these cpus are dependent
>>>> each other, from performance perspective (as opposed to ACPI where we have
>>>> _PSD). Hence my proposal for the new cpu-perf-dependencies.
>>>> This is regardless whether we decide to go for either a policy per-cpu or a
>>>> policy per-domain.
>>>>
>>>> Hope it helps.
>>>
>>> Oh yes, I get it now. Finally. Thanks for helping me out :)
>>>
>>> So if I can say all this stuff in simple terms, this is what it will
>>> be like:
>>>
>>> - We don't want software aggregation of frequencies and so we need to
>>>    have per-cpu policies even when they share their clock lines.
>>>
>>> - But we still need a way for other frameworks to know which CPUs
>>>    share the clock lines (that's what the perf-dependency is all about,
>>>    right ?).
>>>
>>> - We can't get it from SCMI, but need a DT based solution.
>>>
>>> - Currently for the cpufreq-case we relied for this on the way OPP
>>>    tables for the CPUs were described. i.e. the opp-table is marked as
>>>    "shared" and multiple CPUs point to it.
>>>
>>> - I wonder if we can keep using that instead of creating new bindings
>>>    for exact same stuff ? Though the difference here would be that the
>>>    OPP may not have any other entries.
>>
>> Well summarised, sorry for chiming in late. I could have not summarised
>> any better. Just saw the big thread and was thinking of summarising.
>> If the last point on OPP is possible(i.e. no OPP entries but just use
>> it for fetch the information) for $subject patch is trying to achieve,
>> then it would be good.
>>
> 
> Just to put in my two pennies worth: using opp-shared (in possibly empty
> OPP table) as alternative to cpu-perf-dependencies sounds good enough
> to me as well.

+1

Regards,
Lukasz

> 
> Thanks,
> Ionela.
> 
>> -- 
>> Regards,
>> Sudeep

Lukasz Luba Oct. 12, 2020, 6:19 p.m. UTC | #28

On 10/12/20 5:30 PM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> On Monday 12 Oct 2020 at 14:48:20 (+0100), Lukasz Luba wrote:
>>
>>
>> On 10/12/20 11:59 AM, Ionela Voinescu wrote:
>>> On Monday 12 Oct 2020 at 11:22:57 (+0100), Lukasz Luba wrote:
>>> [..]
>>>>>> I thought about it and looked for other platforms' DT to see if can reuse
>>>>>> existing opp information. Unfortunately I don't think it is optimal. The reason
>>>>>> being that, because cpus have the same opp table it does not necessarily mean
>>>>>> that they share a clock wire. It just tells us that they have the same
>>>>>> capabilities (literally just tells us they have the same V/f op points).
>>>>>> Unless I am missing something?
>>>>>>
>>>>>> When comparing with ACPI/_PSD it becomes more intuitive that there is no
>>>>>> equivalent way to reveal "perf-dependencies" in DT.
>>>>>
>>>>> You should be able to by examining the clock tree. But perhaps SCMI
>>>>> abstracts all that and just presents virtual clocks without parent
>>>>> clocks available to determine what clocks are shared? Fix SCMI if that's
>>>>> the case.
>>>>
>>>> True, the SCMI clock does not support discovery of clock tree:
>>>> (from 4.6.1 Clock management protocol background)
>>>> 'The protocol does not cover discovery of the clock tree, which must be
>>>> described through firmware tables instead.' [1]
>>>>
>>>> In this situation, would it make sense, instead of this binding from
>>>> patch 1/2, create a binding for internal firmware/scmi node?
>>>>
>>>> Something like:
>>>>
>>>> firmware {
>>>> 	scmi {
>>>> 	...		
>>>> 		scmi-perf-dep {
>>>> 			compatible = "arm,scmi-perf-dependencies";
>>>> 			cpu-perf-dep0 {
>>>> 				cpu-perf-affinity = <&CPU0>, <&CPU1>;
>>>> 			};
>>>> 			cpu-perf-dep1 {
>>>> 				cpu-perf-affinity = <&CPU3>, <&CPU4>;
>>>> 			};
>>>> 			cpu-perf-dep2 {
>>>> 				cpu-perf-affinity = <&CPU7>;
>>>> 			};
>>>> 		};
>>>> 	};
>>>> };
>>>>
>>>> The code which is going to parse the binding would be inside the
>>>> scmi perf protocol code and used via API by scmi-cpufreq.c.
>>>>
>>>
>>> While SCMI cpufreq would be able to benefit from the functionality that
>>> Nicola is trying to introduce, it's not the only driver, and more
>>> importantly, it's not *going* to be the only driver benefiting from
>>> this.
>>>
>>> Currently there is also qcom-cpufreq-hw.c and the future
>>> mediatek-cpufreq-hw.c that is currently under review [1]. They both do
>>> their frequency setting by interacting with HW/FW, and could either take
>>> or update their OPP tables from there. Therefore, if the platform would
>>> require it, they could also expose different controls for frequency
>>> setting and could benefit from additional information about clock
>>> domains (either through opp-shared or the new entries in Nicola's patch),
>>> without driver changes.
>>>
>>> Another point to be made is that I strongly believe this is going to be
>>> the norm in the future. Directly setting PLLs and regulator voltages
>>> has been proven unsafe and unsecure.
>>>
>>> Therefore, I see this as support for a generic cpufreq feature (a
>>> hardware coordination type), rather than support for a specific driver.
>>>
>>> [1] https://lkml.org/lkml/2020/9/10/11
>>>
>>>>
>>>> Now regarding the 'dependent_cpus' mask.
>>>>
>>>> We could avoid adding a new field 'dependent_cpus' in policy
>>>> struct, but I am not sure of one bit - Frequency Invariant Engine,
>>>> (which is also not fixed by just adding a new cpumask).
>>>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>     Let's take it step by step..
>>>>
>>>> We have 3 subsystems to fix:
>>>> 1. EAS - EM has API function which takes custom cpumask, so no issue,
>>>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> 	   keep in mind that EAS it's using the max aggregation method
>>> 	   that schedutil is using. So if we are to describe the
>>> 	   functionality correctly, it needs both a cpumask describing
>>> 	   the frequency domains and an aggregation method.
>>
>> EAS does not use schedutil max agregation, it calculates max_util
>> internally.
>>
> 
> But isn't it the same logic mechanism that schedutil uses?

No, it's standalone, slightly different because no ioboosting.
As I said, on issue, just proper cpumask during EM registration is
needed and proper tasks utilization (assuming FIE works in the system).
No need of a new cpumask in policy.

> 
>> The compute_energy() loops through the CPUs in the domain and
>> takes the utilization from them via schedutil_cpu_util(cpu_rq(cpu)).
>> It figures out max_util and then em_cpu_energy() maps it to next
>    ^^^^^^^^^^^^^^^^^^^^^^^
> 
> Same for schedutil: sugov_next_freq_shared() calls sugov_get_util()
> which then calls schedutil_cpu_util().

Which we won't have in per-cpu solution and we would use simpler:
sugov_update_single()
and in FIE you won't see the aggregated max freq for the cluster.
Instead you will see N freq requests coming from N CPUs and going
into software FIE.


> 
> If your point is that one is applying the max function in compute_energy()
> while the other is doing it in sugov_next_freq_shared(), I'll re-enforce
> my argument that they are logically doing the same *type* of
> aggregation. EAS is relying on it and schedutil was purposely modified
> for this purpose:
> 
> 938e5e4b0d15  sched/cpufreq: Prepare schedutil for Energy Aware
> Scheduling
> 
>> frequency for the cluster. It just needs proper utilization from
>> CPUs, which is taken from run-queues, which is a sum of utilization
>> of tasks being there. This leads to problem how we account utilization
>> of a task. This is the place where the FIE is involved. EAS assumes the
>> utilization is calculated properly.
> 
> This is separate. Above we were discussing the aggregation method and
> what CPUs this is applied on. I'll continue on FIE below.

As I said EAS has no issues, it has it own 'aggregation' just needs
cpumask from EM.
FIE has issue, due to missing schedutil shared policy, in particular the
function:
sugov_next_freq_shared()
which figures out the freq max for the cluster.

> 
>>>
>>>>     fix would be to use it via the scmi-cpufreq.c
>>>
>>>> 2. IPA (for calculating the power of a cluster, not whole thermal needs
>>>>     this knowledge about 'dependent cpus') - this can be fixed internally
>>>
>>>> 3. Frequency Invariant Engine (FIE) - currently it relies on schedutil
>>>>     filtering and providing max freq of all cpus in the cluster into the
>>>>     FIE; this info is then populated to all 'related_cpus' which will
>>>>     have this freq (we know, because there is no other freq requests);
>>>>     Issues:
>>>> 3.1. Schedutil is not going to check all cpus in the cluster to take
>>>>     max freq, which is then passed into the cpufreq driver and FIE
>>>> 3.2. FIE would have to (or maybe we would drop it) have a logic similar
>>>>     to what schedutil does (max freq search and set, then filter next
>>>>     freq requests from other cpus in the next period e.g. 10ms)
>>>> 3.3. Schedutil is going to invoke freq change for each cpu independently
>>>>     and the current code just calls arch_set_freq_scale() - adding just
>>>>     'dependent_cpus' won't help
>>>
>>> I don't believe these are issues. As we need changes for EAS and IPA, we'd
>>> need changes for FIE. We don't need more than the cpumask that shows
>>> frequency domains as we already already have the aggregation method that
>>> schedutil uses to propagate the max frequency in a domain across CPUs.
>>
>> Schedutil is going to work in !policy_is_shared() mode, which leads to
>> sugov_update_single() being the 'main' function. We won't have
>> schedutil goodness which is handling related_cpus use case.
>>
> 
> Agreed! I did not mean that I'd rely on schedutil to do the aggregation
> and hand me the answer. But my suggestion is to use the same logical
> method - maximum, for cases where counters are not present.

In FIE generic code and this new 'dependent_cpus' mask?
Or with help of cpufreq driver, which would have internal cpu
dependency mask?

IMO the old FIE implementation where arch_set_freq_scale()
was called from the drivers had less issues, see below.

> 
>> Then in software FIE would you just change the call from:
>> 	arch_set_freq_scale(policy->related_cpus,...)
>> to:
>> 	arch_set_freq_scale(policy->dependent_cpus,...)
>> ?
>>
>> This code would be called from any CPU (without filtering) and it
>> would loop through cpumask updating freq_scale, which is wrong IMO.
>> You need some 'logic', which is not currently in there.
>>
> 
> Definitely! But that's because the FIE changes above are incomplete.
> That's why whomever does these changes should go beyond:
> s/related_cpus/dependent_cpus.
> 
> We don't need more information from DT additional to this dependent_cpus
> maks, but it does not mean the end solution for making use of it will be
> a simple "s/related_cpus/dependent_cpus".

The previous FIE implementation where arch_set_freq_scale()
was called from the drivers, was better suited for this issue.
Driver could just use internal dependency cpumask or even
do the aggregation to figure out the max freq for cluster
if there is a need, before calling arch_set_freq_scale().

It is not perfect solution for software FIE, but one of possible
when there is no hw counters.

> 
>> Leaving the 'related_cpus' would also be wrong (because real CPU
>> frequency is different, so we would account task utilization wrongly).
>>
>>>
>>> This would be the default method if cycle counters are not present. It
>>> might not reflect the frequency the cores actually get from HW, but for
>>> that cycle counters should be used.
>>
>> IMHO the configurations with per-cpu freq requests while there are CPUs
>> 'dependent' and there are no HW counters to use for tasks
>> utilization accounting - should be blocked. Then we don't need
>> 'dependent_cpus' in software FIE. Then one less from your requirements
>> list for new cpumask.
>>
> 
> I'd go for a default.. better have something than removing it
> altogether, but we'll see.
> 
> I'll stop this here as I think we're distracting a bit from the main
> purpose of this RFC. I don't believe FIE brings an additional
> requirement. 

You've put FIE as 3rd in the list in response to Rafael's question
'What would you use it for, specifically?'
when he ask regarding that new extra cpumask in policy.

That's why I am trying to figure out if we can eliminate FIE
form your list of sub-system which would need the extra mask.


"Software" FIE will need fixing/optimizing/bypassing
> (we'll agree later on the implementation) but it does not need anything
> else from DT/ACPI.

So we agree that software FIE needs fixes.

Difference between new FIE and old FIE (from v5.8) is that the new one
purely relies on schedutil max freq value (which will now be missing),
while the old FIE was called by the driver and thus it was an option to
fix only the affected cpufreq driver [1][2].

IMO we can avoid this new cpumask in policy.

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v5.8/source/drivers/cpufreq/scmi-cpufreq.c#L58
[2] 
https://elixir.bootlin.com/linux/v5.8/source/drivers/cpufreq/qcom-cpufreq-hw.c#L79

> 
> Thank you,
> Ionela.
> 
>>>
>>>> 3.4 What would be the real frequency of these cpus and what would be
>>>>     set to FIE
>>>> 3.5 FIE is going to filter to soon requests from other dependent cpus?
>>>>
>>>> IMHO the FIE needs more bits than just a new cpumask.
>>>> Maybe we should consider to move FIE arch_set_freq_scale() call into the
>>>> cpufreq driver, which will know better how to aggregate/filter requests
>>>> and then call FIE update?
>>>
>>> I'm quite strongly against this :). As described before, this is not a
>>> feature that a single driver needs, and even if it was, the aggregation
>>> method for FIE is not a driver policy.
>>
>> Software version of FIE has issues in this case, schedutil or EAS won't
>> help (different code path).
>>
>> Regards,
>> Lukasz

Ionela Voinescu Oct. 12, 2020, 10:01 p.m. UTC | #29

Hey Lukasz,

I think after all this discussion (in our own way of describing things)
we agree on how the current cpufreq based FIE implementation is affected
in systems that use hardware coordination.

What we don't agree on is the location where that implementation (that
uses the new mask and aggregation) should be.

On Monday 12 Oct 2020 at 19:19:29 (+0100), Lukasz Luba wrote:
[..]
> The previous FIE implementation where arch_set_freq_scale()
> was called from the drivers, was better suited for this issue.
> Driver could just use internal dependency cpumask or even
> do the aggregation to figure out the max freq for cluster
> if there is a need, before calling arch_set_freq_scale().
> 
> It is not perfect solution for software FIE, but one of possible
> when there is no hw counters.
> 
[..]

> Difference between new FIE and old FIE (from v5.8) is that the new one
> purely relies on schedutil max freq value (which will now be missing),
> while the old FIE was called by the driver and thus it was an option to
> fix only the affected cpufreq driver [1][2].
> 

My final argument is that now you have 2 drivers that would need this
support, next you'll have 3 (the new mediatek driver), and in the future
there will be more. So why limit and duplicate this functionality in the
drivers? Why not make it generic for all drivers to use if the system
is using hardware coordination?

Additionally, I don't think drivers should not even need to know about
these dependency/clock domains. They should act at the level of the
policy, which in this case will be at the level of each CPU.

Thanks,
Ionela.

> IMO we can avoid this new cpumask in policy.
> 
> Regards,
> Lukasz
> 
> [1] https://elixir.bootlin.com/linux/v5.8/source/drivers/cpufreq/scmi-cpufreq.c#L58
> [2] https://elixir.bootlin.com/linux/v5.8/source/drivers/cpufreq/qcom-cpufreq-hw.c#L79
>

Rafael J. Wysocki Oct. 13, 2020, 11:53 a.m. UTC | #30

On Tue, Oct 13, 2020 at 12:01 AM Ionela Voinescu
<ionela.voinescu@arm.com> wrote:
>
> Hey Lukasz,
>
> I think after all this discussion (in our own way of describing things)
> we agree on how the current cpufreq based FIE implementation is affected
> in systems that use hardware coordination.
>
> What we don't agree on is the location where that implementation (that
> uses the new mask and aggregation) should be.
>
> On Monday 12 Oct 2020 at 19:19:29 (+0100), Lukasz Luba wrote:
> [..]
> > The previous FIE implementation where arch_set_freq_scale()
> > was called from the drivers, was better suited for this issue.
> > Driver could just use internal dependency cpumask or even
> > do the aggregation to figure out the max freq for cluster
> > if there is a need, before calling arch_set_freq_scale().
> >
> > It is not perfect solution for software FIE, but one of possible
> > when there is no hw counters.
> >
> [..]
>
> > Difference between new FIE and old FIE (from v5.8) is that the new one
> > purely relies on schedutil max freq value (which will now be missing),
> > while the old FIE was called by the driver and thus it was an option to
> > fix only the affected cpufreq driver [1][2].
> >
>
> My final argument is that now you have 2 drivers that would need this
> support, next you'll have 3 (the new mediatek driver), and in the future
> there will be more. So why limit and duplicate this functionality in the
> drivers? Why not make it generic for all drivers to use if the system
> is using hardware coordination?
>
> Additionally, I don't think drivers should not even need to know about
> these dependency/clock domains. They should act at the level of the
> policy, which in this case will be at the level of each CPU.

The policies come from the driver, though.

The driver decides how many CPUs will be there in a policy and how to
handle them at the initialization time.

The core has no idea whether or not there is HW coordination in the
system, the driver is expected to know that and take that into
account.

Accordingly, it looks like there should be an option for drivers to
arrange things in the most convenient way (from their perspective) and
that option has gone away now.

Ionela Voinescu Oct. 13, 2020, 12:39 p.m. UTC | #31

Hi Rafael,

On Tuesday 13 Oct 2020 at 13:53:37 (+0200), Rafael J. Wysocki wrote:
> On Tue, Oct 13, 2020 at 12:01 AM Ionela Voinescu
> <ionela.voinescu@arm.com> wrote:
> >
> > Hey Lukasz,
> >
> > I think after all this discussion (in our own way of describing things)
> > we agree on how the current cpufreq based FIE implementation is affected
> > in systems that use hardware coordination.
> >
> > What we don't agree on is the location where that implementation (that
> > uses the new mask and aggregation) should be.
> >
> > On Monday 12 Oct 2020 at 19:19:29 (+0100), Lukasz Luba wrote:
> > [..]
> > > The previous FIE implementation where arch_set_freq_scale()
> > > was called from the drivers, was better suited for this issue.
> > > Driver could just use internal dependency cpumask or even
> > > do the aggregation to figure out the max freq for cluster
> > > if there is a need, before calling arch_set_freq_scale().
> > >
> > > It is not perfect solution for software FIE, but one of possible
> > > when there is no hw counters.
> > >
> > [..]
> >
> > > Difference between new FIE and old FIE (from v5.8) is that the new one
> > > purely relies on schedutil max freq value (which will now be missing),
> > > while the old FIE was called by the driver and thus it was an option to
> > > fix only the affected cpufreq driver [1][2].
> > >
> >
> > My final argument is that now you have 2 drivers that would need this
> > support, next you'll have 3 (the new mediatek driver), and in the future
> > there will be more. So why limit and duplicate this functionality in the
> > drivers? Why not make it generic for all drivers to use if the system
> > is using hardware coordination?
> >
> > Additionally, I don't think drivers should not even need to know about
> > these dependency/clock domains. They should act at the level of the
> > policy, which in this case will be at the level of each CPU.
> 
> The policies come from the driver, though.
> 
> The driver decides how many CPUs will be there in a policy and how to
> handle them at the initialization time.

Yes, policies are built based on information populated from the drivers
at .init(): what CPUs will belong to a policy, what methods to use for
setting and getting frequency, etc.

So they do pass this information to the cpufreq core to be stored at the
level of the policy, but later drivers (in the majority of cases) will
not need to store on their own information on what CPUs belong to a
frequency domain, they rely on having passed that information to the
core, and the core mechanisms hold this information on the clock domains
(currently through policy->cpus and policy->related_cpus).

> 
> The core has no idea whether or not there is HW coordination in the
> system, the driver is expected to know that and take that into
> account.
> 

Given that multiple drivers could use hardware coordination, and
drivers already have a way to pass information about the type of
coordination to the core through policy->shared_type, could there be a
case for supporting this in the core, rather than the drivers?

In my head I'm finding this option better compared to having a select
set of drivers that would instruct the core to build the policies
per-cpu, while holding in the driver information about what CPUs
actually belong to clock domains.

Additionally, the cpufreq core will have to be able to present to other
frameworks (scheduler, thermal) this mask when requested, through a
cpufreq interface function. So in the end we'll still potentially end
up passing on this information from the driver to the core and then to
the user.

> Accordingly, it looks like there should be an option for drivers to
> arrange things in the most convenient way (from their perspective) and
> that option has gone away now.

IMO, even if this hardware coordination support is entirely managed by
the driver, one requirement is that other subsystems would be able to
acquire information about dependent cpus. The scheduler FIE should just
be another one of those users, with the decision on how that information
is handled residing in architecture code (arch_set_freq_scale()).
Architecture code might decide to have a default way of handling these
cases or not to support them at all.

Thank you,
Ionela.

Lukasz Luba Oct. 13, 2020, 1:53 p.m. UTC | #32

Hi Viresh,

On 10/9/20 6:39 AM, Viresh Kumar wrote:

<snip>

> 
> Oh yes, I get it now. Finally. Thanks for helping me out :)
> 
> So if I can say all this stuff in simple terms, this is what it will
> be like:
> 
> - We don't want software aggregation of frequencies and so we need to
>    have per-cpu policies even when they share their clock lines.
> 
> - But we still need a way for other frameworks to know which CPUs
>    share the clock lines (that's what the perf-dependency is all about,
>    right ?).
> 
> - We can't get it from SCMI, but need a DT based solution.
> 
> - Currently for the cpufreq-case we relied for this on the way OPP
>    tables for the CPUs were described. i.e. the opp-table is marked as
>    "shared" and multiple CPUs point to it.

I've started wondering based on the OPP code if this is a good solution.
We would end up with one (?) instance of opp_table and list of devices
pinned to it, in: opp_table->dev_list
It can be seen e.g. in function dev_pm_opp_get_sharing_cpus(),
where we retrieve the cpumask simply looping through the devices:

list_for_each_entry(opp_dev, &opp_table->dev_list, node)
	cpumask_set_cpu(opp_dev->dev->id, cpumask);

This means we have a single OPP table for all pinned CPUs.
I wonder if this is not too strong assumption for still being compliant
with SCMI spec, when in theory performance levels might differ...
(please correct me here it that would never happen)

There is also 2nd function dev_pm_opp_of_get_sharing_cpus() which looks
more promising. But I still don't know if the framework will allow us
to have private OPP tables when we use 'shared' in DT.

Could you clarify if we would get 'private' opp table for each CPU,
which could be then populated independently, but still 2nd function will
work?

Regards,
Lukasz

Viresh Kumar Oct. 14, 2020, 4:20 a.m. UTC | #33

On 13-10-20, 14:53, Lukasz Luba wrote:
> I've started wondering based on the OPP code if this is a good solution.
> We would end up with one (?) instance of opp_table and list of devices
> pinned to it, in: opp_table->dev_list
> It can be seen e.g. in function dev_pm_opp_get_sharing_cpus(),
> where we retrieve the cpumask simply looping through the devices:
> 
> list_for_each_entry(opp_dev, &opp_table->dev_list, node)
> 	cpumask_set_cpu(opp_dev->dev->id, cpumask);
> 
> 
> This means we have a single OPP table for all pinned CPUs.
> I wonder if this is not too strong assumption for still being compliant
> with SCMI spec, when in theory performance levels might differ...
> (please correct me here it that would never happen)
> 
> There is also 2nd function dev_pm_opp_of_get_sharing_cpus() which looks
> more promising. But I still don't know if the framework will allow us
> to have private OPP tables when we use 'shared' in DT.
> 
> Could you clarify if we would get 'private' opp table for each CPU,
> which could be then populated independently, but still 2nd function will
> work?

I think there is some misunderstanding here in your part. The
opp-table in the code is shared between CPUs only when the
"opp-shared" property is present in the OPP table. If that property
isn't available, even if same DT opp-table is pointed at by all the
CPUs, the code will have separate copies of the OPP table.

Though in your case (with performance-levels) it shouldn't matter as
code will never create an OPP table I suppose.

Viresh Kumar Oct. 14, 2020, 4:25 a.m. UTC | #34

On 12-10-20, 18:18, Lukasz Luba wrote:
> On 10/12/20 5:52 PM, Ionela Voinescu wrote:
> > On Monday 12 Oct 2020 at 16:49:30 (+0100), Sudeep Holla wrote:
> > > On Fri, Oct 09, 2020 at 11:09:21AM +0530, Viresh Kumar wrote:
> > > > - I wonder if we can keep using that instead of creating new bindings
> > > >    for exact same stuff ? Though the difference here would be that the
> > > >    OPP may not have any other entries.
> > > 
> > > Well summarised, sorry for chiming in late. I could have not summarised
> > > any better. Just saw the big thread and was thinking of summarising.
> > > If the last point on OPP is possible(i.e. no OPP entries but just use
> > > it for fetch the information) for $subject patch is trying to achieve,
> > > then it would be good.

Under normal circumstances, I wouldn't have suggested empty opp-tables
for sure but it doesn't seem worth adding another binding to get this
information out :)

> > 
> > Just to put in my two pennies worth: using opp-shared (in possibly empty
> > OPP table) as alternative to cpu-perf-dependencies sounds good enough
> > to me as well.
> 
> +1

Now that (almost) everyone agrees, I don't think we need to make any
change anywhere, in code or bindings. This should work right now as
well.  The code should never try to create OPP tables and the core
will not create one. Your driver (which want to get this information
out of empty OPP tables) shall call dev_pm_opp_of_get_sharing_cpus(),
which just parses the DT to get this information out.

Lukasz Luba Oct. 14, 2020, 9:11 a.m. UTC | #35

On 10/14/20 5:25 AM, Viresh Kumar wrote:
> On 12-10-20, 18:18, Lukasz Luba wrote:
>> On 10/12/20 5:52 PM, Ionela Voinescu wrote:
>>> On Monday 12 Oct 2020 at 16:49:30 (+0100), Sudeep Holla wrote:
>>>> On Fri, Oct 09, 2020 at 11:09:21AM +0530, Viresh Kumar wrote:
>>>>> - I wonder if we can keep using that instead of creating new bindings
>>>>>     for exact same stuff ? Though the difference here would be that the
>>>>>     OPP may not have any other entries.
>>>>
>>>> Well summarised, sorry for chiming in late. I could have not summarised
>>>> any better. Just saw the big thread and was thinking of summarising.
>>>> If the last point on OPP is possible(i.e. no OPP entries but just use
>>>> it for fetch the information) for $subject patch is trying to achieve,
>>>> then it would be good.
> 
> Under normal circumstances, I wouldn't have suggested empty opp-tables
> for sure but it doesn't seem worth adding another binding to get this
> information out :)
> 
>>>
>>> Just to put in my two pennies worth: using opp-shared (in possibly empty
>>> OPP table) as alternative to cpu-perf-dependencies sounds good enough
>>> to me as well.
>>
>> +1
> 
> Now that (almost) everyone agrees, I don't think we need to make any
> change anywhere, in code or bindings. This should work right now as
> well.  The code should never try to create OPP tables and the core
> will not create one. Your driver (which want to get this information
> out of empty OPP tables) shall call dev_pm_opp_of_get_sharing_cpus(),
> which just parses the DT to get this information out.
> 

Thank you Viresh. We are going to experiment with this and come back
soon.

Regards,
Lukasz

Rafael J. Wysocki Oct. 15, 2020, 3:56 p.m. UTC | #36

On Tue, Oct 13, 2020 at 2:39 PM Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> Hi Rafael,
>
> On Tuesday 13 Oct 2020 at 13:53:37 (+0200), Rafael J. Wysocki wrote:
> > On Tue, Oct 13, 2020 at 12:01 AM Ionela Voinescu
> > <ionela.voinescu@arm.com> wrote:
> > >
> > > Hey Lukasz,
> > >
> > > I think after all this discussion (in our own way of describing things)
> > > we agree on how the current cpufreq based FIE implementation is affected
> > > in systems that use hardware coordination.
> > >
> > > What we don't agree on is the location where that implementation (that
> > > uses the new mask and aggregation) should be.
> > >
> > > On Monday 12 Oct 2020 at 19:19:29 (+0100), Lukasz Luba wrote:
> > > [..]
> > > > The previous FIE implementation where arch_set_freq_scale()
> > > > was called from the drivers, was better suited for this issue.
> > > > Driver could just use internal dependency cpumask or even
> > > > do the aggregation to figure out the max freq for cluster
> > > > if there is a need, before calling arch_set_freq_scale().
> > > >
> > > > It is not perfect solution for software FIE, but one of possible
> > > > when there is no hw counters.
> > > >
> > > [..]
> > >
> > > > Difference between new FIE and old FIE (from v5.8) is that the new one
> > > > purely relies on schedutil max freq value (which will now be missing),
> > > > while the old FIE was called by the driver and thus it was an option to
> > > > fix only the affected cpufreq driver [1][2].
> > > >
> > >
> > > My final argument is that now you have 2 drivers that would need this
> > > support, next you'll have 3 (the new mediatek driver), and in the future
> > > there will be more. So why limit and duplicate this functionality in the
> > > drivers? Why not make it generic for all drivers to use if the system
> > > is using hardware coordination?
> > >
> > > Additionally, I don't think drivers should not even need to know about
> > > these dependency/clock domains. They should act at the level of the
> > > policy, which in this case will be at the level of each CPU.
> >
> > The policies come from the driver, though.
> >
> > The driver decides how many CPUs will be there in a policy and how to
> > handle them at the initialization time.
>
> Yes, policies are built based on information populated from the drivers
> at .init(): what CPUs will belong to a policy, what methods to use for
> setting and getting frequency, etc.
>
> So they do pass this information to the cpufreq core to be stored at the
> level of the policy, but later drivers (in the majority of cases) will
> not need to store on their own information on what CPUs belong to a
> frequency domain, they rely on having passed that information to the
> core, and the core mechanisms hold this information on the clock domains
> (currently through policy->cpus and policy->related_cpus).

Strictly speaking, not quite.

In fact policy->related_cpus is a set of CPUs that share a common perf
control HW/FW interface which may or may not match the boundaries of
clock domains etc.  That's what the entire cpufreq needs to know and
cares about.

AFAICS your scale invariance rework patches were based on the
assumption that CPUs sharing an interface like that should also belong
to the same frequency domain, which is reasonable and that's why I
didn't have a problem with it, but if you were really assuming that
policy->related_cpus must always reflect a frequency domain, then I'm
afraid that you were not going in the right direction (the
one-CPU-per-policy with HW coordination example should be sufficient
to illustrate that).

It is correct that drivers generally don't need to know about the HW
clock (or voltage for that matter) coordination dependencies, but the
rest of cpufreq doesn't need to know about them either.  If that
information is needed for something else, I don't see a reason to put
it into cpufreq.

Ionela Voinescu Oct. 15, 2020, 6:38 p.m. UTC | #37

Hi Rafael,

Sorry in advance for the long writing. I hope it makes sense.

On Thursday 15 Oct 2020 at 17:56:56 (+0200), Rafael J. Wysocki wrote:
> On Tue, Oct 13, 2020 at 2:39 PM Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >
> > Hi Rafael,
> >
> > On Tuesday 13 Oct 2020 at 13:53:37 (+0200), Rafael J. Wysocki wrote:
> > > On Tue, Oct 13, 2020 at 12:01 AM Ionela Voinescu
> > > <ionela.voinescu@arm.com> wrote:
> > > >
> > > > Hey Lukasz,
> > > >
> > > > I think after all this discussion (in our own way of describing things)
> > > > we agree on how the current cpufreq based FIE implementation is affected
> > > > in systems that use hardware coordination.
> > > >
> > > > What we don't agree on is the location where that implementation (that
> > > > uses the new mask and aggregation) should be.
> > > >
> > > > On Monday 12 Oct 2020 at 19:19:29 (+0100), Lukasz Luba wrote:
> > > > [..]
> > > > > The previous FIE implementation where arch_set_freq_scale()
> > > > > was called from the drivers, was better suited for this issue.
> > > > > Driver could just use internal dependency cpumask or even
> > > > > do the aggregation to figure out the max freq for cluster
> > > > > if there is a need, before calling arch_set_freq_scale().
> > > > >
> > > > > It is not perfect solution for software FIE, but one of possible
> > > > > when there is no hw counters.
> > > > >
> > > > [..]
> > > >
> > > > > Difference between new FIE and old FIE (from v5.8) is that the new one
> > > > > purely relies on schedutil max freq value (which will now be missing),
> > > > > while the old FIE was called by the driver and thus it was an option to
> > > > > fix only the affected cpufreq driver [1][2].
> > > > >
> > > >
> > > > My final argument is that now you have 2 drivers that would need this
> > > > support, next you'll have 3 (the new mediatek driver), and in the future
> > > > there will be more. So why limit and duplicate this functionality in the
> > > > drivers? Why not make it generic for all drivers to use if the system
> > > > is using hardware coordination?
> > > >
> > > > Additionally, I don't think drivers should not even need to know about
> > > > these dependency/clock domains. They should act at the level of the
> > > > policy, which in this case will be at the level of each CPU.
> > >
> > > The policies come from the driver, though.
> > >
> > > The driver decides how many CPUs will be there in a policy and how to
> > > handle them at the initialization time.
> >
> > Yes, policies are built based on information populated from the drivers
> > at .init(): what CPUs will belong to a policy, what methods to use for
> > setting and getting frequency, etc.
> >
> > So they do pass this information to the cpufreq core to be stored at the
> > level of the policy, but later drivers (in the majority of cases) will
> > not need to store on their own information on what CPUs belong to a
> > frequency domain, they rely on having passed that information to the
> > core, and the core mechanisms hold this information on the clock domains
> > (currently through policy->cpus and policy->related_cpus).
> 
> Strictly speaking, not quite.
> 
> In fact policy->related_cpus is a set of CPUs that share a common perf
> control HW/FW interface which may or may not match the boundaries of
> clock domains etc.  That's what the entire cpufreq needs to know and
> cares about.
> 

Yes, generally speaking, you are absolutely correct, and actually this
is the motivation behind these patches: policy->related_cpus shows cpus
sharing the same perf controls, but we need "something else" to show us
the cpus sharing the same clocks.

It was my mistake for describing policy->related_cpus in the way I have
above, but my "excuse" is that I was referring to arm/arm64 platforms, in
the context of this cpufreq driver FIE, which is only present on
arm/arm64 platforms. I'll expand on this below.

Speaking practically, again referring to arm/arm64 platforms, for most*
of them, policy->related_cpus does also identify CPUs in the same clock
domain and therefore is used as if that was true. The obvious examples
are exactly the users that were presented as part of the motivation of
these patches: EAS, IPA, cpufreq-based FIE. They treat
policy->related_cpus as dependent cpus (cpus that change they performance
together or are clocked together).

*one of the exceptions is that old BL_SWITCHER support.

Another example of how the two (perf controls vs clock domains) are not
quite as separate in practice, is the cppc_cpufreq driver: that driver
reads the PSD domains and creates the policies based on the domains, not
based on the CPPC controls. It does mention only support for SW_ANY, but
that does not change the fact that the PSD (p-state dependency) is used to
populate policy->related_cpus and not any sharing of controls from _CPC.

Also, generically, policy->cpus (in schedutil or the ondemand governor)
is used as the basis for software coordination. I suppose software
coordination only has meaning at the level of a domain, so policy->cpus
is used as if providing the CPUs in a domain. If that wasn't the case,
their logic would not stand.

> AFAICS your scale invariance rework patches were based on the
> assumption that CPUs sharing an interface like that should also belong
> to the same frequency domain, which is reasonable and that's why I
> didn't have a problem with it, but if you were really assuming that
> policy->related_cpus must always reflect a frequency domain, then I'm
> afraid that you were not going in the right direction (the
> one-CPU-per-policy with HW coordination example should be sufficient
> to illustrate that).
> 

In my defence, I was definitely not assuming that, and I was one of the
first people bringing up HW_ALL in this thread :).

To expand on that, I never assumed policy->related_cpus *must always*
reflect frequency domains. But to that I have to add that FIE, as well
as EAS and IPA, are acting on policy->related_cpus as the closest to
accurate information on frequency domains that they can get. That is
the best information they have got and historically that was good enough.
That is exactly because on arm/arm64 platforms policy->related_cpus has,
more often than not, been describing clock domains as well. And again,
cpufreq-based FIE using policy->related_cpus is only implemented on
arm/arm64 platforms. For anything more accurate, counters should be
used.

Therefore, EAS, FIE, IPA, and even the schedutil and ondmand governors
for that matter, will always work best if policy->related_cpus reflects
the full composition of clock domains as well. It might not be always
true, but that is the best information that they can act on.

But now that new platforms are breaking these historical assumptions,
through these patches, the intention is to give these users better
information to act on, that is information about the real composition of
performance domains.

> It is correct that drivers generally don't need to know about the HW
> clock (or voltage for that matter) coordination dependencies, but the
> rest of cpufreq doesn't need to know about them either.  If that
> information is needed for something else, I don't see a reason to put
> it into cpufreq.

Right! If we say that cpufreq is meant for frequency control only, there
would be no reason. But I believe that is not so easy to separate
frequency control from frequency dependency. I suppose that was the
reason behind having policy->shared_type (as another example to
schedutil and ondemand software coordination). I believe coordination
only has meaning at the level of a domain and therefore should not
belong to a framework that would only reflect control.

Therefore, given precedents, it does seem consistent for information about
dependency domains to be placed in cpufreq as well, and possibly a good
place to make a clear separation between sharing of controls and sharing
of coordination.

But it might just be my interpretation. In any case, I appreciate your
time spent on this.

If cpufreq is not the correct location, are there better options that you
can recommend?

Thank you,
Ionela.

Nicola Mazzucato Oct. 19, 2020, 8:50 a.m. UTC | #38

Hi Viresh,

thank you for your suggestion on using 'opp-shared'.
I think it could work for most of the cases we explained earlier.

Summarising, there are two parts of this entire proposal:
1) where/how to get the information: now we are focusing on taking advantage of
'opp-shared' within an empty opp table
2) and how/where this information will be consumed

Further details below:

1) a CPUFreq driver that takes the OPPs from firmware, can call
dev_pm_opp_of_get_sharing_cpus like you suggested. When doing so, a provided
cpumaksk will be populated with the corresponding cpus that share the same
(empty) table opp in DT.
All good so far.
The current opp core is not expecting an empty table and therefore some errors
are thrown when this happens.
Since we are now allowing this corner-case, I am presenting below where I think
some minor corrections may be needed:

--- a/drivers/opp/of.c
+++ b/drivers/opp/of.c
@@ static void _opp_table_alloc_required_tables(struct opp_table *opp_table,
        struct device_node *required_np, *np;
        int count, i;

        /* Traversing the first OPP node is all we need */
        np = of_get_next_available_child(opp_np, NULL);
        if (!np) {
-               dev_err(dev, "Empty OPP table\n");
+               dev_warn(dev, "Empty OPP table\n");
+
+               /*
+                * With empty table we remove shared_opp. This is to leave the
+                * responsibility to decide which opp are shared to the opp users
+                */
+               opp_table->shared_opp = OPP_TABLE_ACCESS_EXCLUSIVE;
+
                return;
        }

@@ int dev_pm_opp_of_find_icc_paths(struct device *dev,
        int ret, i, count, num_paths;
        struct icc_path **paths;

        ret = _bandwidth_supported(dev, opp_table);
-       if (ret <= 0)
+       if (ret == -EINVAL)
+               return 0; /* Empty OPP table is a valid corner-case, let's not
fail */
+       else if (ret <= 0)
                return ret;

The above are not 'strictly' necessary to achieve the intended goal, but they
make clearer that an empty table is now allowed and not an error anymore.
What it is your point of view on this?

In addition, I think it would also be appropriate to update the documentation
(Documentation/devicetree/bindings/opp/opp.txt) to reflect this new case
(required properties etc).
Any different thoughts?

2) Once the driver gets the 'performance dependencies' by
dev_pm_opp_of_get_sharing_cpus(), this information will have to be shared with
EAS, thermal, etc.. The natural way to do so would be to add a new cpumask like
I proposed in this RFC.
I see this as an improvement for the whole subsystem and a scalable choice since
we can unambiguously provide the correct information to whoever needs it, given
that we don't enforce "hw dependencies" for related_cpus.
The changes would be trivial (it's in the original RFC).
On the other hand, we can't unload this h/w detail into related_cpus IMO as we
are dealing with per-cpu systems in this context.
Hope it makes sense?

Once we are pretty much aligned with our ideas, I can run some other tests and
make a V3.

Thank you very much,

Best regards
Nicola

On 10/14/20 5:25 AM, Viresh Kumar wrote:
> On 12-10-20, 18:18, Lukasz Luba wrote:
>> On 10/12/20 5:52 PM, Ionela Voinescu wrote:
>>> On Monday 12 Oct 2020 at 16:49:30 (+0100), Sudeep Holla wrote:
>>>> On Fri, Oct 09, 2020 at 11:09:21AM +0530, Viresh Kumar wrote:
>>>>> - I wonder if we can keep using that instead of creating new bindings
>>>>>    for exact same stuff ? Though the difference here would be that the
>>>>>    OPP may not have any other entries.
>>>>
>>>> Well summarised, sorry for chiming in late. I could have not summarised
>>>> any better. Just saw the big thread and was thinking of summarising.
>>>> If the last point on OPP is possible(i.e. no OPP entries but just use
>>>> it for fetch the information) for $subject patch is trying to achieve,
>>>> then it would be good.
> 
> Under normal circumstances, I wouldn't have suggested empty opp-tables
> for sure but it doesn't seem worth adding another binding to get this
> information out :)
> 
>>>
>>> Just to put in my two pennies worth: using opp-shared (in possibly empty
>>> OPP table) as alternative to cpu-perf-dependencies sounds good enough
>>> to me as well.
>>
>> +1
> 
> Now that (almost) everyone agrees, I don't think we need to make any
> change anywhere, in code or bindings. This should work right now as
> well.  The code should never try to create OPP tables and the core
> will not create one. Your driver (which want to get this information
> out of empty OPP tables) shall call dev_pm_opp_of_get_sharing_cpus(),
> which just parses the DT to get this information out.
>

Viresh Kumar Oct. 19, 2020, 9:46 a.m. UTC | #39

On 19-10-20, 09:50, Nicola Mazzucato wrote:
> Hi Viresh,
> 
> thank you for your suggestion on using 'opp-shared'.
> I think it could work for most of the cases we explained earlier.
> 
> Summarising, there are two parts of this entire proposal:
> 1) where/how to get the information: now we are focusing on taking advantage of
> 'opp-shared' within an empty opp table
> 2) and how/where this information will be consumed
> 
> Further details below:
> 
> 1) a CPUFreq driver that takes the OPPs from firmware, can call
> dev_pm_opp_of_get_sharing_cpus like you suggested. When doing so, a provided
> cpumaksk will be populated with the corresponding cpus that share the same
> (empty) table opp in DT.
> All good so far.

Great.

> The current opp core is not expecting an empty table and therefore some errors
> are thrown when this happens.
> Since we are now allowing this corner-case, I am presenting below where I think
> some minor corrections may be needed:
> 
> --- a/drivers/opp/of.c
> +++ b/drivers/opp/of.c
> @@ static void _opp_table_alloc_required_tables(struct opp_table *opp_table,
>         struct device_node *required_np, *np;
>         int count, i;
> 
>         /* Traversing the first OPP node is all we need */
>         np = of_get_next_available_child(opp_np, NULL);
>         if (!np) {
> -               dev_err(dev, "Empty OPP table\n");
> +               dev_warn(dev, "Empty OPP table\n");
> +
> +               /*
> +                * With empty table we remove shared_opp. This is to leave the
> +                * responsibility to decide which opp are shared to the opp users
> +                */
> +               opp_table->shared_opp = OPP_TABLE_ACCESS_EXCLUSIVE;
> +
>                 return;
>         }
> 
> @@ int dev_pm_opp_of_find_icc_paths(struct device *dev,
>         int ret, i, count, num_paths;
>         struct icc_path **paths;
> 
>         ret = _bandwidth_supported(dev, opp_table);
> -       if (ret <= 0)
> +       if (ret == -EINVAL)
> +               return 0; /* Empty OPP table is a valid corner-case, let's not
> fail */
> +       else if (ret <= 0)
>                 return ret;
> 
> The above are not 'strictly' necessary to achieve the intended goal, but they
> make clearer that an empty table is now allowed and not an error anymore.
> What it is your point of view on this?

Why is this stuff getting called in your case ? We shouldn't be trying
to create an OPP table here and it should still be an error in the
code if we are asked to parse an empty OPP table.

> In addition, I think it would also be appropriate to update the documentation
> (Documentation/devicetree/bindings/opp/opp.txt) to reflect this new case
> (required properties etc).
> Any different thoughts?

Yes, this needs a small update in the required-opps section.

> 2) Once the driver gets the 'performance dependencies' by
> dev_pm_opp_of_get_sharing_cpus(), this information will have to be shared with
> EAS, thermal, etc.. The natural way to do so would be to add a new cpumask like
> I proposed in this RFC.
> I see this as an improvement for the whole subsystem and a scalable choice since
> we can unambiguously provide the correct information to whoever needs it, given
> that we don't enforce "hw dependencies" for related_cpus.
> The changes would be trivial (it's in the original RFC).
> On the other hand, we can't unload this h/w detail into related_cpus IMO as we
> are dealing with per-cpu systems in this context.
> Hope it makes sense?

I will have another look at this stuff, honestly I haven't looked at
this in detail yet. But I do understand that we can't really use
related-cpu here without changing its earlier meaning.

Nicola Mazzucato Oct. 19, 2020, 1:36 p.m. UTC | #40

Hi Viresh,


On 10/19/20 10:46 AM, Viresh Kumar wrote:
> On 19-10-20, 09:50, Nicola Mazzucato wrote:
>> Hi Viresh,
>>
>> thank you for your suggestion on using 'opp-shared'.
>> I think it could work for most of the cases we explained earlier.
>>
>> Summarising, there are two parts of this entire proposal:
>> 1) where/how to get the information: now we are focusing on taking advantage of
>> 'opp-shared' within an empty opp table
>> 2) and how/where this information will be consumed
>>
>> Further details below:
>>
>> 1) a CPUFreq driver that takes the OPPs from firmware, can call
>> dev_pm_opp_of_get_sharing_cpus like you suggested. When doing so, a provided
>> cpumaksk will be populated with the corresponding cpus that share the same
>> (empty) table opp in DT.
>> All good so far.
> 
> Great.
> 
>> The current opp core is not expecting an empty table and therefore some errors
>> are thrown when this happens.
>> Since we are now allowing this corner-case, I am presenting below where I think
>> some minor corrections may be needed:
>>
>> --- a/drivers/opp/of.c
>> +++ b/drivers/opp/of.c
>> @@ static void _opp_table_alloc_required_tables(struct opp_table *opp_table,
>>         struct device_node *required_np, *np;
>>         int count, i;
>>
>>         /* Traversing the first OPP node is all we need */
>>         np = of_get_next_available_child(opp_np, NULL);
>>         if (!np) {
>> -               dev_err(dev, "Empty OPP table\n");
>> +               dev_warn(dev, "Empty OPP table\n");
>> +
>> +               /*
>> +                * With empty table we remove shared_opp. This is to leave the
>> +                * responsibility to decide which opp are shared to the opp users
>> +                */
>> +               opp_table->shared_opp = OPP_TABLE_ACCESS_EXCLUSIVE;
>> +
>>                 return;
>>         }
>>
>> @@ int dev_pm_opp_of_find_icc_paths(struct device *dev,
>>         int ret, i, count, num_paths;
>>         struct icc_path **paths;
>>
>>         ret = _bandwidth_supported(dev, opp_table);
>> -       if (ret <= 0)
>> +       if (ret == -EINVAL)
>> +               return 0; /* Empty OPP table is a valid corner-case, let's not
>> fail */
>> +       else if (ret <= 0)
>>                 return ret;
>>
>> The above are not 'strictly' necessary to achieve the intended goal, but they
>> make clearer that an empty table is now allowed and not an error anymore.
>> What it is your point of view on this?
> 
> Why is this stuff getting called in your case ? We shouldn't be trying
> to create an OPP table here and it should still be an error in the
> code if we are asked to parse an empty OPP table.

A driver that gets a set of opp-points from f/w needs to add them to each
device. To do so, it will call dev_pm_opp_add(). If an opp_table struct for this
device is not found, one will be created and the opp-point will be added to it.
When allocation a new opp_table the opp will try to initialise it by parsing the
values in DT. It will also try to find_icc_paths.

Everything happens silently if we don't have a table in DT.

> 
>> In addition, I think it would also be appropriate to update the documentation
>> (Documentation/devicetree/bindings/opp/opp.txt) to reflect this new case
>> (required properties etc).
>> Any different thoughts?
> 
> Yes, this needs a small update in the required-opps section.

Cool, I'll sketch something in the next version.

> 
>> 2) Once the driver gets the 'performance dependencies' by
>> dev_pm_opp_of_get_sharing_cpus(), this information will have to be shared with
>> EAS, thermal, etc.. The natural way to do so would be to add a new cpumask like
>> I proposed in this RFC.
>> I see this as an improvement for the whole subsystem and a scalable choice since
>> we can unambiguously provide the correct information to whoever needs it, given
>> that we don't enforce "hw dependencies" for related_cpus.
>> The changes would be trivial (it's in the original RFC).
>> On the other hand, we can't unload this h/w detail into related_cpus IMO as we
>> are dealing with per-cpu systems in this context.
>> Hope it makes sense?
> 
> I will have another look at this stuff, honestly I haven't looked at
> this in detail yet. But I do understand that we can't really use
> related-cpu here without changing its earlier meaning.

Sure. thanks
> 

Hope it helps

Best regards
Nicola

Viresh Kumar Oct. 20, 2020, 10:48 a.m. UTC | #41

On 19-10-20, 14:36, Nicola Mazzucato wrote:
> Hi Viresh,
> 
> 
> On 10/19/20 10:46 AM, Viresh Kumar wrote:
> > On 19-10-20, 09:50, Nicola Mazzucato wrote:
> >> Hi Viresh,
> >>
> >> thank you for your suggestion on using 'opp-shared'.
> >> I think it could work for most of the cases we explained earlier.
> >>
> >> Summarising, there are two parts of this entire proposal:
> >> 1) where/how to get the information: now we are focusing on taking advantage of
> >> 'opp-shared' within an empty opp table
> >> 2) and how/where this information will be consumed
> >>
> >> Further details below:
> >>
> >> 1) a CPUFreq driver that takes the OPPs from firmware, can call
> >> dev_pm_opp_of_get_sharing_cpus like you suggested. When doing so, a provided
> >> cpumaksk will be populated with the corresponding cpus that share the same
> >> (empty) table opp in DT.
> >> All good so far.
> > 
> > Great.
> > 
> >> The current opp core is not expecting an empty table and therefore some errors
> >> are thrown when this happens.
> >> Since we are now allowing this corner-case, I am presenting below where I think
> >> some minor corrections may be needed:
> >>
> >> --- a/drivers/opp/of.c
> >> +++ b/drivers/opp/of.c
> >> @@ static void _opp_table_alloc_required_tables(struct opp_table *opp_table,
> >>         struct device_node *required_np, *np;
> >>         int count, i;
> >>
> >>         /* Traversing the first OPP node is all we need */
> >>         np = of_get_next_available_child(opp_np, NULL);
> >>         if (!np) {
> >> -               dev_err(dev, "Empty OPP table\n");
> >> +               dev_warn(dev, "Empty OPP table\n");
> >> +
> >> +               /*
> >> +                * With empty table we remove shared_opp. This is to leave the
> >> +                * responsibility to decide which opp are shared to the opp users
> >> +                */
> >> +               opp_table->shared_opp = OPP_TABLE_ACCESS_EXCLUSIVE;
> >> +
> >>                 return;
> >>         }
> >>
> >> @@ int dev_pm_opp_of_find_icc_paths(struct device *dev,
> >>         int ret, i, count, num_paths;
> >>         struct icc_path **paths;
> >>
> >>         ret = _bandwidth_supported(dev, opp_table);
> >> -       if (ret <= 0)
> >> +       if (ret == -EINVAL)
> >> +               return 0; /* Empty OPP table is a valid corner-case, let's not
> >> fail */
> >> +       else if (ret <= 0)
> >>                 return ret;
> >>
> >> The above are not 'strictly' necessary to achieve the intended goal, but they
> >> make clearer that an empty table is now allowed and not an error anymore.
> >> What it is your point of view on this?
> > 
> > Why is this stuff getting called in your case ? We shouldn't be trying
> > to create an OPP table here and it should still be an error in the
> > code if we are asked to parse an empty OPP table.
> 
> A driver that gets a set of opp-points from f/w needs to add them to each
> device. To do so, it will call dev_pm_opp_add(). If an opp_table struct for this
> device is not found, one will be created and the opp-point will be added to it.
> When allocation a new opp_table the opp will try to initialise it by parsing the
> values in DT. It will also try to find_icc_paths.
> 
> Everything happens silently if we don't have a table in DT.

Right, you need something like your patch here.

[v2,0/2] CPUFreq: Add support for cpu performance dependencies

Message

Comments