diff mbox series

[1/4] cpufreq: qcom-nvmem: Enable virtual power domain devices

Message ID 20230912-msm8909-cpufreq-v1-1-767ce66b544b@kernkonzept.com (mailing list archive)
State New
Delegated to: viresh kumar
Headers show
Series cpufreq: Add basic cpufreq scaling for Qualcomm MSM8909 | expand

Commit Message

Stephan Gerhold Sept. 12, 2023, 9:40 a.m. UTC
The genpd core ignores performance state votes from devices that are
runtime suspended as of commit 5937c3ce2122 ("PM: domains: Drop/restore
performance state votes for devices at runtime PM"). However, at the
moment nothing ever enables the virtual devices created in
qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
permanently runtime-suspended.

Fix this by enabling the devices after attaching them and use
dev_pm_syscore_device() to ensure the power domain also stays on when
going to suspend. Since it supplies the CPU we can never turn it off
from Linux. There are other mechanisms to turn it off when needed,
usually in the RPM firmware or the cpuidle path.

Without this fix performance states votes are silently ignored, and the
CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
for some reason no one noticed this on QCS404 so far.

Cc: stable@vger.kernel.org
Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
---
 drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

Comments

Ulf Hansson Sept. 13, 2023, 10:56 a.m. UTC | #1
On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
<stephan.gerhold@kernkonzept.com> wrote:
>
> The genpd core ignores performance state votes from devices that are
> runtime suspended as of commit 5937c3ce2122 ("PM: domains: Drop/restore
> performance state votes for devices at runtime PM").

I think you are referring to the wrong commit above. Please have a
look at commit 3c5a272202c2 ("PM: domains: Improve runtime PM
performance state handling"), instead.

I also suggest rephrasing the above into saying that the performance
state vote for a device is cached rather than carried out, if
pm_runtime_suspended() returns true for it.

Another relevant information in the commit message would be to add
that during device-attach (genpd_dev_pm_attach_by_id()), calls
pm_runtime_enable() the device.

> However, at the
> moment nothing ever enables the virtual devices created in
> qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> permanently runtime-suspended.
>
> Fix this by enabling the devices after attaching them and use
> dev_pm_syscore_device() to ensure the power domain also stays on when
> going to suspend. Since it supplies the CPU we can never turn it off
> from Linux. There are other mechanisms to turn it off when needed,
> usually in the RPM firmware or the cpuidle path.
>
> Without this fix performance states votes are silently ignored, and the
> CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> for some reason no one noticed this on QCS404 so far.
>
> Cc: stable@vger.kernel.org
> Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
> ---
>  drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> index 84d7033e5efe..17d6ab14c909 100644
> --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> @@ -25,6 +25,7 @@
>  #include <linux/platform_device.h>
>  #include <linux/pm_domain.h>
>  #include <linux/pm_opp.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/slab.h>
>  #include <linux/soc/qcom/smem.h>
>
> @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
>         }
>
>         for_each_possible_cpu(cpu) {
> +               struct device **virt_devs = NULL;
>                 struct dev_pm_opp_config config = {
>                         .supported_hw = NULL,
>                 };
> @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
>
>                 if (drv->data->genpd_names) {
>                         config.genpd_names = drv->data->genpd_names;
> -                       config.virt_devs = NULL;
> +                       config.virt_devs = &virt_devs;
>                 }
>
>                 if (config.supported_hw || config.genpd_names) {
> @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
>                                 goto free_opp;
>                         }
>                 }
> +
> +               if (virt_devs) {
> +                       const char * const *name = config.genpd_names;
> +                       int i;
> +
> +                       for (i = 0; *name; i++, name++) {
> +                               ret = pm_runtime_resume_and_get(virt_devs[i]);
> +                               if (ret) {
> +                                       dev_err(cpu_dev, "failed to resume %s: %d\n",
> +                                               *name, ret);
> +                                       goto free_opp;
> +                               }

Shouldn't we restore the usage count at ->remove() too?

> +
> +                               /* Keep CPU power domain always-on */
> +                               dev_pm_syscore_device(virt_devs[i], true);

Is this really correct? cpufreq is suspended/resumed by the PM core
during system wide suspend/resume. See dpm_suspend|resume(). Isn't
that sufficient?

Moreover, it looks like the cpr genpd provider supports genpd's
->power_on|off() callbacks. Is there something wrong with this, that I
am missing?


> +                       }
> +               }
>         }
>
>         cpufreq_dt_pdev = platform_device_register_simple("cpufreq-dt", -1,
>

Kind regards
Uffe
Stephan Gerhold Sept. 13, 2023, 12:26 p.m. UTC | #2
On Wed, Sep 13, 2023 at 12:56:16PM +0200, Ulf Hansson wrote:
> On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
> <stephan.gerhold@kernkonzept.com> wrote:
> >
> > The genpd core ignores performance state votes from devices that are
> > runtime suspended as of commit 5937c3ce2122 ("PM: domains: Drop/restore
> > performance state votes for devices at runtime PM").
> 
> I think you are referring to the wrong commit above. Please have a
> look at commit 3c5a272202c2 ("PM: domains: Improve runtime PM
> performance state handling"), instead.
> 
> I also suggest rephrasing the above into saying that the performance
> state vote for a device is cached rather than carried out, if
> pm_runtime_suspended() returns true for it.
> 
> Another relevant information in the commit message would be to add
> that during device-attach (genpd_dev_pm_attach_by_id()), calls
> pm_runtime_enable() the device.
> 

Thanks, I will try to clarify this a bit! I was actually looking at that
commit originally but decided to reference the commit that "started the
change", since the this commit is marked as fix of the one I referenced.
But I think you're right, it would be more clear to reference "PM:
domains: Improve runtime PM performance state handling" directly.

> > However, at the
> > moment nothing ever enables the virtual devices created in
> > qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> > permanently runtime-suspended.
> >
> > Fix this by enabling the devices after attaching them and use
> > dev_pm_syscore_device() to ensure the power domain also stays on when
> > going to suspend. Since it supplies the CPU we can never turn it off
> > from Linux. There are other mechanisms to turn it off when needed,
> > usually in the RPM firmware or the cpuidle path.
> >
> > Without this fix performance states votes are silently ignored, and the
> > CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> > for some reason no one noticed this on QCS404 so far.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> > Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
> > ---
> >  drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > index 84d7033e5efe..17d6ab14c909 100644
> > --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/platform_device.h>
> >  #include <linux/pm_domain.h>
> >  #include <linux/pm_opp.h>
> > +#include <linux/pm_runtime.h>
> >  #include <linux/slab.h>
> >  #include <linux/soc/qcom/smem.h>
> >
> > @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> >         }
> >
> >         for_each_possible_cpu(cpu) {
> > +               struct device **virt_devs = NULL;
> >                 struct dev_pm_opp_config config = {
> >                         .supported_hw = NULL,
> >                 };
> > @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> >
> >                 if (drv->data->genpd_names) {
> >                         config.genpd_names = drv->data->genpd_names;
> > -                       config.virt_devs = NULL;
> > +                       config.virt_devs = &virt_devs;
> >                 }
> >
> >                 if (config.supported_hw || config.genpd_names) {
> > @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> >                                 goto free_opp;
> >                         }
> >                 }
> > +
> > +               if (virt_devs) {
> > +                       const char * const *name = config.genpd_names;
> > +                       int i;
> > +
> > +                       for (i = 0; *name; i++, name++) {
> > +                               ret = pm_runtime_resume_and_get(virt_devs[i]);
> > +                               if (ret) {
> > +                                       dev_err(cpu_dev, "failed to resume %s: %d\n",
> > +                                               *name, ret);
> > +                                       goto free_opp;
> > +                               }
> 
> Shouldn't we restore the usage count at ->remove() too?
> 
> > +
> > +                               /* Keep CPU power domain always-on */
> > +                               dev_pm_syscore_device(virt_devs[i], true);
> 
> Is this really correct? cpufreq is suspended/resumed by the PM core
> during system wide suspend/resume. See dpm_suspend|resume(). Isn't
> that sufficient?
> 
> Moreover, it looks like the cpr genpd provider supports genpd's
> ->power_on|off() callbacks. Is there something wrong with this, that I
> am missing?
> 

I think this question is a quite fundamental one. To explain this
properly I will need to delve a bit into the implementation details of
the two different GENPD providers that are applicable here:

Fundamentally, we are describing the main power supply for the CPU here.
Consider a simple regulator with adjustable voltage. From the Linux
point of view this regulator should be marked as "regulator-always-on".
If we would turn off this regulator, the CPU would be immediately dead
without proper shutdown done by firmware or hardware.

Representing the regulator as power domain does not change much, except
that we now have abstract "performance states" instead of actual voltages.
However, for power domains there is currently no generic mechanism like
"regulator-always-on" in the DT, only drivers can specify
GENPD_FLAG_ALWAYS_ON.

The special situation on MSM8909 is that there are two possible setups
for the CPU power supply depending on the PMIC that is used (see
"[PATCH 4/4] cpufreq: qcom-nvmem: Add MSM8909"): CPR or RPMPD. Both are
GENPD providers so in theory we can just have either

  cpu@0 { power-domains = <&cpr>; }; // or
  cpu@0 { power-domains = <&rpmpd MSM8909_VDDCX_AO>; };

in the DT, without handling this specifically on the cpufreq side.

The two GENPD providers behave quite differently though:

 - CPR: CPR is not really a power domain itself. It's more like a monitor
   on a power supply line coming from some other regulator. CPR provides
   suggestions how to adjust the voltage for best power/stability.

   The GENPD .power_off() disables the CPR state machine and forwards
   this to the regulator with regulator_disable(). On QCS404 the
   regulator is marked regulator-always-on, so it will never be disabled
   from Linux. The SAW/SPM hardware component on Qualcomm SoCs will
   usually disable the regulator during deep cpuidle states.

 - RPMPD: This is the generic driver for all the SoC power domains
   managed by the RPM firmware. It's not CPU-specific. However, as
   special feature each power domain is exposed twice in Linux, e.g.
   "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
   variant tells the RPM firmware that the performance/enable vote only
   applies when the CPU is active (not in deep cpuidle state).

   The GENPD .power_off() drops all performance state votes and also
   releases the "enable" vote for the power domain.

Now, imagine what happens during system wide suspend/resume:

 - CPR: The CPR state machine gets disabled. The voltage stays as-is.
     - With "regulator-always-on": The CPU keeps running until WFI.
     - Without: I would expect the CPU is dead immediately(?)

 - RPMPD: The performance/enable vote is dropped. The power domain might
   go to minimal voltage or even turn off completely. However, the CPU
   actually needs to keep running at the same frequency until WFI!
   Worst case, the CPU is dead immediately when the power domain votes
   get dropped.

In case of RPMPD, the votes must remain even during system wide suspend.
The special _AO variant of the power domain tells the firmware to
release the votes once the CPU has been shut down cleanly. It will also
restore them once the CPU wakes up (long before the resume handlers run).

My conclusion was that in both cases we want to keep the "power domain"
enabled, since the CPU must keep running for a short while even after
the system suspend handlers have been called.

Does this help with understanding the problem? It's a bit complicated. :D

Thanks!
Stephan

PS: This is essentially just another manifestation of a discussion we
had a few times already over the years about where to enable power
domains used by cpufreq, e.g. [1, 2, 3, 4]. Apparently I already
mentioned back in 2021 already that QCS404 is broken [5] (I forgot
about that :')).

[1]: https://lore.kernel.org/linux-pm/YLi5N06Qs+gYHgYg@gerhold.net/
[2]: https://lore.kernel.org/linux-pm/20200826093328.88268-1-stephan@gerhold.net/
[3]: https://lore.kernel.org/linux-pm/20200730080146.25185-1-stephan@gerhold.net/
[4]: https://lore.kernel.org/linux-arm-msm/20200426123140.GA190483@gerhold.net/
[5]: https://lore.kernel.org/linux-pm/YLoTl7MfMfq2g10h@gerhold.net/
Ulf Hansson Sept. 29, 2023, 1:14 p.m. UTC | #3
On Wed, 13 Sept 2023 at 14:26, Stephan Gerhold <stephan@gerhold.net> wrote:
>
> On Wed, Sep 13, 2023 at 12:56:16PM +0200, Ulf Hansson wrote:
> > On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
> > <stephan.gerhold@kernkonzept.com> wrote:
> > >
> > > The genpd core ignores performance state votes from devices that are
> > > runtime suspended as of commit 5937c3ce2122 ("PM: domains: Drop/restore
> > > performance state votes for devices at runtime PM").
> >
> > I think you are referring to the wrong commit above. Please have a
> > look at commit 3c5a272202c2 ("PM: domains: Improve runtime PM
> > performance state handling"), instead.
> >
> > I also suggest rephrasing the above into saying that the performance
> > state vote for a device is cached rather than carried out, if
> > pm_runtime_suspended() returns true for it.
> >
> > Another relevant information in the commit message would be to add
> > that during device-attach (genpd_dev_pm_attach_by_id()), calls
> > pm_runtime_enable() the device.
> >
>
> Thanks, I will try to clarify this a bit! I was actually looking at that
> commit originally but decided to reference the commit that "started the
> change", since the this commit is marked as fix of the one I referenced.
> But I think you're right, it would be more clear to reference "PM:
> domains: Improve runtime PM performance state handling" directly.
>
> > > However, at the
> > > moment nothing ever enables the virtual devices created in
> > > qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> > > permanently runtime-suspended.
> > >
> > > Fix this by enabling the devices after attaching them and use
> > > dev_pm_syscore_device() to ensure the power domain also stays on when
> > > going to suspend. Since it supplies the CPU we can never turn it off
> > > from Linux. There are other mechanisms to turn it off when needed,
> > > usually in the RPM firmware or the cpuidle path.
> > >
> > > Without this fix performance states votes are silently ignored, and the
> > > CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> > > for some reason no one noticed this on QCS404 so far.
> > >
> > > Cc: stable@vger.kernel.org
> > > Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> > > Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
> > > ---
> > >  drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
> > >  1 file changed, 20 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > index 84d7033e5efe..17d6ab14c909 100644
> > > --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > @@ -25,6 +25,7 @@
> > >  #include <linux/platform_device.h>
> > >  #include <linux/pm_domain.h>
> > >  #include <linux/pm_opp.h>
> > > +#include <linux/pm_runtime.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/soc/qcom/smem.h>
> > >
> > > @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > >         }
> > >
> > >         for_each_possible_cpu(cpu) {
> > > +               struct device **virt_devs = NULL;
> > >                 struct dev_pm_opp_config config = {
> > >                         .supported_hw = NULL,
> > >                 };
> > > @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > >
> > >                 if (drv->data->genpd_names) {
> > >                         config.genpd_names = drv->data->genpd_names;
> > > -                       config.virt_devs = NULL;
> > > +                       config.virt_devs = &virt_devs;
> > >                 }
> > >
> > >                 if (config.supported_hw || config.genpd_names) {
> > > @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > >                                 goto free_opp;
> > >                         }
> > >                 }
> > > +
> > > +               if (virt_devs) {
> > > +                       const char * const *name = config.genpd_names;
> > > +                       int i;
> > > +
> > > +                       for (i = 0; *name; i++, name++) {
> > > +                               ret = pm_runtime_resume_and_get(virt_devs[i]);
> > > +                               if (ret) {
> > > +                                       dev_err(cpu_dev, "failed to resume %s: %d\n",
> > > +                                               *name, ret);
> > > +                                       goto free_opp;
> > > +                               }
> >
> > Shouldn't we restore the usage count at ->remove() too?
> >
> > > +
> > > +                               /* Keep CPU power domain always-on */
> > > +                               dev_pm_syscore_device(virt_devs[i], true);
> >
> > Is this really correct? cpufreq is suspended/resumed by the PM core
> > during system wide suspend/resume. See dpm_suspend|resume(). Isn't
> > that sufficient?
> >
> > Moreover, it looks like the cpr genpd provider supports genpd's
> > ->power_on|off() callbacks. Is there something wrong with this, that I
> > am missing?
> >
>
> I think this question is a quite fundamental one. To explain this
> properly I will need to delve a bit into the implementation details of
> the two different GENPD providers that are applicable here:
>
> Fundamentally, we are describing the main power supply for the CPU here.
> Consider a simple regulator with adjustable voltage. From the Linux
> point of view this regulator should be marked as "regulator-always-on".
> If we would turn off this regulator, the CPU would be immediately dead
> without proper shutdown done by firmware or hardware.
>
> Representing the regulator as power domain does not change much, except
> that we now have abstract "performance states" instead of actual voltages.
> However, for power domains there is currently no generic mechanism like
> "regulator-always-on" in the DT, only drivers can specify
> GENPD_FLAG_ALWAYS_ON.

We have relied on genpd providers to act on their compatible strings
to make the correct configuration. If that isn't sufficient, I don't
see why we couldn't add a new DT property corresponding to
GENPD_FLAG_ALWAYS_ON.

>
> The special situation on MSM8909 is that there are two possible setups
> for the CPU power supply depending on the PMIC that is used (see
> "[PATCH 4/4] cpufreq: qcom-nvmem: Add MSM8909"): CPR or RPMPD. Both are
> GENPD providers so in theory we can just have either
>
>   cpu@0 { power-domains = <&cpr>; }; // or
>   cpu@0 { power-domains = <&rpmpd MSM8909_VDDCX_AO>; };
>
> in the DT, without handling this specifically on the cpufreq side.

Looks like it would be nice to get a patch for the MSM8909 DTS too, as
part of the series, to get a better picture of how this is going to be
used. Would that be possible for you to provide?

>
> The two GENPD providers behave quite differently though:
>
>  - CPR: CPR is not really a power domain itself. It's more like a monitor
>    on a power supply line coming from some other regulator. CPR provides
>    suggestions how to adjust the voltage for best power/stability.
>
>    The GENPD .power_off() disables the CPR state machine and forwards
>    this to the regulator with regulator_disable(). On QCS404 the
>    regulator is marked regulator-always-on, so it will never be disabled
>    from Linux. The SAW/SPM hardware component on Qualcomm SoCs will
>    usually disable the regulator during deep cpuidle states.

Parts of this sound a bit odd to me. The CPR/CPUfreq shouldn't really
need to vote for the CPU's power-rail(s) from a powered-on/off (CPU
idle states) point of view, but only from a performance (voltage
level) point of view.

If the enable/disable voting on the regulator really has an impact on
some platforms, it sounds like it could prevent deeper CPU idle states
too. That's probably not what we want, right?

I also had a look at the existing CPR genpd provider's probe
function/path (cpr_probe()) - and it turns out there is no call to
regulator_enable(). Whatever that means to us.

>
>  - RPMPD: This is the generic driver for all the SoC power domains
>    managed by the RPM firmware. It's not CPU-specific. However, as
>    special feature each power domain is exposed twice in Linux, e.g.
>    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
>    variant tells the RPM firmware that the performance/enable vote only
>    applies when the CPU is active (not in deep cpuidle state).
>
>    The GENPD .power_off() drops all performance state votes and also
>    releases the "enable" vote for the power domain.
>
> Now, imagine what happens during system wide suspend/resume:
>
>  - CPR: The CPR state machine gets disabled. The voltage stays as-is.
>      - With "regulator-always-on": The CPU keeps running until WFI.
>      - Without: I would expect the CPU is dead immediately(?)

As I indicated above, I am starting to feel that this is a bit upside
down. CPR/CPUfreq should vote on voltages to scale performance, but
not for cpu idle states.

Perhaps what is missing is a synchronization point or a notification,
to inform the CPR driver that its state machine (registers) needs to
be saved/restored, when the power-rails get turned on/off. In fact, we
have a couple mechanisms at hand to support this.

>
>  - RPMPD: The performance/enable vote is dropped. The power domain might
>    go to minimal voltage or even turn off completely. However, the CPU
>    actually needs to keep running at the same frequency until WFI!
>    Worst case, the CPU is dead immediately when the power domain votes
>    get dropped.

Since RPMPD is managing the voting for both performance and low power
states for different kinds of devices, this certainly gets a bit more
complicated.

On the other hand, the CPUfreq driver should really only vote for
performance states for the CPUs and not for low power states. The
latter is a job for cpuidle and other consumers of the RPMPD to
manage, I think.

So, while thinking of this, I just realized that it may not always be
a good idea for genpd to cache a performance state request, for an
attached device and for which pm_runtime_suspended() returns true for
it. As this is the default behaviour in genpd, I am thinking that we
need a way to make that behaviour configurable for an attached device.
What do you think about that?

>
> In case of RPMPD, the votes must remain even during system wide suspend.
> The special _AO variant of the power domain tells the firmware to
> release the votes once the CPU has been shut down cleanly. It will also
> restore them once the CPU wakes up (long before the resume handlers run).
>
> My conclusion was that in both cases we want to keep the "power domain"
> enabled, since the CPU must keep running for a short while even after
> the system suspend handlers have been called.

It looks to me that the system wide suspend/resume case isn't really
much different from the runtime suspend/resume case.

It's not CPUfreq's role (by calling pm_runtime_resume_and_get(), for
example) to prevent the RPMPD from entering a low power state.

>
> Does this help with understanding the problem? It's a bit complicated. :D

Yes, I think so, thanks!

Although, I am afraid my response made this even more complicated. :-)

>
> Thanks!
> Stephan
>
> PS: This is essentially just another manifestation of a discussion we
> had a few times already over the years about where to enable power
> domains used by cpufreq, e.g. [1, 2, 3, 4]. Apparently I already
> mentioned back in 2021 already that QCS404 is broken [5] (I forgot
> about that :')).

Right, I recall these discussions now, thanks for the pointers.

Let's try to get to the bottom of this and figure out a proper solution.

>
> [1]: https://lore.kernel.org/linux-pm/YLi5N06Qs+gYHgYg@gerhold.net/
> [2]: https://lore.kernel.org/linux-pm/20200826093328.88268-1-stephan@gerhold.net/
> [3]: https://lore.kernel.org/linux-pm/20200730080146.25185-1-stephan@gerhold.net/
> [4]: https://lore.kernel.org/linux-arm-msm/20200426123140.GA190483@gerhold.net/
> [5]: https://lore.kernel.org/linux-pm/YLoTl7MfMfq2g10h@gerhold.net/

Kind regards
Uffe
Stephan Gerhold Sept. 29, 2023, 5:01 p.m. UTC | #4
On Fri, Sep 29, 2023 at 03:14:07PM +0200, Ulf Hansson wrote:
> On Wed, 13 Sept 2023 at 14:26, Stephan Gerhold <stephan@gerhold.net> wrote:
> > On Wed, Sep 13, 2023 at 12:56:16PM +0200, Ulf Hansson wrote:
> > > On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
> > > <stephan.gerhold@kernkonzept.com> wrote:
> > > > [...]
> > > > However, at the
> > > > moment nothing ever enables the virtual devices created in
> > > > qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> > > > permanently runtime-suspended.
> > > >
> > > > Fix this by enabling the devices after attaching them and use
> > > > dev_pm_syscore_device() to ensure the power domain also stays on when
> > > > going to suspend. Since it supplies the CPU we can never turn it off
> > > > from Linux. There are other mechanisms to turn it off when needed,
> > > > usually in the RPM firmware or the cpuidle path.
> > > >
> > > > Without this fix performance states votes are silently ignored, and the
> > > > CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> > > > for some reason no one noticed this on QCS404 so far.
> > > >
> > > > Cc: stable@vger.kernel.org
> > > > Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> > > > Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
> > > > ---
> > > >  drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
> > > >  1 file changed, 20 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > index 84d7033e5efe..17d6ab14c909 100644
> > > > --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > @@ -25,6 +25,7 @@
> > > >  #include <linux/platform_device.h>
> > > >  #include <linux/pm_domain.h>
> > > >  #include <linux/pm_opp.h>
> > > > +#include <linux/pm_runtime.h>
> > > >  #include <linux/slab.h>
> > > >  #include <linux/soc/qcom/smem.h>
> > > >
> > > > @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > >         }
> > > >
> > > >         for_each_possible_cpu(cpu) {
> > > > +               struct device **virt_devs = NULL;
> > > >                 struct dev_pm_opp_config config = {
> > > >                         .supported_hw = NULL,
> > > >                 };
> > > > @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > >
> > > >                 if (drv->data->genpd_names) {
> > > >                         config.genpd_names = drv->data->genpd_names;
> > > > -                       config.virt_devs = NULL;
> > > > +                       config.virt_devs = &virt_devs;
> > > >                 }
> > > >
> > > >                 if (config.supported_hw || config.genpd_names) {
> > > > @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > >                                 goto free_opp;
> > > >                         }
> > > >                 }
> > > > +
> > > > +               if (virt_devs) {
> > > > +                       const char * const *name = config.genpd_names;
> > > > +                       int i;
> > > > +
> > > > +                       for (i = 0; *name; i++, name++) {
> > > > +                               ret = pm_runtime_resume_and_get(virt_devs[i]);
> > > > +                               if (ret) {
> > > > +                                       dev_err(cpu_dev, "failed to resume %s: %d\n",
> > > > +                                               *name, ret);
> > > > +                                       goto free_opp;
> > > > +                               }
> > >
> > > Shouldn't we restore the usage count at ->remove() too?
> > >
> > > > +
> > > > +                               /* Keep CPU power domain always-on */
> > > > +                               dev_pm_syscore_device(virt_devs[i], true);
> > >
> > > Is this really correct? cpufreq is suspended/resumed by the PM core
> > > during system wide suspend/resume. See dpm_suspend|resume(). Isn't
> > > that sufficient?
> > >
> > > Moreover, it looks like the cpr genpd provider supports genpd's
> > > ->power_on|off() callbacks. Is there something wrong with this, that I
> > > am missing?
> > >
> >
> > I think this question is a quite fundamental one. To explain this
> > properly I will need to delve a bit into the implementation details of
> > the two different GENPD providers that are applicable here:
> >
> > Fundamentally, we are describing the main power supply for the CPU here.
> > Consider a simple regulator with adjustable voltage. From the Linux
> > point of view this regulator should be marked as "regulator-always-on".
> > If we would turn off this regulator, the CPU would be immediately dead
> > without proper shutdown done by firmware or hardware.
> >
> > Representing the regulator as power domain does not change much, except
> > that we now have abstract "performance states" instead of actual voltages.
> > However, for power domains there is currently no generic mechanism like
> > "regulator-always-on" in the DT, only drivers can specify
> > GENPD_FLAG_ALWAYS_ON.
> 
> We have relied on genpd providers to act on their compatible strings
> to make the correct configuration. If that isn't sufficient, I don't
> see why we couldn't add a new DT property corresponding to
> GENPD_FLAG_ALWAYS_ON.
> 

Right. It's not completely trivial though, since a DT node may provide
many different power domains with #power-domain-cells = <N>. A regulator
on the other hand has a dedicated DT node where you can just add
"regulator-always-on". :')

> >
> > The special situation on MSM8909 is that there are two possible setups
> > for the CPU power supply depending on the PMIC that is used (see
> > "[PATCH 4/4] cpufreq: qcom-nvmem: Add MSM8909"): CPR or RPMPD. Both are
> > GENPD providers so in theory we can just have either
> >
> >   cpu@0 { power-domains = <&cpr>; }; // or
> >   cpu@0 { power-domains = <&rpmpd MSM8909_VDDCX_AO>; };
> >
> > in the DT, without handling this specifically on the cpufreq side.
> 
> Looks like it would be nice to get a patch for the MSM8909 DTS too, as
> part of the series, to get a better picture of how this is going to be
> used. Would that be possible for you to provide?
> 

Sure! Right now I cannot include it as working patch in this series
since I don't have the base SoC DT (msm8909.dtsi) upstream yet. It's
mostly a copy-paste of msm8916.dtsi so I was trying to finish up the
SoC-specific parts before sending it.

I'm happy to provide links to the full DT and my changes though. Does
that help? If you would like to comment inline I could copy paste the
diffs in a mail or include some kind of RFC patch. It just wouldn't be
possible to apply it successfully. :')

Here are the two commits with the my current DT changes (WIP):
  - MSM8909+PM8909 (RPMPD only):
    https://github.com/msm8916-mainline/linux/commit/791e0c5a3162372a0738bc7b0f4a5e87247923db
  - MSM8916 (CPR+RPMPD):
    https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c
  (- QCS404 (CPR only): already in mainline (see qcs404.dtsi))

> >
> > The two GENPD providers behave quite differently though:
> >
> >  - CPR: CPR is not really a power domain itself. It's more like a monitor
> >    on a power supply line coming from some other regulator. CPR provides
> >    suggestions how to adjust the voltage for best power/stability.
> >
> >    The GENPD .power_off() disables the CPR state machine and forwards
> >    this to the regulator with regulator_disable(). On QCS404 the
> >    regulator is marked regulator-always-on, so it will never be disabled
> >    from Linux. The SAW/SPM hardware component on Qualcomm SoCs will
> >    usually disable the regulator during deep cpuidle states.
> 
> Parts of this sound a bit odd to me. The CPR/CPUfreq shouldn't really
> need to vote for the CPU's power-rail(s) from a powered-on/off (CPU
> idle states) point of view, but only from a performance (voltage
> level) point of view.
> 
> If the enable/disable voting on the regulator really has an impact on
> some platforms, it sounds like it could prevent deeper CPU idle states
> too. That's probably not what we want, right?
> 

I think this heavily depends on what exactly this "regulator"
represents. Are we talking about a physical regulator with a binary
enable/disable signal or actually some hardware/firmware magic that
combines multiple independent "votes"?

If we are talking about a physical regulator then we can never disable
it from Linux. Not even during CPU idle states. It would just cut off
all power immediately and kill the CPU without proper shutdown. Instead,
the platform might have special hardware/firmware functionality that
will control the actual physical enable/disable signal of the regulator.

> I also had a look at the existing CPR genpd provider's probe
> function/path (cpr_probe()) - and it turns out there is no call to
> regulator_enable(). Whatever that means to us.

In most (all?) setups the CPR genpd provider will manage the actual
physical regulator. It could be part of the PMIC or even some
off-the-shelf regulator with I2C control. It doesn't matter. There is
nothing special about that regulator. You have the standard Linux
regulator driver, set up the DT node for it and hook it up to CPR.

Now, to prevent the regulator driver in Linux from touching the physical
enable signal (see above) we add "regulator-always-on". When Linux
requests deep CPU idle states via PSCI the hardware will toggle the
physical enable/disable signal of the regulator for us (after the CPU
has been shut down).

On some platforms CPR is also used for the GPU or other power rails that
are not critical for the CPU to run. In that case it's fine to disable
the regulator directly from Linux. Just not for the CPU.

> 
> >
> >  - RPMPD: This is the generic driver for all the SoC power domains
> >    managed by the RPM firmware. It's not CPU-specific. However, as
> >    special feature each power domain is exposed twice in Linux, e.g.
> >    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> >    variant tells the RPM firmware that the performance/enable vote only
> >    applies when the CPU is active (not in deep cpuidle state).
> >
> >    The GENPD .power_off() drops all performance state votes and also
> >    releases the "enable" vote for the power domain.
> >
> > Now, imagine what happens during system wide suspend/resume:
> >
> >  - CPR: The CPR state machine gets disabled. The voltage stays as-is.
> >      - With "regulator-always-on": The CPU keeps running until WFI.
> >      - Without: I would expect the CPU is dead immediately(?)
> 
> As I indicated above, I am starting to feel that this is a bit upside
> down. CPR/CPUfreq should vote on voltages to scale performance, but
> not for cpu idle states.
> 
> Perhaps what is missing is a synchronization point or a notification,
> to inform the CPR driver that its state machine (registers) needs to
> be saved/restored, when the power-rails get turned on/off. In fact, we
> have a couple mechanisms at hand to support this.
> 

I think we can ignore this part of CPR for now. AFAICT Qualcomm's vendor
driver does not explicitly disable the CPR state machine during CPU idle
when the power rails are potentially turned off. They only do it during
system wide suspend, for whatever reason. For that we don't need such a
notification mechanism.

> >
> >  - RPMPD: The performance/enable vote is dropped. The power domain might
> >    go to minimal voltage or even turn off completely. However, the CPU
> >    actually needs to keep running at the same frequency until WFI!
> >    Worst case, the CPU is dead immediately when the power domain votes
> >    get dropped.
> 
> Since RPMPD is managing the voting for both performance and low power
> states for different kinds of devices, this certainly gets a bit more
> complicated.
> 
> On the other hand, the CPUfreq driver should really only vote for
> performance states for the CPUs and not for low power states. The
> latter is a job for cpuidle and other consumers of the RPMPD to
> manage, I think.
> 
> So, while thinking of this, I just realized that it may not always be
> a good idea for genpd to cache a performance state request, for an
> attached device and for which pm_runtime_suspended() returns true for
> it. As this is the default behaviour in genpd, I am thinking that we
> need a way to make that behaviour configurable for an attached device.
> What do you think about that?
> 

Hm. This would be a bit of a special case of course. But I think this
would be fine to solve the regression for CPR on QCS404.

Then we "just" need to solve the fundamental question from a few years
ago: Who *will* actually vote for enabling the power domains/regulators
required by the CPU? :D

I agree that enabling/disabling power supplies feels closer to cpuidle.
But it's not a perfect fit either, given that we don't actually want to
change our vote while entering CPU idle states. I think on all platforms
I'm looking at here we need a permanent enable vote (effectively making
the regulator/power domains always-on from the Linux point of view).

We could solve this by adding a "regulator-always-on" mechanism in the
DT for power domains. This feels more like a workaround to me than an
actual solution. With this the CPU won't appear as always-on consumer of
the power domains in debugfs. There will just be a "suspended" consumer
attributed to the CPU (from CPUfreq, since we don't have a dedicated
device for CPUfreq).

While this patch is a bit strange from a conceptual perspective, on the
implementation side it effectively makes that CPU consumer appear as
active. So the end result is actually kind of the one we need. :'D

Thanks,
Stephan
Ulf Hansson Oct. 12, 2023, 11:33 a.m. UTC | #5
On Fri, 29 Sept 2023 at 19:01, Stephan Gerhold <stephan@gerhold.net> wrote:
>
> On Fri, Sep 29, 2023 at 03:14:07PM +0200, Ulf Hansson wrote:
> > On Wed, 13 Sept 2023 at 14:26, Stephan Gerhold <stephan@gerhold.net> wrote:
> > > On Wed, Sep 13, 2023 at 12:56:16PM +0200, Ulf Hansson wrote:
> > > > On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
> > > > <stephan.gerhold@kernkonzept.com> wrote:
> > > > > [...]
> > > > > However, at the
> > > > > moment nothing ever enables the virtual devices created in
> > > > > qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> > > > > permanently runtime-suspended.
> > > > >
> > > > > Fix this by enabling the devices after attaching them and use
> > > > > dev_pm_syscore_device() to ensure the power domain also stays on when
> > > > > going to suspend. Since it supplies the CPU we can never turn it off
> > > > > from Linux. There are other mechanisms to turn it off when needed,
> > > > > usually in the RPM firmware or the cpuidle path.
> > > > >
> > > > > Without this fix performance states votes are silently ignored, and the
> > > > > CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> > > > > for some reason no one noticed this on QCS404 so far.
> > > > >
> > > > > Cc: stable@vger.kernel.org
> > > > > Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> > > > > Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
> > > > > ---
> > > > >  drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
> > > > >  1 file changed, 20 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > > index 84d7033e5efe..17d6ab14c909 100644
> > > > > --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > > +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > > @@ -25,6 +25,7 @@
> > > > >  #include <linux/platform_device.h>
> > > > >  #include <linux/pm_domain.h>
> > > > >  #include <linux/pm_opp.h>
> > > > > +#include <linux/pm_runtime.h>
> > > > >  #include <linux/slab.h>
> > > > >  #include <linux/soc/qcom/smem.h>
> > > > >
> > > > > @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > > >         }
> > > > >
> > > > >         for_each_possible_cpu(cpu) {
> > > > > +               struct device **virt_devs = NULL;
> > > > >                 struct dev_pm_opp_config config = {
> > > > >                         .supported_hw = NULL,
> > > > >                 };
> > > > > @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > > >
> > > > >                 if (drv->data->genpd_names) {
> > > > >                         config.genpd_names = drv->data->genpd_names;
> > > > > -                       config.virt_devs = NULL;
> > > > > +                       config.virt_devs = &virt_devs;
> > > > >                 }
> > > > >
> > > > >                 if (config.supported_hw || config.genpd_names) {
> > > > > @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > > >                                 goto free_opp;
> > > > >                         }
> > > > >                 }
> > > > > +
> > > > > +               if (virt_devs) {
> > > > > +                       const char * const *name = config.genpd_names;
> > > > > +                       int i;
> > > > > +
> > > > > +                       for (i = 0; *name; i++, name++) {
> > > > > +                               ret = pm_runtime_resume_and_get(virt_devs[i]);
> > > > > +                               if (ret) {
> > > > > +                                       dev_err(cpu_dev, "failed to resume %s: %d\n",
> > > > > +                                               *name, ret);
> > > > > +                                       goto free_opp;
> > > > > +                               }
> > > >
> > > > Shouldn't we restore the usage count at ->remove() too?
> > > >
> > > > > +
> > > > > +                               /* Keep CPU power domain always-on */
> > > > > +                               dev_pm_syscore_device(virt_devs[i], true);
> > > >
> > > > Is this really correct? cpufreq is suspended/resumed by the PM core
> > > > during system wide suspend/resume. See dpm_suspend|resume(). Isn't
> > > > that sufficient?
> > > >
> > > > Moreover, it looks like the cpr genpd provider supports genpd's
> > > > ->power_on|off() callbacks. Is there something wrong with this, that I
> > > > am missing?
> > > >
> > >
> > > I think this question is a quite fundamental one. To explain this
> > > properly I will need to delve a bit into the implementation details of
> > > the two different GENPD providers that are applicable here:
> > >
> > > Fundamentally, we are describing the main power supply for the CPU here.
> > > Consider a simple regulator with adjustable voltage. From the Linux
> > > point of view this regulator should be marked as "regulator-always-on".
> > > If we would turn off this regulator, the CPU would be immediately dead
> > > without proper shutdown done by firmware or hardware.
> > >
> > > Representing the regulator as power domain does not change much, except
> > > that we now have abstract "performance states" instead of actual voltages.
> > > However, for power domains there is currently no generic mechanism like
> > > "regulator-always-on" in the DT, only drivers can specify
> > > GENPD_FLAG_ALWAYS_ON.
> >
> > We have relied on genpd providers to act on their compatible strings
> > to make the correct configuration. If that isn't sufficient, I don't
> > see why we couldn't add a new DT property corresponding to
> > GENPD_FLAG_ALWAYS_ON.
> >
>
> Right. It's not completely trivial though, since a DT node may provide
> many different power domains with #power-domain-cells = <N>. A regulator
> on the other hand has a dedicated DT node where you can just add
> "regulator-always-on". :')

Sure, it can get a bit messy, but we will work it out if we have too.

Perhaps looking for a specific compitbile string for the cpr can work
instead? No?

>
> > >
> > > The special situation on MSM8909 is that there are two possible setups
> > > for the CPU power supply depending on the PMIC that is used (see
> > > "[PATCH 4/4] cpufreq: qcom-nvmem: Add MSM8909"): CPR or RPMPD. Both are
> > > GENPD providers so in theory we can just have either
> > >
> > >   cpu@0 { power-domains = <&cpr>; }; // or
> > >   cpu@0 { power-domains = <&rpmpd MSM8909_VDDCX_AO>; };
> > >
> > > in the DT, without handling this specifically on the cpufreq side.
> >
> > Looks like it would be nice to get a patch for the MSM8909 DTS too, as
> > part of the series, to get a better picture of how this is going to be
> > used. Would that be possible for you to provide?
> >
>
> Sure! Right now I cannot include it as working patch in this series
> since I don't have the base SoC DT (msm8909.dtsi) upstream yet. It's
> mostly a copy-paste of msm8916.dtsi so I was trying to finish up the
> SoC-specific parts before sending it.
>
> I'm happy to provide links to the full DT and my changes though. Does
> that help? If you would like to comment inline I could copy paste the
> diffs in a mail or include some kind of RFC patch. It just wouldn't be
> possible to apply it successfully. :')
>
> Here are the two commits with the my current DT changes (WIP):
>   - MSM8909+PM8909 (RPMPD only):
>     https://github.com/msm8916-mainline/linux/commit/791e0c5a3162372a0738bc7b0f4a5e87247923db

Okay, so this looks pretty straight forward. One thing though, it
looks like we need to update the DT bindings for cpus.

I recently updated Documentation/devicetree/bindings/arm/cpus.yaml
[1], to let "perf" be the common "power-domain-name" for a CPU's SCMI
performance-domain. I look like we should extend the description to
allow "perf" to be used for all types of performance domains.

>   - MSM8916 (CPR+RPMPD):
>     https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c

This looks a bit odd to me. Does a CPU really have four different
power-domains, where three of them are performance-domains?

In a way it sounds like an option could be to hook up the cpr to the
rpmpd:s instead (possibly even set it as a child-domains to the
rpmpd:s), assuming that is a better description of the HW, which it
may not be, of course.

When it comes to the regulator, vdd-apc-supply, it seems fine to me to
set it as an always-on regulator. Maybe another option could simply be
to leave it enabled when the cpr driver has probed.

>   (- QCS404 (CPR only): already in mainline (see qcs404.dtsi))
>

Okay, so in this case it's solely the cpr that manages the performance
scaling for the CPU.

In regards to the vdd-apc-supply, it seems to be used in the similar
way in the case above.

> > >
> > > The two GENPD providers behave quite differently though:
> > >
> > >  - CPR: CPR is not really a power domain itself. It's more like a monitor
> > >    on a power supply line coming from some other regulator. CPR provides
> > >    suggestions how to adjust the voltage for best power/stability.
> > >
> > >    The GENPD .power_off() disables the CPR state machine and forwards
> > >    this to the regulator with regulator_disable(). On QCS404 the
> > >    regulator is marked regulator-always-on, so it will never be disabled
> > >    from Linux. The SAW/SPM hardware component on Qualcomm SoCs will
> > >    usually disable the regulator during deep cpuidle states.
> >
> > Parts of this sound a bit odd to me. The CPR/CPUfreq shouldn't really
> > need to vote for the CPU's power-rail(s) from a powered-on/off (CPU
> > idle states) point of view, but only from a performance (voltage
> > level) point of view.
> >
> > If the enable/disable voting on the regulator really has an impact on
> > some platforms, it sounds like it could prevent deeper CPU idle states
> > too. That's probably not what we want, right?
> >
>
> I think this heavily depends on what exactly this "regulator"
> represents. Are we talking about a physical regulator with a binary
> enable/disable signal or actually some hardware/firmware magic that
> combines multiple independent "votes"?
>
> If we are talking about a physical regulator then we can never disable
> it from Linux. Not even during CPU idle states. It would just cut off
> all power immediately and kill the CPU without proper shutdown. Instead,
> the platform might have special hardware/firmware functionality that
> will control the actual physical enable/disable signal of the regulator.
>
> > I also had a look at the existing CPR genpd provider's probe
> > function/path (cpr_probe()) - and it turns out there is no call to
> > regulator_enable(). Whatever that means to us.
>
> In most (all?) setups the CPR genpd provider will manage the actual
> physical regulator. It could be part of the PMIC or even some
> off-the-shelf regulator with I2C control. It doesn't matter. There is
> nothing special about that regulator. You have the standard Linux
> regulator driver, set up the DT node for it and hook it up to CPR.
>
> Now, to prevent the regulator driver in Linux from touching the physical
> enable signal (see above) we add "regulator-always-on". When Linux
> requests deep CPU idle states via PSCI the hardware will toggle the
> physical enable/disable signal of the regulator for us (after the CPU
> has been shut down).
>
> On some platforms CPR is also used for the GPU or other power rails that
> are not critical for the CPU to run. In that case it's fine to disable
> the regulator directly from Linux. Just not for the CPU.

Right. I get the point, thanks for clarifying!

Still, the CPR can't just disable the regulator for a GPU without
using some kind of synchronization point for when to do it. The GPU
may be running some use cases, etc. Although, let's leave that out of
this discussion. :-)

>
> >
> > >
> > >  - RPMPD: This is the generic driver for all the SoC power domains
> > >    managed by the RPM firmware. It's not CPU-specific. However, as
> > >    special feature each power domain is exposed twice in Linux, e.g.
> > >    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> > >    variant tells the RPM firmware that the performance/enable vote only
> > >    applies when the CPU is active (not in deep cpuidle state).
> > >
> > >    The GENPD .power_off() drops all performance state votes and also
> > >    releases the "enable" vote for the power domain.
> > >
> > > Now, imagine what happens during system wide suspend/resume:
> > >
> > >  - CPR: The CPR state machine gets disabled. The voltage stays as-is.
> > >      - With "regulator-always-on": The CPU keeps running until WFI.
> > >      - Without: I would expect the CPU is dead immediately(?)
> >
> > As I indicated above, I am starting to feel that this is a bit upside
> > down. CPR/CPUfreq should vote on voltages to scale performance, but
> > not for cpu idle states.
> >
> > Perhaps what is missing is a synchronization point or a notification,
> > to inform the CPR driver that its state machine (registers) needs to
> > be saved/restored, when the power-rails get turned on/off. In fact, we
> > have a couple mechanisms at hand to support this.
> >
>
> I think we can ignore this part of CPR for now. AFAICT Qualcomm's vendor
> driver does not explicitly disable the CPR state machine during CPU idle
> when the power rails are potentially turned off. They only do it during
> system wide suspend, for whatever reason. For that we don't need such a
> notification mechanism.

I see.

So, if I understand correctly, we could potentially use the regular
system suspend/resume callbacks for the CPR genpd provider driver,
rather than its genpd->power_on|off() callbacks?

>
> > >
> > >  - RPMPD: The performance/enable vote is dropped. The power domain might
> > >    go to minimal voltage or even turn off completely. However, the CPU
> > >    actually needs to keep running at the same frequency until WFI!
> > >    Worst case, the CPU is dead immediately when the power domain votes
> > >    get dropped.
> >
> > Since RPMPD is managing the voting for both performance and low power
> > states for different kinds of devices, this certainly gets a bit more
> > complicated.
> >
> > On the other hand, the CPUfreq driver should really only vote for
> > performance states for the CPUs and not for low power states. The
> > latter is a job for cpuidle and other consumers of the RPMPD to
> > manage, I think.
> >
> > So, while thinking of this, I just realized that it may not always be
> > a good idea for genpd to cache a performance state request, for an
> > attached device and for which pm_runtime_suspended() returns true for
> > it. As this is the default behaviour in genpd, I am thinking that we
> > need a way to make that behaviour configurable for an attached device.
> > What do you think about that?
> >
>
> Hm. This would be a bit of a special case of course. But I think this
> would be fine to solve the regression for CPR on QCS404.

Okay, I will try to propose and submit something for this shortly. I
will keep you cc:ed.

>
> Then we "just" need to solve the fundamental question from a few years
> ago: Who *will* actually vote for enabling the power domains/regulators
> required by the CPU? :D
>
> I agree that enabling/disabling power supplies feels closer to cpuidle.
> But it's not a perfect fit either, given that we don't actually want to
> change our vote while entering CPU idle states. I think on all platforms
> I'm looking at here we need a permanent enable vote (effectively making
> the regulator/power domains always-on from the Linux point of view).
>
> We could solve this by adding a "regulator-always-on" mechanism in the
> DT for power domains. This feels more like a workaround to me than an
> actual solution.

From the discussions above, it sounded like it would be sufficient to
use the regulator-always-on for the actual regulator supply.

In the case where there is no cpr being used on the platform, there
also is no regulator that needs to stay enabled, right?

> With this the CPU won't appear as always-on consumer of
> the power domains in debugfs. There will just be a "suspended" consumer
> attributed to the CPU (from CPUfreq, since we don't have a dedicated
> device for CPUfreq).

I didn't quite get this part.

The devices that we hook up to the genpd from cpufreq are used for
performance scaling, not for power-on/off things. It shouldn't matter
if these devices are "suspended" from debugfs/sysfs point of view,
right?

Or did I fail to understand your point?

>
> While this patch is a bit strange from a conceptual perspective, on the
> implementation side it effectively makes that CPU consumer appear as
> active. So the end result is actually kind of the one we need. :'D

Right. It looks like we are concluding on the way forward. :-)

*) The approach you have taken in the $subject patch with the call to
pm_runtime_resume_and_get() works as a fix for QCS404, as there is
only the CPR to attach to. The problem with it, is that it doesn't
work for cases where the RPMPD is used for performance scaling, either
separate or in combination with the CPR. It would keep the rpmpd:s
powered-on, which would be wrong. In regards to the
dev_pm_syscore_device() thingy, this should not be needed, as long as
we keep the vdd-apc-supply enabled, right?

**) A more generic solution, that would work for all cases (even
when/if hooking up the CPR to the rpmpd:s), consists of tweaking genpd
to avoid "caching" performance states for these kinds of devices. And
again, I don't see that we need dev_pm_syscore_device(), assuming we
manage the vdd-apc-supply correctly.

Did I miss anything?

Kind regards
Uffe

[1] https://lore.kernel.org/all/20230825112633.236607-9-ulf.hansson@linaro.org/
Stephan Gerhold Oct. 12, 2023, 6:43 p.m. UTC | #6
On Thu, Oct 12, 2023 at 01:33:34PM +0200, Ulf Hansson wrote:
> On Fri, 29 Sept 2023 at 19:01, Stephan Gerhold <stephan@gerhold.net> wrote:
> > On Fri, Sep 29, 2023 at 03:14:07PM +0200, Ulf Hansson wrote:
> > > On Wed, 13 Sept 2023 at 14:26, Stephan Gerhold <stephan@gerhold.net> wrote:
> > > > On Wed, Sep 13, 2023 at 12:56:16PM +0200, Ulf Hansson wrote:
> > > > > On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
> > > > > <stephan.gerhold@kernkonzept.com> wrote:
> > > > > > [...]
> > > > > > However, at the
> > > > > > moment nothing ever enables the virtual devices created in
> > > > > > qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> > > > > > permanently runtime-suspended.
> > > > > >
> > > > > > Fix this by enabling the devices after attaching them and use
> > > > > > dev_pm_syscore_device() to ensure the power domain also stays on when
> > > > > > going to suspend. Since it supplies the CPU we can never turn it off
> > > > > > from Linux. There are other mechanisms to turn it off when needed,
> > > > > > usually in the RPM firmware or the cpuidle path.
> > > > > >
> > > > > > Without this fix performance states votes are silently ignored, and the
> > > > > > CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> > > > > > for some reason no one noticed this on QCS404 so far.
> > > > > >
> > > > > > Cc: stable@vger.kernel.org
> > > > > > Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> > > > > > Signed-off-by: Stephan Gerhold <stephan.gerhold@kernkonzept.com>
> > > > > > ---
> > > > > >  drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
> > > > > >  1 file changed, 20 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > > > index 84d7033e5efe..17d6ab14c909 100644
> > > > > > --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > > > +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > > > > @@ -25,6 +25,7 @@
> > > > > >  #include <linux/platform_device.h>
> > > > > >  #include <linux/pm_domain.h>
> > > > > >  #include <linux/pm_opp.h>
> > > > > > +#include <linux/pm_runtime.h>
> > > > > >  #include <linux/slab.h>
> > > > > >  #include <linux/soc/qcom/smem.h>
> > > > > >
> > > > > > @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > > > >         }
> > > > > >
> > > > > >         for_each_possible_cpu(cpu) {
> > > > > > +               struct device **virt_devs = NULL;
> > > > > >                 struct dev_pm_opp_config config = {
> > > > > >                         .supported_hw = NULL,
> > > > > >                 };
> > > > > > @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > > > >
> > > > > >                 if (drv->data->genpd_names) {
> > > > > >                         config.genpd_names = drv->data->genpd_names;
> > > > > > -                       config.virt_devs = NULL;
> > > > > > +                       config.virt_devs = &virt_devs;
> > > > > >                 }
> > > > > >
> > > > > >                 if (config.supported_hw || config.genpd_names) {
> > > > > > @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > > > >                                 goto free_opp;
> > > > > >                         }
> > > > > >                 }
> > > > > > +
> > > > > > +               if (virt_devs) {
> > > > > > +                       const char * const *name = config.genpd_names;
> > > > > > +                       int i;
> > > > > > +
> > > > > > +                       for (i = 0; *name; i++, name++) {
> > > > > > +                               ret = pm_runtime_resume_and_get(virt_devs[i]);
> > > > > > +                               if (ret) {
> > > > > > +                                       dev_err(cpu_dev, "failed to resume %s: %d\n",
> > > > > > +                                               *name, ret);
> > > > > > +                                       goto free_opp;
> > > > > > +                               }
> > > > >
> > > > > Shouldn't we restore the usage count at ->remove() too?
> > > > >
> > > > > > +
> > > > > > +                               /* Keep CPU power domain always-on */
> > > > > > +                               dev_pm_syscore_device(virt_devs[i], true);
> > > > >
> > > > > Is this really correct? cpufreq is suspended/resumed by the PM core
> > > > > during system wide suspend/resume. See dpm_suspend|resume(). Isn't
> > > > > that sufficient?
> > > > >
> > > > > Moreover, it looks like the cpr genpd provider supports genpd's
> > > > > ->power_on|off() callbacks. Is there something wrong with this, that I
> > > > > am missing?
> > > > >
> > > >
> > > > I think this question is a quite fundamental one. To explain this
> > > > properly I will need to delve a bit into the implementation details of
> > > > the two different GENPD providers that are applicable here:
> > > >
> > > > Fundamentally, we are describing the main power supply for the CPU here.
> > > > Consider a simple regulator with adjustable voltage. From the Linux
> > > > point of view this regulator should be marked as "regulator-always-on".
> > > > If we would turn off this regulator, the CPU would be immediately dead
> > > > without proper shutdown done by firmware or hardware.
> > > >
> > > > Representing the regulator as power domain does not change much, except
> > > > that we now have abstract "performance states" instead of actual voltages.
> > > > However, for power domains there is currently no generic mechanism like
> > > > "regulator-always-on" in the DT, only drivers can specify
> > > > GENPD_FLAG_ALWAYS_ON.
> > >
> > > We have relied on genpd providers to act on their compatible strings
> > > to make the correct configuration. If that isn't sufficient, I don't
> > > see why we couldn't add a new DT property corresponding to
> > > GENPD_FLAG_ALWAYS_ON.
> > >
> >
> > Right. It's not completely trivial though, since a DT node may provide
> > many different power domains with #power-domain-cells = <N>. A regulator
> > on the other hand has a dedicated DT node where you can just add
> > "regulator-always-on". :')
> 
> Sure, it can get a bit messy, but we will work it out if we have too.
> 
> Perhaps looking for a specific compitbile string for the cpr can work
> instead? No?
> 

It's easy for CPR, but more complicated for RPMPD because it manages
multiple power domains from a single DT node. In general, only the ones
used by the CPU need to be always-on (see explanation at the end of the
mail).

> >
> > > >
> > > > The special situation on MSM8909 is that there are two possible setups
> > > > for the CPU power supply depending on the PMIC that is used (see
> > > > "[PATCH 4/4] cpufreq: qcom-nvmem: Add MSM8909"): CPR or RPMPD. Both are
> > > > GENPD providers so in theory we can just have either
> > > >
> > > >   cpu@0 { power-domains = <&cpr>; }; // or
> > > >   cpu@0 { power-domains = <&rpmpd MSM8909_VDDCX_AO>; };
> > > >
> > > > in the DT, without handling this specifically on the cpufreq side.
> > >
> > > Looks like it would be nice to get a patch for the MSM8909 DTS too, as
> > > part of the series, to get a better picture of how this is going to be
> > > used. Would that be possible for you to provide?
> > >
> >
> > Sure! Right now I cannot include it as working patch in this series
> > since I don't have the base SoC DT (msm8909.dtsi) upstream yet. It's
> > mostly a copy-paste of msm8916.dtsi so I was trying to finish up the
> > SoC-specific parts before sending it.
> >
> > I'm happy to provide links to the full DT and my changes though. Does
> > that help? If you would like to comment inline I could copy paste the
> > diffs in a mail or include some kind of RFC patch. It just wouldn't be
> > possible to apply it successfully. :')
> >
> > Here are the two commits with the my current DT changes (WIP):
> >   - MSM8909+PM8909 (RPMPD only):
> >     https://github.com/msm8916-mainline/linux/commit/791e0c5a3162372a0738bc7b0f4a5e87247923db
> 
> Okay, so this looks pretty straight forward. One thing though, it
> looks like we need to update the DT bindings for cpus.
> 
> I recently updated Documentation/devicetree/bindings/arm/cpus.yaml
> [1], to let "perf" be the common "power-domain-name" for a CPU's SCMI
> performance-domain. I look like we should extend the description to
> allow "perf" to be used for all types of performance domains.
> 

"perf" sounds fine for a single power domain, I just used "apc" here for
consistency with the MSM8916 changes (which scales this power domain and
several others, as you saw).

(BTW, I would appreciate such a generic name for the cpuidle case as
 well, so "idle" instead of "psci" vs "sbi". I have another WIP cpuidle
 driver and didn't want to invent another name there...)

> >   - MSM8916 (CPR+RPMPD):
> >     https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c
> 
> This looks a bit odd to me. Does a CPU really have four different
> power-domains, where three of them are performance-domains?
> 

Good question. I think we're largely entering "uncharted territory" with
these questions, I can just try to answer it the best I can from the
limited documentation and knowledge I have. :)

The CPU does indeed use four different power domains. There also seem to
be additional power switches that gate power for some components without
having to turn off the entire supply.

I'll list them twice from two points of view: Once mapping component ->
power domain, then again showing each power domain separately to make it
more clear. At the end I also want to make clear that MSM8909 (with the
"single" power domain) is actually exactly the same SoC design, just
with different regulators supplying the power domains.

It's totally fine if you just skim over it. I'm listing it in detail
also as reference for myself. :D

# Components
 - SoC
   - CPU subsystem ("APPS")
     - CPU cluster
       - 4x CPU core (logic and L1 cache) -> VDD_APC
       - Shared L2 cache
         - Logic -> VDD_APC
         - Memory -> VDD_MX
     - CPU clock controller (logic) -> VDD_CX
       - Provides CPU frequency from different clock sources
       - L2 cache runs at 1/2 of CPU frequency
       => Both VDD_APC and VDD_MX must be scaled based on frequency
     - CPU PLL clock source
       - Generates the higher (GHz) CPU frequencies
       - Logic (?, unsure) -> VDD_CX
       - ??? -> VDD_SR2_APPS_PLL
       => VDD_CX must be scaled based on PLL frequency

# Power Domains
## VDD_APC
 - dedicated for CPU
 - powered off completely in deepest cluster cpuidle state

 - per-core power switch (per-core cpuidle)
   - CPU logic
   - L1 cache controller/logic and maybe memory(?, unsure)
 - shared L2 cache controller/logic

 => must be scaled based on CPU frequency

## VDD_MX
 - global SoC power domain for "on-chip memories"
 - always on, reduced to minimal voltage when entire SoC is idle

 - power switch (controlled by deepest cluster cpuidle state?, unsure)
   - L2 cache memory

 => must be scaled based on L2 frequency (=> 1/2 CPU frequency)

## VDD_CX
 - global SoC power domain for "digital logic"
 - always on, reduced to minimal voltage when entire SoC is idle
 - voting for VDD_CX in the RPM firmware also affects VDD_MX performance
   state (firmware implicitly sets VDD_MX >= VDD_CX)

 - CPU clock controller logic, CPU PLL logic(?, unsure)

 => must be scaled based on CPU PLL frequency

## VDD_SR2_APPS_PLL
 - global SoC power domain for CPU clock PLLs
 - on MSM8916: always on with constant voltage

 => ignored in Linux at the moment

# Power Domain Regulators
These power domains are literally input pins on the SoC chip. In theory
one could connect any suitable regulator to each of those. In practice
there are just a couple of standard reference designs that everyone
uses:

## MSM8916 (SoC) + PM8916 (PMIC)
We need to scale 3 power domains together with cpufreq:

 - VDD_APC (CPU logic) = &pm8916_spmi_s2 (via CPR)
 - VDD_MX  (L2 memory) = &pm8916_l3 (via RPMPD: MSM8916_VDDMX)
 - VDD_CX  (CPU PLL)   = &pm8916_s1 (via RPMPD: MSM8916_VDDCX)

## MSM8909 (SoC) + PM8909 (PMIC)
We need to scale 1 power domain together with cpufreq:

 - VDD_APC = VDD_CX    = &pm8909_s1 (via RPMPD: MSM8909_VDDCX)
   (CPU logic, L2 logic and CPU PLL)
(- VDD_MX  (L2 memory) = &pm8909_l3 (RPM firmware enforces VDD_MX >= VDD_CX))

There is implicit magic in the RPM firmware here that saves us from
scaling VDD_MX. VDD_CX/APC are the same power rail.

## MSM8909 (SoC) + PM8916 (PMIC)
When MSM8909 is paired with PM8916 instead of PM8909, the setup is
identical to MSM8916+PM8916. We need to scale 3 power domains.

> In a way it sounds like an option could be to hook up the cpr to the
> rpmpd:s instead (possibly even set it as a child-domains to the
> rpmpd:s), assuming that is a better description of the HW, which it
> may not be, of course.

Hm. It's definitely an option. I must admit I haven't really looked
much at child-domains so far, so spontaneously I'm not sure about
the implications, for both the abstract hardware description and
the implementation.

There seems to be indeed some kind of relation between MX <=> CX/APC:

 - When voting for CX in the RPM firmware, it will always implicitly
   adjust the MX performance state to be MX >= CX.

 - When scaling APC up, we must increase MX before APC.
 - When scaling APC down, we must decrease MX after APC.
 => Clearly MX >= APC. Not in terms of raw voltage, but at least for the
    abstract performance state.

Is this some kind of parent-child relationship between MX <=> CX and
MX <=> APC?

If yes, maybe we could indeed bind MX to the CPR genpd somehow. They use
different performance state numbering, so we need some kind of
translation. I'm not entirely sure how that would be described.

Scaling VDD_CX for the PLL is more complicated. APC <=> CX feel more
like siblings, so I don't think it makes sense to vote for CX as part of
the CPR genpd. Spontaneously I would argue scaling CX belongs into the
CPU PLL driver (since that's what the vote is for). However, for some
reason it was decided to handle such votes on the consumer side (here =
cpufreq) on mainline [1].

[1]: https://lore.kernel.org/linux-arm-msm/20200910162610.GA7008@gerhold.net/

> When it comes to the regulator, vdd-apc-supply, it seems fine to me to
> set it as an always-on regulator. Maybe another option could simply be
> to leave it enabled when the cpr driver has probed.
> 

Agreed.

> >   (- QCS404 (CPR only): already in mainline (see qcs404.dtsi))
> >
> 
> Okay, so in this case it's solely the cpr that manages the performance
> scaling for the CPU.
> 

I'm not sure but I suspect there are also more power domains involved
here, just hidden behind other implicit magic that we don't need to
control ourselves.

> In regards to the vdd-apc-supply, it seems to be used in the similar
> way in the case above.
> 

Yep.

> > > >
> > > > The two GENPD providers behave quite differently though:
> > > >
> > > >  - CPR: CPR is not really a power domain itself. It's more like a monitor
> > > >    on a power supply line coming from some other regulator. CPR provides
> > > >    suggestions how to adjust the voltage for best power/stability.
> > > >
> > > >    The GENPD .power_off() disables the CPR state machine and forwards
> > > >    this to the regulator with regulator_disable(). On QCS404 the
> > > >    regulator is marked regulator-always-on, so it will never be disabled
> > > >    from Linux. The SAW/SPM hardware component on Qualcomm SoCs will
> > > >    usually disable the regulator during deep cpuidle states.
> > >
> > > Parts of this sound a bit odd to me. The CPR/CPUfreq shouldn't really
> > > need to vote for the CPU's power-rail(s) from a powered-on/off (CPU
> > > idle states) point of view, but only from a performance (voltage
> > > level) point of view.
> > >
> > > If the enable/disable voting on the regulator really has an impact on
> > > some platforms, it sounds like it could prevent deeper CPU idle states
> > > too. That's probably not what we want, right?
> > >
> >
> > I think this heavily depends on what exactly this "regulator"
> > represents. Are we talking about a physical regulator with a binary
> > enable/disable signal or actually some hardware/firmware magic that
> > combines multiple independent "votes"?
> >
> > If we are talking about a physical regulator then we can never disable
> > it from Linux. Not even during CPU idle states. It would just cut off
> > all power immediately and kill the CPU without proper shutdown. Instead,
> > the platform might have special hardware/firmware functionality that
> > will control the actual physical enable/disable signal of the regulator.
> >
> > > I also had a look at the existing CPR genpd provider's probe
> > > function/path (cpr_probe()) - and it turns out there is no call to
> > > regulator_enable(). Whatever that means to us.
> >
> > In most (all?) setups the CPR genpd provider will manage the actual
> > physical regulator. It could be part of the PMIC or even some
> > off-the-shelf regulator with I2C control. It doesn't matter. There is
> > nothing special about that regulator. You have the standard Linux
> > regulator driver, set up the DT node for it and hook it up to CPR.
> >
> > Now, to prevent the regulator driver in Linux from touching the physical
> > enable signal (see above) we add "regulator-always-on". When Linux
> > requests deep CPU idle states via PSCI the hardware will toggle the
> > physical enable/disable signal of the regulator for us (after the CPU
> > has been shut down).
> >
> > On some platforms CPR is also used for the GPU or other power rails that
> > are not critical for the CPU to run. In that case it's fine to disable
> > the regulator directly from Linux. Just not for the CPU.
> 
> Right. I get the point, thanks for clarifying!
> 
> Still, the CPR can't just disable the regulator for a GPU without
> using some kind of synchronization point for when to do it. The GPU
> may be running some use cases, etc. Although, let's leave that out of
> this discussion. :-)
> 

(Here I assumed that the Linux GPU driver (running on the CPU) is in
 full control of the GPU. So it explicitly turns the GPU power domain on
 when the GPU is needed and turns it off only when the GPU is idle.)

> >
> > >
> > > >
> > > >  - RPMPD: This is the generic driver for all the SoC power domains
> > > >    managed by the RPM firmware. It's not CPU-specific. However, as
> > > >    special feature each power domain is exposed twice in Linux, e.g.
> > > >    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> > > >    variant tells the RPM firmware that the performance/enable vote only
> > > >    applies when the CPU is active (not in deep cpuidle state).
> > > >
> > > >    The GENPD .power_off() drops all performance state votes and also
> > > >    releases the "enable" vote for the power domain.
> > > >
> > > > Now, imagine what happens during system wide suspend/resume:
> > > >
> > > >  - CPR: The CPR state machine gets disabled. The voltage stays as-is.
> > > >      - With "regulator-always-on": The CPU keeps running until WFI.
> > > >      - Without: I would expect the CPU is dead immediately(?)
> > >
> > > As I indicated above, I am starting to feel that this is a bit upside
> > > down. CPR/CPUfreq should vote on voltages to scale performance, but
> > > not for cpu idle states.
> > >
> > > Perhaps what is missing is a synchronization point or a notification,
> > > to inform the CPR driver that its state machine (registers) needs to
> > > be saved/restored, when the power-rails get turned on/off. In fact, we
> > > have a couple mechanisms at hand to support this.
> > >
> >
> > I think we can ignore this part of CPR for now. AFAICT Qualcomm's vendor
> > driver does not explicitly disable the CPR state machine during CPU idle
> > when the power rails are potentially turned off. They only do it during
> > system wide suspend, for whatever reason. For that we don't need such a
> > notification mechanism.
> 
> I see.
> 
> So, if I understand correctly, we could potentially use the regular
> system suspend/resume callbacks for the CPR genpd provider driver,
> rather than its genpd->power_on|off() callbacks?
> 

Exactly. At least that's what Qualcomm seems to do...

> >
> > > >
> > > >  - RPMPD: The performance/enable vote is dropped. The power domain might
> > > >    go to minimal voltage or even turn off completely. However, the CPU
> > > >    actually needs to keep running at the same frequency until WFI!
> > > >    Worst case, the CPU is dead immediately when the power domain votes
> > > >    get dropped.
> > >
> > > Since RPMPD is managing the voting for both performance and low power
> > > states for different kinds of devices, this certainly gets a bit more
> > > complicated.
> > >
> > > On the other hand, the CPUfreq driver should really only vote for
> > > performance states for the CPUs and not for low power states. The
> > > latter is a job for cpuidle and other consumers of the RPMPD to
> > > manage, I think.
> > >
> > > So, while thinking of this, I just realized that it may not always be
> > > a good idea for genpd to cache a performance state request, for an
> > > attached device and for which pm_runtime_suspended() returns true for
> > > it. As this is the default behaviour in genpd, I am thinking that we
> > > need a way to make that behaviour configurable for an attached device.
> > > What do you think about that?
> > >
> >
> > Hm. This would be a bit of a special case of course. But I think this
> > would be fine to solve the regression for CPR on QCS404.
> 
> Okay, I will try to propose and submit something for this shortly. I
> will keep you cc:ed.
> 

Thanks a lot!

> >
> > Then we "just" need to solve the fundamental question from a few years
> > ago: Who *will* actually vote for enabling the power domains/regulators
> > required by the CPU? :D
> >
> > I agree that enabling/disabling power supplies feels closer to cpuidle.
> > But it's not a perfect fit either, given that we don't actually want to
> > change our vote while entering CPU idle states. I think on all platforms
> > I'm looking at here we need a permanent enable vote (effectively making
> > the regulator/power domains always-on from the Linux point of view).
> >
> > We could solve this by adding a "regulator-always-on" mechanism in the
> > DT for power domains. This feels more like a workaround to me than an
> > actual solution.
> 
> From the discussions above, it sounded like it would be sufficient to
> use the regulator-always-on for the actual regulator supply.
> 
> In the case where there is no cpr being used on the platform, there
> also is no regulator that needs to stay enabled, right?
> 

Yes and no. There is no regulator we need to keep enabled. But we need
to keep the CPU-related RPMPDs always-on too.

> > With this the CPU won't appear as always-on consumer of
> > the power domains in debugfs. There will just be a "suspended" consumer
> > attributed to the CPU (from CPUfreq, since we don't have a dedicated
> > device for CPUfreq).
> 
> I didn't quite get this part.
> 
> The devices that we hook up to the genpd from cpufreq are used for
> performance scaling, not for power-on/off things. It shouldn't matter
> if these devices are "suspended" from debugfs/sysfs point of view,
> right?
> 
> Or did I fail to understand your point?
> 

My point here was: If we only set GENPD_FLAG_ALWAYS_ON for the RPMPDs
needed by the CPU, then it won't be obvious from debugfs that it's the
CPU that is keeping the power domains always-on. It's not a big problem.

> >
> > While this patch is a bit strange from a conceptual perspective, on the
> > implementation side it effectively makes that CPU consumer appear as
> > active. So the end result is actually kind of the one we need. :'D
> 
> Right. It looks like we are concluding on the way forward. :-)
> 
> *) The approach you have taken in the $subject patch with the call to
> pm_runtime_resume_and_get() works as a fix for QCS404, as there is
> only the CPR to attach to. The problem with it, is that it doesn't
> work for cases where the RPMPD is used for performance scaling, either
> separate or in combination with the CPR. It would keep the rpmpd:s
> powered-on, which would be wrong. In regards to the
> dev_pm_syscore_device() thingy, this should not be needed, as long as
> we keep the vdd-apc-supply enabled, right?
> 
> **) A more generic solution, that would work for all cases (even
> when/if hooking up the CPR to the rpmpd:s), consists of tweaking genpd
> to avoid "caching" performance states for these kinds of devices. And
> again, I don't see that we need dev_pm_syscore_device(), assuming we
> manage the vdd-apc-supply correctly.
> 
> Did I miss anything?
> 

We do need to keep the CPU-related RPMPDs always-on too.

Keeping the CPU-related RPMPDs always-on is a bit counter-intuitive, but
it's because of this:

> > > >  - RPMPD: This is the generic driver for all the SoC power domains
> > > >    managed by the RPM firmware. It's not CPU-specific. However, as
> > > >    special feature each power domain is exposed twice in Linux, e.g.
> > > >    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> > > >    variant tells the RPM firmware that the performance/enable vote only
> > > >    applies when the CPU is active (not in deep cpuidle state).

The CPU only uses the "_AO"/active-only variants in RPMPD. Keeping these
always-on effectively means "keep the power domain on as long as the CPU
is active".

I hope that clears up some of the confusion. :)

Thanks a lot for taking the time to discuss this!

Stephan
Ulf Hansson Oct. 16, 2023, 2:47 p.m. UTC | #7
[...]

> > >
> > > Here are the two commits with the my current DT changes (WIP):
> > >   - MSM8909+PM8909 (RPMPD only):
> > >     https://github.com/msm8916-mainline/linux/commit/791e0c5a3162372a0738bc7b0f4a5e87247923db
> >
> > Okay, so this looks pretty straight forward. One thing though, it
> > looks like we need to update the DT bindings for cpus.
> >
> > I recently updated Documentation/devicetree/bindings/arm/cpus.yaml
> > [1], to let "perf" be the common "power-domain-name" for a CPU's SCMI
> > performance-domain. I look like we should extend the description to
> > allow "perf" to be used for all types of performance domains.
> >
>
> "perf" sounds fine for a single power domain, I just used "apc" here for
> consistency with the MSM8916 changes (which scales this power domain and
> several others, as you saw).
>
> (BTW, I would appreciate such a generic name for the cpuidle case as
>  well, so "idle" instead of "psci" vs "sbi". I have another WIP cpuidle
>  driver and didn't want to invent another name there...)

Whether it's "idle" or "power" or something else, we should certainly
avoid a provider specific (psci) name, as has been pointed out earlier
by Rob too.

I will try to get some time to update the DT docs as soon as I can.
Unless you get to it first, feel free to do it.

>
> > >   - MSM8916 (CPR+RPMPD):
> > >     https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c
> >
> > This looks a bit odd to me. Does a CPU really have four different
> > power-domains, where three of them are performance-domains?
> >
>
> Good question. I think we're largely entering "uncharted territory" with
> these questions, I can just try to answer it the best I can from the
> limited documentation and knowledge I have. :)
>
> The CPU does indeed use four different power domains. There also seem to
> be additional power switches that gate power for some components without
> having to turn off the entire supply.
>
> I'll list them twice from two points of view: Once mapping component ->
> power domain, then again showing each power domain separately to make it
> more clear. At the end I also want to make clear that MSM8909 (with the
> "single" power domain) is actually exactly the same SoC design, just
> with different regulators supplying the power domains.
>
> It's totally fine if you just skim over it. I'm listing it in detail
> also as reference for myself. :D
>
> # Components
>  - SoC
>    - CPU subsystem ("APPS")
>      - CPU cluster
>        - 4x CPU core (logic and L1 cache) -> VDD_APC
>        - Shared L2 cache
>          - Logic -> VDD_APC
>          - Memory -> VDD_MX
>      - CPU clock controller (logic) -> VDD_CX
>        - Provides CPU frequency from different clock sources
>        - L2 cache runs at 1/2 of CPU frequency
>        => Both VDD_APC and VDD_MX must be scaled based on frequency
>      - CPU PLL clock source
>        - Generates the higher (GHz) CPU frequencies
>        - Logic (?, unsure) -> VDD_CX
>        - ??? -> VDD_SR2_APPS_PLL
>        => VDD_CX must be scaled based on PLL frequency
>
> # Power Domains
> ## VDD_APC
>  - dedicated for CPU
>  - powered off completely in deepest cluster cpuidle state
>
>  - per-core power switch (per-core cpuidle)
>    - CPU logic
>    - L1 cache controller/logic and maybe memory(?, unsure)
>  - shared L2 cache controller/logic
>
>  => must be scaled based on CPU frequency
>
> ## VDD_MX
>  - global SoC power domain for "on-chip memories"
>  - always on, reduced to minimal voltage when entire SoC is idle
>
>  - power switch (controlled by deepest cluster cpuidle state?, unsure)
>    - L2 cache memory
>
>  => must be scaled based on L2 frequency (=> 1/2 CPU frequency)
>
> ## VDD_CX
>  - global SoC power domain for "digital logic"
>  - always on, reduced to minimal voltage when entire SoC is idle
>  - voting for VDD_CX in the RPM firmware also affects VDD_MX performance
>    state (firmware implicitly sets VDD_MX >= VDD_CX)
>
>  - CPU clock controller logic, CPU PLL logic(?, unsure)
>
>  => must be scaled based on CPU PLL frequency
>
> ## VDD_SR2_APPS_PLL
>  - global SoC power domain for CPU clock PLLs
>  - on MSM8916: always on with constant voltage
>
>  => ignored in Linux at the moment
>
> # Power Domain Regulators
> These power domains are literally input pins on the SoC chip. In theory
> one could connect any suitable regulator to each of those. In practice
> there are just a couple of standard reference designs that everyone
> uses:
>
> ## MSM8916 (SoC) + PM8916 (PMIC)
> We need to scale 3 power domains together with cpufreq:
>
>  - VDD_APC (CPU logic) = &pm8916_spmi_s2 (via CPR)
>  - VDD_MX  (L2 memory) = &pm8916_l3 (via RPMPD: MSM8916_VDDMX)
>  - VDD_CX  (CPU PLL)   = &pm8916_s1 (via RPMPD: MSM8916_VDDCX)
>
> ## MSM8909 (SoC) + PM8909 (PMIC)
> We need to scale 1 power domain together with cpufreq:
>
>  - VDD_APC = VDD_CX    = &pm8909_s1 (via RPMPD: MSM8909_VDDCX)
>    (CPU logic, L2 logic and CPU PLL)
> (- VDD_MX  (L2 memory) = &pm8909_l3 (RPM firmware enforces VDD_MX >= VDD_CX))
>
> There is implicit magic in the RPM firmware here that saves us from
> scaling VDD_MX. VDD_CX/APC are the same power rail.
>
> ## MSM8909 (SoC) + PM8916 (PMIC)
> When MSM8909 is paired with PM8916 instead of PM8909, the setup is
> identical to MSM8916+PM8916. We need to scale 3 power domains.
>
> > In a way it sounds like an option could be to hook up the cpr to the
> > rpmpd:s instead (possibly even set it as a child-domains to the
> > rpmpd:s), assuming that is a better description of the HW, which it
> > may not be, of course.
>
> Hm. It's definitely an option. I must admit I haven't really looked
> much at child-domains so far, so spontaneously I'm not sure about
> the implications, for both the abstract hardware description and
> the implementation.
>
> There seems to be indeed some kind of relation between MX <=> CX/APC:
>
>  - When voting for CX in the RPM firmware, it will always implicitly
>    adjust the MX performance state to be MX >= CX.
>
>  - When scaling APC up, we must increase MX before APC.
>  - When scaling APC down, we must decrease MX after APC.
>  => Clearly MX >= APC. Not in terms of raw voltage, but at least for the
>     abstract performance state.
>
> Is this some kind of parent-child relationship between MX <=> CX and
> MX <=> APC?

Thanks for sharing the above. Yes, to me, it looks like there is a
parent/child-domain relationship that could be worth describing/using.

>
> If yes, maybe we could indeed bind MX to the CPR genpd somehow. They use
> different performance state numbering, so we need some kind of
> translation. I'm not entirely sure how that would be described.

Both the power-domain and the required-opps DT bindings
(Documentation/devicetree/bindings/opp/opp-v2-base.yaml) are already
allowing us to describe these kinds of hierarchical
dependencies/layouts.

In other words, to scale performance for a child domain, the child may
rely on that we scale performance for the parent domain too. This is
already supported by genpd and through the opp library - so it should
just work. :-)

>
> Scaling VDD_CX for the PLL is more complicated. APC <=> CX feel more
> like siblings, so I don't think it makes sense to vote for CX as part of
> the CPR genpd. Spontaneously I would argue scaling CX belongs into the
> CPU PLL driver (since that's what the vote is for). However, for some
> reason it was decided to handle such votes on the consumer side (here =
> cpufreq) on mainline [1].
>
> [1]: https://lore.kernel.org/linux-arm-msm/20200910162610.GA7008@gerhold.net/

Right. I assume the above works just fine, so probably best to stick
to that for this case too.

[...]

> >
> > *) The approach you have taken in the $subject patch with the call to
> > pm_runtime_resume_and_get() works as a fix for QCS404, as there is
> > only the CPR to attach to. The problem with it, is that it doesn't
> > work for cases where the RPMPD is used for performance scaling, either
> > separate or in combination with the CPR. It would keep the rpmpd:s
> > powered-on, which would be wrong. In regards to the
> > dev_pm_syscore_device() thingy, this should not be needed, as long as
> > we keep the vdd-apc-supply enabled, right?
> >
> > **) A more generic solution, that would work for all cases (even
> > when/if hooking up the CPR to the rpmpd:s), consists of tweaking genpd
> > to avoid "caching" performance states for these kinds of devices. And
> > again, I don't see that we need dev_pm_syscore_device(), assuming we
> > manage the vdd-apc-supply correctly.
> >
> > Did I miss anything?
> >
>
> We do need to keep the CPU-related RPMPDs always-on too.
>
> Keeping the CPU-related RPMPDs always-on is a bit counter-intuitive, but
> it's because of this:
>
> > > > >  - RPMPD: This is the generic driver for all the SoC power domains
> > > > >    managed by the RPM firmware. It's not CPU-specific. However, as
> > > > >    special feature each power domain is exposed twice in Linux, e.g.
> > > > >    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> > > > >    variant tells the RPM firmware that the performance/enable vote only
> > > > >    applies when the CPU is active (not in deep cpuidle state).
>
> The CPU only uses the "_AO"/active-only variants in RPMPD. Keeping these
> always-on effectively means "keep the power domain on as long as the CPU
> is active".
>
> I hope that clears up some of the confusion. :)

Yes, it does, thanks! Is the below the correct conclusion for how we
could move forward then?

*) The pm_runtime_resume_and_get() works for QCS404 as a fix. It also
works fine when there is only one RPMPD that manages the performance
scaling.

**) In cases where we have multiple PM domains to scale performance
for, using pm_runtime_resume_and_get() would work fine too. Possibly
we want to use device_link_add() to set up suppliers, to avoid calling
pm_runtime_resume_and_get() for each and every device.

***) Due to the above, we don't need a new mechanism to avoid
"caching" performance states for genpd. At least for the time being.

Kind regards
Uffe
Stephan Gerhold Oct. 17, 2023, 8:54 p.m. UTC | #8
On Mon, Oct 16, 2023 at 04:47:52PM +0200, Ulf Hansson wrote:
> [...]
> > > >
> > > > Here are the two commits with the my current DT changes (WIP):
> > > >   - MSM8909+PM8909 (RPMPD only):
> > > >     https://github.com/msm8916-mainline/linux/commit/791e0c5a3162372a0738bc7b0f4a5e87247923db
> > >
> > > Okay, so this looks pretty straight forward. One thing though, it
> > > looks like we need to update the DT bindings for cpus.
> > >
> > > I recently updated Documentation/devicetree/bindings/arm/cpus.yaml
> > > [1], to let "perf" be the common "power-domain-name" for a CPU's SCMI
> > > performance-domain. I look like we should extend the description to
> > > allow "perf" to be used for all types of performance domains.
> > >
> >
> > "perf" sounds fine for a single power domain, I just used "apc" here for
> > consistency with the MSM8916 changes (which scales this power domain and
> > several others, as you saw).
> >
> > (BTW, I would appreciate such a generic name for the cpuidle case as
> >  well, so "idle" instead of "psci" vs "sbi". I have another WIP cpuidle
> >  driver and didn't want to invent another name there...)
> 
> Whether it's "idle" or "power" or something else, we should certainly
> avoid a provider specific (psci) name, as has been pointed out earlier
> by Rob too.
> 
> I will try to get some time to update the DT docs as soon as I can.
> Unless you get to it first, feel free to do it.
> 

Thanks! I'm not sure either when I will have time to get back to the
cpuidle driver, so let's just see who finds time first. :D

> > [MSM8916 setup with multiple power domains...]
> > There seems to be indeed some kind of relation between MX <=> CX/APC:
> >
> >  - When voting for CX in the RPM firmware, it will always implicitly
> >    adjust the MX performance state to be MX >= CX.
> >
> >  - When scaling APC up, we must increase MX before APC.
> >  - When scaling APC down, we must decrease MX after APC.
> >  => Clearly MX >= APC. Not in terms of raw voltage, but at least for the
> >     abstract performance state.
> >
> > Is this some kind of parent-child relationship between MX <=> CX and
> > MX <=> APC?
> 
> Thanks for sharing the above. Yes, to me, it looks like there is a
> parent/child-domain relationship that could be worth describing/using.
> 
> >
> > If yes, maybe we could indeed bind MX to the CPR genpd somehow. They use
> > different performance state numbering, so we need some kind of
> > translation. I'm not entirely sure how that would be described.
> 
> Both the power-domain and the required-opps DT bindings
> (Documentation/devicetree/bindings/opp/opp-v2-base.yaml) are already
> allowing us to describe these kinds of hierarchical
> dependencies/layouts.
> 
> In other words, to scale performance for a child domain, the child may
> rely on that we scale performance for the parent domain too. This is
> already supported by genpd and through the opp library - so it should
> just work. :-)
> 

Oh! I have looked at that code in the genpd core already a few times but
until now I never understood how it works. That's great, thanks!

I will test this and get back to you separately.

Seems like we reached a conclusion for enabling the power domains at
least, which will already help me a lot with MSM8909. :-)

> [...]
> 
> > >
> > > *) The approach you have taken in the $subject patch with the call to
> > > pm_runtime_resume_and_get() works as a fix for QCS404, as there is
> > > only the CPR to attach to. The problem with it, is that it doesn't
> > > work for cases where the RPMPD is used for performance scaling, either
> > > separate or in combination with the CPR. It would keep the rpmpd:s
> > > powered-on, which would be wrong. In regards to the
> > > dev_pm_syscore_device() thingy, this should not be needed, as long as
> > > we keep the vdd-apc-supply enabled, right?
> > >
> > > **) A more generic solution, that would work for all cases (even
> > > when/if hooking up the CPR to the rpmpd:s), consists of tweaking genpd
> > > to avoid "caching" performance states for these kinds of devices. And
> > > again, I don't see that we need dev_pm_syscore_device(), assuming we
> > > manage the vdd-apc-supply correctly.
> > >
> > > Did I miss anything?
> > >
> >
> > We do need to keep the CPU-related RPMPDs always-on too.
> >
> > Keeping the CPU-related RPMPDs always-on is a bit counter-intuitive, but
> > it's because of this:
> >
> > > > > >  - RPMPD: This is the generic driver for all the SoC power domains
> > > > > >    managed by the RPM firmware. It's not CPU-specific. However, as
> > > > > >    special feature each power domain is exposed twice in Linux, e.g.
> > > > > >    "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> > > > > >    variant tells the RPM firmware that the performance/enable vote only
> > > > > >    applies when the CPU is active (not in deep cpuidle state).
> >
> > The CPU only uses the "_AO"/active-only variants in RPMPD. Keeping these
> > always-on effectively means "keep the power domain on as long as the CPU
> > is active".
> >
> > I hope that clears up some of the confusion. :)
> 
> Yes, it does, thanks! Is the below the correct conclusion for how we
> could move forward then?
> 
> *) The pm_runtime_resume_and_get() works for QCS404 as a fix. It also
> works fine when there is only one RPMPD that manages the performance
> scaling.
> 

Agreed.

> **) In cases where we have multiple PM domains to scale performance
> for, using pm_runtime_resume_and_get() would work fine too. Possibly
> we want to use device_link_add() to set up suppliers, to avoid calling
> pm_runtime_resume_and_get() for each and every device.
> 

Hm. What would you use as "supplied" device? The CPU device I guess?

I'm looking again at my old patch from 2020 where I implemented this
with device links in the OPP core. Seems like you suggested this back
then too :)

  https://lore.kernel.org/linux-pm/20200826093328.88268-1-stephan@gerhold.net/

However, for the special case of the CPU I think we don't gain any code
simplification from using device links. There will just be a single
resume of each virtual genpd device, as well as one put during remove().
Exactly the same applies when using device links, we need to set up the
device links once for each virtual genpd device, and clean them up again
during remove().

Or can you think of another advantage of using device links?

> ***) Due to the above, we don't need a new mechanism to avoid
> "caching" performance states for genpd. At least for the time being.
> 

Right. Given *) and **) I'll prepare a v2 of $subject patch with the
remove() cleanup fixed and an improved commit description.

I'll wait for a bit in case you have more thoughts about the device
links.

Thanks!
Stephan
Ulf Hansson Oct. 17, 2023, 9:50 p.m. UTC | #9
[...]

> >
> > *) The pm_runtime_resume_and_get() works for QCS404 as a fix. It also
> > works fine when there is only one RPMPD that manages the performance
> > scaling.
> >
>
> Agreed.
>
> > **) In cases where we have multiple PM domains to scale performance
> > for, using pm_runtime_resume_and_get() would work fine too. Possibly
> > we want to use device_link_add() to set up suppliers, to avoid calling
> > pm_runtime_resume_and_get() for each and every device.
> >
>
> Hm. What would you use as "supplied" device? The CPU device I guess?

The consumer would be the device that is used to probe the cpureq
driver and the supplier(s) the virtual devices returned from genpd
when attaching.

>
> I'm looking again at my old patch from 2020 where I implemented this
> with device links in the OPP core. Seems like you suggested this back
> then too :)
>
>   https://lore.kernel.org/linux-pm/20200826093328.88268-1-stephan@gerhold.net/
>
> However, for the special case of the CPU I think we don't gain any code
> simplification from using device links. There will just be a single
> resume of each virtual genpd device, as well as one put during remove().
> Exactly the same applies when using device links, we need to set up the
> device links once for each virtual genpd device, and clean them up again
> during remove().
>
> Or can you think of another advantage of using device links?

No, not at this point.

So, in this particular case it may not matter that much. But when the
number of PM domains starts to vary between platforms it could be a
nice way to abstract some logic. I guess starting without using
device-links and seeing how it evolves could be a way forward too.

>
> > ***) Due to the above, we don't need a new mechanism to avoid
> > "caching" performance states for genpd. At least for the time being.
> >
>
> Right. Given *) and **) I'll prepare a v2 of $subject patch with the
> remove() cleanup fixed and an improved commit description.
>
> I'll wait for a bit in case you have more thoughts about the device
> links.

One more thing though that crossed my mind. In the rpmpd case, is
there anything we need to care about during system suspend/resume that
isn't already taken care of correctly?

Kind regards
Uffe
Stephan Gerhold Oct. 18, 2023, 7:13 a.m. UTC | #10
On Tue, Oct 17, 2023 at 11:50:21PM +0200, Ulf Hansson wrote:
> [...]
> > >
> > > *) The pm_runtime_resume_and_get() works for QCS404 as a fix. It also
> > > works fine when there is only one RPMPD that manages the performance
> > > scaling.
> > >
> >
> > Agreed.
> >
> > > **) In cases where we have multiple PM domains to scale performance
> > > for, using pm_runtime_resume_and_get() would work fine too. Possibly
> > > we want to use device_link_add() to set up suppliers, to avoid calling
> > > pm_runtime_resume_and_get() for each and every device.
> > >
> >
> > Hm. What would you use as "supplied" device? The CPU device I guess?
> 
> The consumer would be the device that is used to probe the cpureq
> driver and the supplier(s) the virtual devices returned from genpd
> when attaching.
> 
> >
> > I'm looking again at my old patch from 2020 where I implemented this
> > with device links in the OPP core. Seems like you suggested this back
> > then too :)
> >
> >   https://lore.kernel.org/linux-pm/20200826093328.88268-1-stephan@gerhold.net/
> >
> > However, for the special case of the CPU I think we don't gain any code
> > simplification from using device links. There will just be a single
> > resume of each virtual genpd device, as well as one put during remove().
> > Exactly the same applies when using device links, we need to set up the
> > device links once for each virtual genpd device, and clean them up again
> > during remove().
> >
> > Or can you think of another advantage of using device links?
> 
> No, not at this point.
> 
> So, in this particular case it may not matter that much. But when the
> number of PM domains starts to vary between platforms it could be a
> nice way to abstract some logic. I guess starting without using
> device-links and seeing how it evolves could be a way forward too.
> 

Sounds good :)

> >
> > > ***) Due to the above, we don't need a new mechanism to avoid
> > > "caching" performance states for genpd. At least for the time being.
> > >
> >
> > Right. Given *) and **) I'll prepare a v2 of $subject patch with the
> > remove() cleanup fixed and an improved commit description.
> >
> > I'll wait for a bit in case you have more thoughts about the device
> > links.
> 
> One more thing though that crossed my mind. In the rpmpd case, is
> there anything we need to care about during system suspend/resume that
> isn't already taken care of correctly?
> 

No, I don't think so. The RPM firmware makes no difference between deep
cpuidle and system suspend. As long as we properly enter deep cpuidle as
part of s2idle, it will automatically release our votes for the
"active-only" (_AO) variant of the RPMPDs.

I'll send the adjusted v2 shortly for you to look at. :)

Thanks,
Stephan
Stephan Gerhold Nov. 28, 2023, 12:41 p.m. UTC | #11
Hi Uffe,

On Mon, Oct 16, 2023 at 04:47:52PM +0200, Ulf Hansson wrote:
> [...]
> > > >   - MSM8916 (CPR+RPMPD):
> > > >     https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c
> > >
> > > This looks a bit odd to me. Does a CPU really have four different
> > > power-domains, where three of them are performance-domains?
> > >
> >
> > Good question. I think we're largely entering "uncharted territory" with
> > these questions, I can just try to answer it the best I can from the
> > limited documentation and knowledge I have. :)
> >
> > The CPU does indeed use four different power domains. There also seem to
> > be additional power switches that gate power for some components without
> > having to turn off the entire supply.
> >
> > I'll list them twice from two points of view: Once mapping component ->
> > power domain, then again showing each power domain separately to make it
> > more clear. At the end I also want to make clear that MSM8909 (with the
> > "single" power domain) is actually exactly the same SoC design, just
> > with different regulators supplying the power domains.
> >
> > It's totally fine if you just skim over it. I'm listing it in detail
> > also as reference for myself. :D
> >
> > # Components
> >  - SoC
> >    - CPU subsystem ("APPS")
> >      - CPU cluster
> >        - 4x CPU core (logic and L1 cache) -> VDD_APC
> >        - Shared L2 cache
> >          - Logic -> VDD_APC
> >          - Memory -> VDD_MX
> >      - CPU clock controller (logic) -> VDD_CX
> >        - Provides CPU frequency from different clock sources
> >        - L2 cache runs at 1/2 of CPU frequency
> >        => Both VDD_APC and VDD_MX must be scaled based on frequency
> >      - CPU PLL clock source
> >        - Generates the higher (GHz) CPU frequencies
> >        - Logic (?, unsure) -> VDD_CX
> >        - ??? -> VDD_SR2_APPS_PLL
> >        => VDD_CX must be scaled based on PLL frequency
> >
> > # Power Domains
> > ## VDD_APC
> >  - dedicated for CPU
> >  - powered off completely in deepest cluster cpuidle state
> >
> >  - per-core power switch (per-core cpuidle)
> >    - CPU logic
> >    - L1 cache controller/logic and maybe memory(?, unsure)
> >  - shared L2 cache controller/logic
> >
> >  => must be scaled based on CPU frequency
> >
> > ## VDD_MX
> >  - global SoC power domain for "on-chip memories"
> >  - always on, reduced to minimal voltage when entire SoC is idle
> >
> >  - power switch (controlled by deepest cluster cpuidle state?, unsure)
> >    - L2 cache memory
> >
> >  => must be scaled based on L2 frequency (=> 1/2 CPU frequency)
> >
> > ## VDD_CX
> >  - global SoC power domain for "digital logic"
> >  - always on, reduced to minimal voltage when entire SoC is idle
> >  - voting for VDD_CX in the RPM firmware also affects VDD_MX performance
> >    state (firmware implicitly sets VDD_MX >= VDD_CX)
> >
> >  - CPU clock controller logic, CPU PLL logic(?, unsure)
> >
> >  => must be scaled based on CPU PLL frequency
> >
> > ## VDD_SR2_APPS_PLL
> >  - global SoC power domain for CPU clock PLLs
> >  - on MSM8916: always on with constant voltage
> >
> >  => ignored in Linux at the moment
> >
> > # Power Domain Regulators
> > These power domains are literally input pins on the SoC chip. In theory
> > one could connect any suitable regulator to each of those. In practice
> > there are just a couple of standard reference designs that everyone
> > uses:
> >
> > ## MSM8916 (SoC) + PM8916 (PMIC)
> > We need to scale 3 power domains together with cpufreq:
> >
> >  - VDD_APC (CPU logic) = &pm8916_spmi_s2 (via CPR)
> >  - VDD_MX  (L2 memory) = &pm8916_l3 (via RPMPD: MSM8916_VDDMX)
> >  - VDD_CX  (CPU PLL)   = &pm8916_s1 (via RPMPD: MSM8916_VDDCX)
> >
> > ## MSM8909 (SoC) + PM8909 (PMIC)
> > We need to scale 1 power domain together with cpufreq:
> >
> >  - VDD_APC = VDD_CX    = &pm8909_s1 (via RPMPD: MSM8909_VDDCX)
> >    (CPU logic, L2 logic and CPU PLL)
> > (- VDD_MX  (L2 memory) = &pm8909_l3 (RPM firmware enforces VDD_MX >= VDD_CX))
> >
> > There is implicit magic in the RPM firmware here that saves us from
> > scaling VDD_MX. VDD_CX/APC are the same power rail.
> >
> > ## MSM8909 (SoC) + PM8916 (PMIC)
> > When MSM8909 is paired with PM8916 instead of PM8909, the setup is
> > identical to MSM8916+PM8916. We need to scale 3 power domains.
> >
> > > In a way it sounds like an option could be to hook up the cpr to the
> > > rpmpd:s instead (possibly even set it as a child-domains to the
> > > rpmpd:s), assuming that is a better description of the HW, which it
> > > may not be, of course.
> >
> > Hm. It's definitely an option. I must admit I haven't really looked
> > much at child-domains so far, so spontaneously I'm not sure about
> > the implications, for both the abstract hardware description and
> > the implementation.
> >
> > There seems to be indeed some kind of relation between MX <=> CX/APC:
> >
> >  - When voting for CX in the RPM firmware, it will always implicitly
> >    adjust the MX performance state to be MX >= CX.
> >
> >  - When scaling APC up, we must increase MX before APC.
> >  - When scaling APC down, we must decrease MX after APC.
> >  => Clearly MX >= APC. Not in terms of raw voltage, but at least for the
> >     abstract performance state.
> >
> > Is this some kind of parent-child relationship between MX <=> CX and
> > MX <=> APC?
> 
> Thanks for sharing the above. Yes, to me, it looks like there is a
> parent/child-domain relationship that could be worth describing/using.
> 
> >
> > If yes, maybe we could indeed bind MX to the CPR genpd somehow. They use
> > different performance state numbering, so we need some kind of
> > translation. I'm not entirely sure how that would be described.
> 
> Both the power-domain and the required-opps DT bindings
> (Documentation/devicetree/bindings/opp/opp-v2-base.yaml) are already
> allowing us to describe these kinds of hierarchical
> dependencies/layouts.
> 
> In other words, to scale performance for a child domain, the child may
> rely on that we scale performance for the parent domain too. This is
> already supported by genpd and through the opp library - so it should
> just work. :-)
> 

I'm getting back to the "multiple power domains" case of MSM8916 now, as
discussed above. I've tried modelling MX as parent genpd of CPR, to
avoid having to scale multiple power domains as part of cpufreq.

Basically, it looks like the following:

	cpr: power-controller@b018000 {
		compatible = "qcom,msm8916-cpr", "qcom,cpr";
		reg = <0x0b018000 0x1000>;
		/* ... */
		#power-domain-cells = <0>;
		operating-points-v2 = <&cpr_opp_table>;
		/* Supposed to be parent domain, not consumer */
		power-domains = <&rpmpd MSM8916_VDDMX_AO>;

		cpr_opp_table: opp-table {
			compatible = "operating-points-v2-qcom-level";

			cpr_opp1: opp1 {
				opp-level = <1>;
				qcom,opp-fuse-level = <1>;
				required-opps = <&rpmpd_opp_svs_soc>;
			};
			cpr_opp2: opp2 {
				opp-level = <2>;
				qcom,opp-fuse-level = <2>;
				required-opps = <&rpmpd_opp_nom>;
			};
			cpr_opp3: opp3 {
				opp-level = <3>;
				qcom,opp-fuse-level = <3>;
				required-opps = <&rpmpd_opp_super_turbo>;
			};
		};
	};

As already discussed [1] it's a bit annoying that the genpd core
attaches the power domain as consumer by default, but I work around this
by calling of_genpd_add_subdomain() followed by dev_pm_domain_detach()
in the CPR driver.

The actual scaling works fine, performance states of the MX power domain
are updated when CPR performance state. I added some debug prints and it
looks e.g. as follows (CPR is the power-controller@):

    [   24.498218] PM: mx_ao set performance state 6
    [   24.498788] PM: power-controller@b018000 set performance state 3
    [   24.511025] PM: mx_ao set performance state 3
    [   24.511526] PM: power-controller@b018000 set performance state 1
    [   24.521189] PM: mx_ao set performance state 4
    [   24.521660] PM: power-controller@b018000 set performance state 2
    [   24.533183] PM: mx_ao set performance state 6
    [   24.533535] PM: power-controller@b018000 set performance state 3

There is one remaining problem here: Consider e.g. the switch from CPR
performance state 3 -> 1. In both cases the parent genpd state is set
*before* the child genpd. When scaling down, the parent genpd state must
be reduced *after* the child genpd. Otherwise, we can't guarantee that
the parent genpd state is always >= of the child state.

In the OPP core, the order of such operations is always chosen based on
whether we are scaling up or down. When scaling up, power domain states
are set before the frequency is changed, and the other way around for
scaling down.

Is this something you could imagine changing in the GENPD core, either
unconditionally for everyone, or as an option?

I tried to hack this in for a quick test and came up with the following
(the diff is unreadable so I'll just post the entire changed
(_genpd_set_performance_state() function). Admittedly it's a bit ugly.

With these changes the sequence from above looks more like:

    [   22.374555] PM: mx_ao set performance state 6
    [   22.375175] PM: power-controller@b018000 set performance state 3
    [   22.424661] PM: power-controller@b018000 set performance state 1
    [   22.425169] PM: mx_ao set performance state 3
    [   22.434932] PM: mx_ao set performance state 4
    [   22.435331] PM: power-controller@b018000 set performance state 2
    [   22.461197] PM: mx_ao set performance state 6
    [   22.461968] PM: power-controller@b018000 set performance state 3

Which is correct now.

Let me know if you have any thoughts about this. :-)

Thanks for taking the time to discuss this!
Stephan

[1]: https://lore.kernel.org/linux-pm/CAPDyKFq+zsoeF-4h5TfT4Z+S46a501_pUq8y2c1x==Tt6EKBGA@mail.gmail.com/

static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
					unsigned int state, int depth);

static void _genpd_rollback_parent_state(struct gpd_link *link, int depth)
{
	struct generic_pm_domain *parent = link->parent;
	int parent_state;

	genpd_lock_nested(parent, depth + 1);

	parent_state = link->prev_performance_state;
	link->performance_state = parent_state;

	parent_state = _genpd_reeval_performance_state(parent, parent_state);
	if (_genpd_set_performance_state(parent, parent_state, depth + 1)) {
		pr_err("%s: Failed to roll back to %d performance state\n",
		       parent->name, parent_state);
	}

	genpd_unlock(parent);
}

static int _genpd_set_parent_state(struct generic_pm_domain *genpd,
				   struct gpd_link *link,
				   unsigned int state, int depth)
{
	struct generic_pm_domain *parent = link->parent;
	int parent_state, ret;

	/* Find parent's performance state */
	ret = genpd_xlate_performance_state(genpd, parent, state);
	if (unlikely(ret < 0))
		return ret;

	parent_state = ret;

	genpd_lock_nested(parent, depth + 1);

	link->prev_performance_state = link->performance_state;
	link->performance_state = parent_state;
	parent_state = _genpd_reeval_performance_state(parent,
						parent_state);
	ret = _genpd_set_performance_state(parent, parent_state, depth + 1);
	if (ret)
		link->performance_state = link->prev_performance_state;

	genpd_unlock(parent);

	return ret;
}

static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
					unsigned int state, int depth)
{
	struct gpd_link *link = NULL;
	int ret;

	if (state == genpd->performance_state)
		return 0;

	/* When scaling up, propagate to parents first in normal order */
	if (state > genpd->performance_state) {
		list_for_each_entry(link, &genpd->child_links, child_node) {
			ret = _genpd_set_parent_state(genpd, link, state, depth);
			if (ret)
				goto rollback_parents_up;
		}
	}

	if (genpd->set_performance_state) {
		pr_err("%s set performance state %d\n", genpd->name, state);
		ret = genpd->set_performance_state(genpd, state);
		if (ret) {
			if (link)
				goto rollback_parents_up;
			return ret;
		}
	}

	/* When scaling down, propagate to parents after in reverse order */
	if (state < genpd->performance_state) {
		list_for_each_entry_reverse(link, &genpd->child_links, child_node) {
			ret = _genpd_set_parent_state(genpd, link, state, depth);
			if (ret)
				goto rollback_parents_down;
		}
	}

	genpd->performance_state = state;
	return 0;

rollback_parents_up:
	list_for_each_entry_continue_reverse(link, &genpd->child_links, child_node)
		_genpd_rollback_parent_state(link, depth);
	return ret;
rollback_parents_down:
	list_for_each_entry_continue(link, &genpd->child_links, child_node)
		_genpd_rollback_parent_state(link, depth);
	return ret;
}
Ulf Hansson Dec. 4, 2023, 10:59 a.m. UTC | #12
On Tue, 28 Nov 2023 at 13:42, Stephan Gerhold <stephan@gerhold.net> wrote:
>
> Hi Uffe,
>
> On Mon, Oct 16, 2023 at 04:47:52PM +0200, Ulf Hansson wrote:
> > [...]
> > > > >   - MSM8916 (CPR+RPMPD):
> > > > >     https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c
> > > >
> > > > This looks a bit odd to me. Does a CPU really have four different
> > > > power-domains, where three of them are performance-domains?
> > > >
> > >
> > > Good question. I think we're largely entering "uncharted territory" with
> > > these questions, I can just try to answer it the best I can from the
> > > limited documentation and knowledge I have. :)
> > >
> > > The CPU does indeed use four different power domains. There also seem to
> > > be additional power switches that gate power for some components without
> > > having to turn off the entire supply.
> > >
> > > I'll list them twice from two points of view: Once mapping component ->
> > > power domain, then again showing each power domain separately to make it
> > > more clear. At the end I also want to make clear that MSM8909 (with the
> > > "single" power domain) is actually exactly the same SoC design, just
> > > with different regulators supplying the power domains.
> > >
> > > It's totally fine if you just skim over it. I'm listing it in detail
> > > also as reference for myself. :D
> > >
> > > # Components
> > >  - SoC
> > >    - CPU subsystem ("APPS")
> > >      - CPU cluster
> > >        - 4x CPU core (logic and L1 cache) -> VDD_APC
> > >        - Shared L2 cache
> > >          - Logic -> VDD_APC
> > >          - Memory -> VDD_MX
> > >      - CPU clock controller (logic) -> VDD_CX
> > >        - Provides CPU frequency from different clock sources
> > >        - L2 cache runs at 1/2 of CPU frequency
> > >        => Both VDD_APC and VDD_MX must be scaled based on frequency
> > >      - CPU PLL clock source
> > >        - Generates the higher (GHz) CPU frequencies
> > >        - Logic (?, unsure) -> VDD_CX
> > >        - ??? -> VDD_SR2_APPS_PLL
> > >        => VDD_CX must be scaled based on PLL frequency
> > >
> > > # Power Domains
> > > ## VDD_APC
> > >  - dedicated for CPU
> > >  - powered off completely in deepest cluster cpuidle state
> > >
> > >  - per-core power switch (per-core cpuidle)
> > >    - CPU logic
> > >    - L1 cache controller/logic and maybe memory(?, unsure)
> > >  - shared L2 cache controller/logic
> > >
> > >  => must be scaled based on CPU frequency
> > >
> > > ## VDD_MX
> > >  - global SoC power domain for "on-chip memories"
> > >  - always on, reduced to minimal voltage when entire SoC is idle
> > >
> > >  - power switch (controlled by deepest cluster cpuidle state?, unsure)
> > >    - L2 cache memory
> > >
> > >  => must be scaled based on L2 frequency (=> 1/2 CPU frequency)
> > >
> > > ## VDD_CX
> > >  - global SoC power domain for "digital logic"
> > >  - always on, reduced to minimal voltage when entire SoC is idle
> > >  - voting for VDD_CX in the RPM firmware also affects VDD_MX performance
> > >    state (firmware implicitly sets VDD_MX >= VDD_CX)
> > >
> > >  - CPU clock controller logic, CPU PLL logic(?, unsure)
> > >
> > >  => must be scaled based on CPU PLL frequency
> > >
> > > ## VDD_SR2_APPS_PLL
> > >  - global SoC power domain for CPU clock PLLs
> > >  - on MSM8916: always on with constant voltage
> > >
> > >  => ignored in Linux at the moment
> > >
> > > # Power Domain Regulators
> > > These power domains are literally input pins on the SoC chip. In theory
> > > one could connect any suitable regulator to each of those. In practice
> > > there are just a couple of standard reference designs that everyone
> > > uses:
> > >
> > > ## MSM8916 (SoC) + PM8916 (PMIC)
> > > We need to scale 3 power domains together with cpufreq:
> > >
> > >  - VDD_APC (CPU logic) = &pm8916_spmi_s2 (via CPR)
> > >  - VDD_MX  (L2 memory) = &pm8916_l3 (via RPMPD: MSM8916_VDDMX)
> > >  - VDD_CX  (CPU PLL)   = &pm8916_s1 (via RPMPD: MSM8916_VDDCX)
> > >
> > > ## MSM8909 (SoC) + PM8909 (PMIC)
> > > We need to scale 1 power domain together with cpufreq:
> > >
> > >  - VDD_APC = VDD_CX    = &pm8909_s1 (via RPMPD: MSM8909_VDDCX)
> > >    (CPU logic, L2 logic and CPU PLL)
> > > (- VDD_MX  (L2 memory) = &pm8909_l3 (RPM firmware enforces VDD_MX >= VDD_CX))
> > >
> > > There is implicit magic in the RPM firmware here that saves us from
> > > scaling VDD_MX. VDD_CX/APC are the same power rail.
> > >
> > > ## MSM8909 (SoC) + PM8916 (PMIC)
> > > When MSM8909 is paired with PM8916 instead of PM8909, the setup is
> > > identical to MSM8916+PM8916. We need to scale 3 power domains.
> > >
> > > > In a way it sounds like an option could be to hook up the cpr to the
> > > > rpmpd:s instead (possibly even set it as a child-domains to the
> > > > rpmpd:s), assuming that is a better description of the HW, which it
> > > > may not be, of course.
> > >
> > > Hm. It's definitely an option. I must admit I haven't really looked
> > > much at child-domains so far, so spontaneously I'm not sure about
> > > the implications, for both the abstract hardware description and
> > > the implementation.
> > >
> > > There seems to be indeed some kind of relation between MX <=> CX/APC:
> > >
> > >  - When voting for CX in the RPM firmware, it will always implicitly
> > >    adjust the MX performance state to be MX >= CX.
> > >
> > >  - When scaling APC up, we must increase MX before APC.
> > >  - When scaling APC down, we must decrease MX after APC.
> > >  => Clearly MX >= APC. Not in terms of raw voltage, but at least for the
> > >     abstract performance state.
> > >
> > > Is this some kind of parent-child relationship between MX <=> CX and
> > > MX <=> APC?
> >
> > Thanks for sharing the above. Yes, to me, it looks like there is a
> > parent/child-domain relationship that could be worth describing/using.
> >
> > >
> > > If yes, maybe we could indeed bind MX to the CPR genpd somehow. They use
> > > different performance state numbering, so we need some kind of
> > > translation. I'm not entirely sure how that would be described.
> >
> > Both the power-domain and the required-opps DT bindings
> > (Documentation/devicetree/bindings/opp/opp-v2-base.yaml) are already
> > allowing us to describe these kinds of hierarchical
> > dependencies/layouts.
> >
> > In other words, to scale performance for a child domain, the child may
> > rely on that we scale performance for the parent domain too. This is
> > already supported by genpd and through the opp library - so it should
> > just work. :-)
> >
>
> I'm getting back to the "multiple power domains" case of MSM8916 now, as
> discussed above. I've tried modelling MX as parent genpd of CPR, to
> avoid having to scale multiple power domains as part of cpufreq.
>
> Basically, it looks like the following:
>
>         cpr: power-controller@b018000 {
>                 compatible = "qcom,msm8916-cpr", "qcom,cpr";
>                 reg = <0x0b018000 0x1000>;
>                 /* ... */
>                 #power-domain-cells = <0>;
>                 operating-points-v2 = <&cpr_opp_table>;
>                 /* Supposed to be parent domain, not consumer */
>                 power-domains = <&rpmpd MSM8916_VDDMX_AO>;
>
>                 cpr_opp_table: opp-table {
>                         compatible = "operating-points-v2-qcom-level";
>
>                         cpr_opp1: opp1 {
>                                 opp-level = <1>;
>                                 qcom,opp-fuse-level = <1>;
>                                 required-opps = <&rpmpd_opp_svs_soc>;
>                         };
>                         cpr_opp2: opp2 {
>                                 opp-level = <2>;
>                                 qcom,opp-fuse-level = <2>;
>                                 required-opps = <&rpmpd_opp_nom>;
>                         };
>                         cpr_opp3: opp3 {
>                                 opp-level = <3>;
>                                 qcom,opp-fuse-level = <3>;
>                                 required-opps = <&rpmpd_opp_super_turbo>;
>                         };
>                 };
>         };
>
> As already discussed [1] it's a bit annoying that the genpd core
> attaches the power domain as consumer by default, but I work around this
> by calling of_genpd_add_subdomain() followed by dev_pm_domain_detach()
> in the CPR driver.

Yep, that seems reasonable to me.

>
> The actual scaling works fine, performance states of the MX power domain
> are updated when CPR performance state. I added some debug prints and it
> looks e.g. as follows (CPR is the power-controller@):
>
>     [   24.498218] PM: mx_ao set performance state 6
>     [   24.498788] PM: power-controller@b018000 set performance state 3
>     [   24.511025] PM: mx_ao set performance state 3
>     [   24.511526] PM: power-controller@b018000 set performance state 1
>     [   24.521189] PM: mx_ao set performance state 4
>     [   24.521660] PM: power-controller@b018000 set performance state 2
>     [   24.533183] PM: mx_ao set performance state 6
>     [   24.533535] PM: power-controller@b018000 set performance state 3
>
> There is one remaining problem here: Consider e.g. the switch from CPR
> performance state 3 -> 1. In both cases the parent genpd state is set
> *before* the child genpd. When scaling down, the parent genpd state must
> be reduced *after* the child genpd. Otherwise, we can't guarantee that
> the parent genpd state is always >= of the child state.

Good point!

>
> In the OPP core, the order of such operations is always chosen based on
> whether we are scaling up or down. When scaling up, power domain states
> are set before the frequency is changed, and the other way around for
> scaling down.
>
> Is this something you could imagine changing in the GENPD core, either
> unconditionally for everyone, or as an option?

This sounds like a generic problem that we need to fix for genpd. So
for everyone.

>
> I tried to hack this in for a quick test and came up with the following
> (the diff is unreadable so I'll just post the entire changed
> (_genpd_set_performance_state() function). Admittedly it's a bit ugly.
>
> With these changes the sequence from above looks more like:
>
>     [   22.374555] PM: mx_ao set performance state 6
>     [   22.375175] PM: power-controller@b018000 set performance state 3
>     [   22.424661] PM: power-controller@b018000 set performance state 1
>     [   22.425169] PM: mx_ao set performance state 3
>     [   22.434932] PM: mx_ao set performance state 4
>     [   22.435331] PM: power-controller@b018000 set performance state 2
>     [   22.461197] PM: mx_ao set performance state 6
>     [   22.461968] PM: power-controller@b018000 set performance state 3
>
> Which is correct now.
>
> Let me know if you have any thoughts about this. :-)

Makes sense! Please post the below as a formal patch so I can review
and test it!

Kind regards
Uffe

>
> Thanks for taking the time to discuss this!
> Stephan
>
> [1]: https://lore.kernel.org/linux-pm/CAPDyKFq+zsoeF-4h5TfT4Z+S46a501_pUq8y2c1x==Tt6EKBGA@mail.gmail.com/
>
> static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
>                                         unsigned int state, int depth);
>
> static void _genpd_rollback_parent_state(struct gpd_link *link, int depth)
> {
>         struct generic_pm_domain *parent = link->parent;
>         int parent_state;
>
>         genpd_lock_nested(parent, depth + 1);
>
>         parent_state = link->prev_performance_state;
>         link->performance_state = parent_state;
>
>         parent_state = _genpd_reeval_performance_state(parent, parent_state);
>         if (_genpd_set_performance_state(parent, parent_state, depth + 1)) {
>                 pr_err("%s: Failed to roll back to %d performance state\n",
>                        parent->name, parent_state);
>         }
>
>         genpd_unlock(parent);
> }
>
> static int _genpd_set_parent_state(struct generic_pm_domain *genpd,
>                                    struct gpd_link *link,
>                                    unsigned int state, int depth)
> {
>         struct generic_pm_domain *parent = link->parent;
>         int parent_state, ret;
>
>         /* Find parent's performance state */
>         ret = genpd_xlate_performance_state(genpd, parent, state);
>         if (unlikely(ret < 0))
>                 return ret;
>
>         parent_state = ret;
>
>         genpd_lock_nested(parent, depth + 1);
>
>         link->prev_performance_state = link->performance_state;
>         link->performance_state = parent_state;
>         parent_state = _genpd_reeval_performance_state(parent,
>                                                 parent_state);
>         ret = _genpd_set_performance_state(parent, parent_state, depth + 1);
>         if (ret)
>                 link->performance_state = link->prev_performance_state;
>
>         genpd_unlock(parent);
>
>         return ret;
> }
>
> static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
>                                         unsigned int state, int depth)
> {
>         struct gpd_link *link = NULL;
>         int ret;
>
>         if (state == genpd->performance_state)
>                 return 0;
>
>         /* When scaling up, propagate to parents first in normal order */
>         if (state > genpd->performance_state) {
>                 list_for_each_entry(link, &genpd->child_links, child_node) {
>                         ret = _genpd_set_parent_state(genpd, link, state, depth);
>                         if (ret)
>                                 goto rollback_parents_up;
>                 }
>         }
>
>         if (genpd->set_performance_state) {
>                 pr_err("%s set performance state %d\n", genpd->name, state);
>                 ret = genpd->set_performance_state(genpd, state);
>                 if (ret) {
>                         if (link)
>                                 goto rollback_parents_up;
>                         return ret;
>                 }
>         }
>
>         /* When scaling down, propagate to parents after in reverse order */
>         if (state < genpd->performance_state) {
>                 list_for_each_entry_reverse(link, &genpd->child_links, child_node) {
>                         ret = _genpd_set_parent_state(genpd, link, state, depth);
>                         if (ret)
>                                 goto rollback_parents_down;
>                 }
>         }
>
>         genpd->performance_state = state;
>         return 0;
>
> rollback_parents_up:
>         list_for_each_entry_continue_reverse(link, &genpd->child_links, child_node)
>                 _genpd_rollback_parent_state(link, depth);
>         return ret;
> rollback_parents_down:
>         list_for_each_entry_continue(link, &genpd->child_links, child_node)
>                 _genpd_rollback_parent_state(link, depth);
>         return ret;
> }
>
diff mbox series

Patch

diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
index 84d7033e5efe..17d6ab14c909 100644
--- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
+++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
@@ -25,6 +25,7 @@ 
 #include <linux/platform_device.h>
 #include <linux/pm_domain.h>
 #include <linux/pm_opp.h>
+#include <linux/pm_runtime.h>
 #include <linux/slab.h>
 #include <linux/soc/qcom/smem.h>
 
@@ -280,6 +281,7 @@  static int qcom_cpufreq_probe(struct platform_device *pdev)
 	}
 
 	for_each_possible_cpu(cpu) {
+		struct device **virt_devs = NULL;
 		struct dev_pm_opp_config config = {
 			.supported_hw = NULL,
 		};
@@ -300,7 +302,7 @@  static int qcom_cpufreq_probe(struct platform_device *pdev)
 
 		if (drv->data->genpd_names) {
 			config.genpd_names = drv->data->genpd_names;
-			config.virt_devs = NULL;
+			config.virt_devs = &virt_devs;
 		}
 
 		if (config.supported_hw || config.genpd_names) {
@@ -311,6 +313,23 @@  static int qcom_cpufreq_probe(struct platform_device *pdev)
 				goto free_opp;
 			}
 		}
+
+		if (virt_devs) {
+			const char * const *name = config.genpd_names;
+			int i;
+
+			for (i = 0; *name; i++, name++) {
+				ret = pm_runtime_resume_and_get(virt_devs[i]);
+				if (ret) {
+					dev_err(cpu_dev, "failed to resume %s: %d\n",
+						*name, ret);
+					goto free_opp;
+				}
+
+				/* Keep CPU power domain always-on */
+				dev_pm_syscore_device(virt_devs[i], true);
+			}
+		}
 	}
 
 	cpufreq_dt_pdev = platform_device_register_simple("cpufreq-dt", -1,