mbox series

[v3,0/7] Fix RK3588 GPU domain

Message ID 20241022154508.63563-1-sebastian.reichel@collabora.com (mailing list archive)
Headers show
Series Fix RK3588 GPU domain | expand

Message

Sebastian Reichel Oct. 22, 2024, 3:41 p.m. UTC
Hi,

I got a report, that the Linux kernel crashes on Rock 5B when the panthor
driver is loaded late after booting. The crash starts with the following
shortened error print:

rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff
SError Interrupt on CPU4, code 0x00000000be000411 -- SError

This series first does some cleanups in the Rockchip power domain
driver and changes the driver, so that it no longer tries to continue
when it fails to enable a domain. This gets rid of the SError interrupt
and long backtraces. But the kernel still hangs when it fails to enable
a power domain. I have not done further analysis to check if that can
be avoided.

Last but not least this provides a fix for the GPU power domain failing
to get enabled - after some testing from my side it seems to require the
GPU voltage supply to be enabled.

This series is now based on the pull request from Mark Brown:
https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/

I added one more patch, which adds devm_of_regulator_get without the
_optional suffix, since that is more sensible for the Rockchip usecase.
Longer explanation can be seen in patch 6, which adds the handling to
the Rockchip driver. My merge suggestion would be that Mark adds the
regulator patch on top of the immutable branch and creates a new pull
request.

The last patch, which updates the RK3588 board files only covers the
boards from 6.12-rc1. Any board missing the update will behave as before,
so it is perfectly fine not to update all DT files at once.

Changes since PATCHv2:
 * https://lore.kernel.org/linux-rockchip/20240919091834.83572-1-sebastian.reichel@collabora.com/
 * Rebase to 6.12-rc1 + devm_of_regulator_get_optional branch (Ulf Hansson, Chen-Yu Tsai)
  - Introduce devm_of_regulator_get()
  - Add code to only request regulators for domains needing them
 * Mention other platforms in the DT binding patch (Rob Murphy)
 * Update more RK3588 DT files (Jonas Karlman)

Changes since PATCHv1:
 * https://lore.kernel.org/all/20240910180530.47194-1-sebastian.reichel@collabora.com/
 * Collect Reviewed-by/Acked-by/Tested-by
 * swap first and second patch to avoid introducing and directly removing a mutex_unlock
 * fix spelling of indentation
 * fix double empty line after rockchip_pd_regulator_disable()

Greetings,

-- Sebastian

Sebastian Reichel (7):
  regulator: Add (devm_)of_regulator_get()
  pmdomain: rockchip: cleanup mutex handling in rockchip_pd_power
  pmdomain: rockchip: forward rockchip_do_pmu_set_power_domain errors
  pmdomain: rockchip: reduce indentation in rockchip_pd_power
  dt-bindings: power: rockchip: add regulator support
  pmdomain: rockchip: add regulator support
  arm64: dts: rockchip: Add GPU power domain regulator dependency for
    RK3588

 .../power/rockchip,power-controller.yaml      |   3 +
 .../boot/dts/rockchip/rk3588-armsom-sige7.dts |   4 +
 arch/arm64/boot/dts/rockchip/rk3588-base.dtsi |   2 +-
 .../boot/dts/rockchip/rk3588-coolpi-cm5.dtsi  |   4 +
 .../rockchip/rk3588-edgeble-neu6a-common.dtsi |   4 +
 .../boot/dts/rockchip/rk3588-evb1-v10.dts     |   4 +
 .../boot/dts/rockchip/rk3588-fet3588-c.dtsi   |   4 +
 .../rockchip/rk3588-friendlyelec-cm3588.dtsi  |   4 +
 .../arm64/boot/dts/rockchip/rk3588-jaguar.dts |   4 +
 .../boot/dts/rockchip/rk3588-nanopc-t6.dtsi   |   4 +
 .../boot/dts/rockchip/rk3588-ok3588-c.dts     |   4 +
 .../dts/rockchip/rk3588-orangepi-5-plus.dts   |   4 +
 .../boot/dts/rockchip/rk3588-quartzpro64.dts  |   4 +
 .../boot/dts/rockchip/rk3588-rock-5-itx.dts   |   4 +
 .../boot/dts/rockchip/rk3588-rock-5b.dts      |   4 +
 .../arm64/boot/dts/rockchip/rk3588-tiger.dtsi |   4 +
 .../boot/dts/rockchip/rk3588-toybrick-x0.dts  |   4 +
 .../boot/dts/rockchip/rk3588-turing-rk1.dtsi  |   4 +
 .../boot/dts/rockchip/rk3588s-coolpi-4b.dts   |   4 +
 .../dts/rockchip/rk3588s-gameforce-ace.dts    |   4 +
 .../dts/rockchip/rk3588s-indiedroid-nova.dts  |   4 +
 .../dts/rockchip/rk3588s-khadas-edge2.dts     |   4 +
 .../boot/dts/rockchip/rk3588s-nanopi-r6s.dts  |   4 +
 .../boot/dts/rockchip/rk3588s-odroid-m2.dts   |   4 +
 .../boot/dts/rockchip/rk3588s-orangepi-5.dts  |   4 +
 .../boot/dts/rockchip/rk3588s-rock-5a.dts     |   4 +
 drivers/pmdomain/rockchip/pm-domains.c        | 190 +++++++++++-------
 drivers/regulator/devres.c                    |  17 ++
 drivers/regulator/of_regulator.c              |  21 ++
 include/linux/regulator/consumer.h            |   6 +
 30 files changed, 266 insertions(+), 69 deletions(-)

Comments

Ulf Hansson Oct. 23, 2024, 10:05 a.m. UTC | #1
On Tue, 22 Oct 2024 at 17:45, Sebastian Reichel
<sebastian.reichel@collabora.com> wrote:
>
> Hi,
>
> I got a report, that the Linux kernel crashes on Rock 5B when the panthor
> driver is loaded late after booting. The crash starts with the following
> shortened error print:
>
> rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
> rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff
> SError Interrupt on CPU4, code 0x00000000be000411 -- SError
>
> This series first does some cleanups in the Rockchip power domain
> driver and changes the driver, so that it no longer tries to continue
> when it fails to enable a domain. This gets rid of the SError interrupt
> and long backtraces. But the kernel still hangs when it fails to enable
> a power domain. I have not done further analysis to check if that can
> be avoided.
>
> Last but not least this provides a fix for the GPU power domain failing
> to get enabled - after some testing from my side it seems to require the
> GPU voltage supply to be enabled.
>
> This series is now based on the pull request from Mark Brown:
> https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/
>
> I added one more patch, which adds devm_of_regulator_get without the
> _optional suffix, since that is more sensible for the Rockchip usecase.
> Longer explanation can be seen in patch 6, which adds the handling to
> the Rockchip driver. My merge suggestion would be that Mark adds the
> regulator patch on top of the immutable branch and creates a new pull
> request.

The merge strategy seems reasonable to me. But I am fine with that
whatever works for Mark.

[...]

Kind regards
Uffe
Heiko Stuebner Oct. 25, 2024, 9:19 a.m. UTC | #2
Am Dienstag, 22. Oktober 2024, 17:41:45 CEST schrieb Sebastian Reichel:
> Hi,
> 
> I got a report, that the Linux kernel crashes on Rock 5B when the panthor
> driver is loaded late after booting. The crash starts with the following
> shortened error print:
> 
> rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
> rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff
> SError Interrupt on CPU4, code 0x00000000be000411 -- SError
> 
> This series first does some cleanups in the Rockchip power domain
> driver and changes the driver, so that it no longer tries to continue
> when it fails to enable a domain. This gets rid of the SError interrupt
> and long backtraces. But the kernel still hangs when it fails to enable
> a power domain. I have not done further analysis to check if that can
> be avoided.
> 
> Last but not least this provides a fix for the GPU power domain failing
> to get enabled - after some testing from my side it seems to require the
> GPU voltage supply to be enabled.
> 
> This series is now based on the pull request from Mark Brown:
> https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/
> 
> I added one more patch, which adds devm_of_regulator_get without the
> _optional suffix, since that is more sensible for the Rockchip usecase.
> Longer explanation can be seen in patch 6, which adds the handling to
> the Rockchip driver. My merge suggestion would be that Mark adds the
> regulator patch on top of the immutable branch and creates a new pull
> request.
> 
> The last patch, which updates the RK3588 board files only covers the
> boards from 6.12-rc1. Any board missing the update will behave as before,
> so it is perfectly fine not to update all DT files at once.

My rk3588 jaguar somehow developed some delay when dhcp'ing for its nfs
root and with that actually started running into that gpu-regulator-issue.

With this series applied, that issue goes away:

Tested-by: Heiko Stuebner <heiko@sntech.de>
Ulf Hansson Nov. 1, 2024, 11:56 a.m. UTC | #3
On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 22 Oct 2024 at 17:45, Sebastian Reichel
> <sebastian.reichel@collabora.com> wrote:
> >
> > Hi,
> >
> > I got a report, that the Linux kernel crashes on Rock 5B when the panthor
> > driver is loaded late after booting. The crash starts with the following
> > shortened error print:
> >
> > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
> > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff
> > SError Interrupt on CPU4, code 0x00000000be000411 -- SError
> >
> > This series first does some cleanups in the Rockchip power domain
> > driver and changes the driver, so that it no longer tries to continue
> > when it fails to enable a domain. This gets rid of the SError interrupt
> > and long backtraces. But the kernel still hangs when it fails to enable
> > a power domain. I have not done further analysis to check if that can
> > be avoided.
> >
> > Last but not least this provides a fix for the GPU power domain failing
> > to get enabled - after some testing from my side it seems to require the
> > GPU voltage supply to be enabled.
> >
> > This series is now based on the pull request from Mark Brown:
> > https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/
> >
> > I added one more patch, which adds devm_of_regulator_get without the
> > _optional suffix, since that is more sensible for the Rockchip usecase.
> > Longer explanation can be seen in patch 6, which adds the handling to
> > the Rockchip driver. My merge suggestion would be that Mark adds the
> > regulator patch on top of the immutable branch and creates a new pull
> > request.
>
> The merge strategy seems reasonable to me. But I am fine with that
> whatever works for Mark.

Mark, any update on this?

If easier, you could also just ack the regulator patch (patch1), and
can just take it all via my tree.

Kind regards
Uffe
Mark Brown Nov. 1, 2024, 2:36 p.m. UTC | #4
On Fri, Nov 01, 2024 at 12:56:16PM +0100, Ulf Hansson wrote:
> On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote:

> > The merge strategy seems reasonable to me. But I am fine with that
> > whatever works for Mark.

> Mark, any update on this?

> If easier, you could also just ack the regulator patch (patch1), and
> can just take it all via my tree.

I'm still deciding what I think about the regulator patch, I can see why
it's wanted in this situation but it's also an invitation to misuse by
drivers just blindly requesting all supplies and not caring if things
work.
Chen-Yu Tsai Nov. 1, 2024, 2:41 p.m. UTC | #5
On Fri, Nov 1, 2024 at 10:36 PM Mark Brown <broonie@kernel.org> wrote:
>
> On Fri, Nov 01, 2024 at 12:56:16PM +0100, Ulf Hansson wrote:
> > On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> > > The merge strategy seems reasonable to me. But I am fine with that
> > > whatever works for Mark.
>
> > Mark, any update on this?
>
> > If easier, you could also just ack the regulator patch (patch1), and
> > can just take it all via my tree.
>
> I'm still deciding what I think about the regulator patch, I can see why
> it's wanted in this situation but it's also an invitation to misuse by
> drivers just blindly requesting all supplies and not caring if things
> work.

I suppose an alternative is to flag which power domains actually need
a regulator supply. The MediaTek power domain driver does this.

There's still the issue of backwards compatibility with older device
trees that are missing said supply though.


ChenYu
Sebastian Reichel Nov. 1, 2024, 7:04 p.m. UTC | #6
Hi,

On Fri, Nov 01, 2024 at 10:41:14PM +0800, Chen-Yu Tsai wrote:
> On Fri, Nov 1, 2024 at 10:36 PM Mark Brown <broonie@kernel.org> wrote:
> > On Fri, Nov 01, 2024 at 12:56:16PM +0100, Ulf Hansson wrote:
> > > On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > > > The merge strategy seems reasonable to me. But I am fine with that
> > > > whatever works for Mark.
> >
> > > Mark, any update on this?
> >
> > > If easier, you could also just ack the regulator patch (patch1), and
> > > can just take it all via my tree.
> >
> > I'm still deciding what I think about the regulator patch, I can see why
> > it's wanted in this situation but it's also an invitation to misuse by
> > drivers just blindly requesting all supplies and not caring if things
> > work.
> 
> I suppose an alternative is to flag which power domains actually need
> a regulator supply. The MediaTek power domain driver does this.

If you look at patch 6/7, which actually makes use of devm_of_regulator_get()
you will notice that I did actually flag which power domains have/need a
regulator.

> There's still the issue of backwards compatibility with older device
> trees that are missing said supply though.

Exactly :)

As far as I can see the same misuse potential also exists for the
plain devm_regulator_get() version.

Greetings,

-- Sebastian
Mark Brown Nov. 1, 2024, 7:22 p.m. UTC | #7
On Fri, Nov 01, 2024 at 08:04:52PM +0100, Sebastian Reichel wrote:
> On Fri, Nov 01, 2024 at 10:41:14PM +0800, Chen-Yu Tsai wrote:

> > There's still the issue of backwards compatibility with older device
> > trees that are missing said supply though.

> Exactly :)

> As far as I can see the same misuse potential also exists for the
> plain devm_regulator_get() version.

You'll get warnings but I'm not sure that's such a huge issue?
Sebastian Reichel Nov. 1, 2024, 9:35 p.m. UTC | #8
Hi,

On Fri, Nov 01, 2024 at 07:22:28PM +0000, Mark Brown wrote:
> On Fri, Nov 01, 2024 at 08:04:52PM +0100, Sebastian Reichel wrote:
> > On Fri, Nov 01, 2024 at 10:41:14PM +0800, Chen-Yu Tsai wrote:
> 
> > > There's still the issue of backwards compatibility with older device
> > > trees that are missing said supply though.
> 
> > Exactly :)
> 
> > As far as I can see the same misuse potential also exists for the
> > plain devm_regulator_get() version.
> 
> You'll get warnings but I'm not sure that's such a huge issue?

I see that as a feature and not as an issue. Obviously the
dependency should be properly described in DT. When we upstreamed
GPU support for RK3588 we did not mark the GPU regulator as always-on
[*] and that has been copied to all other upstreamed RK3588 board DTs.
This means all of them are buggy now. Getting a warning might help
people to understand what is going on. In any case I fixed up every
in-tree user as part of this series.

[*] Older Rockchip platforms (which are not touched by this series)
and downstream RK3588 have the GPU regulator marked as always-on.

Greetings,

-- Sebastian