Message ID | 20241022154508.63563-1-sebastian.reichel@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix RK3588 GPU domain | expand |
On Tue, 22 Oct 2024 at 17:45, Sebastian Reichel <sebastian.reichel@collabora.com> wrote: > > Hi, > > I got a report, that the Linux kernel crashes on Rock 5B when the panthor > driver is loaded late after booting. The crash starts with the following > shortened error print: > > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0 > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff > SError Interrupt on CPU4, code 0x00000000be000411 -- SError > > This series first does some cleanups in the Rockchip power domain > driver and changes the driver, so that it no longer tries to continue > when it fails to enable a domain. This gets rid of the SError interrupt > and long backtraces. But the kernel still hangs when it fails to enable > a power domain. I have not done further analysis to check if that can > be avoided. > > Last but not least this provides a fix for the GPU power domain failing > to get enabled - after some testing from my side it seems to require the > GPU voltage supply to be enabled. > > This series is now based on the pull request from Mark Brown: > https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/ > > I added one more patch, which adds devm_of_regulator_get without the > _optional suffix, since that is more sensible for the Rockchip usecase. > Longer explanation can be seen in patch 6, which adds the handling to > the Rockchip driver. My merge suggestion would be that Mark adds the > regulator patch on top of the immutable branch and creates a new pull > request. The merge strategy seems reasonable to me. But I am fine with that whatever works for Mark. [...] Kind regards Uffe
Am Dienstag, 22. Oktober 2024, 17:41:45 CEST schrieb Sebastian Reichel: > Hi, > > I got a report, that the Linux kernel crashes on Rock 5B when the panthor > driver is loaded late after booting. The crash starts with the following > shortened error print: > > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0 > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff > SError Interrupt on CPU4, code 0x00000000be000411 -- SError > > This series first does some cleanups in the Rockchip power domain > driver and changes the driver, so that it no longer tries to continue > when it fails to enable a domain. This gets rid of the SError interrupt > and long backtraces. But the kernel still hangs when it fails to enable > a power domain. I have not done further analysis to check if that can > be avoided. > > Last but not least this provides a fix for the GPU power domain failing > to get enabled - after some testing from my side it seems to require the > GPU voltage supply to be enabled. > > This series is now based on the pull request from Mark Brown: > https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/ > > I added one more patch, which adds devm_of_regulator_get without the > _optional suffix, since that is more sensible for the Rockchip usecase. > Longer explanation can be seen in patch 6, which adds the handling to > the Rockchip driver. My merge suggestion would be that Mark adds the > regulator patch on top of the immutable branch and creates a new pull > request. > > The last patch, which updates the RK3588 board files only covers the > boards from 6.12-rc1. Any board missing the update will behave as before, > so it is perfectly fine not to update all DT files at once. My rk3588 jaguar somehow developed some delay when dhcp'ing for its nfs root and with that actually started running into that gpu-regulator-issue. With this series applied, that issue goes away: Tested-by: Heiko Stuebner <heiko@sntech.de>
On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote: > > On Tue, 22 Oct 2024 at 17:45, Sebastian Reichel > <sebastian.reichel@collabora.com> wrote: > > > > Hi, > > > > I got a report, that the Linux kernel crashes on Rock 5B when the panthor > > driver is loaded late after booting. The crash starts with the following > > shortened error print: > > > > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0 > > rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0xa9fff > > SError Interrupt on CPU4, code 0x00000000be000411 -- SError > > > > This series first does some cleanups in the Rockchip power domain > > driver and changes the driver, so that it no longer tries to continue > > when it fails to enable a domain. This gets rid of the SError interrupt > > and long backtraces. But the kernel still hangs when it fails to enable > > a power domain. I have not done further analysis to check if that can > > be avoided. > > > > Last but not least this provides a fix for the GPU power domain failing > > to get enabled - after some testing from my side it seems to require the > > GPU voltage supply to be enabled. > > > > This series is now based on the pull request from Mark Brown: > > https://lore.kernel.org/linux-pm/ZvsVfQ1fuSVZpF6A@finisterre.sirena.org.uk/ > > > > I added one more patch, which adds devm_of_regulator_get without the > > _optional suffix, since that is more sensible for the Rockchip usecase. > > Longer explanation can be seen in patch 6, which adds the handling to > > the Rockchip driver. My merge suggestion would be that Mark adds the > > regulator patch on top of the immutable branch and creates a new pull > > request. > > The merge strategy seems reasonable to me. But I am fine with that > whatever works for Mark. Mark, any update on this? If easier, you could also just ack the regulator patch (patch1), and can just take it all via my tree. Kind regards Uffe
On Fri, Nov 01, 2024 at 12:56:16PM +0100, Ulf Hansson wrote: > On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote: > > The merge strategy seems reasonable to me. But I am fine with that > > whatever works for Mark. > Mark, any update on this? > If easier, you could also just ack the regulator patch (patch1), and > can just take it all via my tree. I'm still deciding what I think about the regulator patch, I can see why it's wanted in this situation but it's also an invitation to misuse by drivers just blindly requesting all supplies and not caring if things work.
On Fri, Nov 1, 2024 at 10:36 PM Mark Brown <broonie@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 12:56:16PM +0100, Ulf Hansson wrote: > > On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote: > > > > The merge strategy seems reasonable to me. But I am fine with that > > > whatever works for Mark. > > > Mark, any update on this? > > > If easier, you could also just ack the regulator patch (patch1), and > > can just take it all via my tree. > > I'm still deciding what I think about the regulator patch, I can see why > it's wanted in this situation but it's also an invitation to misuse by > drivers just blindly requesting all supplies and not caring if things > work. I suppose an alternative is to flag which power domains actually need a regulator supply. The MediaTek power domain driver does this. There's still the issue of backwards compatibility with older device trees that are missing said supply though. ChenYu
Hi, On Fri, Nov 01, 2024 at 10:41:14PM +0800, Chen-Yu Tsai wrote: > On Fri, Nov 1, 2024 at 10:36 PM Mark Brown <broonie@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 12:56:16PM +0100, Ulf Hansson wrote: > > > On Wed, 23 Oct 2024 at 12:05, Ulf Hansson <ulf.hansson@linaro.org> wrote: > > > > > > The merge strategy seems reasonable to me. But I am fine with that > > > > whatever works for Mark. > > > > > Mark, any update on this? > > > > > If easier, you could also just ack the regulator patch (patch1), and > > > can just take it all via my tree. > > > > I'm still deciding what I think about the regulator patch, I can see why > > it's wanted in this situation but it's also an invitation to misuse by > > drivers just blindly requesting all supplies and not caring if things > > work. > > I suppose an alternative is to flag which power domains actually need > a regulator supply. The MediaTek power domain driver does this. If you look at patch 6/7, which actually makes use of devm_of_regulator_get() you will notice that I did actually flag which power domains have/need a regulator. > There's still the issue of backwards compatibility with older device > trees that are missing said supply though. Exactly :) As far as I can see the same misuse potential also exists for the plain devm_regulator_get() version. Greetings, -- Sebastian
On Fri, Nov 01, 2024 at 08:04:52PM +0100, Sebastian Reichel wrote: > On Fri, Nov 01, 2024 at 10:41:14PM +0800, Chen-Yu Tsai wrote: > > There's still the issue of backwards compatibility with older device > > trees that are missing said supply though. > Exactly :) > As far as I can see the same misuse potential also exists for the > plain devm_regulator_get() version. You'll get warnings but I'm not sure that's such a huge issue?
Hi, On Fri, Nov 01, 2024 at 07:22:28PM +0000, Mark Brown wrote: > On Fri, Nov 01, 2024 at 08:04:52PM +0100, Sebastian Reichel wrote: > > On Fri, Nov 01, 2024 at 10:41:14PM +0800, Chen-Yu Tsai wrote: > > > > There's still the issue of backwards compatibility with older device > > > trees that are missing said supply though. > > > Exactly :) > > > As far as I can see the same misuse potential also exists for the > > plain devm_regulator_get() version. > > You'll get warnings but I'm not sure that's such a huge issue? I see that as a feature and not as an issue. Obviously the dependency should be properly described in DT. When we upstreamed GPU support for RK3588 we did not mark the GPU regulator as always-on [*] and that has been copied to all other upstreamed RK3588 board DTs. This means all of them are buggy now. Getting a warning might help people to understand what is going on. In any case I fixed up every in-tree user as part of this series. [*] Older Rockchip platforms (which are not touched by this series) and downstream RK3588 have the GPU regulator marked as always-on. Greetings, -- Sebastian