diff mbox series

drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()

Message ID 20231102141507.73481-1-angelogioacchino.delregno@collabora.com (mailing list archive)
State New, archived
Headers show
Series drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off() | expand

Commit Message

AngeloGioacchino Del Regno Nov. 2, 2023, 2:15 p.m. UTC
The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
this means that in order to request poweroff of cores, we are supposed
to write a bitmask of cores that should be powered off!
This means that the panfrost_gpu_power_off() function has always been
doing nothing.

Fix powering off the GPU by writing a bitmask of the cores to poweroff
to the relevant PWROFF_LO registers and then check that the transition
(from ON to OFF) has finished by polling the relevant PWRTRANS_LO
registers.

While at it, in order to avoid code duplication, move the core mask
logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
function, used in both poweron and poweroff.

Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_gpu.c | 65 ++++++++++++++++++-------
 1 file changed, 47 insertions(+), 18 deletions(-)

Comments

Steven Price Nov. 8, 2023, 1:20 p.m. UTC | #1
On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
> this means that in order to request poweroff of cores, we are supposed
> to write a bitmask of cores that should be powered off!
> This means that the panfrost_gpu_power_off() function has always been
> doing nothing.
> 
> Fix powering off the GPU by writing a bitmask of the cores to poweroff
> to the relevant PWROFF_LO registers and then check that the transition
> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
> registers.
> 
> While at it, in order to avoid code duplication, move the core mask
> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
> function, used in both poweron and poweroff.
> 
> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

Thanks,

Steve
Krzysztof Kozlowski Nov. 21, 2023, 3:34 p.m. UTC | #2
On 08/11/2023 14:20, Steven Price wrote:
> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>> this means that in order to request poweroff of cores, we are supposed
>> to write a bitmask of cores that should be powered off!
>> This means that the panfrost_gpu_power_off() function has always been
>> doing nothing.
>>
>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>> to the relevant PWROFF_LO registers and then check that the transition
>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>> registers.
>>
>> While at it, in order to avoid code duplication, move the core mask
>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>> function, used in both poweron and poweroff.
>>
>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>


Hi,

This commit was added to next recently but it causes "external abort on
non-linefetch" during boot of my Odroid HC1 board.

At least bisect points to it.

If fixed, please add:

Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

[    4.861683] 8<--- cut here ---
[    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
[    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
...
[    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
[    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
[    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
[    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
[    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
[    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
[    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0

Full log:
https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0

1. exynos_defconfig
2. HW: Odroid HC1
   ARMv7, octa-core (Cortex-A7+A15), Exynos5422 SoC
   arm,mali-t628
   
Bisect log:

git bisect start
# bad: [07b677953b9dca02928be323e2db853511305fa9] Add linux-next specific files for 20231121
git bisect bad 07b677953b9dca02928be323e2db853511305fa9
# good: [98b1cc82c4affc16f5598d4fa14b1858671b2263] Linux 6.7-rc2
git bisect good 98b1cc82c4affc16f5598d4fa14b1858671b2263
# good: [13e2401d5bdc7f5a30f2651c99f0e3374cdda815] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
git bisect good 13e2401d5bdc7f5a30f2651c99f0e3374cdda815
# bad: [3b586cd6d8e51c428675312e7c3f634eb96337e9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git
git bisect bad 3b586cd6d8e51c428675312e7c3f634eb96337e9
# bad: [9d63fd5f05248c78d9a66ce5dbc9cf5649054848] Merge branch 'drm-next' of https://gitlab.freedesktop.org/agd5f/linux
git bisect bad 9d63fd5f05248c78d9a66ce5dbc9cf5649054848
# bad: [5dea0c3fedee65413271a5700e653eff633e9a7f] drm/panel-elida-kd35t133: Drop shutdown logic
git bisect bad 5dea0c3fedee65413271a5700e653eff633e9a7f
# good: [48d45fac3940347becd290b96b2fc6d5ad8171f7] accel/ivpu: Remove support for uncached buffers
git bisect good 48d45fac3940347becd290b96b2fc6d5ad8171f7
# bad: [809ef191ee600e8bcbe2f8a769e00d2d54c16094] drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm
git bisect bad 809ef191ee600e8bcbe2f8a769e00d2d54c16094
# good: [a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f] drm/sched: implement dynamic job-flow control
git bisect good a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f
# bad: [e4178256094a76cc36d9b9aabe7482615959b26f] drm/virtio: use uint64_t more in virtio_gpu_context_init_ioctl
git bisect bad e4178256094a76cc36d9b9aabe7482615959b26f
# bad: [56e76c0179185568049913257c18069293f8bde9] drm/panfrost: Implement ability to turn on/off GPU clocks in suspend
git bisect bad 56e76c0179185568049913257c18069293f8bde9
# bad: [57d4e26717b030fd794df3534e6b2e806eb761e4] drm/panfrost: Perform hard reset to recover GPU if soft reset fails
git bisect bad 57d4e26717b030fd794df3534e6b2e806eb761e4
# bad: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()
git bisect bad 22aa1a209018dc2eca78745f7666db63637cd5dc
# first bad commit: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()
  

Best regards,
Krzysztof
AngeloGioacchino Del Regno Nov. 21, 2023, 4:11 p.m. UTC | #3
Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
> On 08/11/2023 14:20, Steven Price wrote:
>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>> this means that in order to request poweroff of cores, we are supposed
>>> to write a bitmask of cores that should be powered off!
>>> This means that the panfrost_gpu_power_off() function has always been
>>> doing nothing.
>>>
>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>> to the relevant PWROFF_LO registers and then check that the transition
>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>> registers.
>>>
>>> While at it, in order to avoid code duplication, move the core mask
>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>> function, used in both poweron and poweroff.
>>>
>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> 
> 
> Hi,
> 
> This commit was added to next recently but it causes "external abort on
> non-linefetch" during boot of my Odroid HC1 board.
> 
> At least bisect points to it.
> 
> If fixed, please add:
> 
> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> 
> [    4.861683] 8<--- cut here ---
> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
> ...
> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
> 
> Full log:
> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
> 

Hey Krzysztof,

This is interesting. It might be about the cores that are missing from the partial
core_mask raising interrupts, but an external abort on non-linefetch is strange to
see here.

Would you be available for some tests?

I'm thinking to call power_off on all cores (all shaders, all tilers, all l2s),
regardless of what panfrost_get_core_mask() says, as it could be that your GPU
powers on the cores that are unused by Panfrost by default, and that then we never
turn them off, escalating to this issue.

If you can please please please test:

void panfrost_gpu_power_off(struct panfrost_device *pfdev)
{
	u64 core_mask = panfrost_get_core_mask(pfdev);
	int ret;
	u32 val;

	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present);
	gpu_write(pfdev, SHADER_PWROFF_HI, pfdev->features.shader_present >> 32);
	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO,
					 val, !val, 1, 1000);
	if (ret)
		dev_err(pfdev->dev, "shader power transition timeout");

	gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present);
	gpu_write(pfdev, TILER_PWROFF_HI, pfdev->features.tiler_present >> 32);
	ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO,
					 val, !val, 1, 1000);
	if (ret)
		dev_err(pfdev->dev, "tiler power transition timeout");

	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present);
	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO,
					 val, !val, 0, 1000);
	if (ret)
		dev_err(pfdev->dev, "l2 power transition timeout");

	gpu_write(pfdev, L2_PWROFF_HI, pfdev->features.l2_present >> 32);
	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI,
					 val, !val, 0, 1000);
	if (ret)
		dev_err(pfdev->dev, "l2 power transition timeout");
}


This is a quick hack that might work. If this does actually work, the real fix
will be to PWROFF the extra cores only once, at panfrost_gpu_init() time, before
calling panfrost_gpu_power_on(), and to leave them disabled forever (until Panfrost
actually gets to support those extra cores for real).

Thanks,
Angelo

> 1. exynos_defconfig
> 2. HW: Odroid HC1
>     ARMv7, octa-core (Cortex-A7+A15), Exynos5422 SoC
>     arm,mali-t628
>     
> Bisect log:
> 
> git bisect start
> # bad: [07b677953b9dca02928be323e2db853511305fa9] Add linux-next specific files for 20231121
> git bisect bad 07b677953b9dca02928be323e2db853511305fa9
> # good: [98b1cc82c4affc16f5598d4fa14b1858671b2263] Linux 6.7-rc2
> git bisect good 98b1cc82c4affc16f5598d4fa14b1858671b2263
> # good: [13e2401d5bdc7f5a30f2651c99f0e3374cdda815] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> git bisect good 13e2401d5bdc7f5a30f2651c99f0e3374cdda815
> # bad: [3b586cd6d8e51c428675312e7c3f634eb96337e9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git
> git bisect bad 3b586cd6d8e51c428675312e7c3f634eb96337e9
> # bad: [9d63fd5f05248c78d9a66ce5dbc9cf5649054848] Merge branch 'drm-next' of https://gitlab.freedesktop.org/agd5f/linux
> git bisect bad 9d63fd5f05248c78d9a66ce5dbc9cf5649054848
> # bad: [5dea0c3fedee65413271a5700e653eff633e9a7f] drm/panel-elida-kd35t133: Drop shutdown logic
> git bisect bad 5dea0c3fedee65413271a5700e653eff633e9a7f
> # good: [48d45fac3940347becd290b96b2fc6d5ad8171f7] accel/ivpu: Remove support for uncached buffers
> git bisect good 48d45fac3940347becd290b96b2fc6d5ad8171f7
> # bad: [809ef191ee600e8bcbe2f8a769e00d2d54c16094] drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm
> git bisect bad 809ef191ee600e8bcbe2f8a769e00d2d54c16094
> # good: [a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f] drm/sched: implement dynamic job-flow control
> git bisect good a78422e9dff366b3a46ae44caf6ec8ded9c9fc2f
> # bad: [e4178256094a76cc36d9b9aabe7482615959b26f] drm/virtio: use uint64_t more in virtio_gpu_context_init_ioctl
> git bisect bad e4178256094a76cc36d9b9aabe7482615959b26f
> # bad: [56e76c0179185568049913257c18069293f8bde9] drm/panfrost: Implement ability to turn on/off GPU clocks in suspend
> git bisect bad 56e76c0179185568049913257c18069293f8bde9
> # bad: [57d4e26717b030fd794df3534e6b2e806eb761e4] drm/panfrost: Perform hard reset to recover GPU if soft reset fails
> git bisect bad 57d4e26717b030fd794df3534e6b2e806eb761e4
> # bad: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()
> git bisect bad 22aa1a209018dc2eca78745f7666db63637cd5dc
> # first bad commit: [22aa1a209018dc2eca78745f7666db63637cd5dc] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()
>    
> 
> Best regards,
> Krzysztof
>
Krzysztof Kozlowski Nov. 21, 2023, 4:35 p.m. UTC | #4
On 21/11/2023 17:11, AngeloGioacchino Del Regno wrote:
> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>> On 08/11/2023 14:20, Steven Price wrote:
>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>>> this means that in order to request poweroff of cores, we are supposed
>>>> to write a bitmask of cores that should be powered off!
>>>> This means that the panfrost_gpu_power_off() function has always been
>>>> doing nothing.
>>>>
>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>>> to the relevant PWROFF_LO registers and then check that the transition
>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>> registers.
>>>>
>>>> While at it, in order to avoid code duplication, move the core mask
>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>> function, used in both poweron and poweroff.
>>>>
>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>
>>
>> Hi,
>>
>> This commit was added to next recently but it causes "external abort on
>> non-linefetch" during boot of my Odroid HC1 board.
>>
>> At least bisect points to it.
>>
>> If fixed, please add:
>>
>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>
>> [    4.861683] 8<--- cut here ---
>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>> ...
>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>
>> Full log:
>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>
> 
> Hey Krzysztof,
> 
> This is interesting. It might be about the cores that are missing from the partial
> core_mask raising interrupts, but an external abort on non-linefetch is strange to
> see here.
> 
> Would you be available for some tests?
> 
> I'm thinking to call power_off on all cores (all shaders, all tilers, all l2s),
> regardless of what panfrost_get_core_mask() says, as it could be that your GPU
> powers on the cores that are unused by Panfrost by default, and that then we never
> turn them off, escalating to this issue.
> 
> If you can please please please test:
> 
> void panfrost_gpu_power_off(struct panfrost_device *pfdev)
> {
> 	u64 core_mask = panfrost_get_core_mask(pfdev);
> 	int ret;
> 	u32 val;
> 
> 	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present);
> 	gpu_write(pfdev, SHADER_PWROFF_HI, pfdev->features.shader_present >> 32);
> 	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO,
> 					 val, !val, 1, 1000);
> 	if (ret)
> 		dev_err(pfdev->dev, "shader power transition timeout");
> 
> 	gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present);
> 	gpu_write(pfdev, TILER_PWROFF_HI, pfdev->features.tiler_present >> 32);
> 	ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO,
> 					 val, !val, 1, 1000);
> 	if (ret)
> 		dev_err(pfdev->dev, "tiler power transition timeout");
> 
> 	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present);
> 	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO,
> 					 val, !val, 0, 1000);
> 	if (ret)
> 		dev_err(pfdev->dev, "l2 power transition timeout");
> 
> 	gpu_write(pfdev, L2_PWROFF_HI, pfdev->features.l2_present >> 32);
> 	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_HI,
> 					 val, !val, 0, 1000);
> 	if (ret)
> 		dev_err(pfdev->dev, "l2 power transition timeout");
> }
> 

Send a diff please - against next or some other commit sha from next.

Best regards,
Krzysztof
Boris Brezillon Nov. 21, 2023, 4:55 p.m. UTC | #5
On Tue, 21 Nov 2023 17:11:42 +0100
AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
wrote:

> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
> > On 08/11/2023 14:20, Steven Price wrote:  
> >> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:  
> >>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
> >>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
> >>> this means that in order to request poweroff of cores, we are supposed
> >>> to write a bitmask of cores that should be powered off!
> >>> This means that the panfrost_gpu_power_off() function has always been
> >>> doing nothing.
> >>>
> >>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
> >>> to the relevant PWROFF_LO registers and then check that the transition
> >>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
> >>> registers.
> >>>
> >>> While at it, in order to avoid code duplication, move the core mask
> >>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
> >>> function, used in both poweron and poweroff.
> >>>
> >>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
> >>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>  
> > 
> > 
> > Hi,
> > 
> > This commit was added to next recently but it causes "external abort on
> > non-linefetch" during boot of my Odroid HC1 board.
> > 
> > At least bisect points to it.
> > 
> > If fixed, please add:
> > 
> > Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> > 
> > [    4.861683] 8<--- cut here ---
> > [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
> > [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
> > ...
> > [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
> > [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
> > [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
> > [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
> > [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
> > [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
> > [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
> > 
> > Full log:
> > https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
> >   
> 
> Hey Krzysztof,
> 
> This is interesting. It might be about the cores that are missing from the partial
> core_mask raising interrupts, but an external abort on non-linefetch is strange to
> see here.

I've seen such external aborts in the past, and the fault type has
often been misleading. It's unlikely to have anything to do with a
non-linefetch access, but it might be caused by a register access after
the clock or power domain driving the register bank has been disabled.
The following diff might help validate this theory. If that works, we
probably want to make sure we synchronize IRQs before disabling in the
suspend path.

--->8---
diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
index 55ec807550b3..98df66e5cc9b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -34,8 +34,6 @@
          (GPU_IRQ_FAULT                        |\
           GPU_IRQ_MULTIPLE_FAULT               |\
           GPU_IRQ_RESET_COMPLETED              |\
-          GPU_IRQ_POWER_CHANGED                |\
-          GPU_IRQ_POWER_CHANGED_ALL            |\
           GPU_IRQ_PERFCNT_SAMPLE_COMPLETED     |\
           GPU_IRQ_CLEAN_CACHES_COMPLETED)
 #define GPU_IRQ_MASK_ERROR                     \
Krzysztof Kozlowski Nov. 21, 2023, 5:08 p.m. UTC | #6
On 21/11/2023 17:55, Boris Brezillon wrote:
> On Tue, 21 Nov 2023 17:11:42 +0100
> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> wrote:
> 
>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>> On 08/11/2023 14:20, Steven Price wrote:  
>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:  
>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>>>> this means that in order to request poweroff of cores, we are supposed
>>>>> to write a bitmask of cores that should be powered off!
>>>>> This means that the panfrost_gpu_power_off() function has always been
>>>>> doing nothing.
>>>>>
>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>>>> to the relevant PWROFF_LO registers and then check that the transition
>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>> registers.
>>>>>
>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>> function, used in both poweron and poweroff.
>>>>>
>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>  
>>>
>>>
>>> Hi,
>>>
>>> This commit was added to next recently but it causes "external abort on
>>> non-linefetch" during boot of my Odroid HC1 board.
>>>
>>> At least bisect points to it.
>>>
>>> If fixed, please add:
>>>
>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>
>>> [    4.861683] 8<--- cut here ---
>>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>> ...
>>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
>>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
>>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>
>>> Full log:
>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>   
>>
>> Hey Krzysztof,
>>
>> This is interesting. It might be about the cores that are missing from the partial
>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>> see here.
> 
> I've seen such external aborts in the past, and the fault type has
> often been misleading. It's unlikely to have anything to do with a

Yeah, often accessing device with power or clocks gated.

> non-linefetch access, but it might be caused by a register access after
> the clock or power domain driving the register bank has been disabled.
> The following diff might help validate this theory. If that works, we
> probably want to make sure we synchronize IRQs before disabling in the
> suspend path.
> 
> --->8---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
> index 55ec807550b3..98df66e5cc9b 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_regs.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
> @@ -34,8 +34,6 @@
>           (GPU_IRQ_FAULT                        |\
>            GPU_IRQ_MULTIPLE_FAULT               |\
>            GPU_IRQ_RESET_COMPLETED              |\
> -          GPU_IRQ_POWER_CHANGED                |\
> -          GPU_IRQ_POWER_CHANGED_ALL            |\

This helped, at least for this issue (next-20231121). Much later in
user-space boot I have lockups:
watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/4:1:61]

[   56.329224]  smp_call_function_single from
__sync_rcu_exp_select_node_cpus+0x29c/0x78c
[   56.337111]  __sync_rcu_exp_select_node_cpus from
sync_rcu_exp_select_cpus+0x334/0x878
[   56.344995]  sync_rcu_exp_select_cpus from wait_rcu_exp_gp+0xc/0x18
[   56.351231]  wait_rcu_exp_gp from process_one_work+0x20c/0x620
[   56.357038]  process_one_work from worker_thread+0x1d0/0x488
[   56.362668]  worker_thread from kthread+0x104/0x138
[   56.367521]  kthread from ret_from_fork+0x14/0x28

But anyway the external abort does not appear.

Best regards,
Krzysztof
Boris Brezillon Nov. 22, 2023, 9:02 a.m. UTC | #7
On Tue, 21 Nov 2023 18:08:44 +0100
Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> wrote:

> > non-linefetch access, but it might be caused by a register access after
> > the clock or power domain driving the register bank has been disabled.
> > The following diff might help validate this theory. If that works, we
> > probably want to make sure we synchronize IRQs before disabling in the
> > suspend path.
> >   
> > --->8---  
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
> > index 55ec807550b3..98df66e5cc9b 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_regs.h
> > +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
> > @@ -34,8 +34,6 @@
> >           (GPU_IRQ_FAULT                        |\
> >            GPU_IRQ_MULTIPLE_FAULT               |\
> >            GPU_IRQ_RESET_COMPLETED              |\
> > -          GPU_IRQ_POWER_CHANGED                |\
> > -          GPU_IRQ_POWER_CHANGED_ALL            |\  
> 
> This helped, at least for this issue (next-20231121). Much later in
> user-space boot I have lockups:
> watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/4:1:61]

Hm, if this doesn't happen with "drm/panfrost: Really power off GPU
cores in panfrost_gpu_power_off()" reverted, it might be related to the
issue Angelo was describing, though I'd expect it to lead to job
timeouts, not the sort of soft lockup reported here.

> 
> [   56.329224]  smp_call_function_single from
> __sync_rcu_exp_select_node_cpus+0x29c/0x78c
> [   56.337111]  __sync_rcu_exp_select_node_cpus from
> sync_rcu_exp_select_cpus+0x334/0x878
> [   56.344995]  sync_rcu_exp_select_cpus from wait_rcu_exp_gp+0xc/0x18
> [   56.351231]  wait_rcu_exp_gp from process_one_work+0x20c/0x620
> [   56.357038]  process_one_work from worker_thread+0x1d0/0x488
> [   56.362668]  worker_thread from kthread+0x104/0x138
> [   56.367521]  kthread from ret_from_fork+0x14/0x28
> 
> But anyway the external abort does not appear.

Thanks for testing! As I said in my previous reply, I think the proper
fix for this particular issue would be to mask all panfrost IRQs
(writing 0 to xxx_INT_MASK) + call synchronize_irq() from
panfrost_device_suspend(), to make sure pending interrupts are flushed
and the handlers can't be called again (or at least not with real
interrupts to process, if the IRQ line is shared) until the device is
resumed.

FWIW, after fighting with annoying bugs in the interrupt handling logic
and its interactions with runtime PM, I've decided to automate some of
this in panthor [1]. Also made the power on/off logic generic [2], with
macros to allow powering on/off specific blocks. Not saying we should
do that in panfrost, just posting it as a reference, in case someone is
interested.

[1]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/panthor-v3+rk3588/drivers/gpu/drm/panthor/panthor_device.h?ref_type=heads#L279
[2]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/panthor-v3+rk3588/drivers/gpu/drm/panthor/panthor_gpu.c?ref_type=heads#L279
[3]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/panthor-v3+rk3588/drivers/gpu/drm/panthor/panthor_gpu.h?ref_type=heads#L24
AngeloGioacchino Del Regno Nov. 22, 2023, 9:06 a.m. UTC | #8
Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto:
> On 21/11/2023 17:55, Boris Brezillon wrote:
>> On Tue, 21 Nov 2023 17:11:42 +0100
>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>> wrote:
>>
>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>>> On 08/11/2023 14:20, Steven Price wrote:
>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>>>>> this means that in order to request poweroff of cores, we are supposed
>>>>>> to write a bitmask of cores that should be powered off!
>>>>>> This means that the panfrost_gpu_power_off() function has always been
>>>>>> doing nothing.
>>>>>>
>>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>>>>> to the relevant PWROFF_LO registers and then check that the transition
>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>>> registers.
>>>>>>
>>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>>> function, used in both poweron and poweroff.
>>>>>>
>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> This commit was added to next recently but it causes "external abort on
>>>> non-linefetch" during boot of my Odroid HC1 board.
>>>>
>>>> At least bisect points to it.
>>>>
>>>> If fixed, please add:
>>>>
>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>
>>>> [    4.861683] 8<--- cut here ---
>>>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
>>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>>> ...
>>>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
>>>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
>>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
>>>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
>>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>>
>>>> Full log:
>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>>    
>>>
>>> Hey Krzysztof,
>>>
>>> This is interesting. It might be about the cores that are missing from the partial
>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>>> see here.
>>
>> I've seen such external aborts in the past, and the fault type has
>> often been misleading. It's unlikely to have anything to do with a
> 
> Yeah, often accessing device with power or clocks gated.
> 

Except my commit does *not* gate SoC power, nor SoC clocks 
Krzysztof Kozlowski Nov. 22, 2023, 9:29 a.m. UTC | #9
On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:

>>>>>    
>>>>
>>>> Hey Krzysztof,
>>>>
>>>> This is interesting. It might be about the cores that are missing from the partial
>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>>>> see here.
>>>
>>> I've seen such external aborts in the past, and the fault type has
>>> often been misleading. It's unlikely to have anything to do with a
>>
>> Yeah, often accessing device with power or clocks gated.
>>
> 
> Except my commit does *not* gate SoC power, nor SoC clocks 
Steven Price Nov. 22, 2023, 9:48 a.m. UTC | #10
On 22/11/2023 09:06, AngeloGioacchino Del Regno wrote:
> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto:
>> On 21/11/2023 17:55, Boris Brezillon wrote:
>>> On Tue, 21 Nov 2023 17:11:42 +0100
>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>> wrote:
>>>
>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>>>> On 08/11/2023 14:20, Steven Price wrote:
>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to
>>>>>>> request
>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO
>>>>>>> ones:
>>>>>>> this means that in order to request poweroff of cores, we are
>>>>>>> supposed
>>>>>>> to write a bitmask of cores that should be powered off!
>>>>>>> This means that the panfrost_gpu_power_off() function has always
>>>>>>> been
>>>>>>> doing nothing.
>>>>>>>
>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to
>>>>>>> poweroff
>>>>>>> to the relevant PWROFF_LO registers and then check that the
>>>>>>> transition
>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>>>> registers.
>>>>>>>
>>>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>>>> function, used in both poweron and poweroff.
>>>>>>>
>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>>>> Signed-off-by: AngeloGioacchino Del Regno
>>>>>>> <angelogioacchino.delregno@collabora.com>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> This commit was added to next recently but it causes "external
>>>>> abort on
>>>>> non-linefetch" during boot of my Odroid HC1 board.
>>>>>
>>>>> At least bisect points to it.
>>>>>
>>>>> If fixed, please add:
>>>>>
>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>>
>>>>> [    4.861683] 8<--- cut here ---
>>>>> [    4.863429] Unhandled fault: external abort on non-linefetch
>>>>> (0x1008) at 0xf0c8802c
>>>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>>>> ...
>>>>> [    5.164010]  panfrost_gpu_irq_handler from
>>>>> __handle_irq_event_percpu+0xcc/0x31c
>>>>> [    5.171276]  __handle_irq_event_percpu from
>>>>> handle_irq_event+0x38/0x80
>>>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>>>> [    5.183743]  handle_fasteoi_irq from
>>>>> generic_handle_domain_irq+0x28/0x38
>>>>> [    5.190417]  generic_handle_domain_irq from
>>>>> gic_handle_irq+0x88/0xa8
>>>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>>>
>>>>> Full log:
>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>>>    
>>>>
>>>> Hey Krzysztof,
>>>>
>>>> This is interesting. It might be about the cores that are missing
>>>> from the partial
>>>> core_mask raising interrupts, but an external abort on non-linefetch
>>>> is strange to
>>>> see here.
>>>
>>> I've seen such external aborts in the past, and the fault type has
>>> often been misleading. It's unlikely to have anything to do with a
>>
>> Yeah, often accessing device with power or clocks gated.
>>
> 
> Except my commit does *not* gate SoC power, nor SoC clocks 
Boris Brezillon Nov. 22, 2023, 9:54 a.m. UTC | #11
Hi Angelo,

On Wed, 22 Nov 2023 10:06:19 +0100
AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
wrote:

> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto:
> > On 21/11/2023 17:55, Boris Brezillon wrote:  
> >> On Tue, 21 Nov 2023 17:11:42 +0100
> >> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> >> wrote:
> >>  
> >>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:  
> >>>> On 08/11/2023 14:20, Steven Price wrote:  
> >>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:  
> >>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
> >>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
> >>>>>> this means that in order to request poweroff of cores, we are supposed
> >>>>>> to write a bitmask of cores that should be powered off!
> >>>>>> This means that the panfrost_gpu_power_off() function has always been
> >>>>>> doing nothing.
> >>>>>>
> >>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
> >>>>>> to the relevant PWROFF_LO registers and then check that the transition
> >>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
> >>>>>> registers.
> >>>>>>
> >>>>>> While at it, in order to avoid code duplication, move the core mask
> >>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
> >>>>>> function, used in both poweron and poweroff.
> >>>>>>
> >>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
> >>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>  
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> This commit was added to next recently but it causes "external abort on
> >>>> non-linefetch" during boot of my Odroid HC1 board.
> >>>>
> >>>> At least bisect points to it.
> >>>>
> >>>> If fixed, please add:
> >>>>
> >>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> >>>>
> >>>> [    4.861683] 8<--- cut here ---
> >>>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
> >>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
> >>>> ...
> >>>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
> >>>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
> >>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
> >>>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
> >>>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
> >>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
> >>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
> >>>>
> >>>> Full log:
> >>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
> >>>>      
> >>>
> >>> Hey Krzysztof,
> >>>
> >>> This is interesting. It might be about the cores that are missing from the partial
> >>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
> >>> see here.  
> >>
> >> I've seen such external aborts in the past, and the fault type has
> >> often been misleading. It's unlikely to have anything to do with a  
> > 
> > Yeah, often accessing device with power or clocks gated.
> >   
> 
> Except my commit does *not* gate SoC power, nor SoC clocks 
AngeloGioacchino Del Regno Nov. 22, 2023, 10:23 a.m. UTC | #12
Il 22/11/23 10:54, Boris Brezillon ha scritto:
> Hi Angelo,
> 
> On Wed, 22 Nov 2023 10:06:19 +0100
> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> wrote:
> 
>> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto:
>>> On 21/11/2023 17:55, Boris Brezillon wrote:
>>>> On Tue, 21 Nov 2023 17:11:42 +0100
>>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>>> wrote:
>>>>   
>>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>>>>> On 08/11/2023 14:20, Steven Price wrote:
>>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>>>>>>> this means that in order to request poweroff of cores, we are supposed
>>>>>>>> to write a bitmask of cores that should be powered off!
>>>>>>>> This means that the panfrost_gpu_power_off() function has always been
>>>>>>>> doing nothing.
>>>>>>>>
>>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>>>>>>> to the relevant PWROFF_LO registers and then check that the transition
>>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>>>>> registers.
>>>>>>>>
>>>>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>>>>> function, used in both poweron and poweroff.
>>>>>>>>
>>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This commit was added to next recently but it causes "external abort on
>>>>>> non-linefetch" during boot of my Odroid HC1 board.
>>>>>>
>>>>>> At least bisect points to it.
>>>>>>
>>>>>> If fixed, please add:
>>>>>>
>>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>>>
>>>>>> [    4.861683] 8<--- cut here ---
>>>>>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
>>>>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>>>>> ...
>>>>>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
>>>>>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
>>>>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>>>>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
>>>>>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
>>>>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>>>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>>>>
>>>>>> Full log:
>>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>>>>       
>>>>>
>>>>> Hey Krzysztof,
>>>>>
>>>>> This is interesting. It might be about the cores that are missing from the partial
>>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>>>>> see here.
>>>>
>>>> I've seen such external aborts in the past, and the fault type has
>>>> often been misleading. It's unlikely to have anything to do with a
>>>
>>> Yeah, often accessing device with power or clocks gated.
>>>    
>>
>> Except my commit does *not* gate SoC power, nor SoC clocks 
AngeloGioacchino Del Regno Nov. 22, 2023, 10:33 a.m. UTC | #13
Il 22/11/23 10:48, Steven Price ha scritto:
> On 22/11/2023 09:06, AngeloGioacchino Del Regno wrote:
>> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto:
>>> On 21/11/2023 17:55, Boris Brezillon wrote:
>>>> On Tue, 21 Nov 2023 17:11:42 +0100
>>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>>>> wrote:
>>>>
>>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>>>>> On 08/11/2023 14:20, Steven Price wrote:
>>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to
>>>>>>>> request
>>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO
>>>>>>>> ones:
>>>>>>>> this means that in order to request poweroff of cores, we are
>>>>>>>> supposed
>>>>>>>> to write a bitmask of cores that should be powered off!
>>>>>>>> This means that the panfrost_gpu_power_off() function has always
>>>>>>>> been
>>>>>>>> doing nothing.
>>>>>>>>
>>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to
>>>>>>>> poweroff
>>>>>>>> to the relevant PWROFF_LO registers and then check that the
>>>>>>>> transition
>>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>>>>> registers.
>>>>>>>>
>>>>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>>>>> function, used in both poweron and poweroff.
>>>>>>>>
>>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>>>>> Signed-off-by: AngeloGioacchino Del Regno
>>>>>>>> <angelogioacchino.delregno@collabora.com>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This commit was added to next recently but it causes "external
>>>>>> abort on
>>>>>> non-linefetch" during boot of my Odroid HC1 board.
>>>>>>
>>>>>> At least bisect points to it.
>>>>>>
>>>>>> If fixed, please add:
>>>>>>
>>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>>>
>>>>>> [    4.861683] 8<--- cut here ---
>>>>>> [    4.863429] Unhandled fault: external abort on non-linefetch
>>>>>> (0x1008) at 0xf0c8802c
>>>>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>>>>> ...
>>>>>> [    5.164010]  panfrost_gpu_irq_handler from
>>>>>> __handle_irq_event_percpu+0xcc/0x31c
>>>>>> [    5.171276]  __handle_irq_event_percpu from
>>>>>> handle_irq_event+0x38/0x80
>>>>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>>>>> [    5.183743]  handle_fasteoi_irq from
>>>>>> generic_handle_domain_irq+0x28/0x38
>>>>>> [    5.190417]  generic_handle_domain_irq from
>>>>>> gic_handle_irq+0x88/0xa8
>>>>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>>>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>>>>
>>>>>> Full log:
>>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>>>>     
>>>>>
>>>>> Hey Krzysztof,
>>>>>
>>>>> This is interesting. It might be about the cores that are missing
>>>>> from the partial
>>>>> core_mask raising interrupts, but an external abort on non-linefetch
>>>>> is strange to
>>>>> see here.
>>>>
>>>> I've seen such external aborts in the past, and the fault type has
>>>> often been misleading. It's unlikely to have anything to do with a
>>>
>>> Yeah, often accessing device with power or clocks gated.
>>>
>>
>> Except my commit does *not* gate SoC power, nor SoC clocks 
Boris Brezillon Nov. 22, 2023, 10:42 a.m. UTC | #14
On Wed, 22 Nov 2023 11:23:05 +0100
AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
wrote:

> Il 22/11/23 10:54, Boris Brezillon ha scritto:
> > Hi Angelo,
> > 
> > On Wed, 22 Nov 2023 10:06:19 +0100
> > AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> > wrote:
> >   
> >> Il 21/11/23 18:08, Krzysztof Kozlowski ha scritto:  
> >>> On 21/11/2023 17:55, Boris Brezillon wrote:  
> >>>> On Tue, 21 Nov 2023 17:11:42 +0100
> >>>> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> >>>> wrote:
> >>>>     
> >>>>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:  
> >>>>>> On 08/11/2023 14:20, Steven Price wrote:  
> >>>>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:  
> >>>>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
> >>>>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
> >>>>>>>> this means that in order to request poweroff of cores, we are supposed
> >>>>>>>> to write a bitmask of cores that should be powered off!
> >>>>>>>> This means that the panfrost_gpu_power_off() function has always been
> >>>>>>>> doing nothing.
> >>>>>>>>
> >>>>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
> >>>>>>>> to the relevant PWROFF_LO registers and then check that the transition
> >>>>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
> >>>>>>>> registers.
> >>>>>>>>
> >>>>>>>> While at it, in order to avoid code duplication, move the core mask
> >>>>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
> >>>>>>>> function, used in both poweron and poweroff.
> >>>>>>>>
> >>>>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
> >>>>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>  
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> This commit was added to next recently but it causes "external abort on
> >>>>>> non-linefetch" during boot of my Odroid HC1 board.
> >>>>>>
> >>>>>> At least bisect points to it.
> >>>>>>
> >>>>>> If fixed, please add:
> >>>>>>
> >>>>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> >>>>>>
> >>>>>> [    4.861683] 8<--- cut here ---
> >>>>>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
> >>>>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
> >>>>>> ...
> >>>>>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
> >>>>>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
> >>>>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
> >>>>>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
> >>>>>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
> >>>>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
> >>>>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
> >>>>>>
> >>>>>> Full log:
> >>>>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
> >>>>>>         
> >>>>>
> >>>>> Hey Krzysztof,
> >>>>>
> >>>>> This is interesting. It might be about the cores that are missing from the partial
> >>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
> >>>>> see here.  
> >>>>
> >>>> I've seen such external aborts in the past, and the fault type has
> >>>> often been misleading. It's unlikely to have anything to do with a  
> >>>
> >>> Yeah, often accessing device with power or clocks gated.
> >>>      
> >>
> >> Except my commit does *not* gate SoC power, nor SoC clocks 
Marek Szyprowski Nov. 24, 2023, 12:42 p.m. UTC | #15
On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
>>>>> Hey Krzysztof,
>>>>>
>>>>> This is interesting. It might be about the cores that are missing from the partial
>>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>>>>> see here.
>>>> I've seen such external aborts in the past, and the fault type has
>>>> often been misleading. It's unlikely to have anything to do with a
>>> Yeah, often accessing device with power or clocks gated.
>>>
>> Except my commit does *not* gate SoC power, nor SoC clocks 
Marek Szyprowski Nov. 24, 2023, 12:45 p.m. UTC | #16
On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
>>>>> Hey Krzysztof,
>>>>>
>>>>> This is interesting. It might be about the cores that are missing from the partial
>>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>>>>> see here.
>>>> I've seen such external aborts in the past, and the fault type has
>>>> often been misleading. It's unlikely to have anything to do with a
>>> Yeah, often accessing device with power or clocks gated.
>>>
>> Except my commit does *not* gate SoC power, nor SoC clocks 
Marek Szyprowski Nov. 27, 2023, 11:24 a.m. UTC | #17
On 24.11.2023 13:45, Marek Szyprowski wrote:
> On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
>> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
>>>>>> Hey Krzysztof,
>>>>>>
>>>>>> This is interesting. It might be about the cores that are missing 
>>>>>> from the partial
>>>>>> core_mask raising interrupts, but an external abort on 
>>>>>> non-linefetch is strange to
>>>>>> see here.
>>>>> I've seen such external aborts in the past, and the fault type has
>>>>> often been misleading. It's unlikely to have anything to do with a
>>>> Yeah, often accessing device with power or clocks gated.
>>>>
>>> Except my commit does *not* gate SoC power, nor SoC clocks 
AngeloGioacchino Del Regno Nov. 27, 2023, 11:26 a.m. UTC | #18
Il 27/11/23 12:24, Marek Szyprowski ha scritto:
> On 24.11.2023 13:45, Marek Szyprowski wrote:
>> On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
>>> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
>>>>>>> Hey Krzysztof,
>>>>>>>
>>>>>>> This is interesting. It might be about the cores that are missing
>>>>>>> from the partial
>>>>>>> core_mask raising interrupts, but an external abort on
>>>>>>> non-linefetch is strange to
>>>>>>> see here.
>>>>>> I've seen such external aborts in the past, and the fault type has
>>>>>> often been misleading. It's unlikely to have anything to do with a
>>>>> Yeah, often accessing device with power or clocks gated.
>>>>>
>>>> Except my commit does *not* gate SoC power, nor SoC clocks 
Krzysztof Kozlowski Dec. 4, 2023, 7:53 a.m. UTC | #19
On 22/11/2023 10:29, Krzysztof Kozlowski wrote:
> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
> 
>>>>>>    
>>>>>
>>>>> Hey Krzysztof,
>>>>>
>>>>> This is interesting. It might be about the cores that are missing from the partial
>>>>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>>>>> see here.
>>>>
>>>> I've seen such external aborts in the past, and the fault type has
>>>> often been misleading. It's unlikely to have anything to do with a
>>>
>>> Yeah, often accessing device with power or clocks gated.
>>>
>>
>> Except my commit does *not* gate SoC power, nor SoC clocks 
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index f0be7e19b13e..fad75b6e543e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -362,28 +362,38 @@  unsigned long long panfrost_cycle_counter_read(struct panfrost_device *pfdev)
 	return ((u64)hi << 32) | lo;
 }
 
+static u64 panfrost_get_core_mask(struct panfrost_device *pfdev)
+{
+	u64 core_mask;
+
+	if (pfdev->features.l2_present == 1)
+		return U64_MAX;
+
+	/*
+	 * Only support one core group now.
+	 * ~(l2_present - 1) unsets all bits in l2_present except
+	 * the bottom bit. (l2_present - 2) has all the bits in
+	 * the first core group set. AND them together to generate
+	 * a mask of cores in the first core group.
+	 */
+	core_mask = ~(pfdev->features.l2_present - 1) &
+		     (pfdev->features.l2_present - 2);
+	dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n",
+		      hweight64(core_mask),
+		      hweight64(pfdev->features.shader_present));
+
+	return core_mask;
+}
+
 void panfrost_gpu_power_on(struct panfrost_device *pfdev)
 {
 	int ret;
 	u32 val;
-	u64 core_mask = U64_MAX;
+	u64 core_mask;
 
 	panfrost_gpu_init_quirks(pfdev);
+	core_mask = panfrost_get_core_mask(pfdev);
 
-	if (pfdev->features.l2_present != 1) {
-		/*
-		 * Only support one core group now.
-		 * ~(l2_present - 1) unsets all bits in l2_present except
-		 * the bottom bit. (l2_present - 2) has all the bits in
-		 * the first core group set. AND them together to generate
-		 * a mask of cores in the first core group.
-		 */
-		core_mask = ~(pfdev->features.l2_present - 1) &
-			     (pfdev->features.l2_present - 2);
-		dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n",
-			      hweight64(core_mask),
-			      hweight64(pfdev->features.shader_present));
-	}
 	gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present & core_mask);
 	ret = readl_relaxed_poll_timeout(pfdev->iomem + L2_READY_LO,
 		val, val == (pfdev->features.l2_present & core_mask),
@@ -408,11 +418,30 @@  void panfrost_gpu_power_on(struct panfrost_device *pfdev)
 
 void panfrost_gpu_power_off(struct panfrost_device *pfdev)
 {
-	gpu_write(pfdev, TILER_PWROFF_LO, 0);
-	gpu_write(pfdev, SHADER_PWROFF_LO, 0);
-	gpu_write(pfdev, L2_PWROFF_LO, 0);
+	u64 core_mask = panfrost_get_core_mask(pfdev);
+	int ret;
+	u32 val;
+
+	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask);
+	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO,
+					 val, !val, 1, 1000);
+	if (ret)
+		dev_err(pfdev->dev, "shader power transition timeout");
+
+	gpu_write(pfdev, TILER_PWROFF_LO, pfdev->features.tiler_present);
+	ret = readl_relaxed_poll_timeout(pfdev->iomem + TILER_PWRTRANS_LO,
+					 val, !val, 1, 1000);
+	if (ret)
+		dev_err(pfdev->dev, "tiler power transition timeout");
+
+	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask);
+	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO,
+					 val, !val, 0, 1000);
+	if (ret)
+		dev_err(pfdev->dev, "l2 power transition timeout");
 }
 
+
 int panfrost_gpu_init(struct panfrost_device *pfdev)
 {
 	int err, irq;