[v3] drm/panfrost: Ignore core_mask for poweroff and sync interrupts

Message ID	20231123120521.147695-1-angelogioacchino.delregno@collabora.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> sender: kholk11) by madras.collabora.co.uk (Postfix) with ESMTPSA id 7F99C66073A3; Thu, 23 Nov 2023 12:05:27 +0000 (GMT) From: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> To: steven.price@arm.com Subject: [PATCH v3] drm/panfrost: Ignore core_mask for poweroff and sync interrupts Date: Thu, 23 Nov 2023 13:05:21 +0100 Message-ID: <20231123120521.147695-1-angelogioacchino.delregno@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: linux-kernel@vger.kernel.org, mripard@kernel.org, krzysztof.kozlowski@linaro.org, boris.brezillon@collabora.com, dri-devel@lists.freedesktop.org, tzimmermann@suse.de, angelogioacchino.delregno@collabora.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[v3] drm/panfrost: Ignore core_mask for poweroff and sync interrupts \| expand [v3] drm/panfrost: Ignore core_mask for poweroff and sync interrupts

AngeloGioacchino Del Regno Nov. 23, 2023, 12:05 p.m. UTC

Some SoCs may be equipped with a GPU containing two core groups
and this is exactly the case of Samsung's Exynos 5422 featuring
an ARM Mali-T628 MP6 GPU: the support for this GPU in Panfrost
is partial, as this driver currently supports using only one
core group and that's reflected on all parts of it, including
the power on (and power off, previously to this patch) function.

The issue with this is that even though executing the soft reset
operation should power off all cores unconditionally, on at least
one platform we're seeing a crash that seems to be happening due
to an interrupt firing which may be because we are calling power
transition only on the first core group, leaving the second one
unchanged, or because ISR execution was pending before entering
the panfrost_gpu_power_off() function and executed after powering
off the GPU cores, or all of the above.

Finally, solve this by introducing a new panfrost_gpu_suspend_irq()
helper function and changing the panfrost_device_suspend() flow to
 1. Mask and clear all interrupts: we don't need nor want any, as
    for power_off() we are polling PWRTRANS, but we anyway don't
    want GPU IRQs to fire while it is suspended/powered off;
 2. Call synchronize_irq() after that to make sure that any pending
    ISR is executed before powering off the GPU Shaders/Tilers/L2
    hence avoiding unpowered registers R/W; and
 3. Ignore the core_mask and ask the GPU to poweroff both core groups

Of course it was also necessary to add a `irq` variable to `struct
panfrost_device` as we need to get that in panfrost_gpu_power_off()
for calling synchronize_irq() on it.

Fixes: 22aa1a209018 ("drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()")
[Regression detected on Odroid HC1, Exynos5422, Mali-T628 MP6]
Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
---

Changes in v3:
 - Fix compile issue

Changes in v2:
 - Fixed the commit hash of "Really power off [...]"
 - Actually based on a clean next-20231121
 - Renamed "irq" to "gpu_irq" as per Boris' suggestion
 - Moved the IRQ mask/clear/sync to a helper function and added
   a call to that in panfrost_device.c instead of doing that in
   panfrost_gpu_power_off().

NOTE: I didn't split 1+2 from 3 as suggested by Boris, and I'm sending
this one without waiting for feedback on my reasons for that which I
explained as a reply to v1 because the former couldn't be applied to
linux-next, and I want to unblock Krzysztof ASAP to get this tested.

 drivers/gpu/drm/panfrost/panfrost_device.c |  1 +
 drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
 drivers/gpu/drm/panfrost/panfrost_gpu.c    | 28 +++++++++++++++-------
 drivers/gpu/drm/panfrost/panfrost_gpu.h    |  1 +
 4 files changed, 23 insertions(+), 8 deletions(-)

Krzysztof Kozlowski Nov. 23, 2023, 12:43 p.m. UTC | #1

On 23/11/2023 13:05, AngeloGioacchino Del Regno wrote:
> Some SoCs may be equipped with a GPU containing two core groups
> and this is exactly the case of Samsung's Exynos 5422 featuring
> an ARM Mali-T628 MP6 GPU: the support for this GPU in Panfrost
> is partial, as this driver currently supports using only one
> core group and that's reflected on all parts of it, including
> the power on (and power off, previously to this patch) function.
> 
> The issue with this is that even though executing the soft reset
> operation should power off all cores unconditionally, on at least
> one platform we're seeing a crash that seems to be happening due
> to an interrupt firing which may be because we are calling power
> transition only on the first core group, leaving the second one
> unchanged, or because ISR execution was pending before entering
> the panfrost_gpu_power_off() function and executed after powering
> off the GPU cores, or all of the above.
> 
> Finally, solve this by introducing a new panfrost_gpu_suspend_irq()
> helper function and changing the panfrost_device_suspend() flow to
>  1. Mask and clear all interrupts: we don't need nor want any, as
>     for power_off() we are polling PWRTRANS, but we anyway don't
>     want GPU IRQs to fire while it is suspended/powered off;
>  2. Call synchronize_irq() after that to make sure that any pending
>     ISR is executed before powering off the GPU Shaders/Tilers/L2
>     hence avoiding unpowered registers R/W; and
>  3. Ignore the core_mask and ask the GPU to poweroff both core groups
> 
> Of course it was also necessary to add a `irq` variable to `struct
> panfrost_device` as we need to get that in panfrost_gpu_power_off()
> for calling synchronize_irq() on it.
> 
> Fixes: 22aa1a209018 ("drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()")
> [Regression detected on Odroid HC1, Exynos5422, Mali-T628 MP6]
> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>


Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Best regards,
Krzysztof

AngeloGioacchino Del Regno Nov. 23, 2023, 12:44 p.m. UTC | #2

Il 23/11/23 13:43, Krzysztof Kozlowski ha scritto:
> On 23/11/2023 13:05, AngeloGioacchino Del Regno wrote:
>> Some SoCs may be equipped with a GPU containing two core groups
>> and this is exactly the case of Samsung's Exynos 5422 featuring
>> an ARM Mali-T628 MP6 GPU: the support for this GPU in Panfrost
>> is partial, as this driver currently supports using only one
>> core group and that's reflected on all parts of it, including
>> the power on (and power off, previously to this patch) function.
>>
>> The issue with this is that even though executing the soft reset
>> operation should power off all cores unconditionally, on at least
>> one platform we're seeing a crash that seems to be happening due
>> to an interrupt firing which may be because we are calling power
>> transition only on the first core group, leaving the second one
>> unchanged, or because ISR execution was pending before entering
>> the panfrost_gpu_power_off() function and executed after powering
>> off the GPU cores, or all of the above.
>>
>> Finally, solve this by introducing a new panfrost_gpu_suspend_irq()
>> helper function and changing the panfrost_device_suspend() flow to
>>   1. Mask and clear all interrupts: we don't need nor want any, as
>>      for power_off() we are polling PWRTRANS, but we anyway don't
>>      want GPU IRQs to fire while it is suspended/powered off;
>>   2. Call synchronize_irq() after that to make sure that any pending
>>      ISR is executed before powering off the GPU Shaders/Tilers/L2
>>      hence avoiding unpowered registers R/W; and
>>   3. Ignore the core_mask and ask the GPU to poweroff both core groups
>>
>> Of course it was also necessary to add a `irq` variable to `struct
>> panfrost_device` as we need to get that in panfrost_gpu_power_off()
>> for calling synchronize_irq() on it.
>>
>> Fixes: 22aa1a209018 ("drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()")
>> [Regression detected on Odroid HC1, Exynos5422, Mali-T628 MP6]
>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> 
> 
> Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> 

Thank you for the test and, more importantly, for your patience.

Cheers,
Angelo

> Best regards,
> Krzysztof
>

Boris Brezillon Nov. 23, 2023, 1:13 p.m. UTC | #3

On Thu, 23 Nov 2023 13:05:21 +0100
AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
wrote:

> Some SoCs may be equipped with a GPU containing two core groups
> and this is exactly the case of Samsung's Exynos 5422 featuring
> an ARM Mali-T628 MP6 GPU: the support for this GPU in Panfrost
> is partial, as this driver currently supports using only one
> core group and that's reflected on all parts of it, including
> the power on (and power off, previously to this patch) function.
> 
> The issue with this is that even though executing the soft reset
> operation should power off all cores unconditionally, on at least
> one platform we're seeing a crash that seems to be happening due
> to an interrupt firing which may be because we are calling power
> transition only on the first core group, leaving the second one
> unchanged, or because ISR execution was pending before entering
> the panfrost_gpu_power_off() function and executed after powering
> off the GPU cores, or all of the above.
> 
> Finally, solve this by introducing a new panfrost_gpu_suspend_irq()
> helper function and changing the panfrost_device_suspend() flow to
>  1. Mask and clear all interrupts: we don't need nor want any, as
>     for power_off() we are polling PWRTRANS, but we anyway don't
>     want GPU IRQs to fire while it is suspended/powered off;
>  2. Call synchronize_irq() after that to make sure that any pending
>     ISR is executed before powering off the GPU Shaders/Tilers/L2
>     hence avoiding unpowered registers R/W; and
>  3. Ignore the core_mask and ask the GPU to poweroff both core groups
> 
> Of course it was also necessary to add a `irq` variable to `struct
> panfrost_device` as we need to get that in panfrost_gpu_power_off()
> for calling synchronize_irq() on it.
> 
> Fixes: 22aa1a209018 ("drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()")
> [Regression detected on Odroid HC1, Exynos5422, Mali-T628 MP6]
> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> ---
> 
> Changes in v3:
>  - Fix compile issue
> 
> Changes in v2:
>  - Fixed the commit hash of "Really power off [...]"
>  - Actually based on a clean next-20231121
>  - Renamed "irq" to "gpu_irq" as per Boris' suggestion
>  - Moved the IRQ mask/clear/sync to a helper function and added
>    a call to that in panfrost_device.c instead of doing that in
>    panfrost_gpu_power_off().
> 
> NOTE: I didn't split 1+2 from 3 as suggested by Boris, and I'm sending
> this one without waiting for feedback on my reasons for that which I
> explained as a reply to v1 because the former couldn't be applied to
> linux-next, and I want to unblock Krzysztof ASAP to get this tested.
> 
>  drivers/gpu/drm/panfrost/panfrost_device.c |  1 +
>  drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
>  drivers/gpu/drm/panfrost/panfrost_gpu.c    | 28 +++++++++++++++-------
>  drivers/gpu/drm/panfrost/panfrost_gpu.h    |  1 +
>  4 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
> index c90ad5ee34e7..b0a4f3e513f4 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
> @@ -421,6 +421,7 @@ static int panfrost_device_runtime_suspend(struct device *dev)
>  		return -EBUSY;
>  
>  	panfrost_devfreq_suspend(pfdev);
> +	panfrost_gpu_suspend_irq(pfdev);
>  	panfrost_gpu_power_off(pfdev);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> index 0fc558db6bfd..f2b1d4afb9e9 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> @@ -94,6 +94,7 @@ struct panfrost_device {
>  	struct device *dev;
>  	struct drm_device *ddev;
>  	struct platform_device *pdev;
> +	int gpu_irq;
>  
>  	void __iomem *iomem;
>  	struct clk *clock;
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
> index 09f5e1563ebd..2c09aede0945 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
> @@ -425,11 +425,10 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev)
>  
>  void panfrost_gpu_power_off(struct panfrost_device *pfdev)
>  {
> -	u64 core_mask = panfrost_get_core_mask(pfdev);
>  	int ret;
>  	u32 val;
>  
> -	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask);
> +	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present);
>  	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO,
>  					 val, !val, 1, 1000);
>  	if (ret)
> @@ -441,16 +440,29 @@ void panfrost_gpu_power_off(struct panfrost_device *pfdev)
>  	if (ret)
>  		dev_err(pfdev->dev, "tiler power transition timeout");
>  
> -	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask);
> +	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present);
>  	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO,
>  				 val, !val, 0, 1000);
>  	if (ret)
>  		dev_err(pfdev->dev, "l2 power transition timeout");
>  }
>  
> +void panfrost_gpu_suspend_irq(struct panfrost_device *pfdev)
> +{
> +	/* Disable all interrupts before suspending the GPU */
> +	gpu_write(pfdev, GPU_INT_MASK, 0);
> +	gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL);
> +
> +	/*
> +	 * Make sure that we don't have pending ISRs, otherwise we'll be
> +	 * reading and/or writing registers while the GPU is powered off

I see this comment, plus the fact you call panfrost_gpu_suspend_irq()
before panfrost_gpu_power_off() and I realize there might still be a
misunderstanding. It's not the l2,shader,tiler-poweroff that causes the
invalid reg access, it's what happens in the PM core after we've
returned from panfrost_device_runtime_suspend() (power domain turned
off). The register bank is still accessible when the sub-blocks are off,
otherwise we wouldn't be able to power them on after a reset (which
automatically powers off all the blocks, IIRC).

> +	 */
> +	synchronize_irq(pfdev->gpu_irq);
> +}
> +
>  int panfrost_gpu_init(struct panfrost_device *pfdev)
>  {
> -	int err, irq;
> +	int err;
>  
>  	err = panfrost_gpu_soft_reset(pfdev);
>  	if (err)
> @@ -465,11 +477,11 @@ int panfrost_gpu_init(struct panfrost_device *pfdev)
>  
>  	dma_set_max_seg_size(pfdev->dev, UINT_MAX);
>  
> -	irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "gpu");
> -	if (irq < 0)
> -		return irq;
> +	pfdev->gpu_irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "gpu");
> +	if (pfdev->gpu_irq < 0)
> +		return pfdev->gpu_irq;
>  
> -	err = devm_request_irq(pfdev->dev, irq, panfrost_gpu_irq_handler,
> +	err = devm_request_irq(pfdev->dev, pfdev->gpu_irq, panfrost_gpu_irq_handler,
>  			       IRQF_SHARED, KBUILD_MODNAME "-gpu", pfdev);
>  	if (err) {
>  		dev_err(pfdev->dev, "failed to request gpu irq");
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.h b/drivers/gpu/drm/panfrost/panfrost_gpu.h
> index 876fdad9f721..d841b86504ea 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.h
> @@ -15,6 +15,7 @@ u32 panfrost_gpu_get_latest_flush_id(struct panfrost_device *pfdev);
>  int panfrost_gpu_soft_reset(struct panfrost_device *pfdev);
>  void panfrost_gpu_power_on(struct panfrost_device *pfdev);
>  void panfrost_gpu_power_off(struct panfrost_device *pfdev);
> +void panfrost_gpu_suspend_irq(struct panfrost_device *pfdev);
>  
>  void panfrost_cycle_counter_get(struct panfrost_device *pfdev);
>  void panfrost_cycle_counter_put(struct panfrost_device *pfdev);

AngeloGioacchino Del Regno Nov. 23, 2023, 1:30 p.m. UTC | #4

Il 23/11/23 14:13, Boris Brezillon ha scritto:
> On Thu, 23 Nov 2023 13:05:21 +0100
> AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> wrote:
> 
>> Some SoCs may be equipped with a GPU containing two core groups
>> and this is exactly the case of Samsung's Exynos 5422 featuring
>> an ARM Mali-T628 MP6 GPU: the support for this GPU in Panfrost
>> is partial, as this driver currently supports using only one
>> core group and that's reflected on all parts of it, including
>> the power on (and power off, previously to this patch) function.
>>
>> The issue with this is that even though executing the soft reset
>> operation should power off all cores unconditionally, on at least
>> one platform we're seeing a crash that seems to be happening due
>> to an interrupt firing which may be because we are calling power
>> transition only on the first core group, leaving the second one
>> unchanged, or because ISR execution was pending before entering
>> the panfrost_gpu_power_off() function and executed after powering
>> off the GPU cores, or all of the above.
>>
>> Finally, solve this by introducing a new panfrost_gpu_suspend_irq()
>> helper function and changing the panfrost_device_suspend() flow to
>>   1. Mask and clear all interrupts: we don't need nor want any, as
>>      for power_off() we are polling PWRTRANS, but we anyway don't
>>      want GPU IRQs to fire while it is suspended/powered off;
>>   2. Call synchronize_irq() after that to make sure that any pending
>>      ISR is executed before powering off the GPU Shaders/Tilers/L2
>>      hence avoiding unpowered registers R/W; and
>>   3. Ignore the core_mask and ask the GPU to poweroff both core groups
>>
>> Of course it was also necessary to add a `irq` variable to `struct
>> panfrost_device` as we need to get that in panfrost_gpu_power_off()
>> for calling synchronize_irq() on it.
>>
>> Fixes: 22aa1a209018 ("drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()")
>> [Regression detected on Odroid HC1, Exynos5422, Mali-T628 MP6]
>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
>> ---
>>
>> Changes in v3:
>>   - Fix compile issue
>>
>> Changes in v2:
>>   - Fixed the commit hash of "Really power off [...]"
>>   - Actually based on a clean next-20231121
>>   - Renamed "irq" to "gpu_irq" as per Boris' suggestion
>>   - Moved the IRQ mask/clear/sync to a helper function and added
>>     a call to that in panfrost_device.c instead of doing that in
>>     panfrost_gpu_power_off().
>>
>> NOTE: I didn't split 1+2 from 3 as suggested by Boris, and I'm sending
>> this one without waiting for feedback on my reasons for that which I
>> explained as a reply to v1 because the former couldn't be applied to
>> linux-next, and I want to unblock Krzysztof ASAP to get this tested.
>>
>>   drivers/gpu/drm/panfrost/panfrost_device.c |  1 +
>>   drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
>>   drivers/gpu/drm/panfrost/panfrost_gpu.c    | 28 +++++++++++++++-------
>>   drivers/gpu/drm/panfrost/panfrost_gpu.h    |  1 +
>>   4 files changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
>> index c90ad5ee34e7..b0a4f3e513f4 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
>> @@ -421,6 +421,7 @@ static int panfrost_device_runtime_suspend(struct device *dev)
>>   		return -EBUSY;
>>   
>>   	panfrost_devfreq_suspend(pfdev);
>> +	panfrost_gpu_suspend_irq(pfdev);
>>   	panfrost_gpu_power_off(pfdev);
>>   
>>   	return 0;
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
>> index 0fc558db6bfd..f2b1d4afb9e9 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
>> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
>> @@ -94,6 +94,7 @@ struct panfrost_device {
>>   	struct device *dev;
>>   	struct drm_device *ddev;
>>   	struct platform_device *pdev;
>> +	int gpu_irq;
>>   
>>   	void __iomem *iomem;
>>   	struct clk *clock;
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
>> index 09f5e1563ebd..2c09aede0945 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
>> @@ -425,11 +425,10 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev)
>>   
>>   void panfrost_gpu_power_off(struct panfrost_device *pfdev)
>>   {
>> -	u64 core_mask = panfrost_get_core_mask(pfdev);
>>   	int ret;
>>   	u32 val;
>>   
>> -	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask);
>> +	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present);
>>   	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO,
>>   					 val, !val, 1, 1000);
>>   	if (ret)
>> @@ -441,16 +440,29 @@ void panfrost_gpu_power_off(struct panfrost_device *pfdev)
>>   	if (ret)
>>   		dev_err(pfdev->dev, "tiler power transition timeout");
>>   
>> -	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask);
>> +	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present);
>>   	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO,
>>   				 val, !val, 0, 1000);
>>   	if (ret)
>>   		dev_err(pfdev->dev, "l2 power transition timeout");
>>   }
>>   
>> +void panfrost_gpu_suspend_irq(struct panfrost_device *pfdev)
>> +{
>> +	/* Disable all interrupts before suspending the GPU */
>> +	gpu_write(pfdev, GPU_INT_MASK, 0);
>> +	gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL);
>> +
>> +	/*
>> +	 * Make sure that we don't have pending ISRs, otherwise we'll be
>> +	 * reading and/or writing registers while the GPU is powered off
> 
> I see this comment, plus the fact you call panfrost_gpu_suspend_irq()
> before panfrost_gpu_power_off() and I realize there might still be a
> misunderstanding. It's not the l2,shader,tiler-poweroff that causes the
> invalid reg access, it's what happens in the PM core after we've
> returned from panfrost_device_runtime_suspend() (power domain turned
> off). The register bank is still accessible when the sub-blocks are off,
> otherwise we wouldn't be able to power them on after a reset (which
> automatically powers off all the blocks, IIRC).
> 

Uhm, should I reword that? Suggestions about how?

There wasn't any misunderstanding, I did get that the issue can be about unclocked
access or unpowered access (unpowered = vregs off or power domains off), but it is
correct to mask and sync IRQs *before* calling shader/tiler/l2 poweroff (at least
for how I see it), because if there's anything that we want to do in the ISR, we
still have the GPU fully powered up and we're sure that, for example, if we want
to read a debug register to get the reason of an error irq, we're sure that it did
not get cleared.

Cheers,
Angelo

Boris Brezillon Nov. 23, 2023, 2:07 p.m. UTC | #5

On Thu, 23 Nov 2023 14:30:08 +0100
AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
wrote:

> Il 23/11/23 14:13, Boris Brezillon ha scritto:
> > On Thu, 23 Nov 2023 13:05:21 +0100
> > AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> > wrote:
> >   
> >> Some SoCs may be equipped with a GPU containing two core groups
> >> and this is exactly the case of Samsung's Exynos 5422 featuring
> >> an ARM Mali-T628 MP6 GPU: the support for this GPU in Panfrost
> >> is partial, as this driver currently supports using only one
> >> core group and that's reflected on all parts of it, including
> >> the power on (and power off, previously to this patch) function.
> >>
> >> The issue with this is that even though executing the soft reset
> >> operation should power off all cores unconditionally, on at least
> >> one platform we're seeing a crash that seems to be happening due
> >> to an interrupt firing which may be because we are calling power
> >> transition only on the first core group, leaving the second one
> >> unchanged, or because ISR execution was pending before entering
> >> the panfrost_gpu_power_off() function and executed after powering
> >> off the GPU cores, or all of the above.
> >>
> >> Finally, solve this by introducing a new panfrost_gpu_suspend_irq()
> >> helper function and changing the panfrost_device_suspend() flow to
> >>   1. Mask and clear all interrupts: we don't need nor want any, as
> >>      for power_off() we are polling PWRTRANS, but we anyway don't
> >>      want GPU IRQs to fire while it is suspended/powered off;
> >>   2. Call synchronize_irq() after that to make sure that any pending
> >>      ISR is executed before powering off the GPU Shaders/Tilers/L2
> >>      hence avoiding unpowered registers R/W; and
> >>   3. Ignore the core_mask and ask the GPU to poweroff both core groups
> >>
> >> Of course it was also necessary to add a `irq` variable to `struct
> >> panfrost_device` as we need to get that in panfrost_gpu_power_off()
> >> for calling synchronize_irq() on it.
> >>
> >> Fixes: 22aa1a209018 ("drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()")
> >> [Regression detected on Odroid HC1, Exynos5422, Mali-T628 MP6]
> >> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> >> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
> >> ---
> >>
> >> Changes in v3:
> >>   - Fix compile issue
> >>
> >> Changes in v2:
> >>   - Fixed the commit hash of "Really power off [...]"
> >>   - Actually based on a clean next-20231121
> >>   - Renamed "irq" to "gpu_irq" as per Boris' suggestion
> >>   - Moved the IRQ mask/clear/sync to a helper function and added
> >>     a call to that in panfrost_device.c instead of doing that in
> >>     panfrost_gpu_power_off().
> >>
> >> NOTE: I didn't split 1+2 from 3 as suggested by Boris, and I'm sending
> >> this one without waiting for feedback on my reasons for that which I
> >> explained as a reply to v1 because the former couldn't be applied to
> >> linux-next, and I want to unblock Krzysztof ASAP to get this tested.
> >>
> >>   drivers/gpu/drm/panfrost/panfrost_device.c |  1 +
> >>   drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
> >>   drivers/gpu/drm/panfrost/panfrost_gpu.c    | 28 +++++++++++++++-------
> >>   drivers/gpu/drm/panfrost/panfrost_gpu.h    |  1 +
> >>   4 files changed, 23 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
> >> index c90ad5ee34e7..b0a4f3e513f4 100644
> >> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
> >> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
> >> @@ -421,6 +421,7 @@ static int panfrost_device_runtime_suspend(struct device *dev)
> >>   		return -EBUSY;
> >>   
> >>   	panfrost_devfreq_suspend(pfdev);
> >> +	panfrost_gpu_suspend_irq(pfdev);
> >>   	panfrost_gpu_power_off(pfdev);
> >>   
> >>   	return 0;
> >> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> >> index 0fc558db6bfd..f2b1d4afb9e9 100644
> >> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> >> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> >> @@ -94,6 +94,7 @@ struct panfrost_device {
> >>   	struct device *dev;
> >>   	struct drm_device *ddev;
> >>   	struct platform_device *pdev;
> >> +	int gpu_irq;
> >>   
> >>   	void __iomem *iomem;
> >>   	struct clk *clock;
> >> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
> >> index 09f5e1563ebd..2c09aede0945 100644
> >> --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
> >> +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
> >> @@ -425,11 +425,10 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev)
> >>   
> >>   void panfrost_gpu_power_off(struct panfrost_device *pfdev)
> >>   {
> >> -	u64 core_mask = panfrost_get_core_mask(pfdev);
> >>   	int ret;
> >>   	u32 val;
> >>   
> >> -	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present & core_mask);
> >> +	gpu_write(pfdev, SHADER_PWROFF_LO, pfdev->features.shader_present);
> >>   	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_PWRTRANS_LO,
> >>   					 val, !val, 1, 1000);
> >>   	if (ret)
> >> @@ -441,16 +440,29 @@ void panfrost_gpu_power_off(struct panfrost_device *pfdev)
> >>   	if (ret)
> >>   		dev_err(pfdev->dev, "tiler power transition timeout");
> >>   
> >> -	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present & core_mask);
> >> +	gpu_write(pfdev, L2_PWROFF_LO, pfdev->features.l2_present);
> >>   	ret = readl_poll_timeout(pfdev->iomem + L2_PWRTRANS_LO,
> >>   				 val, !val, 0, 1000);
> >>   	if (ret)
> >>   		dev_err(pfdev->dev, "l2 power transition timeout");
> >>   }
> >>   
> >> +void panfrost_gpu_suspend_irq(struct panfrost_device *pfdev)
> >> +{
> >> +	/* Disable all interrupts before suspending the GPU */
> >> +	gpu_write(pfdev, GPU_INT_MASK, 0);
> >> +	gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_MASK_ALL);
> >> +
> >> +	/*
> >> +	 * Make sure that we don't have pending ISRs, otherwise we'll be
> >> +	 * reading and/or writing registers while the GPU is powered off  
> > 
> > I see this comment, plus the fact you call panfrost_gpu_suspend_irq()
> > before panfrost_gpu_power_off() and I realize there might still be a
> > misunderstanding. It's not the l2,shader,tiler-poweroff that causes the
> > invalid reg access, it's what happens in the PM core after we've
> > returned from panfrost_device_runtime_suspend() (power domain turned
> > off). The register bank is still accessible when the sub-blocks are off,
> > otherwise we wouldn't be able to power them on after a reset (which
> > automatically powers off all the blocks, IIRC).
> >   
> 
> Uhm, should I reword that? Suggestions about how?
> 
> There wasn't any misunderstanding, I did get that the issue can be about unclocked
> access or unpowered access (unpowered = vregs off or power domains off), but it is
> correct to mask and sync IRQs *before* calling shader/tiler/l2 poweroff (at least
> for how I see it), because if there's anything that we want to do in the ISR, we
> still have the GPU fully powered up and we're sure that, for example, if we want
> to read a debug register to get the reason of an error irq, we're sure that it did
> not get cleared.

Which is also true after the l2,tiler,core poweroff happened AFAICT (at
least for registers that are in the GPU_xxx range). My point is, just
because some internal blocks are powered off, doesn't mean the
registers exposed to the CPU are invalid, some are just unlikely to
change after that point. And if you're worried about some registers
having invalid content after a poweroff, it's also true for MMU and JOB
registers, but you don't mask+synchronize those in this patch. That's
what I'm talking about when I say it's a partial fix to a potentially
wider issue, and the very reason I suggested going for simpler fix here,
and then fix the irq-sync-on-suspend issue in another patch series.

[v3] drm/panfrost: Ignore core_mask for poweroff and sync interrupts

Commit Message

Comments

Patch