diff mbox series

[v4,1/7] drm/v3d: Don't run jobs that have errors flagged in its fence

Message ID 20250313-v3d-gpu-reset-fixes-v4-1-c1e780d8e096@igalia.com (mailing list archive)
State New
Headers show
Series drm/v3d: Fix GPU reset issues on the Raspberry Pi 5 | expand

Commit Message

Maíra Canal March 13, 2025, 2:43 p.m. UTC
The V3D driver still relies on `drm_sched_increase_karma()` and
`drm_sched_resubmit_jobs()` for resubmissions when a timeout occurs.
The function `drm_sched_increase_karma()` marks the job as guilty, while
`drm_sched_resubmit_jobs()` sets an error (-ECANCELED) in the DMA fence of
that guilty job.

Because of this, we must check whether the job’s DMA fence has been
flagged with an error before executing the job. Otherwise, the same guilty
job may be resubmitted indefinitely, causing repeated GPU resets.

This patch adds a check for an error on the job's fence to prevent running
a guilty job that was previously flagged when the GPU timed out.

Note that the CPU and CACHE_CLEAN queues do not require this check, as
their jobs are executed synchronously once the DRM scheduler starts them.

Cc: stable@vger.kernel.org
Fixes: d223f98f0209 ("drm/v3d: Add support for compute shader dispatch.")
Fixes: 1584f16ca96e ("drm/v3d: Add support for submitting jobs to the TFU.")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Maíra Canal March 13, 2025, 8:20 p.m. UTC | #1
On 13/03/25 11:43, Maíra Canal wrote:
> The V3D driver still relies on `drm_sched_increase_karma()` and
> `drm_sched_resubmit_jobs()` for resubmissions when a timeout occurs.
> The function `drm_sched_increase_karma()` marks the job as guilty, while
> `drm_sched_resubmit_jobs()` sets an error (-ECANCELED) in the DMA fence of
> that guilty job.
> 
> Because of this, we must check whether the job’s DMA fence has been
> flagged with an error before executing the job. Otherwise, the same guilty
> job may be resubmitted indefinitely, causing repeated GPU resets.
> 
> This patch adds a check for an error on the job's fence to prevent running
> a guilty job that was previously flagged when the GPU timed out.
> 
> Note that the CPU and CACHE_CLEAN queues do not require this check, as
> their jobs are executed synchronously once the DRM scheduler starts them.
> 
> Cc: stable@vger.kernel.org
> Fixes: d223f98f0209 ("drm/v3d: Add support for compute shader dispatch.")
> Fixes: 1584f16ca96e ("drm/v3d: Add support for submitting jobs to the TFU.")
> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
> Signed-off-by: Maíra Canal <mcanal@igalia.com>

As patches 1/7 and 2/7 prevent the same faulty job from being
resubmitted in a loop, I just applied them to misc/kernel.git (drm-misc-
fixes).

Best Regards,
- Maíra

> ---
>   drivers/gpu/drm/v3d/v3d_sched.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 80466ce8c7df669280e556c0793490b79e75d2c7..c2010ecdb08f4ba3b54f7783ed33901552d0eba1 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -327,11 +327,15 @@ v3d_tfu_job_run(struct drm_sched_job *sched_job)
>   	struct drm_device *dev = &v3d->drm;
>   	struct dma_fence *fence;
>   
> +	if (unlikely(job->base.base.s_fence->finished.error))
> +		return NULL;
> +
> +	v3d->tfu_job = job;
> +
>   	fence = v3d_fence_create(v3d, V3D_TFU);
>   	if (IS_ERR(fence))
>   		return NULL;
>   
> -	v3d->tfu_job = job;
>   	if (job->base.irq_fence)
>   		dma_fence_put(job->base.irq_fence);
>   	job->base.irq_fence = dma_fence_get(fence);
> @@ -369,6 +373,9 @@ v3d_csd_job_run(struct drm_sched_job *sched_job)
>   	struct dma_fence *fence;
>   	int i, csd_cfg0_reg;
>   
> +	if (unlikely(job->base.base.s_fence->finished.error))
> +		return NULL;
> +
>   	v3d->csd_job = job;
>   
>   	v3d_invalidate_caches(v3d);
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 80466ce8c7df669280e556c0793490b79e75d2c7..c2010ecdb08f4ba3b54f7783ed33901552d0eba1 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -327,11 +327,15 @@  v3d_tfu_job_run(struct drm_sched_job *sched_job)
 	struct drm_device *dev = &v3d->drm;
 	struct dma_fence *fence;
 
+	if (unlikely(job->base.base.s_fence->finished.error))
+		return NULL;
+
+	v3d->tfu_job = job;
+
 	fence = v3d_fence_create(v3d, V3D_TFU);
 	if (IS_ERR(fence))
 		return NULL;
 
-	v3d->tfu_job = job;
 	if (job->base.irq_fence)
 		dma_fence_put(job->base.irq_fence);
 	job->base.irq_fence = dma_fence_get(fence);
@@ -369,6 +373,9 @@  v3d_csd_job_run(struct drm_sched_job *sched_job)
 	struct dma_fence *fence;
 	int i, csd_cfg0_reg;
 
+	if (unlikely(job->base.base.s_fence->finished.error))
+		return NULL;
+
 	v3d->csd_job = job;
 
 	v3d_invalidate_caches(v3d);